Creating and configuring digitalocean droplets with ipython notebooks.

I recently had an extremely large web scraping gig where I required 4 VPS at once. The client wanted to use digitalocean since you pay for the VPS by the hour, instead of by the month. To make the job more difficult, as the job was progressing the client would specify changes. To make these changes I would need to ssh into each VPS and make a change. Also on digitalocean you pay for the VPS whether it is poweredup or off, but you do not pay for the storage of snapshots. Since the client wanted to scrape for only a few hours a day, it was in my best interest to be able to bring the VPS up, run my scrapers and then when finished power it off, and take a snapshot as quickly as possible. Time is money.

Read More

Injecting XMLHttpRequests into python selenium.

When scraping websites using a headless browser, if it is possible to call the XMLHttpRequest call using Selenium Requests which is an extension of Selenium-Requests. The Selenium Requests Library works by creating a small webserver, spawning another selenium window and copying all of the browser cookies. The solution is ingenious, and making calls with the requests library makes things a lot easier.

Read More

Controlling docker containers with python rpyc

In the post I am going to talk about controlling docker container with the python module rpyc. The main reason you will want to do this is to control multiple processes each running in an isolated virtual machine. I will then introduce the module I created rpycdocker which makes it simpler to control multiple instances of docker. I use rpycdocker primarily to run headless browsers, that way I can control multiple instances of a browser, each brower being a difference version if I desire. With some websites there is a problem with the browser freezing or crashing. When each browser runs in its own docker container this is not a problem, if the browser freezes or crashes it is isolated from the rest of your browser grid.

Read More

Injecting javascript into python selenium to increase scraping speed

When python selenium communicates with the webbrowser it sends its requests through a bridge. For locating a single element in a page and getting its data or clicking on it, this will not be much of a problem. But if you are scrapping a 100+ results from a page this can take a long delay. Sometimes upto several seconds.

Read More