Using Scrapy with proxies

I’m working currently on a scraping some websites for B-kam.com. I used to develop in PHP but when I searched for best scraping / crawling, I found Scrapy (written in Python) is the best.

You can read more about it and how to start here : 
http://readthedocs.org/docs/scrapy/en/latest/index.html

I searched a lot for how to use proxies with Scrapy but couldn’t find simple / Straight forward way to do it. All are talking about Middlewares and Request object but not how to use them.

So, here’s the steps to use Scrapy with proxies :

1 – Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project’s configuration file (./project_name/settings.py) and add the following code

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}

Now, your requests should be passed by this proxy. Simple, isn’t it ?

If you want to test it, just create a new spider with the name test, and add the following code

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
    name = "test"
    domain_name = "whatismyip.com"
    # The following url is subject to change, you can get the last updated one from here :
    # http://www.whatismyip.com/faq/automation.asp
    start_urls = ["http://automation.whatismyip.com/n09230945.asp"]

    def parse(self, response):
        open('test.html', 'wb').write(response.body)

Then cat test.html to find the IP.

For cheap / reasonable proxies, try the following websites :
http://proxymesh.com/pricing/
http://squidproxies.com

References :
http://snippets.scrapy.org/snippets/32/
https://groups.google.com/forum/?fromgroups#!msg/scrapy-users/mX9d05qcZw8/RkjWkqBT-HIJ

3 Comments

  • 10x, works great!

  • great help!

    how to implement that with a list of proxies?

  • Thank you for posting this.

Leave a comment

 

WP-SpamFree by Pole Position Marketing