Using Scrapy with proxies
I’m working currently on a scraping some websites for B-kam.com. I used to develop in PHP but when I searched for best scraping / crawling, I found Scrapy (written in Python) is the best.
You can read more about it and how to start here :
http://readthedocs.org/docs/scrapy/en/latest/index.html
I searched a lot for how to use proxies with Scrapy but couldn’t find simple / Straight forward way to do it. All are talking about Middlewares and Request object but not how to use them.
So, here’s the steps to use Scrapy with proxies :
1 – Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64
# Start your middleware class
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
2 – Open your project’s configuration file (./project_name/settings.py) and add the following code
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'project_name.middlewares.ProxyMiddleware': 100,
}
Now, your requests should be passed by this proxy. Simple, isn’t it ?
If you want to test it, just create a new spider with the name test, and add the following code
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
class TestSpider(CrawlSpider):
name = "test"
domain_name = "whatismyip.com"
# The following url is subject to change, you can get the last updated one from here :
# http://www.whatismyip.com/faq/automation.asp
start_urls = ["http://automation.whatismyip.com/n09230945.asp"]
def parse(self, response):
open('test.html', 'wb').write(response.body)
Then cat test.html to find the IP.
For cheap / reasonable proxies, try the following websites :
http://proxymesh.com/pricing/
http://squidproxies.com
References :
http://snippets.scrapy.org/snippets/32/
https://groups.google.com/forum/?fromgroups#!msg/scrapy-users/mX9d05qcZw8/RkjWkqBT-HIJ
Leave a comment
WP Cumulus Flash tag cloud by Roy Tanck and Luke Morton requires Flash Player 9 or better.
Recent Posts
Recent Comments
- Mahmoud M. Abdel-Fattah on Preventing Joomla! from sending no-cache in headers
- Ahmad Alfy on Preventing Joomla! from sending no-cache in headers
- modsaid on CEO Vs. CEO
- modsaid on CEO Vs. CEO
- Mahmoud M. Abdel-Fattah on CEO Vs. CEO





