Browsing articles in "Web"

Scrapy extension to store spider statistics to Postgesql DB

As am working on a Scrapy project, I wanted to store all spider statistics to Database so as I can access it later, So I wrote the following extension.
Continue reading »

Using Scrapy with different / many proxies

Ref. to the previous post (Using Scrapy with proxies), I mentioned how to use a SINGLE proxy with Scrapy.

Now, what if you have different proxies ? here are a simple few changes to make it .

1. Add a new array with your proxies to your config file as follows :

PROXIES = [{'ip_port': 'xx.xx.xx.xx:xxxx', 'user_pass': 'foo:bar'},
           {'ip_port': 'PROXY2_IP:PORT_NUMBER', 'user_pass': 'username:password'},
           {'ip_port': 'PROXY3_IP:PORT_NUMBER', 'user_pass': ''},]

2. Update your file to the following :

import base64
import random
from settings import PROXIES

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        if proxy['user_pass'] is not None:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']
            encoded_user_pass = base64.encodestring(proxy['user_pass'])
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            
            request.meta['proxy'] = "http://%s" % proxy['ip_port']

That’s it :) !

Using Scrapy with proxies

I’m working currently on a scraping some websites for I used to develop in PHP but when I searched for best scraping / crawling, I found Scrapy (written in Python) is the best.

You can read more about it and how to start here :

I searched a lot for how to use proxies with Scrapy but couldn’t find simple / Straight forward way to do it. All are talking about Middlewares and Request object but not how to use them.

So, here’s the steps to use Scrapy with proxies :

1 – Create a new file called “” and save it in your scrapy project and add the following code to it.

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project’s configuration file (./project_name/ and add the following code

    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,

Now, your requests should be passed by this proxy. Simple, isn’t it ?

If you want to test it, just create a new spider with the name test, and add the following code

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
    name = "test"
    domain_name = ""
    # The following url is subject to change, you can get the last updated one from here :
    start_urls = [""]

    def parse(self, response):
        open('test.html', 'wb').write(response.body)

Then cat test.html to find the IP.

For cheap / reasonable proxies, try the following websites :

References :!msg/scrapy-users/mX9d05qcZw8/RkjWkqBT-HIJ

The History of the Internet

The “History of the Internet” is an animated documentary explaining all the events and technologies that led to the invention of the Internet. A fascinating watch!