Browsing articles tagged with " scrapy"

Scrapy error “ImportError: No module named spiders”

While I was working on migrating some scrapy spiders from project to another one, I was getting the following error when I try to run any scrapy shell


ImportError: No module named spiders

Continue reading »

Scrapy extension to store spider statistics to Postgesql DB

As am working on a Scrapy project, I wanted to store all spider statistics to Database so as I can access it later, So I wrote the following extension.
Continue reading »

Super simple and basic scrapyd web interface

As I use scrapy in crawling sites data to be used in b-kam.com, I found it so silly to use console each time to run/stop spider. So, I decided to take a couple of hours to write this simple HTML and Javascript code to manage (start / stop) scrapyd spiders / jobs.

I think it should be developed later to use PHP & MySQL and store some details in database.

Continue reading »

Using Scrapy with different / many proxies

Ref. to the previous post (Using Scrapy with proxies), I mentioned how to use a SINGLE proxy with Scrapy.

Now, what if you have different proxies ? here are a simple few changes to make it .

1. Add a new array with your proxies to your config file as follows :

PROXIES = [{'ip_port': 'xx.xx.xx.xx:xxxx', 'user_pass': 'foo:bar'},
           {'ip_port': 'PROXY2_IP:PORT_NUMBER', 'user_pass': 'username:password'},
           {'ip_port': 'PROXY3_IP:PORT_NUMBER', 'user_pass': ''},]

2. Update your middlewares.py file to the following :

import base64
import random
from settings import PROXIES

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        if proxy['user_pass'] is not None:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']
            encoded_user_pass = base64.encodestring(proxy['user_pass'])
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            
        else:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']

That’s it :) !

Using Scrapy with proxies

I’m working currently on a scraping some websites for B-kam.com. I used to develop in PHP but when I searched for best scraping / crawling, I found Scrapy (written in Python) is the best.

You can read more about it and how to start here : 
http://readthedocs.org/docs/scrapy/en/latest/index.html

I searched a lot for how to use proxies with Scrapy but couldn’t find simple / Straight forward way to do it. All are talking about Middlewares and Request object but not how to use them.

So, here’s the steps to use Scrapy with proxies :

1 – Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project’s configuration file (./project_name/settings.py) and add the following code

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}

Now, your requests should be passed by this proxy. Simple, isn’t it ?

If you want to test it, just create a new spider with the name test, and add the following code

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
    name = "test"
    domain_name = "whatismyip.com"
    # The following url is subject to change, you can get the last updated one from here :
    # http://www.whatismyip.com/faq/automation.asp
    start_urls = ["http://automation.whatismyip.com/n09230945.asp"]

    def parse(self, response):
        open('test.html', 'wb').write(response.body)

Then cat test.html to find the IP.

For cheap / reasonable proxies, try the following websites :
http://proxymesh.com/pricing/
http://squidproxies.com

References :
http://snippets.scrapy.org/snippets/32/
https://groups.google.com/forum/?fromgroups#!msg/scrapy-users/mX9d05qcZw8/RkjWkqBT-HIJ