Scrapy error “ImportError: No module named spiders”

While I was working on migrating some scrapy spiders from project to another one, I was getting the following error when I try to run any scrapy shell


ImportError: No module named spiders

Continue reading »

Scrapy extension to store spider statistics to Postgesql DB

As am working on a Scrapy project, I wanted to store all spider statistics to Database so as I can access it later, So I wrote the following extension.
Continue reading »

Simple PHP script to migrate articles from Radiant (Rails CMS) to Joomla (PHP CMS)

I was working on migrating data from Radient 0.8.1 (Ruby on Rails CMS) to Joomla 2.5.6 (PHP CMS), and it was a bit silly but interesting task. So, I wrote the following simple php script to migrate articles, but you should adjust some variables first.
Continue reading »

Super simple and basic scrapyd web interface

As I use scrapy in crawling sites data to be used in b-kam.com, I found it so silly to use console each time to run/stop spider. So, I decided to take a couple of hours to write this simple HTML and Javascript code to manage (start / stop) scrapyd spiders / jobs.

I think it should be developed later to use PHP & MySQL and store some details in database.

Continue reading »

Migrating (transferring) data from MySQL to PostgreSQL

I was working on rails project and I faced this problem, my development environment DB is MySQL while production environment DB is PostgreSQL, and I wanted to move some data. I found the following 2 ways :
Continue reading »

Blocking all incoming requests except specific IP using iptables

If you are using linux and wants to block all incoming requests to a specific port except a specific IP (your static IP or localhost in my example) , You should first block all incoming requests to this PORT using the following command :

~  iptables -A INPUT -p tcp --dport PORT_NUMBER -j DROP

Then, Allow this specific IP using the following command :

~  iptables -A INPUT -p tcp -s THE_IP_YOU_WANT_TO_ALLOW --dport PORT_NUMBER -j ACCEPT

خواطر عن مواقع الانتخابات المصرية (من الاستفتاء للرئاسية)

بعد حوالي سنة من الشغل مع شركة eSpace و وزارة التنمية الادارية في مواقع الاستفتاء و الانتخابات البرلمانية و اخيرا موقع الانتخابات الرئيسية حبيت اتكلم شوية عن اللي الواحد اتعلمه و شافه في الكواليس :)يمكن اي حد حيقرأ الكلام ده حيقول و ماله ديه ناس بتعمل شغلها و بتقبض علشان تعمل كده، اسمح لي اقولك لو اللي شغالين في المشروع ده مجرد موظفين كان اخركو حتشوفوا موقع معمول بـ Microsoft Word و بدل متشوف انت مسجل في انهه لجنة كنت حتدخل تنزل PDF زي اللي كانت موجودة فى 2005 و تدور على نفسكفي نفس الوقت ناس كتيرة كانت فاكرة ان كل ده مجهود اللجنة، اسمح لي اقولك ان لو على اللجنة، هم محتاسين خلقة و الله يكون في عنهم و مش عايزين لا مواقع ولا وجع دماغ. لولا وجود فريق وزارة التنمية الادارية و كمية المخاطر اللي اخدوها و المجهود اللي بذلوه علشان اقناع الناس ديه، مكناش شفنا اي حاجة.الفريق اللي شغال في الموقع ده معظمة ناس كويسة جدا ماديا و نفسيا ( و ده اهم من اي حاجة )، الفريق في اوقات الذروة (شهر الانتخابات) كانوا و مازالو بيشتغلوا اكتر من 18 ساعة متواصلة علشان مصلحة المواطن و البلد. و مكنش فيه واحد همه مين اللي حياخد الـ credit ! و متوسط اعمار الفريق 27 – 28 سنة.تصويت المصريين في الخارج كان كله تقريبا مسؤلية الفريق ده، و كانو ممكن يكبروا دماغهم و يسيبوا الموضوع ده على الخارجية، لكن كانوا بيحاربو علشان يسهلوا الدنيا على المواطنين اللي في الخارج.

كمية المقاومة و الغباء الاداري اللي الناس ديه شافته كفاية يدخلهم الجنة :)

كل واحد كان عليه Task ممكن يعملها في ساعة، كان بياخد ساعتين علشان يتقنها و يطلعها احسن حاجة.

ناس كتيرة جدا كالعادة قعدت تفتي و يقولوا (الموقع ده بيتعملوه هاكر بسهولة D:) ، (متكتروا السيرفرات شوية، ده بطيق قوي) ، (ده شغل اي حد ممكن يعملو و موقع عادي جدا)، و كلام عبيط كتير من ده. الشغل التقني المعمول في الموقع ده معقد جدا علشان المحافظة على سرية العلومات و سرعة. الموقع عليه حوالي 5 قواعد بيانات كل واحدة حوالي 40 مليون record !
فيه حوالي 40 سيرفر مخصصين للموقع و في زيادة. و يا ريت تاخدوا بالكو ان البنية التحتية في مصر مش جامدة قوي كدة :)
اكبر موقع في مصر كان موقع نتيجة الثانوية العامة و كان اخره نصف مليون طالب، المره ديه فيه حوالي 20 مليون مواطن. والله ماكنش فيه حاجة ممكن تتعمل و متعملتش :)

مفيش اي تزوير او لعب او اي كلام فاضي من اللي بيتقال ده في الموقع. لو فيه خطأ حصل ( و ده كان مرة او مرتين بالكتير ) فده كان خطأ بشري عادي جدا غير مقصود لمصلحة اي حد.

اخيرا و ليس اخرا ( مع اني مش عارف الفرق ) ادعوا لنا :) و انتخبوا حد يصلح حال البلد :)

That’s why I am in love with eSpace

  • We have no dress code, Actually I spent most of last summer wearing shorts !
  • Flexi-Hours, join whenever you are ready to work !
  • Open Management Meeting, a weekly meeting that gather the whole company staring office boy to the CEO to discuss anything regarding the company !!
  • Our office boy is rarely to find, he’s really funny when u ask him for a drink and he tells you “la2 kfaya 3alaik kda el naharda” or when you ask him “7atet kam ma3la2et sukar” and he replies with “eshrab we mate2la2sh howa Tamam kda” and at the same time he’s treated typically like any one of us.
  • There’s no manager office, it’s the same works space for all of us
  • it’s never about how old are you or how much experience do you have, it’s normal when you find the CTO is listening to the most new junior and tell him “I’ll try your way”
  • Same for CEO, he listens to everyone in the management meetings and he applies the typical meaning of teamwork, I can barely remember any decision he took that the other people didn’t agree to it.
  • OPENNESS, our door is almost open to anyone who is looking for consultation. We have no top secrets :) we always publish our technical tips and tricks
  • TECH-TALK, a weekly talk is presented by different employees about latest technology trends.
  • There is no specific people for a specific tasks, it’s normal when you find a junior is trying in critical part in critical project, it’s about trusting your team.
  • The fun room and PES community :D, the xbox is almost never turned off :D ! And it’s really funny when you listen to a junior talking to one of the co-founders telling him “e7na mal3ebnash el naharda :(
  • Salaries, they pay almost the best salaries in Alexandria, the owners don’t think of their revenues as thinking of keeping the employees satisfied.
  • On hard times, you can find many of us are ready to work for more than 14 hours to help other teams.
  • LOL, and of course working from home :) I though you would guess it after all what I mentioned.
  • No one is looking for personal credit, we all are looking for eSpace credit :)
  • It was my 1st job and I think my last too.
  • Recruitment is not restricted to any specific education degree, gender or religious.
  • Our clients are our friends, they even join us in Xbox PES community :)

Finally, after all this I think you can understand how hard is it to leave such a comfort zone !

Oops, If you don’t know what is eSpace, it’s a technologies company based in Alexandria, Egypt where am working at : http://www.espace.com.eg

Using Scrapy with different / many proxies

Ref. to the previous post (Using Scrapy with proxies), I mentioned how to use a SINGLE proxy with Scrapy.

Now, what if you have different proxies ? here are a simple few changes to make it .

1. Add a new array with your proxies to your config file as follows :

PROXIES = [{'ip_port': 'xx.xx.xx.xx:xxxx', 'user_pass': 'foo:bar'},
           {'ip_port': 'PROXY2_IP:PORT_NUMBER', 'user_pass': 'username:password'},
           {'ip_port': 'PROXY3_IP:PORT_NUMBER', 'user_pass': ''},]

2. Update your middlewares.py file to the following :

import base64
import random
from settings import PROXIES

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        if proxy['user_pass'] is not None:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']
            encoded_user_pass = base64.encodestring(proxy['user_pass'])
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            
        else:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']

That’s it :) !

Using Scrapy with proxies

I’m working currently on a scraping some websites for B-kam.com. I used to develop in PHP but when I searched for best scraping / crawling, I found Scrapy (written in Python) is the best.

You can read more about it and how to start here : 
http://readthedocs.org/docs/scrapy/en/latest/index.html

I searched a lot for how to use proxies with Scrapy but couldn’t find simple / Straight forward way to do it. All are talking about Middlewares and Request object but not how to use them.

So, here’s the steps to use Scrapy with proxies :

1 – Create a new file called “middlewares.py” and save it in your scrapy project and add the following code to it.

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 – Open your project’s configuration file (./project_name/settings.py) and add the following code

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}

Now, your requests should be passed by this proxy. Simple, isn’t it ?

If you want to test it, just create a new spider with the name test, and add the following code

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
    name = "test"
    domain_name = "whatismyip.com"
    # The following url is subject to change, you can get the last updated one from here :
    # http://www.whatismyip.com/faq/automation.asp
    start_urls = ["http://automation.whatismyip.com/n09230945.asp"]

    def parse(self, response):
        open('test.html', 'wb').write(response.body)

Then cat test.html to find the IP.

For cheap / reasonable proxies, try the following websites :
http://proxymesh.com/pricing/
http://squidproxies.com

References :
http://snippets.scrapy.org/snippets/32/
https://groups.google.com/forum/?fromgroups#!msg/scrapy-users/mX9d05qcZw8/RkjWkqBT-HIJ

Pages:12345678»