Issue
I want to scrape a bunch of pages. Feeding different data pots and are then matched later on.
[Page1]-Get-PostProcessing-Store-[Pot1]-+
[Page2]-Get-PostProcessing-Store-[Pot2]-+--Match---[ResultPage]-REST-API
[Page3]-Get-PostProcessing-Store-[Pot3]-+
...
Now I want to be as independent as possible considering the pipeline for each page. Sometimes pages will need JavaScript scraping capabilities, sometimes not. Sometimes I need to also grab images, sometimes only PDFs.
I did a prototype with one page and Scrapy. I really had the structure and I don't know how to "split" it up that scraper and middleware is independent for each page. On the other hand, is lxml enough? How do I handle robots and wait delays to avoid blocking? Does it make sense to add a message queue?
What is the best way to implement all this? Please be specific! My major problem are structures organizing my code and the tools to use.
Solution
Whoa, lots of questions there. =)
Hard to be specific for such a broad question, specially not knowing how familiar you are with the tool.
If I understood correctly, you have a spider and a middleware. I didn't get exactly what is your middleware code doing, but for a proof of concept I'd start with code all in one spider (and perhaps util functions), leaving you free to use different callbacks for the different extraction techniques.
Once you have that working, then you can look into making a generic middleware if needed (premature abstraction is often just as bad as premature optimization).
Here are a few ideas:
For implementing different extraction code for each response
If you know beforehand which code you want to call for handling each request, just set the appropriate callback for that request:
def parse(self, response):
yield scrapy.Request('http://example.com/file.pdf', self.handle_pdf)
yield scrapy.Request('http://example.com/next_page', self.handle_next_page)
def handle_pdf(self, response):
"process the response for a PDF request"
def handle_next_page(self, response):
"process the response for next page"
If you don't know beforehand, you can implement a callback that dispatch to other appropriate callbacks accordingly:
def parse(self, response):
if self.should_grab_images(response):
for it in self.grab_images(response):
yield it
if self.should_follow_links(response):
for it in self.follow_links(response):
yield it
Is lxml enough?
Probably, yeah. However, it's a good idea to learn XPath, if you haven't already, to take full advantage of it. Here is a good starting point.
Unless you need to execute Javascript code, and then you might want to try plugging into Selenium/PhantomJS or Splash.
If you don't need to execute Javascript code, but need to parse data that is inside JS code, you can use js2xml.
How do I handle robots and wait delays to avoid blocking?
To obey robots.txt
, set ROBOTSTXT_OBEY
to True
.
To configure a delay, set DOWNLOAD_DELAY
. You may also try out the autothrottle extension and look into the concurrent requests settings.
Does it make sense to add a message queue?
Well, it depends on your use case, really. If you have a really big crawl (hundreds of millions of URLs or more), it might make sense.
But you already get a lot for free with standalone Scrapy, including a disk-based queue when existing memory isn't enough to hold all pending URLs.
And you can configure the backends the scheduler will use for the memory and disk queues and also completely swap the scheduler with your own version.
Conclusion
I'd start with Scrapy and a working spider and iterate from that, improving where it's really needed.
I hope this helps.
Answered By - Elias Dorneles
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.