Issue
I'm trying to run a scraper of which the output log ends as follows:
2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16048410,
'downloader/request_count': 32902,
'downloader/request_method_count/GET': 32902,
'downloader/response_bytes': 117633316,
'downloader/response_count': 32902,
'downloader/response_status_count/200': 121,
'downloader/response_status_count/429': 32781,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 25, 18, 22, 22, 710446),
'log_count/DEBUG': 32903,
'log_count/INFO': 32815,
'request_depth_max': 2,
'response_received_count': 32902,
'scheduler/dequeued': 32902,
'scheduler/dequeued/memory': 32902,
'scheduler/enqueued': 32902,
'scheduler/enqueued/memory': 32902,
'start_time': datetime.datetime(2017, 4, 25, 17, 54, 36, 621481)}
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Spider closed (finished)
In short, of the 32,902 requests, only 121 are successful (response code 200) whereas the remainder receives 429 for 'too many requests' (cf. https://httpstatuses.com/429).
Are there any recommended ways to get around this? To start with, I'd like to have a look at the details of the 429
response rather than just ignoring it, as it may contain a Retry-After header indicating how long to wait before making a new request.
Also, if the requests are made using Privoxy and Tor as described in http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/, it may be possible to implement retry middleware which makes Tor change its IP address when this occurs. Are there any public examples of such code?
Solution
Wow, your scraper is going really fast, over 30,000 requests in 30 minutes. That's more than 10 requests per second.
Such a high volume will trigger rate limiting on bigger sites and will completely bring down smaller sites. Don't do that.
Also this might even be too fast for privoxy and tor, so these might also be candidates for those replies with a 429.
Solutions:
Start slow. Reduce the concurrency settings and increase DOWNLOAD_DELAY so you do at max 1 request per second. Then increase these values step by step and see what happens. It might sound paradox, but you might be able to get more items and more 200 response by going slower.
If you are scraping a big site try rotating proxies. The tor network might be a bit heavy handed for this in my experience, so you might try a proxy service like Umair is suggesting
Answered By - Done Data Solutions
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.