Issue
I am using Scrapy and tried to use a proxy pool by creating a customized DownloaderMiddleware. I am having some trouble and want to get some help here (I looked at the document from Scrapy website, but there is no code example)
My python code is:
import random
class ProxyRotator(object):
proxy_pool = ['ip1...', 'ip2...', 'ip3...']
def process_request(self, request, spider):
request.meta['proxy'] = "http://" + self.proxy_pool[random.randint(0, len(self.proxy_pool) - 1)] + ":80"
return request
and in the settings.py, I added
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'pricecheck_crawler.ProxyMiddleware.ProxyRotator': 100,
}
Right now the crawler doesn't get anything from the site. The log shows:
2016-02-17 11:27:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-17 11:27:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6051
2016-02-17 11:28:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-17 11:29:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Solution
Try this. Remove the return request
statement as it will return the request back to process_request and the process_response
will never be called. Before this make sure you use only http or https proxies:
def process_request(self, request, spider):
request.meta['proxy'] = self.proxy_pool[random.randint(0, len(self.proxy_pool) - 1)]
You can also change the settings to something like this:
'pricecheck_crawler.ProxyMiddleware.ProxyRotator': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110
Also verify that request.meta['proxy'] = "http://ip:port"
.
Answered By - Rahul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.