Issue
I've got a rather simple spider, that loads URLs from files (working) and should then start crawling and archive the HTML response.
It was working before nicely and since days, I can't figure out anymore, what I've changed to make it stop working. Now, the spider only crawls the first page of every URL and stops then:
'finish_reason': 'finished',
Spider:
class TesterSpider(CrawlSpider):
name = 'tester'
allowed_domains = []
rules = (
Rule(LinkExtractor(allow=(), deny=(r'.*Zahlung.*', r'.*Cookies.*', r'.*Login.*', r'.*Datenschutz.*', r'.*Registrieren.*', r'.*Kontaktformular.*', )),callback='parse_item'),
)
def __init__(self, *a, **kw):
super(CrawlSpider, self).__init__(*a, **kw)
def start_requests(self):
logging.log(logging.INFO, "======== Starting with start_requests")
self._compile_rules()
smgt = Sourcemanagement()
rootdir = smgt.get_root_dir()
file_list = smgt.list_all_files ( rootdir + "/sources" )
links = smgt.get_all_domains()
links = list(set(links))
request_list = []
for link in links:
o = urlparse(link)
result = '{uri.netloc}'.format(uri=o)
self.allowed_domains.append(result)
request_list.append ( Request(url=link, callback=self.parse_item) )
return ( request_list )
def parse_item(self, response):
item = {}
self.write_html_file ( response )
return item
And the settings:
BOT_NAME = 'crawlerscrapy'
SPIDER_MODULES = ['crawlerscrapy.spiders']
NEWSPIDER_MODULE = 'crawlerscrapy.spiders'
USER_AGENT_LIST = "useragents.txt"
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 150
DOWNLOAD_DELAY = 43
CONCURRENT_REQUESTS_PER_DOMAIN = 1
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Connection':'keep-alive',
'Cache-Control':'max-age=0',
'Accept-Language': 'de',
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
'random_useragent.RandomUserAgentMiddleware': 400
}
AUTOTHROTTLE_ENABLED = False
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
REACTOR_THREADPOOL_MAXSIZE = 20
LOG_LEVEL = 'DEBUG'
DEPTH_LIMIT = 0
DOWNLOAD_TIMEOUT = 15
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
Any idea, what I'm doing wrong?
EDIT:
I found out the answer:
request_list.append ( Request(url=link, callback=self.parse_item) )
# to be replaced by:
request_list.append ( Request(url=link, callback=self.parse) )
But I don't really understand why.
https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.parse
So I can return an empty dict in parse_item
but I shouldn't because it would break the flow of things?
Solution
CrawlSpider.parse
is the method that takes care of applying your rules to a response. Only responses you send to CrawlSpider.parse
will get your rules applied, generating additional responses.
By yielding a request with a different callback, you are specifying that you don’t want rules to be applied to the response to that request.
The right place to put your parse_item
callback when using a CrawlSpider
subclass (as opposed to a Spider
) is your rules. You already did that.
If what you want is to have responses to your start requests be handled both by rules and by a different callback, you might be better off using a regular spider. CrawlSpider
is a very specialized spider, with a limited set of use cases; as soon as you need to do something it doesn’t support, you need to switch to a regular spider.
Answered By - Gallaecio
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.