Thursday, November 11, 2021

[FIXED] Scrapy crawl spider only crawls as if DEPTH = 1 and stops with reason = finished

November 11, 2021 scrapy, web-crawler No comments

Issue

I've got a rather simple spider, that loads URLs from files (working) and should then start crawling and archive the HTML response.

It was working before nicely and since days, I can't figure out anymore, what I've changed to make it stop working. Now, the spider only crawls the first page of every URL and stops then:

'finish_reason': 'finished',

Spider:

    class TesterSpider(CrawlSpider):
        name = 'tester'

        allowed_domains = []

        rules = (
            Rule(LinkExtractor(allow=(), deny=(r'.*Zahlung.*', r'.*Cookies.*', r'.*Login.*', r'.*Datenschutz.*', r'.*Registrieren.*', r'.*Kontaktformular.*', )),callback='parse_item'),
        )

        def __init__(self, *a, **kw):
            super(CrawlSpider, self).__init__(*a, **kw)

        def start_requests(self):
            logging.log(logging.INFO, "======== Starting with start_requests")
            self._compile_rules()

            smgt = Sourcemanagement()

            rootdir = smgt.get_root_dir()
            file_list = smgt.list_all_files ( rootdir + "/sources" )
            links = smgt.get_all_domains()

            links = list(set(links))
            request_list = []
            for link in links:
                o = urlparse(link)
                result = '{uri.netloc}'.format(uri=o)
                self.allowed_domains.append(result)
                request_list.append ( Request(url=link, callback=self.parse_item) )

            return ( request_list )

        def parse_item(self, response):
            item = {}
            self.write_html_file ( response )
            return item

And the settings:

BOT_NAME = 'crawlerscrapy'
SPIDER_MODULES = ['crawlerscrapy.spiders']
NEWSPIDER_MODULE = 'crawlerscrapy.spiders'
USER_AGENT_LIST = "useragents.txt"
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 150
DOWNLOAD_DELAY = 43
CONCURRENT_REQUESTS_PER_DOMAIN = 1
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Connection':'keep-alive',
   'Cache-Control':'max-age=0',
   'Accept-Language': 'de',
}
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
    'random_useragent.RandomUserAgentMiddleware': 400
}
AUTOTHROTTLE_ENABLED = False
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
REACTOR_THREADPOOL_MAXSIZE = 20
LOG_LEVEL = 'DEBUG'
DEPTH_LIMIT = 0
DOWNLOAD_TIMEOUT = 15
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

Any idea, what I'm doing wrong?

EDIT:

I found out the answer:

request_list.append ( Request(url=link, callback=self.parse_item) )
# to be replaced by:
request_list.append ( Request(url=link, callback=self.parse) )

But I don't really understand why. https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.parse So I can return an empty dict in parse_item but I shouldn't because it would break the flow of things?

Solution

CrawlSpider.parse is the method that takes care of applying your rules to a response. Only responses you send to CrawlSpider.parse will get your rules applied, generating additional responses.

By yielding a request with a different callback, you are specifying that you don’t want rules to be applied to the response to that request.

The right place to put your parse_item callback when using a CrawlSpider subclass (as opposed to a Spider) is your rules. You already did that.

If what you want is to have responses to your start requests be handled both by rules and by a different callback, you might be better off using a regular spider. CrawlSpider is a very specialized spider, with a limited set of use cases; as soon as you need to do something it doesn’t support, you need to switch to a regular spider.

Answered By - Gallaecio

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 11, 2021

[FIXED] Scrapy crawl spider only crawls as if DEPTH = 1 and stops with reason = finished

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels