Friday, October 14, 2022

[FIXED] Scrapy Recusrive (CrawlSpider) not crawling all links as expected

October 14, 2022 python, scrapy No comments

Issue

So my issue is that I have this CrawlSpider

name = 'recursiveSpider'
    allowed_domains = ['industrialnetworking.com']
    
    custom_settings = {
        'DUPEFILTER_CLASS' : 'scrapy.dupefilters.BaseDupeFilter',
    }
    start_urls = [        
        'https://www.industrialnetworking.com/Manufacturers/Hirschmann'
    ]
    rules = (
        Rule(LinkExtractor(restrict_css='div.catCell a::attr(href)'), follow=True),
        Rule(LinkExtractor(allow=r"/Manufacturers/Hirschmann*"), callback='parse_new_item')
    )

I am trying to hit the product pages of all "Hirshmann" products. I understand that my error is within the 2nd line of the "rules" where I have it allowing anything with Hirschmann*. Although I am unsure how to add a response.css/response.xpath as an argument for allow.

Ideally I would like it so that if the crawler all "div.catCell a:attr(href)" and recursive through them until it detects "response.css('td.cellDesc h2 a::attr(href)')", then it will send that link to my "parse_new_item". If that item is not found then continue with following all links that have "div.catCell a:attr(href)".

Example URL travel path ->
StartURL: https://www.industrialnetworking.com/Manufacturers/Hirschmann
Category: https://www.industrialnetworking.com/Manufacturers/Hirschmann-Rail-Switches
SubCategory: https://www.industrialnetworking.com/Manufacturers/Hirschmann-Switches-Unmanaged
Series: https://www.industrialnetworking.com/Manufacturers/Hirschmann-SPIDER-Family-Rail-Switches
END GOAL ->
Product: https://www.industrialnetworking.com/Manufacturers/Hirschmann-SPIDER-III-Rail-Switches/Hirschmann-SSL20-5TX-Rail-Switch-942-132-001

EDIT - Reason I am targeting the xpath/css path is because the links do not have any obvious pattern that I can use to target the urls.

Thanks everyone!

Solution

I personally am not a big fan of the crawlspider. There are some cases when it is convenient, but I think in your situation sticking to crawling the links manually might be an easier approach.

Since you have multiple pages with the same format what you could do is feed each of the links back into the main parse method, until it finds the links that match the td/h2/a links, at which point it can then assign a different callback to parse the final product page using your parse_new_item method.

For example:

import scrapy

class MySpider(scrapy.Spider):
    name = 'recursiveSpider'
    allowed_domains = ['industrialnetworking.com']
    start_urls = ['https://www.industrialnetworking.com/Manufacturers/Hirschmann']

    def parse(self, response):
        for url in response.xpath("//div[@class='catCell']/a/@href").getall():
            yield scrapy.Request(response.urljoin(url), callback=self.parse)
        for url in response.xpath("//td[@class='cellDesc']/h2/a/@href").getall():
            yield scrapy.Request(response.urljoin(url), callback=self.parse_new_item)

    def parse_new_item(self, response):
        print(response)
        item_name = response.xpath("//div[@id='itmNam']/h1/text()").get()
        item = {"name": item_name}
        yield item

The output is really long so I just put the final tally below.

OUTPUT

<200 https://www.industrialnetworking.com/Manufacturers/Hirschmann-Greyhound-Switch-Power-Accessories/Hirschmann-Greyhound-1040-Industrial-Power-Supply-GPS1-KSY9HH>
2022-09-14 13:32:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.industrialnetworking.com/Manufacturers/Hirschmann-Greyhound-Switch-Power-Accessories/Hirschmann-Greyhound-1040-Industrial-Power-Supply-GPS1-KSY9H
H>
{'name': 'GPS1-KSY9HH Power Supply'}
2022-09-14 13:32:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-14 13:32:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 380892,
 'downloader/request_count': 483,
 'downloader/request_method_count/GET': 483,
 'downloader/response_bytes': 9139340,
 'downloader/response_count': 483,
 'downloader/response_status_count/200': 471,
 'downloader/response_status_count/429': 12,
 'elapsed_time_seconds': 22.988552,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 9, 14, 20, 32, 12, 287889),
 'httpcompression/response_bytes': 41356802,
 'httpcompression/response_count': 471,
 'httperror/response_ignored_count': 4,
 'httperror/response_ignored_status_count/429': 4,
 'item_scraped_count': 401,
 'log_count/DEBUG': 889,
 'log_count/ERROR': 4,
 'log_count/INFO': 14,
 'request_depth_max': 5,
 'response_received_count': 475,
 'retry/count': 8,
 'retry/max_reached': 4,
 'retry/reason_count/429 Unknown Status': 8,
 'scheduler/dequeued': 483,
 'scheduler/dequeued/memory': 483,
 'scheduler/enqueued': 483,
 'scheduler/enqueued/memory': 483,
 'start_time': datetime.datetime(2022, 9, 14, 20, 31, 49, 299337)}
2022-09-14 13:32:12 [scrapy.core.engine] INFO: Spider closed (finished)

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, October 14, 2022

[FIXED] Scrapy Recusrive (CrawlSpider) not crawling all links as expected

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels