Issue
So my issue is that I have this CrawlSpider
name = 'recursiveSpider'
allowed_domains = ['industrialnetworking.com']
custom_settings = {
'DUPEFILTER_CLASS' : 'scrapy.dupefilters.BaseDupeFilter',
}
start_urls = [
'https://www.industrialnetworking.com/Manufacturers/Hirschmann'
]
rules = (
Rule(LinkExtractor(restrict_css='div.catCell a::attr(href)'), follow=True),
Rule(LinkExtractor(allow=r"/Manufacturers/Hirschmann*"), callback='parse_new_item')
)
I am trying to hit the product pages of all "Hirshmann" products. I understand that my error is within the 2nd line of the "rules" where I have it allowing anything with Hirschmann*. Although I am unsure how to add a response.css/response.xpath as an argument for allow.
Ideally I would like it so that if the crawler all "div.catCell a:attr(href)" and recursive through them until it detects "response.css('td.cellDesc h2 a::attr(href)')", then it will send that link to my "parse_new_item". If that item is not found then continue with following all links that have "div.catCell a:attr(href)".
Example URL travel path ->
StartURL: https://www.industrialnetworking.com/Manufacturers/Hirschmann
Category: https://www.industrialnetworking.com/Manufacturers/Hirschmann-Rail-Switches
SubCategory: https://www.industrialnetworking.com/Manufacturers/Hirschmann-Switches-Unmanaged
Series: https://www.industrialnetworking.com/Manufacturers/Hirschmann-SPIDER-Family-Rail-Switches
END GOAL ->
Product: https://www.industrialnetworking.com/Manufacturers/Hirschmann-SPIDER-III-Rail-Switches/Hirschmann-SSL20-5TX-Rail-Switch-942-132-001
EDIT - Reason I am targeting the xpath/css path is because the links do not have any obvious pattern that I can use to target the urls.
Thanks everyone!
Solution
I personally am not a big fan of the crawlspider
. There are some cases when it is convenient, but I think in your situation sticking to crawling the links manually might be an easier approach.
Since you have multiple pages with the same format what you could do is feed each of the links back into the main parse
method, until it finds the links that match the td/h2/a
links, at which point it can then assign a different callback to parse the final product page using your parse_new_item
method.
For example:
import scrapy
class MySpider(scrapy.Spider):
name = 'recursiveSpider'
allowed_domains = ['industrialnetworking.com']
start_urls = ['https://www.industrialnetworking.com/Manufacturers/Hirschmann']
def parse(self, response):
for url in response.xpath("//div[@class='catCell']/a/@href").getall():
yield scrapy.Request(response.urljoin(url), callback=self.parse)
for url in response.xpath("//td[@class='cellDesc']/h2/a/@href").getall():
yield scrapy.Request(response.urljoin(url), callback=self.parse_new_item)
def parse_new_item(self, response):
print(response)
item_name = response.xpath("//div[@id='itmNam']/h1/text()").get()
item = {"name": item_name}
yield item
The output is really long so I just put the final tally below.
OUTPUT
<200 https://www.industrialnetworking.com/Manufacturers/Hirschmann-Greyhound-Switch-Power-Accessories/Hirschmann-Greyhound-1040-Industrial-Power-Supply-GPS1-KSY9HH>
2022-09-14 13:32:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.industrialnetworking.com/Manufacturers/Hirschmann-Greyhound-Switch-Power-Accessories/Hirschmann-Greyhound-1040-Industrial-Power-Supply-GPS1-KSY9H
H>
{'name': 'GPS1-KSY9HH Power Supply'}
2022-09-14 13:32:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-14 13:32:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 380892,
'downloader/request_count': 483,
'downloader/request_method_count/GET': 483,
'downloader/response_bytes': 9139340,
'downloader/response_count': 483,
'downloader/response_status_count/200': 471,
'downloader/response_status_count/429': 12,
'elapsed_time_seconds': 22.988552,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 9, 14, 20, 32, 12, 287889),
'httpcompression/response_bytes': 41356802,
'httpcompression/response_count': 471,
'httperror/response_ignored_count': 4,
'httperror/response_ignored_status_count/429': 4,
'item_scraped_count': 401,
'log_count/DEBUG': 889,
'log_count/ERROR': 4,
'log_count/INFO': 14,
'request_depth_max': 5,
'response_received_count': 475,
'retry/count': 8,
'retry/max_reached': 4,
'retry/reason_count/429 Unknown Status': 8,
'scheduler/dequeued': 483,
'scheduler/dequeued/memory': 483,
'scheduler/enqueued': 483,
'scheduler/enqueued/memory': 483,
'start_time': datetime.datetime(2022, 9, 14, 20, 31, 49, 299337)}
2022-09-14 13:32:12 [scrapy.core.engine] INFO: Spider closed (finished)
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.