Tuesday, December 19, 2023

[FIXED] Why is Scrapy CrawlSpider returning 'None' on this website: 'https://books.toscrape.com/'?

December 19, 2023 python, scrapy, web-scraping No comments

Issue

Below is the code with which I am trying to extact 3 values (UPC, Price & Availability) from this website: https://books.toscrape.com/. I am using the Scrapy CrawlSpider but it is returning 'None' for the extracted values. What I am trying to achieve with this code is this: Go inside every book on 1st page and extract the above mentioned 3 values. Code is below:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BooksSpider(CrawlSpider):
    name = "bookscraper"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    rules = (Rule(LinkExtractor(restrict_xpaths='//h3/a'), callback='parse_item', follow=True))

    def parse_item(self, response):

        product_info = response.xpath('//table[@class="table table-striped"]')

        upc = product_info.xpath('(./tbody/tr/td)[1]/text()').get()
        price = product_info.xpath('(./tbody/tr/td)[3]/text()').get()
        availability = product_info.xpath('(./tbody/tr/td)[6]/text()').get()

        yield {'UPC': upc, 'Price': price, 'Availability': availability}
        # print(response.url)

Solution

Your program reported an error TypeError: 'Rule' object is not iterable instead of returning None, see this answer: https://stackoverflow.com/a/53343029/18857676

My modified and optimized code：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BooksSpider(CrawlSpider):
    name = "bookscraper"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//h3/a"), callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        product_info = response.xpath('//table[@class="table table-striped"]//td/text()') # get each column td label directly
        temp = [i.extract().strip() for i in product_info]
        upc = temp[0]
        price = temp[2]
        availability = temp[5]
        return {'UPC': upc, 'Price': price, 'Availability': availability}

Answered By - Sepu Ling

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 19, 2023

[FIXED] Why is Scrapy CrawlSpider returning 'None' on this website: 'https://books.toscrape.com/'?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels