Issue
Below is the code with which I am trying to extact 3 values (UPC, Price & Availability) from this website: https://books.toscrape.com/. I am using the Scrapy CrawlSpider but it is returning 'None' for the extracted values. What I am trying to achieve with this code is this: Go inside every book on 1st page and extract the above mentioned 3 values. Code is below:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BooksSpider(CrawlSpider):
name = "bookscraper"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
rules = (Rule(LinkExtractor(restrict_xpaths='//h3/a'), callback='parse_item', follow=True))
def parse_item(self, response):
product_info = response.xpath('//table[@class="table table-striped"]')
upc = product_info.xpath('(./tbody/tr/td)[1]/text()').get()
price = product_info.xpath('(./tbody/tr/td)[3]/text()').get()
availability = product_info.xpath('(./tbody/tr/td)[6]/text()').get()
yield {'UPC': upc, 'Price': price, 'Availability': availability}
# print(response.url)
Solution
Your program reported an error TypeError: 'Rule' object is not iterable
instead of returning None, see this answer: https://stackoverflow.com/a/53343029/18857676
My modified and optimized code:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BooksSpider(CrawlSpider):
name = "bookscraper"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
rules = (
Rule(LinkExtractor(restrict_xpaths="//h3/a"), callback="parse_item", follow=True),
)
def parse_item(self, response):
product_info = response.xpath('//table[@class="table table-striped"]//td/text()') # get each column td label directly
temp = [i.extract().strip() for i in product_info]
upc = temp[0]
price = temp[2]
availability = temp[5]
return {'UPC': upc, 'Price': price, 'Availability': availability}
Answered By - Sepu Ling
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.