Thursday, November 11, 2021

[FIXED] Populating data with scrapy's item loader works in shell but not in spider

November 11, 2021 python, scrapy No comments

Issue

I have the following simple spider composed of three files. My goal is to use item loader correctly to populate the data I'm currently scrapping. The pipeline.pyis a simple json file creator as explained in scrapy documentation.

items.py

from scrapy.loader import ItemLoader

class FoodItem(scrapy.Item):
    brand = scrapy.Field(
        input_processor=TakeFirst(),
        output_processor=Identity()
    )
    name = scrapy.Field(
        input_processor=TakeFirst(),
        output_processor=Identity()
    )

    description = scrapy.Field(
        input_processor=TakeFirst(),
        output_processor=Identity()
    )

    last_updated = scrapy.Field()

spider.py

class MySpider(CrawlSpider):
    name = 'Test'
    allowed_domains = ['zooplus.fr']    
    start_urls = [
    'https://www.zooplus.fr/shop/chats/aliments_specifiques_therapeutiques_chat/problemes_urinaires_renaux_chat/croquettes_therapeutiques_chat/595867',
    ]


    def parse_item(self, response):

        l = ItemLoader(item=PetfoodItem(),response=response)
        l.add_xpath('brand', '//*[@id="js-breadcrumb"]/li[4]/a/span/text()')
        l.add_xpath('name', '//*[@id="js-product__detail"]/div[1]/div[2]/div[1]/h1/text()')
        l.add_xpath('description', '//*[@id="js-product__detail"]/div[1]/div[2]/div[1]/div[1]/meta/@content')
        l.add_value('url', response.url)
        l.add_value('last_updated', 'today')

        l.load_item()

        return l.load_item()

If I do it manually and copy exactly the code of the spider into the shell, I populate exactly what I want. Xpath are for sure right because it's already an hardcoded and functional spider which I want to refine using pipelines and item loader. I can't get where there is the obvious mistake. It looks pretty straightforward though. Any ideas welcome.

Solution

You are using CrawlSpider incorrectly.

If you want to crawl a single product just stick to original Spider base class:

* changes marked with ^

class MySpider(Spider):
    #          ^^^^^^
    name = 'zooplus'
    allowed_domains = ['zooplus.fr']
    start_urls = [
        'https://www.zooplus.fr/shop/chats/aliments_specifiques_therapeutiques_chat/problemes_urinaires_renaux_chat/croquettes_therapeutiques_chat/595867',
    ]

    def parse(self, response):
    #   ^^^^^
        l = ItemLoader(item=dict(), response=response)
        l.add_xpath('brand', '//*[@id="js-breadcrumb"]/li[4]/a/span/text()')
        l.add_xpath('name', '//*[@id="js-product__detail"]/div[1]/div[2]/div[1]/h1/text()')
        l.add_xpath('description', '//*[@id="js-product__detail"]/div[1]/div[2]/div[1]/div[1]/meta/@content')
        l.add_value('url', response.url)
        l.add_value('last_updated', 'today')
        return l.load_item()

Answered By - Granitosaurus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 11, 2021

[FIXED] Populating data with scrapy's item loader works in shell but not in spider

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels