Issue
I have the following simple spider composed of three files.
My goal is to use item loader correctly to populate the data I'm currently scrapping.
The pipeline.py
is a simple json file creator as explained in scrapy documentation.
items.py
from scrapy.loader import ItemLoader
class FoodItem(scrapy.Item):
brand = scrapy.Field(
input_processor=TakeFirst(),
output_processor=Identity()
)
name = scrapy.Field(
input_processor=TakeFirst(),
output_processor=Identity()
)
description = scrapy.Field(
input_processor=TakeFirst(),
output_processor=Identity()
)
last_updated = scrapy.Field()
spider.py
class MySpider(CrawlSpider):
name = 'Test'
allowed_domains = ['zooplus.fr']
start_urls = [
'https://www.zooplus.fr/shop/chats/aliments_specifiques_therapeutiques_chat/problemes_urinaires_renaux_chat/croquettes_therapeutiques_chat/595867',
]
def parse_item(self, response):
l = ItemLoader(item=PetfoodItem(),response=response)
l.add_xpath('brand', '//*[@id="js-breadcrumb"]/li[4]/a/span/text()')
l.add_xpath('name', '//*[@id="js-product__detail"]/div[1]/div[2]/div[1]/h1/text()')
l.add_xpath('description', '//*[@id="js-product__detail"]/div[1]/div[2]/div[1]/div[1]/meta/@content')
l.add_value('url', response.url)
l.add_value('last_updated', 'today')
l.load_item()
return l.load_item()
If I do it manually and copy exactly the code of the spider into the shell, I populate exactly what I want. Xpath are for sure right because it's already an hardcoded and functional spider which I want to refine using pipelines and item loader. I can't get where there is the obvious mistake. It looks pretty straightforward though. Any ideas welcome.
Solution
You are using CrawlSpider
incorrectly.
If you want to crawl a single product just stick to original Spider
base class:
* changes marked with ^
class MySpider(Spider):
# ^^^^^^
name = 'zooplus'
allowed_domains = ['zooplus.fr']
start_urls = [
'https://www.zooplus.fr/shop/chats/aliments_specifiques_therapeutiques_chat/problemes_urinaires_renaux_chat/croquettes_therapeutiques_chat/595867',
]
def parse(self, response):
# ^^^^^
l = ItemLoader(item=dict(), response=response)
l.add_xpath('brand', '//*[@id="js-breadcrumb"]/li[4]/a/span/text()')
l.add_xpath('name', '//*[@id="js-product__detail"]/div[1]/div[2]/div[1]/h1/text()')
l.add_xpath('description', '//*[@id="js-product__detail"]/div[1]/div[2]/div[1]/div[1]/meta/@content')
l.add_value('url', response.url)
l.add_value('last_updated', 'today')
return l.load_item()
Answered By - Granitosaurus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.