Issue
I have a problem with the code flow. Im trying to:
-In page n.1 iterate through a table
and get some info about every row items, for every item getting the link for the page n.2
-Go to page n.2 captured before and iterate in a table
to get the link of every row for the page n.3 that contains the valuable info for every row that is contained in the table of page n.2
This is the code:
def parse(self, response):
row = response.xpath("//*[@id='lista-table']//tr")
#ignore fist element
for sel in row[1:]:
l = ItemLoader(item=ComicscraperItem(), selector=sel)
l.add_xpath('titleEdition', './td[1]//a/text()')
l.add_xpath('linkEdition', './/td[1]//a/@href')
l.add_xpath('year', './td[2]/text()')
l.add_xpath('numbers', './td[3]/text()')
l.add_xpath('publisher', './td[5]/text()')
link_2page = 'https://site//' + sel.xpath('./td[1]//a/@href').get()
yield scrapy.Request(link_2page, callback=self.parse_2page, meta={'l': l})
def parse_2page(self, response):
l = response.meta['l']
l.add_xpath('detailsEdition', "./div[@class='dettagli_testo']/text()")
row2page = response.xpath("//*[@id='lista-table']//tr")
for sel2page in row2page[1:]:
n3page = sel2page.xpath('.//td[1]//a/@href').get()
link_3page = 'https://site//' + n3page
yield scrapy.Request(link_3page, callback=self.parse_3page, meta={'l': l})
def parse_3page(self, response):
l = response.meta['l']
l.add_value('issueTitle', response.xpath("//div[@id='intestazione']//h1/text()").get())
return l.load_item()
I get an output like this:
{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t"]},
{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t", "Zona X # 42\t\t\t\t"]},
{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t", "Zona X # 42\t\t\t\t", "Zona X # 41\t\t\t\t"]},
{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t", "Zona X # 42\t\t\t\t", "Zona X # 41\t\t\t\t", "Zona X # 43\t\t\t\t"]},...
Instead i'm trying to achieve an output like this:
{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t", "Zona X # 42\t\t\t\t", "Zona X # 41\t\t\t\t", "Zona X # 43\t\t\t\t"]},...
How can i do that? Someone can explain me the correct workflow of scrapy?
Solution
Instead of reusing the same item loader object (l
) in parse_2page
, you need to instantiate a new item loader object for each row in the for loop, similar to what you did in the initial parse function.
You could use Python’s copy.deepcopy
for this. So, at the end of parse_2page
, switch meta={'l': l}
to meta={'l': deepcopy(l)}
.
Answered By - Gallaecio
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.