Thursday, November 11, 2021

[FIXED] Save Item in Scrapy only at the end of the iterations

November 11, 2021 python, scrapy, web-scraping No comments

Issue

I have a problem with the code flow. Im trying to:

-In page n.1 iterate through a table and get some info about every row items, for every item getting the link for the page n.2

-Go to page n.2 captured before and iterate in a table to get the link of every row for the page n.3 that contains the valuable info for every row that is contained in the table of page n.2

This is the code:

def parse(self, response):
    row = response.xpath("//*[@id='lista-table']//tr")
    #ignore fist element
    for sel in row[1:]:    
        l = ItemLoader(item=ComicscraperItem(), selector=sel)
        l.add_xpath('titleEdition', './td[1]//a/text()')
        l.add_xpath('linkEdition', './/td[1]//a/@href')
        l.add_xpath('year', './td[2]/text()')
        l.add_xpath('numbers', './td[3]/text()')
        l.add_xpath('publisher', './td[5]/text()')

        link_2page = 'https://site//' + sel.xpath('./td[1]//a/@href').get()
        yield scrapy.Request(link_2page, callback=self.parse_2page, meta={'l': l})

def parse_2page(self, response):
    l = response.meta['l']
    l.add_xpath('detailsEdition', "./div[@class='dettagli_testo']/text()")

    row2page = response.xpath("//*[@id='lista-table']//tr")
    for sel2page in row2page[1:]:
        n3page = sel2page.xpath('.//td[1]//a/@href').get()

        link_3page = 'https://site//' + n3page
        yield scrapy.Request(link_3page, callback=self.parse_3page, meta={'l': l})

def parse_3page(self, response):
    l = response.meta['l']
    l.add_value('issueTitle', response.xpath("//div[@id='intestazione']//h1/text()").get())
    return l.load_item()

I get an output like this:

{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t"]},
{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t", "Zona X # 42\t\t\t\t"]},
{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t", "Zona X # 42\t\t\t\t", "Zona X # 41\t\t\t\t"]},
{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t", "Zona X # 42\t\t\t\t", "Zona X # 41\t\t\t\t", "Zona X # 43\t\t\t\t"]},...

Instead i'm trying to achieve an output like this:

{"titleEdizione": ["ZONA X"], "linkEdizione": ["serie/ZONAX"], "year": ["1992"], "numbers": ["45"], "publisher": ["Sergio Bonelli Editore"], "issueTitle": ["Zona X # 45\t\t\t\t", "Zona X # 44\t\t\t\t", "Zona X # 42\t\t\t\t", "Zona X # 41\t\t\t\t", "Zona X # 43\t\t\t\t"]},...

How can i do that? Someone can explain me the correct workflow of scrapy?

Solution

Instead of reusing the same item loader object (l) in parse_2page, you need to instantiate a new item loader object for each row in the for loop, similar to what you did in the initial parse function.

You could use Python’s copy.deepcopy for this. So, at the end of parse_2page, switch meta={'l': l} to meta={'l': deepcopy(l)}.

Answered By - Gallaecio

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 11, 2021

[FIXED] Save Item in Scrapy only at the end of the iterations

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels