Tuesday, December 19, 2023

[FIXED] scrapy create yield field dynamically with a variable

December 19, 2023 python, scrapy No comments

Issue

I want go get all bullet points with scrapy from an amazon product page e.g. Amazon link, however their number varies. I end up using something like this

def parse(self, response):
        t = response
        url = t.request.url
        yield{
                'bullets_no': len(t.xpath('//div[@id="feature-bullets"]//li/span/text()'))
                'bullet_1' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[0].get().strip()
                'bullet_2' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[1].get().strip()
                'bullet_3' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[2].get().strip()
                'bullet_4' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[3].get().strip()
                'bullet_5' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[4].get().strip()
...
            }

however in pythong i would be able to simply do something like this and adjust automatically:

bullets = t.xpath('//div[@id="feature-bullets"]//li/span/text()')
    for i, bullet in enumerate(bullets):
        row[f'Bullet_{i+1}'] = bullet.strip()

Is it possible to create yielded fields like this in scrapy?

Solution

Yes, this is covered in detail in the scrapy tutorial which I highly suggest reading.

The return type when using either the response.css or response.xpath calls is a SelectorList object. You can iterate this object like you can a regular python list object.

The result of running response.css('title') is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

So using your example you could do something like this:

def parse(self, response):
    item = {'url': response.url}
    for i, bullet in enumerate(response.xpath('//div[@id="feature-bullets"]//li/span/text()'), start=1):
        item[f'bullet_{i}'] = bullet.get().strip()
    item['bullet_no'] = i
    yield item

As mention in a previous answer there is also the getall method that you can call on a selector list:

The other thing is that the result of calling .getall() is a list: it is possible that a selector returns more than one result, so we extract them all.

I suggest giving the Extracting Data and Extracting Quotes and Authors sections of the scrapy docs tutorial a read to find out more.

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 19, 2023

[FIXED] scrapy create yield field dynamically with a variable

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels