Friday, August 19, 2022

[FIXED] how to scrape link from a previous function with scrapy

August 19, 2022 python, scrapy No comments

Issue

I have this code to scrape a website. The def parse function gives the full link of a full news, and def parse_item returns 3 items which are date, title, and also the full text from the full url.

How can I also scrape and save the link from def parse? So the code would return 4 items which are date, title, text, and also the link.

Here is the code:

import scrapy
from scrapy.crawler import CrawlerProcess


class weeklymining(scrapy.Spider):
    name = 'weeklymining'
    start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(0,351)]

    def parse(self, response):
        for link in response.xpath('//*[@class="en-serif"]/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="article_title"]//p/span[1]/text()').extract(),
            'title': response.xpath('//*[@id="article_headline"]/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')])
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

Anyhelp would be appreciated, thanks in advance

Solution

Just add the response.url to the yielded item.

For example:

import scrapy
from scrapy.crawler import CrawlerProcess


class weeklymining(scrapy.Spider):
    name = 'weeklymining'
    start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(0,351)]

    def parse(self, response):
        for link in response.xpath('//*[@class="en-serif"]/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="article_title"]//p/span[1]/text()').extract(),
            'title': response.xpath('//*[@id="article_headline"]/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')]),
            'link': response.url   # <-- added this
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, August 19, 2022

[FIXED] how to scrape link from a previous function with scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels