Issue
I have this code to scrape a website. The def parse
function gives the full link of a full news, and def parse_item
returns 3 items which are date, title, and also the full text from the full url.
How can I also scrape and save the link from def parse
? So the code would return 4 items which are date, title, text, and also the link.
Here is the code:
import scrapy
from scrapy.crawler import CrawlerProcess
class weeklymining(scrapy.Spider):
name = 'weeklymining'
start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(0,351)]
def parse(self, response):
for link in response.xpath('//*[@class="en-serif"]/a/@href'):
yield response.follow(
url=link.get(),
callback=self.parse_item
)
def parse_item(self, response):
yield {
'date': response.xpath('//*[@class="article_title"]//p/span[1]/text()').extract(),
'title': response.xpath('//*[@id="article_headline"]/text()').get(),
'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')])
}
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(weeklymining)
process.start()
Anyhelp would be appreciated, thanks in advance
Solution
Just add the response.url
to the yielded item.
For example:
import scrapy
from scrapy.crawler import CrawlerProcess
class weeklymining(scrapy.Spider):
name = 'weeklymining'
start_urls = ['https://www.miningweekly.com/page/coal/page:'+str(x) for x in range(0,351)]
def parse(self, response):
for link in response.xpath('//*[@class="en-serif"]/a/@href'):
yield response.follow(
url=link.get(),
callback=self.parse_item
)
def parse_item(self, response):
yield {
'date': response.xpath('//*[@class="article_title"]//p/span[1]/text()').extract(),
'title': response.xpath('//*[@id="article_headline"]/text()').get(),
'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article_content_container"]//p//text()')]),
'link': response.url # <-- added this
}
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(weeklymining)
process.start()
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.