Issue
I try to use SitemapSpider to parse sitemap. Please see the following code, How can I get additional information in the parse function from the sitemap. For example, the sitemap already contain news:keywords
and news:stock_tickers
. How do I get those data and pass to the parse function?
from scrapy.spiders import SitemapSpider
class ReutersSpider(SitemapSpider):
name = 'reuters'
sitemap_urls = ['https://www.reuters.com/sitemap_news_index1.xml']
def parse(self, response):
# How can I get data like news:stock_tickers from sitemap for this item? I only have url from the sitemap here.
yield {
'title': response.css("title ::text").extract_first(),
'url': response.url
}
Sitemap item example
<url>
<loc>
https://www.reuters.com/article/micron-tech-results/update-6-micron-sales-profit-miss-estimates-as-chip-glut-hurts-prices-idUSL3N1YN50N
</loc>
<news:news>
<news:publication>
<news:name>Reuters</news:name>
<news:language>eng</news:language>
</news:publication>
<news:publication_date>2018-12-19T03:50:10+00:00</news:publication_date>
<news:title>
UPDATE 6-Micron sales, profit miss estimates as chip glut hurts prices
</news:title>
<news:keywords>Headlines,Industrial Conglomerates</news:keywords>
<news:stock_tickers>
SEO:000660,SEO:005930,TYO:6502,NASDAQ:AAPL,NASDAQ:AMZN
</news:stock_tickers>
</news:news>
</url>
Solution
SitemapSpider
is specialized for extracting links and nothing else, so it doesn't provide the means for extracting additional data from a sitemap.
You could overwrite its _parse_sitemap
method to pass the data in generated requests' meta.
However, if your sitemap is simple enough, it might be simpler to just do your own sitemap parsing.
Answered By - stranac
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.