Issue
>>> print(response.text)
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://cargadgetss.com/sitemap-product.xml</loc>
</sitemap>
<sitemap>
<loc>https://cargadgetss.com/sitemap-category.xml</loc>
</sitemap>
<sitemap>
<loc>https://cargadgetss.com/sitemap-page.xml</loc>
</sitemap>
</sitemapindex>
>>> response.xpath('//loc')
[]
>>> Selector(text=response.text).xpath('//loc')[0].extract()
'<loc>https://cargadgetss.com/sitemap-product.xml</loc>'
>>>
I would to extract the tag info from the "xml" text.Actually,I have just started to learn how to extract data with scrapy, where always use respone.xpath in the code, but this time,it does't work.So I tried to use "Selector", luckily,I got the data what I need.But I still don't understand Why can the data be extracted with Selector, but not only with .xpath?
Solution
That's because the XML namespace (xmlns). Another way to extract those URLs is to assign some prefix to the namespace and use it on the selector.
For example:
>>> response.xpath("//x:loc/text()", namespaces={"x": "http://www.sitemaps.org/schemas/sitemap/0.9"}).getall()
['https://cargadgetss.com/sitemap-product.xml',
'https://cargadgetss.com/sitemap-category.xml',
'https://cargadgetss.com/sitemap-page.xml']
(More info about namespaces and parsel)
However, if you want to extract links from a sitemap, I advise you to use Scrapy's SitemapSpider. Eg.:
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
sitemap_rules = [
('/product/', 'parse_product'),
('/category/', 'parse_category'),
]
def parse_product(self, response):
pass # ... scrape product ...
def parse_category(self, response):
pass # ... scrape category ...
Answered By - Thiago Curvelo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.