Issue
I am new to the world of web scraping,Python and Scrapy. So pardon me if there is some fundamental flaw in my understanding. I come from a JAVA/R background. I am trying to scrape www.amazon.in for book details. I built the XPATHs required after using chrome's XPATH finder, but when I try the same XPATH query in Scrapy shell a different form of the URL is being displayed.
for example for the following XPATH query //ul[@id='ref_976390031']/li[23]/a[@href]/@href
in the XPATH finder I get
www.amazon.in/s/ref=lp_976389031_nr_n_21?fst=as%3Aoff&rh=n%3A976389031%2Cn%3A%21976390031%2Cn%3A1318203031&bbn=976390031&ie=UTF8&qid=1418660681&rnid=976390031
But when I try it on the response variable of the scrapy shell as response.xpath("//ul[@id='ref_976390031']/li[23]/a[@href]/@href").extract()
I get
http://www.amazon.in/b?ie=UTF8&node=1318203031
What's more interesting is, the scraped link when keyed into a browser lands in a different page as opposed to the page where its supposed to land( Same behaviour i.e. landing in a different pages occurs when scraped too)
One more thing I have observed, while scraping though the links scraped are different from the browser rendered links most of them are directed/redirected properly, While some links like this dont.
This behaviour makes my scraper scrape successfully on some links and some links are not scraped at all.
Any help/explanation for this behaviour will be greatly appreciated. Thanks in advance.
Solution
Kyle K,warvariuc were right, the site was rendering different URLs for different user agents.
Adding the following parameter in settings.py
fixed the issue
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
Thank you everyone for taking the time to reply.
Answered By - Gopi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.