Issue
I am new at scrapy. I want to scrap data from alibaba.com but I'm getting none. I don't know where is the problem. Here is my code
class IndiaSpider(scrapy.Spider):
name = 'india'
allowed_domains = ['indiamart.com']
# search_value = 'car'
start_urls = [f'https://dir.indiamart.com/search.mp?ss=laptop&prdsrc=1&res=RC4']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
def request_header(self):
yield scrapy.Request(url=self.start_urls, callback=self.parse, headers={'User-Agent':self.user_agent})
def parse(self, response):
title = response.xpath("//span[@class='elps elps2 p10b0 fs14 tac mListNme']/a/text()").get()
related_link = response.xpath("//span[@class='elps elps2 p10b0 fs14 tac mListNme']/a/@href").get()
yield{
'titling':title,
'rel_link':related_link
}
And I am getting
2023-02-14 15:20:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dir.indiamart.com/search.mp?ss=car&prdsrc=1&res=RC4>
{'titling': None, 'rel_link': None, 'images': []} 2023-02-14 15:20:34 [scrapy.core.engine] INFO: Closing spider (finished)
I was getting results yesterday, and it is working good but today it returns none. it is not javascript based website. I tried more than one time but returns same
Solution
As @SuperUser told you, the spider gets None
because the site uses Javascript to render the product information. If you disable Javascript in your browser and reload the page, you will see that the products are not displayed.
However you can get the information from one of the <script>
tags.
import scrapy
import json
class AlibabaSpider(scrapy.Spider):
name = "alibaba"
allowed_domains = ["alibaba.com"]
search_value = "laptop"
start_urls = [f"https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&tab=all&SearchText={search_value}"]
def parse(self, response):
raw_data = response.xpath("//script[contains(., 'window.__page__data__config')]/text()").extract_first()
raw_data = raw_data.replace("window.__page__data__config = ", "").replace("window.__page__data = window.__page__data__config.props", "")
data = json.loads(raw_data)
title = data["props"]["offerResultData"]["offerList"][0]["information"]["puretitle"]
yield {"title": title} # Laptops Laptop Cheapest OEM Core I5...
Answered By - Jalil SA
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.