Issue
I can parse the json datas in a script, but I am not able to target specific tags.
If it would be a "normal" script type like "application/ld+json", it would be pretty easy to collect what I need. But I cannot address the script name, cause there is no name 😅 So I used the XPATH selector to the script via dev tools.
So, I read the json datas in a Scrapy shell from a product link
scrapy shell 'https://www.electronic4you.de/makita-dmp180z-akku-kompressor-189153.html' ...
...
>>> response.xpath('//*[@id="root-wrapper"]/div/script[5]/text()').get()
And I get all datas in the script, of course. The result shows us:
['new E4uTrack("view_item", {"currency":"EUR","value":"49","page_type":"product","title":"Makita DMP180Z Akku-Kompressor","items":[{"item_name":"Makita DMP180Z Akku-Kompressor","item_id":"189153","id":"189153","item_mpn":"DMP180Z","item_gtin":"0088381898263","item_brand":"MAKITA","google_business_vertical":"retail","price":"49","currency":"EUR","regular_price":109.9501,"screensize":"","vogels_fsf":false}]})']
Normally I would targeting a value - e.g. the value from "item_name" with:
scrapy shell 'https://www.electronic4you.de/makita-dmp180z-akku-kompressor-189153.html' ...
...
>>> script_tag = response.xpath('//*[@id="root-wrapper"]/div/script[5]/text()').get()
>>> import json
>>> json.loads(script_tag)["items"]["item_name"]
But......
The output shows us:
File "", line 1 json.loads(script_tag)["item_name"]
IndentationError: unexpected indent
I have two questions.
Do I address the script correctly with the xpath selector? And how I can target from this script only tags I need?
Solution
You have to iterate over the ResultSet and pull the desired data
import scrapy
import json
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.electronic4you.de/makita-dmp180z-akku-kompressor-189153.html']
def parse(self, response):
script_tag = response.xpath('(//*[@id="root-wrapper"]/div/script)[5]/text()').re_first(r'E4uTrack\("view_item",(.+?)\)$')
json_data= json.loads(script_tag)
for item in json_data["items"]:
yield {
'Name':item["item_name"]
}
Output:
{'Name': 'Makita DMP180Z Akku-Kompressor'}
Answered By - Fazlul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.