Issue
I have a script created in a Jupyter Notebook that scrap a url and should save the result in a json file, but it doesn't do it, even in the log it says it does. I am using Google Drive to save the files and it is correctly mounted.
I leave you the code, although I think the problem may be in FEEDS, because in the log I see that it has selected well all the lines that it has to catch.
Thank you very much for your help.
import scrapy
from scrapy.crawler import CrawlerProcess
class DatosSpider(scrapy.Spider):
name = 'spider_datos'
start_urls = ['URL']
custom_settings = {
'FEEDS': { 'data.json': { 'format': 'json', 'overwrite': True}}
}
def parse(self, response):
events = response.xpath('//*[@id="PageContent"]/div[3]/table/tbody/tr')
for event in events:
dato1 = event.xpath('.//td[1]/text()').get()
dato2 = event.xpath('.//td[2]/text()').get()
datos = {
'Dato 1': dato1.strip() if dato1 else None,
'Dato 2': dato2.strip() if dato2 else None,
}
yield datos
process = CrawlerProcess()
process.crawl(DatosSpider)
process.start()
Solution
The following code is tested and works (although why use Scrapy for a single piece of data in that page?
import scrapy
from scrapy.crawler import CrawlerProcess
class DatosSpider(scrapy.Spider):
name = 'spider_datos'
start_urls = ['https://geoinfo.nmt.edu/nmtso/events/home.cfml']
custom_settings = {
'FEEDS': { 'data.json': { 'format': 'json', 'encoding': 'utf-8', 'overwrite': True}}
}
def parse(self, response):
events = response.xpath('//*[@id="PageContent"]/div[3]/table/tbody/tr')
for event in events:
dato1 = event.xpath('.//td[1]/text()').get()
dato2 = event.xpath('.//td[2]/text()').get()
datos = {
'Dato 1': dato1.strip() if dato1 else None,
'Dato 2': dato2.strip() if dato2 else None,
}
yield datos
process = CrawlerProcess()
process.crawl(DatosSpider)
process.start()
The result is a JSON file looking like this:
[
{"Dato 1": "2022-11-30 15:40:32.0", "Dato 2": "32.640"}
]
As this looks too much like an X-Y Problem, and you in fact may be after the data in that table, why not scrape the data with a 3 line code?
import pandas as pd
df = pd.read_html('https://geoinfo.nmt.edu/nmtso/events/home.cfml')[0]
print(df)
Result in terminal:
Date+Time (UTC) Latitude Longitude (WGS84) Depth (km) Magnitude RMS STD (km) #Stations Unnamed: 8
0 2023-11-16 13:35:53.0 36.843 -104.925 5.00 2.52 0.63 2.89 8 NaN
1 2023-11-13 20:57:06.0 35.599 -107.487 5.00 2.13 0.49 4.77 8 NaN
2 2023-11-13 16:40:57.0 34.565 -106.833 5.00 2.53 0.45 4.03 11 NaN
3 2023-11-12 11:31:58.0 32.264 -104.468 6.75 2.34 0.46 2.22 20 NaN
4 2023-11-11 14:01:21.0 32.304 -104.497 7.43 2.54 0.46 1.86 26 NaN
... ... ... ... ... ... ... ... ... ...
170 2022-12-04 08:08:57.0 33.990 -106.880 5.00 2.40 0.40 1.41 11 NaN
171 2022-12-01 07:50:24.0 34.010 -106.920 5.00 3.50 0.50 2.24 16 NaN
172 2022-12-01 07:41:50.0 34.000 -106.920 5.00 2.90 0.60 1.41 18 NaN
173 2022-11-30 16:34:43.0 32.640 -104.420 5.00 2.10 0.40 1.41 19 NaN
174 2022-11-30 15:40:32.0 32.640 -104.440 5.00 2.10 0.50 1.41 16 NaN
175 rows × 9 columns
Answered By - Barry the Platipus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.