Issue
I'm new to all this. I managed to crawl through 3600+ items in a page and extract data such as Name, address, phone, mail. All of which I wrote to a .csv file.
My excitement was cut short when I discovered that some of the distributors had missing information (information that's written in the website, and have been written incorrectly to the .csv. Furthermore, some blank columns (like 'B') were created.
Also, I couldn't find a way for the square brackets and the apostrophes to not be written, but I can easily erase them all with LibreOficce Calc.
(In my code I only pasted a few urls out of 3600+, including the ones in the attached picture that show the problem)
import scrapy
import requests
import csv
class QuotesSpider(scrapy.Spider):
name = "final"
def start_requests(self):
urls = [
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla01586 /zarate/bodelon-edgardo-aristides/?countrySelectorCode=AR', 'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla01778/zarate/cesario- mariano-rodrigo/?countrySelectorCode=AR', 'https://www.bosch- professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00140/zarate/de-vicenzi-elio-mario-g.-rosana-sh/?countrySelectorCode=AR', 'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla01941/zarate/de-vincenzi-elio-mario-y-rosana-sh/?countrySelectorCode=AR', 'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla02168/zarate/ferreterias-indufer-s.a./?countrySelectorCode=AR',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
marca = []
names = []
direcc = []
locali = []
telef = []
mail = []
site = []
for item in response.css('div.item-content'):
marca.append('Bosch')
names.append(item.css('p.item-name::text').extract())
lista_direcc = item.css('p.item-address::text').extract()
direcc.append(lista_direcc[0].strip())
locali.append(lista_direcc[1].strip())
telef.append(item.css('a.btn-phone.trackingElement.trackingTeaser::text').extract())
mail.append(item.css('a.btn-email.trackingElement.trackingTeaser::text').extract())
site.append(item.css('a.btn-website.trackingElement.trackingTeaser::text').extract())
with open('base.csv', 'a') as csvFile:
fieldnames = ['Empresa', 'Nombres', 'Dirección' , 'Localidad', 'Teléfono', 'Mail', 'Sitio Web']
writer = csv.DictWriter(csvFile, fieldnames=fieldnames)
writer.writerow({'Empresa' : marca, 'Nombres' : names, 'Dirección' : direcc, 'Localidad' : locali, 'Teléfono' : telef, 'Mail' : mail, 'Sitio Web' : site })
csvFile.close()
You can see an example of what I'm talking about. The program created several extra columns and in some cases shifted the data one column to the left.
I assume that the solution to this is quite simple, as all my previous questions have been. But yet it's puzzling me.
So thanks a lot for any help and for tolerating my poor English. Cheers!
Solution
Firstly, rather use the built-in CSV feed exporter instead of your own CSV writer method. In other words, yield
the item instead and let Scrapy handle the CSV.
And secondly, don't write lists to the CSV. That is why you get [[
and [
in the output. It is likely also the reason for the extra columns due to unnecessary commas (from the lists) in the output.
Another point is you do not need to implement start_request()
. You can just specify your URLs in a start_urls
property.
Here is an example:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "final"
start_urls = [
# ...
]
def parse(self, response):
for item in response.css('div.item-content'):
lista_direcc = item.css('p.item-address::text').getall()
yield {
'Empresa': 'Bosch',
'Nombres': item.css('p.item-name::text').get(),
'Dirección': lista_direcc[0].strip(),
'Localidad': lista_direcc[1].strip(),
'Teléfono': item.css('a.btn-phone.trackingElement.trackingTeaser::text').get(),
'Mail': item.css('a.btn-email.trackingElement.trackingTeaser::text').get(),
'Sitio Web': item.css('a.btn-website.trackingElement.trackingTeaser::text').get(),
}
As mentioned by @Gallaecio in the comments below, it is better to use get()
instead of extract()
when you expect a single item (and it is the preferred usage nowadays). Read more here: https://docs.scrapy.org/en/latest/topics/selectors.html#extract-and-extract-first
To get the CSV you can run:
scrapy runspider spidername.py -o output.csv
Answered By - malberts
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.