Thursday, November 4, 2021

[FIXED] Why does my scraper write only some of the data wrong in the csv file?

November 04, 2021 csv, excel, python, scrapy, web-scraping No comments

Issue

I'm new to all this. I managed to crawl through 3600+ items in a page and extract data such as Name, address, phone, mail. All of which I wrote to a .csv file.

My excitement was cut short when I discovered that some of the distributors had missing information (information that's written in the website, and have been written incorrectly to the .csv. Furthermore, some blank columns (like 'B') were created.

Also, I couldn't find a way for the square brackets and the apostrophes to not be written, but I can easily erase them all with LibreOficce Calc.

(In my code I only pasted a few urls out of 3600+, including the ones in the attached picture that show the problem)

import scrapy
import requests
import csv



class QuotesSpider(scrapy.Spider):
    name = "final"

    def start_requests(self):
        urls = [
'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla01586 /zarate/bodelon-edgardo-aristides/?countrySelectorCode=AR', 'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla01778/zarate/cesario- mariano-rodrigo/?countrySelectorCode=AR', 'https://www.bosch- professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla00140/zarate/de-vicenzi-elio-mario-g.-rosana-sh/?countrySelectorCode=AR', 'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla01941/zarate/de-vincenzi-elio-mario-y-rosana-sh/?countrySelectorCode=AR', 'https://www.bosch-professional.com/ar/es/dl/localizador-de-distribuidores/localizador-de-distribuidores/distribuidor/boschla02168/zarate/ferreterias-indufer-s.a./?countrySelectorCode=AR',
]


for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)


def parse(self, response):

    marca = []
    names = []
    direcc = []
    locali = []
    telef = []
    mail = []
    site = []


    for item in response.css('div.item-content'):            
        marca.append('Bosch')
        names.append(item.css('p.item-name::text').extract())

        lista_direcc = item.css('p.item-address::text').extract()
        direcc.append(lista_direcc[0].strip())
        locali.append(lista_direcc[1].strip())

        telef.append(item.css('a.btn-phone.trackingElement.trackingTeaser::text').extract())
        mail.append(item.css('a.btn-email.trackingElement.trackingTeaser::text').extract())
        site.append(item.css('a.btn-website.trackingElement.trackingTeaser::text').extract())





    with open('base.csv', 'a') as csvFile:
        fieldnames = ['Empresa', 'Nombres', 'Dirección' , 'Localidad', 'Teléfono', 'Mail', 'Sitio Web']

        writer = csv.DictWriter(csvFile, fieldnames=fieldnames)
        writer.writerow({'Empresa' : marca, 'Nombres' : names, 'Dirección' : direcc, 'Localidad' : locali, 'Teléfono' : telef, 'Mail' : mail, 'Sitio Web' : site })

    csvFile.close()

Here you can see part of the problem

You can see an example of what I'm talking about. The program created several extra columns and in some cases shifted the data one column to the left.

I assume that the solution to this is quite simple, as all my previous questions have been. But yet it's puzzling me.

So thanks a lot for any help and for tolerating my poor English. Cheers!

Solution

Firstly, rather use the built-in CSV feed exporter instead of your own CSV writer method. In other words, yield the item instead and let Scrapy handle the CSV.

And secondly, don't write lists to the CSV. That is why you get [[ and [ in the output. It is likely also the reason for the extra columns due to unnecessary commas (from the lists) in the output.

Another point is you do not need to implement start_request(). You can just specify your URLs in a start_urls property.

Here is an example:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "final"

    start_urls = [
        # ...
    ]

    def parse(self, response):
        for item in response.css('div.item-content'):
            lista_direcc = item.css('p.item-address::text').getall()

            yield {
                'Empresa': 'Bosch',
                'Nombres': item.css('p.item-name::text').get(),
                'Dirección': lista_direcc[0].strip(),
                'Localidad': lista_direcc[1].strip(),
                'Teléfono': item.css('a.btn-phone.trackingElement.trackingTeaser::text').get(),
                'Mail': item.css('a.btn-email.trackingElement.trackingTeaser::text').get(),
                'Sitio Web': item.css('a.btn-website.trackingElement.trackingTeaser::text').get(),
            }

As mentioned by @Gallaecio in the comments below, it is better to use get() instead of extract() when you expect a single item (and it is the preferred usage nowadays). Read more here: https://docs.scrapy.org/en/latest/topics/selectors.html#extract-and-extract-first

To get the CSV you can run:

scrapy runspider spidername.py -o output.csv

Answered By - malberts

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 4, 2021

[FIXED] Why does my scraper write only some of the data wrong in the csv file?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels