Sunday, December 31, 2023

[FIXED] Data repeats itself when trying to crawl html data into csv with python

December 31, 2023 csv, python, web-crawler No comments

Issue

I am trying to crawl data from a table divided in multiple webpages and transfering this data to a csv. The csv must contain the csv must contain every data under the columns "Palavra" and "Divisão Silábica" separated by commas. The website is http://www.portaldalinguaportuguesa.org/index.php?action=syllables&act=list&letter=a&start=0.

Following the argument structure they have for their links, my code adds 100 (the maximum amount of rows a single webpage shows) every time it crawls data and when these are done it goes to the next letter.

abc = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r",
"s","t","u","v","w","x","y","z"]

from bs4 import BeautifulSoup
import requests
import time

with open("divisao_silabica.csv", "w", encoding = "utf-8") as file:
    file.write("palavra,tipo,divisao_silabica\n")

    for letra in abc:
        hundred = 0
        seguintes = True

        while seguintes:
            r = requests.get('http://www.portaldalinguaportuguesa.org/index.php?action=syllables&act=list&letter='+letra+"&start="+str(hundred))
            data = r.content
            soup = BeautifulSoup(data, "html.parser")
            rows = soup.find(id="rollovertable").find_all("tr")

            for row in rows:
                cells = row.find_all("td")
                if cells:
                    palavra = cells[0].find('b').find('a').get_text()
                    divisaosilabica = cells[1].get_text()
                    file.write(f"{palavra}, {divisaosilabica}\n")

            hundred += 100
            seguintes = "seguintes" in soup.get_text()

When I checked the output, the separator commas were not showing for some reason. When I looked closely, I noticed commas appear, but when they do, the whole table data follows. Between the data scraped from the row "a-pedido" and the row "a-propósito", for instance, there is a BUNCH of trash, which data in the whole table.

a-pedido, a-pe·di·do

a-propósito (nome masculino) a-pro·pó·si·to a-tempo (nome masculino) a-tem·po à-toa (adjetivo) à-to·a à-toinha (adjetivo) à-to·i·nha à-vontade (nome masculino) à-von·ta·de aa (nome feminino) a·a aabora (nome feminino) a·a·bo·ra aacheniano (nome masculino) aa·che·ni·a·no aacheniano (adjetivo) aa·che·ni·a·no aal (nome masculino) a·al aaleniano (adjetivo) a·a·le·ni·a·no aaleniense (adjetivo) aa·le·ni·en·se aaleniense (nome masculino) aa·le·ni·en·se aaleniense (nome feminino) aa·le·ni·en·se aalênio (nome masculino) a·a·lê·ni·o aalénio (nome masculino) a·a·lé·ni·o

aarônico (adjetivo) a·a·rô·ni·co aarónico (adjetivo) a·a·ró·ni·co

aarônida (nome masculino) a·a·rô·ni·da aarônida (adjetivo) a·a·rô·ni·da aarônida (nome feminino) a·a·rô·ni·da aarónida (adjetivo) a·a·ró·ni·da

aarónida (nome masculino) a·a·ró·ni·da

aarónida (nome feminino) a·a·ró·ni·da

aaronita (nome masculino) a·a·ro·ni·ta

ab-reação (nome feminino) ab-re·a·ção ab-reativo (adjetivo) ab-re·a·ti·vo ab-repticiamente (advérbio) ab-rep·ti·ci·a·men·te ab-reptício (adjetivo) ab-rep·tí·ci·o ab-rogação (nome feminino) ab-ro·ga·ção ab-rogado (adjetivo) ab-ro·ga·do ab-rogador (adjetivo) ab-ro·ga·dor ab-rogador (nome masculino) ab-ro·ga·dor ab-rogamento (nome masculino) ab-ro·ga·men·to

ab-rogante (nome masculino) ab-ro·gan·te

ab-rogante (nome feminino) ab-ro·gan·te

ab-rogante (adjetivo) ab-ro·gan·te

ab-rogar (verbo) ab-ro·gar ab-rogativo (adjetivo) ab-ro·ga·ti·vo ab-rogatório (adjetivo) ab-ro·ga·tó·ri·o ab-rogável (adjetivo) ab-ro·gá·vel aba (nome masculino) a·ba aba (nome feminino) a·ba ababá (nome masculino) a·ba·bá ababá (nome feminino) a·ba·bá ababá (adjetivo) a·ba·bá ababalhado (adjetivo) a·ba·ba·lha·do ababalhar (verbo) a·ba·ba·lhar ababelado (adjetivo) a·ba·be·la·do ababelar (verbo) a·ba·be·lar ababosado (adjetivo) a·ba·bo·sa·do ababosar (verbo) a·ba·bo·sar abacá (nome masculino) a·ba·cá abacalhoadamente (advérbio) a·ba·ca·lho·a·da·men·te

abacalhoado (adjetivo) a·ba·ca·lho·a·do abacalhoar (verbo) a·ba·ca·lho·ar abacamartado (adjetivo) a·ba·ca·mar·ta·do abaçanado (adjetivo) a·ba·ça·na·do abaçanar (verbo) a·ba·ça·nar abacanto (nome masculino) a·ba·can·to

abacate (adjetivo) a·ba·ca·te abacate (nome masculino) a·ba·ca·te abacate (nome feminino) a·ba·ca·te abacateira (nome feminino) a·ba·ca·tei·ra abacateiro (nome masculino) a·ba·ca·tei·ro abacaxi (adjetivo) a·ba·ca·xi abacaxi (nome masculino) a·ba·ca·xi abacaxi (nome feminino) a·ba·ca·xi abacelado (adjetivo) a·ba·ce·la·do abacelamento (nome masculino) a·ba·ce·la·men·to abacelar (verbo) a·ba·ce·lar abacelável (adjetivo) a·ba·ce·lá·vel abacenino (adjetivo) a·ba·ce·ni·no abacenino (nome masculino) a·ba·ce·ni·no abacense (adjetivo) a·ba·cen·se abacense (nome masculino) a·ba·cen·se abacense (nome feminino) a·ba·cen·se abacharelado (adjetivo) a·ba·cha·re·la·do abacharelar (verbo) a·ba·cha·re·lar abacial (adjetivo) a·ba·ci·al abacinado (adjetivo) a·ba·ci·na·do

abacinamento (nome masculino) a·ba·ci·na·men·to abacinar (verbo) a·ba·ci·nar abacisco (nome masculino) a·ba·cis·co

abacista (nome masculino) a·ba·cis·ta abacista (nome feminino) a·ba·cis·ta ábaco (nome masculino) á·ba·co abacomitato (nome masculino) a·ba·co·mi·ta·to

abacómite (nome masculino) a·ba·có·mi·te

abacômite (nome masculino) a·ba·cô·mi·te abacto (nome masculino) a·bac·to abactor (nome masculino) a·bac·tor abáculo (nome masculino) a·bá·cu·lo abada (nome feminino) a·ba·da abadado (nome masculino) a·ba·da·do

a-propósito, a-pro·pó·si·to

It seems like I got the html structure right? Inside the table, element by id "rollovertable", there are multiple rows tr that contain cells td. If that is right, I guess there might be something wrong with my iteration syntax.

Solution

The problem with your code as-is is that the HTML is not well formed and the standard parser, html.parser, you pass to BeautifulSoup cannot cope with how bad it is.

Consider this snippet:

soup = BeautifulSoup(data, "html.parser")
table = soup.find(id="rollovertable")

rows = table.find_all("tr")
print(f"rows: {len(rows)}")

for i, row in enumerate(rows, start=1):
    cells = row.find_all("td")
    print(f"row {i}: cells: {len(cells)}")

and you get something like:

rows: 101
row 1: cells: 200
row 2: cells: 200
row 3: cells: 198
...
row 97: cells: 10
row 98: cells: 8
row 99: cells: 6
row 100: cells: 4
row 101: cells: 2

Probably not what you were expecting. If you install html5lib and use that as the parser in BS:

soup = BeautifulSoup(data, "html5lib")

you get what you probably thought you were going to get:

rows: 101
row 1: cells: 0
row 2: cells: 2
row 3: cells: 2
...
row 97: cells: 2
row 98: cells: 2
row 99: cells: 2
row 100: cells: 2
row 101: cells: 2

I explored the parser options because I looked at the raw HTML coming back from the server and could see nothing was terminated, all the cells were essentially nesting inside each other.

You can see the pros/cons of each parser here. What caught my eye about html5lib was "Creates valid HTML5".

Answered By - Zach Young

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 31, 2023

[FIXED] Data repeats itself when trying to crawl html data into csv with python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels