Thursday, December 2, 2021

[FIXED] How can I webscrape a Wikipedia table with lists of data instead of rows?

December 02, 2021 beautifulsoup, dataframe, python, web-scraping No comments

Issue

I am trying to get data from the Localities table located on the Wikipedia https://en.wikipedia.org/wiki/Districts_of_Warsaw page.

I would like to collect this data and put it into a dataframe with two columns ["Districts"] and ["Neighbourhoods"].

My code so far looks like this:

url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html")

table = soup.find_all('table')[2]

A=[]
B=[]

for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==2:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))

df=pd.DataFrame(A,columns=['Neighbourhood'])
df['District']=B
print(df)

This gives the following dataframe:

Dataframe

Certainly, scraping the Neighbourhood column is not right since they are contained in lists, but I don't know how it should be done so will be glad for any tips.

In addition to it, I will appreciate any hints why scraping gives me only 10 districts instead of 18.

Solution

Are you sure that you are scraping the right table? I understood that you need a second table with 18 districts and listed neighbourhoods.

Also, I'm not sure how you want to have districts and neighbourhoods arranged in a DataFrame, I've set districts as columns and neighbourhoods as rows. You can change it as you want.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")

table = soup.find_all("table")[1]

def process_list(tr):
    result = []
    for td in tr.findAll("td"):
        result.append([x.string for x in td.findAll("li")])
    return result

districts = []
neighbourhoods = []
for row in table.findAll("tr"):
    if row.find("ul"):
        neighbourhoods.extend(process_list(row))
    else:
        districts.extend([x.string.strip() for x in row.findAll("th")])

# Check and arrange as you wish
for i in range(len(districts)):
    print(f'District {districts[i]} has neighbourhoods: {", ".join(neighbourhoods[i])}')

df = pd.DataFrame()
for i in range(len(districts)):
    df[districts[i]] = pd.Series(neighbourhoods[i])

Some tips:

Use element.string to get the text from an element
Use string.strip() to remove any leading (spaces at the beginning) and trailing (spaces at the end) characters (space is the default leading character to remove) i.e. to clean the text

Answered By - Vlad Siv

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 2, 2021

[FIXED] How can I webscrape a Wikipedia table with lists of data instead of rows?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels