Issue
I am trying to get data from the Localities table located on the Wikipedia https://en.wikipedia.org/wiki/Districts_of_Warsaw page.
I would like to collect this data and put it into a dataframe with two columns ["Districts"] and ["Neighbourhoods"].
My code so far looks like this:
url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html")
table = soup.find_all('table')[2]
A=[]
B=[]
for row in table.findAll('tr'):
cells=row.findAll('td')
if len(cells)==2:
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
df=pd.DataFrame(A,columns=['Neighbourhood'])
df['District']=B
print(df)
This gives the following dataframe:
Certainly, scraping the Neighbourhood column is not right since they are contained in lists, but I don't know how it should be done so will be glad for any tips.
In addition to it, I will appreciate any hints why scraping gives me only 10 districts instead of 18.
Solution
Are you sure that you are scraping the right table? I understood that you need a second table with 18 districts and listed neighbourhoods.
Also, I'm not sure how you want to have districts and neighbourhoods arranged in a DataFrame, I've set districts as columns and neighbourhoods as rows. You can change it as you want.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find_all("table")[1]
def process_list(tr):
result = []
for td in tr.findAll("td"):
result.append([x.string for x in td.findAll("li")])
return result
districts = []
neighbourhoods = []
for row in table.findAll("tr"):
if row.find("ul"):
neighbourhoods.extend(process_list(row))
else:
districts.extend([x.string.strip() for x in row.findAll("th")])
# Check and arrange as you wish
for i in range(len(districts)):
print(f'District {districts[i]} has neighbourhoods: {", ".join(neighbourhoods[i])}')
df = pd.DataFrame()
for i in range(len(districts)):
df[districts[i]] = pd.Series(neighbourhoods[i])
Some tips:
- Use
element.string
to get the text from an element - Use
string.strip()
to remove any leading (spaces at the beginning) and trailing (spaces at the end) characters (space is the default leading character to remove) i.e. to clean the text
Answered By - Vlad Siv
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.