Issue
At the moment I'll am gathering data from the dioceses of The world.
My approach works with bs4 and pandas. I am currently working on the scrape-logic.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "http://www.catholic-hierarchy.org/"
# Send a GET request to the website
response = requests.get(url)
#my approach to parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Find the relevant elements containing diocese information
diocese_elements = soup.find_all("div", class_="diocesan")
# Initialize empty lists to store data
dioceses = []
addresses = []
# Extract now data from each diocese element
for diocese_element in diocese_elements:
# Example: Extracting diocese name
diocese_name = diocese_element.find("a").text.strip()
dioceses.append(diocese_name)
# Example: Extracting address
address = diocese_element.find("div", class_="address").text.strip()
addresses.append(address)
# to save the whole data we create a DataFrame using pandas
data = {'Diocese': dioceses, 'Address': addresses}
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
At the moment I get some weird things on my pycharm. And I try to find a way to gather the whole data with the pandas approach.
Solution
This example could get you started - it will parse all diocese pages for diocese names + URLs and store it into Panda's DataFrame.
You can then iterate over these URLs and get further information you want.
import pandas as pd
import requests
from bs4 import BeautifulSoup
chars = "abcdefghijklmnopqrstuvwxyz"
url = "http://www.catholic-hierarchy.org/diocese/la{char}.html"
all_data = []
for char in chars:
u = url.format(char=char)
while True:
print(f"Parsing {u}")
soup = BeautifulSoup(requests.get(u).content, "html.parser")
for a in soup.select("li a[href^=d]"):
all_data.append(
{
"Name": a.text,
"URL": "http://www.catholic-hierarchy.org/diocese/" + a["href"],
}
)
next_page = soup.select_one('a:has(img[alt="[Next Page]"])')
if not next_page:
break
u = "http://www.catholic-hierarchy.org/diocese/" + next_page["href"]
df = pd.DataFrame(all_data).drop_duplicates()
print(df.head(10))
Prints:
...
Parsing http://www.catholic-hierarchy.org/diocese/lax.html
Parsing http://www.catholic-hierarchy.org/diocese/lay.html
Parsing http://www.catholic-hierarchy.org/diocese/laz.html
Name URL
0 Holy See http://www.catholic-hierarchy.org/diocese/droma.html
1 Diocese of Rome http://www.catholic-hierarchy.org/diocese/droma.html
2 Aachen http://www.catholic-hierarchy.org/diocese/da549.html
3 Aachen http://www.catholic-hierarchy.org/diocese/daach.html
4 Aarhus (Ã…rhus) http://www.catholic-hierarchy.org/diocese/da566.html
5 Aba http://www.catholic-hierarchy.org/diocese/dabaa.html
6 Abaetetuba http://www.catholic-hierarchy.org/diocese/dabae.html
8 Abakaliki http://www.catholic-hierarchy.org/diocese/dabak.html
9 Abancay http://www.catholic-hierarchy.org/diocese/daban.html
10 Abaradira http://www.catholic-hierarchy.org/diocese/d2a01.html
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.