Sunday, January 28, 2024

[FIXED] Beautiful soup parse the listing of many entries and saving in data frame

January 28, 2024 beautifulsoup, pandas, python No comments

Issue

At the moment I'll am gathering data from the dioceses of The world.

My approach works with bs4 and pandas. I am currently working on the scrape-logic.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.catholic-hierarchy.org/"

# Send a GET request to the website
response = requests.get(url)

#my approach  to parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')

# Find the relevant elements containing diocese information
diocese_elements = soup.find_all("div", class_="diocesan")

# Initialize empty lists to store data
dioceses = []
addresses = []

# Extract now data from each diocese element
for diocese_element in diocese_elements:
    # Example: Extracting diocese name
    diocese_name = diocese_element.find("a").text.strip()
    dioceses.append(diocese_name)

    # Example: Extracting address
    address = diocese_element.find("div", class_="address").text.strip()
    addresses.append(address)

#  to save the whole data we create a DataFrame using pandas
data = {'Diocese': dioceses, 'Address': addresses}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

At the moment I get some weird things on my pycharm. And I try to find a way to gather the whole data with the pandas approach.

Solution

This example could get you started - it will parse all diocese pages for diocese names + URLs and store it into Panda's DataFrame.

You can then iterate over these URLs and get further information you want.

import pandas as pd
import requests
from bs4 import BeautifulSoup

chars = "abcdefghijklmnopqrstuvwxyz"
url = "http://www.catholic-hierarchy.org/diocese/la{char}.html"

all_data = []
for char in chars:
    u = url.format(char=char)

    while True:
        print(f"Parsing {u}")
        soup = BeautifulSoup(requests.get(u).content, "html.parser")
        for a in soup.select("li a[href^=d]"):
            all_data.append(
                {
                    "Name": a.text,
                    "URL": "http://www.catholic-hierarchy.org/diocese/" + a["href"],
                }
            )

        next_page = soup.select_one('a:has(img[alt="[Next Page]"])')
        if not next_page:
            break

        u = "http://www.catholic-hierarchy.org/diocese/" + next_page["href"]

df = pd.DataFrame(all_data).drop_duplicates()
print(df.head(10))

Prints:


...
Parsing http://www.catholic-hierarchy.org/diocese/lax.html
Parsing http://www.catholic-hierarchy.org/diocese/lay.html
Parsing http://www.catholic-hierarchy.org/diocese/laz.html

               Name                                                   URL
0          Holy See  http://www.catholic-hierarchy.org/diocese/droma.html
1   Diocese of Rome  http://www.catholic-hierarchy.org/diocese/droma.html
2            Aachen  http://www.catholic-hierarchy.org/diocese/da549.html
3            Aachen  http://www.catholic-hierarchy.org/diocese/daach.html
4    Aarhus (Århus)  http://www.catholic-hierarchy.org/diocese/da566.html
5               Aba  http://www.catholic-hierarchy.org/diocese/dabaa.html
6        Abaetetuba  http://www.catholic-hierarchy.org/diocese/dabae.html
8         Abakaliki  http://www.catholic-hierarchy.org/diocese/dabak.html
9           Abancay  http://www.catholic-hierarchy.org/diocese/daban.html
10        Abaradira  http://www.catholic-hierarchy.org/diocese/d2a01.html

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 28, 2024

[FIXED] Beautiful soup parse the listing of many entries and saving in data frame

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels