Saturday, January 27, 2024

[FIXED] a bs4 script gives back a empty list

January 27, 2024 beautifulsoup, python, web-scraping No comments

Issue

I have a list with around 50 entries from centers in Germany. These centers are public institutions and are close to the economy. I want to create a list with all the centers _categories

Industry sectors:  
Location:  
Contact person:

The data - they can be found here on the overview page:
https://www.mittelstand-digital.de/MD/Redaktion/DE/artikel/Mittelstand-4-0/mittelstand-40-unternehmen.html

The idea is to use a parser (scraper) that uses Python and Beautiful Soup and then writes the data into a Calc spreadsheet via Pandas.
So i go like so:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    # URL der Webseite
    url = "https://www.mittelstand-digital.de/MD/Redaktion/DE/Artikel/Mittelstand-4-0/mittelstand-40-kompetenzzentren.html"
    
    # Webseiteninhalt abrufen
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Leere Listen für die Daten erstellen
    themen_list = []
    branchen_list = []
    ort_list = []
    ansprechpartner_list = []
    
    # Zentren-Daten extrahieren und den Listen hinzufügen
    for center in soup.find_all('div', class_='linkblock'):
        themen_list.append(center.find('h3').text.strip())
        branchen_list.append(center.find('p', class_='text').text.strip())
        ort_list.append(center.find('span', class_='ort').text.strip())
        ansprechpartner_list.append(center.find('span', class_='ansprechpartner').text.strip())
    
    # Daten in ein Pandas DataFrame umwandeln
    data = {
        'Themen': themen_list,
        'Branchen': branchen_list,
        'Ort': ort_list,
        'Ansprechpartner': ansprechpartner_list
    }
    
    df = pd.DataFrame(data)

but this does not work at the moment. I only get back a enpty list

Solution

Think you missed scraping the links of all centers first to iterate them one by one and extract your information. Try to avoid the bunch of lists instead use a single one with structured dict.

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL der Webseite
url = "https://www.mittelstand-digital.de/MD/Redaktion/DE/Artikel/Mittelstand-4-0/mittelstand-40-kompetenzzentren.html"

# Webseiteninhalt abrufen
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

data = []

# iterate each center page and extract data
for center in soup.select('.content a.link.ExtBasepage:has(span)'):
    response = requests.get(center.get('href'))
    soup = BeautifulSoup(response.text, 'html.parser')
    d = dict(e.stripped_strings for e in soup.select('.document-info-item'))
    d.update({'Ansprechpartner':soup.find(class_='card-contactdata').get_text('|',strip= True)})
    data.append(d)

pd.DataFrame(data)

Themen: Branchen:   Ort:    Ansprechpartner
0   Assistenzsysteme, Digitale Bildung, Digita...   Dienstleistungen, Energie, Handwerk und Baue... Denninger Str. 84, 81925 München    © Mittelstand-Digital Zentrum Augsburg|Ansprec...
1   Vernetzte Produktion, Assistenzsysteme, Cl...   Energie, Handwerk und Bauen, Dienstleistungen   Fraunhoferstraße 10, 83626 Valley   © Mittelstand-Digital Zentrum Bau|Ansprechpart...
2   Assistenzsysteme, Digitale Bildung, Künstl...   Dienstleistungen, Handel (Binnenhandel, Gast... Potsdamer Straße 7, 10785 Berlin    © Mittelstand-Digital Zentrum Berlin|Ansprechp...

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 27, 2024

[FIXED] a bs4 script gives back a empty list

Issue

Solution

Example

0 comments:

Post a Comment

Popular Posts

Labels