Issue
I have a list with around 50 entries from centers in Germany. These centers are public institutions and are close to the economy. I want to create a list with all the centers _categories
Industry sectors:
Location:
Contact person:
The data - they can be found here on the overview page:
https://www.mittelstand-digital.de/MD/Redaktion/DE/artikel/Mittelstand-4-0/mittelstand-40-unternehmen.html
The idea is to use a parser (scraper) that uses Python and Beautiful Soup and then writes the data into a Calc spreadsheet via Pandas.
So i go like so:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# URL der Webseite
url = "https://www.mittelstand-digital.de/MD/Redaktion/DE/Artikel/Mittelstand-4-0/mittelstand-40-kompetenzzentren.html"
# Webseiteninhalt abrufen
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Leere Listen für die Daten erstellen
themen_list = []
branchen_list = []
ort_list = []
ansprechpartner_list = []
# Zentren-Daten extrahieren und den Listen hinzufügen
for center in soup.find_all('div', class_='linkblock'):
themen_list.append(center.find('h3').text.strip())
branchen_list.append(center.find('p', class_='text').text.strip())
ort_list.append(center.find('span', class_='ort').text.strip())
ansprechpartner_list.append(center.find('span', class_='ansprechpartner').text.strip())
# Daten in ein Pandas DataFrame umwandeln
data = {
'Themen': themen_list,
'Branchen': branchen_list,
'Ort': ort_list,
'Ansprechpartner': ansprechpartner_list
}
df = pd.DataFrame(data)
but this does not work at the moment. I only get back a enpty list
Solution
Think you missed scraping the links of all centers first to iterate them one by one and extract your information. Try to avoid the bunch of lists instead use a single one with structured dict
.
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
# URL der Webseite
url = "https://www.mittelstand-digital.de/MD/Redaktion/DE/Artikel/Mittelstand-4-0/mittelstand-40-kompetenzzentren.html"
# Webseiteninhalt abrufen
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = []
# iterate each center page and extract data
for center in soup.select('.content a.link.ExtBasepage:has(span)'):
response = requests.get(center.get('href'))
soup = BeautifulSoup(response.text, 'html.parser')
d = dict(e.stripped_strings for e in soup.select('.document-info-item'))
d.update({'Ansprechpartner':soup.find(class_='card-contactdata').get_text('|',strip= True)})
data.append(d)
pd.DataFrame(data)
Themen: Branchen: Ort: Ansprechpartner
0 Assistenzsysteme, Digitale Bildung, Digita... Dienstleistungen, Energie, Handwerk und Baue... Denninger Str. 84, 81925 München © Mittelstand-Digital Zentrum Augsburg|Ansprec...
1 Vernetzte Produktion, Assistenzsysteme, Cl... Energie, Handwerk und Bauen, Dienstleistungen Fraunhoferstraße 10, 83626 Valley © Mittelstand-Digital Zentrum Bau|Ansprechpart...
2 Assistenzsysteme, Digitale Bildung, Künstl... Dienstleistungen, Handel (Binnenhandel, Gast... Potsdamer Straße 7, 10785 Berlin © Mittelstand-Digital Zentrum Berlin|Ansprechp...
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.