Issue
I have a html file with the below structure. As you see the header and the matches below each header are not grouped in their own divs.
<div class="basketball">
<div class="header">
<span class="event_title">Playoff</span>
</div>
<div class="match">
<div class="home">Bakken</div>
<div class="away">Akken</div>
<div class="home score">90</div>
<div class="away score">70</div>
</div>
<div class="match">
<div class="home">Monaco</div>
<div class="away">Strasbourg</div>
<div class="home score">80</div>
<div class="away score">65</div>
</div>
<div class="header">
<span class="event_title">Semi Finals</span>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="header">
<span class="event_title">Quarter Finals</span>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="header">
<span class="event_title">Normal Season Matches</span>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
<div class="match">
<div class="home">Randers</div>
<div class="away">Celtics</div>
<div class="home score">60</div>
<div class="away score">90</div>
</div>
</div>
I thought of grouping the headers and the matches below each header into separate divs but it was not economical since the original file has more than 1000 lines of html markup.
I need to extract the data such that the output will be as follows:
data = {Playoff: [["Bakken", "Akken", 90, 70], ["Monaco", "Strasbourg", 80, 65]],
Semi Finals: [["Randers", "Celtics", 60, 90], [...]]
Quarter Finals: [.... ],
Normal Season Matches: [.... ]}
The first part I did:
data = {}
for i in soup.find_all("div", class_="header"):
title = i.find("span", class_="event_title").get_text()
data[title] = []
data
# output
{'Playoff': [],
'Semi Finals': [],
'Quarter Finals': [],
'Normal Season Matches': []}
I am unable to figure out how to fill in the lists with the correct matches. Any help will be highly appreciated.
Solution
If html_text
contains the HTML from the question you can do:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, "html.parser")
out = {}
for m in soup.select("div.match"):
data = [div.text for div in m.select("div")]
header = m.find_previous(class_="header").text.strip()
out.setdefault(header, []).append(data)
print(out)
Prints:
{
"Playoff": [["Bakken", "Akken", "90", "70"], ["Monaco", "Strasbourg", "80", "65"]],
"Semi Finals": [
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
],
"Quarter Finals": [
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
],
"Normal Season Matches": [
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
["Randers", "Celtics", "60", "90"],
],
}
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.