Issue
I'm in search of some help. I would like to scrape quantitative information from SofaScore (https://www.sofascore.com/) about Serie A teams, specifically the starting lineups, the ratings assigned by the website, and possibly some more advanced statistics. However, my knowledge of HTML and web scraping is limited, and I'm struggling to extract this information from the site.
Currently, I'm attempting to extract this data for a single game, but I'm unsure how to generalize the code to collect information for all the rounds and teams.
Below is the code I've written so far, but it seems that the part with BeautifulSoup's find
method is not targeting the correct section of the website.
import bs4
from bs4 import BeautifulSoup as bs
import requests
import webbrowser
link='https://www.sofascore.com/sassuolo-atalanta/LdbsTfb'
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.691'
response=requests.get(link, headers={'user-agent':user_agent})
response.raise_for_status()
soup=bs(response.text, 'html.parser')
div_voti=soup.find('div', class_="sc-fqkvVR eeeBnr sc-d8bc48b6-2 cUcAWg")
print(div_voti)
I understand this might be a basic question, but I'm feeling a bit lost. Thank you to anyone who can provide assistance!
Solution
The data you see on the page is loaded from external URL via Javascript (so beautifulsoup
doesn't see it). To simulate these requests you can use this example:
from itertools import zip_longest
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/117.0"
}
url = "https://www.sofascore.com/sassuolo-atalanta/LdbsTfb"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
id_ = soup.select_one('link[href*="android-app:"]')["href"].split("/")[-1]
lineups_url = f"https://api.sofascore.com/api/v1/event/{id_}/lineups"
# for goals, substitutions etc use this url:
# incidents_url = "https://api.sofascore.com/api/v1/event/11407341/incidents"
lineups = requests.get(lineups_url, headers=headers).json()
for h, a in zip_longest(lineups["home"]["players"], lineups["away"]["players"]):
if h:
h = h["player"]["name"] + f" ({h['player']['position']})"
else:
h = "-"
if a:
a = a["player"]["name"] + f" ({a['player']['position']})"
else:
a = "-"
print(f"{h:<50} {a:<50}")
Prints:
Andrea Consigli (G) Juan Musso (G)
Jeremy Toljan (D) Berat Djimsiti (D)
Martin Erlić (D) Giorgio Scalvini (D)
Mattia Viti (D) Sead Kolašinac (D)
Matías Viña (D) Davide Zappacosta (M)
Matheus Henrique (M) Marten de Roon (M)
Maxime López (M) Teun Koopmeiners (M)
Grégoire Defrel (F) Mario Pašalić (F)
Nedim Bajrami (M) Matteo Ruggeri (M)
Armand Laurienté (F) Ademola Lookman (F)
Andrea Pinamonti (F) Duván Zapata (F)
Filippo Missori (D) Éderson (M)
Kristian Thorstvedt (M) Charles De Ketelaere (M)
Kevin Miranda (D) Gianluca Scamacca (F)
Cristian Volpato (M) Nadir Zortea (D)
Samuele Mulattieri (F) Michel Ndary Adopo (M)
Gianluca Pegolo (G) Francesco Rossi (G)
Alessio Cragno (G) Marco Carnesecchi (G)
Yeferson Paz (M) Rafael Tolói (D)
Luca Lipani (M) Caleb Okoli (D)
Daniel Boloca (M) Mitchel Bakker (M)
Emil Konradsen Ceide (F) Luis Muriel (F)
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.