Saturday, January 27, 2024

[FIXED] Scraping masslottery with beautiful soup for scratch ticket stats

January 27, 2024 beautifulsoup, scrape No comments

Issue

Working on scraping winners and prize amount from this page: https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page=1 using beautiful soup. I found a similar stack post, but solving for a different page with different elements. This solution does not work for my task

I've inspected the HTML code and have tried countless possible tags and ID's. Does anyone have any advice on accessing the actual table and returning a df with the Prize, date and location for the given ticket? Thanks in advance!!

Heres my code:

from bs4 import BeautifulSoup as bs
import requests
import urllib.request
import json
import pandas as pd
from datetime import datetime as dt

website = "https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page=1"
result = requests.get(website)
content = result.text

soup = bs(content, 'lxml')

htmltable = soup.find('table', {'class' : 'multi-col-stacking-table'})




#print(prize.prettify())
table = soup.find('table', attrs = {'data-title': 'Prize &nbsp;'})

for tr in table.tbody.find_all('tr'):
    print(tr.text)

I tried the above code and variations but continue getting None or blank outputs

Solution

You cannot scrape this code with beautifulsoup since it contains Javascript. Javascript is dynamically loaded, and beautifulsoup can not work dynamically since it is essentially downloading the code of a static web page.

To handle Javascript, you need a more sophisticated tool like Selenium, which controls a web browser and is able to execute JavaScript, allowing you to interact with the dynamically loaded content.

If you have not used Selenium before, you can pip install it like you do for any other major package, and can look up the setup. It is very simple. I have adjusted your code to use Selenium below:

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import time

driver = webdriver.Chrome()

driver.get("https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page=1")
time.sleep(10) 

table_element = driver.find_element(By.CLASS_NAME, 'multi-col-stacking-table')  # Or any other method to locate your table

rows = table_element.find_elements(By.TAG_NAME, "tr")

data = []
for row in rows:
    cells = row.find_elements(By.TAG_NAME, "td")
    row_data = [cell.text for cell in cells]
    data.append(row_data)

columns = ['Date', 'Amount', 'Game', 'Location']

if not data[0]:
    data.pop(0)

df = pd.DataFrame(data, columns=columns)
print(df)

This will create a dataframe with the following data:

If you would like to iterate through more pages, you could write a for loop to iterate through url pages. You would need to repeat each part of the process, including sleep, to allow the pages to load. Here is the for loop section, iterating through 10 pages with an f string:

for i in range(10):
    driver.get(f"https://www.masslottery.com/tools/winners?games=billion-dollar-extravaganza-2023&page={i}")

Answered By - samsupertaco

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 27, 2024

[FIXED] Scraping masslottery with beautiful soup for scratch ticket stats

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels