Monday, December 25, 2023

[FIXED] JavaScript on Site Prevents `requests.get` from Working

December 25, 2023 beautifulsoup, python, python-requests, selenium-webdriver, web-scraping No comments

Issue

I’m trying to write a simple web-scraper, practicing on this site which has dynamic content.

My strategy is to use Selenium to get the page source so I have all the dynamic content, then scrape using Beautiful Soup. Basically, exactly the strategy here

I’m stuck on Step 1, however: I can’t get Selenium to even get the page. The following script gets through the ‘driver loaded’ print statement and then freezes:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537"
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f"user-agent={user_agent}")
try:
    driver = webdriver.Chrome(executable_path='/opt/homebrew/bin/chromedriver', options=chrome_options)
    print("Driver loaded")
    driver.get("https://weighttraining.guide/exercises/standing-dumbbell-overhead-triceps-extension/")
    print("Selenium get successful")

    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "entry-content"))
    )
    print("Page loaded")
    driver.quit()
except Exception as e:
    print(f"An error occurred: {e}")

When I run this script I see Chrome open the website and I can see it load successfully, but then the script freezes. I’ve tried this with and without any of the chrome_options, user_agent, and WebDriverWait portions. Nothing seems to work.

Please help!

Solution

I figured out how to do this by modifying the headers sent by the requests.get call, then using Beautiful Soup for the scraping. The point of Selenium is to get the JavaScript to load all the dynamic content so that then we have simple HTML to parse with Beautiful Soup, but for this site and many others Python's built-in requests library can mimic the browser enough to do this using headers. The performance is materially faster than Selenium too. Here's how it work:

headers = {
    'upgrade-insecure-requests': '1', 
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36', 
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 
    'sec-ch-ua': 'Google Chrome;v="84", "Chromium";v="84", ";Not A Brand";v="99"', 
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': 'Windows', 
    'sec-fetch-site': 'none', 
    'sec-fetch-mod': '',
    'sec-fetch-user': '?1',
    'accept-encoding': 'gzip, deflate',
    'accept-language': 'en-US,en;q=0.9,es;q=0.5'
}

    response = requests.get(site, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

Note: use those headers in that order to look most like a real browser. Thank you ScrapeOps for the headers.

Now we can use the soup object as expected:

in: soup.find('h1', {'class': 'entry-title'}).text
out: 'Standing dumbbell overhead triceps extension'

JavaScript defeated!

Answered By - BLimitless

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 25, 2023

[FIXED] JavaScript on Site Prevents `requests.get` from Working

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels