Issue
I’m trying to write a simple web-scraper, practicing on this site which has dynamic content.
My strategy is to use Selenium to get the page source so I have all the dynamic content, then scrape using Beautiful Soup. Basically, exactly the strategy here
I’m stuck on Step 1, however: I can’t get Selenium to even get the page. The following script gets through the ‘driver loaded’ print statement and then freezes:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537"
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f"user-agent={user_agent}")
try:
driver = webdriver.Chrome(executable_path='/opt/homebrew/bin/chromedriver', options=chrome_options)
print("Driver loaded")
driver.get("https://weighttraining.guide/exercises/standing-dumbbell-overhead-triceps-extension/")
print("Selenium get successful")
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "entry-content"))
)
print("Page loaded")
driver.quit()
except Exception as e:
print(f"An error occurred: {e}")
When I run this script I see Chrome open the website and I can see it load successfully, but then the script freezes. I’ve tried this with and without any of the chrome_options
, user_agent
, and WebDriverWait
portions. Nothing seems to work.
Please help!
Solution
I figured out how to do this by modifying the headers
sent by the requests.get
call, then using Beautiful Soup
for the scraping. The point of Selenium is to get the JavaScript to load all the dynamic content so that then we have simple HTML to parse with Beautiful Soup
, but for this site and many others Python's built-in requests
library can mimic the browser enough to do this using headers
. The performance is materially faster than Selenium
too. Here's how it work:
headers = {
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'sec-ch-ua': 'Google Chrome;v="84", "Chromium";v="84", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': 'Windows',
'sec-fetch-site': 'none',
'sec-fetch-mod': '',
'sec-fetch-user': '?1',
'accept-encoding': 'gzip, deflate',
'accept-language': 'en-US,en;q=0.9,es;q=0.5'
}
response = requests.get(site, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
Note: use those headers in that order to look most like a real browser. Thank you ScrapeOps for the headers.
Now we can use the soup
object as expected:
in: soup.find('h1', {'class': 'entry-title'}).text
out: 'Standing dumbbell overhead triceps extension'
JavaScript defeated!
Answered By - BLimitless
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.