Tuesday, October 12, 2021

[FIXED] Web Scraping Blocked by Robots Meta Directives

October 12, 2021 beautifulsoup, python, selenium, selenium-webdriver, web-scraping No comments

Issue

I am working on a web scraper to access scheduling data from a website. Our company has full access to this website and data via login credentials. With dynamic site navigation required, I am using Selenium for automated data scraping, Python, and BeautifulSoup to work with the HTML structure. With all variables defined, I have the following code:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import lxml.html as lh

opt = Options()
opt.headless = True
driver = webdriver.Chrome(options=opt, executable_path=<path to chromedriver.exe>)
driver = webdriver.Chrome(<path to chromedriver.exe>)

driver.get(<website login page URL>?username=' + username + '&password=' + password)
driver.get(<url of website page with data>?start_date=' + start_date + '&end_date=' + end_date +'&type=Excel')

soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup)

The result of the print(soup) is as follows:

<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
</head>
<body> ... irrelevant ... </body></html>

Before any questions, I do not have much knowledge regarding robot or HTTP requests. My questions are:

When I run a headless driver as above, the scrape is blocked by robots. When I run a regular, non-headless driver where an automated browser opens, the scrape is successful. Why is this the case?
What is the best method to get around this? The scrape is legal and non-exploitive as we practically have full access to the data we are scraping (we are a registered client). Will using the requests library solve this problem? Are there other methods of running headless web drivers that won't get blocked? Is there some parameter I can change that prevents the block?
How do I see the robots.txt file of a website?

Solution

you can use the following code to hide the webdriver

driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

also, add this to your chromedriver options

options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option('useAutomationExtension', False)

Answered By - CatChMeIfUCan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 12, 2021

[FIXED] Web Scraping Blocked by Robots Meta Directives

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels