Issue
I want to get href links from this website : https://www.dataprivacyframework.gov/s/participant-search The problem is there are no visible href links even when i inspect the page.
If i use the manual method 'copy link', the link gets copied.
How is that possible?
import re
import time
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15"
]
driver_path = '/users/tosh/downloads/chromedriver'
user_agent = random.choice(user_agents)
chrome_options = Options()
chrome_options.add_argument(f"user-agent={user_agent}")
driver = webdriver.Chrome(executable_path=driver_path, options=chrome_options)
url = 'https://www.dataprivacyframework.gov/s/participant-search'
driver.get(url)
time.sleep(5)
inactive_tab = driver.find_element(By.XPATH, '//*[@id="Inactive__item"]')
inactive_tab.click()
time.sleep(5)
link_list = []
links = driver.find_elements(By.CSS_SELECTOR, 'a.slds-text-heading_small lgorg')
for link in links:
href = link.get_attribute('href')
link_list.append(href)
print(link_list)
driver.quit()
Solution
The problem is there are no visible href links even when i inspect the page.
Your above statement is the reason why you are not able to scrape the href
, simple.
Having said that, there is a (not so straight forward) solution to get all the URLs.
Explanation: Since there is no href
attribute in the anchor tag(<a>
), obviously you cannot use get_attribute('href')
.
To capture the URLs of the link, the code below will click on each link in a loop and capture the URL using driver.current_url
method.
Code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.maximize_window()
wait = WebDriverWait(driver, 20)
driver.get('https://www.dataprivacyframework.gov/s/participant-search')
# click on Inactive tab
wait.until(EC.element_to_be_clickable((By.XPATH, "//*[@id='Inactive__item']"))).click()
time.sleep(5)
link_list = []
links = driver.find_elements(By.XPATH, "//lightning-tab[@id='tab-2']//a[@class='slds-text-heading_small lgorg']")
count = 1
for link in links:
countStr = str(count)
wait.until(EC.element_to_be_clickable((By.XPATH, "(//lightning-tab[@id='tab-2']//a[@class='slds-text-heading_small lgorg'])[" +countStr+ "]"))).click()
url = driver.current_url
link_list.append(url)
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[text()='Data Privacy Framework List']"))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, "//*[@id='Inactive__item']"))).click()
time.sleep(1)
count = count +1
print(link_list)
driver.quit()
Console result:
['https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt0000000PFClAAO&status=Inactive', 'https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt0000000PJNjAAO&status=Inactive', 'https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt00000008WQ8AAM&status=Inactive', 'https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt0000000GpEPAA0&status=Inactive', 'https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt00000008hXZAAY&status=Inactive', 'https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt00000008VqiAAE&status=Inactive', 'https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt00000008UzDAAU&status=Inactive', 'https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt0000000TOKjAAO&status=Inactive', 'https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt0000000eF01AAE&status=Inactive', 'https://www.dataprivacyframework.gov/s/participant-search/participant-detail?id=a2zt0000000PFpOAAW&status=Inactive']
Process finished with exit code 0
Answered By - Shawn
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.