Issue
I'm trying to scrape tables from a website using selenium in Python (in Jupyter Notebook). For this, I'm using a loop to get the data from all countries (in a dropdown list). This works fine until reaching around the 10th iteration/country.
See my code below for what I tried:
ChemName = []
Category = []
Country = []
Response = []
Decision = []
Date = []
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
driver.get("https://www.pic.int/Procedures/ImportResponses/Database/tabid/1370/language/en-US/Default.aspx")
# Click on the tab with text "Import Responses by Party"
party_tab = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, "//*[@aria-controls = 'tabstrip_ICR-2']")))
party_tab.click()
# Wait for the table rows to load (you might need to adjust the wait time as needed)
driver.implicitly_wait(10)
wait = WebDriverWait(driver, 5)
content = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "byParty.k-content.k-state-active")))
# Locate the dropdown element by its class name
country_dropdown = driver.find_element(By.CLASS_NAME, "k-icon.k-i-arrow-s")
country_dropdown.click()
# Locate the parent element of the dropdown
dropdown_parent = driver.find_element(By.XPATH, "//*[@id = 'ddlParty_listbox']")
dropdown_options = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.k-animation-container div#ddlParty-list ul#ddlParty_listbox li.k-item")))
time.sleep(2)
#Iterate through the dropdown options and extract text
for option in itertools.islice(dropdown_options,0,12):
country_name = option.text.strip()
print(country_name)
a = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@aria-owns="ddlParty_listbox"]')))
a.click()
a.send_keys(Keys.CONTROL+'a')
time.sleep(1)
a.send_keys(Keys.DELETE)
a.send_keys(country_name)
time.sleep(1)
a.send_keys(Keys.ENTER)
try:
if wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div#windowEU"))):
WebDriverWait(driver, 3).until(EC.element_to_be_clickable((By.CLASS_NAME, "confirm_no.k-button"))).click()
continue
except:
print('b')
pass
time.sleep(1)
# Get the updated page source after selecting the country
page_source = driver.page_source
# Parse the updated page source using BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")
relevant_section = soup.find("table", id="IRview")
# Find all rows in the table with IDs "row1" and "row2"
rows = relevant_section.find_all("tr", 'class'==["row1", "row2"])
if len(rows) > 0:
for row in rows:
columns = row.find_all("td")
if len(columns) >= 4: # Make sure the row has enough columns
chemical_name = columns[0].text.strip()
category = columns[1].text.strip()
party = columns[2].text.strip()
resp = columns[3].text.strip()
dec = columns[4].text.strip()
dat = columns[5].text.strip()
ChemName.append(chemical_name)
Category.append(category)
Country.append(party)
Response.append(resp)
Decision.append(dec)
Date.append(dat)
time.sleep(1)
I'm very new to this so I'm not sure where this issue is and might have some bad practices. However, for the first ~10 countries (usually up to Bahrain) it goes fine (although a bit slow) but after that, it no longer prints the country_name but instead prints an empty string. I think this is due to the pop-up that shows up for the 9th country which I close by clicking on "confirm_no.k-button". After clicking this this loop still goes fine but the next one does not.
I also tried this code by clicking on the relevant country in the dropdown, but this didn't work either hence why I switched to the .send_keys(country_name) method.
Hope someone is able to explain what goes wrong. Thanks!
Solution
You shouldn't use Selenium for this purpose, first of all, site is buggy and after tenth response it doesn't clear input after re-selection, so your get 'AlbaniaBahrein'.
Instead of using frontend, you can simply send requests to backend and parse the response. It would be much quicker solution and stable enough.
(I haven't touched your code before loop)
from selenium import webdriver
import time
import requests
import pycountry
from selenium.webdriver.common.by import By
from urllib.parse import quote
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
ChemName = []
Category = []
Country = []
Response = []
Decision = []
Date = []
driver = webdriver.Chrome()
driver.get("https://www.pic.int/Procedures/ImportResponses/Database/tabid/1370/language/en-US/Default.aspx")
# Click on the tab with text "Import Responses by Party"
party_tab = WebDriverWait(driver, 5).until(
EC.element_to_be_clickable((By.XPATH, "//*[@aria-controls = 'tabstrip_ICR-2']")))
party_tab.click()
# Wait for the table rows to load (you might need to adjust the wait time as needed)
driver.implicitly_wait(10)
wait = WebDriverWait(driver, 5)
content = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "byParty.k-content.k-state-active")))
# Locate the dropdown element by its class name
country_dropdown = driver.find_element(By.CLASS_NAME, "k-icon.k-i-arrow-s")
country_dropdown.click()
# Locate the parent element of the dropdown
dropdown_parent = driver.find_element(By.XPATH, "//*[@id = 'ddlParty_listbox']")
dropdown_options = wait.until(EC.presence_of_all_elements_located(
(By.CSS_SELECTOR, "div.k-animation-container div#ddlParty-list ul#ddlParty_listbox li.k-item")))
time.sleep(2)
for iteration in range(len(dropdown_options)):
country_text = dropdown_options[iteration].text.split(' (')[0]
print(country_text)
country = pycountry.countries.get(name=country_text)
if(country):
country_param = f"country eq '{country.alpha_2}'"
encoded_param = quote(country_param)
link = f"https://informea.pops.int/asbCountryProfiles/asbRcIrFra.svc/asbGetRcIR?$select=*&$filter={encoded_param}+and+listed_id+eq+5"
response = requests.get(link).json()
if len(response) > 0 and 'value' in response:
for data in response['value']:
ChemName.append(data['ChemicalName_en'])
Category.append(data['Category_en'])
Country.append(data['CountryName_en'])
Response.append(data['DecisionType_en'])
Decision.append(data['Decision_en'])
Date.append(data['PIC_circular_date'])
else:
print('Country skipped: ' + country_text)
In the code above I construct request that contains all needed data. It can be constructed with filter by country 2 symbols representation.
So I used pycountry
to convert country name into it's two symbol equivalent.
Be aware, that several countries can't be mapped, so you need to provide additional mapping for them. (They are printed in console)
Then you simply get needed properties from the response 'value' object and that's it. :)
Answered By - Yaroslavm
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.