Saturday, January 6, 2024

[FIXED] Python loop for webscraping using Selenium stops working after number of iterations

January 06, 2024 jupyter-notebook, python, selenium-webdriver, web-scraping No comments

Issue

I'm trying to scrape tables from a website using selenium in Python (in Jupyter Notebook). For this, I'm using a loop to get the data from all countries (in a dropdown list). This works fine until reaching around the 10th iteration/country.

See my code below for what I tried:

ChemName = []
Category = []
Country = []
Response = []
Decision = []
Date = []

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
driver.get("https://www.pic.int/Procedures/ImportResponses/Database/tabid/1370/language/en-US/Default.aspx")    

# Click on the tab with text "Import Responses by Party"
party_tab = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, "//*[@aria-controls = 'tabstrip_ICR-2']")))
party_tab.click()

# Wait for the table rows to load (you might need to adjust the wait time as needed)
driver.implicitly_wait(10)

wait = WebDriverWait(driver, 5)

content = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "byParty.k-content.k-state-active")))

# Locate the dropdown element by its class name
country_dropdown = driver.find_element(By.CLASS_NAME, "k-icon.k-i-arrow-s")
country_dropdown.click()

# Locate the parent element of the dropdown
dropdown_parent = driver.find_element(By.XPATH, "//*[@id = 'ddlParty_listbox']")
dropdown_options = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.k-animation-container div#ddlParty-list ul#ddlParty_listbox li.k-item")))

time.sleep(2)

#Iterate through the dropdown options and extract text
for option in itertools.islice(dropdown_options,0,12):
    country_name = option.text.strip()
    print(country_name)

    a = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@aria-owns="ddlParty_listbox"]')))
    a.click()
    a.send_keys(Keys.CONTROL+'a')
    time.sleep(1)
    a.send_keys(Keys.DELETE)

    a.send_keys(country_name)
    time.sleep(1)
    a.send_keys(Keys.ENTER)
    
    try:
        if wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div#windowEU"))):
            
            WebDriverWait(driver, 3).until(EC.element_to_be_clickable((By.CLASS_NAME, "confirm_no.k-button"))).click()
            continue

    except:
        print('b')
        pass

    time.sleep(1)
    

    # Get the updated page source after selecting the country
    page_source = driver.page_source

    # Parse the updated page source using BeautifulSoup
    soup = BeautifulSoup(page_source, "html.parser")
    relevant_section = soup.find("table", id="IRview")

    # Find all rows in the table with IDs "row1" and "row2"
    rows = relevant_section.find_all("tr", 'class'==["row1", "row2"])

    
    if len(rows) > 0:
        for row in rows:
            columns = row.find_all("td")
            if len(columns) >= 4:  # Make sure the row has enough columns
                chemical_name = columns[0].text.strip()
                category = columns[1].text.strip()
                party = columns[2].text.strip()
                resp = columns[3].text.strip()
                dec = columns[4].text.strip()
                dat = columns[5].text.strip()

                ChemName.append(chemical_name)
                Category.append(category)
                Country.append(party)
                Response.append(resp)
                Decision.append(dec)
                Date.append(dat)

    time.sleep(1)

I'm very new to this so I'm not sure where this issue is and might have some bad practices. However, for the first ~10 countries (usually up to Bahrain) it goes fine (although a bit slow) but after that, it no longer prints the country_name but instead prints an empty string. I think this is due to the pop-up that shows up for the 9th country which I close by clicking on "confirm_no.k-button". After clicking this this loop still goes fine but the next one does not.

I also tried this code by clicking on the relevant country in the dropdown, but this didn't work either hence why I switched to the .send_keys(country_name) method.

Hope someone is able to explain what goes wrong. Thanks!

Solution

You shouldn't use Selenium for this purpose, first of all, site is buggy and after tenth response it doesn't clear input after re-selection, so your get 'AlbaniaBahrein'.

Instead of using frontend, you can simply send requests to backend and parse the response. It would be much quicker solution and stable enough.

(I haven't touched your code before loop)

from selenium import webdriver
import time
import requests
import pycountry

from selenium.webdriver.common.by import By
from urllib.parse import quote

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.support.wait import WebDriverWait

ChemName = []
Category = []
Country = []
Response = []
Decision = []
Date = []

driver = webdriver.Chrome()
driver.get("https://www.pic.int/Procedures/ImportResponses/Database/tabid/1370/language/en-US/Default.aspx")

# Click on the tab with text "Import Responses by Party"
party_tab = WebDriverWait(driver, 5).until(
    EC.element_to_be_clickable((By.XPATH, "//*[@aria-controls = 'tabstrip_ICR-2']")))
party_tab.click()

# Wait for the table rows to load (you might need to adjust the wait time as needed)
driver.implicitly_wait(10)

wait = WebDriverWait(driver, 5)

content = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "byParty.k-content.k-state-active")))

# Locate the dropdown element by its class name
country_dropdown = driver.find_element(By.CLASS_NAME, "k-icon.k-i-arrow-s")
country_dropdown.click()

# Locate the parent element of the dropdown
dropdown_parent = driver.find_element(By.XPATH, "//*[@id = 'ddlParty_listbox']")
dropdown_options = wait.until(EC.presence_of_all_elements_located(
    (By.CSS_SELECTOR, "div.k-animation-container div#ddlParty-list ul#ddlParty_listbox li.k-item")))

time.sleep(2)

for iteration in range(len(dropdown_options)):
    country_text = dropdown_options[iteration].text.split(' (')[0]
    print(country_text)
    country = pycountry.countries.get(name=country_text)
    if(country):
        country_param = f"country eq '{country.alpha_2}'"
        encoded_param = quote(country_param)
        link = f"https://informea.pops.int/asbCountryProfiles/asbRcIrFra.svc/asbGetRcIR?$select=*&$filter={encoded_param}+and+listed_id+eq+5"
        response = requests.get(link).json()
        if len(response) > 0 and 'value' in response:
            for data in response['value']:
                ChemName.append(data['ChemicalName_en'])
                Category.append(data['Category_en'])
                Country.append(data['CountryName_en'])
                Response.append(data['DecisionType_en'])
                Decision.append(data['Decision_en'])
                Date.append(data['PIC_circular_date'])
    else:
        print('Country skipped: ' + country_text)

In the code above I construct request that contains all needed data. It can be constructed with filter by country 2 symbols representation. So I used pycountry to convert country name into it's two symbol equivalent.

Be aware, that several countries can't be mapped, so you need to provide additional mapping for them. (They are printed in console)

Then you simply get needed properties from the response 'value' object and that's it. :)

Answered By - Yaroslavm

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 6, 2024

[FIXED] Python loop for webscraping using Selenium stops working after number of iterations

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels