Monday, November 21, 2022

[FIXED] How to scrape only clickable links in a loop with Scrapy and Selenium

November 21, 2022 python, scrapy, selenium, selenium-webdriver, web-scraping No comments

Issue

I'm trying to scrape some info on tennis matches from a Javascript site using Scrapy and Selenium. The starting URLs are for pages that contain all the matches on a given date. The first task on each page is to make all the matches visible from behind some horizontal tabs - got this covered. The second task is to scrape the match pages that sit behind links that aren't present on the starting URL pages - a specific tag needs to be clicked.

I've found all these tags no problem and have a loop written that uses Selenium to click the tag and yields a Request after each iteration. The issue I'm having is that each time I click through on a link then the page changes and my lovely list of elements detaches itself from the DOM and I get a StaleElementReferenceException error. I understand why this happens but I'm struggling to come up with a solution.

Here's my code so far:

import datetime as dt
from dateutil.rrule import DAILY, rrule
from scrapy import Spider, Request
from scrapy.crawler import CrawlerProcess
from scrapy.http.response import Response
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

MATCHES_XPATH = "//span[@class='link sc-10gv6xe-4 eEAcym pointer']"
ELEMENT_TEST = "//span[@class='link sc-15d69aw-2 hhbGos']"


class ScraperS24(Spider):

    name = "scores24_scraper"

    custom_settings = {
        "USER_AGENT": "*",
        "LOG_LEVEL": "WARNING",
        "DOWNLOADER_MIDDLEWARES": {
            'scraper.polgara.middlewares.SeleniumMiddleware': 543,
        },
    }
    httperror_allowed_codes = [301]

    def __init__(self):
        dates = list(rrule(DAILY, dtstart=dt.datetime(2015, 1, 6), until=dt.datetime(2015, 1, 21)))
        self.start_urls = [f"https://scores24.live/en/tennis/{d.strftime('%Y-%m-%d')}" for d in dates]
        super().__init__()

    def parse(self, response: Response):
        print(f"Parsing date - {response.url}")
        driver = response.request.meta["driver"]
        tabs = driver.find_elements_by_xpath("//button[@class='hjlkds-7 khtDIT']")
        for t in tabs: driver.execute_script("arguments[0].click();", t)
        matches = driver.find_elements_by_xpath(MATCHES_XPATH)
        wait = WebDriverWait(driver, 20)
        for m in matches:
            driver.execute_script("arguments[0].click();", m)
            try:
                wait.until(EC.presence_of_element_located((By.XPATH, ELEMENT_TEST)))
            except TimeoutException:
                driver.back()
            else:
                url = str(driver.current_url)
                driver.back()
                yield Request(url, callback=self._parse_match)

    def _parse_match(self, response):
        print(f"Parsing match - {response.url}")


process = CrawlerProcess()
process.crawl(ScraperS24)
process.start()

And the Selenium middleware:

class SeleniumMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        return middleware

    def process_request(self, request: Request, spider: Spider):
        logger.debug(f"Selenium processing request - {request.url}")
        self.driver.get(request.url)
        request.meta.update({'driver': self.driver})
        return HtmlResponse(
            request.url,
            body=self.driver.page_source,
            encoding='utf-8',
            request=request,
        )

    def spider_opened(self, spider):
        options = webdriver.FirefoxOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Firefox(
            options=options,
            executable_path=cn.GECKODRIVER_PATH,
        )

    def spider_closed(self, spider):
        self.driver.quit()

I've tried adapting the loop using the answer here:

for idx, _ in enumerate(matches):
    matches = driver.find_elements_by_xpath(MATCHES_XPATH)
    driver.execute_script("arguments[0].click();", matches[idx])
    try:
        wait.until(EC.presence_of_element_located((By.XPATH, ELEMENT_TEST)))
    except TimeoutException:
        driver.back()
    else:
        url = str(driver.current_url)
        driver.back()
        yield Request(url, callback=self._parse_match)

However, because Scrapy scrapes in parallel then the driver is very quickly used to load up a new starting URL page before the loop has finished and so the list indexing on matches gets screwed.

Any ideas on where I could go from here? Is there a way to for me to use the adapted loop and force Scrapy to finish scraping all the match tags on a starting URL page before moving onto the next URL? There are answers on how to do this but they rely on storing a list of URLs on each page - something I don't have the luxury of...

Solution

The page you are trying to scrape does not need you to use Selenium because the data is already contained in the html of the page.

Most of the information on a match is available in the matches json object so you might not need to scrape the pages that follow depending on what information you want to obtain.

See below code that shows you how to parse the matches data directly from the html.

import scrapy
import datetime as dt
from dateutil.rrule import DAILY, rrule
import json
from scrapy.crawler import CrawlerProcess

class ScraperS24(scrapy.Spider):

    name = "scores24_scraper"

    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36",
    }

    def __init__(self):
        dates = list(rrule(DAILY, dtstart=dt.datetime(2015, 1, 6), until=dt.datetime(2015, 1, 6)))
        self.start_urls = [f"https://scores24.live/en/tennis/{d.strftime('%Y-%m-%d')}" for d in dates]
        super().__init__()

    def parse(self, response):
        # obtain the json response object
        data = response.xpath("//script[contains(text(),'window.__APP__STATE__=JSON.parse')]/text()").re_first(r"window\.__APP__STATE__=JSON\.parse\((.*)\);")
        data_json = json.loads(data)
        data_json = json.loads(data_json)
        tennis_data = data_json["matchFeed"]["_feed"]["tennis"] # this is a list

        for tournament in tennis_data:
            for match in tournament["matches"]:
                match_url = f"m-{match['slug']}"
                yield match # outputs the summary data on each match. Comment this line if following links to each match
                # to follow the links to each match uncomment the line below
                #yield response.follow(match_url, callback=self.parse_match) 

    def parse_match(self, response):
        self.logger.info("Parsing match on url: " + response.url)
        # parse the match details here


if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(ScraperS24)
    process.start()

If you run the scraper, you will obtain results similar to the below image for each match.

Answered By - msenior_

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 21, 2022

[FIXED] How to scrape only clickable links in a loop with Scrapy and Selenium

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels