Thursday, November 4, 2021

[FIXED] best practice for navigating through hrefs with scrapy

November 04, 2021 python, scrapy No comments

Issue

I am building a web scraper that downloads csv files from a website. I have to login to multiple user accounts in order to download all the files. I also have to navigate through several hrefs to reach these files for each user account. I've decided to use Scrapy spiders in order to complete this task. Here's the code I have so far:

I store the username and password info in a dictionary

 def start_requests(self):
    yield scrapy.Request(url = "https://external.lacare.org/provportal/", callback = self.login)

 def login(self, response):
    for uname, upass in login_info.items():      
        yield scrapy.FormRequest.from_response(
            response,
            formdata = {'username': uname,
                        'password': upass,
                        },
            dont_filter = True,
            callback = self.after_login
            )

I then navigate through the web pages by finding all href links in each response.

def after_login(self, response):
    hxs = scrapy.Selector(response)
    all_links = hxs.xpath('*//a/@href').extract()
    for link in all_links:
        if 'listReports' in link:
            url_join = response.urljoin(link)
            return scrapy.Request(
                url = url_join,
                dont_filter = True,
                callback = self.reports
                )
    return

def reports(self, response):
    hxs = scrapy.Selector(response)
    all_links = hxs.xpath('*//a/@href').extract()
    for link in all_links:
        url_join = response.urljoin(link)
        yield scrapy.Request(
            url = url_join,
            dont_filter = True,
            callback = self.select_year
            )
            
    return

I then crawl through each href on the page and check the response to see if I can keep going. This portion of the code seems excessive to me, but I am not sure how else to approach it.

def select_year(self, response):
    if '>2017' in str(response.body):
        hxs = scrapy.Selector(response)
        all_links = hxs.xpath('*//a/@href').extract()
        for link in all_links:
            url_join = response.urljoin(link)
            yield scrapy.Request(
                url = url_join,
                dont_filter = True,
                callback = self.select_elist
                )
    return

 def select_elist(self, response):
    if '>Elists' in str(response.body):
        hxs = scrapy.Selector(response)
        all_links = hxs.xpath('*//a/@href').extract()
        for link in all_links:
            url_join = response.urljoin(link)
            yield scrapy.Request(
                url = url_join,
                dont_filter = True,
                callback = self.select_company
                )

Everything works fine, but as I said it does seem excessive to crawl through each href on the page. I wrote a script for this website in Selenium, and was able to select the correct hrefs by using the select_by_partial_link_text() method. I've searched for something comparable to that in scrapy, but it seems like scrapy navigation is based strickly on xpath and css name.

Is this how Scrapy is meant to be used in this scenario? Is there anything I can do to make the scraping process less redundant?

This is my first working scrapy spider, so go easy on me!

Solution

If you need to extract only links with certain substring in link text, you can use LinkExtractor with following XPath:

LinkExtractor(restrict_xpaths='//a[contains(text(), "substring to find")]').extract_links(response)

as LinkExtractor is the proper way to extract and process links in Scrapy.

Docs: https://doc.scrapy.org/en/latest/topics/link-extractors.html

Answered By - mizhgun

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 4, 2021

[FIXED] best practice for navigating through hrefs with scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels