Sunday, December 31, 2023

[FIXED] How to pass a looped list of urls to Scrapy (url="")

December 31, 2023 python, scrapy, splash-screen No comments

Issue

I have a loop that creates links I want to scrape:

    start_date = date(2020, 1, 1)
    end_date = date.today()
    crawl_date = start_date
    base_url = ""https://www.racingpost.com/results/""
    links = []
    # Generate the links
    while crawl_date <= end_date:
        links.append(base_url + str(crawl_date))
        crawl_date += timedelta(days=1)

If I print "links", it works fine and I get the urls I want.

Then I have a spider, that scrapes the site just as well, if I put in the url manually. Now I tried to pass the "links" variable containing the url I want to scrape as below, but I get "undefined variable" back.

class RpresultSpider(scrapy.Spider):
    name = 'rpresult'
    allowed_domains = ['www.racingpost.com']
        script = '''
        function main(splash, args)
            url = args.url
            assert(splash:go(url))
            
            return splash:html()
        end
        '''
        def start_requests(self):
            yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
                                args={
                                    'lua_source': self.script
                                })
            
        def parse(self, response):
            for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
                yield {
                    'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
                }

How do I pass the generated links into SplashRequest(url=links

Thanks so much for helping me out - I am still new to this and making small steps - most of them backward...

Solution

From my comment above (I'm not quite sure if this works because I'm unfamiliar with scrapy. However, the obvious problem is there is no reference to the links variable in the RpresultSpider class. Putting the loop that generates urls inside the function would fix that.

class RpresultSpider(scrapy.Spider):
    name = 'rpresult'
    allowed_domains = ['www.racingpost.com']
        script = '''
        function main(splash, args)
            url = args.url
            assert(splash:go(url))
            
            return splash:html()
        end
        '''
        def start_requests(self):
            start_date = date(2020, 1, 1)
            end_date = date.today()
            crawl_date = start_date
            base_url = ""https://www.racingpost.com/results/""
            links = []
            # Generate the links
            while crawl_date <= end_date:
                links.append(base_url + str(crawl_date))
                crawl_date += timedelta(days=1)
            yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
                                args={
                                    'lua_source': self.script
                                })
            
        def parse(self, response):
            for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
                yield {
                    'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
                }

Answered By - dvr

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 31, 2023

[FIXED] How to pass a looped list of urls to Scrapy (url="")

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels