Wednesday, January 24, 2024

[FIXED] How can I make Scrapy follow the links in order

January 24, 2024 python, scrapy, web-scraping No comments

Issue

I'm doing a small scrape project and everything is working fine, but I'm having a problem with the order of links since Scrapy is synchronous. The 'rankings["Men's Pound-for-Pound"]' is a list of links which I except to be followed on its order, so the output will be in order as well.

Here's my code:

class FighterSpiderSpider(scrapy.Spider):

    name = 'fighter_spider'

    allowed_domains = ['www.ufc.com.br']

    start_urls = ['https://www.ufc.com.br/rankings']

    def parse(self, response):

        all_rankings = response.css('div.view-grouping').getall() # --> list of all rankings

        champions = {Selector(text=x).css('div.view-grouping div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').get() for x in all_rankings}

        rankings = {Selector(text=x).css('div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').getall() for x in all_rankings}

        if self.ranking == "p4p male":

            for link in rankings["Men's Pound-for-Pound"]:

                yield response.follow(link, callback=self.parse_date)

Solution

So there is no way to guarantee that the responses/output will be processed in a specific order. You can manually set the priority for each request which will influence the order in which requests are dispatched from the engine, but it will not guarantee that each response will be processed in the same order.

You can set the priority for requests by simply setting the priority parameter in your requests or response.follow calls.

for i, link in enumerate(rankings["Men's Pound-for-Pound"]):
    yield response.follow(link, callback=self.parse_date, priority=len(rankings["Men's Pound-for-Pound"])) - i)

The higher the value set, the sooner it will be processed.

Since this doesn't guarantee the output ordering though I would suggest simply passing the rank as a callback keyword argument with the request and then sorting the output in a pipeline or postprocessing procedure.

For example:

class FighterSpiderSpider(scrapy.Spider):

    name = 'fighter_spider'

    allowed_domains = ['www.ufc.com.br']

    start_urls = ['https://www.ufc.com.br/rankings']

    def parse(self, response):

        all_rankings = response.css('div.view-grouping').getall() # --> list of all rankings

        champions = {Selector(text=x).css('div.view-grouping div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').get() for x in all_rankings}

        rankings = {Selector(text=x).css('div.info h4::text').get().strip() : Selector(text=x).css('a::attr(href)').getall() for x in all_rankings}

        if self.ranking == "p4p male":

            for i, link in enumerate(rankings["Men's Pound-for-Pound"]):

                yield response.follow(link, callback=self.parse_date, cb_kwargs={"rank": i+1})


    def parse_date(self, response, rank):
        ...
        ...
        yield {'rank': rank ...}

Then you can sort the output into the correct order in a pipeline or post processsing.

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 24, 2024

[FIXED] How can I make Scrapy follow the links in order

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels