Friday, January 26, 2024

[FIXED] How to allow scrapy to follow redirects?

January 26, 2024 python, redirect, scrapy No comments

Issue

I am trying to scrape data from historical versions of web pages as backed up Wayback Machine.

I have thousands of pages that need scraping and I don't want to go to trouble of finding out exact dates and time of available backups for each of them. I just want to get weekly historical data or the nearest available.

What I know is that if I put a date in a link here:

https://web.archive.org/web/<some_date>/<some_url>

then Wayback Machine will automatically redirect to the closest available capture. This will work fine in my scenario.

I have a scrapy spider that extracts the data and that I already successfully used on the current version of web pages, so I know that it works and it produces the correct output. But when I try to run scrapy on the backed up versions of pages I get the following output notifying that the page is redirecting and no data is returned:

2023-05-04 20:18:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-05-04 20:18:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-05-04 20:18:33 [scrapy.core.engine] INFO: Spider opened
2023-05-04 20:18:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-05-04 20:18:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-05-04 20:18:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://web.archive.org/web/20200204105913/<some_url>> from <GET https://web.archive.org/web/20050313/<some_url>>

I've looked at other questions of similar nature and I understand I need to do something with the middleware, but those other questions were more about not allowing redirects, while I want the exact opposite.

How do I allow scrapy to follow redirects?

Solution

From the documentation link @beer provided, you need to enable the RedirectMiddleware.

However, from the documentation :

For example, if you want the redirect middleware to ignore 301 and 302 responses (and pass them through to your spider) you can do this:
class MySpider(CrawlSpider):
    handle_httpstatus_list = [301, 302]

This parameter is used to bypass the RedirectMiddleware for the given HTTP statuses. Try using the middleware without setting handle_httpstatus_list.

Answered By - Pierre couy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 26, 2024

[FIXED] How to allow scrapy to follow redirects?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels