Issue
I am trying to scrape data from historical versions of web pages as backed up Wayback Machine.
I have thousands of pages that need scraping and I don't want to go to trouble of finding out exact dates and time of available backups for each of them. I just want to get weekly historical data or the nearest available.
What I know is that if I put a date in a link here:
https://web.archive.org/web/<some_date>/<some_url>
then Wayback Machine will automatically redirect to the closest available capture. This will work fine in my scenario.
I have a scrapy
spider that extracts the data and that I already successfully used on the current version of web pages, so I know that it works and it produces the correct output. But when I try to run scrapy on the backed up versions of pages I get the following output notifying that the page is redirecting and no data is returned:
2023-05-04 20:18:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-05-04 20:18:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-05-04 20:18:33 [scrapy.core.engine] INFO: Spider opened
2023-05-04 20:18:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-05-04 20:18:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-05-04 20:18:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://web.archive.org/web/20200204105913/<some_url>> from <GET https://web.archive.org/web/20050313/<some_url>>
I've looked at other questions of similar nature and I understand I need to do something with the middleware, but those other questions were more about not allowing redirects, while I want the exact opposite.
How do I allow scrapy
to follow redirects?
Solution
From the documentation link @beer provided, you need to enable the RedirectMiddleware
.
However, from the documentation :
For example, if you want the redirect middleware to ignore 301 and 302 responses (and pass them through to your spider) you can do this:
class MySpider(CrawlSpider): handle_httpstatus_list = [301, 302]
This parameter is used to bypass the RedirectMiddleware
for the given HTTP statuses. Try using the middleware without setting handle_httpstatus_list
.
Answered By - Pierre couy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.