Thursday, January 25, 2024

[FIXED] Scrapy how to remove a url from httpcache or prevent adding to cache

January 25, 2024 scrapy No comments

Issue

I am using latest scrapy version, v1.3

I crawl a webpage page by page, by following urls in pagination. In some pages, website detects that I use a bot and gives me an error in html. Since it is a successful request, it caches the page and when I run it again, I get the same error.

What I need is how can I prevent that page get into cache? Or if I cannot do that, I need to remove it from cache after I realize the error in parse method. Then I can retry and get the correct one.

I have a partial solution, I yield all requests with "dont_cache":False parameter in meta so I make sure they use cache. Where I detect the error and retry the request, I put dont_filter=True along with "dont_cache":True to make sure I get the fresh copy of the erroneous url.

def parse(self, response):
    page = response.meta["page"] + 1
    html = Selector(response)

    counttext = html.css('h2#s-result-count::text').extract_first()
    if counttext is None:
        page = page - 1
        yield Request(url=response.url, callback=self.parse, meta={"page":page, "dont_cache":True}, dont_filter=True)

I also tried a custom retry middleware, where I managed to get it working before cache, but I couldnt read the response.body successfully. I suspect that it is zipped somehow, as it is binary data.

class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        with open('debug.txt', 'wb') as outfile:
            outfile.write(response.body)
        html = Selector(text=response.body)

        url = response.url

        counttext = html.css('h2#s-result-count::text').extract_first()
        if counttext is None:
            log.msg("Automated process error: %s" %url, level=log.INFO)
            reason = 'Automated process error %d' %response.status
            return self._retry(request, reason, spider) or response
        return response

Any suggestion is appreciated.

Thanks

Mehmet

Solution

Thanks to mizhgun, I managed to develop a solution using custom policies.

Here is what I did,

from scrapy.utils.httpobj import urlparse_cached

class CustomPolicy(object):

    def __init__(self, settings):
        self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
        self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]

    def should_cache_request(self, request):
        return urlparse_cached(request).scheme not in self.ignore_schemes

    def should_cache_response(self, response, request):
        return response.status not in self.ignore_http_codes

    def is_cached_response_fresh(self, response, request):
        if "refresh_cache" in request.meta:
            return False
        return True

    def is_cached_response_valid(self, cachedresponse, response, request):
        if "refresh_cache" in request.meta:
            return False
        return True

And when I catch the error, (after caching occurred of course)

def parse(self, response):
    html = Selector(response)

    counttext = html.css('selector').extract_first()
    if counttext is None:
        yield Request(url=response.url, callback=self.parse, meta={"refresh_cache":True}, dont_filter=True)

When you add refresh_cache into meta, that can be catched in custom policy class.

Don't forget to add dont_filter otherwise second request will be filtered as duplicate.

Answered By - Mehmet Kurtipek

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 25, 2024

[FIXED] Scrapy how to remove a url from httpcache or prevent adding to cache

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels