Thursday, January 25, 2024

[FIXED] My callback function is not responding, can anyone help out here i'm really stuck

January 25, 2024 scrapy, scrapy-request No comments

Issue

I tried running this code several times using pycharm but it just won't work. The scrapy Request callback function is not responding, nothing gets printed out. Does anyone have an idea on what's causing the bug??

import scrapy


class HemnetSpider(scrapy.Spider):
    name = "hemnet"
    allowed_domains = ["hemnet.se/"]
    start_urls = ["https://www.hemnet.se/bostader?location_ids%5B%5D=17759"]

    def parse(self, response):

        for links in response.css('ul.normal-results > li.normal-results__hit > a::attr("href")'):

            yield scrapy.Request(url=links.get(), callback=self.parseInnerPage)

    def parseInnerPage(self, response):
        print(response.text)

Solution

The issue is caused by the value you have in your allowed_domains. This is apparent by looking at the log output produced by scrapy while running your spider.

For example when I run your spider it shows.

2023-04-29 20:17:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hemnet.se/bostader?location_ids%5B%5D=17759> (referer: None)
2023-04-29 20:17:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.hemnet.se': <GET https://www.hemnet.se/bostad/lagenhet-3rum-stoten-malung-salens-kommun-nordklint-57-18902155>
2023-04-29 20:17:37 [scrapy.core.engine] INFO: Closing spider (finished)

What the above is saying is that it crawled the initial page successfully, however none of the collected links that you yield new requests for match the any of the domains you have listed in your allowed_domains attribute.

This can be solved by either removing the allowed_domains attribute, or editing it to ["www.hemnet.se"].

for example:

class HemnetSpider(scrapy.Spider):
    name = "hemnet"
    allowed_domains = ["www.hemnet.se"]
    start_urls = ["https://www.hemnet.se/bostader?location_ids%5B%5D=17759"]

    ...

After making the above changes and running your spider, the output prints the full html for a multitude of pages as expected.

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 25, 2024

[FIXED] My callback function is not responding, can anyone help out here i'm really stuck

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels