Issue
I tried running this code several times using pycharm but it just won't work. The scrapy Request callback function is not responding, nothing gets printed out. Does anyone have an idea on what's causing the bug??
import scrapy
class HemnetSpider(scrapy.Spider):
name = "hemnet"
allowed_domains = ["hemnet.se/"]
start_urls = ["https://www.hemnet.se/bostader?location_ids%5B%5D=17759"]
def parse(self, response):
for links in response.css('ul.normal-results > li.normal-results__hit > a::attr("href")'):
yield scrapy.Request(url=links.get(), callback=self.parseInnerPage)
def parseInnerPage(self, response):
print(response.text)
Solution
The issue is caused by the value you have in your allowed_domains. This is apparent by looking at the log output produced by scrapy while running your spider.
For example when I run your spider it shows.
2023-04-29 20:17:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.hemnet.se/bostader?location_ids%5B%5D=17759> (referer: None)
2023-04-29 20:17:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.hemnet.se': <GET https://www.hemnet.se/bostad/lagenhet-3rum-stoten-malung-salens-kommun-nordklint-57-18902155>
2023-04-29 20:17:37 [scrapy.core.engine] INFO: Closing spider (finished)
What the above is saying is that it crawled the initial page successfully, however none of the collected links that you yield new requests for match the any of the domains you have listed in your allowed_domains
attribute.
This can be solved by either removing the allowed_domains
attribute, or editing it to ["www.hemnet.se"]
.
for example:
class HemnetSpider(scrapy.Spider):
name = "hemnet"
allowed_domains = ["www.hemnet.se"]
start_urls = ["https://www.hemnet.se/bostader?location_ids%5B%5D=17759"]
...
After making the above changes and running your spider, the output prints the full html for a multitude of pages as expected.
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.