Friday, January 26, 2024

[FIXED] attribute error during recursive scraping with scrapy

January 26, 2024 python, scrapy, web-scraping No comments

Issue

I have a scrapy spider that works well as long as I give it a page that contains the links to the pages that it should scrape. Now I want to not give it all the categories but the page that contains links to all categories. I thought I could simply add another parse function in order to achieve this.

but the console output gives me an attribute error

"attributeError: 'zaubersonder' object has no attribute 'parsedetails'"

This tells me that some attribute reference is not working correctly. I am new to object orientation but I thought scrapy is calling parse which is calling prase_level2 which in turn calls parse_details and this should work fine.

below is my effort so far.

import scrapy


class zaubersonder(scrapy.Spider):
    name = 'zaubersonder'
    allowed_domains = ['abc.de']
    start_urls = ['http://www.abc.de/index.php/rgergegregre.html'
                 ]




    def parse(self, response):
        urls = response.css('a.ulSubMenu::attr(href)').extract() # links to categories
        for url in urls:
            url = response.urljoin(url)
            yield scrapy.Request(url=url,callback=self.parse_level2)

    def parse_level2(self, response):
        urls2 = response.css('a.ulSubMenu::attr(href)').extract() # links to entries
        for url2 in urls2:
            url2 = response.urljoin(url2)
            yield scrapy.Request(url=url2,callback=self.parse_details)

    def parse_details(self,response): #extract entries
        yield {
            "Titel": response.css("li.active.last::text").extract(),
            "Content": response.css('div.ce_text.first.block').extract() + response.css('div.ce_text.last.block').extract(),
        }

edit: fixed the code in case someone will search for it

Solution

There is a typo in the code. The callback in parse_level2 is self.parsedetails, but the function is named parse_details.

Just change the yield in parse_level2 to:

yield scrapy.Request(url=url2,callback=self.parse_details)

..and it should work better.

Answered By - Tor Stava

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 26, 2024

[FIXED] attribute error during recursive scraping with scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels