Saturday, November 13, 2021

[FIXED] Scrapy not crawling all links recursively

November 13, 2021 python-3.x, scrapy, scrapy-spider No comments

Issue

I need all internal links from all pages in the website for analysis. I have searched found lot of similar question. I found this code by Mithu which gives closes possible answer. However this is not providing all possible links from the second level of depth of pages. The generated csv file has only 676 records however the website has 1000 records.

Working Codes

import csv // Done to avoid line gaps in the generated csv file
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from eylinks.items import LinkscrawlItem
outfile = open("data.csv", "w", newline='')
writer = csv.writer(outfile)
class ToscrapeSpider(scrapy.Spider):

    name = "toscrapesp"
    start_urls = ["http://books.toscrape.com/"]

    rules = ([Rule(LinkExtractor(allow=r".*"), callback='parse', follow=True)])


    def parse(self, response):
        extractor = LinkExtractor(allow_domains='toscrape.com')
        links = extractor.extract_links(response)
        for link in links:
            yield scrapy.Request(link.url, callback=self.collect_data)

    def collect_data(self, response):
        global writer                                  
        for item in response.css('.product_pod'):
            product = item.css('h3 a::text').extract_first()
            value = item.css('.price_color::text').extract_first()
            lnk = response.url
            stats = response.status
            print(lnk)
            yield {'Name': product, 'Price': value,"URL":lnk,"Status":stats}  
            writer.writerow([product,value,lnk,stats])

Solution

For extract links try this:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import csv 

outfile = open("data.csv", "w", newline='')
writer = csv.writer(outfile)
class BooksScrapySpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        books = response.xpath('//h3/a/@href').extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

        next_page_url = response.xpath(
            "//a[text()='next']/@href").extract_first()
        absolute_next_page = response.urljoin(next_page_url)
        yield Request(absolute_next_page)

    def parse_book(self, response):

        title = response.css("h1::text").extract_first()
        price = response.xpath(
            "//*[@class='price_color']/text()").extract_first()
        url = response.request.url

        yield {'title': title,
               'price': price,
               'url': url,
               'status': response.status}
        writer.writerow([title,price,url,response.status])

Answered By - zafiron

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 13, 2021

[FIXED] Scrapy not crawling all links recursively

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels