Monday, February 21, 2022

[FIXED] Scrapy to print results in real time rather than waiting for crawl to finish

February 21, 2022 python, scrapy No comments

Issue

Is it possible for scrapy to print the results in real-time? I'm planning to crawl large sites and fear that if my vpn connection cuts off, crawl effort will just be wasted since it won't print any results.

I'm currently using VPN with rotating user agents and I know it's ideal to use rotating proxies instead of VPN but that will be for the future script upgrade.

import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

results = open('results.csv','w')

class TestSpider(CrawlSpider):
    name = "test"
    with open("domains.txt", "r") as d:
        allowed_domains = [url.strip() for url in d.readlines()]

    with open("urls.txt", "r") as f:
        start_urls = [url.strip() for url in f.readlines()]
        f.close()

    rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)

    def parse_item(self, response):
        for pattern in ['Albert Einstein', 'Bob Marley']:
            result = re.findall(pattern, response.text) 
            print(response.url,">",pattern,'>',len(result), file = results)

Many thanks in advance.

Updates

The script from harada works perfectly without any changes at all apart from the save file. All I needed to do was to make some modifications to the current files as below in order for everything to work.

spider - defined items

import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TestItem

class TestSpider(CrawlSpider):
    name = "test"
    with open("domains.txt", "r") as d:
        allowed_domains = [url.strip() for url in d.readlines()]

    with open("urls.txt", "r") as f:
        start_urls = [url.strip() for url in f.readlines()]
        f.close()

    rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)

    def parse_item(self, response):

        items = TestItem()

        for pattern in ['Albert Einstein', 'Bob Marley']:
            result = re.findall(pattern, response.text) 
            url = response.url
            count = len(result)

            items['url'] = url
            items['pattern'] = pattern
            items['count'] = count

            yield(items)

items.py - added items as fields

import scrapy

    class TestItem(scrapy.Item):
        url = scrapy.Field()
        pattern = scrapy.Field()
        count = scrapy.Field()

settings.py - uncommented ITEM_PIPELINES

ITEM_PIPELINES = {
   'test.pipelines.TestPipeline': 300,
}

Solution

You can add a script to your pipeline that can save the data you have at that time to a file. Add a counter to the pipeline as a variable, and when the pipeline reaches a certain threshold (let's say, each 1000 items yielded), it should write to a file. The code would look something like this. I tried to make it as general as possible.

class MyPipeline:
    def __init__(self):
        # variable that keeps track of the total number of items yielded
        self.total_count = 0
        self.data = []

    def process_item(self, item, spider):
        self.data.append(item)
        self.total_count += 1
        if self.total_count % 1000 == 0:
            # write to your file of choice....
            # I'm not sure how your data is stored throughout the crawling process
            # If it's a variable of the pipeline like self.data,
            # then just write that to the file
            with open("test.txt", "w") as myfile:
                myfile.write(f'{self.data}')

        return item

Answered By - harada

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, February 21, 2022

[FIXED] Scrapy to print results in real time rather than waiting for crawl to finish

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels