Issue
Is it possible for scrapy to print the results in real-time? I'm planning to crawl large sites and fear that if my vpn connection cuts off, crawl effort will just be wasted since it won't print any results.
I'm currently using VPN with rotating user agents and I know it's ideal to use rotating proxies instead of VPN but that will be for the future script upgrade.
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
results = open('results.csv','w')
class TestSpider(CrawlSpider):
name = "test"
with open("domains.txt", "r") as d:
allowed_domains = [url.strip() for url in d.readlines()]
with open("urls.txt", "r") as f:
start_urls = [url.strip() for url in f.readlines()]
f.close()
rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)
def parse_item(self, response):
for pattern in ['Albert Einstein', 'Bob Marley']:
result = re.findall(pattern, response.text)
print(response.url,">",pattern,'>',len(result), file = results)
Many thanks in advance.
Updates
The script from harada works perfectly without any changes at all apart from the save file. All I needed to do was to make some modifications to the current files as below in order for everything to work.
spider - defined items
import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TestItem
class TestSpider(CrawlSpider):
name = "test"
with open("domains.txt", "r") as d:
allowed_domains = [url.strip() for url in d.readlines()]
with open("urls.txt", "r") as f:
start_urls = [url.strip() for url in f.readlines()]
f.close()
rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)
def parse_item(self, response):
items = TestItem()
for pattern in ['Albert Einstein', 'Bob Marley']:
result = re.findall(pattern, response.text)
url = response.url
count = len(result)
items['url'] = url
items['pattern'] = pattern
items['count'] = count
yield(items)
items.py - added items as fields
import scrapy
class TestItem(scrapy.Item):
url = scrapy.Field()
pattern = scrapy.Field()
count = scrapy.Field()
settings.py - uncommented ITEM_PIPELINES
ITEM_PIPELINES = {
'test.pipelines.TestPipeline': 300,
}
Solution
You can add a script to your pipeline that can save the data you have at that time to a file. Add a counter to the pipeline as a variable, and when the pipeline reaches a certain threshold (let's say, each 1000 items yielded), it should write to a file. The code would look something like this. I tried to make it as general as possible.
class MyPipeline:
def __init__(self):
# variable that keeps track of the total number of items yielded
self.total_count = 0
self.data = []
def process_item(self, item, spider):
self.data.append(item)
self.total_count += 1
if self.total_count % 1000 == 0:
# write to your file of choice....
# I'm not sure how your data is stored throughout the crawling process
# If it's a variable of the pipeline like self.data,
# then just write that to the file
with open("test.txt", "w") as myfile:
myfile.write(f'{self.data}')
return item
Answered By - harada
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.