Saturday, April 9, 2022

[FIXED] Keeping streams of data separate using one Scrapy spider

April 09, 2022 python, scrapy, web-scraping No comments

Issue

I want to scrape data from three different categories of contracts --- goods, services, construction.

Because each type of contract can be parsed with the same method, my goal is to use a single spider, start the spider on three different urls, and then extract data in three distinct streams that can be saved to different places.

My understanding is that just listing all three urls as start_urls will lead to one combined output of data.

My spider inherits from Scrapy's CrawlSpider class.

Let me know if you need further information.

Solution

I would suggest that you tackle this problem from another angle. In scrapy it is possible to pass arguments to the spider from the command line using the -a option like so

scrapy crawl CanCrawler -a contract=goods

You just need to include the variables you reference in your class initializer

class CanCrawler(scrapy.Spider):
    name = 'CanCrawler'
    def __init__(self, contract='', *args, **kwargs):
        super(CanCrawler, self).__init__(*args, **kwargs)
        self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
        # ...

Something else you might consider is adding multiple arguments so that you can start on the homepage of a website and using the arguments, you can get to whatever data you need. For the case of this website https://buyandsell.gc.ca/procurement-data/search/site, for example you could have two command line arguments.

    scrapy crawl CanCrawler -a procure=ContractHistory -a contract=goods

so you'd get

class CanCrawler(scrapy.Spider):
    name = 'CanCrawler'
    def __init__(self, procure='', contract='', *args, **kwargs):
        super(CanCrawler, self).__init__(*args, **kwargs)
        self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
        # ...

and then depending on what arguments you passed, you could make your crawler click on those options on the website to get to the data that you want to crawl. Please also see here. I hope this helps!

Answered By - Некто

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, April 9, 2022

[FIXED] Keeping streams of data separate using one Scrapy spider

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels