Issue
I want to scrape data from three different categories of contracts --- goods, services, construction.
Because each type of contract can be parsed with the same method, my goal is to use a single spider, start the spider on three different urls, and then extract data in three distinct streams that can be saved to different places.
My understanding is that just listing all three urls as start_urls
will lead to one combined output of data.
My spider inherits from Scrapy's CrawlSpider
class.
Let me know if you need further information.
Solution
I would suggest that you tackle this problem from another angle. In scrapy it is possible to pass arguments to the spider from the command line using the -a
option like so
scrapy crawl CanCrawler -a contract=goods
You just need to include the variables you reference in your class initializer
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
Something else you might consider is adding multiple arguments so that you can start on the homepage of a website and using the arguments, you can get to whatever data you need. For the case of this website https://buyandsell.gc.ca/procurement-data/search/site, for example you could have two command line arguments.
scrapy crawl CanCrawler -a procure=ContractHistory -a contract=goods
so you'd get
class CanCrawler(scrapy.Spider):
name = 'CanCrawler'
def __init__(self, procure='', contract='', *args, **kwargs):
super(CanCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
# ...
and then depending on what arguments you passed, you could make your crawler click on those options on the website to get to the data that you want to crawl. Please also see here. I hope this helps!
Answered By - Некто
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.