Issue
I have a loop that creates links I want to scrape:
start_date = date(2020, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = ""https://www.racingpost.com/results/""
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
If I print "links", it works fine and I get the urls I want.
Then I have a spider, that scrapes the site just as well, if I put in the url manually. Now I tried to pass the "links" variable containing the url I want to scrape as below, but I get "undefined variable" back.
class RpresultSpider(scrapy.Spider):
name = 'rpresult'
allowed_domains = ['www.racingpost.com']
script = '''
function main(splash, args)
url = args.url
assert(splash:go(url))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
args={
'lua_source': self.script
})
def parse(self, response):
for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
yield {
'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
}
How do I pass the generated links into SplashRequest(url=links
Thanks so much for helping me out - I am still new to this and making small steps - most of them backward...
Solution
From my comment above (I'm not quite sure if this works because I'm unfamiliar with scrapy. However, the obvious problem is there is no reference to the links variable in the RpresultSpider class. Putting the loop that generates urls inside the function would fix that.
class RpresultSpider(scrapy.Spider):
name = 'rpresult'
allowed_domains = ['www.racingpost.com']
script = '''
function main(splash, args)
url = args.url
assert(splash:go(url))
return splash:html()
end
'''
def start_requests(self):
start_date = date(2020, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = ""https://www.racingpost.com/results/""
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
args={
'lua_source': self.script
})
def parse(self, response):
for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
yield {
'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
}
Answered By - dvr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.