Issue
I am trying to crawl a covid-19 statistics website which has a bunch of links to pages regarding the statistics for different countries. The links all have a class name that makes them easy to access using css selectors ('mt_a'). There is no continuity between the countries so if you are on the webpage for one of them, there is no link to go to the next country. I am a complete beginner to scrapy and I'm not sure what I should do if my goal is to scrape all the (200 ish) links listed on the root page for the same few pieces of information. Any guidance on what I should be trying to do would be appreciated.
The link I'm trying to scrape: https://www.worldometers.info/coronavirus/ (scroll down to see country links)
Solution
What I would do is create two spiders. One would parse the home page and extract all specific links to country pages href within anchor tags, i.e. href="country/us/"
and then create full urls from these relative links so that you get a proper url like https://www.worldometers.info/coronavirus/country/us/
.
Then the second spider is given the list of all country urls and then goes on to crawl all individual pages and extract information from those.
For example, you get a list of urls from the first spider:
urls = ['https://www.worldometers.info/coronavirus/country/us/',
'https://www.worldometers.info/coronavirus/country/russia/']
Then in the second spider you give that list to the start_urls
attribute.
Answered By - NotAName
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.