Issue
I have this webpage (https://academic.oup.com/plphys/search-results?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1) from which I want to extract information, for example, title, name, doi etc. For the first page I am able to do easily, but as there are more pages, I am not being able to crawl through. The code I have is:
import scrapy
class PhotosynSpiderSpider(scrapy.Spider):
name = 'photosyn_spider'
allowed_domains = ['https://academic.oup.com/plphys']
start_urls = ['https://academic.oup.com/plphys/search-results?q=photosynthesis&allJournals=1&fl_SiteID=6323']
def parse(self, response):
# Step 1: Locate the first page in div class 'pageNumbers al-pageNumbers'
page_numbers = response.css('div.pageNumbers.al-pageNumbers')
current_page = page_numbers.css('span.current-page::text').get()
total_pages = page_numbers.css('span.total-pages::text').get()
# Step 2: Locate link in a class 'al-citation-list', and extract all the href for doi in the element 'a'
citation_list = response.css('a.al-citation-list')
dois = citation_list.css('a::attr(href)').getall()
for doi in dois:
yield {'doi': doi}
# Step 3: Open url for the next page in the element 'a' and class 'sr-nav-next al-nav-next' and repeat step 2
if current_page != total_pages:
next_page_url = response.css('a.sr-nav-next.al-nav-next::attr(href)').get()
yield scrapy.Request(next_page_url, callback=self.parse)
I am trying to dump the result into a json file. However, the result is empty. Can anyone help me with this? Thanks
Solution
If you look at the next page element you will see that the href
attribute isn't an actual url:
<a role="button" aria-label="Next" href="javascript:;" class="sr-nav-next al-nav-next" data-url="q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2" data-google-interstitial="false">
Next
</a>
This is because clicking the next button doesn't actually take you to a new page, instead it uses javascript to swap out the contents of the articles section by making an ajax call.
Using the url used in the ajax call we can get all results from the subsequent pages by matching it's pattern.
For example:
import scrapy
class PhotosynSpiderSpider(scrapy.Spider):
name = 'photosyn_spider'
def start_requests(self):
ajax_url = 'https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page='
for i in range(1, 50):
yield scrapy.Request(ajax_url + str(i))
def parse(self, response):
for row in response.css("div.sr-list.al-article-box.al-normal.clearfix"):
doi = row.xpath(".//div[@class='al-citation-list']//a/@href").get()
yield {"doi": doi}
OUTPUT for pages 1-2:
{'doi': 'https://doi.org/10.1093/plphys/kiac484'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1093/plphys/kiaa026'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1093/plphys/kiaa032'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.120.2.599'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.109.139378'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.106.085167'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.106.085886'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.106.090449'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.119.2.553'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.015479'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.97.1.415'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.2.283'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.2.228'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.6.728'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.50.1.149'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.29.1.64'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1104/pp.16.4.721'}
2023-05-09 23:07:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=2>
{'doi': 'https://doi.org/10.1093/plphys/kiaa119'}
2023-05-09 23:07:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1> (referer: None)
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.73.4.1002'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.59.5.868'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.75.1.82'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.68.4.894'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.81.4.1115'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.59.5.859'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.93.4.1466'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.95.4.1270'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.48.6.712'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.89.2.409'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.89.4.1231'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.26.3.581'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.100.2.947'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.71.4.855'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.62.1.127'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.72.1.16'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.61.2.150'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1104/pp.20.00264'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1093/plphys/kiac602'}
2023-05-09 23:07:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://academic.oup.com/plphys/Solr/SolrSearch/OUP_SearchResults?q=photosynthesis&allJournals=1&fl_SiteID=6323&page=1>
{'doi': 'https://doi.org/10.1093/plphys/kiad183'}
Note: While writing this answer the site put up a captcha. If you are trying to scrape the site while that captcha is active all you need to do is copy the cookies from your browser and insert them into each of the requests in the start_requests method.
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.