Issue
I try to scrape all articles in a website to get the full text, date and also title. I am using xpath to capture the information I need. I try to be very careful in writing the xpath, but when I run my code it results nothing.
The error message:
result = xpathev(query, namespaces=nsp,
File "src/lxml/etree.pyx", line 1582, in lxml.etree._Element.xpath
File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression
As what I understand, the message means something wrong with the xpath.
Here is the code I have created:
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
class barchart(scrapy.Spider):
name = 'barchart'
start_urls = ['https://www.barchart.com/news/commodities/energy']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def start_requests(self):
for url in self.start_urls:
yield Request(url, callback=self.parse)
def parse(self, response):
for link in response.xpath('//*[@class="stories-list"]//*[@class=["story clearfix "]/a/@href'):
yield response.follow(
url=link.get(),
callback=self.parse_item
)
def parse_item(self, response):
yield {
'date': response.xpath('//*[@class="field__item"]/time/text()').extract(),
'title': response.xpath('//*[@class="article-header-wrapper"]//h1//text()').get(),
'text':''.join([x.get().strip() for x in response.xpath('//*[@class="article-content ng-binding ng-scope"]//p//text()')])
}
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(barchart)
process.start()
How should I write the xpath in order to capture all the information I need for this web scrapping?
Thank you very much for any help
Solution
After some minor changes to your initial xpath expression I was able to get the all of the links from the first page okay, However it seems that the inner articles themselves are rendered differently, possibly using angular, so for those I ended up using the scrapy-selenium extension.
With this configuration I was able to get the results.
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from scrapy_selenium import SeleniumRequest
class barchart(scrapy.Spider):
name = 'barchart'
start_urls = ['https://www.barchart.com/news/commodities/energy']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 10,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'SELENIUM_DRIVER_NAME' : "chrome",
'SELENIUM_DRIVER_EXECUTABLE_PATH' : "chromedriver.exe",
'SELENIUM_DRIVER_ARGUMENTS' : [],
"DOWNLOADER_MIDDLEWARES" : {
'scrapy_selenium.SeleniumMiddleware': 800
}
}
def start_requests(self):
for url in self.start_urls:
yield Request(url, callback=self.parse)
def parse(self, response):
sections = response.xpath('//div[contains(@class,"story clearfix ")]')
for section in sections:
link = section.xpath('.//a[contains(@class,"story-link")]/@href').get()
yield SeleniumRequest(url=link, callback=self.parse_item, wait_time=10)
def parse_item(self, response):
item = {
'date': response.xpath('//div[@class="article-meta"]/span[contains(@class,"article-published")]/text()').get().strip(),
'title': response.xpath('//h1[contains(@class,"article-title")]/text()').get().strip(),
'text':''.join([x.get().strip() for x in response.xpath('//div[contains(@class,"article-content")]//p/text()')])
}
yield item
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(barchart)
process.start()
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.