Wednesday, August 3, 2022

[FIXED] scraping article with scrapy but results null

August 03, 2022 python, scrapy, xpath No comments

Issue

I try to scrape all articles in a website to get the full text, date and also title. I am using xpath to capture the information I need. I try to be very careful in writing the xpath, but when I run my code it results nothing.

The error message:

    result = xpathev(query, namespaces=nsp,
  File "src/lxml/etree.pyx", line 1582, in lxml.etree._Element.xpath
  File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression

As what I understand, the message means something wrong with the xpath.

Here is the code I have created:

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess


class barchart(scrapy.Spider):
    name = 'barchart'
    start_urls = ['https://www.barchart.com/news/commodities/energy']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=self.parse)

    def parse(self, response):
        for link in response.xpath('//*[@class="stories-list"]//*[@class=["story clearfix "]/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@class="field__item"]/time/text()').extract(),
            'title': response.xpath('//*[@class="article-header-wrapper"]//h1//text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@class="article-content ng-binding ng-scope"]//p//text()')])
        }

if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(barchart)
    process.start()

How should I write the xpath in order to capture all the information I need for this web scrapping?

Thank you very much for any help

Solution

After some minor changes to your initial xpath expression I was able to get the all of the links from the first page okay, However it seems that the inner articles themselves are rendered differently, possibly using angular, so for those I ended up using the scrapy-selenium extension.

With this configuration I was able to get the results.

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess

from scrapy_selenium import SeleniumRequest

class barchart(scrapy.Spider):
    name = 'barchart'
    start_urls = ['https://www.barchart.com/news/commodities/energy']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 10,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
        'SELENIUM_DRIVER_NAME' : "chrome",
        'SELENIUM_DRIVER_EXECUTABLE_PATH' : "chromedriver.exe",
        'SELENIUM_DRIVER_ARGUMENTS' : [],
        "DOWNLOADER_MIDDLEWARES" : {
            'scrapy_selenium.SeleniumMiddleware': 800
        }
    }

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=self.parse)

    def parse(self, response):
        sections = response.xpath('//div[contains(@class,"story clearfix ")]')
        for section in sections:
            link = section.xpath('.//a[contains(@class,"story-link")]/@href').get()
            yield SeleniumRequest(url=link, callback=self.parse_item, wait_time=10)

    def parse_item(self, response):
        item = {
            'date': response.xpath('//div[@class="article-meta"]/span[contains(@class,"article-published")]/text()').get().strip(),
            'title': response.xpath('//h1[contains(@class,"article-title")]/text()').get().strip(),
            'text':''.join([x.get().strip() for x in response.xpath('//div[contains(@class,"article-content")]//p/text()')])
        }
        yield item

if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(barchart)
    process.start()

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, August 3, 2022

[FIXED] scraping article with scrapy but results null

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels