Issue
I'm scraping to .csv and I'm getting many extra spaces in the .csv file that are not in the actual web page. I'm able to remove tabs and line breaks using .replace()
but the spaces don't get removed using .replace()
. Even if there was something unusual in the formatting of the web page, it should get removed by the .replace()
. What am I missing?
import re
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class DhqSpider(CrawlSpider):
name = 'dhq1'
allowed_domains = ['digitalhumanities.org']
start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/index.html']
rules = (
Rule(LinkExtractor(allow = 'index.html')),
Rule(LinkExtractor(allow = 'vol'), callback='parse_article'),
)
def parse_article(self, response):
yield {
'title' : response.css('h1.articleTitle::text').get().replace('\n', '').replace('\t', '').replace('\s+', ' '),
'author1' : response.css('div.author a::text').getall(),
'year' : response.css('div#pubInfo::text')[0].get(),
'volume' : response.css('div#pubInfo::text')[1].get(),
'xmllink' : response.urljoin(response.xpath('(//div[@class="toolbar"]/a[contains(@href, ".xml")]/@href)[1]').get()),
}
Piece of the csv. https://pastebin.com/7GvZT3b9
Link to one of the pages that's included in the .csv. http://www.digitalhumanities.org/dhq/vol/16/3/000629/000629.html
Solution
You can use normalize-space.
Either replace the css selectors with xpath selectors, or remove the ::text
from the css selector and use xpath with normalize-space
after the css selector, as shown in the example.
Example:
import scrapy
import unidecode # to remove "\xa0" from the strings
class DhqSpider(scrapy.Spider):
name = 'dhq1'
allowed_domains = ['digitalhumanities.org']
start_urls = ['http://www.digitalhumanities.org/dhq/vol/16/3/000629/000629.html']
def parse(self, response):
item = {
'title': response.css('h1.articleTitle').xpath('normalize-space(text())').get(default='').strip(),
'author1': response.css('div.author a').xpath('normalize-space(text())').getall(),
'year': unidecode.unidecode(response.css('div#pubInfo::text')[0].get()),
'volume': unidecode.unidecode(response.css('div#pubInfo::text')[1].get()),
'xmllink': response.urljoin(response.xpath('(//div[@class="toolbar"]/a[contains(@href, ".xml")]/@href)[1]').get()),
}
item['author1'] = [unidecode.unidecode(i) for i in item['author1']]
yield item
Output:
{'title': 'Ethical and Effective Visualization of Knowledge Networks', 'author1': ['Chelsea Canon', 'canon_at_nevada_dot_unr_dot_edu', ' https://orcid.org/0000-0002-0431-343X', 'Douglas Boyle', 'douglasb_at_unr_dot_edu', ' https://orcid.org/0000-0002-3301-3997', 'K. J. Hepworth', 'katherine_dot_hepworth_at_unisa_dot_edu_dot_au', ' https://orcid.org/0000-0003-1059-567X'], 'year': '2022', 'volume': 'Volume 16 Number 3', 'xmllink': 'http://www.digitalhumanities.org/dhq/vol/16/3/000629.xml'}
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.