Issue
I am trying to download pdf documents using a spider written with scrapy. I am able to get all the documents that I need on a page but instead of saving as pdf files, they are saving as encoded text files.
The href tags that I am downloading from look like this
<a href="/utils/view?id=37a074754f8d7d7302e0a32d9b049054" target="_blank" title="Download/View Attachment_1_PandemicFlu.pdf" class="file" id="yui-gen6">Attachment_1_Pandemi...</a>
where the relative url points to https://www.fbo.gov/utils/view?id=37a074754f8d7d7302e0a32d9b049054.
It seems that the issue lies in the fact that the href link does not have the .pdf in it. I tried to append the suffix in my program(and browser) but that link doesn't exist and nothing downloaded.
Any help would be appreciated!
My code is below
import scrapy
from scrapy.loader import ItemLoader
from FBOSpider.items import FbospiderItem
class fbo_spider(scrapy.Spider):
name = "fbospider"
start_urls = ["https://www.fbo.gov/spg/AOC/AOCPD/WashingtonDC/RFPPPA190087/listing.html"]
def parse(self, response):
base_url = "https://www.fbo.gov"
for link in response.xpath("//*[@class='pkglist']/dd/a"):
loader = ItemLoader(item= FbospiderItem(), selector=link)
relative_url = link.xpath(".//@href").extract_first()
absolute_url = base_url + relative_url # this is where I tried to add: + '.pdf'
loader.add_value('file_urls', absolute_url)
yield loader.load_item()
UPDATE: Got it working with the help of an answer below. Here is my solution. Hope it helps.
import scrapy
import requests
class fbo_spider(scrapy.Spider):
name = "fbospider"
start_urls = ["https://www.fbo.gov/spg/AOC/AOCPD/WashingtonDC/RFPPPA190087/listing.html"]
def parse(self, response):
base_url = "https://www.fbo.gov" # base url used build url from href link
i = 1
# xpath to retrieve the part of html which holds documents
for link in response.xpath("//*[@class='pkglist']/dd/a"):
relative_url = link.xpath(".//@href").extract_first()
# ex: https://www.fbo.gov/utils/view?id=921ca3f6f2ae471ab579075b8dc37afb
absolute_url = base_url + relative_url
# request to fetch pdf documents using absolute url
r = requests.get(absolute_url)
with open("file%s.pdf" % i, 'wb') as f:
f.write(r.content)
i+=1
Solution
Use requests library to get the file
import requests
def download(url):
print('Beginning file download with requests')
r = requests.get(url)
with open('some_name.pdf', 'wb') as f:
f.write(r.content)
# Retrieve HTTP meta-data
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
download('https://www.fbo.gov/utils/view?id=37a074754f8d7d7302e0a32d9b049054')
Answered By - Vishnudev
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.