Friday, December 3, 2021

[FIXED] Downloading PDF documents using Scrapy

December 03, 2021 downloadfile, pdf, python, scrapy, web-scraping No comments

Issue

I am trying to download pdf documents using a spider written with scrapy. I am able to get all the documents that I need on a page but instead of saving as pdf files, they are saving as encoded text files.

The href tags that I am downloading from look like this

<a href="/utils/view?id=37a074754f8d7d7302e0a32d9b049054" target="_blank" title="Download/View Attachment_1_PandemicFlu.pdf" class="file" id="yui-gen6">Attachment_1_Pandemi...</a>

where the relative url points to https://www.fbo.gov/utils/view?id=37a074754f8d7d7302e0a32d9b049054.

It seems that the issue lies in the fact that the href link does not have the .pdf in it. I tried to append the suffix in my program(and browser) but that link doesn't exist and nothing downloaded.

Any help would be appreciated!

My code is below

import scrapy
from scrapy.loader import ItemLoader
from FBOSpider.items import FbospiderItem

class fbo_spider(scrapy.Spider):
    name = "fbospider"

    start_urls = ["https://www.fbo.gov/spg/AOC/AOCPD/WashingtonDC/RFPPPA190087/listing.html"]

    def parse(self, response):
        base_url = "https://www.fbo.gov"
        for link in response.xpath("//*[@class='pkglist']/dd/a"):
            loader = ItemLoader(item= FbospiderItem(), selector=link)
            relative_url = link.xpath(".//@href").extract_first()
            absolute_url = base_url + relative_url # this is where I tried to add: + '.pdf'
            loader.add_value('file_urls', absolute_url)
            yield loader.load_item()

UPDATE: Got it working with the help of an answer below. Here is my solution. Hope it helps.

import scrapy
import requests

class fbo_spider(scrapy.Spider):
    name = "fbospider"

    start_urls = ["https://www.fbo.gov/spg/AOC/AOCPD/WashingtonDC/RFPPPA190087/listing.html"]

    def parse(self, response):

        base_url = "https://www.fbo.gov" # base url used build url from href link
        i = 1

        # xpath to retrieve the part of html which holds documents
        for link in response.xpath("//*[@class='pkglist']/dd/a"):
            relative_url = link.xpath(".//@href").extract_first()

            # ex: https://www.fbo.gov/utils/view?id=921ca3f6f2ae471ab579075b8dc37afb
            absolute_url = base_url + relative_url 

            # request to fetch pdf documents using absolute url
            r = requests.get(absolute_url)
            with open("file%s.pdf" % i, 'wb') as f:
                f.write(r.content)
            i+=1

Solution

Use requests library to get the file

import requests

def download(url):
    print('Beginning file download with requests')

    r = requests.get(url)

    with open('some_name.pdf', 'wb') as f:
        f.write(r.content)

    # Retrieve HTTP meta-data
    print(r.status_code)
    print(r.headers['content-type'])
    print(r.encoding)

download('https://www.fbo.gov/utils/view?id=37a074754f8d7d7302e0a32d9b049054')

Answered By - Vishnudev

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 3, 2021

[FIXED] Downloading PDF documents using Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels