Tuesday, October 12, 2021

[FIXED] Using pdfminer python to extract information from PDF file

October 12, 2021 pdfminer, python, spyder No comments

Issue

I met a problem when I tried to use pdfminer to extract certain information from a PDF file in Spyder. I followed pdfminer official documentation trying to define an extraction function first;

# Define a pdf-to-txt function
def pdftotxt(path, new_name):
    # Create a pdf parser
    parser = PDFParser(path)
    # Create an object storing information
    document = PDFDocument(parser)
    # Evaluate if extractable
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        # Create a PDF resource management to restore resource
        resmag = PDFResourceManager()
        # Set a parameter for analysis
        laparams = LAParams()
        # Create a PDF object
        # device = PDFDevice(resmag)
        device = PDFPageAggregator(resmag,laparams=laparams)
        # Create a PDF interpreter
        interpreter = PDFPageInterpreter(resmag, device)
        # Analyzing each page
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
            # Assign LTPage of this page
            layout = device.get_result()
            for y in layout:
                if(isinstance(y,LTTextBoxHorizontal)):
                    with open("%s"%(new_name),'a',encoding="utf-8") as f:
                        f.write(y.get_text()+"\n")  

# Get a PDF's directory to test
path = open('/keep_2.pdf')
pdftotxt(path, "pdfminer.txt")

But it returns an error message:

File "<ipython-input-2-11f054ad4321>", line 31, in <module>
    pdftotxt(path, "pdfminer.txt")

  File "<ipython-input-2-11f054ad4321>", line 5, in pdftotxt
    document = PDFDocument(parser)

  File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 557, in __init__
    pos = self.find_xref(parser)

  File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 759, in find_xref
    for line in parser.revreadlines():

  File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/psparser.py", line 268, in revreadlines
    n = max(s.rfind(b'\r'), s.rfind(b'\n'))

TypeError: must be str, not bytes

Can anyone help solve this error? I tried to google it but it seems no similar problems in using pdfminer being reported. Thank you so much for the help in advance.

Solution

Posting my comment as an answer so this doesn't look like an unanswered question to people scrolling through:

Instead of open('/keep_2.pdf'), use open('/keep_2.pdf', 'rb') to open in binary mode.

Answered By - jdaz

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 12, 2021

[FIXED] Using pdfminer python to extract information from PDF file

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels