Issue
I met a problem when I tried to use pdfminer
to extract certain information from a PDF file in Spyder. I followed pdfminer
official documentation trying to define an extraction function first;
# Define a pdf-to-txt function
def pdftotxt(path, new_name):
# Create a pdf parser
parser = PDFParser(path)
# Create an object storing information
document = PDFDocument(parser)
# Evaluate if extractable
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
else:
# Create a PDF resource management to restore resource
resmag = PDFResourceManager()
# Set a parameter for analysis
laparams = LAParams()
# Create a PDF object
# device = PDFDevice(resmag)
device = PDFPageAggregator(resmag,laparams=laparams)
# Create a PDF interpreter
interpreter = PDFPageInterpreter(resmag, device)
# Analyzing each page
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# Assign LTPage of this page
layout = device.get_result()
for y in layout:
if(isinstance(y,LTTextBoxHorizontal)):
with open("%s"%(new_name),'a',encoding="utf-8") as f:
f.write(y.get_text()+"\n")
# Get a PDF's directory to test
path = open('/keep_2.pdf')
pdftotxt(path, "pdfminer.txt")
But it returns an error message:
File "<ipython-input-2-11f054ad4321>", line 31, in <module>
pdftotxt(path, "pdfminer.txt")
File "<ipython-input-2-11f054ad4321>", line 5, in pdftotxt
document = PDFDocument(parser)
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 557, in __init__
pos = self.find_xref(parser)
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdocument.py", line 759, in find_xref
for line in parser.revreadlines():
File "/Users/WQY/opt/anaconda3/lib/python3.7/site-packages/pdfminer/psparser.py", line 268, in revreadlines
n = max(s.rfind(b'\r'), s.rfind(b'\n'))
TypeError: must be str, not bytes
Can anyone help solve this error? I tried to google it but it seems no similar problems in using pdfminer
being reported. Thank you so much for the help in advance.
Solution
Posting my comment as an answer so this doesn't look like an unanswered question to people scrolling through:
Instead of open('/keep_2.pdf')
, use open('/keep_2.pdf', 'rb')
to open in binary mode.
Answered By - jdaz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.