Issue
Given any image of a scanned document, I want to check if it's not empty page. I know I can send it to AWS Textract - but it will cost money for nothing.
I know I can use pytesseract but maybe there is more elegant and simple solution? Or given a .html file that represents the text of the image - how to check it shows a blank page?
Solution
PyMuPDF would be another option for you if you need to save the hassles without going through Pytesseract. Here is just an example of how you could extract text from scanned image or clean format of PDFs:
import fitz
input_file = 'path/to/your/file'
pdf_file = input_file
doc = fitz.open(pdf_file) # open pdf files using fitz bindings
noOfPages = doc.pageCount # Here is how you get number of pages
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo) # number of pages
blocks = page.getText("blocks")
blocks.sort(key=lambda block: block[3]) # sort by 'y1' values
for block in blocks:
print(block[4]) # print the lines of this block or do your check here
page.getText(option)
is probably your best bet and option
is a string which controls the output type. You can choose among things like plain text, single words with position info, HTML or XML string output, complete page content in Python dict format and what not.
EDIT:
A quick way to work with jpg is to convert it back to pdf using:
pdfbytes = doc.convertToPDF()
pdf = fitz.open('pdf',pdfbytes)
If you don't want to convert it back to pdf, then use page.getText
with the "dict" parameter. This creates a list of all images on a page:
d = page.getText("dict")
blocks = d["blocks"]
imgblocks = [b for b in blocks if b["type"] == 1]
If both of them don't satisfy your need, then PIL
library might be your next option. If you need extra information, here is official documentation for PyMuPDF and here since you mentioned HTML on other threads.
Answered By - liamsuma
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.