Sunday, March 27, 2022

[FIXED] Sorting Multiple Choice Questions using OPENCV and PYTESSERACT

March 27, 2022 opencv, python-3.x, python-tesseract No comments

Issue

I am trying to make and compile a multiple choice quiz , the MCQ questions comes from different books and other sources so that I can answer them digitally. I didn't bother to type them one by one because it was a hassle and will consume a lot of time. So I took pictures of the questions from the books then fed them to my script that uses openCV for image processing and Py-tesseract to convert them to text and used a python module to export it to excel which acts as a "database" for my questions.

My problem is I am having trouble sorting the choices to its corresponding letter

Here is an image of the choices

Multiple Choices

and my code that sorts the choices by newline

choices = cv2.imread("ROI_2.png", 0)
custom_config = r'--oem 3 --psm 6'
c = pytesseract.image_to_string(choices, config=custom_config, lang='eng')

x = re.sub(r'\n{2}', '\n', c)
text = repr(x)
print(text)
newtext = text.split("\\n")

It works well if the choices are short but fails in other choices having multiple new lines

Choices having multiple new lines

I'm trying to find a way to sort these choices efficiently by its corresponding letter , I was thinking about maybe de-limiters would work or combining the newly converted text to a single line or maybe its in the image processing ? I have ideas on how to solve my problem but i dont know how to proceed I'm still fairly a beginner at python and rely heavily on tutorials or past answered questions in stackoverflow

Solution

Your images seem to be noise-free. So it was easy to extract the text.

code:

    img = cv2.imread("options.png",0)
    img_copy = cv2.cvtColor(img,cv2.COLOR_GRAY2BGR)
    otsu = cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
    custom_oem_psm_config = r'--oem 3 --psm 6'
    ocr = pytesseract.image_to_data(otsu, output_type=Output.DICT,config=custom_oem_psm_config,lang='eng')
    boxes = len(ocr['text'])
    texts = []

    for i in range(boxes):
        if (int(ocr['conf'][i])!=-1):
            (x,y,w,h) = (ocr['left'][i],ocr['top'][i],ocr['width'][i],ocr['height'][i])
            cv2.rectangle(img_copy,(x,y),(x+w,y+h),(255,0,0),2)
            texts.append(ocr['text'][i])

    def list_to_string(list):
        str1 = " "
        return str1.join(list)

    string = list_to_string(texts)
    print("String: ",string)

Output

String:  A. A sound used to indicate when a transmission is complete. B. A sound used to identify the repeater. C. A sound used to indicate that a message is waiting for someone. D. A sound used to activate a receiver in case of severe weather.

But here we have all the options joined in one string. So to split the string according to the options, I have used split function.

    a = string.split("A.")
    b = a[1].split("B.")
    c = b[1].split("C.")
    d = c[1].split("D.")

    option_A = b[0]
    option_B = c[0]
    option_C = d[0]
    option_D = d[1]

    print("only options RHS")
    print(option_A)
    print(option_B)
    print(option_C)
    print(option_D)

Output:

only options RHS
 A sound used to indicate when a transmission is complete.
 A sound used to identify the repeater.
 A sound used to indicate that a message is waiting for someone.
 A sound used to activate a receiver in case of severe weather.

there you go, all the options. Hope this solves the problem.

Answered By - Tarun Chakitha

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, March 27, 2022

[FIXED] Sorting Multiple Choice Questions using OPENCV and PYTESSERACT

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels