amazing_ang

Multiple image OCR

2 posts in this topic

Hi, I'm new to AutoIt and i'm absolutely loving it! :) I have a requirement for a script which can read the text from multiple image files and load the data into a different repository. Have gone through the various OCR posts in the AutoIt Example Scripts, but it would be great if some expert in Autoit can help me get started in the right direction. Thanks in advance.

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Hello amazing_ang and welcome to the forum.

At first you should have tesseract already installed in your computer.

Take a look at this simple script while I prepare another one that improves the quality of the ocr results:

 

#include <MsgBoxConstants.au3>
#include <Array.au3>
#include <File.au3>


Global $imagesfolder ; folder where the images are stored.


; ===============================================================================================================================
; We do also need to point to the Tesseract executable file. Put your Tesseract executable path.
; If you want to use your script in another computer, you can just copy the whole Tsseract installation folder to the same folder
; from where you are running your script and it will work even if the other computer doesn't have Tesseract installed on it.
; I use to copy the third programs that I'm going to need in a folder called 'Data' in the script dir
; ===============================================================================================================================
Const $TesseractExePath = @ScriptDir & "\Data\Tesseract-OCR\Tesseract.exe"


; ===============================================================================================================================
; This piece of code will let us select the folder where the images are stored. Take a look at the help file for the FileSelectFolder
; ===============================================================================================================================
Local $sFileSelectFolder = FileSelectFolder("Choose the folder with the images you want to OCR", @DesktopDir)
If @error Then ; If you don't choose any folder then you'll get an error and the script will Exit
    MsgBox($MB_SYSTEMMODAL, "ERROR", "I can't do OCR if I don't have any folder with images")
    Exit
Else ; But if you choose a folder, then the $imagesfolder variable will get this folder's path as value.
    $imagesfolder = $sFileSelectFolder
    ConsoleWrite("Images folder: " & $imagesfolder & @CRLF) ; I like to use consolewrites to see what is the script doing all the time
EndIf


; ===============================================================================================================================
; If you selected a folder with images, then you'll need a list with all the images that there are stored inside it.
; You can store the path of every image inside the folder with this function: _FileListToArray
;
; You can take a look at the help file to learn about the parameters of this function.
; Basically you have to call the function FileListToArray( Folder, Filter, Flag, Return Path)
;
; When the _FileListToArray has ended we should have an array with all the files in the folder.
; ===============================================================================================================================
Global $imagelist = _FileListToArray($imagesfolder, "*.*", 1, True)


; ===============================================================================================================================
; To be sure that it has worked we will check that there hasn't been any errors. We are going to dismiss the error 1 because the
; folder with the images must exist (we have selected it just a few miliseconds ago). We will check that the folder wasn't empty
; looking for error 4. In case of doubt remember always to take a look at the help file.
; ===============================================================================================================================
If @error = 4 Then
    MsgBox($MB_SYSTEMMODAL, "", "No file(s) were found.")
    Exit
Else
    ; Display the results returned by _FileListToArray.
    _ArrayDisplay($imagelist, "Images found")
EndIf


; ===============================================================================================================================
; Now you have an array with all the images that there were in the images folder and the number of images stored in the first
; element of the array $imagelist[0] = Number of images. Now we can do a simple loop to
; ===============================================================================================================================
For $i = 1 To $imagelist[0] ; From a starting value of 1 to a finish value equal to the amount of images in the folder...
    Local $input = $imagelist[$i] ; our input for Tsseract will be the $i element of the $imagelist array
    Local $txtoutput = StringTrimRight($imagelist[$i], 4) ; and the output will have the same path and filename, but we are going to remove the extension and the point.
                                                          ; it only would work if the file extension has 3 characters. Experts users would point you about how to improve
                                                          ; this part of the script.


    ; ===============================================================================================================================
    ; Now comes the part where we call the Tesseract to try to read the text inside the images. We are just running the Tesseract like
    ; if we are using it from a command line. You just have to indicate the Tesseract executable path and the parameters. In this case
    ; the paramenters are the image file to OCR (The $input variable) and, separated by " and by one space, the putput text file.
    ; I have added the @SW_HIDE parameter, but you can use @SW_MAXIMIZE if ypu want to see Tesseract working.
    ;
    ; When I use the ShellExecuteWait function I like to check for any StdoutRead or StderrRead possible message from Tesseract.
    ; ===============================================================================================================================
    Local $OCR = ShellExecuteWait($TesseractExePath, '"' & $input & '" "' & $txtoutput & '"', "", "", @SW_HIDE)
    Consolewrite("Doing OCR at file: " & $input & @CRLF)

    While 1
        Local $line = StdoutRead($OCR)
        If @error Then ExitLoop
        If $line <> "" Then ConsoleWrite("STDOUT ocr " & $line & @CRLF)
    WEnd
    While 1
        Local $line = StderrRead($OCR)
        If @error Then ExitLoop
        If $line <> "" Then ConsoleWrite("ERROR ocr " & $line & @CRLF)
    WEnd
    Sleep(100)

Next

You can use these random images:

12907273_532170663658240_70846492_n.jpg

 

diet.jpg

 

image1.png

 

As you can see these are really perfect images fot doing ocr. I have to do a script to try to do the same but with scanned documentation and it was a total dissaster.

 I finally ended inserting qr codes inside the documentation that was going to be scanned, because, even having the possibility of interpretate the output from the Tesseract text files, it was not as accurate as I needed. I was looking always for the same kind strings to rename my scanned files with the data stored inside the documentation and it used to confuse 5 with 6 or 8 with a B 

 

I also tried to improve the quality of the images to make the easier to be readed by Tesseract.

Even having great GDI+ examples in the forum I used an external program to do a little image conversion. (Sorry UEZ :sweating:).

(I wanted to use imagemagick, but I messed with the object creation and the program that I used was portable an easily usable in any kind of computer just calling it with a command line without having to install anything).

 

The program is nconvert, from the same guys that created xnview. One image processing that improved the OCR results for me was this:

Const $NconvertExePath = @ScriptDir & "\Data\nconvert.exe"
    Local $optionsconvert = " -out png -rtype lanczos -resize 100% 140% -gauss 5"

For $i = 1 To $imagelist[0]
    Local $iReturn = ShellExecuteWait($NconvertExePath, $optionsconvert & ' "' & $imagelist[$i] & '" "' & $imagelist[$i] & '"', @ScriptDir, "", @SW_HIDE)
    Consolewrite("Doing some image processing to " & $imagelist[$i] & @crlf)
Next

 

So you'll have to put the Nconvert executable inside the Data folder.

The final script with the nconvert image processing would be:

 

#include <MsgBoxConstants.au3>
#include <Array.au3>
#include <File.au3>


Global $imagesfolder ; folder where the images are stored.


; ===============================================================================================================================
; We do also need to point to the Tesseract executable file. Put your Tesseract executable path.
; If you want to use your script in another computer, you can just copy the whole Tsseract installation folder to the same folder
; from where you are running your script and it will work even if the other computer doesn't have Tesseract installed on it.
; I use to copy the third programs that I'm going to need in a folder called 'Data' in the script dir
; ===============================================================================================================================
Const $TesseractExePath = @ScriptDir & "\Data\Tesseract-OCR\Tesseract.exe"


; ===============================================================================================================================
; This piece of code will let us select the folder where the images are stored. Take a look at the help file for the FileSelectFolder
; ===============================================================================================================================
Local $sFileSelectFolder = FileSelectFolder("Choose the folder with the images you want to OCR", @DesktopDir)
If @error Then ; If you don't choose any folder then you'll get an error and the script will Exit
    MsgBox($MB_SYSTEMMODAL, "ERROR", "I can't do OCR if I don't have any folder with images")
    Exit
Else ; But if you choose a folder, then the $imagesfolder variable will get this folder's path as value.
    $imagesfolder = $sFileSelectFolder
    ConsoleWrite("Images folder: " & $imagesfolder & @CRLF) ; I like to use consolewrites to see what is the script doing all the time
EndIf


; ===============================================================================================================================
; If you selected a folder with images, then you'll need a list with all the images that there are stored inside it.
; You can store the path of every image inside the folder with this function: _FileListToArray
;
; You can take a look at the help file to learn about the parameters of this function.
; Basically you have to call the function FileListToArray( Folder, Filter, Flag, Return Path)
;
; When the _FileListToArray has ended we should have an array with all the files in the folder.
; ===============================================================================================================================
Global $imagelist = _FileListToArray($imagesfolder, "*.*", 1, True)


; ===============================================================================================================================
; To be sure that it has worked we will check that there hasn't been any errors. We are going to dismiss the error 1 because the
; folder with the images must exist (we have selected it just a few miliseconds ago). We will check that the folder wasn't empty
; looking for error 4. In case of doubt remember always to take a look at the help file.
; ===============================================================================================================================
If @error = 4 Then
    MsgBox($MB_SYSTEMMODAL, "", "No file(s) were found.")
    Exit
Else
    ; Display the results returned by _FileListToArray.
    _ArrayDisplay($imagelist, "Images found")
EndIf


; ===============================================================================================================================
; We are going to perform some image processing to try to increase the OCR perfomance
; ===============================================================================================================================
    Const $NconvertExePath = @ScriptDir & "\Data\nconvert.exe"
    Local $optionsconvert = " -out png -rtype lanczos -resize 100% 140% -gauss 5"

For $i = 1 To $imagelist[0]
    Local $iReturn = ShellExecuteWait($NconvertExePath, $optionsconvert & ' "' & $imagelist[$i] & '" "' & $imagelist[$i] & '"', @ScriptDir, "", @SW_HIDE)
    Consolewrite("Doing some image processing to " & $imagelist[$i] & @crlf)
Next




; ===============================================================================================================================
; Now you have an array with all the images that there were in the images folder and the number of images stored in the first
; element of the array $imagelist[0] = Number of images. Now we can do a simple loop to
; ===============================================================================================================================
For $i = 1 To $imagelist[0] ; From a starting value of 1 to a finish value equal to the amount of images in the folder...
    Local $input = $imagelist[$i] ; our input for Tsseract will be the $i element of the $imagelist array
    Local $txtoutput = StringTrimRight($imagelist[$i], 4) ; and the output will have the same path and filename, but we are going to remove the extension and the point.
                                                          ; it only would work if the file extension has 3 characters. Experts users would point you about how to improve
                                                          ; this part of the script.


    ; ===============================================================================================================================
    ; Now comes the part where we call the Tesseract to try to read the text inside the images. We are just running the Tesseract like
    ; if we are using it from a command line. You just have to indicate the Tesseract executable path and the parameters. In this case
    ; the paramenters are the image file to OCR (The $input variable) and, separated by " and by one space, the putput text file.
    ; I have added the @SW_HIDE parameter, but you can use @SW_MAXIMIZE if ypu want to see Tesseract working.
    ;
    ; When I use the ShellExecuteWait function I like to check for any StdoutRead or StderrRead possible message from Tesseract.
    ; ===============================================================================================================================
    Local $OCR = ShellExecuteWait($TesseractExePath, '"' & $input & '" "' & $txtoutput & '"', "", "", @SW_HIDE)
    Consolewrite("Doing OCR at file: " & $input & @CRLF)

    While 1
        Local $line = StdoutRead($OCR)
        If @error Then ExitLoop
        If $line <> "" Then ConsoleWrite("STDOUT ocr " & $line & @CRLF)
    WEnd
    While 1
        Local $line = StderrRead($OCR)
        If @error Then ExitLoop
        If $line <> "" Then ConsoleWrite("ERROR ocr " & $line & @CRLF)
    WEnd
    Sleep(100)

Next

 

Let me apologize for any grammar error. I'm sure that all the great coders around here can perform my example script better. Would love to see some more options to learn.

 

Greets from Barcelona

Edited by Qwerty212

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now