Sign in to follow this  
Followers 0
boogieoompa

API Call for Tesseract

6 posts in this topic

Several years back someone wrote an API for Tesseract

As great as that UDF is it is just a screen scraping UDF (that I'm pretty sure has memory leaks). I have very high resolution images (that I am hoping will help increase the accuracy) that I would like to read but I would like to read them directly from a file. Effectivly my program will save a .tiff (or some other format, will play around to see what has higher accuracy) open up Tesseract and read the .txt file that it creates.

I believe you can call Tesseract via an API call with the image path in the parameter. Tesseract than creates a .txt file in the same directory (with the same name). I looked over the UDF and this seems to be happening but I am unable to get this thing to run.

I have never really used parameters in execuables like this and was hoping someone could lend me a hand figuring it out.. I was hoping it would be as easy as...

ShellExecuteWait(@ProgramFilesDir & "\tesseract\tesseract.exe", "C:\OCRTEST.TIF C:\OCRTEST")

But alas it is not.

Thanks!

Share this post


Link to post
Share on other sites



So after more research I feel more confident that my initial assumption for the shell call was right however when I run this script I cannot get a text file created. It will create it using the UDF, implying that it is up and running but it will not fire off if I use my own shell command on my own file.

This was posted by JohnOne last year, I slightly adjusted it to fit my needs. I can tell that the Tesseract executable is getting called but the text extraction is never created.

Any Suggestions?

$s_Image_InputFile = @ScriptDir & "\OCRTEST.tif"
$s_OCR_OutputFile = @ScriptDir & "\in"
$result = _TessOcr($s_Image_InputFile, $s_OCR_OutputFile)
MsgBox(0,"Result",$result)
Func _TessOcr($in_image, $out_file)
    Local $Read
    ShellExecuteWait(@ProgramFilesDir & "\tesseract\tesseract.exe", '"' & $in_image & '" "' & $out_file & '"', Default, Default, @SW_HIDE)
    If @error Then
        MsgBox(0,"Error","ShellExecuteWait Error")
        Exit
    EndIf
    If FileExists($out_file & ".txt") Then
        $Read = FileRead($out_file & ".txt")
        FileDelete($out_file & ".txt")
    Else
        $Read = "No file created"
    EndIf
    Return $Read
EndFunc   ;==>_TessOcr

Share this post


Link to post
Share on other sites

Thanks for the input but during the time between posts I basically scrapped that old udf file. I downloaded the newest version of Tesseract and was playing around with it and found some cool stuff with page formating. Here is a snit bit I used for proof of concept.

;http://code.google.com/p/tesseract-ocr/wiki/FAQ
#Include <Array.au3>
$s_Image_InputFile = "C:\temp\test.tif"
$s_OCR_OutputFile = "C:\temp\in.txt"
$result = _TessOcr($s_Image_InputFile, $s_OCR_OutputFile)
$array = StringSplit($result, @CRLF)
_ArrayDisplay($array)
;MsgBox(0,"Result",$result)
Func _TessOcr($in_image, $out_file)
 Local $Read
 ShellExecuteWait(@ProgramFilesDir & "\Tesseract-OCR\tesseract.exe", '"' & $in_image & '" "' & $out_file & '" ' & '"-l eng"'  & '" ' &'" -psm 6"')
 If @error Then
  MsgBox(0,"Error","ShellExecuteWait Error")
  Exit
 EndIf
 If FileExists($out_file & ".txt") Then
  $Read = FileRead($out_file & ".txt")
  ;FileDelete($out_file & ".txt")
 Else
  $Read = "No file created"
 EndIf
 Return $Read
EndFunc   ;==>_TessOcr

Thanks for the reply!

Share this post


Link to post
Share on other sites

HI, Boogieoompa

Great idea to simplify calling tesseract from Autoit!  It works perfectly with my  sample texts in english. But I've encountered strange problem while trying implementation with my russian language texts. 

Sample input in Russian (russian cubes for tesseract installed, the image for testing purposes is w/b tif with no noize and simple layout - just a couple of paragraphes) produces garbage with a lot of incorrectly recognized and/or capitalized symbols. The strange part is that if I recognize exactly the same image the usual way i. e. running tesseract from command prompt (with the same parameters -l rus and -psm 6 as in autoit script) the result is nearly perfect. Any ideas how it could be explained?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0