boogieoompa Posted February 22, 2013 Share Posted February 22, 2013 Several years back someone wrote an API for TesseractAs great as that UDF is it is just a screen scraping UDF (that I'm pretty sure has memory leaks). I have very high resolution images (that I am hoping will help increase the accuracy) that I would like to read but I would like to read them directly from a file. Effectivly my program will save a .tiff (or some other format, will play around to see what has higher accuracy) open up Tesseract and read the .txt file that it creates.I believe you can call Tesseract via an API call with the image path in the parameter. Tesseract than creates a .txt file in the same directory (with the same name). I looked over the UDF and this seems to be happening but I am unable to get this thing to run.I have never really used parameters in execuables like this and was hoping someone could lend me a hand figuring it out.. I was hoping it would be as easy as...ShellExecuteWait(@ProgramFilesDir & "\tesseract\tesseract.exe", "C:\OCRTEST.TIF C:\OCRTEST")But alas it is not.Thanks! Link to comment Share on other sites More sharing options...
boogieoompa Posted February 22, 2013 Author Share Posted February 22, 2013 So after more research I feel more confident that my initial assumption for the shell call was right however when I run this script I cannot get a text file created. It will create it using the UDF, implying that it is up and running but it will not fire off if I use my own shell command on my own file. This was posted by JohnOne last year, I slightly adjusted it to fit my needs. I can tell that the Tesseract executable is getting called but the text extraction is never created. Any Suggestions? $s_Image_InputFile = @ScriptDir & "\OCRTEST.tif" $s_OCR_OutputFile = @ScriptDir & "\in" $result = _TessOcr($s_Image_InputFile, $s_OCR_OutputFile) MsgBox(0,"Result",$result) Func _TessOcr($in_image, $out_file) Local $Read ShellExecuteWait(@ProgramFilesDir & "\tesseract\tesseract.exe", '"' & $in_image & '" "' & $out_file & '"', Default, Default, @SW_HIDE) If @error Then MsgBox(0,"Error","ShellExecuteWait Error") Exit EndIf If FileExists($out_file & ".txt") Then $Read = FileRead($out_file & ".txt") FileDelete($out_file & ".txt") Else $Read = "No file created" EndIf Return $Read EndFunc ;==>_TessOcr Link to comment Share on other sites More sharing options...
JohnOne Posted February 22, 2013 Share Posted February 22, 2013 You might need #requireadmin to run in that location AutoIt Absolute Beginners Require a serial Pause Script Video Tutorials by Morthawt ipify Monkey's are, like, natures humans. Link to comment Share on other sites More sharing options...
boogieoompa Posted February 25, 2013 Author Share Posted February 25, 2013 Thanks for the input but during the time between posts I basically scrapped that old udf file. I downloaded the newest version of Tesseract and was playing around with it and found some cool stuff with page formating. Here is a snit bit I used for proof of concept. ;http://code.google.com/p/tesseract-ocr/wiki/FAQ #Include <Array.au3> $s_Image_InputFile = "C:\temp\test.tif" $s_OCR_OutputFile = "C:\temp\in.txt" $result = _TessOcr($s_Image_InputFile, $s_OCR_OutputFile) $array = StringSplit($result, @CRLF) _ArrayDisplay($array) ;MsgBox(0,"Result",$result) Func _TessOcr($in_image, $out_file) Local $Read ShellExecuteWait(@ProgramFilesDir & "\Tesseract-OCR\tesseract.exe", '"' & $in_image & '" "' & $out_file & '" ' & '"-l eng"' & '" ' &'" -psm 6"') If @error Then MsgBox(0,"Error","ShellExecuteWait Error") Exit EndIf If FileExists($out_file & ".txt") Then $Read = FileRead($out_file & ".txt") ;FileDelete($out_file & ".txt") Else $Read = "No file created" EndIf Return $Read EndFunc ;==>_TessOcr Thanks for the reply! Link to comment Share on other sites More sharing options...
pchun Posted October 7, 2013 Share Posted October 7, 2013 HI, Boogieoompa Great idea to simplify calling tesseract from Autoit! It works perfectly with my sample texts in english. But I've encountered strange problem while trying implementation with my russian language texts. Sample input in Russian (russian cubes for tesseract installed, the image for testing purposes is w/b tif with no noize and simple layout - just a couple of paragraphes) produces garbage with a lot of incorrectly recognized and/or capitalized symbols. The strange part is that if I recognize exactly the same image the usual way i. e. running tesseract from command prompt (with the same parameters -l rus and -psm 6 as in autoit script) the result is nearly perfect. Any ideas how it could be explained? Link to comment Share on other sites More sharing options...
JohnOne Posted October 8, 2013 Share Posted October 8, 2013 Free bump. Is there an official standalone Tesseract.exe that can be used in the fashion above or only an installer? AutoIt Absolute Beginners Require a serial Pause Script Video Tutorials by Morthawt ipify Monkey's are, like, natures humans. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now