dm83737 Posted May 29, 2009 Posted May 29, 2009 (edited) Maybe this is a known issue that I could not find in any forum posts, but I figure I had better check to make sure. When I run tesseract to return the text of a JAVA window, it brings back the text (with a few issues, but that is normal according to what I have been reading about the various OCR solutions available so far). The issue comes when I have the result sent to a notepad window just to see how it will look, and there are 4 carriage returns on the end of it: REJECTED¶ ¶ ¶ ¶ I am not sure why this is happening and could not find anything in any other post about this (stated earlier). I have tried different scales and many varieties of the different settings to no avail. Any help will be welcome! The code comes from the SimpleTesseract example script available with the SimpleTesseract script: #include <SimpleTesseract.au3> #include <ClipBoard.au3> Global $sRead ; Clears the Clipboard _ClipBoard_Open(0) _ClipBoard_Empty() _ClipBoard_Close() sleep(1000) $tempfile = "test.tif" if FileExists($tempfile) Then FileDelete($tempfile) EndIf $sRead = _TesseractScreenCapture(0, "", 1, 5, 4, 428, 53, 441, 1) ; Opens Notepad Run("notepad.exe") WinWaitActive("Untitled - Notepad") $Note = WinGetHandle("Untitled - Notepad") Send($sRead) Edited June 10, 2009 by dm83737
dm83737 Posted June 10, 2009 Author Posted June 10, 2009 (edited) Just in case anyone actually cares about this, I thought I would share what I have found to be the solution to this. I have used a combination of the SimpleTesseract.au3 script that has the color chaging properties already added (turns the scanned items black and white for clearer OCR reading) and added the StringStripWS() function to take anything out of the Tesseract output that is not a character or digit. Basically, this is all I needed: #include <SimpleTesseractColor.au3> ; <------- Customized SimpleTesseract.au3 with color info added $iRead = _TesseractScreenCapture(0, "", 1, 6, 690, 600, 762, 616, 1, 185) $iReadClean = StringStripWS($iRead, 8) $iReadClean gives the result without carriage returns and other unnecessary spacers. The exact description of the StringStripWS() function from the help file is: StringStripWS -------------------------------------------------------------------------------- Strips the white space in a string. StringStripWS ( "string", flag ) Parameters string The string to strip. flag Flag to indicate the type of stripping that should be performed (add the flags together for multiple operations): 1 = strip leading white space 2 = strip trailing white space 4 = strip double (or more) spaces between words 8 = strip all spaces (over-rides all other flags) Return Value Returns the new string stripped of the requested white space. Remarks Whitespace includes Chr(9) thru Chr(13) which are HorizontalTab, LineFeed, VerticalTab, FormFeed, and CarriageReturn. Whitespace also includes the null string ( Chr(0) ) and the standard space ( Chr(32) ). To strip single spaces between words, use the function StringReplace. My script looks like this: expandcollapse popup; ------------------------------------------------------------------------------------------------------------------------------------------- ; ; AutoIt Version: 3.3.0.0 ; Author: Dan Maxwell ; ; Script Function: ; Shows what the screen capture looks like as well as what Tesseract OCR reads for a specific set of coordinates ; Time Needed: - ; ; ------------------------------------------------------------------------------------------------------------------------------------------- ; ------------------------------------------------------------------------------------------------------------------------------------------- ; DECLARATIONS / VARIABLES ; Includes #include <SimpleTesseractColor.au3> ; Options Opt("WinTitleMatchMode", 2); Matches partial Window Titles (i.e. - "Untitled" or "pad" for "Untitled - Notepad" ; Variable List Global $Admin, $Note Global $iRead, $iReadClean ; ------------------------------------------------------------------------------------------------------------------------------------------- ; ------------------------------------------------------------------------------------------------------------------------------------------- ; SCRIPT ; Opens Notepad Run("notepad.exe") WinWaitActive("Untitled - Notepad") $Note = WinGetHandle("Untitled - Notepad") ; Stops script to allow test condition to be met ; (i.e. - Getting The ADMINISTRATOR to show the text you want to scan) MsgBox(0, "", "") ; Activates The ADMINISTRATOR application window WinActivate($Admin) WinMove($Admin, "", 0, 0) ; Scans and cleans the image/text from The ADMINISTRATOR $iRead = _TesseractScreenCapture(0, "", 1, 6, 690, 600, 762, 616, 1, 185) $iReadClean = StringStripWS($iRead, 8) ; Places what it read in Notepad for reference WinActivate($Note) Send($iReadClean) ; Exits the script Exit I am also attaching the SimpleTesseractColor script as it is a modified version combining the original SimpleTesseract.au3 script that had the color portion added, and has the Melba23 component added which looks for luminosity instead of specific colors.SimpleTesseractColor.au3 Edited June 10, 2009 by dm83737
longfields Posted June 21, 2009 Posted June 21, 2009 (edited) Many thanks for this update to SimpleTesseract - at last, this is delivering reasonable accuracy! Please note that: $iReadClean = StringStripWS($iRead, 2) Will remove the trailing carriage returns and leave the rest of the text intact. Edited June 21, 2009 by longfields
inter Posted January 18, 2011 Posted January 18, 2011 Apologies for bumping an old thread, but I have to thank you for this as I ran into the exact same problem and this solved things for me.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now