Jump to content

Tesseract OCR oddity [SOLVED]


Recommended Posts

Maybe this is a known issue that I could not find in any forum posts, but I figure I had better check to make sure.

When I run tesseract to return the text of a JAVA window, it brings back the text (with a few issues, but that is normal according to what I have been reading about the various OCR solutions available so far). The issue comes when I have the result sent to a notepad window just to see how it will look, and there are 4 carriage returns on the end of it:

REJECTED¶

I am not sure why this is happening and could not find anything in any other post about this (stated earlier). I have tried different scales and many varieties of the different settings to no avail.

Any help will be welcome!

The code comes from the SimpleTesseract example script available with the SimpleTesseract script:

#include <SimpleTesseract.au3>
#include <ClipBoard.au3>

Global $sRead

; Clears the Clipboard
_ClipBoard_Open(0)
_ClipBoard_Empty()
_ClipBoard_Close()

sleep(1000)
$tempfile = "test.tif"

if FileExists($tempfile) Then
    FileDelete($tempfile)
EndIf

$sRead = _TesseractScreenCapture(0, "", 1, 5, 4, 428, 53, 441, 1)

; Opens Notepad
Run("notepad.exe")
WinWaitActive("Untitled - Notepad")
$Note = WinGetHandle("Untitled - Notepad")
Send($sRead)
Edited by dm83737
Link to comment
Share on other sites

  • 2 weeks later...

Just in case anyone actually cares about this, I thought I would share what I have found to be the solution to this.

I have used a combination of the SimpleTesseract.au3 script that has the color chaging properties already added (turns the scanned items black and white for clearer OCR reading) and added the StringStripWS() function to take anything out of the Tesseract output that is not a character or digit.

Basically, this is all I needed:

#include <SimpleTesseractColor.au3> ; <------- Customized SimpleTesseract.au3 with color info added
$iRead = _TesseractScreenCapture(0, "", 1, 6, 690, 600, 762, 616, 1, 185)
$iReadClean = StringStripWS($iRead, 8)

$iReadClean gives the result without carriage returns and other unnecessary spacers. The exact description of the StringStripWS() function from the help file is:

StringStripWS

--------------------------------------------------------------------------------

Strips the white space in a string.

StringStripWS ( "string", flag )

Parameters

string The string to strip.

flag Flag to indicate the type of stripping that should be performed (add the flags together for multiple operations):

1 = strip leading white space

2 = strip trailing white space

4 = strip double (or more) spaces between words

8 = strip all spaces (over-rides all other flags)

Return Value

Returns the new string stripped of the requested white space.

Remarks

Whitespace includes Chr(9) thru Chr(13) which are HorizontalTab, LineFeed, VerticalTab, FormFeed, and CarriageReturn. Whitespace also includes the null string ( Chr(0) ) and the standard space ( Chr(32) ).

To strip single spaces between words, use the function StringReplace.

My script looks like this:

; -------------------------------------------------------------------------------------------------------------------------------------------
;
;   AutoIt Version: 3.3.0.0
;   Author:      Dan Maxwell
;
;   Script Function:
;   Shows what the screen capture looks like as well as what Tesseract OCR reads for a specific set of coordinates
;   Time Needed: -
;
; -------------------------------------------------------------------------------------------------------------------------------------------

; -------------------------------------------------------------------------------------------------------------------------------------------
; DECLARATIONS / VARIABLES

; Includes
#include <SimpleTesseractColor.au3>

; Options
Opt("WinTitleMatchMode", 2); Matches partial Window Titles (i.e. - "Untitled" or "pad" for "Untitled - Notepad"

; Variable List
Global $Admin, $Note
Global $iRead, $iReadClean

; -------------------------------------------------------------------------------------------------------------------------------------------

; -------------------------------------------------------------------------------------------------------------------------------------------
; SCRIPT

; Opens Notepad
Run("notepad.exe")
WinWaitActive("Untitled - Notepad")
$Note = WinGetHandle("Untitled - Notepad")

; Stops script to allow test condition to be met
;  (i.e. - Getting The ADMINISTRATOR to show the text you want to scan)
MsgBox(0, "", "")

; Activates The ADMINISTRATOR application window
WinActivate($Admin)
WinMove($Admin, "", 0, 0)

; Scans and cleans the image/text from The ADMINISTRATOR
$iRead = _TesseractScreenCapture(0, "", 1, 6, 690, 600, 762, 616, 1, 185)
$iReadClean = StringStripWS($iRead, 8)

; Places what it read in Notepad for reference
WinActivate($Note)
Send($iReadClean)

; Exits the script
Exit

I am also attaching the SimpleTesseractColor script as it is a modified version combining the original SimpleTesseract.au3 script that had the color portion added, and has the Melba23 component added which looks for luminosity instead of specific colors.

SimpleTesseractColor.au3

Edited by dm83737
Link to comment
Share on other sites

  • 2 weeks later...

Many thanks for this update to SimpleTesseract - at last, this is delivering reasonable accuracy!

Please note that:

$iReadClean = StringStripWS($iRead, 2)

Will remove the trailing carriage returns and leave the rest of the text intact.

Edited by longfields
Link to comment
Share on other sites

  • 1 year later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...