Jump to content

OCR a webpage


 Share

Recommended Posts

I don't know where to start looking on how do this. I am basically looking to put in a website addy and do OCR on the whole page. It would have to scroll down on the page itself as well.

I've messed with _IE functions, that can get me the text. But it doesn't get the text in the pictures.

Any ideas where I can start looking? Thanks

Link to comment
Share on other sites

In the Tesseract UDF there is a function called CaptureToTIFF(). Mix that with Ward's Web Screenshot UDF to create a TIFF image of a webpage.

Then copy the _TesseractScreenCapture() function and look where CaptureToTIFF() kicks in. Replace the capture part with the download part as described above and tweak the rest as needed.

Link to comment
Share on other sites

Getting the following errors in my tesseract txt doc in my output dir:

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/Documents\Scriptsz\test.tif

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/E:\Documents

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/and

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/Settings\My

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/Documents\Scriptsz\test

Could not open file, Settings\My

Link to comment
Share on other sites

And where is the code producing this error :unsure: ?

You are making me feel retarded :> The problem with those errors are that those folders dont even exist in the tesseract prog file directory.

I grabbed his ScreenCapture function and changed the following(On diff comp now):

$capture_filename = ("C:\Documents and Settings\ewh2qxk\My Documents\Scripts\Test1.tif")

$ocr_filename = StringLeft($capture_filename, StringLen($capture_filename) - 4)

$ocr_filename_and_ext = $ocr_filename & ".txt"

;CaptureToTIFF("", "", "", $capture_filename, $scale, $left_indent, $top_indent, $right_indent, $bottom_indent)

ShellExecuteWait(@ProgramFilesDir & "\tesseract\tesseract.exe", $capture_filename & " " & $ocr_filename)

And called it.

$text1 = TSC(0,"",0,"",1,2,1,2,1)

Link to comment
Share on other sites

Tesseract Open Source OCR Engine

read_tif_image:Error:Illegal image format:Compression

Tessedit:Error:Read of file failed:C:\Test1.tif

Signal_exit 31 ABORT. LocCode: 3 AbortCode: 3

Getting that now, I'll have to look into somehow getting a different pic.

Link to comment
Share on other sites

Sweet I got it working, sort of. It really has a problem doing the OCR on the tif, very very little of the values are correct, and its just basic black on white text.

Do you think its possible that since this tif is so big that tesseract is scaling it down and then trying to do the OCR?

Link to comment
Share on other sites

I think Tesseract is working with what you're feeding it. At the top of the UDF there's this comment:

If text accuracy is still low, increase the $scale parameter.  In general, the higher
the scale the clearer the font and the more accurate the text recognition.

There's a parameter $scale in the CaptureToTIFF() function I previous mentioned. It's defaulted to 1 in the function header (though it's defaulted to 2 in the respective calls :unsure: ). Try to set it to 2 or 3.

Additionally there's this comment:

Use the default values for first time use.  If the text recognition accuracy is low,
I suggest setting $show_capture to 1 and rerunning.  If the screenshot of the
window or control includes borders or erroneous pixels that may interfere with
the text recognition process, then use $left_indent, $top_indent, $right_indent and
$bottom_indent to adjust the portion of the screen being captured, to
exclude these non-textural elements.

If feasible I would recommend cropping the output picture for better recognition results too.

Link to comment
Share on other sites

I think Tesseract is working with what you're feeding it. At the top of the UDF there's this comment:

If text accuracy is still low, increase the $scale parameter.  In general, the higher
the scale the clearer the font and the more accurate the text recognition.

There's a parameter $scale in the CaptureToTIFF() function I previous mentioned. It's defaulted to 1 in the function header (though it's defaulted to 2 in the respective calls :unsure: ). Try to set it to 2 or 3.

Additionally there's this comment:

Use the default values for first time use.  If the text recognition accuracy is low,
I suggest setting $show_capture to 1 and rerunning.  If the screenshot of the
window or control includes borders or erroneous pixels that may interfere with
the text recognition process, then use $left_indent, $top_indent, $right_indent and
$bottom_indent to adjust the portion of the screen being captured, to
exclude these non-textural elements.

If feasible I would recommend cropping the output picture for better recognition results too.

Ya I am going work with trying to just get the element that I specifically need off the website. As for using the scale function in CaptureToTiff, I wasn't even using that functionality because all I used that for was saving the tiff in the correct format.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...