OCR a webpage

HSBen · April 21, 2011

I don't know where to start looking on how do this. I am basically looking to put in a website addy and do OCR on the whole page. It would have to scroll down on the page itself as well.

I've messed with _IE functions, that can get me the text. But it doesn't get the text in the pictures.

Any ideas where I can start looking? Thanks

KaFu · April 21, 2011

A combo of

Ward's excellent Web Screenshot UDF

and

Seangriffin's Tesseract (Screen OCR) UDF

should fit your needs :unsure: ...

HSBen · April 24, 2011

How do I get tesseract to OCR the tif file I created, all I am seeing it can do is do OCR on a screencap.

HSBen · April 25, 2011

Anyone? I can't seem to find anything on this.

KaFu · April 25, 2011

In the Tesseract UDF there is a function called CaptureToTIFF(). Mix that with Ward's Web Screenshot UDF to create a TIFF image of a webpage.

Then copy the _TesseractScreenCapture() function and look where CaptureToTIFF() kicks in. Replace the capture part with the download part as described above and tweak the rest as needed.

HSBen · April 25, 2011

Sweet I'll take a look at. I didn't think about grabbing part of his code, instead I was trying to use its functions. Thanks

HSBen · April 26, 2011

Getting the following errors in my tesseract txt doc in my output dir:

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/Documents\Scriptsz\test.tif

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/E:\Documents

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/and

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/Settings\My

read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/Documents\Scriptsz\test

Could not open file, Settings\My

KaFu · April 26, 2011

And where is the code producing this error :unsure: ?

HSBen · April 26, 2011

And where is the code producing this error ?

You are making me feel retarded The problem with those errors are that those folders dont even exist in the tesseract prog file directory.

I grabbed his ScreenCapture function and changed the following(On diff comp now):

$capture_filename = ("C:\Documents and Settings\ewh2qxk\My Documents\Scripts\Test1.tif")

$ocr_filename = StringLeft($capture_filename, StringLen($capture_filename) - 4)

$ocr_filename_and_ext = $ocr_filename & ".txt"

;CaptureToTIFF("", "", "", $capture_filename, $scale, $left_indent, $top_indent, $right_indent, $bottom_indent)

ShellExecuteWait(@ProgramFilesDir & "\tesseract\tesseract.exe", $capture_filename & " " & $ocr_filename)

And called it.

$text1 = TSC(0,"",0,"",1,2,1,2,1)

HSBen · April 26, 2011

Tesseract Open Source OCR Engine

read_tif_image:Error:Illegal image format:Compression

Tessedit:Error:Read of file failed:C:\Test1.tif

Signal_exit 31 ABORT. LocCode: 3 AbortCode: 3

Getting that now, I'll have to look into somehow getting a different pic.

KaFu · April 26, 2011

Take a look at the CaptureToTIFF() function in the original UDF for how to create a TIFF in the right format.

HSBen · April 27, 2011

Sweet I got it working, sort of. It really has a problem doing the OCR on the tif, very very little of the values are correct, and its just basic black on white text.

Do you think its possible that since this tif is so big that tesseract is scaling it down and then trying to do the OCR?

KaFu · April 28, 2011

I think Tesseract is working with what you're feeding it. At the top of the UDF there's this comment:

If text accuracy is still low, increase the $scale parameter.  In general, the higher
the scale the clearer the font and the more accurate the text recognition.

There's a parameter $scale in the CaptureToTIFF() function I previous mentioned. It's defaulted to 1 in the function header (though it's defaulted to 2 in the respective calls :unsure: ). Try to set it to 2 or 3.

Additionally there's this comment:

Use the default values for first time use.  If the text recognition accuracy is low,
I suggest setting $show_capture to 1 and rerunning.  If the screenshot of the
window or control includes borders or erroneous pixels that may interfere with
the text recognition process, then use $left_indent, $top_indent, $right_indent and
$bottom_indent to adjust the portion of the screen being captured, to
exclude these non-textural elements.

If feasible I would recommend cropping the output picture for better recognition results too.

HSBen · April 28, 2011

I think Tesseract is working with what you're feeding it. At the top of the UDF there's this comment:
If text accuracy is still low, increase the $scale parameter.  In general, the higher
the scale the clearer the font and the more accurate the text recognition.
There's a parameter $scale in the CaptureToTIFF() function I previous mentioned. It's defaulted to 1 in the function header (though it's defaulted to 2 in the respective calls ). Try to set it to 2 or 3.

Additionally there's this comment:
Use the default values for first time use.  If the text recognition accuracy is low,
I suggest setting $show_capture to 1 and rerunning.  If the screenshot of the
window or control includes borders or erroneous pixels that may interfere with
the text recognition process, then use $left_indent, $top_indent, $right_indent and
$bottom_indent to adjust the portion of the screen being captured, to
exclude these non-textural elements.
If feasible I would recommend cropping the output picture for better recognition results too.

Ya I am going work with trying to just get the element that I specifically need off the website. As for using the scale function in CaptureToTiff, I wasn't even using that functionality because all I used that for was saving the tiff in the correct format.

OCR a webpage

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members