HSBen Posted April 21, 2011 Posted April 21, 2011 I don't know where to start looking on how do this. I am basically looking to put in a website addy and do OCR on the whole page. It would have to scroll down on the page itself as well. I've messed with _IE functions, that can get me the text. But it doesn't get the text in the pictures. Any ideas where I can start looking? Thanks
KaFu Posted April 21, 2011 Posted April 21, 2011 A combo of Ward's excellent Web Screenshot UDF and Seangriffin's Tesseract (Screen OCR) UDF should fit your needs ... OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2024-Oct-13) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21) DCS - Dynamic Cursor Selector (2024-Oct-13) HMW - Hide my Windows (2024-Oct-19) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2025-May-18) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16)
HSBen Posted April 24, 2011 Author Posted April 24, 2011 How do I get tesseract to OCR the tif file I created, all I am seeing it can do is do OCR on a screencap.
HSBen Posted April 25, 2011 Author Posted April 25, 2011 Anyone? I can't seem to find anything on this.
KaFu Posted April 25, 2011 Posted April 25, 2011 In the Tesseract UDF there is a function called CaptureToTIFF(). Mix that with Ward's Web Screenshot UDF to create a TIFF image of a webpage. Then copy the _TesseractScreenCapture() function and look where CaptureToTIFF() kicks in. Replace the capture part with the download part as described above and tweak the rest as needed. OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2024-Oct-13) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21) DCS - Dynamic Cursor Selector (2024-Oct-13) HMW - Hide my Windows (2024-Oct-19) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2025-May-18) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16)
HSBen Posted April 25, 2011 Author Posted April 25, 2011 Sweet I'll take a look at. I didn't think about grabbing part of his code, instead I was trying to use its functions. Thanks
HSBen Posted April 26, 2011 Author Posted April 26, 2011 Getting the following errors in my tesseract txt doc in my output dir: read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/Documents\Scriptsz\test.tif read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/E:\Documents read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/and read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/Settings\My read_variables_file:Can't open C:/Program Files/tesseract/tessdata/configs/Documents\Scriptsz\test Could not open file, Settings\My
KaFu Posted April 26, 2011 Posted April 26, 2011 And where is the code producing this error ? OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2024-Oct-13) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21) DCS - Dynamic Cursor Selector (2024-Oct-13) HMW - Hide my Windows (2024-Oct-19) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2025-May-18) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16)
HSBen Posted April 26, 2011 Author Posted April 26, 2011 And where is the code producing this error ?You are making me feel retarded The problem with those errors are that those folders dont even exist in the tesseract prog file directory. I grabbed his ScreenCapture function and changed the following(On diff comp now): $capture_filename = ("C:\Documents and Settings\ewh2qxk\My Documents\Scripts\Test1.tif") $ocr_filename = StringLeft($capture_filename, StringLen($capture_filename) - 4) $ocr_filename_and_ext = $ocr_filename & ".txt" ;CaptureToTIFF("", "", "", $capture_filename, $scale, $left_indent, $top_indent, $right_indent, $bottom_indent) ShellExecuteWait(@ProgramFilesDir & "\tesseract\tesseract.exe", $capture_filename & " " & $ocr_filename)And called it.$text1 = TSC(0,"",0,"",1,2,1,2,1)
HSBen Posted April 26, 2011 Author Posted April 26, 2011 Tesseract Open Source OCR Engine read_tif_image:Error:Illegal image format:Compression Tessedit:Error:Read of file failed:C:\Test1.tif Signal_exit 31 ABORT. LocCode: 3 AbortCode: 3 Getting that now, I'll have to look into somehow getting a different pic.
KaFu Posted April 26, 2011 Posted April 26, 2011 Take a look at the CaptureToTIFF() function in the original UDF for how to create a TIFF in the right format. OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2024-Oct-13) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21) DCS - Dynamic Cursor Selector (2024-Oct-13) HMW - Hide my Windows (2024-Oct-19) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2025-May-18) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16)
HSBen Posted April 27, 2011 Author Posted April 27, 2011 Sweet I got it working, sort of. It really has a problem doing the OCR on the tif, very very little of the values are correct, and its just basic black on white text. Do you think its possible that since this tif is so big that tesseract is scaling it down and then trying to do the OCR?
KaFu Posted April 28, 2011 Posted April 28, 2011 I think Tesseract is working with what you're feeding it. At the top of the UDF there's this comment: If text accuracy is still low, increase the $scale parameter. In general, the higher the scale the clearer the font and the more accurate the text recognition. There's a parameter $scale in the CaptureToTIFF() function I previous mentioned. It's defaulted to 1 in the function header (though it's defaulted to 2 in the respective calls ). Try to set it to 2 or 3. Additionally there's this comment: Use the default values for first time use. If the text recognition accuracy is low, I suggest setting $show_capture to 1 and rerunning. If the screenshot of the window or control includes borders or erroneous pixels that may interfere with the text recognition process, then use $left_indent, $top_indent, $right_indent and $bottom_indent to adjust the portion of the screen being captured, to exclude these non-textural elements. If feasible I would recommend cropping the output picture for better recognition results too. OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2024-Oct-13) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21) DCS - Dynamic Cursor Selector (2024-Oct-13) HMW - Hide my Windows (2024-Oct-19) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2025-May-18) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16)
HSBen Posted April 28, 2011 Author Posted April 28, 2011 I think Tesseract is working with what you're feeding it. At the top of the UDF there's this comment: If text accuracy is still low, increase the $scale parameter. In general, the higher the scale the clearer the font and the more accurate the text recognition. There's a parameter $scale in the CaptureToTIFF() function I previous mentioned. It's defaulted to 1 in the function header (though it's defaulted to 2 in the respective calls ). Try to set it to 2 or 3. Additionally there's this comment: Use the default values for first time use. If the text recognition accuracy is low, I suggest setting $show_capture to 1 and rerunning. If the screenshot of the window or control includes borders or erroneous pixels that may interfere with the text recognition process, then use $left_indent, $top_indent, $right_indent and $bottom_indent to adjust the portion of the screen being captured, to exclude these non-textural elements. If feasible I would recommend cropping the output picture for better recognition results too. Ya I am going work with trying to just get the element that I specifically need off the website. As for using the scale function in CaptureToTiff, I wasn't even using that functionality because all I used that for was saving the tiff in the correct format.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now