Jump to content

Recommended Posts

Posted

Hello All -

This is my first post.  By way of introduction I am a research engineer at the University of Texas - Center for Transportation Research.  I work with Masters and PhD students conducting research and though they are whizzes at creating computer programs - I want to learn more about this myself.  This will enable me to converse with them on more of an equal footing plus I want to learn something new. 

I am old school, first did key punch cards and ran programs on IBM mainframes in school.  I've also done Basic and Fortran coding - a long time ago.   I've also done some SQL coding for use in a Sybase database - probably 15 years ago.....which I still use by the way.

I asked the UT Civil Engineering IT department for their recommendation of a programming tool I could use to create various types of programs (like the one I will describe below) and was told to download AutoIT.  

I've read through a number of the different Forum posts and viewed the pre-written code examples but haven't yet seen an example of the type of program I want to write first (or maybe don't recognize an example when I see it).

I am currently conducting research for the Texas Department of Transportation and have access to several thousand scanned PDF files - I want to extract information contained in each file and import the data into an Excel spreadsheet to create a database.    These are scanned files so the fields in each form are not directly accessible.  Thus, though I can see a number I want to import such as a truck axle load   10,330  this is not actually 'number' per say that I can read using any type of Excel Macro.

I've examined these files and though Acrobat has a feature to export files to Excel this is disabled in the files I've been provided. 

It would be helpful if someone could point me to an example script that I could study to understand the AutoIt functions and start working to create AutoIT code for the purpose described.

I would appreciate any help you can give.

Thanks very much,

Mike

Posted (edited)

This pdf files are searchabled ?

EDIT: welcome to the AutoIt and forum.

Edited by mLipok

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

  Reveal hidden contents

Signature last update: 2023-04-24

Posted (edited)

Welcome to AutoIt and the forum!

I started with punch cards and IBM mainframes (360/25) myself many years ago. And now I'm using AutoIt to solve most of my computer "problems".

To start using AutoIt you have chosen quite a complex task :)

The task can be split into two parts:

  • Reading and translating the scanned PDF file
  • Extracting the data and write it to Excel

For part one I suggest you search for "Tesseract", an OCR (Optical Character Recognition= program. I'm not sure it can translate the scanned bit map to text, but it is worht hte try.

Part two can be solved with the builtin Excel UDF.

The hardest part will be to translate the scanned PDF file.

Edited by water

My UDFs and Tutorials:

  Reveal hidden contents

 

Posted (edited)

I Think Texas Department of Transportation have some buyed software like finereader corporate or server version. IT could be very, handy.

Sorry for many edits actualy i use smartfon with Polish language.

Edited by mLipok

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

  Reveal hidden contents

Signature last update: 2023-04-24

Posted

Or search Google for "translate scanned files to text tesseract".

You will find something like this.

My UDFs and Tutorials:

  Reveal hidden contents

 

Posted (edited)

When you look at the PDF files, are you using full Adobe Acrobat or just the reader? If you are using the reader, then you will not be able to convert them.

Another option if you have office 2010 or newer is try to open the PDF in office. Office 2010 and newer supports PDF and you may be able to convert the file to an Excel file.

Edited by DarthCookieMonster
Posted

As I undestand MikeMurphy is talking about sheets of paper being scanned and stored as PDF files. The scanned content is imbedded in the PDF file as TIF

My UDFs and Tutorials:

  Reveal hidden contents

 

Posted

Not necessarily only tif image.
Some scanning devices have built-in OCR technology.

That's why I asked the question:

  On 9/20/2014 at 6:40 PM, mLipok said:

This pdf files are searchabled ?

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

  Reveal hidden contents

Signature last update: 2023-04-24

Posted
  Quote

 

These are scanned files so the fields in each form are not directly accessible

This lets me think that he doesn't use an OCR software.

My UDFs and Tutorials:

  Reveal hidden contents

 

Posted

Let's wait to see what he (Mike) would say.

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

  Reveal hidden contents

Signature last update: 2023-04-24

Posted

Hello All -

First, I would like to apologize for not responding to your messages - I am a member of other web forums and have gotten used to a email advising me that someone has responded to a new post I made.   Since I hadn't heard from AutoIT, I assumed that no one had replied to my original post.

mLipok, thank you for contacting me ;-)   The photo you found on the UT-CTR website is about 7 years old so I have less hair and it is whiter now....I may have also lost some weight ;-)

But to the problem at hand.....

Actually, we have tried optical character readers for other applications that involved extracting information from hand written law enforcement crash records.  This did not work that well.   However, I've not tried to extract data from a type written PDF using an OCR -  I somehow thought that I would be able to read the PDF image and extract the data directly, but apparently this is not possible.   We have created an excel tool for another application that extracts data from a truck data website and places it into an excel database - I had hoped for something similar for this application.  

To describe the process by which these files are initially created, a user accesses a web-site to create the document - the document is then stored in a database within the website as a PDF file which can be accessed by TxDOT but cannot be changed (it is pass word protected).  The document creator can download a copy of the document for their use.   However, these documents are password protected to prevent the person who first created the document from later altering it (or anyone else for that matter).   I have been given a large number of these files for use in our project - the files are downloaded directly from the web database - so they might not actually be scanned images, but rather electronic copies (PDF file) of the document as it was originally created.  

I am using Adobe Acrobat X Pro to open the images - the document properties indicate they are in PDF version 1.5 (Adobe 6.x).   

The PDF files are not searchable.

I will try using an OCR (Tesseract) to read the PDF and convert it to a text or similar searchable file.   I assume that AutoIT does not include an OCR function else a separate program would not be necessary.

If I am making ignorant statements (for example, assuming that OCR could be an AutoIT function, please keep in mind that I am in the learning phase.

Thanks very much for your comments, I'll check in on the forum to follow up.

Mike

Posted
  On 9/25/2014 at 2:03 PM, MikeMurphy said:

Hello All -

First, I would like to apologize for not responding to your messages - I am a member of other web forums and have gotten used to a email advising me that someone has responded to a new post I made.   Since I hadn't heard from AutoIT, I assumed that no one had replied to my original post.

 

Mike

take a look here:

http://www.autoitscript.com/forum/index.php?app=core&module=usercp&tab=core&area=notifications

and set up your profile exactly how you like.

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

  Reveal hidden contents

Signature last update: 2023-04-24

Posted (edited)

You can change your profile to get email notifications:

Click on your user name in the upper right corner of this page. Click settings and then click notification options.

Edit:

Not fast enough :

Edited by water

My UDFs and Tutorials:

  Reveal hidden contents

 

Posted

I'm still not 100% sure what kind of PDF files we are talking about.

Would it be possible to post one of this files here or send it to me by PM?

My UDFs and Tutorials:

  Reveal hidden contents

 

Posted
  On 9/25/2014 at 2:03 PM, MikeMurphy said:

Actually, we have tried optical character readers for other applications that involved extracting information from hand written law enforcement crash records.  This did not work that well.   However, I've not tried to extract data from a type written PDF using an OCR -  I somehow thought that I would be able to read the PDF image and extract the data directly, but apparently this is not possible.   We have created an excel tool for another application that extracts data from a truck data website and places it into an excel database - I had hoped for something similar for this application.  

To describe the process by which these files are initially created, a user accesses a web-site to create the document - the document is then stored in a database within the website as a PDF file which can be accessed by TxDOT but cannot be changed (it is pass word protected).  The document creator can download a copy of the document for their use.   However, these documents are password protected to prevent the person who first created the document from later altering it (or anyone else for that matter).   I have been given a large number of these files for use in our project - the files are downloaded directly from the web database - so they might not actually be scanned images, but rather electronic copies (PDF file) of the document as it was originally created.  

.......

The PDF files are not searchable.

I will try using an OCR (Tesseract) to read the PDF and convert it to a text or similar searchable file.

 

"a user accesses a web-site to create the document - the document is then stored in a database within the website as a PDF file which can be accessed by TxDOT but cannot be changed (it is pass word protected)."

Is this document is fully filled via web-site ? or after storing is filled by hand, with using pencil, and after this is scaned ?

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

  Reveal hidden contents

Signature last update: 2023-04-24

Posted

Mike maybe you be intrested to look here: Real OCR

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

  Reveal hidden contents

Signature last update: 2023-04-24

Posted

The PDF files I am interested in reading / extracting data are filled in on a website by the website user. These documents can be accessed by the DOT for their purposes, but as mentioned, are password locked to prevent changes. I have been given a

sample of these documents, which can be downloaded from the website as PDF files.

I have viewed the Tesseract OCR information referenced in your responses and though the documents are all completed on the web and thus are type written (not hand written) I am realizing the difficulties that can occur in reading the files and populating a database. For example, some individuals provide distances between truck axles as feet and inches, others provide this

information as feet and decimals of a foot. For example, 4' 3" or 4.25 ft.

I can't post an example of the document on AutoIT since these documents contain some information considered confidential by theTexas State Attorney General - I am able to view these documents only because it has been given to me by TxDOT since I'm

working on a project for them.

I will continue to study the information provided thus far. I will also change my profile so I receive Email updates.

Thank you very much,

Mike

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...