Need Guidance - convert a scanned PDF to excel format

MikeMurphy · September 20, 2014

Hello All -

This is my first post. By way of introduction I am a research engineer at the University of Texas - Center for Transportation Research. I work with Masters and PhD students conducting research and though they are whizzes at creating computer programs - I want to learn more about this myself. This will enable me to converse with them on more of an equal footing plus I want to learn something new.

I am old school, first did key punch cards and ran programs on IBM mainframes in school. I've also done Basic and Fortran coding - a long time ago. I've also done some SQL coding for use in a Sybase database - probably 15 years ago.....which I still use by the way.

I asked the UT Civil Engineering IT department for their recommendation of a programming tool I could use to create various types of programs (like the one I will describe below) and was told to download AutoIT.

I've read through a number of the different Forum posts and viewed the pre-written code examples but haven't yet seen an example of the type of program I want to write first (or maybe don't recognize an example when I see it).

I am currently conducting research for the Texas Department of Transportation and have access to several thousand scanned PDF files - I want to extract information contained in each file and import the data into an Excel spreadsheet to create a database. These are scanned files so the fields in each form are not directly accessible. Thus, though I can see a number I want to import such as a truck axle load 10,330 this is not actually 'number' per say that I can read using any type of Excel Macro.

I've examined these files and though Acrobat has a feature to export files to Excel this is disabled in the files I've been provided.

It would be helpful if someone could point me to an example script that I could study to understand the AutoIt functions and start working to create AutoIT code for the purpose described.

I would appreciate any help you can give.

Thanks very much,

Mike

mLipok · September 20, 2014

This pdf files are searchabled ?

EDIT: welcome to the AutoIt and forum.

Edited September 20, 2014 by mLipok

water · September 20, 2014

Welcome to AutoIt and the forum!

I started with punch cards and IBM mainframes (360/25) myself many years ago. And now I'm using AutoIt to solve most of my computer "problems".

To start using AutoIt you have chosen quite a complex task

The task can be split into two parts:

Reading and translating the scanned PDF file
Extracting the data and write it to Excel

For part one I suggest you search for "Tesseract", an OCR (Optical Character Recognition= program. I'm not sure it can translate the scanned bit map to text, but it is worht hte try.

Part two can be solved with the builtin Excel UDF.

The hardest part will be to translate the scanned PDF file.

Edited September 20, 2014 by water

mLipok · September 20, 2014

I Think Texas Department of Transportation have some buyed software like finereader corporate or server version. IT could be very, handy.

Sorry for many edits actualy i use smartfon with Polish language.

Edited September 20, 2014 by mLipok

water · September 20, 2014

Or search Google for "translate scanned files to text tesseract".

You will find something like this.

Bert · September 20, 2014

When you look at the PDF files, are you using full Adobe Acrobat or just the reader? If you are using the reader, then you will not be able to convert them.

Another option if you have office 2010 or newer is try to open the PDF in office. Office 2010 and newer supports PDF and you may be able to convert the file to an Excel file.

Edited September 20, 2014 by DarthCookieMonster

water · September 20, 2014

As I undestand MikeMurphy is talking about sheets of paper being scanned and stored as PDF files. The scanned content is imbedded in the PDF file as TIF

mLipok · September 20, 2014

Not necessarily only tif image.
Some scanning devices have built-in OCR technology.

That's why I asked the question:

This pdf files are searchabled ?

water · September 20, 2014

These are scanned files so the fields in each form are not directly accessible

This lets me think that he doesn't use an OCR software.

mLipok · September 20, 2014

Let's wait to see what he (Mike) would say.

MikeMurphy · September 25, 2014

Hello All -

First, I would like to apologize for not responding to your messages - I am a member of other web forums and have gotten used to a email advising me that someone has responded to a new post I made. Since I hadn't heard from AutoIT, I assumed that no one had replied to my original post.

mLipok, thank you for contacting me ;-) The photo you found on the UT-CTR website is about 7 years old so I have less hair and it is whiter now....I may have also lost some weight ;-)

But to the problem at hand.....

Actually, we have tried optical character readers for other applications that involved extracting information from hand written law enforcement crash records. This did not work that well. However, I've not tried to extract data from a type written PDF using an OCR - I somehow thought that I would be able to read the PDF image and extract the data directly, but apparently this is not possible. We have created an excel tool for another application that extracts data from a truck data website and places it into an excel database - I had hoped for something similar for this application.

To describe the process by which these files are initially created, a user accesses a web-site to create the document - the document is then stored in a database within the website as a PDF file which can be accessed by TxDOT but cannot be changed (it is pass word protected). The document creator can download a copy of the document for their use. However, these documents are password protected to prevent the person who first created the document from later altering it (or anyone else for that matter). I have been given a large number of these files for use in our project - the files are downloaded directly from the web database - so they might not actually be scanned images, but rather electronic copies (PDF file) of the document as it was originally created.

I am using Adobe Acrobat X Pro to open the images - the document properties indicate they are in PDF version 1.5 (Adobe 6.x).

The PDF files are not searchable.

I will try using an OCR (Tesseract) to read the PDF and convert it to a text or similar searchable file. I assume that AutoIT does not include an OCR function else a separate program would not be necessary.

If I am making ignorant statements (for example, assuming that OCR could be an AutoIT function, please keep in mind that I am in the learning phase.

Thanks very much for your comments, I'll check in on the forum to follow up.

Mike

mLipok · September 25, 2014

Hello All -

First, I would like to apologize for not responding to your messages - I am a member of other web forums and have gotten used to a email advising me that someone has responded to a new post I made. Since I hadn't heard from AutoIT, I assumed that no one had replied to my original post.

Mike

take a look here:

http://www.autoitscript.com/forum/index.php?app=core&module=usercp&tab=core&area=notifications

and set up your profile exactly how you like.

water · September 25, 2014

You can change your profile to get email notifications:

Click on your user name in the upper right corner of this page. Click settings and then click notification options.

Edit:

Not fast enough

Edited September 25, 2014 by water

water · September 25, 2014

I'm still not 100% sure what kind of PDF files we are talking about.

Would it be possible to post one of this files here or send it to me by PM?

mLipok · September 25, 2014

Actually, we have tried optical character readers for other applications that involved extracting information from hand written law enforcement crash records. This did not work that well. However, I've not tried to extract data from a type written PDF using an OCR - I somehow thought that I would be able to read the PDF image and extract the data directly, but apparently this is not possible. We have created an excel tool for another application that extracts data from a truck data website and places it into an excel database - I had hoped for something similar for this application.

To describe the process by which these files are initially created, a user accesses a web-site to create the document - the document is then stored in a database within the website as a PDF file which can be accessed by TxDOT but cannot be changed (it is pass word protected). The document creator can download a copy of the document for their use. However, these documents are password protected to prevent the person who first created the document from later altering it (or anyone else for that matter). I have been given a large number of these files for use in our project - the files are downloaded directly from the web database - so they might not actually be scanned images, but rather electronic copies (PDF file) of the document as it was originally created.

.......

The PDF files are not searchable.

I will try using an OCR (Tesseract) to read the PDF and convert it to a text or similar searchable file.

"a user accesses a web-site to create the document - the document is then stored in a database within the website as a PDF file which can be accessed by TxDOT but cannot be changed (it is pass word protected)."

Is this document is fully filled via web-site ? or after storing is filled by hand, with using pencil, and after this is scaned ?

mLipok · September 26, 2014

Mike maybe you be intrested to look here: Real OCR

MikeMurphy · September 27, 2014

The PDF files I am interested in reading / extracting data are filled in on a website by the website user. These documents can be accessed by the DOT for their purposes, but as mentioned, are password locked to prevent changes. I have been given a

sample of these documents, which can be downloaded from the website as PDF files.

I have viewed the Tesseract OCR information referenced in your responses and though the documents are all completed on the web and thus are type written (not hand written) I am realizing the difficulties that can occur in reading the files and populating a database. For example, some individuals provide distances between truck axles as feet and inches, others provide this

information as feet and decimals of a foot. For example, 4' 3" or 4.25 ft.

I can't post an example of the document on AutoIT since these documents contain some information considered confidential by theTexas State Attorney General - I am able to view these documents only because it has been given to me by TxDOT since I'm

working on a project for them.

I will continue to study the information provided thus far. I will also change my profile so I receive Email updates.

Thank you very much,

Mike

Sign In

Need Guidance - convert a scanned PDF to excel format

Recommended Posts

MikeMurphy

mLipok

water

mLipok

water

Bert

water

mLipok

water

mLipok

MikeMurphy

mLipok

water

water

mLipok

mLipok

MikeMurphy

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta