MikeMurphy

Need Guidance - convert a scanned PDF to excel format

MikeMurphy replied to MikeMurphy's topic in AutoIt General Help and Support

The PDF files I am interested in reading / extracting data are filled in on a website by the website user. These documents can be accessed by the DOT for their purposes, but as mentioned, are password locked to prevent changes. I have been given a sample of these documents, which can be downloaded from the website as PDF files. I have viewed the Tesseract OCR information referenced in your responses and though the documents are all completed on the web and thus are type written (not hand written) I am realizing the difficulties that can occur in reading the files and populating a database. For example, some individuals provide distances between truck axles as feet and inches, others provide this information as feet and decimals of a foot. For example, 4' 3" or 4.25 ft. I can't post an example of the document on AutoIT since these documents contain some information considered confidential by theTexas State Attorney General - I am able to view these documents only because it has been given to me by TxDOT since I'm working on a project for them. I will continue to study the information provided thus far. I will also change my profile so I receive Email updates. Thank you very much, Mike

Need Guidance - convert a scanned PDF to excel format

MikeMurphy replied to MikeMurphy's topic in AutoIt General Help and Support

Hello All - First, I would like to apologize for not responding to your messages - I am a member of other web forums and have gotten used to a email advising me that someone has responded to a new post I made. Since I hadn't heard from AutoIT, I assumed that no one had replied to my original post. mLipok, thank you for contacting me ;-) The photo you found on the UT-CTR website is about 7 years old so I have less hair and it is whiter now....I may have also lost some weight ;-) But to the problem at hand..... Actually, we have tried optical character readers for other applications that involved extracting information from hand written law enforcement crash records. This did not work that well. However, I've not tried to extract data from a type written PDF using an OCR - I somehow thought that I would be able to read the PDF image and extract the data directly, but apparently this is not possible. We have created an excel tool for another application that extracts data from a truck data website and places it into an excel database - I had hoped for something similar for this application. To describe the process by which these files are initially created, a user accesses a web-site to create the document - the document is then stored in a database within the website as a PDF file which can be accessed by TxDOT but cannot be changed (it is pass word protected). The document creator can download a copy of the document for their use. However, these documents are password protected to prevent the person who first created the document from later altering it (or anyone else for that matter). I have been given a large number of these files for use in our project - the files are downloaded directly from the web database - so they might not actually be scanned images, but rather electronic copies (PDF file) of the document as it was originally created. I am using Adobe Acrobat X Pro to open the images - the document properties indicate they are in PDF version 1.5 (Adobe 6.x). The PDF files are not searchable. I will try using an OCR (Tesseract) to read the PDF and convert it to a text or similar searchable file. I assume that AutoIT does not include an OCR function else a separate program would not be necessary. If I am making ignorant statements (for example, assuming that OCR could be an AutoIT function, please keep in mind that I am in the learning phase. Thanks very much for your comments, I'll check in on the forum to follow up. Mike

mLipok reacted to a post in a topic: Need Guidance - convert a scanned PDF to excel format September 20, 2014

Need Guidance - convert a scanned PDF to excel format

MikeMurphy posted a topic in AutoIt General Help and Support

Hello All - This is my first post. By way of introduction I am a research engineer at the University of Texas - Center for Transportation Research. I work with Masters and PhD students conducting research and though they are whizzes at creating computer programs - I want to learn more about this myself. This will enable me to converse with them on more of an equal footing plus I want to learn something new. I am old school, first did key punch cards and ran programs on IBM mainframes in school. I've also done Basic and Fortran coding - a long time ago. I've also done some SQL coding for use in a Sybase database - probably 15 years ago.....which I still use by the way. I asked the UT Civil Engineering IT department for their recommendation of a programming tool I could use to create various types of programs (like the one I will describe below) and was told to download AutoIT. I've read through a number of the different Forum posts and viewed the pre-written code examples but haven't yet seen an example of the type of program I want to write first (or maybe don't recognize an example when I see it). I am currently conducting research for the Texas Department of Transportation and have access to several thousand scanned PDF files - I want to extract information contained in each file and import the data into an Excel spreadsheet to create a database. These are scanned files so the fields in each form are not directly accessible. Thus, though I can see a number I want to import such as a truck axle load 10,330 this is not actually 'number' per say that I can read using any type of Excel Macro. I've examined these files and though Acrobat has a feature to export files to Excel this is disabled in the files I've been provided. It would be helpful if someone could point me to an example script that I could study to understand the AutoIt functions and start working to create AutoIT code for the purpose described. I would appreciate any help you can give. Thanks very much, Mike

Sign In

Posts

Joined

Last visited

Recent Profile Visitors

jitb

MikeMurphy's Achievements

Seeker (1/7)

Reputation

Need Guidance - convert a scanned PDF to excel format

Need Guidance - convert a scanned PDF to excel format

Need Guidance - convert a scanned PDF to excel format

Browse

AutoIt Resources

Release

Beta