Sign in to follow this  
Followers 0
Sorak99

PDF to Text to Excel

28 posts in this topic

I am stuck, I have an idea of how i would like to proceed but no idea how to get there. I love to read the forums and have tried to find and figure this out on my own, but I would appreciate any input.

I'm a mortgage specialist and in the morning I go thru a lot of rate sheets trying to find the best price. Basically, I get alot of PDF documents in the morning and I am trying to convert them to excel.

I have been trying various methods to achieve my goal.

I use Autoit to run pdftotext and excel and have it imported.

The major problem I am running into is when the rate sheets carry a special announcement or a full page announcement it throws off lining up the text to yesterdays standard.

I have tried to change the pdfs to text then elimate all the words so they are not a factor by using find and replace then importing to excel. (this didnt work because the tables with the rate content would not always be the same dimensions)

I tried to numerous variations on the same theme, trying to get the format the same each time, with little success.

So is there a way to have a "reader" program, that reads the lines of a txt file and if the first entry is #.### or ?#.###? or somethng along those lines? Or something that just takes out numerical data.

George

Share this post


Link to post
Share on other sites



I am stuck, I have an idea of how i would like to proceed but no idea how to get there. I love to read the forums and have tried to find and figure this out on my own, but I would appreciate any input.

I'm a mortgage specialist and in the morning I go thru a lot of rate sheets trying to find the best price. Basically, I get alot of PDF documents in the morning and I am trying to convert them to excel.

I have been trying various methods to achieve my goal.

I use Autoit to run pdftotext and excel and have it imported.

The major problem I am running into is when the rate sheets carry a special announcement or a full page announcement it throws off lining up the text to yesterdays standard.

I have tried to change the pdfs to text then elimate all the words so they are not a factor by using find and replace then importing to excel. (this didnt work because the tables with the rate content would not always be the same dimensions)

I tried to numerous variations on the same theme, trying to get the format the same each time, with little success.

So is there a way to have a "reader" program, that reads the lines of a txt file and if the first entry is #.### or ?#.###? or somethng along those lines? Or something that just takes out numerical data.

George

Because you didnt post examples of these pdf\textfiles I cant give you a ready made script. However, this is very possible.

You could FileReadLine each line, and parse it with StringRegExp to see if it fits the format you mentioned above. Then, also using StringRegExp you can pull out the numerical information and easily write it to an excel spreadsheet using the ExcelCOM UDF.

The more items you attach/post the more I can help you.

-Simucal


AutoIt Scripts:Aimbot: Proof of Concept - PixelSearching Aimbot with several search/autoshoot/lock-on techniques.Sliding Toolbar - Add a nice Sliding Toolbar to your next script. Click the link to see an animation of it in action!FontInfo UDF - Get list of system fonts, or search to see if a particular font is installed.Get Extended Property UDF - Retrieve a files extended properties (e.g., video/image dimensions, file version, bitrate of song/video, etc)

Share this post


Link to post
Share on other sites

Because you didnt post examples of these pdf\textfiles I cant give you a ready made script. However, this is very possible.

You could FileReadLine each line, and parse it with StringRegExp to see if it fits the format you mentioned above. Then, also using StringRegExp you can pull out the numerical information and easily write it to an excel spreadsheet using the ExcelCOM UDF.

The more items you attach/post the more I can help you.

-Simucal

I have read the Autoit Help File, and have never come across StringRegExp... until now and realized that I assumed that the beta help file was identical to the regular help file. Lots more reading to do.

Auroranew.pdf

For the most part things remain the same except overtime they shift slighty and seperate when transformed into text. Currently i use xpdf from http://www.foolabs.com/xpdf/home.html using C:/pdftotext.exe -layout -f 0 -l 1 C:/Aurora.pdf C:/Aurora1.txt eventually getting each page seperate to help minimize the changes from the begining of the document from affecting the latter.

Writing data to the excel isnt that bad, but keeping it in the same place is hard. So, if i recognize a pattern on text line 5 and import it into excel line 1, then how do recognize text line 6 (old) that is now text line 7 (new) into excel line 2. Am i getting ahead of myself, or am I just way off track?

George

Share this post


Link to post
Share on other sites

Other Attachment: Auroraold.pdf

Dunno how to put them on the same post.

George

Share this post


Link to post
Share on other sites

I am stuck

Is this the data you are after?

8.375 (3.875) (3.750) (3.625) (3.500) (3.250)

8.250 (3.625) (3.500) (3.375) (3.250) (3.000)

8.125 (3.500) (3.375) (3.250) (3.125) (2.875)

8.000 (3.250) (3.125) (3.000) (2.875) (2.625)

7.875 (3.000) (2.875) (2.750) (2.625) (2.375)

7.750 (2.750) (2.625) (2.500) (2.375) (2.125)

7.625 (2.500) (2.375) (2.250) (2.125) (1.875)

7.500 (2.250) (2.125) (2.000) (1.875) (1.625)

7.375 (2.000) (1.875) (1.750) (1.625) (1.375)

7.250 (1.750) (1.625) (1.500) (1.375) (1.125)

7.125 (1.500) (1.375) (1.250) (1.125) (0.875)

7.000 (1.250) (1.125) (1.000) (0.875) (0.625)

6.875 (1.000) (0.875) (0.750) (0.625) (0.375)

6.750 (0.625) (0.500) (0.375) (0.250) 0.000

6.625 (0.250) (0.125) 0.000 0.125 0.375

6.500 0.125 0.250 0.375 0.500 0.750

6.375 0.500 0.625 0.750 0.875 1.125

6.250 1.000 1.125 1.250 1.375 1.625

6.125 1.625 1.750 1.875 2.000 2.250


...by the way, it's pronounced: "JIF"... Bob Berry --- inventor of the GIF format

Share this post


Link to post
Share on other sites

Is this the data you are after?

For the most part, yes. Impressive, but does it take into account when extra pages are added and the movement of the data? Like I said, i can get it into excel, but i cant keep it in the same place.

George

Share this post


Link to post
Share on other sites

You could FileReadLine each line, and parse it with StringRegExp to see if it fits the format you mentioned above. Then, also using StringRegExp you can pull out the numerical information and easily write it to an excel spreadsheet using the ExcelCOM UDF.

When you say fits the format, is it possible to have a format of ?#.###? so that the ? are a space or a () and the # are numbers? Or something to that effect?

George

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

Give this a try and see how it does for what you want.

#include <File.au3>
#include <Array.au3>
Opt ( "TrayIconDebug", 1 )

Dim $array
_FileReadToArray ( "Aurora1.txt", $array )

FileDelete ( "test.csv" )

For $x = 1 TO $array[0] - 1
    ToolTip ( $x, 0, 0 )
    If StringInStr ( $array[$x], "(" ) And StringIsDigit (StringLeft ( StringStripWS($array[$x], 1), 1 )) Then

        $SpaceFix = StringStripWS ( $array[$x], 1 )
        $SpaceFix = StringReplace ( StringStripWS ( StringStripCR ( $SpaceFix ), 4 ), " ", "," )
        $Split = StringSplit ( $SpaceFix, "," )

        If $Split[0] > 12 Then
            $SplitSize = $Split[0]
            For $y = 1 to $Split[0]
                If StringInStr ( $Split[1], "." ) Then
                    If $y > $SplitSize Then ExitLoop
                    If Not StringInStr ( $Split[$y], "." ) Then
                        _ArrayDelete ( $Split, $y )
                        $SplitSize = $SplitSize - 1
                        $y = $y - 1
                    EndIf
                EndIf
            Next
        EndIf
        
        $SplitString = _ArrayToString ( $Split, ",", 1 )
        $SplitString = StringReplace ( $SplitString, "(", "" )
        $SplitString = StringReplace ( $SplitString, ")", "" )
        
        If StringInStr ( $SpaceFix, "yr" ) Then
            FileWriteLine ("test.csv", @CRLF & $SplitString )
        ElseIf StringInStr ( $Split[1], "." ) Then
            FileWriteLine ("test.csv", $SplitString )
        EndIf
    EndIf
Next
Edited by exodius

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

Give this a try and see how it does for what you want.

#include <File.au3>
#include <Array.au3>
Opt ( "TrayIconDebug", 1 )

Dim $array
_FileReadToArray ( "Aurora1.txt", $array )

FileDelete ( "test.csv" )

For $x = 1 TO $array[0] - 1
    ToolTip ( $x, 0, 0 )
    If StringInStr ( $array[$x], "(" ) And StringIsDigit (StringLeft ( StringStripWS($array[$x], 1), 1 )) Then

        $SpaceFix = StringStripWS ( $array[$x], 1 )
        $SpaceFix = StringReplace ( StringStripWS ( StringStripCR ( $SpaceFix ), 4 ), " ", "," )
        $Split = StringSplit ( $SpaceFix, "," )

        If $Split[0] > 12 Then
            $SplitSize = $Split[0]
            For $y = 1 to $Split[0]
                If StringInStr ( $Split[1], "." ) Then
                    If $y > $SplitSize Then ExitLoop
                    If Not StringInStr ( $Split[$y], "." ) Then
                        _ArrayDelete ( $Split, $y )
                        $SplitSize = $SplitSize - 1
                        $y = $y - 1
                    EndIf
                EndIf
            Next
        EndIf
        
        $SplitString = _ArrayToString ( $Split, ",", 1 )
        $SplitString = StringReplace ( $SplitString, "(", "" )
        $SplitString = StringReplace ( $SplitString, ")", "" )
        
        If StringInStr ( $SpaceFix, "yr" ) Then
            FileWriteLine ("test.csv", @CRLF & $SplitString )
        ElseIf StringInStr ( $Split[1], "." ) Then
            FileWriteLine ("test.csv", $SplitString )
        EndIf
    EndIf
Next

Believe it or not, I have gotten to this point. Not with nearly as beautiful of code, mine is about 8k lines of gibberish, I had it first use pdftotext then remove the words in word so i had nothing but numbers, then import into an excel sheet. I just automated the keyboard strokes for most of it. But i couldnt get the data to stay within the same cells from day to day. An extra announcement would throw it off.

So i tried to search and find a key word on each, similar to "yr" that you used, but for each program. Then it could copy and import into excel every 50 spaces so that A1 would always be the lowest rate for that program and a50 would be the lowest rate for another program. But i couldnt find a keyword that worked.

Basically thats my dilemma. I put a bunch of the .txt on the back of our website http://www.bluecoasthomeloans.com/George/ (still a work in progress) and if anyone needs a loan! (Small advertisement). I cant get it to work on a one such as: TaylorBean&Whitaker.txt but I think thats because of And StringIsDigit (StringLeft ( StringStripWS($array[$x], 1), 1 ))

I appreciate the code you have given me so far, it is taking me a while to figure it out, but it is definitely helping me learn.

Edited by Sorak99

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

Believe it or not, I have gotten to this point. Not with nearly as beautiful of code, mine is about 8k lines of gibberish, I had it first use pdftotext then remove the words in word so i had nothing but numbers, then import into an excel sheet. I just automated the keyboard strokes for most of it. But i couldn’t get the data to stay within the same cells from day to day. An extra announcement would throw it off.

So i tried to search and find a key word on each, similar to "yr" that you used, but for each program. Then it could copy and import into excel every 50 spaces so that A1 would always be the lowest rate for that program and a50 would be the lowest rate for another program. But i couldn’t find a keyword that worked.

Basically that’s my dilemma. I put a bunch of the .txt on the back of our website http://www.bluecoasthomeloans.com/George/ (still a work in progress) and if anyone needs a loan! (Small advertisement). I cant get it to work on a one such as: TaylorBean&Whitaker.txt but I think that’s because of And StringIsDigit (StringLeft ( StringStripWS($array[$x], 1), 1 ))

I appreciate the code you have given me so far, it is taking me a while to figure it out, but it is definitely helping me learn.

So based on what the code that I gave you generates, what needs to be different about the output?

Edit: So a big problem I'm seeing that you're going to have is each of these seem to have just slightly different enough of a format to throw off the possibility of having just one script (or at least one function in a script) to do them all. You will probably need to customize what I gave you for each vendor, assuming that they're fairly standard in their formatting.

Edited by exodius

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

Basically, as time goes on things on these sheets change. The output from the one changed 45 days later.

The major problem I am running into is when the rate sheets carry a special announcement or a full page announcement it throws off lining up the text to yesterdays standard.

i put an example on the back end of the website again. One is from about 45 days ago, the other text is from today. The Outputs are slightly off and are exacerbated as the entire sheet is unraveled. I planned on minimizing that by creating each page output separately, so that it doesnt affect the data beneath it. But problem still remains.

http://www.bluecoasthomeloans.com/George/Exodius/

Edit: I dont mind customizing each script, its just that a customized script for an individual pdf/txt will not stay affective as time goes on. As shown by the example I uploaded. I can make a script for a single day, things usually shift within a week.

Essentially this:

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x. xx

x.xx x.xx x.xx x. xx

Turns into this

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x.xx

Or with the .csv created by the code there is an extra space in the beginning

Does that make sense? What i eventually did was make an excel book with each investor then a page that showed the same program for everyone on the same sheet. So i wouldnt have to go through the 60 or so investors that we have everyday. But when they move my formulas dont work.

George

Edited by Sorak99

Share this post


Link to post
Share on other sites

Basically, as time goes on things on these sheets change. The output from the one changed 45 days later.

i put an example on the back end of the website again. One is from about 45 days ago, the other text is from today. The Outputs are slightly off and are exacerbated as the entire sheet is unraveled. I planned on minimizing that by creating each page output separately, so that it doesnt affect the data beneath it. But problem still remains.

http://www.bluecoasthomeloans.com/George/Exodius/

Edit: I dont mind customizing each script, its just that a customized script for an individual pdf/txt will not stay affective as time goes on. As shown by the example I uploaded. I can make a script for a single day, things usually shift within a week.

Essentially this:

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x. xx

x.xx x.xx x.xx x. xx

Turns into this

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x.xx

x.xx x.xx x.xx x.xx

Or with the .csv created by the code there is an extra space in the beginning

Does that make sense? What i eventually did was make an excel book with each investor then a page that showed the same program for everyone on the same sheet. So i wouldnt have to go through the 60 or so investors that we have everyday. But when they move my formulas dont work.

George

I had a thought. If i can read the first line and check if its higher than the next line until its less than then with a few rolling counters i can seperate the material into sections, so i can space the begining of each section about 50 lines.

8.5 -3.875 -3.75 -3.625

8.375 -3.75 -3.625 -3.5

8.25 -3.5 -3.375 -3.25

8.125 -3.375 -3.25 -3.125

8 -3.125 -3 -2.875

7.875 -2.875 -2.75 -2.625

7.75 -2.625 -2.5 -2.375

7.625 -2.375 -2.25 -2.125

7.5 -2.125 -2 -1.875

7.375 -1.875 -1.75 -1.625

7.25 -1.625 -1.5 -1.375

7.125 -1.375 -1.25 -1.125

7 -1 -0.875 -0.75

6.875 -0.75 -0.625 -0.5

6.75 -0.375 -0.25 -0.125

6.625 0 0.125 0.25

8.125 -4.25 -4.125 -4

8 -4 -3.875 -3.75

7.875 -3.875 -3.75 -3.625

7.75 -3.625 -3.5 -3.375

7.625 -3.5 -3.375 -3.25

7.5 -3.125 -3 -2.875

7.375 -2.875 -2.75 -2.625

7.25 -2.5 -2.375 -2.25

7.125 -2.25 -2.125 -2

7 -1.875 -1.75 -1.625

6.875 -1.5 -1.375 -1.25

6.75 -1.125 -1 -0.875

6.625 -0.75 -0.625 -0.5

6.5 -0.375 -0.25 -0.125

8.5 -3.75 -3.625 -3.5

8.375 -3.5 -3.375 -3.25

8.25 -3.375 -3.25 -3.125

8.125 -3.125 -3 -2.875

8 -3 -2.875 -2.75

7.875 -2.75 -2.625 -2.5

7.75 -2.625 -2.5 -2.375

7.625 -2.375 -2.25 -2.125

7.5 -2.25 -2.125 -2

7.375 -2 -1.875 -1.75

7.25 -1.625 -1.5 -1.375

7.125 -1.25 -1.125 -1

7 -0.875 -0.75 -0.625

6.875 -0.5 -0.375 -0.25

6.75 0 0.125 0.25

I appreciate your help Exodius, I learned a great deal studying your code.

George

Share this post


Link to post
Share on other sites

It can be divided into two sub-function:

1. PDF to TXT transformation

you can search the XPDF library. It's a free DOS command-line tool.

2. TXT to EXCEL transtormation

First, you need to parse the TXT structure.

Language that supports Regular Expression can do this job well, like perl or something else.

Then, you need to put the data you like into the Excel file.

Through language ODBC support, you can do it.

In perl, you can refer to DBI library.

:D

Share this post


Link to post
Share on other sites

1. PDF to TXT transformation

you can search the XPDF library. It's a free DOS command-line tool.

This part is true. These tools might as well be utilized and in fact he actually's using one already (like he said above).

2. TXT to EXCEL transtormation

First, you need to parse the TXT structure.

Language that supports Regular Expression can do this job well, like perl or something else.

Then, you need to put the data you like into the Excel file.

Through language ODBC support, you can do it.

In perl, you can refer to DBI library.

AutoIt is capable of both of these things itself. The changing nature of the documents he's working with just make it difficult to nail down a standard format. So I'm left wondering why you're trying to explain how to do things in any way other than AutoIt, on the AutoIt support forum. :D

Share this post


Link to post
Share on other sites

I do appreciate all the help.

I think I will try to strip all information away except for the numbers and periods, then attempt to reorganize it into the tables that it came from. How? No idea. :">

George

Share this post


Link to post
Share on other sites

k, Here is what i came up with.

#include <File.au3>
#include <Array.au3>
#include<string.au3>

Opt ( "TrayIconDebug", 1 )

Dim $array
_FileReadToArray ( "C:\citi.txt", $array )

FileDelete ( "C:\test.csv" )

For $x = 1 TO $array[0] - 1
    ToolTip ( $x, 0, 0 )

        $SpaceFix = StringStripWS ( $array[$x], 1 )
        $SpaceFix = stringreplace($SpaceFix, "...", " ")
        $SpaceFix = StringReplace ( StringStripWS ( StringStripCR ( $SpaceFix ), 4 ), " ", "," )
        $Split = StringSplit ( $SpaceFix, "," )
        
    for $y = 0 to $Split[0] - 1
        ToolTip ( $y, 20, 20 )
        $z = StringReplace($split[$y], ".", "")
        $z = StringReplace($z, ")", "")
        $z = StringReplace($z, "(", "")
    
        if stringinstr( $split[$y], ".") > 0 and stringisdigit($z) = 1 Then

            FileWriteline ("C:\test.csv",$Split[$y] & "," & $y & "," & $x )
            
        EndIf
    Next
     
   
Next

With the attachment citi.txt , i kept in the $x and $y because i figured they would help me rebuild a table. The only difference between this code, and exodius's code is that this only has the items that are necessary. So this is how far I am so far on trying to rebuild the tables. It doesnt work, but advice on either code would be appreciated.

#include <Array.au3>

Dim $parts
Dim $firstentry
Dim $prevpartsline

$a = 0
$file = FileOpen("C:\test.csv", 0)

FileDelete ( "C:\test2.csv" )

; Check if file opened for reading OK
If $file = -1 Then
    MsgBox(0, "Error", "Unable to open file.")
    Exit
EndIf

; Read in lines of text until the EOF is reached
While 1
    $line = FileReadLine($file)
    If @error = -1 Then ExitLoop
    
    $parts = stringsplit($line, ",")
    
    $a = $a + 1
    if $a = 1 Then
        $parts[1] = $firstentry and $parts[3] = $prevpartsline
        ContinueLoop
        
        EndIf
    if $prevpartsline = $parts[3] then
        _stringinsert("," & $parts[1], $firstentry, StringLen($firstentry))

       $prevpartsline = $parts[3]
       
       FileWriteline ("C:\test2.csv",$firstentry)
       
       endif
Wend

FileClose($file)

Please be kind. I have never written anything that loops before... in fact, I think that this is the first time i used IF THEN statements.. I would love all the help i can get, but im afraid of becomin a pest, I cant seem find examples of things i would like to do with Outlook, and IE (Unfortunately the guide for IE, doesn't help I learn from programs examples)

Btw, does _filewritetoline work when the line is a variable?

Thanks again,

George

Share this post


Link to post
Share on other sites

#17 ·  Posted (edited)

I'll look at this when I get some time George, I'm not sure if that's going to be tonight though.

Edit: So I'm looking at it George, does the code that you put up after your alteration of mine actually work as you posted it?

Edited by exodius

Share this post


Link to post
Share on other sites

I'll look at this when I get some time George, I'm not sure if that's going to be tonight though.

Edit: So I'm looking at it George, does the code that you put up after your alteration of mine actually work as you posted it?

It does, I get something like this:

7.625 1 26

-3.968 2 26

-3.84 3 26

-3.809 4 26

-3.679 5 26

7.625 6 26

-3.493 7 26

-3.364 8 26

-3.314 9 26

-3.184 10 26

7.125 11 26

-3.552 12 26

-3.446 13 26

-3.339 14 26

7.5 1 27

-3.694 2 27

-3.57 3 27

-3.57 4 27

-3.445 5 27

7.5 6 27

-3.151 7 27

-3.027 8 27

-2.972 9 27

-2.847 10 27

7 11 27

-3.281 12 27

-3.18 13 27

-3.078 14 27

7.375 1 28

-3.348 2 28

-3.23 3 28

-3.225 4 28

-3.106 5 28

7.375 6 28

-3.259 7 28

-3.141 8 28

-3.08 9 28

-2.961 10 28

6.875 11 28

-2.898 12 28

-2.803 13 28

-2.706 14 28

7.25 1 29

-3.238 2 29

-3.125 3 29

-3.08 4 29

-2.966 5 29

7.25 6 29

-3.032 7 29

-2.919 8 29

-2.853 9 29

-2.739 10 29

6.75 11 29

-2.61 12 29

-2.519 13 29

-2.427 14 29

7.125 1 30

-3.026 2 30

-2.918 3 30

-2.878 4 30

-2.769 5 30

7.125 6 30

-2.763 7 30

-2.655 8 30

ect.

ect.

ect.

George

Share this post


Link to post
Share on other sites

So you want the data separated by a single space?

Share this post


Link to post
Share on other sites

So you want the data separated by a single space?

Well, the actual data is with commas (see attachment), i just copied it off the sheet.

Just trying to rebuild the tables, I figure if i only take out the important info, and while im rebuilding the tables i can move them apart the same distance in a new or same file, so avoid the changing nature of the document.

George

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0