Sorak99 Posted June 6, 2006 Share Posted June 6, 2006 I am stuck, I have an idea of how i would like to proceed but no idea how to get there. I love to read the forums and have tried to find and figure this out on my own, but I would appreciate any input. I'm a mortgage specialist and in the morning I go thru a lot of rate sheets trying to find the best price. Basically, I get alot of PDF documents in the morning and I am trying to convert them to excel. I have been trying various methods to achieve my goal. I use Autoit to run pdftotext and excel and have it imported. The major problem I am running into is when the rate sheets carry a special announcement or a full page announcement it throws off lining up the text to yesterdays standard. I have tried to change the pdfs to text then elimate all the words so they are not a factor by using find and replace then importing to excel. (this didnt work because the tables with the rate content would not always be the same dimensions) I tried to numerous variations on the same theme, trying to get the format the same each time, with little success. So is there a way to have a "reader" program, that reads the lines of a txt file and if the first entry is #.### or ?#.###? or somethng along those lines? Or something that just takes out numerical data. George Link to comment Share on other sites More sharing options...
Simucal Posted June 6, 2006 Share Posted June 6, 2006 I am stuck, I have an idea of how i would like to proceed but no idea how to get there. I love to read the forums and have tried to find and figure this out on my own, but I would appreciate any input. I'm a mortgage specialist and in the morning I go thru a lot of rate sheets trying to find the best price. Basically, I get alot of PDF documents in the morning and I am trying to convert them to excel. I have been trying various methods to achieve my goal. I use Autoit to run pdftotext and excel and have it imported.The major problem I am running into is when the rate sheets carry a special announcement or a full page announcement it throws off lining up the text to yesterdays standard. I have tried to change the pdfs to text then elimate all the words so they are not a factor by using find and replace then importing to excel. (this didnt work because the tables with the rate content would not always be the same dimensions)I tried to numerous variations on the same theme, trying to get the format the same each time, with little success.So is there a way to have a "reader" program, that reads the lines of a txt file and if the first entry is #.### or ?#.###? or somethng along those lines? Or something that just takes out numerical data.GeorgeBecause you didnt post examples of these pdf\textfiles I cant give you a ready made script. However, this is very possible.You could FileReadLine each line, and parse it with StringRegExp to see if it fits the format you mentioned above. Then, also using StringRegExp you can pull out the numerical information and easily write it to an excel spreadsheet using the ExcelCOM UDF.The more items you attach/post the more I can help you.-Simucal AutoIt Scripts:Aimbot: Proof of Concept - PixelSearching Aimbot with several search/autoshoot/lock-on techniques.Sliding Toolbar - Add a nice Sliding Toolbar to your next script. Click the link to see an animation of it in action!FontInfo UDF - Get list of system fonts, or search to see if a particular font is installed.Get Extended Property UDF - Retrieve a files extended properties (e.g., video/image dimensions, file version, bitrate of song/video, etc) Link to comment Share on other sites More sharing options...
Sorak99 Posted June 7, 2006 Author Share Posted June 7, 2006 Because you didnt post examples of these pdf\textfiles I cant give you a ready made script. However, this is very possible. You could FileReadLine each line, and parse it with StringRegExp to see if it fits the format you mentioned above. Then, also using StringRegExp you can pull out the numerical information and easily write it to an excel spreadsheet using the ExcelCOM UDF. The more items you attach/post the more I can help you. -Simucal I have read the Autoit Help File, and have never come across StringRegExp... until now and realized that I assumed that the beta help file was identical to the regular help file. Lots more reading to do. Auroranew.pdf For the most part things remain the same except overtime they shift slighty and seperate when transformed into text. Currently i use xpdf from http://www.foolabs.com/xpdf/home.html using C:/pdftotext.exe -layout -f 0 -l 1 C:/Aurora.pdf C:/Aurora1.txt eventually getting each page seperate to help minimize the changes from the begining of the document from affecting the latter. Writing data to the excel isnt that bad, but keeping it in the same place is hard. So, if i recognize a pattern on text line 5 and import it into excel line 1, then how do recognize text line 6 (old) that is now text line 7 (new) into excel line 2. Am i getting ahead of myself, or am I just way off track? George Link to comment Share on other sites More sharing options...
Sorak99 Posted June 7, 2006 Author Share Posted June 7, 2006 Other Attachment: Auroraold.pdf Dunno how to put them on the same post. George Link to comment Share on other sites More sharing options...
jefhal Posted June 7, 2006 Share Posted June 7, 2006 I am stuckIs this the data you are after?8.375 (3.875) (3.750) (3.625) (3.500) (3.250) 8.250 (3.625) (3.500) (3.375) (3.250) (3.000) 8.125 (3.500) (3.375) (3.250) (3.125) (2.875) 8.000 (3.250) (3.125) (3.000) (2.875) (2.625) 7.875 (3.000) (2.875) (2.750) (2.625) (2.375) 7.750 (2.750) (2.625) (2.500) (2.375) (2.125) 7.625 (2.500) (2.375) (2.250) (2.125) (1.875) 7.500 (2.250) (2.125) (2.000) (1.875) (1.625) 7.375 (2.000) (1.875) (1.750) (1.625) (1.375) 7.250 (1.750) (1.625) (1.500) (1.375) (1.125) 7.125 (1.500) (1.375) (1.250) (1.125) (0.875) 7.000 (1.250) (1.125) (1.000) (0.875) (0.625) 6.875 (1.000) (0.875) (0.750) (0.625) (0.375) 6.750 (0.625) (0.500) (0.375) (0.250) 0.000 6.625 (0.250) (0.125) 0.000 0.125 0.375 6.500 0.125 0.250 0.375 0.500 0.750 6.375 0.500 0.625 0.750 0.875 1.125 6.250 1.000 1.125 1.250 1.375 1.625 6.125 1.625 1.750 1.875 2.000 2.250 ...by the way, it's pronounced: "JIF"... Bob Berry --- inventor of the GIF format Link to comment Share on other sites More sharing options...
Sorak99 Posted June 7, 2006 Author Share Posted June 7, 2006 Is this the data you are after?For the most part, yes. Impressive, but does it take into account when extra pages are added and the movement of the data? Like I said, i can get it into excel, but i cant keep it in the same place. George Link to comment Share on other sites More sharing options...
Sorak99 Posted June 7, 2006 Author Share Posted June 7, 2006 You could FileReadLine each line, and parse it with StringRegExp to see if it fits the format you mentioned above. Then, also using StringRegExp you can pull out the numerical information and easily write it to an excel spreadsheet using the ExcelCOM UDF.When you say fits the format, is it possible to have a format of ?#.###? so that the ? are a space or a () and the # are numbers? Or something to that effect?George Link to comment Share on other sites More sharing options...
exodius Posted June 7, 2006 Share Posted June 7, 2006 (edited) Give this a try and see how it does for what you want. expandcollapse popup#include <File.au3> #include <Array.au3> Opt ( "TrayIconDebug", 1 ) Dim $array _FileReadToArray ( "Aurora1.txt", $array ) FileDelete ( "test.csv" ) For $x = 1 TO $array[0] - 1 ToolTip ( $x, 0, 0 ) If StringInStr ( $array[$x], "(" ) And StringIsDigit (StringLeft ( StringStripWS($array[$x], 1), 1 )) Then $SpaceFix = StringStripWS ( $array[$x], 1 ) $SpaceFix = StringReplace ( StringStripWS ( StringStripCR ( $SpaceFix ), 4 ), " ", "," ) $Split = StringSplit ( $SpaceFix, "," ) If $Split[0] > 12 Then $SplitSize = $Split[0] For $y = 1 to $Split[0] If StringInStr ( $Split[1], "." ) Then If $y > $SplitSize Then ExitLoop If Not StringInStr ( $Split[$y], "." ) Then _ArrayDelete ( $Split, $y ) $SplitSize = $SplitSize - 1 $y = $y - 1 EndIf EndIf Next EndIf $SplitString = _ArrayToString ( $Split, ",", 1 ) $SplitString = StringReplace ( $SplitString, "(", "" ) $SplitString = StringReplace ( $SplitString, ")", "" ) If StringInStr ( $SpaceFix, "yr" ) Then FileWriteLine ("test.csv", @CRLF & $SplitString ) ElseIf StringInStr ( $Split[1], "." ) Then FileWriteLine ("test.csv", $SplitString ) EndIf EndIf Next Edited June 7, 2006 by exodius Link to comment Share on other sites More sharing options...
Sorak99 Posted June 8, 2006 Author Share Posted June 8, 2006 (edited) Give this a try and see how it does for what you want. expandcollapse popup#include <File.au3> #include <Array.au3> Opt ( "TrayIconDebug", 1 ) Dim $array _FileReadToArray ( "Aurora1.txt", $array ) FileDelete ( "test.csv" ) For $x = 1 TO $array[0] - 1 ToolTip ( $x, 0, 0 ) If StringInStr ( $array[$x], "(" ) And StringIsDigit (StringLeft ( StringStripWS($array[$x], 1), 1 )) Then $SpaceFix = StringStripWS ( $array[$x], 1 ) $SpaceFix = StringReplace ( StringStripWS ( StringStripCR ( $SpaceFix ), 4 ), " ", "," ) $Split = StringSplit ( $SpaceFix, "," ) If $Split[0] > 12 Then $SplitSize = $Split[0] For $y = 1 to $Split[0] If StringInStr ( $Split[1], "." ) Then If $y > $SplitSize Then ExitLoop If Not StringInStr ( $Split[$y], "." ) Then _ArrayDelete ( $Split, $y ) $SplitSize = $SplitSize - 1 $y = $y - 1 EndIf EndIf Next EndIf $SplitString = _ArrayToString ( $Split, ",", 1 ) $SplitString = StringReplace ( $SplitString, "(", "" ) $SplitString = StringReplace ( $SplitString, ")", "" ) If StringInStr ( $SpaceFix, "yr" ) Then FileWriteLine ("test.csv", @CRLF & $SplitString ) ElseIf StringInStr ( $Split[1], "." ) Then FileWriteLine ("test.csv", $SplitString ) EndIf EndIf Next Believe it or not, I have gotten to this point. Not with nearly as beautiful of code, mine is about 8k lines of gibberish, I had it first use pdftotext then remove the words in word so i had nothing but numbers, then import into an excel sheet. I just automated the keyboard strokes for most of it. But i couldnt get the data to stay within the same cells from day to day. An extra announcement would throw it off. So i tried to search and find a key word on each, similar to "yr" that you used, but for each program. Then it could copy and import into excel every 50 spaces so that A1 would always be the lowest rate for that program and a50 would be the lowest rate for another program. But i couldnt find a keyword that worked. Basically thats my dilemma. I put a bunch of the .txt on the back of our website http://www.bluecoasthomeloans.com/George/ (still a work in progress) and if anyone needs a loan! (Small advertisement). I cant get it to work on a one such as: TaylorBean&Whitaker.txt but I think thats because of And StringIsDigit (StringLeft ( StringStripWS($array[$x], 1), 1 )) I appreciate the code you have given me so far, it is taking me a while to figure it out, but it is definitely helping me learn. Edited June 8, 2006 by Sorak99 Link to comment Share on other sites More sharing options...
exodius Posted June 8, 2006 Share Posted June 8, 2006 (edited) Believe it or not, I have gotten to this point. Not with nearly as beautiful of code, mine is about 8k lines of gibberish, I had it first use pdftotext then remove the words in word so i had nothing but numbers, then import into an excel sheet. I just automated the keyboard strokes for most of it. But i couldn’t get the data to stay within the same cells from day to day. An extra announcement would throw it off. So i tried to search and find a key word on each, similar to "yr" that you used, but for each program. Then it could copy and import into excel every 50 spaces so that A1 would always be the lowest rate for that program and a50 would be the lowest rate for another program. But i couldn’t find a keyword that worked. Basically that’s my dilemma. I put a bunch of the .txt on the back of our website http://www.bluecoasthomeloans.com/George/ (still a work in progress) and if anyone needs a loan! (Small advertisement). I cant get it to work on a one such as: TaylorBean&Whitaker.txt but I think that’s because of And StringIsDigit (StringLeft ( StringStripWS($array[$x], 1), 1 )) I appreciate the code you have given me so far, it is taking me a while to figure it out, but it is definitely helping me learn. So based on what the code that I gave you generates, what needs to be different about the output? Edit: So a big problem I'm seeing that you're going to have is each of these seem to have just slightly different enough of a format to throw off the possibility of having just one script (or at least one function in a script) to do them all. You will probably need to customize what I gave you for each vendor, assuming that they're fairly standard in their formatting. Edited June 8, 2006 by exodius Link to comment Share on other sites More sharing options...
Sorak99 Posted June 8, 2006 Author Share Posted June 8, 2006 (edited) Basically, as time goes on things on these sheets change. The output from the one changed 45 days later. The major problem I am running into is when the rate sheets carry a special announcement or a full page announcement it throws off lining up the text to yesterdays standard.i put an example on the back end of the website again. One is from about 45 days ago, the other text is from today. The Outputs are slightly off and are exacerbated as the entire sheet is unraveled. I planned on minimizing that by creating each page output separately, so that it doesnt affect the data beneath it. But problem still remains. http://www.bluecoasthomeloans.com/George/Exodius/Edit: I dont mind customizing each script, its just that a customized script for an individual pdf/txt will not stay affective as time goes on. As shown by the example I uploaded. I can make a script for a single day, things usually shift within a week.Essentially this:x.xx x.xx x.xx x.xxx.xx x.xx x.xx x.xxx.xx x.xx x.xx x. xxx.xx x.xx x.xx x. xxTurns into thisx.xx x.xx x.xx x.xxx.xx x.xx x.xx x.xxx.xx x.xx x.xx x.xxx.xx x.xx x.xx x.xx Or with the .csv created by the code there is an extra space in the beginningDoes that make sense? What i eventually did was make an excel book with each investor then a page that showed the same program for everyone on the same sheet. So i wouldnt have to go through the 60 or so investors that we have everyday. But when they move my formulas dont work.George Edited June 8, 2006 by Sorak99 Link to comment Share on other sites More sharing options...
Sorak99 Posted June 8, 2006 Author Share Posted June 8, 2006 Basically, as time goes on things on these sheets change. The output from the one changed 45 days later. i put an example on the back end of the website again. One is from about 45 days ago, the other text is from today. The Outputs are slightly off and are exacerbated as the entire sheet is unraveled. I planned on minimizing that by creating each page output separately, so that it doesnt affect the data beneath it. But problem still remains. http://www.bluecoasthomeloans.com/George/Exodius/Edit: I dont mind customizing each script, its just that a customized script for an individual pdf/txt will not stay affective as time goes on. As shown by the example I uploaded. I can make a script for a single day, things usually shift within a week.Essentially this:x.xx x.xx x.xx x.xxx.xx x.xx x.xx x.xxx.xx x.xx x.xx x. xxx.xx x.xx x.xx x. xxTurns into thisx.xx x.xx x.xx x.xxx.xx x.xx x.xx x.xxx.xx x.xx x.xx x.xxx.xx x.xx x.xx x.xx Or with the .csv created by the code there is an extra space in the beginningDoes that make sense? What i eventually did was make an excel book with each investor then a page that showed the same program for everyone on the same sheet. So i wouldnt have to go through the 60 or so investors that we have everyday. But when they move my formulas dont work.GeorgeI had a thought. If i can read the first line and check if its higher than the next line until its less than then with a few rolling counters i can seperate the material into sections, so i can space the begining of each section about 50 lines.8.5 -3.875 -3.75 -3.6258.375 -3.75 -3.625 -3.58.25 -3.5 -3.375 -3.258.125 -3.375 -3.25 -3.1258 -3.125 -3 -2.8757.875 -2.875 -2.75 -2.6257.75 -2.625 -2.5 -2.3757.625 -2.375 -2.25 -2.1257.5 -2.125 -2 -1.8757.375 -1.875 -1.75 -1.6257.25 -1.625 -1.5 -1.3757.125 -1.375 -1.25 -1.1257 -1 -0.875 -0.756.875 -0.75 -0.625 -0.56.75 -0.375 -0.25 -0.1256.625 0 0.125 0.258.125 -4.25 -4.125 -48 -4 -3.875 -3.757.875 -3.875 -3.75 -3.6257.75 -3.625 -3.5 -3.3757.625 -3.5 -3.375 -3.257.5 -3.125 -3 -2.8757.375 -2.875 -2.75 -2.6257.25 -2.5 -2.375 -2.257.125 -2.25 -2.125 -27 -1.875 -1.75 -1.6256.875 -1.5 -1.375 -1.256.75 -1.125 -1 -0.8756.625 -0.75 -0.625 -0.56.5 -0.375 -0.25 -0.1258.5 -3.75 -3.625 -3.58.375 -3.5 -3.375 -3.258.25 -3.375 -3.25 -3.1258.125 -3.125 -3 -2.8758 -3 -2.875 -2.757.875 -2.75 -2.625 -2.57.75 -2.625 -2.5 -2.3757.625 -2.375 -2.25 -2.1257.5 -2.25 -2.125 -27.375 -2 -1.875 -1.757.25 -1.625 -1.5 -1.3757.125 -1.25 -1.125 -17 -0.875 -0.75 -0.6256.875 -0.5 -0.375 -0.256.75 0 0.125 0.25I appreciate your help Exodius, I learned a great deal studying your code.George Link to comment Share on other sites More sharing options...
poje Posted June 9, 2006 Share Posted June 9, 2006 It can be divided into two sub-function:1. PDF to TXT transformationyou can search the XPDF library. It's a free DOS command-line tool.2. TXT to EXCEL transtormationFirst, you need to parse the TXT structure. Language that supports Regular Expression can do this job well, like perl or something else.Then, you need to put the data you like into the Excel file.Through language ODBC support, you can do it. In perl, you can refer to DBI library. Link to comment Share on other sites More sharing options...
exodius Posted June 9, 2006 Share Posted June 9, 2006 1. PDF to TXT transformationyou can search the XPDF library. It's a free DOS command-line tool.This part is true. These tools might as well be utilized and in fact he actually's using one already (like he said above).2. TXT to EXCEL transtormationFirst, you need to parse the TXT structure. Language that supports Regular Expression can do this job well, like perl or something else.Then, you need to put the data you like into the Excel file.Through language ODBC support, you can do it. In perl, you can refer to DBI library.AutoIt is capable of both of these things itself. The changing nature of the documents he's working with just make it difficult to nail down a standard format. So I'm left wondering why you're trying to explain how to do things in any way other than AutoIt, on the AutoIt support forum. Link to comment Share on other sites More sharing options...
Sorak99 Posted June 9, 2006 Author Share Posted June 9, 2006 I do appreciate all the help. I think I will try to strip all information away except for the numbers and periods, then attempt to reorganize it into the tables that it came from. How? No idea. :"> George Link to comment Share on other sites More sharing options...
Sorak99 Posted June 12, 2006 Author Share Posted June 12, 2006 k, Here is what i came up with. #include <File.au3> #include <Array.au3> #include<string.au3> Opt ( "TrayIconDebug", 1 ) Dim $array _FileReadToArray ( "C:\citi.txt", $array ) FileDelete ( "C:\test.csv" ) For $x = 1 TO $array[0] - 1 ToolTip ( $x, 0, 0 ) $SpaceFix = StringStripWS ( $array[$x], 1 ) $SpaceFix = stringreplace($SpaceFix, "...", " ") $SpaceFix = StringReplace ( StringStripWS ( StringStripCR ( $SpaceFix ), 4 ), " ", "," ) $Split = StringSplit ( $SpaceFix, "," ) for $y = 0 to $Split[0] - 1 ToolTip ( $y, 20, 20 ) $z = StringReplace($split[$y], ".", "") $z = StringReplace($z, ")", "") $z = StringReplace($z, "(", "") if stringinstr( $split[$y], ".") > 0 and stringisdigit($z) = 1 Then FileWriteline ("C:\test.csv",$Split[$y] & "," & $y & "," & $x ) EndIf Next Next With the attachment citi.txt , i kept in the $x and $y because i figured they would help me rebuild a table. The only difference between this code, and exodius's code is that this only has the items that are necessary. So this is how far I am so far on trying to rebuild the tables. It doesnt work, but advice on either code would be appreciated. expandcollapse popup#include <Array.au3> Dim $parts Dim $firstentry Dim $prevpartsline $a = 0 $file = FileOpen("C:\test.csv", 0) FileDelete ( "C:\test2.csv" ) ; Check if file opened for reading OK If $file = -1 Then MsgBox(0, "Error", "Unable to open file.") Exit EndIf ; Read in lines of text until the EOF is reached While 1 $line = FileReadLine($file) If @error = -1 Then ExitLoop $parts = stringsplit($line, ",") $a = $a + 1 if $a = 1 Then $parts[1] = $firstentry and $parts[3] = $prevpartsline ContinueLoop EndIf if $prevpartsline = $parts[3] then _stringinsert("," & $parts[1], $firstentry, StringLen($firstentry)) $prevpartsline = $parts[3] FileWriteline ("C:\test2.csv",$firstentry) endif Wend FileClose($file) Please be kind. I have never written anything that loops before... in fact, I think that this is the first time i used IF THEN statements.. I would love all the help i can get, but im afraid of becomin a pest, I cant seem find examples of things i would like to do with Outlook, and IE (Unfortunately the guide for IE, doesn't help I learn from programs examples) Btw, does _filewritetoline work when the line is a variable? Thanks again, George Link to comment Share on other sites More sharing options...
exodius Posted June 12, 2006 Share Posted June 12, 2006 (edited) I'll look at this when I get some time George, I'm not sure if that's going to be tonight though. Edit: So I'm looking at it George, does the code that you put up after your alteration of mine actually work as you posted it? Edited June 12, 2006 by exodius Link to comment Share on other sites More sharing options...
Sorak99 Posted June 12, 2006 Author Share Posted June 12, 2006 I'll look at this when I get some time George, I'm not sure if that's going to be tonight though.Edit: So I'm looking at it George, does the code that you put up after your alteration of mine actually work as you posted it?It does, I get something like this:7.625 1 26-3.968 2 26-3.84 3 26-3.809 4 26-3.679 5 267.625 6 26-3.493 7 26-3.364 8 26-3.314 9 26-3.184 10 267.125 11 26-3.552 12 26-3.446 13 26-3.339 14 267.5 1 27-3.694 2 27-3.57 3 27-3.57 4 27-3.445 5 277.5 6 27-3.151 7 27-3.027 8 27-2.972 9 27-2.847 10 277 11 27-3.281 12 27-3.18 13 27-3.078 14 277.375 1 28-3.348 2 28-3.23 3 28-3.225 4 28-3.106 5 287.375 6 28-3.259 7 28-3.141 8 28-3.08 9 28-2.961 10 286.875 11 28-2.898 12 28-2.803 13 28-2.706 14 287.25 1 29-3.238 2 29-3.125 3 29-3.08 4 29-2.966 5 297.25 6 29-3.032 7 29-2.919 8 29-2.853 9 29-2.739 10 296.75 11 29-2.61 12 29-2.519 13 29-2.427 14 297.125 1 30-3.026 2 30-2.918 3 30-2.878 4 30-2.769 5 307.125 6 30-2.763 7 30-2.655 8 30ect. ect.ect.George Link to comment Share on other sites More sharing options...
exodius Posted June 12, 2006 Share Posted June 12, 2006 So you want the data separated by a single space? Link to comment Share on other sites More sharing options...
Sorak99 Posted June 12, 2006 Author Share Posted June 12, 2006 So you want the data separated by a single space?Well, the actual data is with commas (see attachment), i just copied it off the sheet. Just trying to rebuild the tables, I figure if i only take out the important info, and while im rebuilding the tables i can move them apart the same distance in a new or same file, so avoid the changing nature of the document.George Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now