Jump to content

AutoIT as a data extraction tool


Recommended Posts

Hello all,

Ive been using AutoIT for a few years now and Ive run into a project where Im not sure AutoIT is the right solution. It could be that Im not using AutoIT as effectively as I possible. Anyway, let me tell you a bit about what Im trying to do and how Ive approached it with AutoIT:

I work at a hospital and my project involves a daily report from our patient information system. A Long Time Ago, this report was printed on a line printer. Now, of course, the report must be handled electronically. If you open the report (its a .txt file) it each page is delineated by an ASCII form feed. Each page has a nice header with the patients information.

My mission is to take the original report and divide it up into two files based on a piece of information called Financial Class, which is included in the header.

Heres how Ive approached it with AutoIT: I have a loop that (using FileReadLine) looks for the ASCII form feeds. I use the location of the form feeds to determine a page. Once I get a range of lines defined (e.g. lines 59 thru 117=one page), I pass that information to a function. This function (another loop using FileReadLine) finds the header line within the page and using StringMid, extracts the Financial Class. Then using a case statement, I output lines 59 through 117 using FileWriteLine to the appropriate file. Then its back to the first loop to find the next page and so on.

Ive successfully written a script that does this, but its performance is not ideal. Many of these reports I need to process are lengthy the one Im working with for testing is 350 pages. It takes my script an hour to process it.

My questions are:

1) Am I using AutoIT optimally here in other words, can I write a better script that doesnt have the performance issues (maybe one that doesnt need to read every single line)?

2) Should I be looking at another programming language (perl maybe?) or program to do this?

Thanks in advance,

Kate

Link to comment
Share on other sites

  • Moderators

If you can provide a "dummy" txt file, and show the exact output you are looking for, myself/someone will more than likely show you how to effectively do it with StringRegExp().

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

If you can provide a "dummy" txt file, and show the exact output you are looking for, myself/someone will more than likely show you how to effectively do it with StringRegExp().

Sure...at the bottom of this post are two dummy pages of the original - the only change I made was to label the form feeds since they didn't show up with the paste.

As for output? The output formatting will be the same. What I need my script to do is look at the header:

PT NAME            PATIENT NO   PT FC REP RC REASON  PRODUCED HOSP PAGE

...and look at the number under FC (in my code example below, the FC on the first page is 7 and on the second page it's 6.) Any page with FC=7 will need to be output to one file and any page with FC=6 will need to be output to another file. Again, these reports can be lengthly, so the script will need to append to the output files as it goes. Does that make sense?

My script does this already, but it just isn't very efficient. :|

Thanks,

Kate

PT NAME            PATIENT NO   PT FC REP RC REASON  PRODUCED HOSP PAGE
XXXXXXXXXXXXXXXXXXXXXXXXX 99999999999-9 X  7 999 X XXXXXXXX 99/99/99 XXXX 9999
ADM  99/99/99  MR# XXXXXXXXXXXX  BD 99/99/99  SEX X  SS# 999-99-9999 DISP XXX
DSCH 99/99/99  DR XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX MAR ST XXXXXXXXXXXX
_______________________________________________________________________________
GUARANTOR# 9999999999 PH1 99999999 999-9999 X99999 PH2 99999999 999-9999 X99999
XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX  XX  99999-9999  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
_______________________________________________________________________________
EMP NAME XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX     PHONE 99999999 999-9999 X99999
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX  XX  99999-9999  CONTACT XXXXXXXXXXXXXXXXXXXXXXX
_______________________________________________________________________________
FBL 99/99/99 CR X CA X  DIST XXX  STMTS 999  ST CD X  CTRCT AMT 9999.00  PER X
   XXXXXXXXXXXXXXX   XXXXXXXXXXXXXXX   XXXXXXXXXXXXXXX  XXXXXXXXXXXXXXX
     9999,999.00       9999,999.00     9999,999.00      9999,999.00
XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXX
  9999,999.00    9999,999.00     9999,999.00     9999,999.00     999,999.00
_______________________________________________________________________________
XX UNBL9999,999.00  XX UNBL9999,999.00  XX UNBL9999,999.00  XX UNBL9999,999.00
            PT UNBL9999,999.00  TOT CHGS UNBILLED 9999,999.00   CURRENT BALANCE
            PT CURR BALANCE                                     9999,999.00
I1  X  XXX  XXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXX   XXXXXXXX 9999,999.00
I2  X  XXX  XXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXX   XXXXXXXX 9999,999.00
I3  X  XXX  XXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXX   XXXXXXXX 9999,999.00
I4  X  XXX  XXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXX   XXXXXXXX 9999,999.00
                                          TOT CURR ACCOUNT BAL: 9999,999.00
_______________________________________________________________________________
POSTDATE SVC DATE SVC CD      DESCRIPTION                AMOUNT   BALANCE
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
<THERE IS A FORM FEED HERE>



    PT NAME            PATIENT NO   PT FC REP RC REASON  PRODUCED HOSP PAGE
XXXXXXXXXXXXXXXXXXXXXXXXX 99999999999-9 X  6 999 X XXXXXXXX 99/99/99 XXXX 9999
ADM  99/99/99  MR# XXXXXXXXXXXX  BD 99/99/99  SEX X  SS# 999-99-9999 DISP XXX
DSCH 99/99/99  DR XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX MAR ST XXXXXXXXXXXX
_______________________________________________________________________________
GUARANTOR# 9999999999 PH1 99999999 999-9999 X99999 PH2 99999999 999-9999 X99999
XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX  XX  99999-9999  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
_______________________________________________________________________________
EMP NAME XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX     PHONE 99999999 999-9999 X99999
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX  XX  99999-9999  CONTACT XXXXXXXXXXXXXXXXXXXXXXX
_______________________________________________________________________________
FBL 99/99/99 CR X CA X  DIST XXX  STMTS 999  ST CD X  CTRCT AMT 9999.00  PER X
   XXXXXXXXXXXXXXX   XXXXXXXXXXXXXXX   XXXXXXXXXXXXXXX  XXXXXXXXXXXXXXX
     9999,999.00       9999,999.00     9999,999.00      9999,999.00
XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXX
  9999,999.00    9999,999.00     9999,999.00     9999,999.00     999,999.00
_______________________________________________________________________________
XX UNBL9999,999.00  XX UNBL9999,999.00  XX UNBL9999,999.00  XX UNBL9999,999.00
            PT UNBL9999,999.00  TOT CHGS UNBILLED 9999,999.00   CURRENT BALANCE
            PT CURR BALANCE                                     9999,999.00
I1  X  XXX  XXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXX   XXXXXXXX 9999,999.00
I2  X  XXX  XXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXX   XXXXXXXX 9999,999.00
I3  X  XXX  XXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXX   XXXXXXXX 9999,999.00
I4  X  XXX  XXXXXXXXXXXXXXXXXX  XXXXXXXXXXXXXXXXXXXX   XXXXXXXX 9999,999.00
                                          TOT CURR ACCOUNT BAL: 9999,999.00
_______________________________________________________________________________
POSTDATE SVC DATE SVC CD      DESCRIPTION                AMOUNT   BALANCE
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 9999999 XXXXXXXXXXXXXXXXXXXXXXXXXX  9999,999.00  9999,999.00
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
99/99/99 99/99/99 *    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 999999
<THERE IS A FORM FEED HERE>
Link to comment
Share on other sites

  • Moderators

Fun times:

;Param 1 = File to read and parse for info
;Param 2 = Output file if 6 is found
;Param 3 = Output file if 7 is found
_OutPutFiles('MainFileToRead.log', 'If_6_Is_Found_Out_FileWrite_To.log', 'If_7_Is_Found_Out_FileWrite_To.log')

Func _OutPutFiles($sString, $sSixOutFile, $sSevenOutFile)
    $sString = FileRead($sString)
    ;Parse Main File with Form Feed
    Local $aGroupHeaders = StringRegExp($sString, '(?s)(?i)(\s{4}PT NAME\s+PATIENT NO.*?)\f', 3)
    For $iCC = 0 To UBound($aGroupHeaders) -1
        $aFC = StringRegExp($aGroupHeaders[$iCC], '(?s)(?i)\s+PT NAME\s+PATIENT NO.*?\r\n.*?\s\d+-*.*?\s{2}(\d+)\s', 3)
        If IsArray($aFC) Then
            If $aFC[0] = '6' Then
                FileWrite($sSixOutFile, $aGroupHeaders[$iCC] & @CRLF & Chr(12))
            ElseIf $aFC[0] = '7' Then
                FileWrite($sSevenOutFile, $aGroupHeaders[$iCC] & @CRLF & Chr(12))
            Else
                FileWrite('OutFileErrorLog.log', '--------- Start Error (Wrong Number Found) ----------- ' & _
                    @MON & '/' & @MDAY & '/' & @YEAR & ':' & @HOUR & ':' & @MIN & ':' & @SEC & @CRLF & _
                    $aGroupHeaders[$iCC] & @CRLF & '--------- End Error -----------' & @CRLF)
                EndIf
        Else
            FileWrite('OutFileErrorLog.log', '--------- Start Error (No Number Found) ----------- ' & _
                @MON & '/' & @MDAY & '/' & @YEAR & ':' & @HOUR & ':' & @MIN & ':' & @SEC & @CRLF & _
                $aGroupHeaders[$iCC] & @CRLF & '--------- End Error -----------' & @CRLF)
        EndIf
    Next
    Return ''
EndFunc

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Wow, that's awesome, thank you. I've only just begun reading about regular expressions and am quite glad to have help using them with this file...it's not safe for regular expression beginners...

If I may summarize what your script does to make sure I understand:

1. The input and two output files are defined and passed to a function.

2. The function begins by storing the whole input file as one big string in a variable ($sString)

3. Then StringRegExp goes hunting for the headers in $sString, storing the array of matches in the $aGroupHeaders variable.

4. The For loop begins processing all the entries in the $aGroupHeaders array one at a time

5. StringRegExp then goes hunting for the FC number and stores the result in $aFC

6. An If/Then statement handles output to each file depending on the value of $aFC and error handling ensues.

The only part I don't understand is how does the script know to output the entire page and not just the header line?

Thank you again,

Kate

Link to comment
Share on other sites

  • Moderators

Wow, that's awesome, thank you. I've only just begun reading about regular expressions and am quite glad to have help using them with this file...it's not safe for regular expression beginners...

If I may summarize what your script does to make sure I understand:

1. The input and two output files are defined and passed to a function.

2. The function begins by storing the whole input file as one big string in a variable ($sString)

3. Then StringRegExp goes hunting for the headers in $sString, storing the array of matches in the $aGroupHeaders variable.

4. The For loop begins processing all the entries in the $aGroupHeaders array one at a time

5. StringRegExp then goes hunting for the FC number and stores the result in $aFC

6. An If/Then statement handles output to each file depending on the value of $aFC and error handling ensues.

The only part I don't understand is how does the script know to output the entire page and not just the header line?

Thank you again,

Kate

$aGroupHeaders doesn't just store the header, it stores from the header to the feed. If you look at when I write the group to it's chosen file, I add Chr(12) back in it (the feed).

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...