Complex Text search and deletion help needed

ThomasPowers · October 12, 2012

Hello All,

I have an interesting one that I hope you guys may have some insight on.

I have text files that are generated automatically by one of our systems.

I need to write a script that looks for a line in the text fill that starts with a certain sequence....namely 799R

Once found, I need to delete the whole line, the 2 lines above it, and the line below it.

Now...the lines above it will start with a 5 for two lines above, and a 6 for the line directly above

The line below will start with an 8.

So the sample is as such:

5225HTY 3383693141WEBINSTALLATION 12073100010759011651425785

6261220000123456789 0000001747N/A TOUCH 1075901135123456

799R1012345678910123456789 50523569874587859

822512546589754879658965874587458 666999874587445878584

The Script would find the 799R in the 3rd line, and delete the lines above it and the one below it.

Now...here's the catch....

The lines below it may repeat lines 6X and 7X....and the final 8 line may be a couple lines below like this:

5225HTY 3383693141WEBINSTALLATION 12073100010759011651425785

6261220000123456789 0000001747N/A TOUCH 1075901135123456

799R1012345678910123456789 50523569874587859

6261220000155555555 0000001747N/A TOUCH 1075901135125555

799R101234567891012345555 50523569874585555

822512546589754879658965874587458 666999874587445878584

So...the goal is this:

1. Find the 799R line

2. Delete the 2 lines above (starting with 5 and 6)

3. Delete the line or lines below (could be 6, 7, or 8 to start with) Final line must NOT start with a 5 (this denotes a new record entry)

All help is appreciated...and I am willing to PayPal $ as payment for help with this issue.

Thanks

Tom P

jchd · October 13, 2012

Your specs are a little contradictory.

You say first: "Once found, I need to delete the whole line, the 2 lines above it, and the line below it."

Then, you go on explaining that a record consists of a 5* line, a series of {6*, 7*} lines and a 8* line that you all delete if we refer to the sentence above. Or do you keep the 799R* line?

Also, we have no control over what the "final line" will be: only the program producing the data has. So "Final line must NOT start with a 5" seems irrelevant.

Can you post a sample of all possible "record" configurations and say what you want to keep. Only the line headers matter.Do you have 7* lines (records) which are not 799R* ?

Are multiple series of 6* and 7* in a record always in ordered pairs (e.g. 5* then one or more group of {6* then 7*} then 8*, or can we have 5*, 6*, 6*, 6*, 7*, 6*, 6*, 7*, 8* ?

I'm wild guessing that a suitable regular exp<b></b>ression could do the job in one shot but we have to settle on very precise rules for that.

EDIT: forgot to mention that forum rules preclude offering money for code.

Edited October 13, 2012 by jchd

ThomasPowers · October 13, 2012

Sorry about that...didn't mean to violate the rules of the forum. I apologize for not being better at knowing the rules before posting.

To answer the questions.....

"Then, you go on explaining that a record consists of a 5* line, a series of {6*, 7*} lines and a 8* line that you all delete if we refer to the sentence above. Or do you keep the 799R* line?"

- Each record in this file that we will be concerned with starts with a line with a 5*. We wish to keep all records in tact unless they have a 3rd line of 799R in the beginning. If the record has that line, then we wish to delete all lines associated with that record (from the 5* that started the record to the ending 8* line below the 799R line, including the 799R line)....we want all of the record gone.

"Only the line headers matter.Do you have 7* lines (records) which are not 799R* ?" - Yes...there may be lines in the file that start with 7, but are not 799R lines...so we would want that to stay.

"Are multiple series of 6* and 7* in a record always in ordered pairs (e.g. 5* then one or more group of {6* then 7*} then 8*, or can we have 5*, 6*, 6*, 6*, 7*, 6*, 6*, 7*, 8* ?" - The order would be 5* 6* 799r* then it could be 6*, 7* or 8*... so we could see 6* and 7* a few times before the final line of the record being an 8*, but we will see the beginning of the record starting with a 5* line, followed by a 6* line, then the 799R line, then any combination of 6* or 7*, then ending with the 8* line.

I concur with your idea that a pretty wild exp will be the answer....it's just that I am in the dark as to the specifics to make it work, especially against the whole file.

All help is greatly appreciated.

If a sample of the file would be helpful, I could take parts of one and paste them up here. I will edit one up tonight and copy it into the forum post.

Thank you for all your insight and help.

TP

Spiff59 · October 13, 2012

An old-school method (untested):

#include<array.au3>
#include<file.au3>

Global $array, $recordstart, $recordend, $deleteflag

FileReadToArray($file, $array)
_ArrayDisplay($array)

Local $idx
For $x = 1 to $array[0]
    Switch StringLeft($array[$x], 1)
        Case "5"
         $recordstart = $x
     Case "7"
         If StringLeft($array[$x], 4) = "799R" Then $deleteflag = 1
     Case "8"
         $recordend = $x
    EndSwitch
    If $recordend Then
        If Not $deleteflag Then
         For $y = $recordstart to $recordend
             $idx += 1
             $array[$idx] = $array[$y]
         Next
        Else
            $deleteflag = 0
        EndIf
     $recordend = 0
    EndIf
Next
ReDim $array[$idx + 1]
$array[0] = $idx
_ArrayDisplay($array)

Edited October 13, 2012 by Spiff59

jchd · October 13, 2012

Much clearer now.

Does this work as you want in all cases?

Local $s = FileRead('logs.txt')
Local $t = StringRegExpReplace($s, "(?im)(^5.*R6.*R799R.*R(?:[67].*R)*8.*R)", "")
ConsoleWrite($t)

EDIT before I forget: if you're unsure that the last line of a 799R* record ends with a line break, append @CRLF to the data read or make that precise R optional, e.g. R?

Edited October 13, 2012 by jchd

ThomasPowers · October 15, 2012

This appears to be working as we hoped, and I appreciate all your insight. I was able to get my regexp to do the simple 4 line ones...but was stumped on the whole extra possible 6 and 7 lines after the 799R.

We are testing this on multiple records and files now.

What would I add to this to get the Records that we are removing from the file dumped off to another file so I could easily see what is being removed.

Right now it's a line by line comparison of the old file to the new file...which can get cumbersome as some of these files have 10000 lines.

You have really saved us on this and I cannot thank you enough

TP

Edited October 15, 2012 by ThomasPowers

ThomasPowers · October 15, 2012

OK...while reading through some other forum posts...I have gotten closer to the finished product. Many thanks to JCHD.

My goal has shifted a bit now as we have a regexp string that works.

I am looking to load all matches of the RegExp

"(?im)(^5.*R6.*R799R.*R(?:[67].*R)*8.*R)"

into an array and save them off to another file, then using the StringRegReplace that JHCD created to output to a different file (leaving the parent file untouched.)

So Far...I have accomplished that with this code

Local $s = FileRead('c:testfile.txt')
Local $array = StringRegExp($s,"(?im)(^5.*R6.*R799R.*R(?:[67].*R)*8.*R)", 3)
For $i = 0 To UBound($array) - 1
    FileWrite ('c:outputoutput1.txt', $array[$i])
Next
Local $t = StringRegExpReplace($s, "(?im)(^5.*R6.*R799R.*R(?:[67].*R)*8.*R)", "")
FileWrite ('c:outputoutput2.txt', $t)

This is working great...except for one small thing.

The Filewrite command for output2.txt is appending a line at the end of the file and putting a 1 in it.

An example is here:

The last lines of the adjusted file in output2.txt should be:

5271ABCDEFDD 0075901134OUTPUTNOTREAL 1512101500010759011111111111

6220712324543234455555 000001234567SAMPLETEXT 0075901157898733

But instead the output2 file has this:

5271ABCDEFDD 0075901134OUTPUTNOTREAL 1512101500010759011111111111

6220712324543234455555 000001234567SAMPLETEXT 0075901157898733

1

All help is appreciated....we are almost there!!

TP

kylomas · October 15, 2012

ThomasPowers,

One thing to be aware of is that when you use filewrite in this fashion it APPENDS whatever you are writing to the end of the file. To make sure that you are getting what you expect from the regexpreplace use a consolewrite at the end of your code like this

consolewrite('+> value of $t = ' & $t & @lf)

kylomas

ThomasPowers · October 15, 2012

Kylomas,

Thank you for pointing that out...I can see where I could run into trouble in the future on that. I'll make sure to use a fileopen command with a 2 parameter to denote overwrite.

Following your suggestion...

When we put the

ConsoleWrite($t)

at the end of the code (instead of the Filewrite) it shows an output of the last 2 lines of

5271ABCDEFDD 0075901134OUTPUTNOTREAL 1512101500010759011111111111

6220712324543234455555 000001234567SAMPLETEXT 0075901157898733

1>Exit code: 0 Time: 0.230

SInce it puts the Exit Code right after the 1 I didn't see that at first.

So it's the StrinRegExpReplace line doing the additon of the 1 on the last line.

Any idea why?

TP

kylomas · October 15, 2012

ThomasPowers,

As I suspected, however, prolonged exposure to regular expression may cause dizziness or vomiting so I cannot help you there. You will recieve help soon, be patient!

Good Luck,

kylomas

jchd · October 16, 2012

I can't reproduce this behavior with a dummy sample. Can you post a short sample input text that behaves the way you mention (adding a line containing '1')?

ThomasPowers · October 16, 2012

You can disregard the line adding 1 to the end of the parse...turns out that the file we were using was corrupted in testing. When we used fresh files, everything worked great. When we rerun the file that we got the 1 on the end, it works great as well...it's just the original test copy of the file we were using was bad. The program we generate these from couldn't read our test file anymore either...so the file was the problem.

Everyone here has been great and we really appreciate the help. We will continue to put this through it's paces this week and if anything comes up...we'll post back.

Thanks again!!

TP

Sign In

Complex Text search and deletion help needed

Recommended Posts

ThomasPowers

jchd

ThomasPowers

Spiff59

jchd

ThomasPowers

ThomasPowers

kylomas

ThomasPowers

kylomas

jchd

ThomasPowers

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta