Sign in to follow this  
Followers 0
mdwerne

Search and Replace Engine?

17 posts in this topic

Morning all,

I have some ~20 meg text files that I need to sanitize before sending to a vendor. What I'm wondering is what the most expedient method for doing this would be.

I plan to feed the engine the name of a text file or folder of files and then in an ini have the strings I'm looking for (IP addresses, etc) and then replace the strings with the word "SANITIZED".

Would it make sense to loop through each log file multiple times using StringRegExpReplace for each string I'm looking for, or is there another method that may make more sense.

Because the log files are so large, I'm looking for a method that will execute with as little overhead as possible.

Thanks for your suggestions,

-Mike

Share this post


Link to post
Share on other sites



I can tell you from personal experience that the search and replace feature takes some time on larger files, and you will want to do several passes to ensure everything is caught. Below is what I use to select a file, prompt for input, and then either delete all lines WITH that text or delete all WITHOUT it. You could easily substitute "Sanitized" for "" in the loop. It also lets you know when it has completed, and how long it took. Typically the files I encounter are about 6Mb, and take on average 6 minutes on my machine (5 passes) :

#Include <File.au3>
#include <GUIConstantsEx.au3>
#include <WindowsConstants.au3>

Global $success = False
Global $file_name = FileOpenDialog("Select file", @ScriptDir, "All files (*.*)", 1+4)
Global $line_text_input = InputBox("Text", "Text to search for")
Dim  $add[3] = ["Delete Lines Containing Text", "Delete Lines NOT Containing Text", "Exit" ]
Local $msg
GUICreate( "Find and Replace", 200, 150)
GUISetState(@SW_SHOW)
$add1 = GUICtrlCreateButton( "Delete Lines Containing Text", "10", "20", 175, 30 )
$add2 = GUICtrlCreateButton( "Delete Lines NOT Containing Text", "10", "60", 175, 30 )
$add3 = GUICtrlCreateButton( "Exit", "66", "110", 66, 30 )

While $msg <> $GUI_EVENT_CLOSE
        $msg = GUIGetMsg()
 Select
  Case $msg = $add1
   $begin = TimerInit()
                Call( "Loop1")
   $complete = TimerDiff($begin)
   $seconds = $complete / 1000
   MsgBox(0, "Complete", 'Search completed in ' & $seconds & ' seconds.')
   Exit
   
  Case $msg = $add2
   $begin = TimerInit()
     Call( "Loop2")
   $complete = TimerDiff($begin)
   $seconds = $complete / 1000
   MsgBox(0, "Complete", 'Search completed in ' & $seconds & ' seconds.')
   Exit
       
   Case $msg = $add3
                Exit
   EndSelect
  WEnd 

func Loop1()
 
 $file_count_lines = _FileCountLines($file_name)
  for $i = 0 to $file_count_lines
  $Lines_text_output = FileReadLine($file_name, $i)
   if StringInStr($Lines_text_output, $line_text_input) then
   _FileWriteToLine($file_name, $i, "", 1)
   EndIf
  Next
 EndFunc

func Loop2()

 $file_count_lines = _FileCountLines($file_name)
  for $i = 0 to $file_count_lines
  $Lines_text_output = FileReadLine($file_name, $i)
   if Not StringInStr($Lines_text_output, $line_text_input) then
   _FileWriteToLine($file_name, $i, "", 1)
   EndIf
  Next
 EndFunc

√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites

StringRegExReplace should work as efficiently as possible. Read each file as one string $s and loop thru your replacements with $s = StringRegExReplace($s, ...), then rewrite $s to output.

If look towards routine use automation, then AutoIt PCRE will work just fine. 20Mb files aren't that huge.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

JLogan3o13,

Don't do that!

This is a terrible implementation since you read the very same data multiple times in an utterly unefficient way.

1/ don't use FileCountLines as it needs to read the whole baby and really count lines, then discard the baby. You don't need the line count at all.

2/ Never ever read a text file line by line this way: to access the Nth line, the function has to open the file (again) and read up to line N, then close file...

3/ If I were your PC, I'd go on strike for not agreeing to routinely perform useless tasks :graduated:


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

JLogan3o13,

Don't do that!

This is a terrible implementation since you read the very same data multiple times in an utterly unefficient way.

1/ don't use FileCountLines as it needs to read the whole baby and really count lines, then discard the baby. You don't need the line count at all.

2/ Never ever read a text file line by line this way: to access the Nth line, the function has to open the file (again) and read up to line N, then close file...

3/ If I were your PC, I'd go on strike for not agreeing to routinely perform useless tasks :graduated:

Thanks for the tip. I didn't say it was my best effort :( You are correct that I should be using something like StringRegExReplace.


√-1 2^3 ∑ π, and it was delicious!

Share this post


Link to post
Share on other sites

You are correct that I should be using something like StringRegExReplace.

and you'll see your runtime decrease dramatically (and keep sane and friendly relationship with your PC).

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Thank you both for your suggestions, this is a good place to start.

JLogan3o13, I'm confused as to why you need to run through the target file multiple times?

Sounds like you both agree that StringRegExReplease is the way to go, so I'll start there.

THANKS!!

-Mike

Share this post


Link to post
Share on other sites

PS: Try to use a RE string that contains multiple search strings.

(just test it first, to see if it really makes a differences.)

(There is a limit here somewhere, try to stay below a 4000 or 5000(don't remember exact value) character RE string.)

$RE_pattern_example_string = "string1|string2|string3"

But if you can't load or process a entire file in one go. ... :graduated:


"Straight_and_Crooked_Thinking" : A "classic guide to ferreting out untruths, half-truths, and other distortions of facts in political and social discussions."
"The Secrets of Quantum Physics" : New and excellent 2 part documentary on Quantum Physics by Jim Al-Khalili. (Dec 2014)

"Believing what you know ain't so" ...

Knock Knock ...
 

Share this post


Link to post
Share on other sites

This will give you a start. The expression isn't good as an IP validator but it should be fine for your purposes.

$sStr = "This is some string with the ip address 255.0.0.145 which will be replaced." ;; For your use the string will be replaced with $sStr = FileRead("SomeFile.txt")
$sStr = StringRegExpReplace($sStr, "([0-2]\d{0,3}\.[0-2]\d{0,3}\.[0-2]\d{0,3}\.[0-2]\d{0,3})", "CENSORED")
If @Extended Then
    $hFile = FileOpen(@DesktopDir & "\Result.txt", 2)
    FileWrite($hFile, $sStr)
    FileClose($hFile)
    ShellExecute(@DesktopDir & "\Result.txt")
EndIf

All you have to do is work out the proper expression for anything else you want, for example an email address.

See the PCRE Toolkit in my signature if you need a tool for testing PCRE expressions as used in AutoIt.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

$sStr = StringRegExpReplace($sStr, "([0-2]\d{0,3}\.[0-2]\d{0,3}\.[0-2]\d{0,3}\.[0-2]\d{0,3})", "CENSORED")

See the PCRE Toolkit in my signature if you need a tool for testing PCRE expressions as used in AutoIt.

This regex won't find an IP address in the string correctly. It will only find an IP address if the first digit in every octet is 2 or less, anything above 2 causes it to fail, for example 255.35.255.255 will fail.

This might be a better regex for those cases:

$sStr = StringRegExpReplace($sStr, "^([0-2]?\d{0,3}\.[0-2]?\d{0,3}\.[0-2]?\d{0,3}\.[0-2]?\d{0,3})$", "CENSORED")

Thanks to the PCRE toolkit, I finally started to understand how regex works so I could figure out why it was failing in the tests I did using the code you posted. I know practically nothing about RegEx, but I'm trying to learn as I go along.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites

He BrewManNH. How many IP addresses did your test string contained. I figger only one? :x

(Reread the RE help on '^' and '$'.)


"Straight_and_Crooked_Thinking" : A "classic guide to ferreting out untruths, half-truths, and other distortions of facts in political and social discussions."
"The Secrets of Quantum Physics" : New and excellent 2 part documentary on Quantum Physics by Jim Al-Khalili. (Dec 2014)

"Believing what you know ain't so" ...

Knock Knock ...
 

Share this post


Link to post
Share on other sites

He BrewManNH. How many IP addresses did your test string contained. I figger only one? :x

(Reread the RE help on '^' and '$'.)

This seems to work:

$sStr = StringRegExpReplace($sStr, "\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b", "SANITIZED")

It's not mine...found it on the web.

Thanks to everyone for all the help!!

-Mike

Share this post


Link to post
Share on other sites

He BrewManNH. How many IP addresses did your test string contained. I figger only one? :P

(Reread the RE help on '^' and '$'.)

Yeah, I forgot about that part when I posted it, mea culpa :x

But still, would it work without the "^" and the "$" in it? Just wondering because I'm still trying to learn regex and it seemed to work for IP addresses with any start digit higher than 2 for me before I added those.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites

Trying to simplify things for a practical situation, are your typical input files likely to actually contain sequences that look like IP addresses but are out of IP addresses range? Things like 956.842.45.305 or 0.0.0.0 ??

If the answer is "NO as I know what the file contents look like and such sequences are never found" then a much simpler RE will do "\d{1,3}\.\d{1,3}\.\d{1,3}\"is such a simple pattern. It will allow illegal IP addresses, but does that matter in your case ?


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#15 ·  Posted (edited)

But still, would it work without the "^" and the "$" in it?

The '^' and '$' only effect where the RE will look (or find) the given search pattern(IP). So there limiting the range where the IP's will be found. Without the '^' and '$' the rest of the(your) RE pattern will work fine to.

Just wondering because I'm still trying to learn regex and it seemed to work for IP addresses with any start digit higher than 2 for me before I added those.

That would be a side effect of the way the rest of the pattern acts. Don't know the details. But it probably is related to the fact that the pattern is not very strong(*) when it comes to finding IP addresses. As it will also see "..."(yours) as a IP address. (to much '?' use -> {0,1} -> none or one.)

*)

This will give you a start. The expression isn't good as an IP validator but it should be fine for your purposes.

- - -

Might as well drop in a more reader sanitized version of the pattern mdwerne posted.

"(?x)  \b  (?:  (?:25[0-5] | 2[0-4][0-9] | [01]?[0-9][0-9]?)  \.  ){3}  (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)  \b"
Edited by MvGulik

"Straight_and_Crooked_Thinking" : A "classic guide to ferreting out untruths, half-truths, and other distortions of facts in political and social discussions."
"The Secrets of Quantum Physics" : New and excellent 2 part documentary on Quantum Physics by Jim Al-Khalili. (Dec 2014)

"Believing what you know ain't so" ...

Knock Knock ...
 

Share this post


Link to post
Share on other sites

Trying to simplify things for a practical situation, are your typical input files likely to actually contain sequences that look like IP addresses but are out of IP addresses range? Things like 956.842.45.305 or 0.0.0.0 ??

If the answer is "NO as I know what the file contents look like and such sequences are never found" then a much simpler RE will do "\d{1,3}\.\d{1,3}\.\d{1,3}\"is such a simple pattern. It will allow illegal IP addresses, but does that matter in your case ?

No, all the "IP's" in our log fit the range...no illegals.

I can see that your RegEx is much simpler...I wonder how that (the simplier RegEx) would translate to time savings on a file with 100,000 rows or a million rows to change? In the future, this utility may be used on files that may or may not contain illegal IP strings.

Extending my question a bit...if I'm searching the logs for multiple criteria...would it make more sense to search the entire file for one criteria and then loop through for the second, etc... Or would it be better/faster to search each line (the buffer) for all the critera at once? Hope that makes sense.

Thanks for the followup,

-Mike

Share this post


Link to post
Share on other sites

The problem with replacing several sensitive data with "NOT FOR YOU" at once is that, as I can imagine, your various fields may have very different structures, like IP, names, login:pswd.

Using a complex regexp (and regexp can be _really_ complex if you need to use the most advanced PCRE features exposed here and there) can be difficult to craft and reveal close to headache to maintain. In this view, choosing a reasonable tradeoff is desirable.

While for instance alternation is easy to understand (and even automate) "(this string|that sentence)" possibly with a slightly more complex contruct, I would not accept a "one pattern fits all" solution proposed by my programmer using too much "cleverness", as it's much more likely to break down at the first change in input format. From the professional point of view, an obfuscated RE contest is to be banned at any rate. Never forget that at least 2/3 of the time spent in most software activity is maintainance.

So unless the runtime requirement is a premium criterion, I'd split things down to manageable chunks. What makes a manageable chunk obviously varies between coders.

Use AutoIt as a RAD and try various patterns on significant test files to have a better idea of runtime needed by such or such approach, then make your decision based on that, rather that on a priori guess.

Best of luck anyway.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0