new to regular expressions..

sulfurious · December 15, 2008

Hi. I have been putting off learning regular expressions for some reason lol.

I have looked at code snippets and the help file, but I don't completely understand it. Actually, I am totally lost.

A simple learning project then. Take some strings from a file, like this:

--2008-12-15 11:35:30-- Event 1
--2008-12-15 11:36:30-- Event 2
--2008-12-15 11:37:30-- Event 3
--2008-12-15 11:38:30-- Event 4

Now, this seems just the thing to use a regexp with. But the log file may get truncated from time to time for some reason, like this

--2008-12-15 11:35:30-- Event 1
--2008-12-15 11:36:30-- Event 2 --2008-12-15 11:37:30-- Event 3
--2008-12-15 11:38:30-- Event 4

So now I want to parse the log, so a simple _FileReadToArray() works. Now stepping through the array, I would use a StringTrimLeft to get everything left of the --, like this

StringTrimLeft($arLOG[$x],StringInStr($arLOG[$x],'--',0,-1)+3)

But when truncation happens, this no longer works. So I try this:

StringInStr($arLOG[$x],StringRegExp($arLOG[$x],"(?i)--\d+++-\d+-\d+\s\d+:\d+:\d+--",0))

But obviously, to those who know about regexp, this fails.

Can anyone explain how this works and where my errors are?

Thank you.

Sul.

andybiochem · December 15, 2008

Hi. I have been putting off learning regular expressions for some reason lol.

That's because they were invented by the devil.

SmokeN will be able to give you a nice and concise RegExp solution to this.

The best I can do is beat it to death with String functions.

Here's a solution:

#include <Array.au3>

Global $arLOG[4]
$arLOG[1] = "--2008-12-15 11:35:30-- Event 1"
$arLOG[2] = "--2008-12-15 11:36:30-- Event 2 --2008-12-15 11:37:30-- Event 3"
$arLOG[3] = "--2008-12-15 11:38:30-- Event 4"

$string = ""
For $i = 1 to (UBound($arLOG) - 1)
    StringReplace($arLOG[$i],"--2008","")
    If @extended = 2 Then
        $StartPos = StringInStr($arLOG[$i],"--2008",0,-1)
        $string &= StringMid($arLOG[$i],1,$StartPos - 1) & "|"
        $string &= StringMid($arLOG[$i],$StartPos) & "|"
    Else
        $string &= $arLOG[$i] & "|"
    EndIf
Next

$arLOG = StringSplit(StringTrimRight($string,1),"|")
If $string = "" Then Dim $arLOG[1]

_ArrayDisplay($arLOG)

Good luck

PsaltyDS · December 15, 2008

But when truncation happens, this no longer works. So I try this:

StringInStr($arLOG[$x],StringRegExp($arLOG[$x],"(?i)--\d+++-\d+-\d+\s\d+:\d+:\d+--",0))

But obviously, to those who know about regexp, this fails.

Can anyone explain how this works and where my errors are?

Thank you.
Sul.

You set a flag of zero on your StringRegExp(), which means it returns just 0 or 1 based on if any match was found. If you set the flag to anything other than 0, it returns an array. So it doesn't make sense to nest the StringRegExp() inside the StringInStr() function.

oMBRa · December 15, 2008

try my solution:

#include <Array.au3>
 $string = '--2008-12-15 11:35:30-- Event 1' & _
'--2008-12-15 11:36:30-- Event 2 --2008-12-15 11:37:30-- Event 3' & _
'--2008-12-15 11:38:30-- Event 4'

$troncated = StringRegExp($string, '--.*?Event..', 3)
_ArrayDisplay($troncated)

sulfurious · December 15, 2008

#include <Array.au3>
 $string = '--2008-12-15 11:35:30-- Event 1' & _
'--2008-12-15 11:36:30-- Event 2 --2008-12-15 11:37:30-- Event 3' & _
'--2008-12-15 11:38:30-- Event 4'

$troncated = StringRegExp($string, '--.*?Event..', 3)
_ArrayDisplay($troncated)

Okay, that works to get each truncation into correct array sequence. I was not looking for that, but that is very nice to know.

You set a flag of zero on your StringRegExp(), which means it returns just 0 or 1 based on if any match was found. If you set the flag to anything other than 0, it returns an array. So it doesn't make sense to nest the StringRegExp() inside the StringInStr() function.

I don't understand. StringInStr() returns the index position of the instance found doesn't it? So I used a flag of 0 in the stringregexp, because I did not think I wanted an array. However, I am not sure exactly what to do with a 1 or a 0 that the regexp does return. How would it make sense to return an array with StringRegExp for StringInStr???

@Andybiechem, thanks for the routine. Not exactly the way I would do it, but very similar. I would gladly like for SmokeN to comment on this. I could do this with string manipulation, but I thought this would be a good time to put the regexp to work.

No one has commented though, is the syntax of the regexp correct? For instance, when you have a date, like 2008, would the correct syntax be \d+++ ? What exactly do you do with a regexp anyway? You test it or return what it finds? Then handle the array it returns for instance? Is there not functions that return the index using regexp like a StringInStrRegExp or something?

Thanks a lot for the replies !

Sul.

SmOke_N · December 15, 2008

I found it kind of hard to understand exactly what you were trying to do.

Can you show an exact output you'd like to receive (Just a one line example)?

Edit:

This is how I understood what you wanted... if you read the entire file with FileRead() (which would be $s_string in my example) rather than doing _FileReadToArray().

#include <Array.au3>
Local $s_string = '--2008-12-15 11:35:30-- Event 1' & _
'--2008-12-15 11:36:30-- Event 2 --2008-12-15 11:37:30-- Event 3' & _
'--2008-12-15 11:38:30-- Event 4'

Local $s_pattern = "--(\d{4}-\d+-\d+\s+\d+:\d+:\d+)--"
Local $a_sre = StringRegExp($s_string, $s_pattern, 3)
_ArrayDisplay($a_sre)

If you were only wanting dates and times that is... You surround your query you want to get with parenthesis.

Edited December 15, 2008 by SmOke_N

sulfurious · December 15, 2008

Gladly.

I have a log file for wGet. It is in append mode. In interest of viewing the logfile compositly, I want to parse it into which files have been started, and where they are at % wise if interrupted. Here is an example of the logfile for wGet

--2008-12-15 11:35:30--  http://largedownloads.ea.com/pub/patches/BF2142/1.05/BF2142_Patch_v1.05.exe
Resolving largedownloads.ea.com... 159.153.197.74, 159.153.197.99
Connecting to largedownloads.ea.com|159.153.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22822709 (22M) [application/x-msdownload]
Saving to: `c:/Documents and Settings/Sul/My Documents/My Downloads/BF2142_Patch_v1.05.exe'

     0K ........ ........ ........ ........ ........ ........ 13%  369K 52s
  3072K ........--2008-12-15 11:35:40--  http://largedownloads.ea.com/pub/patches/BF2142/1.05/BF_2142_Server.exe
Resolving largedownloads.ea.com... 159.153.197.74, 159.153.197.99
Connecting to largedownloads.ea.com|159.153.197.74|:80... connected.
HTTP request sent, awaiting response...  ..200 OK
Length: 89459943 (85M) [application/x-msdownload]
Saving to: `c:/Documents and Settings/Sul/My Documents/My Downloads/BF_2142_Server.exe'
.....
     0K ... ............. ... ............. ... .............. . ........ ........ ........  3%  224K 6m16s
  3072K .......--2008-12-15 11:36:13--  http://largedownloads.ea.com/pub/patches/BF2142/1.05/ReadmeServer.txt
Resolving largedownloads.ea.com... 159.153.197.74, 159.153.197.99
Connecting to largedownloads.ea.com|159.153.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32077 (31K) [text/plain]
Saving to: `c:/Documents and Settings/Sul/My Documents/My Downloads/ReadmeServer.txt'

     0K                                                   100%  166K=0.2s

2008-12-15 11:36:13 (166 KB/s) - `c:/Documents and Settings/Sul/My Documents/My Downloads/ReadmeServer.txt' saved [32077/32077]

--2008-12-15 11:37:10--  http://largedownloads.ea.com/pub/patches/battlefield_vietnam_server_incremental_patch_v1.2_to_v1.21.exe
Resolving largedownloads.ea.com... 159.153.197.74, 159.153.197.99
Connecting to largedownloads.ea.com|159.153.197.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4267663 (4.1M) [application/x-msdownload]
Saving to: `c:/Documents and Settings/Sul/My Documents/My Downloads/battlefield_vietnam_server_incremental_patch_v1.2_to_v1.21.exe'

     0K ........ ........ ........ ........ ........ ........ 73%  370K 3s
  3072K ........ ........ .                               100%  375K=11s

2008-12-15 11:37:22 (371 KB/s) - `c:/Documents and Settings/Sul/My Documents/My Downloads/battlefield_vietnam_server_incremental_patch_v1.2_to_v1.21.exe' saved [4267663/4267663]

--2008-12-15 11:42:36--  http://largedownloads.ea.com/pub/patches/BF2142/1.05/BF_2142_Server.exe
Resolving largedownloads.ea.com... 159.153.197.74, 159.153.197.99
Connecting to largedownloads.ea.com|159.153.197.74|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 89459943 (85M), 85791259 (82M) remaining [application/x-msdownload]
Saving to: `c:/Documents and Settings/Sul/My Documents/My Downloads/BF_2142_Server.exe'

        [ skipping 3072K ]
  3072K ,,,,,,,. ........ ........ ........ ........ ........  7%  443K 3m40s
  6144K .......

First, you can see that sometimes the date gets truncated into the line above it. Second, you can see that on resuming a download, there is no appending sequentially, but only to the end of the log.

So I was thinking of reading the log to an array, then stepping through it searching for the dates. I am unsure if the syntax of the log will remain the same for other types of transfers or options with wGet, so I though maybe using regular expressions on the date would be a good way to future proof the script.

I was not worrying about the date so much as it is a marker to the file being downloaded. I also know it marks the line beneath that is the address reslution which I want to keep. Capturing the file names then allows me to go back through the array and look for completion markers. Etc etc.

I did not get very far witht he code because I realized I should try some regular expressions on this to avoid future re-coding.

Here is what i started with till that point

#include <file.au3>
#include <array.au3>

$logDir = "C:\Documents and Settings\Sul\My Documents\My Downloads"
$logFile = '\wGet_log.txt'

Dim $arLOG
If FileExists($logDir & $logFile) Then; open log file
    $fhLOG = FileOpen($logDir & $logFile,0)
    _FileReadToArray($logDir & $logFile,$arLOG)
Else
    Exit
EndIf

; get into array the file name and ip address resolved
For $x = 1 To $arLOG[0]
    If StringInStr($arLOG[$x],'--200') Then
        $ll = StringInStr($arLOG[$x],StringRegExp($arLOG[$x],"(?i)--\d+++-\d+-\d+\s\d+:\d+:\d+--",0))
          MsgBox(0,'position',$ll)
;~      $file2get = StringTrimLeft($arLOG[$x],StringInStr($arLOG[$x],'--',0,-1)+3)
    EndIf
Next

_ArrayDisplay($arFiles)

Does this make sense?

Thanks.

Sul.

SmOke_N · December 15, 2008

Did this make sense? I'm sure for someone ... Maybe we should approach this in a different manner.

1. Let's fix the truncated carriage feed and line feeds:

Local $s_string = FileRead($logDir & $logFile)

Local $s_pattern = "((?![\r\n\A]))(--\d{4}-\d+-\d+\s+\d+:\d+:\d+--)"
Local $s_sre = StringRegExpReplace($s_string, $s_pattern, "\1" & @CRLF & "\2")

That should fix your truncated lines.

Now you can do whatever it was you were doing... as $s_sre should be properly setup now.

sulfurious · December 15, 2008

Ok. So that is a proper setup to acquire the date expression. lol, nowhere near what I had.

I can use some string manips to parse it out. But, can the regexp not help me more? I know that always the beginning of a download appends the --date-- structure. Is it possible for StringRegExp to be used to find the index of it's match in a string? I now have the truncation sorted out, but can regexp be used now to help 'trim' the date structure off the url?

So that, StringTrimLeft(<string>, <some form of regexp to return the index of the match>)

I know how to use StringTrimLeft(<string>,StringInStr(<string>,<search string>))

but I was hoping there was something like

StringTrimLeft(<string>,StringInString(<string>,StringRegExp(<string>,<match>)))

or

StringTrimLeft(<string>,StringRegExp(<string>,<match>))

it would not seem so, but I know diddly about regexp's.

Thanks for your time btw.

Sul.

Szhlopp · December 15, 2008

First: I have a regex tester in my sig you could use to learn SRE.

Second: Give me an example of what it is you want done.

"but can regexp be used now to help 'trim' the date structure off the url?"

Like this?

Before: --2008-12-15 11:35:30-- http://largedownloads.ea.com/pub/patches/BF2142/1.

After: http://largedownloads.ea.com/pub/patches/BF2142/1.

SREReplace -

--[0-9]{1,}-[0-9]{1,}-[0-9]{1,}\s?\d?\d?:\d?\d?:\d?\d?--\s?\s?

You can pretty much get anything you want with SRE...

Resolving text

Resolving\s?(.*)

Three level array of the connecting text:

(?i)connecting to\s?(.*?)\|(.*?)\|(?:.*?)\.\.\.\s?(.*)

HTTP response:

awaiting response\.\.\.\s?(.*)

Length Two level array(Bytes/ shortened):

(?i)length:\s?(\d*)\s?\((.*?)\)

sulfurious · December 15, 2008

First: I have a regex tester in my sig you could use to learn SRE.

Second: Give me an example of what it is you want done.

"but can regexp be used now to help 'trim' the date structure off the url?"

Like this?

Before: --2008-12-15 11:35:30-- http://largedownloads.ea.com/pub/patches/BF2142/1.
After: http://largedownloads.ea.com/pub/patches/BF2142/1.

SREReplace -
--[0-9]{1,}-[0-9]{1,}-[0-9]{1,}\s?\d?\d?:\d?\d?:\d?\d?--\s?\s?

Yes, this is pretty much it. In terms of trimming, it would be trimming off the date structure, just like the After: above. but how does this code above do this? I have been looking at ^ for not matching, but could not get it to do what you did.

Basically, I just want to learn how to make something that I am apt to find again be handled by regexp. I know regexp is very powerful, but did not know where to start with it.

I don't have an example persay, so much as I am looking for how to use it in a StringTrimLeft() function. You know, how to trim the part that matches the regexp off the line, while in an array loop. I guess it is more of how to find and manipulate a sequence of data that may not always have specific characters to search for. This seems to make more sense than recoding when the unknown appears. Providing of course the mask for the regexp is correct for new data.

Sorry if I don't get it, but, I really don't get it.

@SmokeN, what does the ?! do? And why would you have to nest that part twice, while the remaining is nested only once, and not even nested together? Why would you use /d{4} for the 4 part date, and only \d+ for the 2 part? Because + means immediately following same character, and +++ would mean different?

Again, I can't say how much I appreciate the help.

Sul.

SmOke_N · December 16, 2008

"(?!...)" - Is basically like "Not". I grouped together carriage return, line feed, or the beginning of the string as a "Not" statement.

In other words, I told the expression to find any irregularities, any char that wasn't a CR|LF| or beginning of string before my search pattern. If it finds one, then I then replace it with itself followed by a CRLF so that I have a proper structure to work with.

So I didn't nest the search, I simply grouped my query and made it a conditional statement with (?!).

I used \d{4} because we have a "Definate" search pattern. The first thing after the "--" was the year, which is in 4 digit format. I don't want to continue my expression if it can fail quickly.

+ means there is at least "1" of the object we looking for (in this case any digit), however it could be infinite, I could have done \d\d or \d{2} but both of those are more chars than + ... so I opted for the lazy way, feeling as if it got to that point (past the 4 digits of the year) that it wasn't going to fail... Obviously if there was some change it could, I'd replace it with \d\d or \d{2}.

Edit:

You can learn a lot using regex tools, but they have their limitations, typically they might give you an expression that works, but they are poorly put together.

Nothing beats practice, and if you're going to be doing string manipulation, it's (RegEx) a must have in your arsenal. I use it often, but not for everything, each thing has it's place... (although you can do just about everything with expressions you can do with any regular function).

As I said, if you're going to do RegEx consistently, it's in your benefit to drop every project right now for the next 7 to 10 days, and become as much of an expert as you can.

The hours you put in will be returned to you 10 fold over the next few months, saving you time and headaches.

Edited December 16, 2008 by SmOke_N

sulfurious · December 17, 2008

"(?!...)" - Is basically like "Not". I grouped together carriage return, line feed, or the beginning of the string as a "Not" statement.

If I follow this correctly then, the braces give a set of chars to test. Are items in braces allowed to be tested out of order? Group () states that the items are treated in order. So you group () the braces to ensure that specific order of the braces remains intact? Why would you not use (?!\r\n\A) instead?

In other words, I told the expression to find any irregularities, any char that wasn't a CR|LF| or beginning of string before my search pattern. If it finds one, then I then replace it with itself followed by a CRLF so that I have a proper structure to work with.

So I didn't nest the search, I simply grouped my query and made it a conditional statement with (?!).

Then the first nesting ((?![...])) is not a nest, but a complete expression. And the second expresion (--\d...) is a second expression. So that you refer to it then as \1 and \2 ?

Local $s_pattern = "((?![\r\n\A]))(--\d{4}-\d+-\d+\s+\d+:\d+:\d+--)"
Local $s_sre = StringRegExpReplace($s_string, $s_pattern, "\1" & @CRLF & "\2")

If you check for the pattern, in 2 searches, one (()) and one (). I did not realize that you could perform them like that. So each () is a group, and you can test multiple groups in the same expression. And the ?! is a logic test for that group. Does this mean that groups are processed in heirarchial order then? Left to right, match and exit? Or is the order strictly for within the group.

Why is it that if you replace using all tokens 0-9 with CRLF between them, that always token 1 is blank, and whatever the last token is can be pushed to 9 (or the highest token you use)

I used \d{4} because we have a "Definate" search pattern. The first thing after the "--" was the year, which is in 4 digit format. I don't want to continue my expression if it can fail quickly.

+ means there is at least "1" of the object we looking for (in this case any digit), however it could be infinite, I could have done \d\d or \d{2} but both of those are more chars than + ... so I opted for the lazy way, feeling as if it got to that point (past the 4 digits of the year) that it wasn't going to fail... Obviously if there was some change it could, I'd replace it with \d\d or \d{2}.

Yes, this most definately requires in-depth study. I can see already many uses I could have used it for and had much cleaner code without the massive string manipulation.

Thanks for the lesson SmokeN !

Sul.

SmOke_N · December 17, 2008

If I follow this correctly then, the braces give a set of chars to test. Are items in braces allowed to be tested out of order? Group () states that the items are treated in order. So you group () the braces to ensure that specific order of the braces remains intact? Why would you not use (?!\r\n\A) instead?

If you did (?!\r\n\A) the statement can never be true, because as you later stated, the hierarchy is left to right. What you'd be telling the search to do is look for CRLF and the Beginning of the line, well if there were CRLF then there couldn't also be the beginning of the line.

Then the first nesting ((?![...])) is not a nest, but a complete expression. And the second expresion (--\d...) is a second expression. So that you refer to it then as \1 and \2 ?

It is an expression, I'm telling it to find any char that is not a CR or LF or Beginning of the line. Surrounding my Query in square brackets means to look for any of those characters in no specific order.

You are correct ((?![..])) is the first full expression we are looking for, and (--\d...) is the second, so when we use Back referencing, which is $1 or \1 represent ((?![..])) and $2 or \2 represent (--\d..).

If you check for the pattern, in 2 searches, one (()) and one (). I did not realize that you could perform them like that. So each () is a group, and you can test multiple groups in the same expression. And the ?! is a logic test for that group. Does this mean that groups are processed in heirarchial order then? Left to right, match and exit? Or is the order strictly for within the group.

I answered this above along with the exception with square brackets.

Why is it that if you replace using all tokens 0-9 with CRLF between them, that always token 1 is blank, and whatever the last token is can be pushed to 9 (or the highest token you use)

I didn't actually replace anything, the following expression:

, "\1" & @CRLF & "\2")

actually told it to replace the first match found with itself, then insert a CRLF after that then replace the 2nd match found with itself... (So all I did was place a CRLF between the two matches found, I didn't actually replace anything).

Yes, this most definately requires in-depth study. I can see already many uses I could have used it for and had much cleaner code without the massive string manipulation.

Thanks for the lesson SmokeN !

Sul.

Good luck with it... if you have the patience (or in my case determination) you'll get it...

new to regular expressions..

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members