Jump to content
Sign in to follow this  
florisch

StringRegExp

Recommended Posts

florisch

I have tried a lot, but I dont get it.

I have a text file looking like this:

Date 04.06.2007 13:20
Start XXX
blabla
lotsoftext
in lots of lines

Date 04.06.2007 13:22
Start YYY
blabla
lotsoftext
in lots of lines

Date 04.06.2007 16:22
Start ZZZ
blabla
lotsoftext
in lots of lines

Date 04.06.2007 17:23
Start XXX
blabla
lotsoftext
in lots of lines

I am searching for the text block(s) containing "XXX" or "YYY", grabbing everything from "Date" to next "Date" or eof.

To solve this, I would search with StringInStr() for "XXX", cut the text in two pieces, search for "Date" in first part, matching from end; in second part matching from beginning of the part, repeating all that until "XXX" is not found anymore.

Local $sLogText = FileRead("RegExpTest.txt")
Local $sSearchText = "XXX"
local $i = 1
local $pos, $startpos, $temp, $endpos

while 1
    $pos = StringInStr($sLogText, $sSearchText, 0, $i)
    if $pos = 0 then ExitLoop
    $startpos = StringInStr(StringLeft($sLogText, $pos), "Date", 0, -1)
    $temp = StringInStr(StringTrimLeft($sLogText, $pos), "Date")
    if $temp = 0 Then
        $endpos = StringLen($sLogText)
    else 
        $endpos = $pos + $temp
    EndIf
    $i += 1
    MsgBox(0,"Entry found", StringMid($sLogText, $startpos, $endpos - $startpos))   
WEnd

How to do this in a better/shorter/ simpler way with StringRegExp? I fooled around with regexp, but could not get it. Btw. its always complaining when I try to use [:ascii:]

Anybody here for this little challenge? I promise to learn from stringRegExp solution :-)

Share this post


Link to post
Share on other sites
Sokko

Better/shorter/simpler is what regular expressions are all about. I tested this code on the data you posted so it should work fine, assuming the entire file looks like that. Any questions?

$aRegex = StringRegExp($sLogText, "Date (.*?)\r\nStart " & $sSearchText & "\r\n((?:.|\r\n)*?)\r\n\r\n", 2)

If a matching block is found, $aRegex is an array:

$aRegex[0] contains the entire matching block (from "Date" up to and including the blank line)

$aRegex[1] contains the date of the block ("04.06.2007 13:20")

$aRegex[2] contains the text of the block ("lots of text in lots of lines")

In addition, @extended contains the offset of the character in $sLogText that immediately follows the matching block, so you can trim the string and try again to get another match.

If a matching block is not found, @error is set to 1 and $aRegex is NOT an array (so make sure you check that first).

Share this post


Link to post
Share on other sites
Sokko

Did my code work for the file you have? This has quickly dropped down to page 5 so I figured you might have missed the reply.

Share this post


Link to post
Share on other sites
florisch

Did my code work for the file you have? This has quickly dropped down to page 5 so I figured you might have missed the reply.

Thanks a lot for your help. I left early last week and was on vacation until now. Sorry for the delay.

Your regex looks fine and I will try to use it, but (there always is a "but" :-)

The regex finds the first occurence, but the last one does not have an empty line at the end. And: "lots of lines" may contain empty lines.

Therefore I changed the regex to

$aRegex = StringRegExp($sLogText, "Date (.*?)\r\nStart " & $sSearchText & "\r\n((?:.|\r\n)*?)(?:\r\nDate|\z)", 2)

Share this post


Link to post
Share on other sites
Sokko

I ran out of ideas. How do I supress the last group?

Reverse your thinking. Instead of trying to remove the last group, capture everything except the last group with another group.

$aRegex = StringRegExp($sLogText, "(Date (.*?)\r\nStart " & $sSearchText & "\r\n((?:.|\r\n)*?))(?:\r\nDate|\z)", 2)

This shifts the return values around a bit: $aRegex now has indexes from 0 to 3 (the new captured group is index 1).

It's not actually possible to exclude the trailing Date from the full match. Your problem is that the sequence you're trying to capture has no definite terminator that resides within the block. The only way to see whether you've reached the end of the sequence is to step outside it. Blame the designer of the file format. :)

Share this post


Link to post
Share on other sites
florisch

Reverse your thinking. Instead of trying to remove the last group, capture everything except the last group with another group.

[...]

Blame the designer of the file format. :)

Thanks for showing new ways. And well, for the file format I have to blame myself :-) Edited by florisch

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×