RegExp - has anyone seen this library before?

Valik · October 4, 2006

Hmm. Good point. Maybe I should download Perl and try it directly. I just want to know if it's a bug in PCRE or not.

Edit: If I understand the output of grep correctly, then the equivalent pattern in grep does work. Here's some sample grep data. I'll mark the lines I typed with a star. It looks to me that if the pattern matches, grep will echo back what I typed. If the pattern does not match, then grep will not echo it back (I'm using stdin of grep, obviously).

~ # grep "DataStart[[:space:]]*\([[:alpha:]]*[[:space:]]*\)*DataEnd"
DataStart DataA DataB DataEnd *
DataStart DataA DataB DataEnd
DataStart *
DataStart DataA DataEnd *
DataStart DataA DataEnd
DataStart DataA DataB DataC DataEnd *
DataStart DataA DataB DataC DataEnd
DataStart DataA DataB DataEn d *

As can be seen, if I type DataStart followed by zero (not shown) or more blocks of data followed by DataEnd, the pattern matches. If there is no DataEnd or it's mal-formed, no match happens. Same thing happens when I don't provide DataStart. So if grep is any indication, then it should be working. I guess it really is time to test Perl.

Edited October 4, 2006 by Valik

/dev/null · October 4, 2006

Hmm. Good point. Maybe I should download Perl and try it directly. I just want to know if it's a bug in PCRE or not.

same result with perl.... !???! Maybe our knowledge of regexp is not yet sufficient .... ;-)

I guess the backreferences machanism works only, if there are distinct parentheses for

each pattern. That's why /DataStart\s*(?\w+)\s*)(?\w+)\s*)DataEnd/ works in perl.

Then $1 == "DataA" and $2 == "DataB".

While /DataStart\s*(?\w+)\s*){2}DataEnd/ returns $1 == "DataB" and $2 == undef.

As PCRE is the Perl Compatible Reg Exp Implementation it's likely that it's behaviour

is the same as with perl.

EDIT: Perl Code

$s1 = "DataStart DataA DataB DataEnd";
print join(":", $1,$2,$3) if $s1 =~ /DataStart\s*(?:(\w+)\s*)(?:(\w+)\s*)DataEnd/;

Cheers

Kurt

Edited October 4, 2006 by /dev/null

/dev/null · October 4, 2006

I guess the backreferences machanism works only, if there are distinct parentheses for
each pattern.

my assumption seems to be correct. http://www.regular-expressions.info/brackets.html (search for Repetition and Backreferences).

So, it's no problem with StringRegExp(), it's just the way RegExps work (at least with perl and PCRE).

Kurt

Edited October 4, 2006 by /dev/null

Valik · October 4, 2006

I've just tried with Perl as well and see that it is the same. The thing that confuses me is why with PCRE the pattern is expanded and looks almost right but not quite. PCRE nearly expands the simple version to the long-form. I have a slightly different theory. I think internally PCRE (and Perl, too, seemingly) finds DataA, stuffs it into $1, then finds DataB and stuffs it in $1 over top of DataA. I think that because looking at the debug output from pcretest, the number appearing after "Bra" seems to be what index it will go into. In the simple, non-working example, the pattern is expanded but the second, auto-generated occurrence of (?\w+)\s*) gets the index 1 where-as in the manually generated version it gets 2. I'll confirm this with a debugger (maybe) later.

Also, my description explains another problem. How does the pattern match at all if it only matches DataB? Well, the answer is, it matches DataA, too, but it overwrites it with DataB making it look like it skips DataA. I think with a debugger, I can confirm that DataA is there for a time but that it gets overwritten by DataB.

So now my question becomes something like this: The behavior is consistent with Perl but is Perl's behavior right? I expected the repetition operator to be short-hand for the longer form. The Perl docs clearly state that is the point of repetition. Also, the long form can only support a finite number of matches since eventually you'll get tired of typing them out. The short-hand form supports far more matches since it's only limited by resources or whatever else governs the recursion level of repetitions.

At any rate, I'm about 85% sure about what I'm seeing. What I'm seeing does make sense from the standpoint as to why it's not working. It also means that I haven't totally lost my mind. What I don't understand is why this limitation is in place. I can't really believe I'm the first person to ever try this. But I can't think of a rational reason for imposing this limitation, either. So I guess what I need to do is find a Perl/PCRE master and see what they know about the subject.

Edit: Okay, so your link tells me it doesn't work and it confirms what I said. It doesn't really tell me why it doesn't work. It even mentions it's a common problem. So if it's so common, why not support it?

Edited October 4, 2006 by Valik

/dev/null · October 4, 2006

Edit: Okay, so your link tells me it doesn't work and it confirms what I said. It doesn't really tell me why it doesn't work. It even mentions it's a common problem. So if it's so common, why not support it?

Well, it's not saying it's a common problem. It's just describing how it works.

When we use the pattern /DataStart\s*(?\w+)\s*){2}DataEnd/ there is only one backreference pointing to (\w+), as that's the only one we defined.

That seems now correct for me. If it was not like that, how would you ever be able to tell which part of your string will end up in $1,$2, $3 etc. if you use this pattern:

(?\w+)\s*)*(?\w+)\s*)*

With that it would be impossible to do anything with the result, except to get all matching substrings, which is what you are interested in.

Cheers

Kurt

Edited October 4, 2006 by /dev/null

Valik · October 4, 2006

I suppose. But I think it's an artificial limitation. If it worked, could you write patterns where it would be impossible to make sense of the captures? Sure. But it's also possible to write the pattern in a way where you can make sense of the captures. In fact, that's trivial to enforce by using + instead of * so that things are not optional.

Basically I feel like they are saying, "You're too stupid to write a pattern that can return useful data with this feature so we aren't going to allow it to work at all". But I can write a pattern to return useful data. With the real pattern I wanted to use, I knew that all the even elements (starting with 0) would contain one capture and all the odd elements would contain the second capture for that particular line. I know the format of the data so I know my captures will work (or nothing will be captured which is good enough). I know how to find the start and end points where the data is in the larger text block, too. But I have to make 2 calls to do what I could write in one if I wasn't intentionally being limited. And people say regular expressions are powerful? Am I too powerful for regular expressions that not even they can meet my needs?

/dev/null · October 4, 2006

write in one if I wasn't intentionally being limited. And people say regular expressions are powerful? Am I too powerful for regular expressions that not even they can meet my needs?

I guess "they" had to decide if they want to make it work for one case or the other. Let's use a more general example that will lead to a problem to identify the matched terms if it would work like you want it to.

$s1 = "DataStart DataA DataB DataC User: Test DataEnd";

$s2 = "DataStart DataA DataB User: Test DataEnd";

$p = "DataStart\s*(?\w+)\s*)+\s*User:\s(\w+)\sDataEnd";

Now, for $s1 the regexp patten would create (if implemented like you want)

$1 = DataA

$2 = DataB

$3 = DataC

$4 = Test

and for $s2 it would create

$1 = DataA

$2 = DataB

$3 = Test

So, if the input text ist variable, how are going to figure out if the "user match" is in $3 or $4? Also remember that the matched substrings are not only referenced by the "external" variables $1,$2, etc. but also by \1, \2 etc. WITHIN the regexp itself. That would not work either if you need such a backreference.

It will just work for your purpose, if the matched strings are returned within an array. But that's obviously not the way how regexps work, at least not the known ones :-)

BTW: Here's a link where Larry Wall is talking about Pattern Matching and RegExps in Perl. He is also not too happy with the way certain things are implemented.

http://dev.perl.org/perl6/doc/design/apo/A05.html

There are also some references to Backreferences and Captures (esp. the problems with the current implementation). Also see RFC 360, that's basically what you want. Some of those might be changed in Perl6...

Cheers

Kurt

Edited October 4, 2006 by /dev/null

thomasl · October 4, 2006

I have not followed this subthread in any detail (I slept :lmao: ), but perhaps it helps to bear in mind couple of points:

PCRE is trying very hard to be Perl compatible. But there are some unavoidable differences (see the doc); what's more, Perl itself is a moving target There is no final instance that is to decide which RE engine is right and which is wrong.

Many problems (more than most people realise) are solvable by throwing a RE or two at them. But not all: some people still try to write patterns that match valid HTML.

REs are just a tool and like any tool it pays to know when and how to use it (and perhaps more importantly, when not to use it).

trids · October 4, 2006

..
REs are just a tool and like any tool it pays to know when and how to use it (and perhaps more importantly, when not to use it).

.. and it's also good to remember that there is rarely only one way to specify a matching pattern: most times there are several ways to achieve the same result. So don't give up too soon :lmao:

Jon · October 5, 2006

No posts for a day on the topic - good sign. The next beta (tomorrow) will incude this new code and the reg exp docs will be back.

The docs may need tweaks as they are mostly based on the previous version, so let me know if there is anything wrong in there.

jpm · October 6, 2006

oes somebody understand why thefollowing return 3 item the second being empty?

$pattern="\(([0-9]+)\)|: ==> (.*).:"
$string='crash fatalerror.au3 (13) : ==> Unable to parse line.: '
$array = StringRegExp($string, $pattern, 3)

for $i = 0 to UBound($array) - 1
    msgbox(0, "Option 3 - " & $i, $array[$i])
Next

Thanks for the help or the correction or the fix ...

Valik · October 6, 2006

oes somebody understand why thefollowing return 3 item the second being empty?
$pattern="(([0-9]+))|: ==> (.*).:"
$string='crash fatalerror.au3 (13) : ==> Unable to parse line.: '
$array = StringRegExp($string, $pattern, 3)

for $i = 0 to UBound($array) - 1
    msgbox(0, "Option 3 - " & $i, $array[$i])
Next
Thanks for the help or the correction or the fix ...

$pattern="(([0-9]+))|: ==> (.*).:"

Is that pipe supposed to be there? It means OR so I imagine that's what's doing it. When I change your pattern to this, I get 2 results: $pattern="$([0-9]+)$ : ==> (.*)\.:"

I also corrected another mistake where you were using a dot character when you really wanted to use a literal dot (Remember, it has to be escaped). In this case, it's not too important to escape it since a non-escaped dot will still match since it matches any character.

Edit: Fixed typo and added a sentence.

Edited October 6, 2006 by Valik

jpm · October 6, 2006

$pattern="(([0-9]+))|: ==> (.*).:"
Is that pipe supposed to be there? It means OR so I imagine that's what's doing it. When I change your pattern to this, I get 2 results: $pattern="$([0-9]+)$ : ==> (.*)\.:"
I also corrected another mistake where you were using a dot character when you really wanted to use a literal dot (Remember, it has to be escaped). In this case, it's not too important to escape it since a non-escaped dot will still match since it matches any character.

Edit: Fixed typo and added a sentence.

Thanks Valik,

Do I understand that the OR do not need anymore in PCRE a | ?

thomasl · October 6, 2006

Do I understand that the OR do not need anymore in PCRE a | ?

The | meant, means and will for the foreseeable future mean OR in REs.

The effect you see is another case of RTFM. You don't need a | in your context. In fact, it's plain wrong here and the very reason why you got this empty string back: you asked to match EITHER a number OR something else. During the first matching attempt the number is duly matched and returned in the first group. The second group remains empty because it is never initialised: "A set of alternatives matches a string if ANY of the alternatives match [...]. It tries the alternatives left to right AND STOPS ON THE FIRST MATCH THAT ALLOWS SUCCESSFUL COMPLETION OF THE ENTIRE REGULAR Expression." (Quoted from Programming Perl, my emphasis.)

Think short-circuit evaluation in a classical if: if f1() or f2() then ...

If f1() returns true f2() is never called. That's exactly what happened here.

REs are like a programming language: you get back what you asked for and not what you believe you asked.

this-is-me · October 6, 2006

For those wanting to learn more about regular expressions, I have found an online copy of Mastering Regular Expressions, 2nd Edition. I am reading it myself in an attempt to better understand regular expressions.

SlimShady · October 6, 2006

There's a great tool I found some days ago.

It's freeware and called The Regex Coach.

Download from this page.

JSThePatriot · October 6, 2006

Thanks to the both of you for these two teaching tools. I would like to know more about RE's due to my constant text parsing.

JS

thomasl · October 6, 2006

There's a great tool I found some days ago.
It's freeware and called The Regex Coach.
Download from this page.

YES! This is a very, very useful tool to learn REs and to understand how they work (or why they don't). I use it myself when I am sure I am right but the blasted expression doesn't do what it's supposed to do.

It is, however, not based on PCRE, so there are some subtle differences (mostly of a pretty esoteric nature).

Jon · October 6, 2006

The parameter doc for StringRegExp is generated from this text. (tabs are important). If someone can ensure it is correct for PCRE I would be grateful.

sohfeyr · October 6, 2006

The parameter doc for StringRegExp is generated from this text. (tabs are important). If someone can ensure it is correct for PCRE I would be grateful.

No named capturing groups then? Too bad; maybe some day... I'll just be thrilled to finally see RegExps well-supported.

:lmao:

RegExp - has anyone seen this library before?

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Top Posters In This Topic

Popular Posts

Jon

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members