Jump to content

Regex selects all at start.


 Share

Recommended Posts

Some help with a regex please.  I want to select blocks of text and the pattern I have is doing that almost correctly.  The problem is the first item also includes everything before that in the source.

#include <Debug.au3>
$sStr = "aBa¬aCa¬aCa¬aCa¬aCa¬aCa"
$sPatn = "(?U)a.*C.*¬"
$sAry = StringRegExp($sStr, $sPatn, 3)
_DebugArrayDisplay($sAry)

Output:

aBa¬aCa¬
aCa¬
aCa¬
aCa¬

I thought (?U) made it not greedy, so it shouldn't do that?  (The ¬ are replacing @CRLF from reading a file, could use the original if easier.)

 

 

 

 

Link to comment
Share on other sites

I cannot reproduce the output you are reporting. I have tested with the script below. Only changes are the @crlf and check for @error

#include <Debug.au3>
$sStr = "aBa" & @CRLF & "aCa" & @CRLF & "aCa" & @CRLF & "aCa" & @CRLF & "aCa" & @CRLF & "aCa"
$sPatn = "(?U)a.*C.*\r\n"
$sAry = StringRegExp($sStr, $sPatn, 3)
if @error then MsgBox(Default, "ERROR", "@error:" & @error)
_DebugArrayDisplay($sAry)

 

Link to comment
Share on other sites

The output of your original snippet is perfectly correct and expected: PCRE does exactly what you asked it to do. Let's see (I insert a bar | to denote where we are inside the subject and the pattern):

|aBa¬aCa¬aCa¬aCa¬aCa¬aCa
|(?U)a.*C.*¬

First the option is parsed and memorized
|aBa¬aCa¬aCa¬aCa¬aCa¬aCa
(?U)|a.*C.*¬

Then:
a|Ba¬aCa¬aCa¬aCa¬aCa¬aCa
(?U)a|.*C.*¬

aBa¬a|Ca¬aCa¬aCa¬aCa¬aCa
(?U)a.*|C.*¬

aBa¬aC|a¬aCa¬aCa¬aCa¬aCa
(?U)a.*C|.*¬

aBa¬aCa|¬aCa¬aCa¬aCa¬aCa
(?U)a.*C.*|¬

aBa¬aCa¬|aCa¬aCa¬aCa¬aCa
(?U)a.*C.*¬|

First match found: aBa¬aCa¬

Remember that . (dot) doesn't match a line break in the example posted as answer. Yet it matches ¬ in your own example.

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Note : before posting what follows, I Just saw @jchd answered while I was preparing my looong post. No offence jchd, I'm posting my answer as I wrote it, then I'll read your post, promised ! And my apologies if I write erroneous comments below :D

Hi everybody,
I'm a newbie at RegEx but anyway, let's try some comments and explore deeper the preceding posts :

@RichardL If you change your  pattern from...

$sPatn = "(?U)a.*C.*¬"

...to

$sPatn = "(?U)a[^¬]*C.*¬"

...then the output should be correct because it will return matches including an "a", followed by any character (except ¬] followed by "C" etc... all this being ungreedy. So the change from "a.*" to "a[^¬]*" should return a correct output.

@OJBakker glad you made it !
It's interesting to experiment on your pattern to force it return exactly... the same issue as OP, by changing this...

$sPatn = "(?U)a.*C.*\r\n"

...to that :

$sPatn = "(?Us)a.*C.*\r\n"

Now it returns exactly the result OP indicated, because (?s) "Single-line or DotAll" was added !
From AutoIt help file :

By default, DotAll is off hence . does not match a newline sequence. 

That's why your pattern worked : in your pattern, when "a.*" met the 1st "\n" in the string, then it didn't match (as no "C" hadn't been found) so the engine started to search for the 1st match after "\n"

As (?s) changes this behavior, then . matches a newline sequence  and the output will be  the same than OP's... who doesn't want this output at all. Let's hope our RegEx guru's will add some nice comments as they're used to :)

By the way, I read this in AutoIt help file, topic StringRegExp :

Quantifiers (or repetition specifiers) specify how many of the preceding character, class, reference or group are expected to match.

As I didn't understand what "reference" meant in this sentence, then I found this MS page : Quantifiers in Regular Expressions, where we can read :

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

They don't mention references. So what does this "reference" mean in AutoIt helpfile, when applied to quantifiers ?
Thanks... and now let's immediately read jchd's post :)

Link to comment
Share on other sites

16 hours ago, pixelsearch said:

So what does this "reference" mean in AutoIt helpfile, when applied to quantifiers ?

Reference means back reference or subroutine call
So says the bible :)

Repetition is specified by quantifiers, which can follow any of the following items:

  (...)
  a back reference (see next section)
  a parenthesized subpattern (including assertions)
  a subroutine call to a subpattern (recursive or otherwise)

 

Edited by mikell
Link to comment
Share on other sites

Thanks mikell, it's an interesting link :)

By the way, it seems to me that OP had perhaps something else in mind, when he wrote :

On 10/23/2022 at 2:25 PM, RichardL said:

I thought (?U) made it not greedy, so it shouldn't do that?

Look at this simple example :

$sStr = "1a2a3c4c5c6"

$sPatn = "a.*c" ; greedy (default) returns a2a3c4c5c
$sAry = StringRegExp($sStr, $sPatn, 3)
_DebugArrayDisplay($sAry, "greedy")

$sPatn = "(?U)a.*c" ; ungreedy (aka lazy) returns a2a3c
$sAry = StringRegExp($sStr, $sPatn, 3)
_DebugArrayDisplay($sAry, "ungreedy")

289616052_greedyungreedy.png.dd7b8c62dea668713d3853d7ad86a485.png

If a user expects to match "a3c" with the ungreedy pattern of this example, then it doesn't work.

"(?U)a.*c" doesn't mean "As we are ungreedy, then anchor to the last 'a' found before 'c' and grab everything between this last 'a' and the 1st 'c' following it."

With this kind of pattern, no matter the greediness (on or off) the anchor is always done on the first 'a' found in the string, then the lenght of the match depends on the greediness (longer when on, shorter when off)

Please be kind to correct this explanation if it's wrong or obscure, thanks.

Edit: I got a pattern that returns "a3c" in this last example :

$sPatn = "(?U)a[^a]*c" ; ungreedy returns a3c (yes !)

"As we are ungreedy, then anchor to the last 'a' found before 'c' and grab everything between this last 'a' and the 1st 'c' following it."

Edited by pixelsearch
Link to comment
Share on other sites

This isn't exactly an anchor question, but simply a question of satisfying the pattern, backtracking and restarting the pattern after a failure.

|1a2a3c4c5c6
|(?U)a[^a]*c

Option is parsed once

|1a2a3c4c5c6
(?U)|a[^a]*c

1 doesn't match a in pattern
1|a2a3c4c5c6
(?U)|a[^a]*c

a matches
1a|2a3c4c5c6
(?U)a|[^a]*c

2 matches [^a]*
1a2|a3c4c5c6
(?U)a[^a]*|c

not followed by c in subject : pattern failed
backtrack to 2 and restart pattern from there
1a|2a3c4c5c6
(?U)|a[^a]*c

2 doesn't match a
1a2|a3c4c5c6
(?U)|a[^a]*c

a3c matches a[^a]*c : success
1a2a3c|4c5c6
(?U)a[^a]*c|

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

@jchd these last explanations about "satisfying the pattern, backtracking and restarting the [whole ?] pattern after a failure." were very interesting. The word "anchor" I used was surely inappropriate (as it's a "RegEx word") but you certainly understood what I was trying to explain :)

Let's add some more details at the end of your explanations (just to look at the place of the vertical bars and prepare my question to come) :

...

1a2|a3c4c5c6
(?U)|a[^a]*c

a matches
1a2a|3c4c5c6
(?U)a|[^a]*c

3 matches
1a2a3|c4c5c6
(?U)a[^a]*|c

c matches, so a3c matches a[^a]*c : success
1a2a3c|4c5c6
(?U)a[^a]*c|

Now let's try this on another example with a different subject ("33" instead of "3") and a different quantifier {2} instead of *

#include <Debug.au3>

$sStr = "1a2a33c4c5c6"
$sPatn = "(?U)a[^a]{2}c" ; matches a33c (greedy or not)
$sAry = StringRegExp($sStr, $sPatn, 3)
If @error then MsgBox(0, "StringRegExp", "@error:" & @error) ; error 1 = no matches
_DebugArrayDisplay($sAry, "Result")
...

1a2|a33c4c5c6
(?U)|a[^a]{2}c

a matches
1a2a|33c4c5c6
(?U)a|[^a]{2}c

3 matches
1a2a3|3c4c5c6
(?U)a[^a]{2}|c

What now ?

As I moved the vertical bar in pattern & subject for each step (as you did with a * quantifier), how would the engine continue now that the quantifier is {2} ?

I mean if we move the vertical bar after each character match when the quantifier is * like in [^a]* , then should it be different when the quantifier is {2} like in [^a]{2}

Sorry if the question looks too simple :)

Link to comment
Share on other sites

No problem.

When the subject and pattern reach this state, everything before being the same as previously:

1a2a|33c4c5c6
(?U)a|[^a]{2}c

the pattern expects 2 characters not a and 33 rightly match that expectation.

1a2a33|c4c5c6
(?U)a[^a]{2}|c

then the pattern requires c and we have a match as well with the string a33c.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Thanks jchd
So finally, it would be the same behavior as you just explained, when "33" and * are used.

When the subject and pattern reach this state, everything before being the same as previously: 

1a2a|33c4c5c6
(?U)a|[^a]*c 

"The pattern expects 0 or more characters (not a) followed by c" and 33 rightly match that expectation :

1a2a33|c4c5c6
(?U)a[^a]*|c

No vertical bar between 3's lol...
... or why not, something like that during the possible "multicharacter checking phase", moving one vertical bar to the right (in subject) but not the other vertical bar (in pattern) until the checking phase ends :

1a2a3|3c4c5c6
(?U)a|[^a]*c

Glad we have you here :thumbsup:

Edited by pixelsearch
modified comment
Link to comment
Share on other sites

It's because when the pattern is compiled to low-level PCRE internal primitives, a sequence like [^a]{2}c will immediately (or so) detect partial match or failure, as a quasi block operation. That's why it isn't always possible to follow progression in the subject & pattern with bars like we can do in simple examples.

Also PCRE uses by default a number of optimizations which in most use cases cut down the number of pointless backtracking steps. Read the bible for details and the source for more gory details!

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...