Sign in to follow this  
Followers 0
Mithrandir

Problem with regexp:It matches pattern ok at the beginning but not at the end

5 posts in this topic

#1 ·  Posted (edited)

I am doing some tests on regular expressions using this string:

&element1&element2&element11&

And when using this pattern (the \x26 is a way to get a character by its ascii code in this case 26 which is &):

\A\x26[^\x26]+?

It correctly matches '&e'

But when using this pattern:

[^\x26]+?\x26\z

It matches 'element11&' and not '1&' although I told it not to be greedy with the '?' after [^\x26]+

What is happening? :)

Edited by Mithrandir

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Hi.

What result are you expecting?

Regards, Rudi.

Edited by rudi

Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

"A lazy quantifier will first repeat the token as few times as required, and gradually expand the match as the engine backtracks through the regex to find an overall match." You used an anchor and a negated class. Your pattern basically says: "Read from right to left macthing the pattern "&" and "not an & matching as few as possible at first, gradually expanding the match"

From Right to Left:

& - a match for /x26 (token #2)

1 - a match for a character other than /x26 (token #1)

1 - a match for a character other than /x26 (token #1)

t - a match for a character other than /x26 (token #1)

n - a match for a character other than /x26 (token #1)

e - a match for a character other than /x26 (token #1)

m - a match for a character other than /x26 (token #1)

e - a match for a character other than /x26 (token #1)

l - a match for a character other than /x26 (token #1)

e - a match for a character other than /x26 (token #1)

& - Not a match for a character other than /x26 (token #1)

Return element11&

Edited by Varian

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Hi.

What result are you expecting?

Regards, Rudi.

I was expecting the last regexp to return 1& since I believed I told it to match an ampersand ("&") at the end and then not to be greedy when matching elements that were not an ampersand.

"A lazy quantifier will first repeat the token as few times as required, and gradually expand the match as the engine backtracks through the regex to find an overall match." You used an anchor and a negated class. Your pattern basically says: "Read from right to left macthing the pattern "&" and "not an & matching as few as possible at first, gradually expanding the match"

From Right to Left:

& - a match for /x26 (token #2)

1 - a match for a character other than /x26 (token #1)

1 - a match for a character other than /x26 (token #1)

t - a match for a character other than /x26 (token #1)

n - a match for a character other than /x26 (token #1)

e - a match for a character other than /x26 (token #1)

m - a match for a character other than /x26 (token #1)

e - a match for a character other than /x26 (token #1)

l - a match for a character other than /x26 (token #1)

e - a match for a character other than /x26 (token #1)

& - Not a match for a character other than /x26 (token #1)

Return element11&

Great explanation! So in order to match the last non-ampersand character and the last ampersand I used this pattern: [^\x26]{1}\x26\z and it worked.

But does regexp always read from right to left or is it only when using \z in the pattern? Because if it is so, then why when using the pattern

\A\x26[^\x26]+? it matched '&e' ? Shouldn't, if reading from right to left, do this?(I skipped the parsing of '&element2&element11&' because they would end when they match an '&' that is not at the beginning of the string):

From Right to Left:

1 - a match for a character other than /x26 (token #2)

t - a match for a character other than /x26 (token #2)

n - a match for a character other than /x26 (token #2)

e - a match for a character other than /x26 (token #2)

m - a match for a character other than /x26 (token #2)

e - a match for a character other than /x26 (token #2)

l - a match for a character other than /x26 (token #2)

e - a match for a character other than /x26 (token #2)

& - a match for an ampersand at the beginning of the string (token #1)

Return &element1

On the other hand if it is from left to right it would match the ampersand at the beginning of the string and then a non-ampersand character and stop there. So is it this 'right to left' reading method always used or just when using \z ? Thanks for your help!

Edited by Mithrandir

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

I was expecting the last regexp to return 1& since I believed I told it to match an ampersand ("&") at the end and then not to be greedy when matching elements that were not an ampersand.

Great explanation! So in order to match the last non-ampersand character and the last ampersand I used this pattern: [^\x26]{1}\x26\z and it worked.

But does regexp always read from right to left or is it only when using \z in the pattern? Because if it is so, then why when using the pattern

\A\x26[^\x26]+? it matched '&e' ? Shouldn't, if reading from right to left, do this?(I skipped the parsing of '&element2&element11&' because they would end when they match an '&' that is not at the beginning of the string):

From Right to Left:

1 - a match for a character other than /x26 (token #2)

t - a match for a character other than /x26 (token #2)

n - a match for a character other than /x26 (token #2)

e - a match for a character other than /x26 (token #2)

m - a match for a character other than /x26 (token #2)

e - a match for a character other than /x26 (token #2)

l - a match for a character other than /x26 (token #2)

e - a match for a character other than /x26 (token #2)

& - a match for an ampersand at the beginning of the string (token #1)

Return &element1

On the other hand if it is from left to right it would match the ampersand at the beginning of the string and then a non-ampersand character and stop there. So is it this 'right to left' reading method always used or just when using \z ? Thanks for your help!

Sorry for the late reply, but you are correct. The /z or $ (same thing) denotes the end of line (or end of string) so the match will be tested with your other tokens an the by the end of the line (or string). A good example of this is extracting the containing path from a FQ (fully qualified) file or directory: For example:

If the FQ Path is "C:\Windows\System32\regedit.exe", a RegExp of "[^\\]*$" will read from right to left and match everything that is not "\"...So

StringRegExp("C:\Windows\System32\regedit.exe", "[^\\]*$", 1)
will find a match of regedit.exe

and

StringRegExpReplace("C:\Windows\System32\regedit.exe", "[^\\]*$", "")
will replace that match with blank, so it will return C:\Windows\System32\"

Hope this helps

Edited by Varian

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0