Sign in to follow this  
Followers 0
martin

StringREgExp excluding characters from match

8 posts in this topic

From the help for StringRegExp

(?: ... ) Non-capturing group. Behaves just like a normal group, but does not record the matching characters in the array nor can the matched text be used for back-referencing.

'... does not record the matching characters in the array' made me expect that those characters would not be in the array returned but

$test = 'abcd123mmm'
$ans =  StringRegExp($test,'(?:abcd123)\S*',2);$test might have a number of lines
If @error then
   ConsoleWrite(@error & @CRLF)
Else
 ConsoleWrite($ans[0] & @CRLF)
endif

gives a result which includes the characters I don't want to be captured.

How should I do it?

It looks like the description in the help for StringRegExp actually only applies to StringRegExpReplace because this

$test2 = 'fffabcd123kkkk'
$r = StringRegExpReplace($test2,'(?:abcd123).*','qqqq')
ConsoleWrite($r & @CRLF)

gives

fffqqqq

as expected (hoped?)

But then the help says that by default '.' matches any character except new line but this

$test2 = 'fffabcd123kkkk' & @LF & '**********'
$r = StringRegExpReplace($test2,'(?:abcd123).*','qqqq')
ConsoleWrite($r & @CRLF)

returns a string which includes the @LF and the following characters. If I change @LF for @CR it makes no difference.

If instead of '.*' I use '\S*' to match any non-whitespace character then I still get the @LF or @CR included but I think these are whitespace characters.

I do not understanding some very basic things here.


Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Hey,

This might explain some of it.

The non capturing group will not be excluded from the match but will not create an extra match.

#include <Array.au3>
$test = 'abcd123mmm'

; this will return 1 match (the full string)
$ans = StringRegExp($test, '(\w+(?:123).*)', 3)
If @error Then
    ConsoleWrite("Error: " & @error & @CRLF)
Else
    _ArrayDisplay($ans)
    ConsoleWrite($ans[0] & @CRLF)
EndIf

; but if we were to use a capturing group it would return 2 matches (full string and 123)
$ans = StringRegExp($test, '(\w+(123).*)', 3)
If @error Then
    ConsoleWrite("Error: " & @error & @CRLF)
Else
    _ArrayDisplay($ans)
    ConsoleWrite($ans[0] & @CRLF)
EndIfoÝ÷ ظ¤xûÜ©¢)í²­{azw±¶,¶Øb³¥
+«zØ^Þqªmº¸§
èº"VÞ{%¹×~º&¶¦j×!mçºÇ°ØZmÇuÛ|m«"r¬k&qÝvÞÈhºW[^­æî´',ׯz¼­)àmèbØ­jëh×6$ans = StringRegExp($test, '(?:abcd123)(.*)', 3)
If @error Then
    ConsoleWrite("Error: " & @error & @CRLF)
Else
    ConsoleWrite($ans[0] & @CRLF)
EndIfoÝ÷ Ù8^mèZ¾*.®â±«"¶ÈhºW[y·jëÊ¡j÷­âÅ{azËkx-éµÈpY[­æ¤¶«z·©§"§ÊØb©¶ax±pØjªª¬¢Ø^­ën®w²Úâ-YajËax±r¢ç붬b~'«¶µÈK©l¡«­¢+ØÀÌØíÑÍÐÈôÌäíÄÈÍ­­­¬ÌäìµÀì1µÀìÌä쨨¨¨¨¨¨¨¨¨Ìäì(ÀÌØíÈôMÑÉ¥¹IáÁIÁ± ÀÌØíÑÍÐÈ°Ìäì ýÌ¥ÄÈ̸¨Ìäì°ÌäíÅÅÅÄÌäì¤)
½¹Í½±]É¥Ñ ÀÌØíȵÀì
I1¤

: I got it now I think.

Edited by Robjong

Share this post


Link to post
Share on other sites

I think you want return type = 3, and you need parens around the group you want "(\S*)". Like this:

#include <Array.au3>

$test = 'abcd123mmm' & @CRLF & 'xyzmmm' & 'abcd123yyy'
For $n = 1 To 3
    $ans = StringRegExp($test, '(?:abcd123)(\S*)', $n); $test might have a number of lines
    If @error Then
        ConsoleWrite(@error & @CRLF)
    Else
        _ArrayDisplay($ans, "Type = " & $n)
    EndIf
Next

:P


Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law

Share this post


Link to post
Share on other sites

I think you want return type = 3, and you need parens around the group you want "(\S*)". Like this:

#include <Array.au3>
 
 $test = 'abcd123mmm' & @CRLF & 'xyzmmm' & 'abcd123yyy'
 For $n = 1 To 3
     $ans = StringRegExp($test, '(?:abcd123)(\S*)', $n); $test might have a number of lines
     If @error Then
         ConsoleWrite(@error & @CRLF)
     Else
         _ArrayDisplay($ans, "Type = " & $n)
     EndIf
 Next

:unsure:

Thank you PsaltyDS :P

But I'm not at all happy (comfortable) with all this.

It seems that the important thing is that if you have a non capturing group with StringRegExp then you must have at least one other group or does not behave as non-capturing. I didn't think of that obviously and it is certainly not something I would have guessed from reading the help. Plus, as one of my examples showed, it isn't the case with StringRegExpReplace so in my opinion the behaviour is wrong. (But then in my opinion most of the world is wrong :D )

Also if I only want the first match then the flag of 1 is ok, although the help says it returns an array of matches but it doesn't.

If I need all the matches then I must use 3 for the flag which correctly (IMO) causes the non-capturing group to be omitted. Again it is not at all obvious to me from the help and I suspect it is not obvious to many more people.

The flag of 2 should give an array of matches including the full match, but like the flag of 1 it doesn't and only returns the first match. I don't know whether this is because the help is misleading or that StringRegExp is faulty or that I still don't get it.

Anyway, thanks again PsaltyDS for finding a solution for me.


Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites

It does return array of matches as expected but it's not globally go through the rest of the string. If you had 10 capturing parentheses and there was an overall successful match then you can expect 10 element sized array, but with option 3 you don't know what is the array size in first place. Non-capturing parentheses are returned as well if there was no other thing to capture, that is why they're called non-capturing because in this case:

/(?:t|to)op/

you may just want to group alternatives but not capture them, I believe it's quite clear.

Share this post


Link to post
Share on other sites

I always use 'capturing' parentheses when I need to capture a specific portion of a match, or get an array of information, otherwise it will return the full match. I think that should pretty much be your practice, martin. The only case I would say you don't need to use parentheses are in using the match/no-match option (0) of StringRegExp - although sometimes you'll need to do 'OR's with parentheses.

Take part of one of my more fun explorations into PCRE's, grabbing Registry Key data from a RegEdit file:

#include <array.au3>

$sRegFileStr="Windows Registry Editor Version 5.00"&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software]"&@CRLF& _
'@=""'&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software\akey]"&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software\akey\subkey]"&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software\7-zip]"&@CRLF& _
'"Lang"="-"'&@CRLF& _
'"Path"="C:\\Program Files\\7-Zip"'&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software\7-zip\Compression]"

$aRegKeys=StringRegExp($sRegFileStr,"\[(HKEY_(?:CURRENT_USER|LOCAL_MACHINE|USERS)\\.*)\]",3)

_ArrayDisplay($aRegKeys)

Note how I make sure it *starts* with [, but I don't want to capture it - so I can either leave it outside of 'capturing' parentheses, or put it as a non-capturing group (?:\[).

Then inside you'll notice I capture the whole key *inside* brackets, but I have to explicitly tell the PCRE engine that I don't want to capture 'CURRENT_USER|LOCAL_MACHINE|USERS' separately, though I *do* want to capture them as part of the 'grander' capture. (Otherwise, you'll wind up with a 2nd capture for the inner group).

Hrmm.. hope you can understand that..?

Share this post


Link to post
Share on other sites

It does return array of matches as expected but it's not globally go through the rest of the string. If you had 10 capturing parentheses and there was an overall successful match then you can expect 10 element sized array, but with option 3 you don't know what is the array size in first place. Non-capturing parentheses are returned as well if there was no other thing to capture, that is why they're called non-capturing because in this case:

/(?:t|to)op/

you may just want to group alternatives but not capture them, I believe it's quite clear.

No, I can confidently say it's not at all clear. That just sounds like nonsense to me.

Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites

I always use 'capturing' parentheses when I need to capture a specific portion of a match, or get an array of information, otherwise it will return the full match. I think that should pretty much be your practice, martin. The only case I would say you don't need to use parentheses are in using the match/no-match option (0) of StringRegExp - although sometimes you'll need to do 'OR's with parentheses.

Take part of one of my more fun explorations into PCRE's, grabbing Registry Key data from a RegEdit file:

#include <array.au3>

$sRegFileStr="Windows Registry Editor Version 5.00"&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software]"&@CRLF& _
'@=""'&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software\akey]"&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software\akey\subkey]"&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software\7-zip]"&@CRLF& _
'"Lang"="-"'&@CRLF& _
'"Path"="C:\\Program Files\\7-Zip"'&@CRLF& _
@CRLF& _
"[HKEY_CURRENT_USER\Software\7-zip\Compression]"

$aRegKeys=StringRegExp($sRegFileStr,"\[(HKEY_(?:CURRENT_USER|LOCAL_MACHINE|USERS)\\.*)\]",3)

_ArrayDisplay($aRegKeys)

Note how I make sure it *starts* with [, but I don't want to capture it - so I can either leave it outside of 'capturing' parentheses, or put it as a non-capturing group (?:\[).

Then inside you'll notice I capture the whole key *inside* brackets, but I have to explicitly tell the PCRE engine that I don't want to capture 'CURRENT_USER|LOCAL_MACHINE|USERS' separately, though I *do* want to capture them as part of the 'grander' capture. (Otherwise, you'll wind up with a 2nd capture for the inner group).

Hrmm.. hope you can understand that..?

That's a lot more helpful, your example makes perfect sense to me and so does your explanation.

I know that in the help it says that (..) will return the text matched in the group but to me I expect that if a capturing group returns the text then a non capturing group will not return the text even if there is no other capturing group, but I accept that I just have to learn that it is the way it is.

I will take your advise and use the brackets for capturing groups wherever possible.

Thanks ascendant.


Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0