Tutorial - Regular Expression

Here's a smallish guide on unraveling the seeming mysteries of StringRegExp().

StringRegExp( "test" , "pattern" [, flag ] )

"test" = The string to search through for matches.
"pattern" = A string consisting of certain key characters that let the function know PRECISELY what you want to match. No ifs, ands, or buts.. it's a match or it isn't.
flag [optional] = Tells the function if you just want to know if the "pattern" is found, or if you want it to return the first match, or if you want it to return all the matches in the "test" string.

The Very Basics

As you may have figured out, the "pattern" string is the only difficult part of calling a StringRegExp() (forthwith: SRE). I find it best to think of patterns as telling the function to match a string character by character. There are different ways to find a certain character: If you want to match the string "test", that should be simple enough. You want to tell SRE to first search the string for a "t". If it finds one, then it assumes it has a match, and the rest of the pattern is used to try to prove that what it's found is not a match. So, if the next character is an "e", it could still be a match. Let's say the next letter is an "x". SRE knows immediately that it hasn't found a match because the third character you tell it to look for is an "s".


Example 1

#include <MsgBoxConstants.au3>

MsgBox($MB_OK, "SRE Example 1 Result", StringRegExp("text", 'test'))

In this example, the message box should read "0", which means the pattern "test" was not found in the test string "text". I know this seems pretty simple, but now you know why it wasn't found.

The next way of specifying a pattern is by using a set ("[ ... ]"). You can equate a set to the logic function "OR". Let's use the previous Example. We want to find either the string "test" or the string "text". So, the way I start looking for a pattern is to think like SRE would think: The first character I want to match is "t", then the letter "e", this is the same for both strings we want to match. Now we want to match "s" OR "x", so we can use a set as a substitute: "[sx]" means match an "s" or an "x". Then the last letter is a "t" again.


Example 2

#include <MsgBoxConstants.au3>

MsgBox($MB_OK, "SRE Example 2 Result", StringRegExp("text", 'te[sx]t'))
MsgBox($MB_OK, "SRE Example 2 Result", StringRegExp("test", 'te[sx]t'))

These should both provide the result "1", because the pattern should match both "test" and "text".

You can also specify how many times to match each character by using "{number of matches}" or you can specify a range by using "{min, max}". The first example below is redundant, but shows what I mean:


Example 3

#include <MsgBoxConstants.au3>

MsgBox($MB_OK, "SRE Example 3 Result", StringRegExp("text", 't{1}e{1}[sx]{1}t{1}'))
MsgBox($MB_OK, "SRE Example 3 Result", StringRegExp("aaaabbbbcccc", 'b{4}'))

The Not-So Basics

Right now you're probably thinking "Isn't this just a glorified StringInStr() function?". Well, using a "flag" value of 0, most of the time you're right. But SRE is much more powerful than that. As you use SRE's more and more, you'll find you might know less and less about the type of pattern you are looking for. There are ways to be less and less specific about each character you wish to specify in the pattern. Take, for example, a line from an inventory log file: "There were 18 sheets left in the ream of paper." You want to find the number of remaining sheets. Well, you can't use StringInStr() because you aren't looking for "18", you're looking for "????", where ? could be any digit.



Here's how I would assemble this pattern. Look at what you do and do not know about what you want to find:
1) You know that it will ALWAYS contain nothing but digits.
2) You know that it will SOMETIMES be 2 characters long.
2a) You know that a full ream of paper is 500 sheets.
2b) You know that the minimum number of sheets is 0.
3) You know that it will ALWAYS be between 1 and 3 characters long.
4) You know that there are no other digits in the test string.

At this point, I'd like to introduce the FLAG value of "1" and the grouping characters "()". The flag value of "1" means that SRE will not only match your pattern, but also return an array, with each element of the array consisting of a captured "group" of characters. So without veering off course too much, take this example:



Example 4

#include <MsgBoxConstants.au3>
#include <StringConstants.au3>

Local $aResult = StringRegExp("This is a test example", '(test)', $STR_REGEXPARRAYMATCH)
If Not @error Then
    MsgBox($MB_OK, "SRE Example 4 Result", $aResult[0])
EndIf
$aResult = StringRegExp("This is a test example", '(te)(st)', $STR_REGEXPARRAYMATCH)
If Not @error Then
    MsgBox($MB_OK, "SRE Example 4 Result", $aResult[0] & "," & $aResult[1])
EndIf

So, first the pattern must match somewhere in the test string. If it does, then SRE is told to "capture" any groups ("()") and store them in the return array. You can use multiple captures, as demonstrated by the second piece of code in Example 4.

Ok, back to the log file. Now that we know how to "capture" text, let's construct our pattern: Since you know what you're looking for is digits, there are 3 ways to specify "match any digit": "[:digit:]", "[0-9]", and "\d". The first is probably the easiest to understand. There are a few classes (digit, alnum, space, etc. Check the help file for a full list) you can use to specify sets of characters, one of them being digit. "[0-9]" just specifies a range of all the digits 0 through 9. "\d" is just a special character that means the same as the first two. There is no difference between the three, and with all SRE's there are usually at least a couple ways to construct any pattern.

So, first we know we want to capture the digits, so indicate that with the opening parentheses "(". Next, we know we want to capture between 1 and 3 characters, all consisting of digits, so our pattern now looks like "([0-9]{1,3}". And finally close it off with the closing parentheses to indicate the end of our group: "([0-9]{1,3})". Let's try it:


Example 5

#include <MsgBoxConstants.au3>
#include <StringConstants.au3>

Local $aResult = StringRegExp("There were 18 sheets left in the ream of paper.", _
        '([0-9]{1,3})', $STR_REGEXPARRAYMATCH)
If Not @error Then
    MsgBox($MB_OK, "SRE Example 5 Result", $aResult[0])
EndIf

There you go, the message box correctly displays "18".

Next we need to cover non-capturing groups. The way you indicate these groups is by opening the group with "(?:" instead of just "(". Let's say your log file says "You used 36 of 279 pages." Now if you run Example 5's SRE on this, you'll come up with "36" instead of "279". Now what I like to do here is just determine what's different between the numbers. One that jumps out at me is that the second number is always followed by a space and then the word "pages". We could just modify our previous pattern to be "([0-9]{1,3} pages)", but what if our script is just looking for the starting amount of pages, without " pages" tacked onto the end of the number? Here's where you can use a non-capturing group to accomplish this.

Example 6

#include <MsgBoxConstants.au3>
#include <StringConstants.au3>

Local $aResult = StringRegExp("You used 36 of 279 pages.", '([0-9]{1,3})(?: pages)', $STR_REGEXPARRAYMATCH)
If Not @error Then
    MsgBox($MB_OK, "SRE Example 6 Result", $aResult[0])
EndIf

This could get lengthy, but mostly I just wanted to lay out the foundation for how regular expressions work, and mainly how SRE "thinks". A few things to keep in mind:
- Remember to think about the pattern one character at a time
- The StringRegExp() function finds the first character in the pattern, then it's your job to provide enough
evidence to "prove" whether or not it truly is a match. Example 6 is a good display of this.
- Remember [ ... ] means OR ([xyz] match an "x", a "y", OR a "z")
If you have any other questions, consult the help file first! It explains in detail all of the nitty gritty syntax that comes along with SRE's. One thing to look at in particular is the section on "Repeating Characters". It can make your pattern more readable by substituting certain characters for ranges. For example: "*" is equivalent to {0,} or the range from 0 to any number of characters.

Good luck, Regular Expressions can greatly decrease the length of your code, and make it easier to modify later. Corrections and feedback are welcome!

Resources

Wikipedia Article - Regular Expressions - Thanks blindwig.

The 30 Minute Regex Tutorial - by Jim Hollenhorst.


GUI for testing various StringRegExp() patterns - Thanks steve8tch. Credit: w0uter

 

 

Thanks to neogia for this tutorial.