Jump to content

Regular Expression Testing


Nutster
 Share

Recommended Posts

I am now in (I hope) final testing of the regular expression routines. I hope to have it done by oh, Monday, maybe the weekend. I have also done:

  • Binary search on the function list. Approx 20%-25% speed increase.
  • StringJoin
  • StringSplit takes whole string for delimiter
  • Added @CTimer (the number of seconds from Midnight Jan 1, 1970 UCT), @PI, @E (exp(1))
  • Adding way more comments than I do when I am writing for myself. ;) I am putting in more comments than I usually do for my classroom examples, because I will not be there to explain what the F*@< I am doing.
  • Couple more optimizations that I do not remember right now. :)
Just thought I would keep people up to date on what I have been doing, other than working for money :) Need that roof over the head and food on the table (well in the belly is better, but usually I can quickly change one into the other.) Edited by Nutster

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Link to comment
Share on other sites

  • Replies 138
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Ok, for the Regular Expressions as implemented (intended):

$x = RegExp($line, $pattern [,"Array"])

Perform a case-sensitive comparison of $line against the given pattern. Both are to be strings.

The pattern is defined using the following symbols:

  • "abc" - all regular characters that match themselves. i.e. match abc somewhere in the string.
  • "[abc]" - set: match one character that is a, b or c somewhere in the string
  • "[^abc]" - negated set: match one character other than a, b or c.
  • "b*" - matches 0 or more b's.
  • "b+" - matches 1 or more b's.
  • "b?" - optional: matches b if it is there, but does not have to be.
  • "^abc" - abc must appear at the beginning of the string.
  • "abc$" - abc must appear at the end of the string.
  • "(abc)" - group: treat the pattern inside the brackets as a unit. e.g. "(ab)+" matches "ab", "abab", "ababab", etc. The text that matches a group will be stored in the array if it is named in the RegExp call.
I am also including a whole bunch of class definitions: \s for any one whitespace, \d for any one digit, \a for any one alphabetic character, \A for any one alphanumeric character, \p for any punctuation character, \w for any word character (alphabetic or underline), \u for any upper-case character, \l (lower L) for any lower-case character. I think I have them all.

That what the docs are for. Man, was writing the docs a pain. Oh well, one of the necessary evils of writing a program is documenting it.

I have also written a function to keep track of regular expressions that you use repeatedly: RegExpSet and RegExpClose. RegExpSet interprets and stores the regular expression and returns a handle, similar to FileOpen. Do not mix up FileOpen handles and RegExpSet handles! They are not the same! You can replace the $pattern in the above function call with a handle given by RegExpSet. RegExpClose releases the memory that was used by the stored regular expression.

Using RegExpSet will speed up calls of RegExp because the pattern does not have to be interpreted each time, but only once. I only made 4 spaces to store regular expressions. Do people think that is enough?

Edit: Fix spelling errors.

Edited by Nutster

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Link to comment
Share on other sites

It looks like a good solid implementation. Great Job Nutster! I don't know if it would be possible, but just to throw it out there. VB does RegEx through an operator named "Like" this makes the functionality inline with the rest of the code without having to make any explicit function calls. Would something similar be plausible in this situation?

*** Matt @ MPCS

Link to comment
Share on other sites

  • Administrators

Just thought I would keep people up to date on what I have been doing, other than working for money :) Need that roof over the head and food on the table (well in the belly is better, but usually I can quickly change one into the other.)

It would be sweet to be able to work on hobby code for a living. :) But then I suppose it would become work and we'd end up moaning anyway!
Link to comment
Share on other sites

I noticed that one of the best features of regex is missing: Backreference

For example you can have a regex: ^([hH][eE][lL][lL][oO]) and \1$

This would match the following lines:

Hello and Hello

hello and hello

hELLo and hELLo

...

but would not match the following lines:

Hello and hello

helLO and HELLO

So basically \1 references back to the match in the first brackets (works usually up to \9 which equals the ninth brackets).

This is a very powerfull feature if you know how to use it.

There's a complete description of Regex available here (yes, I know it's old. But regex hasn't been changed much since 1992).

Edited by sugi
Link to comment
Share on other sites

Ok, here we go.

  • Matt @ MPCS: Not practical at the moment. Adding a new operator is a little bit of a pain because of the way it is implemented.
  • CyberSlug: I knew I forgot something. . (dot) matches any character. This is what I get for writing this from memory.
  • Jon: Why would I be moaning about writing AutoIt? When my cousins asked me what I was doing at a family reunion I said that I was part of an international programming project to write the Windows automation tool called AutoIt. Oh, and I am writing this big database for a company.
  • sugi: Back-reference. Hmm, now there's an idea. I have looked through the reference you suggested and got a lot of ideas for the next version of RegExp. Let me get what I have working before adding more stuff to it.
Oh, well back to work. I will see what I can do to finish this on the weekend and submit to Jon on Monday.

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Link to comment
Share on other sites

Ok, for the Regular Expressions as implemented (intended):

$x = RegExp($line, $pattern [,"Array"])

Perform a case-sensitive comparison of $line against the given pattern.  Both are to be strings.

Nutster: could you be a bit more specific about what the function returns, and about what is returned in the "Array" (assuming anything is)?

For instance, say I had a string "You were charged $49.57.", and I wanted to get at the amount. Now, I'm not very good at REs, but let say

$line = "You were charged $49.57."
$re = "\d+\.\d\d"
$array = 0
$x = RegExp($line, $re, $array)

What would the value of $x be? And the value of $array? And how could I extract the "49.57" from the string after having used your RegExp function? I am assuming right now that $array[0] would equal 19 (the position within the string where the RE match starts, and $array[1] would equal 23 (the position where the RE match ends), but I am most likely incorrect. So, could you clarify?

TIA

My Projects:DebugIt - Debug your AutoIt scripts with DebugIt!
Link to comment
Share on other sites

Nutster: could you be a bit more specific about what the function returns, and about what is returned in the "Array" (assuming anything is)?

For instance, say I had a string "You were charged $49.57.", and I wanted to get at the amount. Now, I'm not very good at REs, but let say

$line = "You were charged $49.57."
$re = "\d+\.\d\d"
$array = 0
$x = RegExp($line, $re, $array)

What would the value of $x be? And the value of $array? And how could I extract the "49.57" from the string after having used your RegExp function? I am assuming right now that $array[0] would equal 19 (the position within the string where the RE match starts, and $array[1] would equal 23 (the position where the RE match ends), but I am most likely incorrect. So, could you clarify?

TIA

<{POST_SNAPBACK}>

The function returns 1 for success and 0 for failure, and sets @Error is the regular expression is bad. The Array gets the contents of each group, in a single-dimensioned array, blowing away its old contents. If the array is given and there is no match then the array is replaced with an empty string. Right now, I am looking at passing the name, not the variable itself. I think Valik did something to compare variables that are passed, but I do not remember how to use it. So change your code to:

$line = "You were charged $49.57."
$re = "(\d+\.\d\d)"
$array = 0
$x = RegExp($line, $re, "array")

Otherwise, everything is correct. "49.57" would be in $array[0]. The position of matches is not stored. You can use StringInStr to find where $array[0] is in $line.

I guess I could add (next version) \# to store the current position in the array.

Edited by Nutster

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Link to comment
Share on other sites

Just wondering if you have included some of the features you mentioned previously:

  • \t - tab character
  • \n - newline
  • \w - a word = any set of contiguous alphanumeric set of chars including "_", but excluding whitespace chars.
  • \* - an actual *, similarly for other control characters = +.\? ..etc.
  • [^A-Z] - Exclusion set = anything other than the characters specified.
:) Edited by trids
Link to comment
Share on other sites

Just wondering if you have included some of the features you mentioned previously:

  • \t - tab character

  • \n - newline

  • \w - a word = any set of contiguous alphanumeric set of chars including "_", but excluding whitespace chars.

  • \* - an actual *, similarly for other control characters = +.\? ..etc.

  • [^A-Z] - Exclusion set = anything other than the characters specified.

<{POST_SNAPBACK}>

Let's see. BTW, still testing. It is so easy for little bugs to get in the system and screw critical things up. Ok, I guess I will not be submitting today.
  • Tab. Not yet. Can add tonight.
  • Newline. Implemented. Need to test.
  • \w - Implemented as a single character. Do \w+ to get the whole word.
  • \* - well any special character. Implemented, but still need to test fully.
  • Exclusion set - implemented, but (guess what?!) still needs testing.
Added \# over the weekend, but still need to test along with the other stuff about storing values in the array.

Oh well, the docs will contain a list of everything I have implemented. Hey docs team would you be willing to clean up what I submit? I have used a different approach for explaining regular expressions than the stuff I saw in my Perl docs or the link the sugi gave me.. I think it is an easier way of understanding, but I will see what you guys think, when I get around to submitting it. I am using the source from Oct 27 as the base.

Edited by Nutster

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Link to comment
Share on other sites

I almost forgot: the OR character is very handy too .. usually a "pipe" (|)

$sTarget = "(jan)|(feb)"    ;finds either "jan" or "feb"

<{POST_SNAPBACK}>

Next version. It was turning out to be a bit of a pain to implement, so I dropped it for now. Definiately on the TO DO list. Should not need the brackets.

$sTarget = "jan|feb"    ; finds "jan" or "feb"

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Link to comment
Share on other sites

As an alternative method to implementing the OR that way, you could always inverse the code you use for Exclusion sets ("[^a-zA-Z]") to make an INclusion set. Just a thought. I know I have seen it implenented like that before somewhere.

*** Matt @ MPCS

Link to comment
Share on other sites

As an alternative method to implementing the OR that way, you could always inverse the code you use for Exclusion sets ("[^a-zA-Z]") to make an INclusion set. Just a thought. I know I have seen it implenented like that before somewhere.

*** Matt @ MPCS

<{POST_SNAPBACK}>

All I do is check if the character did match and invert the value.

// C++
if (m_type & at_not)
    bFound = ! bFound
Edited by Nutster

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...