StringRegExp

KJohn · August 29, 2007

How would I craft a StringRegExp to obtain the following behavior? (Case insensitive throughout)

Assume the string to be evaluated is: Galapagos

The substring should return a non-zero only if it matches from the first character. For example: Matches would include G, Ga, Gal, Gala, gAlaP, gaLApa, etc. But the following would not be matches: ala, Pago, lapagos, etc.

I had done this once a long time back but this single comparison in StringRegExp proved to be much slower than all the multiple comparisons by StringInStr (SiS performs with multiple starting characters till a match is found or end of string is reached). Maybe it had something to do with the way I had written it. Maybe not. Please help.

Speed is my primary concern. I'm latching on to the single comparison benefit of SRE. If you can think of an even better way to accomplish the above, I would be really grateful.

Edited August 29, 2007 by Koshy John

randallc · August 29, 2007

Hi,

Here's one way; but if you are looking in a fileread, you will see in my RegExp func you need things like (?m) to get every line too..

#include<array.au3>
$s_FileRead="Galapagos"
$s_Searches="gAlAp"
;~ $s_Searches=StringReplace(StringReplace($s_Searches, "|", " | "), ".", "\.")
$patternReg='^(?i)'&$s_Searches&'.*$'
Local $asList = StringRegExp($s_FileRead, $patternReg, 3)
if IsArray($asList) then _ArrayDisplay($asList," Matches for "&$s_Searches&" in " &$s_FileRead)
if not IsArray($asList) then  MsgBox(0,"","No Match for "&$s_Searches&" in " &$s_FileRead)
;=============================================
$s_Searches="AlAp"
;~ $s_Searches=StringReplace(StringReplace($s_Searches, "|", " | "), ".", "\.")
$patternReg='^(?i)'&$s_Searches&'.*$'
Local $asList = StringRegExp($s_FileRead, $patternReg, 3)
if IsArray($asList) then _ArrayDisplay($asList," Matches for "&$s_Searches&" in " &$s_FileRead)
if not IsArray($asList) then  MsgBox(0,"","No Match for "&$s_Searches&" in " &$s_FileRead)

Best, Randall

PS if you are looking at speed in large RegExp, you need to compile with ANSI, not unicode (unless that has been fixed too?.. I haven't tested it lately, but there's has been a huge speed difference)

Edited August 29, 2007 by randallc

KJohn · August 29, 2007

Hi,

Here's one way; but if you are looking in a fileread, you will see in my RegExp func you need things like (?m) to get every line too..

#include<array.au3>
$s_FileRead="Galapagos"
$s_Searches="gAlAp"
;~ $s_Searches=StringReplace(StringReplace($s_Searches, "|", " | "), ".", "\.")
$patternReg='^(?i)'&$s_Searches&'.*$'
Local $asList = StringRegExp($s_FileRead, $patternReg, 3)
if IsArray($asList) then _ArrayDisplay($asList," Matches for "&$s_Searches&" in " &$s_FileRead)
if not IsArray($asList) then  MsgBox(0,"","No Match for "&$s_Searches&" in " &$s_FileRead)
;=============================================
$s_Searches="AlAp"
;~ $s_Searches=StringReplace(StringReplace($s_Searches, "|", " | "), ".", "\.")
$patternReg='^(?i)'&$s_Searches&'.*$'
Local $asList = StringRegExp($s_FileRead, $patternReg, 3)
if IsArray($asList) then _ArrayDisplay($asList," Matches for "&$s_Searches&" in " &$s_FileRead)
if not IsArray($asList) then  MsgBox(0,"","No Match for "&$s_Searches&" in " &$s_FileRead)

Best, Randall

PS if you are looking at speed in large RegExp, you need to compile with ANSI, not unicode (unless that has been fixed too?.. I haven't tested it lately, but there's has been a huge speed difference)

The one line I was looking for was: '^(?i)'&$s_Searches&'.*$'

Could you explain that to me? This is what I understand:

(?i) - case insensitive

^ - but why this this there? (to match any character not in the set)

'.*$' - and what does this stand for?

There are a few parts of AutoIt that don't yet have full Unicode support. These are:

Send and ControlSend - Instead, Use ControlSetText or the Clipboard functions.
Regular expressions - To reduce the size of AutoIt, the regular expression engine is currently compiled in ANSI mode.
Console operations are converted to ANSI.

These limits will be addressed in future versions if possible.

Technically that means the RegExp engine is compiled in ANSI in both versions.. So does it really make a difference... From a performance point of view, would it make sense to compile scipts in the ANSI mode (assuming that the script will be running on only English lang systems) ? Edited August 29, 2007 by Koshy John

KJohn · August 29, 2007

The one line I was looking for was: '^(?i)'&$s_Searches&'.*$'
Could you explain that to me? This is what I understand:
(?i) - case insensitive
^ - but why this this there? (to match any character not in the set)
'.*$' - and what does this stand for?
Technically that means the RegExp engine is compiled in ANSI in both versions.. So does it really make a difference... From a performance point of view, would it make sense to compile scipts in the ANSI mode (assuming that the script will be running on only English lang systems) ?

Ah.. forget it... Regular expressions are slower whether the stub is ANSI or Unicode... ANSI compilation is a little faster but the difference is negligible...

randallc · August 29, 2007

Could you explain that to me? This is what I understand:
(?i) - case insensitive
^ - but why this this there? (to match any character not in the set)
'.*$' - and what does this stand for?

"^" is marker for beginning of line.

$s_Search is search string.

"$" is marker for end of line.

So '.*' is "." any character, any number of times, even zero, after the search string.

then '$' the end of line

Best, randall

(When I last checked , the speed difference on a large result (say 30% match) of a huge file; eg 80-Mb - was about 100x as fast in ANSI; but negligible for small results; but Iwant to do a huge file with one RegExp call, not loop it for recurrent calls to slow it down, , so this becomes significant..

And I am still puzzling how I was lucky to get such a fast result given all the potential pitfalls with RegExp callbacks; just lucky for a change with my first attempt!

Best, Randall)

KJohn · August 29, 2007

"^" is marker for beginning of line.
$s_Search is search string.
"$" is marker for end of line.
So '.*' is "." any character, any number of times, even zero, after the search string.
then '$' the end of line
Best, randall
(When I last checked , the speed difference on a large result (say 30% match) of a huge file; eg 80-Mb - was about 100x as fast in ANSI; but negligible for small results; but Iwant to do a huge file with one RegExp call, not loop it for recurrent calls to slow it down, , so this becomes significant..
And I am still puzzling how I was lucky to get such a fast result given all the potential pitfalls with RegExp callbacks; just lucky for a change with my first attempt!
Best, Randall)

You do realize that doing a RegExp on an entire 80MB file will load the entire 80MB into RAM, rite? 80MB of RAM may not be much for many of us but there are a lot of people in this world on 256MB still... especially in the developing countries...

SmOke_N · August 29, 2007

You do realize that doing a RegExp on an entire 80MB file will load the entire 80MB into RAM, rite? 80MB of RAM may not be much for many of us but there are a lot of people in this world on 256MB still... especially in the developing countries...

Do you know some other function available to us that doesn't?

weaponx · August 29, 2007

Am I missing something or ?? Isn't this a lot easier:

$String = "Galapagos"
$Test = "gala"

If StringLeft(StringUpper($String), StringLen ($Test)) == StringUpper($Test) Then
    MsgBox(0,"","Match found")
EndIfoÝ÷ ØKÞW¬±Êy«¢+ØÀÌØíMÑÉ¥¹ôÅÕ½Ðí±Á½ÌÅÕ½Ðì(ÀÌØíQÍÐôÅÕ½Ðí±ÅÕ½Ðì()%MÑÉ¥¹1Ð ÀÌØíMÑÉ¥¹°MÑÉ¥¹1¸ ÀÌØíQÍÐ¤¤ôÀÌØíQÍÐQ¡¸5Í  ½à À°ÅÕ½ÐìÅÕ½Ðì°ÅÕ½Ðí5Ñ ½Õ¹ÅÕ½Ðì¤

Edited August 29, 2007 by weaponx

SmOke_N · August 29, 2007

$s_FileRead ="Galapagos"
$s_Searches ="gAlAp"
_StringMatch($s_FileRead, $s_Searches)
If @error Then
    MsgBox(16, "Error", "Match not found")
Else
    MsgBox(64, "Success", "Match found")
EndIf

Func _StringMatch($sInStr, $sVerify)
    If StringRegExp($sInStr, "(?s)(?i)(?m:^|[\s|,|\.|\?\:])" & $sVerify) Then Return 1
    Return SetError(1, 0, 0)
EndFunc

Are you trying to do something like this?

MrCreatoR · August 29, 2007

"^" is marker for beginning of line.

Where i can read about it?

As far as i know, to match the begining of line (string) you use \A, to match end of string (Not line) you use \z - But ^ is for matching not the following after that characters.

~~About the mathces... Koshy John: why you can not just use StringInStr() ?~~

Sorry, it seems that i misunderstood the request.

Edited August 29, 2007 by MsCreatoR

randallc · August 29, 2007

Where i can read about it?
As far as i know, to match the begining of line (string) you use \A, to match end of string (Not line) you use \z - But ^ is for matching not the following after that characters.

~~About the mathces... Koshy John: why you can not just use StringInStr() ?~~
Sorry, it seems that i misunderstood the request.

Hi,

Start Wikipedia

[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". As above, literal characters and ranges can be mixed.
^ Matches the starting position within the string. In multiline mode, it matches the starting position of any line.

Wikipedia regexp

, or tute on forum (link in my sig)

Best, randall

randallc · August 29, 2007

80MB of RAM may not be much for many of us but there are a lot of people in this world on 256MB still... especially in the developing countries...

The good news for an index on a smaller machine is that it would be likely to be less than 40gig HD, so <10Mb index file!

Best,Randall

KJohn · September 1, 2007

Do you know some other function available to us that doesn't?

What I meant was not to load the whole file completely... reading line by line would be a better option...

KJohn · September 1, 2007

Am I missing something or ?? Isn't this a lot easier:

$String = "Galapagos"
$Test = "gala"

If StringLeft(StringUpper($String), StringLen ($Test)) == StringUpper($Test) Then
    MsgBox(0,"","Match found")
EndIfoÝ÷ ØKÞW¬±Êy«¢+ØÀÌØíMÑÉ¥¹ôÅÕ½Ðí±Á½ÌÅÕ½Ðì(ÀÌØíQÍÐôÅÕ½Ðí±ÅÕ½Ðì()%MÑÉ¥¹1Ð ÀÌØíMÑÉ¥¹°MÑÉ¥¹1¸ ÀÌØíQÍÐ¤¤ôÀÌØíQÍÐQ¡¸5Í  ½à À°ÅÕ½ÐìÅÕ½Ðì°ÅÕ½Ðí5Ñ ½Õ¹ÅÕ½Ðì¤

What you said is indeed easier (I've tried that before manually).. but its much slower than the StringInStr by itself (with all its excess comparisons; the whole point of attempting to do the comparison only once was to make it faster....)...

randallc · September 1, 2007

What I meant was not to load the whole file completely... reading line by line would be a better option...

Hi, I haven't tested lately, but I always thought reading line by line would have to mean a slower looping function than reading all at once surely?

Best, Randall

KJohn · September 1, 2007

Hi, I haven't tested lately, but I always thought reading line by line would have to mean a slower looping function than reading all at once surely?
Best, Randall

Reading line by line is slightly slower.. But it all depends on how you will be processing the file...

But reading the whole file can be slower on low mem systems if there is hard disk thrashing (page file swapping)...

P.S. I'm readying a test script to bring out the differences in speed...

KJohn · September 1, 2007

http://www.autoitscript.com/forum/index.php?showtopic=52253

This is being discussed in the AutoIt Feature Requests forum...

StringRegExp

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members