Jump to content

StringRegExp


KJohn
 Share

Recommended Posts

How would I craft a StringRegExp to obtain the following behavior? (Case insensitive throughout)

Assume the string to be evaluated is: Galapagos

The substring should return a non-zero only if it matches from the first character. For example: Matches would include G, Ga, Gal, Gala, gAlaP, gaLApa, etc. But the following would not be matches: ala, Pago, lapagos, etc.

I had done this once a long time back but this single comparison in StringRegExp proved to be much slower than all the multiple comparisons by StringInStr (SiS performs with multiple starting characters till a match is found or end of string is reached). Maybe it had something to do with the way I had written it. Maybe not. Please help.

Speed is my primary concern. I'm latching on to the single comparison benefit of SRE. If you can think of an even better way to accomplish the above, I would be really grateful.

Edited by Koshy John
Link to comment
Share on other sites

Hi,

Here's one way; but if you are looking in a fileread, you will see in my RegExp func you need things like (?m) to get every line too..

#include<array.au3>
$s_FileRead="Galapagos"
$s_Searches="gAlAp"
;~ $s_Searches=StringReplace(StringReplace($s_Searches, "|", " | "), ".", "\.")
$patternReg='^(?i)'&$s_Searches&'.*$'
Local $asList = StringRegExp($s_FileRead, $patternReg, 3)
if IsArray($asList) then _ArrayDisplay($asList," Matches for "&$s_Searches&" in " &$s_FileRead)
if not IsArray($asList) then  MsgBox(0,"","No Match for "&$s_Searches&" in " &$s_FileRead)
;=============================================
$s_Searches="AlAp"
;~ $s_Searches=StringReplace(StringReplace($s_Searches, "|", " | "), ".", "\.")
$patternReg='^(?i)'&$s_Searches&'.*$'
Local $asList = StringRegExp($s_FileRead, $patternReg, 3)
if IsArray($asList) then _ArrayDisplay($asList," Matches for "&$s_Searches&" in " &$s_FileRead)
if not IsArray($asList) then  MsgBox(0,"","No Match for "&$s_Searches&" in " &$s_FileRead)
Best, Randall

PS if you are looking at speed in large RegExp, you need to compile with ANSI, not unicode (unless that has been fixed too?.. I haven't tested it lately, but there's has been a huge speed difference)

Edited by randallc
Link to comment
Share on other sites

Hi,

Here's one way; but if you are looking in a fileread, you will see in my RegExp func you need things like (?m) to get every line too..

#include<array.au3>
$s_FileRead="Galapagos"
$s_Searches="gAlAp"
;~ $s_Searches=StringReplace(StringReplace($s_Searches, "|", " | "), ".", "\.")
$patternReg='^(?i)'&$s_Searches&'.*$'
Local $asList = StringRegExp($s_FileRead, $patternReg, 3)
if IsArray($asList) then _ArrayDisplay($asList," Matches for "&$s_Searches&" in " &$s_FileRead)
if not IsArray($asList) then  MsgBox(0,"","No Match for "&$s_Searches&" in " &$s_FileRead)
;=============================================
$s_Searches="AlAp"
;~ $s_Searches=StringReplace(StringReplace($s_Searches, "|", " | "), ".", "\.")
$patternReg='^(?i)'&$s_Searches&'.*$'
Local $asList = StringRegExp($s_FileRead, $patternReg, 3)
if IsArray($asList) then _ArrayDisplay($asList," Matches for "&$s_Searches&" in " &$s_FileRead)
if not IsArray($asList) then  MsgBox(0,"","No Match for "&$s_Searches&" in " &$s_FileRead)
Best, Randall

PS if you are looking at speed in large RegExp, you need to compile with ANSI, not unicode (unless that has been fixed too?.. I haven't tested it lately, but there's has been a huge speed difference)

The one line I was looking for was: '^(?i)'&$s_Searches&'.*$'

Could you explain that to me? This is what I understand:

(?i) - case insensitive

^ - but why this this there? (to match any character not in the set)

'.*$' - and what does this stand for?

There are a few parts of AutoIt that don't yet have full Unicode support. These are:

Send and ControlSend - Instead, Use ControlSetText or the Clipboard functions.

Regular expressions - To reduce the size of AutoIt, the regular expression engine is currently compiled in ANSI mode.

Console operations are converted to ANSI.

These limits will be addressed in future versions if possible.

Technically that means the RegExp engine is compiled in ANSI in both versions.. So does it really make a difference... From a performance point of view, would it make sense to compile scipts in the ANSI mode (assuming that the script will be running on only English lang systems) ? Edited by Koshy John
Link to comment
Share on other sites

The one line I was looking for was: '^(?i)'&$s_Searches&'.*$'

Could you explain that to me? This is what I understand:

(?i) - case insensitive

^ - but why this this there? (to match any character not in the set)

'.*$' - and what does this stand for?

Technically that means the RegExp engine is compiled in ANSI in both versions.. So does it really make a difference... From a performance point of view, would it make sense to compile scipts in the ANSI mode (assuming that the script will be running on only English lang systems) ?

Ah.. forget it... Regular expressions are slower whether the stub is ANSI or Unicode... ANSI compilation is a little faster but the difference is negligible...

Link to comment
Share on other sites

Could you explain that to me? This is what I understand:

(?i) - case insensitive

^ - but why this this there? (to match any character not in the set)

'.*$' - and what does this stand for?

"^" is marker for beginning of line.

$s_Search is search string.

"$" is marker for end of line.

So '.*' is "." any character, any number of times, even zero, after the search string.

then '$' the end of line

Best, randall

(When I last checked , the speed difference on a large result (say 30% match) of a huge file; eg 80-Mb - was about 100x as fast in ANSI; but negligible for small results; but Iwant to do a huge file with one RegExp call, not loop it for recurrent calls to slow it down, , so this becomes significant..

And I am still puzzling how I was lucky to get such a fast result given all the potential pitfalls with RegExp callbacks; just lucky for a change with my first attempt!

Best, Randall)

Link to comment
Share on other sites

"^" is marker for beginning of line.

$s_Search is search string.

"$" is marker for end of line.

So '.*' is "." any character, any number of times, even zero, after the search string.

then '$' the end of line

Best, randall

(When I last checked , the speed difference on a large result (say 30% match) of a huge file; eg 80-Mb - was about 100x as fast in ANSI; but negligible for small results; but Iwant to do a huge file with one RegExp call, not loop it for recurrent calls to slow it down, , so this becomes significant..

And I am still puzzling how I was lucky to get such a fast result given all the potential pitfalls with RegExp callbacks; just lucky for a change with my first attempt!

Best, Randall)

You do realize that doing a RegExp on an entire 80MB file will load the entire 80MB into RAM, rite? 80MB of RAM may not be much for many of us but there are a lot of people in this world on 256MB still... especially in the developing countries...

Link to comment
Share on other sites

  • Moderators

You do realize that doing a RegExp on an entire 80MB file will load the entire 80MB into RAM, rite? 80MB of RAM may not be much for many of us but there are a lot of people in this world on 256MB still... especially in the developing countries...

Do you know some other function available to us that doesn't?

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Am I missing something or ?? Isn't this a lot easier:

$String = "Galapagos"
$Test = "gala"

If StringLeft(StringUpper($String), StringLen ($Test)) == StringUpper($Test) Then
    MsgBox(0,"","Match found")
EndIfoÝ÷ ØKÞW¬±Êy«­¢+ØÀÌØíMÑÉ¥¹ôÅÕ½Ðí±Á½ÌÅÕ½Ðì(ÀÌØíQÍÐôÅÕ½Ðí±ÅÕ½Ðì()%MÑÉ¥¹1Ð ÀÌØíMÑÉ¥¹°MÑÉ¥¹1¸ ÀÌØíQÍФ¤ôÀÌØíQÍÐQ¡¸5Í  ½à À°ÅÕ½ÐìÅÕ½Ðì°ÅÕ½Ðí5Ñ ½Õ¹ÅÕ½Ðì¤
Edited by weaponx
Link to comment
Share on other sites

  • Moderators

$s_FileRead ="Galapagos"
$s_Searches ="gAlAp"
_StringMatch($s_FileRead, $s_Searches)
If @error Then
    MsgBox(16, "Error", "Match not found")
Else
    MsgBox(64, "Success", "Match found")
EndIf

Func _StringMatch($sInStr, $sVerify)
    If StringRegExp($sInStr, "(?s)(?i)(?m:^|[\s|,|\.|\?\:])" & $sVerify) Then Return 1
    Return SetError(1, 0, 0)
EndFunc
Are you trying to do something like this?

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

"^" is marker for beginning of line.

Where i can read about it? :)

As far as i know, to match the begining of line (string) you use \A, to match end of string (Not line) you use \z - But ^ is for matching not the following after that characters.

About the mathces... Koshy John: why you can not just use StringInStr() ?

Sorry, it seems that i misunderstood the request.

Edited by MsCreatoR

 

Spoiler

Using OS: Win 7 Professional, Using AutoIt Ver(s): 3.3.6.1 / 3.3.8.1

AutoIt_Rus_Community.png AutoIt Russian Community

My Work...

Spoiler

AutoIt_Icon_small.pngProjects: ATT - Application Translate Tool {new}| BlockIt - Block files & folders {new}| SIP - Selected Image Preview {new}| SISCABMAN - SciTE Abbreviations Manager {new}| AutoIt Path Switcher | AutoIt Menu for Opera! | YouTube Download Center! | Desktop Icons Restorator | Math Tasks | KeyBoard & Mouse Cleaner | CaptureIt - Capture Images Utility | CheckFileSize Program

AutoIt_Icon_small.pngUDFs: OnAutoItErrorRegister - Handle AutoIt critical errors {new}| AutoIt Syntax Highlight {new}| Opera Library! | Winamp Library | GetFolderToMenu | Custom_InputBox()! | _FileRun UDF | _CheckInput() UDF | _GUIInputSetOnlyNumbers() UDF | _FileGetValidName() UDF | _GUICtrlCreateRadioCBox UDF | _GuiCreateGrid() | _PathSplitByRegExp() | _GUICtrlListView_MoveItems - UDF | GUICtrlSetOnHover_UDF! | _ControlTab UDF! | _MouseSetOnEvent() UDF! | _ProcessListEx - UDF | GUICtrl_SetResizing - UDF! | Mod. for _IniString UDFs | _StringStripChars UDF | _ColorIsDarkShade UDF | _ColorConvertValue UDF | _GUICtrlTab_CoverBackground | CUI_App_UDF | _IncludeScripts UDF | _AutoIt3ExecuteCode | _DragList UDF | Mod. for _ListView_Progress | _ListView_SysLink | _GenerateRandomNumbers | _BlockInputEx | _IsPressedEx | OnAutoItExit Handler | _GUICtrlCreateTFLabel UDF | WinControlSetEvent UDF | Mod. for _DirGetSizeEx UDF
 
AutoIt_Icon_small.pngExamples: 
ScreenSaver Demo - Matrix included | Gui Drag Without pause the script | _WinAttach()! | Turn Off/On Monitor | ComboBox Handler Example | Mod. for "Thinking Box" | Cool "About" Box | TasksBar Imitation Demo

Like the Projects/UDFs/Examples? Please rate the topic (up-right corner of the post header: Rating AutoIt_Rating.gif)

* === My topics === *

==================================================
My_Userbar.gif
==================================================

 

 

 

AutoIt is simple, subtle, elegant. © AutoIt Team

Link to comment
Share on other sites

Where i can read about it? :)

As far as i know, to match the begining of line (string) you use \A, to match end of string (Not line) you use \z - But ^ is for matching not the following after that characters.

About the mathces... Koshy John: why you can not just use StringInStr() ?

Sorry, it seems that i misunderstood the request.

Hi,

Start Wikipedia

[^ ] Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". As above, literal characters and ranges can be mixed.

^ Matches the starting position within the string. In multiline mode, it matches the starting position of any line.

Wikipedia regexp

, or tute on forum (link in my sig)

Best, randall

Link to comment
Share on other sites

80MB of RAM may not be much for many of us but there are a lot of people in this world on 256MB still... especially in the developing countries...

The good news for an index on a smaller machine is that it would be likely to be less than 40gig HD, so <10Mb index file!

Best,Randall

Link to comment
Share on other sites

Am I missing something or ?? Isn't this a lot easier:

$String = "Galapagos"
$Test = "gala"

If StringLeft(StringUpper($String), StringLen ($Test)) == StringUpper($Test) Then
    MsgBox(0,"","Match found")
EndIfoÝ÷ ØKÞW¬±Êy«­¢+ØÀÌØíMÑÉ¥¹ôÅÕ½Ðí±Á½ÌÅÕ½Ðì(ÀÌØíQÍÐôÅÕ½Ðí±ÅÕ½Ðì()%MÑÉ¥¹1Ð ÀÌØíMÑÉ¥¹°MÑÉ¥¹1¸ ÀÌØíQÍФ¤ôÀÌØíQÍÐQ¡¸5Í  ½à À°ÅÕ½ÐìÅÕ½Ðì°ÅÕ½Ðí5Ñ ½Õ¹ÅÕ½Ðì¤
What you said is indeed easier (I've tried that before manually).. but its much slower than the StringInStr by itself (with all its excess comparisons; the whole point of attempting to do the comparison only once was to make it faster....)...
Link to comment
Share on other sites

What I meant was not to load the whole file completely... reading line by line would be a better option...

Hi, I haven't tested lately, but I always thought reading line by line would have to mean a slower looping function than reading all at once surely?

Best, Randall

Link to comment
Share on other sites

Hi, I haven't tested lately, but I always thought reading line by line would have to mean a slower looping function than reading all at once surely?

Best, Randall

Reading line by line is slightly slower.. But it all depends on how you will be processing the file...

But reading the whole file can be slower on low mem systems if there is hard disk thrashing (page file swapping)...

P.S. I'm readying a test script to bring out the differences in speed...

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...