Sign in to follow this  
Followers 0

New option for StringInStr?

44 posts in this topic

Posted

A StringInStr comparison now checks the for the substring in the string to be evaluated with different values of the starting character (0 to StringLen(String) - StringLen(substring); .. to put it in pseudocode).

This is inefficient if the user only wants to check with the starting characterposition as 0 (AutoIt will internally perform StringLen(String) - StringLen(substring) - 1 more comparisons than required).

There are many ways of accomplishing what I've stated in AutoIt:

1. If StringLeft ("string", StringLen("substring") = "substring" Then Return 1

2. If StringRegExp("string", '^(?i)' &"substring" &'.*$') Then Return 1

and others...

But surprisingly they are much slower than StringInStr even with all its 'excess' comparisons (since AutoIt is an interpreted language). The only way this could be made faster was if StringInStr was given another option/parameter which would allow the user to specify a range of starting characters to check against (if not only the first one). This change can only be done internally in the AutoIt C++ source. The inclusion of these parameters can be made optional at the end of the current syntax so as not to affect existing scripts.

The performance benefits of this would be tremendous. And the way I see it, this inclusion will not increase the size of the distributable since only 2 additional variables are incorporated and maybe an additional line of C++ code or two (for condition checking).

StringInStr ( "string", "substring" [, casesense [, occurrence]] )

Parameters

string The string to evaluate.

substring The substring to search for.

casesense [optional] Flag to indicate if the operations should be case sensitive.

0 = not case sensitive, using the user's locale (default)

1 = case sensitive

2 = not case sensitive, using a basic/faster comparison

occurrence [optional] Which occurrence of the substring to find in the string. Use a negative occurrence to search from the right side. The default value is 1 (finds first occurrence).

This is the existing code. What I'm proposing is:

StringInStr ( "string", "substring" [, casesense [, occurrence[, start [,limit]]]] )

Parameters

string The string to evaluate.

substring The substring to search for.

casesense [optional] Flag to indicate if the operations should be case sensitive.

0 = not case sensitive, using the user's locale (default)

1 = case sensitive

2 = not case sensitive, using a basic/faster comparison

occurrence [optional] Which occurrence of the substring to find in the string. Use a negative occurrence to search from the right side. The default value is 1 (finds first occurrence).

start [optional] Which character in the main string to check from

1 (default)

limit [optional] How many comparisons to perform from the starting character

0 (default) = as many as possible

If what I'm trying to convey is not clear enough, please do let me know...

Share this post


Link to post
Share on other sites



Posted

NIce

and what you said was right

StringInStr IS FASTER THAN StringRegExp

Share this post


Link to post
Share on other sites

Posted (edited)

So lets have some test scripts and see if the regexp experts can show why you could still be wrong!?

And I don't mean me; I'll puzzle away, but some guys are good at this..

Randall

Edited by randallc

Share this post


Link to post
Share on other sites

Posted

limit [optional] How many comparisons to perform from the starting character

0 (default) = as many as possible

Can you explain please a litle about this param? what we need it for? please show example of using such new params.

Share this post


Link to post
Share on other sites

Posted

There is a limit to func's params and dicussion about String's function. The last is continuing from 2003 years.

Welcome to 2003

Share this post


Link to post
Share on other sites

Posted

Can you explain please a litle about this param? what we need it for? please show example of using such new params.

Here goes:

StringInStr ( "string", "substring" [, casesense [, occurrence[, start [,limit]]]] )

By default, the current implementation of StringInStr has the following values for start and limit (as per AutoIt; subtract 1 from both for corresponding C++ values):

start = 1

limit = length of "string" - length of "substring" + 1

There will be 'limit' number of comparisons in this case.

EG: "check" against "ec" would go:

start = 1

limit = 5 - 2 + 1 = 4

for(i=start-1; i<limit; i++) //i takes all values of the starting character

- ch vs. ec

- he vs. ec

- ec vs. ec (+ve match)

- ck vs. ck

Lets look at non-default values of 'start' first. 'start' can take values between 1 and the 'limit'. AutoIt must default invalid values to the default 1.

That is, 1<= start <= limit; let start = 2.

- he vs. ec

- ec vs. ec (+ve match)

- ck vs. ck

There was a saving of one comparison.

Now lets look at non-default values of 'limit' next. 'limit' can take values between (and including) 'start' and the 'hard limit' of length of "string" - length of "substring" + 1. AutoIt must default invalid values to the default 'hard limit'.

Let start = 2 and limit = 3:

- he vs. ec

- ec vs. ec (+ve match)

There was a saving of two comparisons.

Personally I have scripts that do close to 5000 StringInStr with sometimes single character substrings on words of average length of 6 (where according to my requirement, start=0,limit=0). By a rough calculation, using the new syntax I can save 5 comparisons on each word. On a total of 5000 words, I can save 25000 comparisons. The script will be doing 5000 comparisons instead of 30,000. These are significant numbers.

Like I've said these can be accomplished by StringRegExp too. But instead of improving performance, I've recorded an undisputable reduction in performance; irrespective of whether I've compiled the script in ANSI or Unicode (ANSI is faster, but only just).

There are likely to be slight inconsistencies in the calculations I've provided when compared to the first post. The calculations in this post are the accurate ones. The first post was only an quick and dirty feature request.

Share this post


Link to post
Share on other sites

Posted

There is a limit to func's params and dicussion about String's function. The last is continuing from 2003 years.

Welcome to 2003

Wow.. I've never seen the old community before... I looked the page up and down once or twice but didn't see anything relating to this particular request... If you could quote the relevant text from that page here, I'd be grateful...

Share this post


Link to post
Share on other sites

Posted

Hi,

But instead of improving performance, I've recorded an undisputable reduction in performance

lets have some test scripts

Could you post your test script please for me to compare [?different ways] the same thing you are comparing?

thanks, Randall

Share this post


Link to post
Share on other sites

Posted

Hi,

Could you post your test script please for me to compare [?different ways] the same thing you are comparing?

thanks, Randall

You'll need to keep the following file (simple text) in the same directory as the code: http://koshyjohnuk.googlepages.com/try.txt

Run the following code:

;script by Koshy John
;Purpose: To compare the speed difference between StringInStr and StringRegExp manually
;To run: needs try.txt in the same folder

;Please note that the StringRegExp function query doesn't return the expected number of results.
;Kindly modify it to match the second 

#NoTrayIcon
Local $query = "mp" ;query must start with m as most of the words in the 
Local $word
Local $timetaken = TimerInit ()
Local $dfiles = 0

; EXAMPLE ONE
;STRING IN STRING (ORIGINAL METHOD) - Fastest (even with multiple comparisons) - usually produces the most results
$dfiles = 0
$qfile = FileOpen ("try.txt",0)
If $qfile = -1 Then Exit

$timetaken = TimerInit ()
While 1
    $word = FileReadLine ($qfile)
    If @error Then ExitLoop
    
    If StringInStr ($word, $query, 2) Then $dfiles = $dfiles + 1
WEnd
$timetaken = Round(TimerDiff($timetaken) / 1000,4)
FileClose($qfile)
MsgBox (0,"StringInStr",$timetaken &" secs, found: "& $dfiles)


; EXAMPLE TWO
;STRING IN STRING (STARTING CHARACTER SINGLE CHECK METHOD; bending over backwards)
;this has varied results, in this example, its marginally faster but I've also seen cases where its marginally slower...
;but more or less its the same speed. There's no major earth shattering performance differences that you would expect.

; THE AIM IS TO REPLICATE THE BEHAVIOR OF THE FOLLOWING CODE USING STRINGREGEXP ****************************************************

$dfiles = 0
Local $querylength = StringLen ($query)
$qfile = FileOpen ("try.txt",0)
If $qfile = -1 Then Exit

$timetaken = TimerInit ()
While 1
    $word = FileReadLine ($qfile)
    If @error Then ExitLoop
    
    If StringLeft ($word, $querylength) = $query Then $dfiles = $dfiles + 1
WEnd
$timetaken = Round(TimerDiff($timetaken) / 1000,4)
FileClose($qfile)
MsgBox (0,"StringInStr (single compare)",$timetaken &" secs, found: "& $dfiles)

; EXAMPLE THREE
;STRINGREGEXP -RANDALLC's code (try bettering it to match the results of EXAMPLE TWO in lesser time
$qfile = FileOpen ("m.txt",0)
If $qfile = -1 Then Exit

$timetaken = TimerInit ()
$dfiles = 0
While 1
    $word = FileReadLine ($qfile)
    If @error Then ExitLoop
    
    ;If StringRegExp($word, "(?s)(?i)(?m:^|[\s|,|\.|\?\:])" & $query) Then $dfiles = $dfiles + 1 ;SMOKE_N's method - I've no idea what this does
    If StringRegExp($word, '^(?i)'&$query&'.*$') Then $dfiles = $dfiles + 1 
WEnd
FileClose($qfile)
$timetaken = Round(TimerDiff($timetaken) / 1000,4)
MsgBox (0,"StringRegExp",$timetaken &"secs, matches: "& $dfiles)


;TO TEST FOR A BETTER SOLUTION, PLEASE MODIFY ONLY THE STRINGREGEXP LINE.

Share this post


Link to post
Share on other sites

Posted

here are my results for the above code and sample file:

0.255 seconds - 4650

0.2624 seconds - 4650

1.4713 seconds - 3514 (the number of results returned are wrong too)

VERY VERY IMPORTANT: Please wait a few seconds and switch between a couple of windows before clicking on the OK of the MsgBox showing the results of each stage. This is to clear out the CPU's L2 cache (noticable difference running on my Core 2 Duo 2 GHz with 4 MB L2 cache).

Share this post


Link to post
Share on other sites

Posted (edited)

I don't know if Jon is using his own InStr method, or the standard, if it's his own, I'd imagine it wouldn't be too difficult to impliment, might even be a nice feature...

I've noticed the regexp has really been lagging lately myself... almost to a hault.

Anyway...

$timetaken = TimerInit ()
$qfile = FileRead("try.txt")
StringReplace($qfile, $query, "")
$dfiles = @extended
$timetaken = Round(TimerDiff($timetaken) / 1000,4)
MsgBox (0,"StringReplace",$timetaken &"secs, matches: "& $dfiles)
Do I win? Edited by SmOke_N

Share this post


Link to post
Share on other sites

Posted

I don't know if Jon is using his own InStr method, or the standard, if it's his own, I'd imagine it wouldn't be too difficult to impliment, might even be a nice feature...

I've noticed the regexp has really been lagging lately myself... almost to a hault.

Anyway...

$timetaken = TimerInit ()
$qfile = FileRead("try.txt")
StringReplace($qfile, $query, "")
$dfiles = @extended
$timetaken = Round(TimerDiff($timetaken) / 1000,4)
MsgBox (0,"StringReplace",$timetaken &"secs, matches: "& $dfiles)
Do I win?
Lol.. yeah.. you do win (by a very considerable margin)... but then that implementation is only useful in counting the number of matches... (this was just a demonstration script, its not feasible in my application; and I'm not a huge fan of increasing the peak working memory of my scripts by loading huge files...)

But it does prove that there's a lot of scope for improvement in performance... As what you've done exploits the core string replace code that's hard coded into autoit, giving it huge performance leaps...

I'm just surprised that no one from the dev team is replying in this thread...

Share this post


Link to post
Share on other sites

Posted (edited)

Is It like this

Func _Stringinstr($test1,$test2,$test3,$test4,$test5,$test6)
$test11 = StringTrimLeft($test1,$test5)
$test21 = StringTrimLeft($test2,$test5)
$test12 = StringLeft($test11,$test6)
$test22 = StringLeft($test21,$test6)
return stringinstr($test12,$test22,$test3,$test4)
endfunc
Edited by athiwatc

Share this post


Link to post
Share on other sites

Posted (edited)

Hey,

that only just wins;

I don't understand why your count should be wrong?...

1StringInStr ;0.1782 secs, found: 4650

22StringInStr (single compare);0.1623 secs, found: 4650

3StringRegExp ;0.2143 secs, found: 4650

4StringRegExp one;0.0146 secs, found: 4650

5StringRep one;0.0062 secs, found: 4650

; StringRegExpKoshyJohn.au3
;script by Koshy John
;Purpose: To compare the speed difference between StringInStr and StringRegExp manually
;To run: needs try.txt in the same folder

;Please note that the StringRegExp function query doesn't return the expected number of results.
;Kindly modify it to match the second

#NoTrayIcon
Local $query = "mp" ;query must start with m as most of the words in the
Local $word
Local $timetaken = TimerInit()
Local $dfiles = 0

; EXAMPLE ONE
;STRING IN STRING (ORIGINAL METHOD) - Fastest (even with multiple comparisons) - usually produces the most results
$dfiles = 0
$qfile = FileOpen("try.txt", 0)
If $qfile = -1 Then Exit

$timetaken = TimerInit()
While 1
    $word = FileReadLine($qfile)
    If @error Then ExitLoop

    If StringInStr($word, $query, 2) Then $dfiles = $dfiles + 1
WEnd
$timetaken = Round(TimerDiff($timetaken) / 1000, 4)
FileClose($qfile)
;~ MsgBox(0, "1StringInStr", $timetaken & " secs, found: " & $dfiles)
ConsoleWrite("1StringInStr ;"& $timetaken & " secs, found: " & $dfiles&@LF)


; EXAMPLE TWO
;STRING IN STRING (STARTING CHARACTER SINGLE CHECK METHOD; bending over backwards)
;this has varied results, in this example, its marginally faster but I've also seen cases where its marginally slower...
;but more or less its the same speed. There's no major earth shattering performance differences that you would expect.

; THE AIM IS TO REPLICATE THE BEHAVIOR OF THE FOLLOWING CODE USING STRINGREGEXP ****************************************************

$dfiles = 0
Local $querylength = StringLen($query)
$qfile = FileOpen("try.txt", 0)
If $qfile = -1 Then Exit

$timetaken = TimerInit()
While 1
    $word = FileReadLine($qfile)
    If @error Then ExitLoop

    If StringLeft($word, $querylength) = $query Then $dfiles = $dfiles + 1
WEnd
$timetaken = Round(TimerDiff($timetaken) / 1000, 4)
FileClose($qfile)
;~ MsgBox(0, "22StringInStr (single compare)", $timetaken & " secs, found: " & $dfiles)
ConsoleWrite("22StringInStr (single compare);"& $timetaken & " secs, found: " & $dfiles&@LF)

; EXAMPLE THREE
;STRINGREGEXP -RANDALLC's code (try bettering it to match the results of EXAMPLE TWO in lesser time
$qfile = FileOpen("try.txt", 0)
If $qfile = -1 Then Exit

$timetaken = TimerInit()
$dfiles = 0
While 1
    $word = FileReadLine($qfile)
    If @error Then ExitLoop

    ;If StringRegExp($word, "(?s)(?i)(?m:^|[\s|,|\.|\?\:])" & $query) Then $dfiles = $dfiles + 1 ;SMOKE_N's method - I've no idea what this does
    If StringRegExp($word, '^(?i)' & $query & '.*$') Then $dfiles = $dfiles + 1
WEnd
FileClose($qfile)
$timetaken = Round(TimerDiff($timetaken) / 1000, 4)
;~ MsgBox(0, "3StringRegExp", $timetaken & "secs, matches: " & $dfiles)
ConsoleWrite("3StringRegExp ;"& $timetaken & " secs, found: " & $dfiles&@LF)

; EXAMPLE 4
;STRINGREGEXP -RANDALLC's code (try bettering it to match the results of EXAMPLE TWO in lesser time
$qfile = FileOpen("try.txt", 0)
If $qfile = -1 Then Exit

$timetaken = TimerInit()
$dfiles = 0
$word=FileRead($qfile)
;~ While 1
;~  $word = FileReadLine($qfile)
;~  If @error Then ExitLoop
;~  ;If StringRegExp($word, "(?s)(?i)(?m:^|[\s|,|\.|\?\:])" & $query) Then $dfiles = $dfiles + 1 ;SMOKE_N's method - I've no idea what this does
;~ WEnd
FileClose($qfile)
local   $ar_files= StringRegExp($word, '(?m)(^(?i)' & $query & '.*$)',3), $dfiles=UBound($ar_files); Then $dfiles = $dfiles + 1
$timetaken = Round(TimerDiff($timetaken) / 1000, 4)
;~ MsgBox(0, "4StringRegExp", $timetaken & "secs, matches: " & $dfiles)
ConsoleWrite("4StringRegExp one;"& $timetaken & " secs, found: " & $dfiles&@LF)

$timetaken = TimerInit ()
$qfile = FileRead("try.txt")
StringReplace($qfile, $query, "")
$dfiles = @extended
$timetaken = Round(TimerDiff($timetaken) / 1000,4)
;~ MsgBox (0,"StringReplace",$timetaken &"secs, matches: "& $dfiles)
ConsoleWrite("5StringRep one;"& $timetaken & " secs, found: " & $dfiles&@LF)
;TO TEST FOR A BETTER SOLUTION, PLEASE MODIFY ONLY THE STRINGREGEXP LINE.
Randall

[EDIT; PS your script used the wrong file for the third test!]

Edited by randallc

Share this post


Link to post
Share on other sites

Posted (edited)

Hey,

that only just wins;

I don't understand why your count should be wrong?...

; StringRegExpKoshyJohn.au3
;script by Koshy John
;Purpose: To compare the speed difference between StringInStr and StringRegExp manually
;To run: needs try.txt in the same folder

;Please note that the StringRegExp function query doesn't return the expected number of results.
;Kindly modify it to match the second

#NoTrayIcon
Local $query = "mp" ;query must start with m as most of the words in the
Local $word
Local $timetaken = TimerInit()
Local $dfiles = 0

; EXAMPLE ONE
;STRING IN STRING (ORIGINAL METHOD) - Fastest (even with multiple comparisons) - usually produces the most results
$dfiles = 0
$qfile = FileOpen("try.txt", 0)
If $qfile = -1 Then Exit

$timetaken = TimerInit()
While 1
    $word = FileReadLine($qfile)
    If @error Then ExitLoop

    If StringInStr($word, $query, 2) Then $dfiles = $dfiles + 1
WEnd
$timetaken = Round(TimerDiff($timetaken) / 1000, 4)
FileClose($qfile)
;~ MsgBox(0, "1StringInStr", $timetaken & " secs, found: " & $dfiles)
ConsoleWrite("1StringInStr ;"& $timetaken & " secs, found: " & $dfiles&@LF)
; EXAMPLE TWO
;STRING IN STRING (STARTING CHARACTER SINGLE CHECK METHOD; bending over backwards)
;this has varied results, in this example, its marginally faster but I've also seen cases where its marginally slower...
;but more or less its the same speed. There's no major earth shattering performance differences that you would expect.

; THE AIM IS TO REPLICATE THE BEHAVIOR OF THE FOLLOWING CODE USING STRINGREGEXP ****************************************************

$dfiles = 0
Local $querylength = StringLen($query)
$qfile = FileOpen("try.txt", 0)
If $qfile = -1 Then Exit

$timetaken = TimerInit()
While 1
    $word = FileReadLine($qfile)
    If @error Then ExitLoop

    If StringLeft($word, $querylength) = $query Then $dfiles = $dfiles + 1
WEnd
$timetaken = Round(TimerDiff($timetaken) / 1000, 4)
FileClose($qfile)
;~ MsgBox(0, "22StringInStr (single compare)", $timetaken & " secs, found: " & $dfiles)
ConsoleWrite("22StringInStr (single compare);"& $timetaken & " secs, found: " & $dfiles&@LF)

; EXAMPLE THREE
;STRINGREGEXP -RANDALLC's code (try bettering it to match the results of EXAMPLE TWO in lesser time
$qfile = FileOpen("try.txt", 0)
If $qfile = -1 Then Exit

$timetaken = TimerInit()
$dfiles = 0
While 1
    $word = FileReadLine($qfile)
    If @error Then ExitLoop

    ;If StringRegExp($word, "(?s)(?i)(?m:^|[\s|,|\.|\?\:])" & $query) Then $dfiles = $dfiles + 1 ;SMOKE_N's method - I've no idea what this does
    If StringRegExp($word, '^(?i)' & $query & '.*$') Then $dfiles = $dfiles + 1
WEnd
FileClose($qfile)
$timetaken = Round(TimerDiff($timetaken) / 1000, 4)
;~ MsgBox(0, "3StringRegExp", $timetaken & "secs, matches: " & $dfiles)
ConsoleWrite("3StringRegExp ;"& $timetaken & " secs, found: " & $dfiles&@LF)

; EXAMPLE 4
;STRINGREGEXP -RANDALLC's code (try bettering it to match the results of EXAMPLE TWO in lesser time
$qfile = FileOpen("try.txt", 0)
If $qfile = -1 Then Exit

$timetaken = TimerInit()
$dfiles = 0
$word=FileRead($qfile)
;~ While 1
;~  $word = FileReadLine($qfile)
;~  If @error Then ExitLoop
;~  ;If StringRegExp($word, "(?s)(?i)(?m:^|[\s|,|\.|\?\:])" & $query) Then $dfiles = $dfiles + 1 ;SMOKE_N's method - I've no idea what this does
;~ WEnd
FileClose($qfile)
local   $ar_files= StringRegExp($word, '(?m)(^(?i)' & $query & '.*$)',3), $dfiles=UBound($ar_files); Then $dfiles = $dfiles + 1
$timetaken = Round(TimerDiff($timetaken) / 1000, 4)
;~ MsgBox(0, "4StringRegExp", $timetaken & "secs, matches: " & $dfiles)
ConsoleWrite("4StringRegExp one;"& $timetaken & " secs, found: " & $dfiles&@LF)

$timetaken = TimerInit ()
$qfile = FileRead("try.txt")
StringReplace($qfile, $query, "")
$dfiles = @extended
$timetaken = Round(TimerDiff($timetaken) / 1000,4)
;~ MsgBox (0,"StringReplace",$timetaken &"secs, matches: "& $dfiles)
ConsoleWrite("5StringRep one;"& $timetaken & " secs, found: " & $dfiles&@LF)
;TO TEST FOR A BETTER SOLUTION, PLEASE MODIFY ONLY THE STRINGREGEXP LINE.
Randall
I doublechecked and I still get the same value with the code I've put up... Your new code... I oon't see why you like consolewrites better... The console writes will prevent you from making a delay between tests, won't they... just asking... Edited by Koshy John

Share this post


Link to post
Share on other sites

Posted

Anyway... Now that there is conclusive proof that StringRegExp is not up to the task... And there are people in favor of the new syntax... Nobody against (so far)...

So will anyone from the dev team at least tell us if this is feasible?

Share this post


Link to post
Share on other sites

Posted

I doublechecked and I still get the same value with the code I've put up... Your new code... I oon't see why you like consolewrites better... The console writes will prevent you from making a delay between tests, won't they... just asking...

Yes to the timing and consolewrite;

but i checked on my machine and it made no difference to timing which order, so should have been OK?..

What is going on with the incorrect results, though?; it may explain the discrepancy in the timing too..

Randall

Share this post


Link to post
Share on other sites

Posted

StringInStr ( "string", "substring" [, casesense [, occurrence[, start [,limit]]]] )

New optional parameters: start, limit

It seems to be good extension with 100% backward compatibility.

I like your idea.

Only better param names will be: start, end

Share this post


Link to post
Share on other sites

Posted

Yes to the timing and consolewrite;

but i checked on my machine and it made no difference to timing which order, so should have been OK?..

What is going on with the incorrect results, though?; it may explain the discrepancy in the timing too..

Randall

I'm trying to figure that out...

New optional parameters: start, limit

It seems to be good extension with 100% backward compatibility.

I like your idea.

Only better param names will be: start, end

Thanks for the vote of confidence! I guess 'end' would be a more intuitive name for that parameter... I just put up the syntax names for indicative purposes... Jon usually decides what to call it in the end, so I didn't bother..

Share this post


Link to post
Share on other sites

Posted

It's a decent suggestion, I'll add it to my list.

Share this post


Link to post
Share on other sites
Sign in to follow this  
Followers 0

  • Recently Browsing   0 members

    No registered users viewing this page.