Jump to content

array search optimization


frank10
 Share

Recommended Posts

I have an unordered list of more than 30000 ID in a file.txt (separated by "|"), such as:

msw-3k-031114-a029_c026_1110gt

A009_C009_062709

jkh-4k-novsowia-028614-a094_c071_09206t

9170-22

UT_BCNP_Arils_25

WHtD_Fo_2592

102510-17

I need to check if a new ID is inside this list and if not there, update it.

I tried loading the file in a variable and check with StringInStr() but it's  slow (as I must check them at blocks of 80, so at 0.4s it results 32s waiting time...), so I would improve the search.

StringInStr($myListFiles, "04-060112-17" )
;0.4 s

I thought subdividing the list in several arrays checking the first letter of each ID, so I will have 9 array for numeric ID and 21 array for chars. This StringInStr() should be faster.

But it depends of how many ID starts with let's say "A": if they are the majority, the speed won't be a lot faster (their names aren't equal subdivided).

I tried also using  Scripting.Dictionary but it's a lot slower than StringInStr() (??)

$oDict.Item("04-060112-17")
;7s (??)

Other suggestions?

Edited by frank10
Link to comment
Share on other sites

Well,  it's in my full list "file.txt".

Here I posted only some example IDs to understand how they are made.

In the test speed I searched one ID in the last positions to compare speed searches in the worst cases.

My 0,4s refers of a search of more than 30000 IDs, not on only the 7 I posted.

Link to comment
Share on other sites

Instead of an array you might try it like this...

local $st = timerinit()
local $fileid = stringregexp(fileread(@scriptdir & '\test.txt'),'UT_BCNP_Arils_25',3)[0]
ConsoleWrite($fileid & ' found in ' & timerdiff($st)/1000 & ' seconds.' & @lf)

kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

I don't know if this is of any use to you. I wrote it a while ago after some inspiration from kylomas.

I think there may be better functions doing something similar.

Edit

You need to have at least two arrays (read from file I imagine) to pass to the function. It will test both arrays for duplicates and concatenate the unique entries - so it's only useful if you have a lot of new items to add in one batch process.

There is an example of use in the first post in that thread (I should write a simpler example sometime). Because you only have approximately 30,000 IDs to test, it should be reasonably fast. Any problems getting it to work, then post the code you tried and I'll see if I can help.

Edited by czardas
Link to comment
Share on other sites

Instead of an array you might try it like this...

local $st = timerinit()
local $fileid = stringregexp(fileread(@scriptdir & '\test.txt'),'UT_BCNP_Arils_25',3)[0]
ConsoleWrite($fileid & ' found in ' & timerdiff($st)/1000 & ' seconds.' & @lf)

kylomas

 

Thank you a lot kylomas!

That stringregexp was really an improvement! I didn't suspect it was that faster vs StringInStr().

I previously used:

$timer = TimerInit()
$result = StringInStr($myListFiles, "04-060112-17")
ConsoleWrite("StringSearch sec:" & TimerDiff($timer)/1000 &  " result:" & $result & @CRLF)

The stringregexp() makes it in 0,004s vs the 0,4s of StringInStr() !!

That is a huge improvement of 100x !! Quite unbelieavable...

Link to comment
Share on other sites

There is something strange.

1)

i.e. if I search:

"A011_C018_0804LF" I get regExp: 0.0029s vs 3.2s ( ! )

but if I search:

"sloquid-082212-05"  I get regExp: 0.003s vs 0.037s

What's so different in the first string with StringInStr()?

2)

And another test reducing the size of file to only 7 IDs:

if I search:

"A011_C018_0804LF"  I get regExp: 5.377s ( ! ) vs 1.31s

and

"sloquid-082212-05"  I get   regExp: 6.11s ( ! ) vs 1.76s

So with very short variables regExp is worse !?

 

PS:

I had to change your line, because if I search for an ID not in the list I get error:

local $aFileid = stringregexp($myListFiles,"sid-082212-05s" ,3) 
if IsArray($aFileid) Then msgBox(0,''","ID found")
Edited by frank10
Link to comment
Share on other sites

Regexp has various characters which are actually commands to the regular expression engine. These cannot be searched directly without an escape backslash being placed before the character. For example the string "sid-082212-05s" would need to be changed to "sid-082212-05s".

I assume the odd runtime discrepancies will be associated with the complexity of parsing Regular Expressions. This is possibly the reason RegExp seems slower on your test strings. The code should give the wrong results in this instance. Also consider that the the expression (as it is written) will allow substring matches - which is probably not what you want.

Edited by czardas
Link to comment
Share on other sites

Regexp has various characters which are actually commands to the regular expression engine. These cannot be searched directly without an escape backslash being placed before the character. For example the string "sid-082212-05s" would need to be changed to "sid-082212-05s".

I don't think that's the problem, because using the "A011_C018_0804LF" string that has no special RegExp char, inside a short variable, I get very slow speed, whilst in a huge var I get very fast speed.

More, If I put "" before "-" I get no improved speed, but also if I don't put "" it gets the correct result...

Edited by frank10
Link to comment
Share on other sites

If you get the correct results without escaping the minus sign, then it's plain luck. Having said that I don't exactly understand how that works. The regular expression engine may be doing various internal tests before searching - making it apparently slower on small strings. This is a guess. It's also a case sensitive search ATM.

Edit

Got it! - the minus sign should be escaped if used inside a set.

Edit 2

Without a full description of all the possible ID variants (or rules to generate them) it is never possible to give a regular expression solution. Here the only rule seems to be that there is a delimiter string "|".

Edited by czardas
Link to comment
Share on other sites

I found out what it was happening!

Sometimes the search was so fast it was less than 0.00001, so

Consolewrite("StringRegExp search sec:" & Timerdiff($st)/1000 & "seconds.")

printed this:

StringRegExp search sec:7.92299668260434e-005 seconds.

I read only the first numbers.... but there was also the e-005 at the end!

So, ok, now it makes sense: with shorter var it is lightning fast!

EDIT:

The same was for StringInStr(): that string was at the first positions of the var, so it was a lot fast also!

If the matching string is near the beginning of the var, StringInStr is even slightly faster than RegExp.

How could I transform the search string to make it safe with all RegExp special chars?

Edited by frank10
Link to comment
Share on other sites

How could I transform the search string to make it safe with all RegExp special chars?

 

Code needs testing for bugs (it seems to be working though).

Local $sDelimString = "msw-3k-031114-a029_c026_1110gt|A009_C009_062709|jkh-4k-novsowia-028614-a094_c071_09206t|9170-22|UT_BCNP_Arils_25|WHtD_Fo_2592|102510-17"

Local $sSearchStr = "jkh-4k-novsowia-028614-a094_c071_09206t"

Local $iMatch = StringRegExp("|" & $sDelimString & "|", "\|" & StringRegExpReplace($sSearchStr, "[\W]", "\\$0") & "\|")

ConsoleWrite($iMatch & @LF) ; 1 = match found

:

This basically tests for anything between the delimiters "|" by including them in the search. This prevents substring matches.

If you want case insensitive matches also then you can try this:

Local $iMatch = StringRegExp("|" & $sDelimString & "|", "(?i)\|" & StringRegExpReplace($sSearchStr, "[\W]", "\\$0") & "\|")
Edited by czardas
Link to comment
Share on other sites

frank10,

EDIT:

The same was for StringInStr(): that string was at the first positions of the var, so it was a lot fast also!

If the matching string is near the beginning of the var, StringInStr is even slightly faster than RegExp.

 

This was bothering me when I got pulled away.  If you are doing a char by char match stringinstr should be just as fast as SRE (I think...).

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...