flyingboz Posted March 4, 2004 Share Posted March 4, 2004 (edited) Problem: The code posted below degrades dramatically in performance the longer the loopcontinues: Values are averages of multiple runs, and the performance continues to worsen as the number of records increases.Records 1 - 1000 - 4.7 seconds 1001 - 2000 - 17.1 seconds 2001 - 3000 - 28.6 secondsHistory: I had posted a theoretical question in the bugs area (thx to all who responded), but I consider this issue to likely be me rather than the tool. I have also used filewriteline and filereadline instead of filewrite and fileread (to try to ensure thatarrays/memory usage was not growing w/o bound) , but similar performance has been noted.Code OverView:This code is designed (and debugged and works ) to store a big file in a variable $filecontents, read its @CRLF delimited lines into an array $lines, then break the lines out into records using @tab as a delimiter, do "stuff useful to me" w/ certain fields, and then write the output. $statusfile and the if loop dealing with $thousands were inserted only for obtaining performance information.expandcollapse popup$inputfilename = fileopendialog("Open FILE","c:\","(*.*)",1) $outputfilename = $inputfilename & ".out" $statusfilename = $outputfilename & ".status" $hInputFile = FileOpen($inputfilename,0) $inputfilesize = FileGetSize($inputfilename) $filecontents = FileRead($hInputfile,$inputfilesize) fileclose($hInputfile) $statusfile = FileOpen($statusfilename,1) $lines = StringSplit($filecontents,@LF) Dim $Time[int($lines[0]/1000)] ;ProgressOn("Progress","Parsing Lines") $thousands = 0 $begin = TimerStart() for $i = 1 to $lines[0] $record = StringSplit($lines[$i],@tab) if $record[0] > 37 tHEN $ichg_dos = $record[1] $ichg_dtrpt = $record[2] $ichg_chgcode = $record[3] $ichg_patname = $record[13] $ichg_units = $record[18] $ichg_acctnum = $record[35] $ichg_ichg = $record[37] $lineout = $ichg_ichg & "," & $ichg_acctnum & "," & $ichg_dos & "," & $ichg_chgcode & "," & $ichg_patname & "," & $ichg_units $fileout = $fileout & $lineout & @LF ENDIF if int($i/1000) = ($i / 1000 )then $thousands = $thousands + 1 $time[$thousands] = timerstop($begin) $begin = timerStart() $statuslineout = $thousands & "," & $time[$thousands] & @LF FileWriteLine($statusfile,$statuslineout) traytip($i,$time[$thousands],3) ; ProgressOff() ;msgbox(4096,$thousands,$statuslineout,3) endif next $outputfile = FileOpen($outputfilename,2) FileWrite($outputfile,$fileout) FileClose($outputfile) FileClose($statusfile)Anyone have any ideas? Edited March 4, 2004 by flyingboz Reading the help file before you post... Not only will it make you look smarter, it will make you smarter. Link to comment Share on other sites More sharing options...
CyberSlug Posted March 4, 2004 Share Posted March 4, 2004 I'd expect StringSplit to have O(n) time-complexity. In other words, StringSplit should takes longer to execute on longer strings.$record = StringSplit($lines[$i],@tab)How long does this statement take to execute each time? Maybe it's the main problem? Use Mozilla | Take a look at My Disorganized AutoIt stuff | Very very old: AutoBuilder 11 Jan 2005 prototype I need to update my sig! Link to comment Share on other sites More sharing options...
flyingboz Posted March 4, 2004 Author Share Posted March 4, 2004 Strings are the same length, in this case, each element of the array $lines is approx 2000 characters. Reading the help file before you post... Not only will it make you look smarter, it will make you smarter. Link to comment Share on other sites More sharing options...
Developers Jos Posted March 4, 2004 Developers Share Posted March 4, 2004 Could it be because the $fileout gets bigger everytime it loops, so would it help if you write the line when its formatted? expandcollapse popup$INPUTFILENAME = FileOpenDialog("Open FILE","c:\","(*.*)",1) $OUTPUTFILENAME = $INPUTFILENAME & ".out" $STATUSFILENAME = $OUTPUTFILENAME & ".status" $HINPUTFILE = FileOpen($INPUTFILENAME,0) $INPUTFILESIZE = FileGetSize($INPUTFILENAME) $FILECONTENTS = FileRead($HINPUTFILE,$INPUTFILESIZE) FileClose($HINPUTFILE) $STATUSFILE = FileOpen($STATUSFILENAME,1) $LINES = StringSplit($FILECONTENTS,@LF) Dim $TIME[Int($LINES[0]/1000)] ;ProgressOn("Progress","Parsing Lines") $THOUSANDS = 0 $BEGIN = TimerStart() $OUTPUTFILE = FileOpen($OUTPUTFILENAME,2) For $I = 1 To $LINES[0] $RECORD = StringSplit($LINES[$I],@TAB) If $RECORD[0] > 37 Then $ICHG_DOS = $RECORD[1] $ICHG_DTRPT = $RECORD[2] $ICHG_CHGCODE = $RECORD[3] $ICHG_PATNAME = $RECORD[13] $ICHG_UNITS = $RECORD[18] $ICHG_ACCTNUM = $RECORD[35] $ICHG_ICHG = $RECORD[37] $LINEOUT = $ICHG_ICHG & "," & $ICHG_ACCTNUM & "," & $ICHG_DOS & "," & $ICHG_CHGCODE & "," & $ICHG_PATNAME & "," & $ICHG_UNITS FileWriteLine($OUTPUTFILE,$LINEOUT) EndIf If Int($I/1000) =($I / 1000 )Then $THOUSANDS = $THOUSANDS + 1 $TIME[$THOUSANDS] = TimerStop($BEGIN) $BEGIN = TimerStart() $STATUSLINEOUT = $THOUSANDS & "," & $TIME[$THOUSANDS] & @LF FileWriteLine($STATUSFILE,$STATUSLINEOUT) TrayTip($I,$TIME[$THOUSANDS],3) ; ProgressOff() ;msgbox(4096,$thousands,$statuslineout,3) EndIf Next FileClose($OUTPUTFILE) FileClose($STATUSFILE) SciTE4AutoIt3 Full installer Download page - Beta files Read before posting How to post scriptsource Forum etiquette Forum Rules Live for the present, Dream of the future, Learn from the past. Link to comment Share on other sites More sharing options...
flyingboz Posted March 4, 2004 Author Share Posted March 4, 2004 (edited) My timing is performed outside of the filewrite command - except for the $statusfile being used to log performance data - so the filewrite of a huge variable is irrelevant. However, creating the large $fileout variable appears to be the culprit - I switched to using FileWriteLine within the loop and my times through the for loop are now down to consistent values of 700 msec per 1000 records iterated. Thanks to all who read and offered comments. For Extra Credit I don't know why growing that variable would affect performance so much. Does it make sense that adding bytes to a defined variable would take MORE time than writing to a file handle?? The machine in question never showed mem utilization of greater than 200 MB of a 1GB 2GHZ PC, no paging going on, etc.... Just so we know - at completion with my test data the length of the defined variable would have been 112 chars per line (including @LF) * 15635 lines 112*15635 = 1751120 one byte chars. -- seems like a huge performance hit for less than 2MB of memory, particularly when compared with the fact that I'm now performing 15635 filewriteline operations instead of 1 filewrite. Edited March 4, 2004 by flyingboz Reading the help file before you post... Not only will it make you look smarter, it will make you smarter. Link to comment Share on other sites More sharing options...
GrahamS Posted March 4, 2004 Share Posted March 4, 2004 One of the problems could be the line$fileout = $fileout & $lineout & @LFNow my understanding of the way strings work is that if there isn't enough room in the destination array for the string then it deletes the storage, allocates enough new storage for the new string and then performs a copy of the existing string.So let us assume that each line adds exactly 112 bytes and that there are 15635 lines. So initially $fileout is empty (allocated size is 0). The first time round the loop $fileout is too small to assign 112 bytes to. So we allocate 112 bytes of memory and copy 112 bytes. Next time round the loop we want to allocate 224 bytes into $fileout. $fileout is too small, so we delete its current memory, allocates 224 bytes of memory and copy 224 bytes. Next loop, we delete 224 bytes of memory, allocates 336 bytes and copies 336 bytes. So after 15635 loops we have:1. Performed 15634 memory deletions (total memory deleted is 122,218,795, i.e. 122MB de-allocated)2. Performed 15635 memory allocations (total memory allocated is 122,234,430, i.e. 122MB)3. Copied 112 + 2*112 + 3*112 + ... + 15635*122 bytes, i.e. 13,690,256,160 bytes, i.e. 12GB. Technical aside 1 + 2 + .. + n-1 + n = (n*(n+1))/2Now memory allocations and deletions, especially of large blocks can be slow (especially since we're doing a lot of allocations interspersed with de-allocations, which could fragment the heap - a well known performance bottleneck).Copying 12GB of data is also not a good performance enhancer Aside to developers (especially Jon) - this analysis is based on the assumption that the line $a = $a & $b uses the AString::assign function.Now there are two technical improvements that could be made:1. If we could preallocate the size of storage for a string then we could get rid of the memory allocations/de-allocations2. When AString::assign needs to grow memory it is more efficient to grow the size of the array by a fixed factor. There is research (which I could probably track down if necessary) that states that the fixed factory should be 1.5. The down size is that there is some wasted memory (i.e. memory that is allocated but doesn't contain any data)This could explain why performing a file write of each line improves the performance. File writes perform indexing operations which mean that they don't need to copy what is already written to the file (not quite true in the presence of file caches, but then the actual writing to disk will using DMA or similar techniques which will not tie up the CPU).I welcome any comments from developers as to why this analysis is flawed P.S. I've just had another thought. $a = $a & $b probably creates a temporary variable of size sizeof($a) + sizeof($, copies $a and $b into this temporary and then, after growing the memory, copies the temporary into the new memory, finally de-allocating the memory for the temporary. This would double both the memory allocations, de-allocations and the size of the data copied (24Gb copied - aaagggghhhhh!) GrahamS Link to comment Share on other sites More sharing options...
flyingboz Posted March 4, 2004 Author Share Posted March 4, 2004 (edited) Graham,Thanks so much - if you're off in some detail as to exactly which method is being used, what you've stated matches well w/ results - I was seeing memory usagegoing up and down - but didn't put the pieces together. I knew there was some reason those computer science guys didn't all want to be Electrical Engineers like me This also brings up the question:How would I dimension a string variable beforehand (in those cases where it was possible to calculate) or to assign a "huge enough to never be worried about it" size to the variable, and then write to it? Would it just be - (and if so , thank god and greyhound for line continuation)?Dim $sVar = "xxxxxxxxxxxxxx.........to insane number of x?"If I understand your analysis correctly, the issue I ran into here is not the size of the variable, but having to constantly "redim" it. Hmm.. the docs appear to be mute on how to specify the amount of memory tobe allocated (dimensioned) by a nonarray variable, though the number of elements in an array is covered pretty nicely. Edited March 4, 2004 by flyingboz Reading the help file before you post... Not only will it make you look smarter, it will make you smarter. Link to comment Share on other sites More sharing options...
Valik Posted March 4, 2004 Share Posted March 4, 2004 Funny, I "instinctively" use the 1.5 grow thing whenever I do reallocation, although I've never heard that's a "good" number to use. To me, its worth it to waste just a little space than have to constantly realloc every time you want to add something. Link to comment Share on other sites More sharing options...
GrahamS Posted March 4, 2004 Share Posted March 4, 2004 This also brings up the question: How would I dimension a string variable beforehand (in those cases where it was possible to calculate) or to assign a "huge enough to never be worried about it" size to the variable, and then write to it? Don't think that it can be done just now. However, the proposed redim command (which is intended to redimension arrays), could also reallocate the size of string variables. Would it just be - (and if so , thank god and greyhound for line continuation)? Dim $sVar = "xxxxxxxxxxxxxx.........to insane number of x?"That would work, but I would hate to type 2 million x's, even with cut and paste If I understand your analysis correctly, the issue I ran into here is not the size of the variable, but having to constantly "redim" it.That would be a big component, but don't forget the 2GB of byte copying. Heh, just had an amazing thought. Add an optimisation phase to AutoIt It should be possible to recognise constructs of the type $a = $a & $b and, providing that there is enough room in $a, just copy $b into $a at the correct place, eliding the temporary. Wow GrahamS Link to comment Share on other sites More sharing options...
GrahamS Posted March 4, 2004 Share Posted March 4, 2004 Funny, I "instinctively" use the 1.5 grow thing whenever I do reallocation, although I've never heard that's a "good" number to use. To me, its worth it to waste just a little space than have to constantly realloc every time you want to add something.OK, a bit of googling for the 1.5 recommendation throws up:Discussion on comp.lang.c++.moderated - which appears to conclude that the correct number is actually the golden ratio, i.e.about 1·61803and perhaps more importantly the exact article that I remember which is Herb Sutter's More Effective C++ (Item 13). He quotes Andrew Koenig's September 1998 column in the Journal of Object-Oriented Programming as containing the analysis. This item is available online at guru of the week (gotw), which contains a good discussion on the correct growth stratgey. By the way, gotw is an excellent resource in generalAccording to a message on boost, the Koenig article is not available on line GrahamS Link to comment Share on other sites More sharing options...
flyingboz Posted March 5, 2004 Author Share Posted March 5, 2004 (edited) That would work, but I would hate to type 2 million x's, even with cut and pasteYou don't have to (tounge firmly in cheek) DimString("$var",10000) Func DimString($StringVariable,$intReallyBig) Opt("SendKeyDelay",1) ; with thanks to Valik :) $lines_needed = 1 + int($intReallyBig / 4000); autoit has line max of 4096 send ("{ENTER 2}") send ("Dim ") send ($StringVariable) send ('="') For $i = 1 to $lines_needed send ("{x 4000}") if $i <> $lines_needed Then send(" _ {ENTER}") endif Next send ("{ENTER 2}") Return EndFunc Running for cover!!! Edited March 5, 2004 by flyingboz Reading the help file before you post... Not only will it make you look smarter, it will make you smarter. Link to comment Share on other sites More sharing options...
Valik Posted March 5, 2004 Share Posted March 5, 2004 I would STRONGLY suggest using Opt("SendKeyDelay", 1) with that... Link to comment Share on other sites More sharing options...
trids Posted March 5, 2004 Share Posted March 5, 2004 This also brings up the question: How would I dimension a string variable beforehand (in those cases where it was possible to calculate) or to assign a "huge enough to never be worried about it" size to the variable, and then write to it? Instead of a special Dim statement, wouldn't something like this work .. $sVar = StringRepeat(" ", 1000) .. of course, it means a new intrinsic StringRepeat function .. which also has other benefits, by the way (padding and formatting; populating arrays in collusion with StringSplit; etc) Link to comment Share on other sites More sharing options...
Administrators Jon Posted March 5, 2004 Administrators Share Posted March 5, 2004 D'oh. The strings used to work in a similar way but it got removed during some bug hunt. i.e. if they needed to grow then instead of just adding a few bytes they doubled the amount of space. I'll add that code back in tonight for both the Variant strings and the AString strings. So 1.6 is the magic number eh? I think I used to deallocate the memory if less than half of the string memory was used too (so that variables that were allocated a massive amount of memory didn't stay massive). Deployment Blog: https://www.autoitconsulting.com/site/blog/ SCCM SDK Programming: https://www.autoitconsulting.com/site/sccm-sdk/ Link to comment Share on other sites More sharing options...
Administrators Jon Posted March 5, 2004 Administrators Share Posted March 5, 2004 Heh, just had an amazing thought. Add an optimisation phase to AutoIt That's on my list. The lexer could be speeded up _loads_ at the expensive of a lot of memory so I'm trying to think of a nice balance. Deployment Blog: https://www.autoitconsulting.com/site/blog/ SCCM SDK Programming: https://www.autoitconsulting.com/site/sccm-sdk/ Link to comment Share on other sites More sharing options...
GrahamS Posted March 5, 2004 Share Posted March 5, 2004 One addition that would help in this case and be relatively easy to implement is to add a string append operator $a &= $b This could call an append function in the string classes. No temporaries required GrahamS Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now