Sign in to follow this  
Followers 0
rudi

howto cleanup an larger array *fast*?

18 posts in this topic

Hello,

I have an array of files to be processed (sort only certain files to specific folders), ~20k filenames which I did readin to an array using <_FileListToArrayFaster1e.au3>

Before starting the sort process I need to clean out all those entries from the Array, that do not need to be sorted.

#include <_FileListToArrayFaster1e.au3>
#include <array.au3>

Const $sPath="C:\DropSourceFilesHere"

$MyArray = _FileListToArray3($sPath, "*", 1, 1, 1, "", 0) ; read file names only, recursively: 
        $sFilter = "*", $iFlag = 1 (FilesOnly), $iRecurse = 1, $iBaseDir = 1, $sExclude = "", $i_deleteduplicate = 1

ConsoleWrite($MyArray[0] & " Filenames have to be processed" & @LF)

CleanupArray($MyArray)

Func CleanupArray(ByRef $MyArray)
    Local $Clean

    For $Clean = $MyArray[0] To 1 Step - 1 ; going upwards would make it necessary to check  again the same Value for $Clean.
        If Not CheckFile($MyArray[$Clean]) Then
            _ArrayDelete($MyArray, $Clean) ; this file is not going to be sorted later -> take it out from MyArray
        EndIf
    Next
    $MyArray[0] = UBound($MyArray) - 1 ; Längenangabe nach Säubern wieder korrekt setzen.
EndFunc   ;==>CleanupArray


Func CheckFile($FullPath)
    Local $OK = True
    If stringinstring($FullPath) Then $OK = False
    ; ... several other testing done here.
    Return $OK
EndFunc   ;==>CheckFile

Maybe creating a separate array and copying over valid values would be the better approach?

Regards, Rudi.


Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Maybe creating a separate array and copying over valid values would be the better approach?

Correct.

Untested String Method:

Func CleanupArray(ByRef $MyArray)
    Local $Clean, $sTemp = ""

    For $Clean = 1 To $MyArray[0] Step 1
        If CheckFile($MyArray[$Clean]) Then
            $sTemp &= $MyArray[$Clean] & "|"
        EndIf
    Next
    
    $MyArray = StringSplit(StringTrimRight($sTemp,1),"|");StringTrimRight removes last pipe delimeter
EndFunc   ;==>CleanupArrayoÝ÷ ÙIízË^t
ëk#¶jëh×6Func CleanupArray(ByRef $MyArray)
    Local $Clean, $asTemp[1]

    For $Clean = 1 To $MyArray[0] Step 1
        If CheckFile($MyArray[$Clean]) Then
            Redim $asTemp[UBound($asTemp)+1]
            $asTemp[UBound($asTemp)-1] = $MyArray[$Clean]
        EndIf
    Next
    
    $asTemp[0] = UBound($asTemp)-1
    $MyArray = $asTemp
EndFunc   ;==>CleanupArray

The array method will probably be faster.

- The Kandie Man ;-)

Edited by The Kandie Man

"So man has sown the wind and reaped the world. Perhaps in the next few hours there will no remembrance of the past and no hope for the future that might have been." & _"All the works of man will be consumed in the great fire after which he was created." & _"And if there is a future for man, insensitive as he is, proud and defiant in his pursuit of power, let him resolve to live it lovingly, for he knows well how to do so." & _"Then he may say once more, 'Truly the light is sweet, and what a pleasant thing it is for the eyes to see the sun.'" - The Day the Earth Caught Fire

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

Thanks for your reply.

As this way the string will become extremly long I tried it a different way and by that came across a amazing bahavior of _FileWriteFromArray()

When you execute the following code you will see, that the created TXT file has a leading blank line where none should be:

#include <file.au3>
#include <array.au3>

$MyArray = StringSplit("1,2,3,4,5,6,7,8,9,10,11,12", ",")

Dim $TempArray[$MyArray[0] + 1]
Dim $t = 1


_ArrayDisplay($MyArray)

For $i = 1 To $MyArray[0]
    If Valid($MyArray[$i]) Then
        $TempArray[$t] = $MyArray[$i]
        $t = $t + 1
    EndIf
Next

$TempArray[0] = UBound($TempArray) - 1


_FileWriteFromArray("C:\Foo-Bar.txt", $TempArray, 1, $t - 1)
RunWait("notepad C:\Foo-Bar.txt")
_FileReadToArray("C:\Foo-Bar.txt", $MyArray)
_ArrayDisplay($MyArray)

Func Valid($VString)
    If $VString = "5"  Then Return False
    If $VString = "7"  Then Return False
    Return True
EndFunc   ;==>Valid

So it looks to me as if _FileWriteFromArray() is gererally putting a "leading blank line" into its OutFile, or what do I miss :) ?

Regards, Rudi.

Edited by rudi

Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

So it looks to me as if _FileWriteFromArray() is gererally putting a "leading blank line" into its OutFile, or what do I miss

It's a bug, which has been noted many times, and fixed in the latest beta.

Anyway, there's no reason to use _FileWriteFromArray anyway, because it's pretty ineffective (calling FileWrite that often is a really bad design idea), and anyone past "total newbie" level should be able to write a simple loop to do that without trouble anyway.

Edited by Siao

"be smart, drink your wine"

Share this post


Link to post
Share on other sites

I would use a linked list approach. It should be considerable faster (10-1000 times) than the code you have provided.

It would goe something like this:

1: Use a two dimensional array.

2: One field for the string and one for an identifier.

3: The first entry in the array[index=0][0] holds the starting point of the linked list, and [index=0][1] is the starting point of a linked list of free slots.

4: The last item array[index=UBound(array)-1][0] could also be given a special meaning. Last item in list. Items in list or similar.

5: Each entry array[index=n][0] is the index of the next item in the list (this goes if it is the data or the free slots we are pointing at).

6: Make functions for listadd, listremove, initlist and so one. Those functions just alter array[index=n][0] and [index=0][1]

I don't have the time to provide the code at the moment. Maybe someone else have the time or you could find samples in scripts. I'm sure I have seen samples there..:)

Best of luck..

Share this post


Link to post
Share on other sites

I would reccomend altering this code that you got that creates the array in the first place so that the first array created contains only files that match your criteria.


[font="Impact"] I always thought dogs laid eggs, and I learned something today. [/font]

Share this post


Link to post
Share on other sites

It's a bug, which has been noted many times, and fixed in the latest beta.

Ah. Good to know, as with the next production one it will be fixed too. I'll try to keep that in my mind.

Currently I do a ArrayDelete($MyArray,1) after I did read in the file again to get rid of that blank line.

Anyway, there's no reason to use _FileWriteFromArray anyway, because it's pretty ineffective (calling FileWrite that often is a really bad design idea), and anyone past "total newbie" level should be able to write a simple loop to do that without trouble anyway.

? Why do you write "calling ... that often"?

I just use one write and one read to get rid of the empty array parts at it's end. It was much faster than multiple ArrayDelete calls?

Well, it might be even faster to re-DIM the $MyArray with the required nuber of values and then to copy these over from the $TempArray.

Thanks, Rudi.


Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Share this post


Link to post
Share on other sites

I would reccomend altering this code that you got that creates the array in the first place so that the first array created contains only files that match your criteria.

How?

Currently I recursively read in all files starting from a certain directory. (some 19000 files)

Then I sort out from this list, which I will need to touch later on.

Regards, Rudi.


Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Share this post


Link to post
Share on other sites

I would use a linked list approach. It should be considerable faster (10-1000 times) than the code you have provided.

It would goe something like this:

1: Use a two dimensional array.

2: One field for the string and one for an identifier.

3: The first entry in the array[index=0][0] holds the starting point of the linked list, and [index=0][1] is the starting point of a linked list of free slots.

4: The last item array[index=UBound(array)-1][0] could also be given a special meaning. Last item in list. Items in list or similar.

5: Each entry array[index=n][0] is the index of the next item in the list (this goes if it is the data or the free slots we are pointing at).

6: Make functions for listadd, listremove, initlist and so one. Those functions just alter array[index=n][0] and [index=0][1]

I don't have the time to provide the code at the moment. Maybe someone else have the time or you could find samples in scripts. I'm sure I have seen samples there..:)

Best of luck..

Interesting advice. But why should this be so much faster to build a linked list rather than searching $MyArray for valid values and copying these over to a second one, my $TempArray? In this case it's just one write to the 2nd array for every valid entry in the first one. With a linked list it's two of them: Pointer from the last found valid entry to the "now" found valid entry and the [index=0][1] pointing towards the "now" found valid entry?

Propably I missunderstood something... :)

Is this what you wanted to give to me?

$MyArray[19001][2]
; fill in the data...

0:  19000   1
1:  fault   0
2:  Valid   0
3:  fault   0
4:  fault   0
5:  Valid   0
6:  fault   0
7:  Valid   0
8:  valid   0
9:  fault   0
...
18995:  fault   0
18996:  valid   0
18997:  fault   0
18998:  valid   0
18999:  fault   0
19000:  fault   0

; would be after doing the linked list processing:

0:  2   18998
1:  fault   0
2:  Valid   5
3:  fault   0
4:  fault   0
5:  Valid   7
6:  fault   0
7:  Valid   8
8:  valid   <next valid>
9:  fault   0
...
18995:  fault   0
18996:  valid   18998
18997:  fault   0
18998:  valid   <EOV>
18999:  fault   0
19000:  fault   0

Thanks, Rudi.


Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Share this post


Link to post
Share on other sites

? Why do you write "calling ... that often"?

I just use one write and one read to get rid of the empty array parts at it's end. It was much faster than multiple ArrayDelete calls?

I thought I made pretty clear I was talking about _FileWriteFromArray. It calls FileWrite for each array element, which is a bad programming no matter how you look at it, and makes it pretty dam slow for big (or even decent sized) arrays, something that could be easily avoided concatenating the string in memory and writing to output file just once (or in big chunks, if it has to handle reaaaally huge arrays).

"be smart, drink your wine"

Share this post


Link to post
Share on other sites

Did you look at my methods?

This one was the winner:

Func CleanupArray(ByRef $MyArray)
    Local $Clean, $sTemp = ""

    For $Clean = 1 To $MyArray[0] Step 1
        If CheckFile($MyArray[$Clean]) Then
            $sTemp &= $MyArray[$Clean] & "|"
        EndIf
    Next
    
    $MyArray = StringSplit(StringTrimRight($sTemp,1),"|");StringTrimRight removes last pipe delimeter
EndFunc   ;==>CleanupArray

It sorted through an array of 85,810 file path elements in 1-3 seconds.(An entire hard drive of mine)

With arrays only several thousand elements in size, it sorted through and removed elements in less than a second.

- The Kandie Man ;-)


"So man has sown the wind and reaped the world. Perhaps in the next few hours there will no remembrance of the past and no hope for the future that might have been." & _"All the works of man will be consumed in the great fire after which he was created." & _"And if there is a future for man, insensitive as he is, proud and defiant in his pursuit of power, let him resolve to live it lovingly, for he knows well how to do so." & _"Then he may say once more, 'Truly the light is sweet, and what a pleasant thing it is for the eyes to see the sun.'" - The Day the Earth Caught Fire

Share this post


Link to post
Share on other sites

@rudi,

Could be I misunderstood the task.

The reason other approaches are faster is that each time you call _ArrayDelete a new array is created replacing the one you passed on to _ArrayDelete (just take a look at the source in the include file..:) ). Creating arrays are expensive in every language I know of.

The string approach suggested is probably fast enough if you have the power (memory and CPU). I tend to use low end computers where memory is an issue. Therfore I suggested a linked list like approach.

Happy scripting

Share this post


Link to post
Share on other sites

Func CheckFile($FullPath)
    Local $OK = True
    If stringinstring($FullPath) Then $OK = False
    ; ... several other testing done here.
    Return $OK
EndFunc   ;==>CheckFile
Regards, Rudi.
Hi,

I agree with above.

Your "CheckFile" is confusing, though!

1. You seem to be eliminating -all- the array items, because they will all have "$FullPath)" in them, won't they? - and if so, you are deleting them?

2. You can be using multiplr filters in the filter section, although I should be adding latest update to the thread post#1, so will do that soon.

3. Although it is not widely tested, th e"Exclude" criterion should work well too, I think with multiple filters; in any case, I would be interested to hear.

Best, Randall

Share this post


Link to post
Share on other sites

@rudi,

Could be I misunderstood the task.

The reason other approaches are faster is that each time you call _ArrayDelete a new array is created replacing the one you passed on to _ArrayDelete (just take a look at the source in the include file..:) ). Creating arrays are expensive in every language I know of.

The string approach suggested is probably fast enough if you have the power (memory and CPU). I tend to use low end computers where memory is an issue. Therfore I suggested a linked list like approach.

Happy scripting

Hi,

You may be right about that, but _ArrayDelete is now all byref in 3.2.11.0;

Func _ArrayDelete(ByRef $avArray, $iElement)
    If Not IsArray($avArray) Then Return SetError(1, 0, 0)

    Local $iUBound = UBound($avArray, 1) - 1

    If Not $iUBound Then
        $avArray = ""
        Return 0
    EndIf

    ; Bounds checking
    If $iElement < 0 Then $iElement = 0
    If $iElement > $iUBound Then $iElement = $iUBound

    ; Move items after $iElement up by 1
    Switch UBound($avArray, 0)
        Case 1
            For $i = $iElement To $iUBound - 1
                $avArray[$i] = $avArray[$i + 1]
            Next
            ReDim $avArray[$iUBound]
        Case 2
            Local $iSubMax = UBound($avArray, 2) - 1
            For $i = $iElement To $iUBound - 1
                For $j = 0 To $iSubMax
                    $avArray[$i][$j] = $avArray[$i + 1][$j]
                Next
            Next
            ReDim $avArray[$iUBound][$iSubMax + 1]
        Case Else
            Return SetError(3, 0, 0)
    EndSwitch

    Return $iUBound
EndFunc   ;==>_ArrayDelete
Best, Randall

Share this post


Link to post
Share on other sites

#15 ·  Posted (edited)

Thanks for pointing that out Randall. I think I checked against 3.2.9.1. It's hard to keep up with the progress..:)

It still have to shuffle all elements beyond the elements removed. Obviously if you remove from the end (the last item) of the array all the time it will not have any impact at all. In this case a linked list is probably faster. But probably harder to implement. So the question is how much time to spend on coding versus waiting for the job to be done (with the current implementation or one of the other suggestions)? Linked list (nuttster's associative array sample looks like it's worth studying) samples are easy to locate in the examples forum.

Edited by Uten

Share this post


Link to post
Share on other sites

I thought I made pretty clear I was talking about _FileWriteFromArray. It calls FileWrite for each array element, which is a bad programming no matter how you look at it, and makes it pretty dam slow for big (or even decent sized) arrays, something that could be easily avoided concatenating the string in memory and writing to output file just once (or in big chunks, if it has to handle reaaaally huge arrays).

Ah. I (wrong) expected that _FileWriteFromArray would do that.

You mean that this function is just looping for the value count doing many, many FileWriteLine(Array[$i]) ?? <argh>

When concatenating the array's values in RAM, how to avoid a "string length overflow"? ( AI3 Help for "String": Maximum string length is 2147483647 characters (but keep in mind that no line in an AutoIt script can exceed 4095 characters.)

_FileReadToArray is OK?

What is the fastest way to trim the value count of an array?

After doing some testing in this case I copied ~7000 valid values out of ~19000 from ArrayA to ArrayB. As in the beginning I only know, that ArrayB <= ArrayB I have to "kick out" the values beyond the last valid one in ArrayB.

_ArrayDelete seems to be quite slow, that was the reason why I gave a try to _FileWriteFromArray / _FileReadFromArray....

Thanks, Rudi.


Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Share this post


Link to post
Share on other sites

Did you look at my methods?

This one was the winner:

Func CleanupArray(ByRef $MyArray)
    Local $Clean, $sTemp = ""

    For $Clean = 1 To $MyArray[0] Step 1
        If CheckFile($MyArray[$Clean]) Then
            $sTemp &= $MyArray[$Clean] & "|"
        EndIf
    Next
    
    $MyArray = StringSplit(StringTrimRight($sTemp,1),"|");StringTrimRight removes last pipe delimeter
EndFunc   ;==>CleanupArray
Yes, that's really fast! I'll change my code. :) and especially thanks for that comment upon STringTrimRight!

Regards, Rudi.


Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Share this post


Link to post
Share on other sites

#18 ·  Posted (edited)

Hi,

I agree with above.

Your "CheckFile" is confusing, though!

This CheckFile wasn't my issue so I simlified the code very much and put in a mistake, it should look like this:

Func CheckFile($FullPath)
    Local $OK = True
    If stringinstring($FullPath,"criteria") Then $OK = False
    ; ... several other testing done here.
    Return $OK
EndFunc   ;==>CheckFile

2. You can be using multiplr filters in the filter section, although I should be adding latest update to the thread post#1, so will do that soon.

I cannot follow this sentence at all, :) sorry... :)

3. Although it is not widely tested, th e"Exclude" criterion should work well too, I think with multiple filters; in any case, I would be interested to hear.

Once more, I miss completly what you want to tell to me... (my English... :) )

Thanks, Rudi.

Edited by rudi

Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0