Jump to content

FileReadToArray vs FileRead & StringSplit - unexpected results


Recommended Posts

I've noticed that for the task of "reading a file row by row into an array" FileReadToArray is consistently slower than a FileRead + StringSplit.

The following script:

Spoiler
#include <String.au3>
#include <WinAPIFiles.au3>

; size steps:
Global $aN[] = [1024, 10240, 102400, 1048576, 10485760, 104857600]
Global Const $sFILEPATH = @ScriptDir & "\Testfile.txt"
Global $nBytes = 0, $nLineLength, $iTimer, $nTimerDiff, $aLines

For $nFileSize In $aN
    ; create test file
    Local $hFile = FileOpen($sFILEPATH, 2 + 512)
    Do
        $nLineLength = Random(1, 20, 1)  ; random line length 
        FileWriteLine($hFile, _StringRepeat("a", $nLineLength))

        $nBytes += $nLineLength + 2 ; "+2" for @crlf
    Until $nBytes >= $nFileSize
    FileFlush($hFile)
    FileClose($hFile)

    ConsoleWrite("Filesize: " & _WinAPI_StrFormatByteSize($nFileSize) & @CRLF)

    ; todo: truncate file to exact size with _WinAPI_SetFilePointer and _WinAPI_SetEndOfFile

; test StringSplit()
    $iTimer = TimerInit()
    $aLines = StringSplit(FileRead($sFILEPATH), @CRLF, 3)
    $nTimerDiff = TimerDiff($iTimer)
    $aLines = "" ; clean up
    ConsoleWrite(StringFormat("\t% 26s: % 8.2f ms (%s PeakWorkingSetSize)\n", "StringSplit", $nTimerDiff, _WinAPI_StrFormatByteSize(ProcessGetStats(@AutoItPID, 0)[1]) ))

; test FileReadToArray()
    $iTimer = TimerInit()
    $aLines = FileReadToArray($sFILEPATH)
    $nTimerDiff = TimerDiff($iTimer)
    $aLines = "" ; clean up
    ConsoleWrite(StringFormat("\t% 26s: % 8.2f ms (%s PeakWorkingSetSize)\n\n", "FileReadToArray", $nTimerDiff, _WinAPI_StrFormatByteSize(ProcessGetStats(@AutoItPID, 0)[1]) ))

; clean up
    FileDelete($sFILEPATH)
Next

 

Results in the following output for me:

Spoiler
Filesize: 1,00 KB
        StringSplit:     0.16 ms (13,5 MB PeakWorkingSetSize)
    FileReadToArray:     0.15 ms (13,7 MB PeakWorkingSetSize)

Filesize: 10,0 KB
        StringSplit:     0.45 ms (13,8 MB PeakWorkingSetSize)
    FileReadToArray:     0.58 ms (13,8 MB PeakWorkingSetSize)

Filesize: 100 KB
        StringSplit:     3.16 ms (14,5 MB PeakWorkingSetSize)
    FileReadToArray:     4.90 ms (14,8 MB PeakWorkingSetSize)

Filesize: 1,00 MB
        StringSplit:    32.73 ms (26,1 MB PeakWorkingSetSize)
    FileReadToArray:    47.17 ms (28,4 MB PeakWorkingSetSize)

Filesize: 10,0 MB
        StringSplit:   315.15 ms (140 MB PeakWorkingSetSize)
    FileReadToArray:   485.55 ms (163 MB PeakWorkingSetSize)

Filesize: 100 MB
        StringSplit:  3191.81 ms (1,25 GB PeakWorkingSetSize)
    FileReadToArray:  4635.66 ms (1,47 GB PeakWorkingSetSize)

 

And the output with switched call sequence:

Spoiler
Filesize: 1,00 KB
    FileReadToArray:     0.15 ms (13,5 MB PeakWorkingSetSize)
        StringSplit:     0.26 ms (13,7 MB PeakWorkingSetSize)

Filesize: 10,0 KB
    FileReadToArray:     0.58 ms (13,9 MB PeakWorkingSetSize)
        StringSplit:     0.37 ms (14,0 MB PeakWorkingSetSize)

Filesize: 100 KB
    FileReadToArray:     4.99 ms (14,9 MB PeakWorkingSetSize)
        StringSplit:     3.16 ms (14,9 MB PeakWorkingSetSize)

Filesize: 1,00 MB
    FileReadToArray:    49.30 ms (28,5 MB PeakWorkingSetSize)
        StringSplit:    31.74 ms (28,5 MB PeakWorkingSetSize)

Filesize: 10,0 MB
    FileReadToArray:   480.50 ms (163 MB PeakWorkingSetSize)
        StringSplit:   323.10 ms (163 MB PeakWorkingSetSize)

Filesize: 100 MB
    FileReadToArray:  5050.74 ms (1,47 GB PeakWorkingSetSize)
        StringSplit:  3202.90 ms (1,47 GB PeakWorkingSetSize)

 

This surprises me, since I first assume that the native function offers more optimization possibilities than having to create an array in AutoIt via an intermediate string.

So I thought that maybe FileReadToArray is more economical with memory. But again FileReadToArray seems to consume more memory than the StringSplit variant. For this I got the PeakWorkingSetSize after the call of the respective function.
After FileReadToArray the Size is always higher than with StringSplit. Since this is not the case if one changes the order of the two functions (see above), one can assume that FileReadToArray actually has this difference. The difference is mostly the double file size. Which would make sense if the file is completely saved internally as a UTF-16 string.

Can anyone explain to me why FileReadToArray performs so unexpectedly poor - or am I missing something?

Edited by AspirinJunkie
Link to comment
Share on other sites

2 hours ago, AspirinJunkie said:

Can anyone explain to me why FileReadToArray performs so unexpectedly poor - or am I missing something?

Since I have no access to source code, I cannot give you a true explanation.  But I can give you my first take on the difference :

#include <Array.au3>
#include <Constants.au3>

$str = "abcAbcAbcaBc"
$arr = StringSplit($str, "A")

_ArrayDisplay($arr, "String Split")

FileWrite ("Test.txt", "abc" & @CR & "bc" & @LF & "bcAbc" & @CRLF & "test")
$arr = FileReadToArray("Test.txt")

_ArrayDisplay($arr, "File Read")
FileDelete("Test.txt")

1) StringSplit is case-sensitive.  While FileReadToArray should also be, but I cannot prove it.  This is one "possible" reason.

2) There is a tad more logic in FileReadToArray, as it searches for different types (and combinations) of end-of-line characters.  More logic implies longer execution.  You cannot achieve this with StringSplit BTW.

Link to comment
Share on other sites

Hi,

the difference comes from the fact that the FileReadToArray is reading line by line as opposed to FileRead which read the whole file.

Later on the splitting will work on the whole read buffer  and act on it allocating entries of the array more quicker as the delimiter is just the @CRLF.

In the case of FileReadToArray the delimiter can be @CR Or @LF or @CRLF which done char by char. That certainly the main overhead.

If we want the same speed FileReadToArray will need a new parameter defining the delimiter.

Cheers

 

Link to comment
Share on other sites

1 hour ago, Nine said:

There is a tad more logic in FileReadToArray, as it searches for different types (and combinations) of end-of-line characters.

Shure - that might be one reason.

6 minutes ago, jpm said:

the difference comes from the fact that the FileReadToArray is reading line by line as opposed to FileRead which read the whole file

O.k. - the difference in performance.
But the difference in memory consumption?

Link to comment
Share on other sites

On 1/8/2021 at 4:06 PM, Bilgus said:

My guess is that an array is instantiated by a power of 2 to limit resizing and you are probably seeing the slack in your ram usage

Hm the difference in memory consumption is quite noticeably the same as the string size (internally represented as UTF-16).
If it's really the difference because of the dynamic array resizing, then the difference should be in worst case [number of lines] * 8 Byte (or 4 Byte at 32 Bit).
But the difference we see here is much greater.

Link to comment
Share on other sites

and then rounded to the next power of 2..

Typically in schemes like this you would start with some low number of indices (power of 2)

1024 for instance if that wasn't enough then you might double it to 2048 or again to 4096 at some point the work will fit or you will run out of memory.

in this case lets say we had a need for 2049 indices and we now have 4096 provisioned thats 50% more than we needed!

 

Generally resizing arrays take time and processor so you typically see the array get overprovisioned to cut down on the resizes needed.

Not sure how Autoit cleans up the ram but my guess is that it is never cleaned up while the array is in scope..

Edited by Bilgus
Splainin
Link to comment
Share on other sites

Exact - in worst case this would be double of the number of lines in the file.

If you have 1000 lines the array would be 1024 - difference of 24 elements.
Worst case instead: If the file would have 1024 (more precise 1025) lines then the array would have 2048 elements - diffence of 1024 lines.

Just as i wrote.
 

Edited by AspirinJunkie
Link to comment
Share on other sites

you assume its lockstep to the size but maybe they quadruple each resize or double for the first 3 resizes and x10 for the next no clue but if you really want to know go fill the array and have a look at the memory I believe sysinternals has a virtual ram viewer

Link to comment
Share on other sites

49 minutes ago, Bilgus said:

you assume its lockstep to the size but maybe they quadruple each resize or double for the first 3 resizes and x10 for the next no clue

This would be very unlikely. The most common factors are 2 (C++ Arraylist, Python) and 1.5 (Java Arraylist).

But well - we can check this empirically with a concrete example:

FileSize (ANSI encoded):                104,857,600 Bytes
Number of lines:                        7,549,361 
Difference in memory consumption:       239,345,664 Bytes
Stringsize in memory (UTF-16 encoded):  209,715,200 Bytes
    
Difference explainable through array resize factor  
1.5  28,063,771 Bytes
2    6,713,976 Bytes
3    54,396,368 Bytes
4    73,822,840 Bytes
5    17,730,112 Bytes
6    20,226,680 Bytes
7    262,433,968 Bytes
8    73,822,840 Bytes
9    283,978,880 Bytes
10   19,605,112 Bytes

I think we can safely disregard factors other than these.
As can be seen, the differences that can be explained by the resize factor are significantly smaller than the actual difference. Or they are significantly larger than this and therefore cannot be used as a reason either.

The array resize factor therefore does not seem to be the reason for the difference.  

Edited by AspirinJunkie
Link to comment
Share on other sites

I'm not sure what your numbers are trying to show there I'm gonna need a whole lot more

explanation of your methodology before these

numbers even start to make sense to safely disregard

Are you saying you tested this in a different language then?

Next are you taking in account the variant nature of the backend of autoit variables

and that each array spot could be holding a string that is also power of 2 growing?

 

Link to comment
Share on other sites

1 minute ago, Bilgus said:

I'm not sure what your numbers are trying to show there I'm gonna need a whole lot more

explanation of your methodology before these

What to explain? I explained it already. Take the number of lines. Determine the next power of the factor. Calc the difference and multiply by 8 Bytes. (Arrays in AutoIt are pointer arrays because every element points to a variant structure. This could be easily checked)
This would be the difference in memory consumption if these are due to the dynamic array enlargement.

Why don't you explain why you suspect the dynamic array so much?
That was your assumption - but so far this suspicion has not been backed up with anything concrete.

My assumption is, that it has something to do with the file string in memory.
I also have a concrete hint for this:
The relation of the string size to the whole memory difference is stable independent of the size of the file:
At 1 MB it is 87% of the total difference, at 10 MB it is also 87%, at 100 MB it is 88%.

If the differences were really due to the array resizing, the differences would have to vary more significantly, since completely different factor-powers would be hit.

 

Link to comment
Share on other sites

I figure over provisioning is the most likely difference since the arrays need to have enough room to search for the line endings without any hints unlike the stringsplit which I figure is more optimized to the workload and has an upperbound from the start

Link to comment
Share on other sites

This is an theory (a good one, by the way, in my opinion). But for now, this is just the first step. In the next step, a theory must be verified against the reality.
In the practical examination of this, two aspects emerge that speak against this hypothesis:

  1. We can calculate the expected effect of overprovisioning for a concrete resize factor and the results showed that none of these factors is suitable to explain the memory difference because the actual difference is far too different from the expected once.
  2. The relation of the string size to the whole memory difference is remarkable stable even over completely different file sizes. If the reason is really overprovisioning then we should not be able to observe this effect. The ratio of file size/string size to memory consumption difference should result in a staircase function when overprovisioning is the reason. In fact, however, we observe a linear dependency.

Both empirical observations argue against the theory that overprovisioning is the main reason for the differences.
So far, I have not seen an empirical aspect that speaks in favor of the theory.

However, my problem remains that this is a bit like reading from a crystal ball. The methodology for determining the memory difference is extremely shaky.
However, I do not have a better approach so far and therefore try to make deductions about the inner mode of functioning on this basis.


In the end, it is about giving the devs hints where something might be wrong so that they can look more specifically if there is really a potential for improvement.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...