Jump to content
AspirinJunkie

FileReadToArray vs FileRead & StringSplit - unexpected results

Recommended Posts

Posted (edited)

I've noticed that for the task of "reading a file row by row into an array" FileReadToArray is consistently slower than a FileRead + StringSplit.

The following script:

Spoiler
#include <String.au3>
#include <WinAPIFiles.au3>

; size steps:
Global $aN[] = [1024, 10240, 102400, 1048576, 10485760, 104857600]
Global Const $sFILEPATH = @ScriptDir & "\Testfile.txt"
Global $nBytes = 0, $nLineLength, $iTimer, $nTimerDiff, $aLines

For $nFileSize In $aN
    ; create test file
    Local $hFile = FileOpen($sFILEPATH, 2 + 512)
    Do
        $nLineLength = Random(1, 20, 1)  ; random line length 
        FileWriteLine($hFile, _StringRepeat("a", $nLineLength))

        $nBytes += $nLineLength + 2 ; "+2" for @crlf
    Until $nBytes >= $nFileSize
    FileFlush($hFile)
    FileClose($hFile)

    ConsoleWrite("Filesize: " & _WinAPI_StrFormatByteSize($nFileSize) & @CRLF)

    ; todo: truncate file to exact size with _WinAPI_SetFilePointer and _WinAPI_SetEndOfFile

; test StringSplit()
    $iTimer = TimerInit()
    $aLines = StringSplit(FileRead($sFILEPATH), @CRLF, 3)
    $nTimerDiff = TimerDiff($iTimer)
    $aLines = "" ; clean up
    ConsoleWrite(StringFormat("\t% 26s: % 8.2f ms (%s PeakWorkingSetSize)\n", "StringSplit", $nTimerDiff, _WinAPI_StrFormatByteSize(ProcessGetStats(@AutoItPID, 0)[1]) ))

; test FileReadToArray()
    $iTimer = TimerInit()
    $aLines = FileReadToArray($sFILEPATH)
    $nTimerDiff = TimerDiff($iTimer)
    $aLines = "" ; clean up
    ConsoleWrite(StringFormat("\t% 26s: % 8.2f ms (%s PeakWorkingSetSize)\n\n", "FileReadToArray", $nTimerDiff, _WinAPI_StrFormatByteSize(ProcessGetStats(@AutoItPID, 0)[1]) ))

; clean up
    FileDelete($sFILEPATH)
Next

 

Results in the following output for me:

Spoiler
Filesize: 1,00 KB
        StringSplit:     0.16 ms (13,5 MB PeakWorkingSetSize)
    FileReadToArray:     0.15 ms (13,7 MB PeakWorkingSetSize)

Filesize: 10,0 KB
        StringSplit:     0.45 ms (13,8 MB PeakWorkingSetSize)
    FileReadToArray:     0.58 ms (13,8 MB PeakWorkingSetSize)

Filesize: 100 KB
        StringSplit:     3.16 ms (14,5 MB PeakWorkingSetSize)
    FileReadToArray:     4.90 ms (14,8 MB PeakWorkingSetSize)

Filesize: 1,00 MB
        StringSplit:    32.73 ms (26,1 MB PeakWorkingSetSize)
    FileReadToArray:    47.17 ms (28,4 MB PeakWorkingSetSize)

Filesize: 10,0 MB
        StringSplit:   315.15 ms (140 MB PeakWorkingSetSize)
    FileReadToArray:   485.55 ms (163 MB PeakWorkingSetSize)

Filesize: 100 MB
        StringSplit:  3191.81 ms (1,25 GB PeakWorkingSetSize)
    FileReadToArray:  4635.66 ms (1,47 GB PeakWorkingSetSize)

 

And the output with switched call sequence:

Spoiler
Filesize: 1,00 KB
    FileReadToArray:     0.15 ms (13,5 MB PeakWorkingSetSize)
        StringSplit:     0.26 ms (13,7 MB PeakWorkingSetSize)

Filesize: 10,0 KB
    FileReadToArray:     0.58 ms (13,9 MB PeakWorkingSetSize)
        StringSplit:     0.37 ms (14,0 MB PeakWorkingSetSize)

Filesize: 100 KB
    FileReadToArray:     4.99 ms (14,9 MB PeakWorkingSetSize)
        StringSplit:     3.16 ms (14,9 MB PeakWorkingSetSize)

Filesize: 1,00 MB
    FileReadToArray:    49.30 ms (28,5 MB PeakWorkingSetSize)
        StringSplit:    31.74 ms (28,5 MB PeakWorkingSetSize)

Filesize: 10,0 MB
    FileReadToArray:   480.50 ms (163 MB PeakWorkingSetSize)
        StringSplit:   323.10 ms (163 MB PeakWorkingSetSize)

Filesize: 100 MB
    FileReadToArray:  5050.74 ms (1,47 GB PeakWorkingSetSize)
        StringSplit:  3202.90 ms (1,47 GB PeakWorkingSetSize)

 

This surprises me, since I first assume that the native function offers more optimization possibilities than having to create an array in AutoIt via an intermediate string.

So I thought that maybe FileReadToArray is more economical with memory. But again FileReadToArray seems to consume more memory than the StringSplit variant. For this I got the PeakWorkingSetSize after the call of the respective function.
After FileReadToArray the Size is always higher than with StringSplit. Since this is not the case if one changes the order of the two functions (see above), one can assume that FileReadToArray actually has this difference. The difference is mostly the double file size. Which would make sense if the file is completely saved internally as a UTF-16 string.

Can anyone explain to me why FileReadToArray performs so unexpectedly poor - or am I missing something?

Edited by AspirinJunkie

Share this post


Link to post
Share on other sites
2 hours ago, AspirinJunkie said:

Can anyone explain to me why FileReadToArray performs so unexpectedly poor - or am I missing something?

Since I have no access to source code, I cannot give you a true explanation.  But I can give you my first take on the difference :

#include <Array.au3>
#include <Constants.au3>

$str = "abcAbcAbcaBc"
$arr = StringSplit($str, "A")

_ArrayDisplay($arr, "String Split")

FileWrite ("Test.txt", "abc" & @CR & "bc" & @LF & "bcAbc" & @CRLF & "test")
$arr = FileReadToArray("Test.txt")

_ArrayDisplay($arr, "File Read")
FileDelete("Test.txt")

1) StringSplit is case-sensitive.  While FileReadToArray should also be, but I cannot prove it.  This is one "possible" reason.

2) There is a tad more logic in FileReadToArray, as it searches for different types (and combinations) of end-of-line characters.  More logic implies longer execution.  You cannot achieve this with StringSplit BTW.

Share this post


Link to post
Share on other sites

Hi,

the difference comes from the fact that the FileReadToArray is reading line by line as opposed to FileRead which read the whole file.

Later on the splitting will work on the whole read buffer  and act on it allocating entries of the array more quicker as the delimiter is just the @CRLF.

In the case of FileReadToArray the delimiter can be @CR Or @LF or @CRLF which done char by char. That certainly the main overhead.

If we want the same speed FileReadToArray will need a new parameter defining the delimiter.

Cheers

 

Share this post


Link to post
Share on other sites
1 hour ago, Nine said:

There is a tad more logic in FileReadToArray, as it searches for different types (and combinations) of end-of-line characters.

Shure - that might be one reason.

6 minutes ago, jpm said:

the difference comes from the fact that the FileReadToArray is reading line by line as opposed to FileRead which read the whole file

O.k. - the difference in performance.
But the difference in memory consumption?

Share this post


Link to post
Share on other sites

My guess is that an array is instantiated by a power of 2 to limit resizing and you are probably seeing the slack in your ram usage

Share this post


Link to post
Share on other sites
On 1/8/2021 at 4:06 PM, Bilgus said:

My guess is that an array is instantiated by a power of 2 to limit resizing and you are probably seeing the slack in your ram usage

Hm the difference in memory consumption is quite noticeably the same as the string size (internally represented as UTF-16).
If it's really the difference because of the dynamic array resizing, then the difference should be in worst case [number of lines] * 8 Byte (or 4 Byte at 32 Bit).
But the difference we see here is much greater.

Share this post


Link to post
Share on other sites

and then rounded to the next power of 2..

Typically in schemes like this you would start with some low number of indices (power of 2)

1024 for instance if that wasn't enough then you might double it to 2048 or again to 4096 at some point the work will fit or you will run out of memory.

in this case lets say we had a need for 2049 indices and we now have 4096 provisioned thats 50% more than we needed!

 

Generally resizing arrays take time and processor so you typically see the array get overprovisioned to cut down on the resizes needed.

Not sure how Autoit cleans up the ram but my guess is that it is never cleaned up while the array is in scope..

Edited by Bilgus
Splainin

Share this post


Link to post
Share on other sites

Exact - in worst case this would be double of the number of lines in the file.

If you have 1000 lines the array would be 1024 - difference of 24 elements.
Worst case instead: If the file would have 1024 (more precise 1025) lines then the array would have 2048 elements - diffence of 1024 lines.

Just as i wrote.
 

Edited by AspirinJunkie

Share this post


Link to post
Share on other sites

you assume its lockstep to the size but maybe they quadruple each resize or double for the first 3 resizes and x10 for the next no clue but if you really want to know go fill the array and have a look at the memory I believe sysinternals has a virtual ram viewer

Share this post


Link to post
Share on other sites
49 minutes ago, Bilgus said:

you assume its lockstep to the size but maybe they quadruple each resize or double for the first 3 resizes and x10 for the next no clue

This would be very unlikely. The most common factors are 2 (C++ Arraylist, Python) and 1.5 (Java Arraylist).

But well - we can check this empirically with a concrete example:

FileSize (ANSI encoded):                104,857,600 Bytes
Number of lines:                        7,549,361 
Difference in memory consumption:       239,345,664 Bytes
Stringsize in memory (UTF-16 encoded):  209,715,200 Bytes
    
Difference explainable through array resize factor  
1.5  28,063,771 Bytes
2    6,713,976 Bytes
3    54,396,368 Bytes
4    73,822,840 Bytes
5    17,730,112 Bytes
6    20,226,680 Bytes
7    262,433,968 Bytes
8    73,822,840 Bytes
9    283,978,880 Bytes
10   19,605,112 Bytes

I think we can safely disregard factors other than these.
As can be seen, the differences that can be explained by the resize factor are significantly smaller than the actual difference. Or they are significantly larger than this and therefore cannot be used as a reason either.

The array resize factor therefore does not seem to be the reason for the difference.  

Edited by AspirinJunkie

Share this post


Link to post
Share on other sites

I'm not sure what your numbers are trying to show there I'm gonna need a whole lot more

explanation of your methodology before these

numbers even start to make sense to safely disregard

Are you saying you tested this in a different language then?

Next are you taking in account the variant nature of the backend of autoit variables

and that each array spot could be holding a string that is also power of 2 growing?

 

Share this post


Link to post
Share on other sites
1 minute ago, Bilgus said:

I'm not sure what your numbers are trying to show there I'm gonna need a whole lot more

explanation of your methodology before these

What to explain? I explained it already. Take the number of lines. Determine the next power of the factor. Calc the difference and multiply by 8 Bytes. (Arrays in AutoIt are pointer arrays because every element points to a variant structure. This could be easily checked)
This would be the difference in memory consumption if these are due to the dynamic array enlargement.

Why don't you explain why you suspect the dynamic array so much?
That was your assumption - but so far this suspicion has not been backed up with anything concrete.

My assumption is, that it has something to do with the file string in memory.
I also have a concrete hint for this:
The relation of the string size to the whole memory difference is stable independent of the size of the file:
At 1 MB it is 87% of the total difference, at 10 MB it is also 87%, at 100 MB it is 88%.

If the differences were really due to the array resizing, the differences would have to vary more significantly, since completely different factor-powers would be hit.

 

Share this post


Link to post
Share on other sites

I figure over provisioning is the most likely difference since the arrays need to have enough room to search for the line endings without any hints unlike the stringsplit which I figure is more optimized to the workload and has an upperbound from the start

Share this post


Link to post
Share on other sites

This is an theory (a good one, by the way, in my opinion). But for now, this is just the first step. In the next step, a theory must be verified against the reality.
In the practical examination of this, two aspects emerge that speak against this hypothesis:

  1. We can calculate the expected effect of overprovisioning for a concrete resize factor and the results showed that none of these factors is suitable to explain the memory difference because the actual difference is far too different from the expected once.
  2. The relation of the string size to the whole memory difference is remarkable stable even over completely different file sizes. If the reason is really overprovisioning then we should not be able to observe this effect. The ratio of file size/string size to memory consumption difference should result in a staircase function when overprovisioning is the reason. In fact, however, we observe a linear dependency.

Both empirical observations argue against the theory that overprovisioning is the main reason for the differences.
So far, I have not seen an empirical aspect that speaks in favor of the theory.

However, my problem remains that this is a bit like reading from a crystal ball. The methodology for determining the memory difference is extremely shaky.
However, I do not have a better approach so far and therefore try to make deductions about the inner mode of functioning on this basis.


In the end, it is about giving the devs hints where something might be wrong so that they can look more specifically if there is really a potential for improvement.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By Clark
      Hello all
      A programme I wrote for the organisation for which I work has been in use for a number of years with no issue.  One of the things that it does is allow for attachments to be selected (Word docs, Excel sheets, etc) which it then writes away as a BLOB in MySQL.
      This was all working fine until recently, when suddenly, on the odd occasion, Word would advise that the file was corrupt on retrieval of the BLOB from MySQL.
      Upon further examination it appears that the issue is not with the retrieval of the data from MySQL, but reading in the data from the disk by AutoIT.   It seems that on occasions that AutoIT will only read half the file.  
      I have no idea why this is, as the whole function is very simple.   I have included a sanitised vesion of it below in the hope that someone can tell me where to look next, as this one has me beat (again).  😋
      I have put some comments in to show what I have done to try and solve the problem, but am getting no closer to the truth.  @error is the strangest one - it should return -1 when it reaches end of file but is returning zero. 
      Func _AttachAttachment() local $file,$chars,$fsize If $RFC_Retrieved = False Then msgbox(0,"Error","Must have a RFC active first") Else $file=FileOpenDialog("Attach to RFC",".","All (*.*)",1) ; Check if file opened for reading OK If $file = -1 Then MsgBox(0, "Error", "Unable to open file.") Exit EndIf ; FileClose($file) These two lines are my attempt to open the file in binary mode. Made no difference ; FileOpen($file,16) ; Read in the File $fsize = FileGetSize($file) ConsoleWrite("File size is " & $fsize & @CRLF) ; the size comes back as what it say in Windows properties ; $chars = FileRead($file,$fsize) ; I was using this, then tried the line following this. It didn't help $chars = FileRead($file) ; This is showing a figure $chars/2 - 1 I don't understand this ConsoleWrite("Number of chars read is " & @extended & " error code is " & @error & @crlf) ; @error is coming back with 0. I don't understand this as I thought it would be -1 FileClose($file) EndIf EndIf EndFunc Thanks in advance for looking at this
      Clark
       
    • By wimhek
      Is it possible , and how can I read and write txt files from Icloud (apple service) ?
      Let me try to explain my application.
      On my Ipad and Iphone I create txt files. On my windows computer it is possible to read and modify these files manually, by logging in on www.icloud.com.
      What I want to make is an auto-it script who reads the txt file and create an new txt file on www.icloud.com,  so I can acces these on my ipad and/or phone.
       
      Thank you.
    • By LWC
      I've made a program that relies on IniReadSectionNames. It reads (~3K) Autorun.inf files in the working folder and creates a GUI based on their contents.
      I made sure to revert to a default GUI upon @error.
      But someone (with Windows XP SP3 32-bit) reported to me he always gets the default menu.
      I sent him a FileRead command instead and it works! So seemingly there's no access problem to AutoRun.inf.
      In the following demo code, I always hit success, but he always ends up with semi-success:
      Local $hIniLocation = "Autorun.inf" Local $aSections = IniReadSectionNames($hIniLocation) If @error Then $aSections = FileRead($hIniLocation) if @error then msgbox(48, "Double error", "Alternative access failed too due to:" & @crlf & @error & @crlf & @extended) else msgbox(0, "Semi-success", "IniReadSectionNames failed, but alternativaly this file contains:" & @crlf & @crlf & $aSections) endif else msgbox(0, "Success", "IniReadSectionNames worked!") endif Why is that? Is there something further to check with him?
      Autorun.inf
    • By antmar904
      Hi,
      I ran a remote program on 180 computers and logged the output to a log file with paexec.  I'm trying to parse through the log file to see which computers launched the program successfully and record the computer name but I'm not sure how to go about it. 
      I can successfully open the log file, read it but not sure how to start separating the data.
      I've attached the log file, basically every computer entry in the log file for a new computer starts with "Connecting to <computername>..." I was thinking about using that as the start and stop point for retrieving the end results.  Anything with a "returned 0" is a success, everything is a fail.
      Any help is much appreciated.
      testlogfile.txt
    • By Jcreator
      i have a file flower.txt with the name of the flowers and there picture location and information about that kind of flower. i wana cerat a gui that will display the flower picture and it information 
      here is the script 
      #include <Misc.au3> #include <EditConstants.au3> #include <Constants.au3> #include <ButtonConstants.au3> #include <GUIConstantsEx.au3> #include <StaticConstants.au3> #include <MsgBoxConstants.au3> #include <WindowsConstants.au3> #include <AVIConstants.au3> #include <TreeViewConstants.au3> #include <GuiComboBox.au3> #include <GuiTab.au3> #include <file.au3> #include <array.au3> Global $sfPath = 'flowers.txt' Global $aFlowers _FileReadToArray($sfPath,$aFlowers,$FRTA_NOCOUNT,'|') _ArrayColDelete($aFlowers,0) ;row 0 has no data so deleting $Form1 = GUICreate("Search Flower Name",320, 480) $input = GUICtrlCreateInput("", 60, 360, 209, 25) $idss = GUICtrlCreateButton("Search", 130, 400, 73, 65) GUISetState(@SW_SHOW) Func Searchf() Global $Fname = GUICtrlRead($input) For $iNo = 0 To UBound($aFlowers[0][0]) - 1 ; if $Fname = $aFlowers[$iNo][0] then MsgBox($MB_SYSTEMMODAL, "","Flower nfo", ""& $aFlowers[$iNo][1] &"") pic() EndIf Next EndFunc While 1 $nMsg = GUIGetMsg() Switch $nMsg Case $GUI_EVENT_CLOSE Exit Case $nMsg = $idss Searchf() EndSwitch WEnd func pic() For $iNo = 0 To UBound($aFlowers[0][0]) - 1 GUICtrlCreatePic($aFlowers[$iNo][2], 20, 30, 280, 280) Next EndFunc here is the flower.txt file
      |Hibiscus|Hibiscus is a hardy perennial which grows in variety of colors, sizes and fragrances. Actually they are tropical flowers which require ample sunlight and moisture to grow well. These flowers start blooming in late spring and continuously bloom through July and August.|C:\T-in\IMG\2.jpg |Lilies|There are different types of lily flowers which bloom in August including water lilies, tiger lilies and gold band lilies. Tiger lilies generate orange flowers having black spots. This lily blooms in delayed July and beginning of August. Gold band lily produce exotic white blooms. All varieties of lilies need enough space to grow and protection from summer sun.|C:\T-in\IMG\1.jpg |Turtlehead|Growing in humid areas, turtleheads are small flowers which bloom from July to September. They mostly produce flowers of white and pink color.|C:\T-in\IMG\3.jpg |Hydrangea|These are ever green bushes which produce flowers in different colors including white, purple, blue and pink. They are easy to grow bushes and can grow 3 to 10 feet tall. They require morning sun to grow but they should be protected from noon and afternoon sun.|C:\T-in\IMG\4.jpg |Dahlias|August proves to be the peak blooming season for dahlias. Dahlias come in colors like white, orange, yellow, red and purple. They can tolerate all types of soil and require full sun to grow.|C:\T-in\IMG\5.jpg am still trying to learn about how the ubound work and the _filereadtoarray
×
×
  • Create New...