Jump to content
Sign in to follow this  
junkew

Processing large gigabyte csv files

Recommended Posts

junkew

Is there any specific advice to be given on processing multiple gigabyte files ranging in size from 1 gigabyte to 15 gigabyte with autoit (for the moment i do not want to switch to perl, awk, sed etc)

Within the files I need to replace the spaces and leading zero's with the empty string

csv line is split by ; 

Whats more efficient 

1. reading line by line

2. reading chunks with fileread

using stringregexpreplace line by line or multiple megabytes at once

I have the basics reading the file in blocks of 60 megabytes but as soon as I start using regexpreplace on this i get a processing speed of about 10 megabyte a second (whereas just plain reading has a speed of about 80-100 megabyte a second) 

edit

fileopen with parameter 16 (read bytes) speeds reading to 300 megabytes a second (factor 4-5 compared to mode 1)

this seems to be helping to speed up replace in binary strings

'?do=embed' frameborder='0' data-embedContent>>

Just to give full background on the problem to tackle
Every month they get about 11 files varying in size from 1 to 25 gigabytes in total about 50-60 gigabyte a month with detailled data where they want to create managementinformation from.

They load this into a SQL database to create summary per day, week, month, quarter, year in many pivot ways a lot of SELECT GROUP BY stuff

I have people given the feedback that probably directly processing the text files is much faster (and cheaper as no expense hardware for SQL database is needed)

So far I have proven to be right (we process stuff now in several hours having the answer by calculating directly insetad of processing days in an SQL database

So something like

a.Read the files

process each file

     process each line

           split each line in their columns

                 Determine the rowkey:    rowkey=column15.value & ";" & column1.value & & ";" & column3.value

                        dictionary(rowkey)=dictionary(rowkey)+1

b. Write the dictionary to file

c. Open the output in Excel and make the relevant pivottable

Edited by junkew

Share this post


Link to post
Share on other sites
Chimp

Hi junkew

perhaps StringReplace is faster than stringregexpreplace(?)
I just did a> test on a> similar case


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites
junkew

could be but simple stringreplace will not do the full trick

as filelines looks like (about 300 columns)

20130901;ABCD      ;1;0;0000012345;1900        ;;;;;                       ;;                      ;               ;

output should become

20130901;ABCD;1;0;12345;1900;;;;;;;;;

the ;0; should stay ;0;

the 1900 should stay 1900

and within the file I get also sometimes terrible values in a column containing linefeeds (so I have to repair / remove that to construct a proper line)

Share this post


Link to post
Share on other sites
junkew

Code so far to measure some time on this
 
fileopen with value 16 makes a difference of about a factor 5 in reading speed (plain ascii data reading it in binary)
As soon as I put a stringreplace space with nothing in speed drops by a factor 100 
so logic to remove leading 0 and some other stuff is hitting me hard in leadtime
 
Any suggestions are welcome to do this in AutoIT or just the advice to switch to c/c++ (I have a solution written in vbscript but for about 54 gigabytes of data it gives me a 2-3 hour processing time)
 
 edit  

changed example

added function createbigfile to demonstrate

use 2 drives / change the paths to get it working

local $strPathInput="C:javatestinput"

local $strPathOutput="j:javatestoutput"

#AutoIt3Wrapper_UseX64=Y
#include "binary.au3"
;~ const $tblocksize=1024
;~ const $tblocksize=2048
;~ const $tblocksize=4096
;~ const $tblocksize=8192
;~ const $tblocksize=16384
const $tblocksize=24576 ;~seems to be most efficient blocksize
;~ const $tblocksize=32768
;~ const $tblocksize=67108864
;~ const $tblocksize=1  ;~ This is a terrible blocksize
 
;~ to show progress every so many megabytes
const $updateEveryMB=67108864 ;~ (64*1024*1024)
;~ const $updateEveryMB=33554432 ;~ (32*1024*1024)
 
;~ Play with these parameters and see the speed difference on huge files
const $blockmode=true ;~ otherwise use readline (4-5 times slower)
const $onlyCopy=false  ;~ Do only a plain copy to have a baseline without string processing
 
;~ createBigFile()
stripCharacters()
 
func stripCharacters()
Local $begin
;~  local $strPathInput=@ScriptDir
local $strPathInput="C:\javatest\input"
;~  local $strPathInput=@ScriptDir
local $strPathOutput="j:\javatest\output"
;~  local $strPathOutput="R:"
 
;~  Local $FileSearch = FileFindFirstFile($strPathInput & "\" & "*.*")
  Local $FileSearch = FileFindFirstFile($strPathInput & "\" & "*.*")
local $fileIn
local $fileOut
 
local $iProcessed
local $tData
    local $j
local $i
 
;~  $tData=dllstructcreate("char myChar[" & $tBlockSize & "]")
 
; Check if the search was successful
If $FileSearch = -1 Then
MsgBox(0, "Error", "No files/directories matched the search pattern")
Exit
EndIf
 
;~ Process all files
while 1
$strFileNameIn=FileFindNextFile($FileSearch)
If @error Then ExitLoop
consolewrite($strFileNameIn & @CRLF)
$fileIn = FileOpen($strPathInput & "\" & $strFileNameIn, 16)  ;~Only 8 bits characters so process binary
;~  $fileOut = fileopen($strPathOutput & "\" & $strFileNameIn,16 + 2)
$fileOut = fileopen($strPathOutput & "\" & $strFileNameIn,16+2) ;~ If binary written per byte
 
$begin = TimerInit()
$iProcessed=0
$i=1
 
;~ While there is data to process
while 1
 
if $blockMode=true Then
$strBlock=fileread($fileIn, $tBlocksize)
If @error = -1 Then ExitLoop
if @Extended=0 then ExitLoop
$bytesRead=@Extended
$iProcessed=$iProcessed+$bytesRead  ;~ +$tBlockSize
 
;~   $strBlockNew=_binaryreplace($strBlock," ","")
;~ $strBlockNew=stringregexpreplace($strBlock,"(;[^a-zA-Z0-9\r\n;]*)|( *;)",";")
;~  $strBlock=binarytostring($strBlock,1)
;~  $strBlock=stringreplace(binarytostring($strBlock,1),chr(255),"")
 
if $onlyCopy=true Then
;~  ;do nothing
Else
$strBlock=stringreplace($strBlock,chr(255),"")
$strBlock=stringregexpreplace(binarytostring($strBlock,1),"( +;)|(; +)|(;00+)",";")
;~ $strBlock=stringregexpreplace(binarytostring($strBlock,1),";00+",";")
endIf
filewrite($fileOut,$strBlock)
Else
  $strBlock=filereadline($fileIn)
If @error = -1 Then ExitLoop
$iProcessed=$iProcessed+stringlen($strblock)
$strBlockNew=stringstripws($strBlock,8)
  $strBlockNew=stringregexpreplace($strBlockNew,";00+",";")
filewrite($fileOut,$strBlockNew)
EndIf
 
;~  $strBlockNew=stringregexpreplace($strBlock,"(;[^a-zA-Z0-9\r\n;]*)|( *;)",";")
;~  $strBlockNew=stringregexpreplace($strBlock," *","")
;~  $strBlockNew=stringreplace(binarytostring($strBlock,1)," ","")
;~  $strBlockNew=stringreplace($strBlock," ","")
 
;~   filewrite($fileOut,$strBlock)
;~  $strBlockNew=stringregexpreplace(binarytostring($strBlock,1),"(; +)|(;00+)",";")
 
 
;~ Do it using new string
;~   $newField=true
;~  $strBlockNew=""
;~  for $j=1 to $bytesRead
;~  $myChar=binarymid($strBlock,$j,1)
;~  $newField=false
;~  if $myChar=";" Then
;~  $newField=True
;~  EndIf
;~  if (($myChar=" ") or ($myChar="0")) and $newField=true Then
;~ ; do nothing
;~  Else
;~  $strBlockNew=$strBlocknew & $myChar
;~  EndIf
;~  Next
 
;~ ;Do it using writing per character
;~   $newField=true
;~  ; $strBlockNew=""
;~  for $j=1 to $bytesRead
;~  $myChar=binarymid($strBlock,$j,1)
;~  $newField=false
;~  if $myChar=";" Then
;~  $newField=True
;~  filewrite($fileout,$myChar)
;~  EndIf
;~  if (($myChar=" ") or ($myChar="0")) and $newField=true Then
;~ ; do nothing
;~  Else
;~  filewrite($fileout,$myChar)
;~  EndIf
;~  Next
 
 
;~ Do it with a structure
;~  $dataPos=1
;~   $newField=true
;~  $tData=dllstructcreate("char myChar[" & $bytesread & "]")
;~  for $j=1 to $bytesRead
;~  $myChar=dllstructgetdata($tData,1, $j)
;~  if $myChar=";" Then
;~  $newField=True
;~  EndIf
;~  if (($myChar=" ") or ($myChar="0")) and $newField=true Then
;~  do nothing
;~  Else
;~  dllstructsetdata($tdata,1,$myChar,$datapos)
;~  dllstructsetdata($tdata,1," ", $j) ;~'Make it a space and kill spaces when writing is needed
;~  EndIf
;~  Next
 
;~  update every 64 megabyte showing progress
if $iProcessed>($i*$updateEveryMB) Then
$i=$i+1
consolewrite((($iProcessed/1024)/1024) / (timerdiff($begin)/1000) & @CRLF)
endif
WEnd
 
fileclose($fileIn)
consolewrite("Average speed in MB/Sec :" & ((filegetsize($strPathInput & "\" & $strFileNameIn) / 1024)/1024) / (timerdiff($begin)/1000) & @CRLF)
consolewrite(((filegetsize($strPathInput & "\" & $strFileNameIn) / 1024)/1024) & " within " & timerdiff($begin)/1000 & " secondes" & @CRLF)
wend
EndFunc
 
func createBigFile()
$str="20120816;WP;OLOI;001;19000;EUR;1;1;000000000;000111;07;0123456789;00000000;CRC;EUR;689748;99118254;20;0103;000000;00;122;1111154105001;   ;00000000000019000;0987654321;CRC;689748;0103    ;20;0103;104;1112154105001;   ;00000000000019000;                                  ;                                  ;                                  ;                                  ;           ;           ;           ;           ;000000;          ; ;00000000;                                        ;    ;                                   ;   ; ;           ;00000000;   ; ;5;WP ;000000;000000;          ;          ; ;1;1;  ;  ;0;0; ;00;        ; ;20130731"
$iRows=ceiling((256*1024*1024)/stringlen($str))
$myFileO=fileopen("input\hugefile.txt",2)
for $i=1 to $iRows
filewriteline($myFileO,$str)
Next
fileclose($myFileO)
EndFunc
Edited by junkew

Share this post


Link to post
Share on other sites
Chimp

Hi junkew

to remove spaces and @cr etc you could use StringStripWS() with flag 8

and to remove leading 0 from numbers you can multiply any number * 1

an sketched example

#include <array.au3>
Local $text = StringStripWS("20130901;ABCDG      ;1;0;0000012345;1900        ;;;;;                       ;;                      ;               ;", 8)
$new = StringSplit($text, ";", 2)
For $i = 0 To UBound($new) - 1
    If StringIsInt($new[$i]) Then $new[$i] *= 1 ; if numbers are only integers or as below in other cases
    ; If StringIsInt($new[$i]) Or StringIsFloat($new[$i]) Then $new[$i] *= 1
Next
_ArrayDisplay($new)

EDIT:

$new[$i] = $new[$i] * 1 is much slower than $new[$i] *= 1

corrected in listing

Edited by PincoPanco

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites
junkew

@PincoPanco: I am aware of different functions but the speed/slowness surprised me when working with larger sizes of files

Just plain reading about 300 megabytes

Just plain copy about 180-200 megabytes

With stringprocessing it immediately drops to below 20 megabytes per second

Updated 2nd post with the source code as I tried so far different scenarios.

With this UDF I came to speeds of about 30 megabytes per second (unfortunately it seems to be killing CR and LF)

'?do=embed' frameborder='0' data-embedContent>>

Will try to write it in C++ to see if this will speed up tremendously

Share this post


Link to post
Share on other sites
junkew

Solution now in C++ and speed is between 100-130 Megabyte per second so will stick to C++ for this.

Observation in AutoIT:  

1.reading files is really just as fast as in C++.

2. C++ beats the replace functions of AutoIT or otherwise said. Replacing in megabytes strings is not fast in AutoIT especially when replacing multiple characters.

Share this post


Link to post
Share on other sites
guinness

Like most languages, they each have their own purpose.


UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Share this post


Link to post
Share on other sites
junkew

@Guinnes: I expected stringStripWS and stringRegExpReplace to be much faster in AutoIt especially when you replace with an string of length 0 or 1 (as then no frequent relocation of memory needed).

I missed a binaryreplace function in AutoIT

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×