junkew Posted September 7, 2013 Share Posted September 7, 2013 (edited) Is there any specific advice to be given on processing multiple gigabyte files ranging in size from 1 gigabyte to 15 gigabyte with autoit (for the moment i do not want to switch to perl, awk, sed etc) Within the files I need to replace the spaces and leading zero's with the empty string csv line is split by ; Whats more efficient 1. reading line by line 2. reading chunks with fileread using stringregexpreplace line by line or multiple megabytes at once I have the basics reading the file in blocks of 60 megabytes but as soon as I start using regexpreplace on this i get a processing speed of about 10 megabyte a second (whereas just plain reading has a speed of about 80-100 megabyte a second) edit fileopen with parameter 16 (read bytes) speeds reading to 300 megabytes a second (factor 4-5 compared to mode 1) this seems to be helping to speed up replace in binary strings '?do=embed' frameborder='0' data-embedContent>> Just to give full background on the problem to tackle Every month they get about 11 files varying in size from 1 to 25 gigabytes in total about 50-60 gigabyte a month with detailled data where they want to create managementinformation from. They load this into a SQL database to create summary per day, week, month, quarter, year in many pivot ways a lot of SELECT GROUP BY stuff I have people given the feedback that probably directly processing the text files is much faster (and cheaper as no expense hardware for SQL database is needed) So far I have proven to be right (we process stuff now in several hours having the answer by calculating directly insetad of processing days in an SQL database So something like a.Read the files process each file process each line split each line in their columns Determine the rowkey: rowkey=column15.value & ";" & column1.value & & ";" & column3.value dictionary(rowkey)=dictionary(rowkey)+1 b. Write the dictionary to file c. Open the output in Excel and make the relevant pivottable Edited September 7, 2013 by junkew FAQ 31 How to click some elements, FAQ 40 Test automation with AutoIt, Multithreading CLR .NET Powershell CMDLets Link to comment Share on other sites More sharing options...
Gianni Posted September 7, 2013 Share Posted September 7, 2013 Hi junkew perhaps StringReplace is faster than stringregexpreplace(?)I just did a> test on a> similar case Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
junkew Posted September 7, 2013 Author Share Posted September 7, 2013 could be but simple stringreplace will not do the full trick as filelines looks like (about 300 columns) 20130901;ABCD ;1;0;0000012345;1900 ;;;;; ;; ; ; output should become 20130901;ABCD;1;0;12345;1900;;;;;;;;; the ;0; should stay ;0; the 1900 should stay 1900 and within the file I get also sometimes terrible values in a column containing linefeeds (so I have to repair / remove that to construct a proper line) FAQ 31 How to click some elements, FAQ 40 Test automation with AutoIt, Multithreading CLR .NET Powershell CMDLets Link to comment Share on other sites More sharing options...
junkew Posted September 7, 2013 Author Share Posted September 7, 2013 (edited) Code so far to measure some time on this fileopen with value 16 makes a difference of about a factor 5 in reading speed (plain ascii data reading it in binary)As soon as I put a stringreplace space with nothing in speed drops by a factor 100 so logic to remove leading 0 and some other stuff is hitting me hard in leadtime Any suggestions are welcome to do this in AutoIT or just the advice to switch to c/c++ (I have a solution written in vbscript but for about 54 gigabytes of data it gives me a 2-3 hour processing time) edit changed example added function createbigfile to demonstrate use 2 drives / change the paths to get it working local $strPathInput="C:javatestinput" local $strPathOutput="j:javatestoutput" expandcollapse popup#AutoIt3Wrapper_UseX64=Y #include "binary.au3" ;~ const $tblocksize=1024 ;~ const $tblocksize=2048 ;~ const $tblocksize=4096 ;~ const $tblocksize=8192 ;~ const $tblocksize=16384 const $tblocksize=24576 ;~seems to be most efficient blocksize ;~ const $tblocksize=32768 ;~ const $tblocksize=67108864 ;~ const $tblocksize=1 ;~ This is a terrible blocksize ;~ to show progress every so many megabytes const $updateEveryMB=67108864 ;~ (64*1024*1024) ;~ const $updateEveryMB=33554432 ;~ (32*1024*1024) ;~ Play with these parameters and see the speed difference on huge files const $blockmode=true ;~ otherwise use readline (4-5 times slower) const $onlyCopy=false ;~ Do only a plain copy to have a baseline without string processing ;~ createBigFile() stripCharacters() func stripCharacters() Local $begin ;~ local $strPathInput=@ScriptDir local $strPathInput="C:\javatest\input" ;~ local $strPathInput=@ScriptDir local $strPathOutput="j:\javatest\output" ;~ local $strPathOutput="R:" ;~ Local $FileSearch = FileFindFirstFile($strPathInput & "\" & "*.*") Local $FileSearch = FileFindFirstFile($strPathInput & "\" & "*.*") local $fileIn local $fileOut local $iProcessed local $tData local $j local $i ;~ $tData=dllstructcreate("char myChar[" & $tBlockSize & "]") ; Check if the search was successful If $FileSearch = -1 Then MsgBox(0, "Error", "No files/directories matched the search pattern") Exit EndIf ;~ Process all files while 1 $strFileNameIn=FileFindNextFile($FileSearch) If @error Then ExitLoop consolewrite($strFileNameIn & @CRLF) $fileIn = FileOpen($strPathInput & "\" & $strFileNameIn, 16) ;~Only 8 bits characters so process binary ;~ $fileOut = fileopen($strPathOutput & "\" & $strFileNameIn,16 + 2) $fileOut = fileopen($strPathOutput & "\" & $strFileNameIn,16+2) ;~ If binary written per byte $begin = TimerInit() $iProcessed=0 $i=1 ;~ While there is data to process while 1 if $blockMode=true Then $strBlock=fileread($fileIn, $tBlocksize) If @error = -1 Then ExitLoop if @Extended=0 then ExitLoop $bytesRead=@Extended $iProcessed=$iProcessed+$bytesRead ;~ +$tBlockSize ;~ $strBlockNew=_binaryreplace($strBlock," ","") ;~ $strBlockNew=stringregexpreplace($strBlock,"(;[^a-zA-Z0-9\r\n;]*)|( *;)",";") ;~ $strBlock=binarytostring($strBlock,1) ;~ $strBlock=stringreplace(binarytostring($strBlock,1),chr(255),"") if $onlyCopy=true Then ;~ ;do nothing Else $strBlock=stringreplace($strBlock,chr(255),"") $strBlock=stringregexpreplace(binarytostring($strBlock,1),"( +;)|(; +)|(;00+)",";") ;~ $strBlock=stringregexpreplace(binarytostring($strBlock,1),";00+",";") endIf filewrite($fileOut,$strBlock) Else $strBlock=filereadline($fileIn) If @error = -1 Then ExitLoop $iProcessed=$iProcessed+stringlen($strblock) $strBlockNew=stringstripws($strBlock,8) $strBlockNew=stringregexpreplace($strBlockNew,";00+",";") filewrite($fileOut,$strBlockNew) EndIf ;~ $strBlockNew=stringregexpreplace($strBlock,"(;[^a-zA-Z0-9\r\n;]*)|( *;)",";") ;~ $strBlockNew=stringregexpreplace($strBlock," *","") ;~ $strBlockNew=stringreplace(binarytostring($strBlock,1)," ","") ;~ $strBlockNew=stringreplace($strBlock," ","") ;~ filewrite($fileOut,$strBlock) ;~ $strBlockNew=stringregexpreplace(binarytostring($strBlock,1),"(; +)|(;00+)",";") ;~ Do it using new string ;~ $newField=true ;~ $strBlockNew="" ;~ for $j=1 to $bytesRead ;~ $myChar=binarymid($strBlock,$j,1) ;~ $newField=false ;~ if $myChar=";" Then ;~ $newField=True ;~ EndIf ;~ if (($myChar=" ") or ($myChar="0")) and $newField=true Then ;~ ; do nothing ;~ Else ;~ $strBlockNew=$strBlocknew & $myChar ;~ EndIf ;~ Next ;~ ;Do it using writing per character ;~ $newField=true ;~ ; $strBlockNew="" ;~ for $j=1 to $bytesRead ;~ $myChar=binarymid($strBlock,$j,1) ;~ $newField=false ;~ if $myChar=";" Then ;~ $newField=True ;~ filewrite($fileout,$myChar) ;~ EndIf ;~ if (($myChar=" ") or ($myChar="0")) and $newField=true Then ;~ ; do nothing ;~ Else ;~ filewrite($fileout,$myChar) ;~ EndIf ;~ Next ;~ Do it with a structure ;~ $dataPos=1 ;~ $newField=true ;~ $tData=dllstructcreate("char myChar[" & $bytesread & "]") ;~ for $j=1 to $bytesRead ;~ $myChar=dllstructgetdata($tData,1, $j) ;~ if $myChar=";" Then ;~ $newField=True ;~ EndIf ;~ if (($myChar=" ") or ($myChar="0")) and $newField=true Then ;~ do nothing ;~ Else ;~ dllstructsetdata($tdata,1,$myChar,$datapos) ;~ dllstructsetdata($tdata,1," ", $j) ;~'Make it a space and kill spaces when writing is needed ;~ EndIf ;~ Next ;~ update every 64 megabyte showing progress if $iProcessed>($i*$updateEveryMB) Then $i=$i+1 consolewrite((($iProcessed/1024)/1024) / (timerdiff($begin)/1000) & @CRLF) endif WEnd fileclose($fileIn) consolewrite("Average speed in MB/Sec :" & ((filegetsize($strPathInput & "\" & $strFileNameIn) / 1024)/1024) / (timerdiff($begin)/1000) & @CRLF) consolewrite(((filegetsize($strPathInput & "\" & $strFileNameIn) / 1024)/1024) & " within " & timerdiff($begin)/1000 & " secondes" & @CRLF) wend EndFunc func createBigFile() $str="20120816;WP;OLOI;001;19000;EUR;1;1;000000000;000111;07;0123456789;00000000;CRC;EUR;689748;99118254;20;0103;000000;00;122;1111154105001; ;00000000000019000;0987654321;CRC;689748;0103 ;20;0103;104;1112154105001; ;00000000000019000; ; ; ; ; ; ; ; ;000000; ; ;00000000; ; ; ; ; ; ;00000000; ; ;5;WP ;000000;000000; ; ; ;1;1; ; ;0;0; ;00; ; ;20130731" $iRows=ceiling((256*1024*1024)/stringlen($str)) $myFileO=fileopen("input\hugefile.txt",2) for $i=1 to $iRows filewriteline($myFileO,$str) Next fileclose($myFileO) EndFunc Edited September 7, 2013 by junkew FAQ 31 How to click some elements, FAQ 40 Test automation with AutoIt, Multithreading CLR .NET Powershell CMDLets Link to comment Share on other sites More sharing options...
Gianni Posted September 7, 2013 Share Posted September 7, 2013 (edited) Hi junkew to remove spaces and @cr etc you could use StringStripWS() with flag 8 and to remove leading 0 from numbers you can multiply any number * 1 an sketched example #include <array.au3> Local $text = StringStripWS("20130901;ABCDG ;1;0;0000012345;1900 ;;;;; ;; ; ;", 8) $new = StringSplit($text, ";", 2) For $i = 0 To UBound($new) - 1 If StringIsInt($new[$i]) Then $new[$i] *= 1 ; if numbers are only integers or as below in other cases ; If StringIsInt($new[$i]) Or StringIsFloat($new[$i]) Then $new[$i] *= 1 Next _ArrayDisplay($new) EDIT: $new[$i] = $new[$i] * 1 is much slower than $new[$i] *= 1 corrected in listing Edited September 7, 2013 by PincoPanco Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
junkew Posted September 8, 2013 Author Share Posted September 8, 2013 @PincoPanco: I am aware of different functions but the speed/slowness surprised me when working with larger sizes of filesJust plain reading about 300 megabytesJust plain copy about 180-200 megabytesWith stringprocessing it immediately drops to below 20 megabytes per secondUpdated 2nd post with the source code as I tried so far different scenarios.With this UDF I came to speeds of about 30 megabytes per second (unfortunately it seems to be killing CR and LF)'?do=embed' frameborder='0' data-embedContent>>Will try to write it in C++ to see if this will speed up tremendously FAQ 31 How to click some elements, FAQ 40 Test automation with AutoIt, Multithreading CLR .NET Powershell CMDLets Link to comment Share on other sites More sharing options...
Solution junkew Posted September 8, 2013 Author Solution Share Posted September 8, 2013 Solution now in C++ and speed is between 100-130 Megabyte per second so will stick to C++ for this. Observation in AutoIT: 1.reading files is really just as fast as in C++. 2. C++ beats the replace functions of AutoIT or otherwise said. Replacing in megabytes strings is not fast in AutoIT especially when replacing multiple characters. FAQ 31 How to click some elements, FAQ 40 Test automation with AutoIt, Multithreading CLR .NET Powershell CMDLets Link to comment Share on other sites More sharing options...
guinness Posted September 9, 2013 Share Posted September 9, 2013 Like most languages, they each have their own purpose. UDF List: _AdapterConnections() • _AlwaysRun() • _AppMon() • _AppMonEx() • _ArrayFilter/_ArrayReduce • _BinaryBin() • _CheckMsgBox() • _CmdLineRaw() • _ContextMenu() • _ConvertLHWebColor()/_ConvertSHWebColor() • _DesktopDimensions() • _DisplayPassword() • _DotNet_Load()/_DotNet_Unload() • _Fibonacci() • _FileCompare() • _FileCompareContents() • _FileNameByHandle() • _FilePrefix/SRE() • _FindInFile() • _GetBackgroundColor()/_SetBackgroundColor() • _GetConrolID() • _GetCtrlClass() • _GetDirectoryFormat() • _GetDriveMediaType() • _GetFilename()/_GetFilenameExt() • _GetHardwareID() • _GetIP() • _GetIP_Country() • _GetOSLanguage() • _GetSavedSource() • _GetStringSize() • _GetSystemPaths() • _GetURLImage() • _GIFImage() • _GoogleWeather() • _GUICtrlCreateGroup() • _GUICtrlListBox_CreateArray() • _GUICtrlListView_CreateArray() • _GUICtrlListView_SaveCSV() • _GUICtrlListView_SaveHTML() • _GUICtrlListView_SaveTxt() • _GUICtrlListView_SaveXML() • _GUICtrlMenu_Recent() • _GUICtrlMenu_SetItemImage() • _GUICtrlTreeView_CreateArray() • _GUIDisable() • _GUIImageList_SetIconFromHandle() • _GUIRegisterMsg() • _GUISetIcon() • _Icon_Clear()/_Icon_Set() • _IdleTime() • _InetGet() • _InetGetGUI() • _InetGetProgress() • _IPDetails() • _IsFileOlder() • _IsGUID() • _IsHex() • _IsPalindrome() • _IsRegKey() • _IsStringRegExp() • _IsSystemDrive() • _IsUPX() • _IsValidType() • _IsWebColor() • _Language() • _Log() • _MicrosoftInternetConnectivity() • _MSDNDataType() • _PathFull/GetRelative/Split() • _PathSplitEx() • _PrintFromArray() • _ProgressSetMarquee() • _ReDim() • _RockPaperScissors()/_RockPaperScissorsLizardSpock() • _ScrollingCredits • _SelfDelete() • _SelfRename() • _SelfUpdate() • _SendTo() • _ShellAll() • _ShellFile() • _ShellFolder() • _SingletonHWID() • _SingletonPID() • _Startup() • _StringCompact() • _StringIsValid() • _StringRegExpMetaCharacters() • _StringReplaceWholeWord() • _StringStripChars() • _Temperature() • _TrialPeriod() • _UKToUSDate()/_USToUKDate() • _WinAPI_Create_CTL_CODE() • _WinAPI_CreateGUID() • _WMIDateStringToDate()/_DateToWMIDateString() • Au3 script parsing • AutoIt Search • AutoIt3 Portable • AutoIt3WrapperToPragma • AutoItWinGetTitle()/AutoItWinSetTitle() • Coding • DirToHTML5 • FileInstallr • FileReadLastChars() • GeoIP database • GUI - Only Close Button • GUI Examples • GUICtrlDeleteImage() • GUICtrlGetBkColor() • GUICtrlGetStyle() • GUIEvents • GUIGetBkColor() • Int_Parse() & Int_TryParse() • IsISBN() • LockFile() • Mapping CtrlIDs • OOP in AutoIt • ParseHeadersToSciTE() • PasswordValid • PasteBin • Posts Per Day • PreExpand • Protect Globals • Queue() • Resource Update • ResourcesEx • SciTE Jump • Settings INI • SHELLHOOK • Shunting-Yard • Signature Creator • Stack() • Stopwatch() • StringAddLF()/StringStripLF() • StringEOLToCRLF() • VSCROLL • WM_COPYDATA • More Examples... Updated: 22/04/2018 Link to comment Share on other sites More sharing options...
junkew Posted September 9, 2013 Author Share Posted September 9, 2013 @Guinnes: I expected stringStripWS and stringRegExpReplace to be much faster in AutoIt especially when you replace with an string of length 0 or 1 (as then no frequent relocation of memory needed). I missed a binaryreplace function in AutoIT FAQ 31 How to click some elements, FAQ 40 Test automation with AutoIt, Multithreading CLR .NET Powershell CMDLets Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now