fsquirrelgpr Posted October 14, 2005 Posted October 14, 2005 Here is code I use for getting values out of a .CSV file (I am basing it on the .csv files made by OpenOffice.Org's Calc). It works but is rather slow (processing thousands of entries can take some time). I would appreciate any advise as to making it work faster. Also I am trying to use the UDF naming convention, point out any errors in that regards also. expandcollapse popupfunc ParseCSV($sCommaSeparatedValues, $iResultComma) Local $iCurrentComma = 0 Local $iLeft = 0 Local $sResult local $iRight = stringlen($sCommaSeparatedValues) local $iPosition local $cSeperator = "," local $cQuote = '"' local $iWithinQuotes = 0 ;This section finds the left bounds that we actually want $iPosition = 1 while($iCurrentComma < $iResultComma) if(stringmid($sCommaSeparatedValues, $iPosition, 1) = $cQuote) then if($iWithinQuotes = 0) Then $iWithinQuotes = 1 Else $iWithinQuotes = 0 EndIf EndIf if($iWithinQuotes = 0) Then if(stringmid($sCommaSeparatedValues,$iPosition,1) = $cSeperator) then $iCurrentComma = $iCurrentComma + 1 EndIf EndIf $iPosition = $iPosition + 1 if($iPosition > $iRight) Then return "NULL" EndIf WEnd ;This section gets the result $sResult = "" $iWithinQuotes = 0 while($iPosition < $iRight + 1) if(stringmid($sCommaSeparatedValues, $iPosition, 1) = $cQuote) Then $iPosition = $iPosition + 1 if($iWithinQuotes = 0) Then $iWithinQuotes = 1 Else $iWithinQuotes = 0 EndIf EndIf if($iWithinQuotes = 0) Then if(stringmid($sCommaSeparatedValues, $iPosition, 1) = $cSeperator) Then $iPosition = $iRight + 1 Else $sResult = $sResult & stringmid($sCommaSeparatedValues, $iPosition, 1) $iPosition = $iPosition + 1 EndIf Else $sResult = $sResult & stringmid($sCommaSeparatedValues, $iPosition, 1) $iPosition = $iPosition + 1 EndIf WEnd $sResult = stringstripws($sResult, 3) return $sResult EndFunc
HardCopy Posted October 15, 2005 Posted October 15, 2005 (edited) Hey there, These are just my 2 cents worth, looking at your code you are doing a fair bit of string manipulation etc. Firstly i would recommend when working with CSV files is export to CSV using TAB Delimiter, unless of course your data contains tabs, which is pretty unlikely. Secondly I would use stringsplit command - (Look in Help file) Quick example: $Line = StringSplit($Record, @TAB) $Line is now an array with each field as an element. to parse the lines from array extract the attributes/elements you need and save them to another file. Try that, which will require less code also. Post back if you need further help. Im sure you will get additonal advice from others. HardCopy Edited October 15, 2005 by HardCopy Contributions: UDF _DateYearFirstChildren are like Farts, you can just about stand your own.Why am I not a Vegetarian?...Well...my ancestors didn't fight & evolve to the Top of the food chain for me to survive on Salad
flyingboz Posted October 15, 2005 Posted October 15, 2005 Quote ...stuff about what delimiter to use...Use a delimiter that works for your data -- Ideally it is a single byte that you can guarantee will never, ever, be used as anything other than a delimiter, or you can generate fixed field length inputs and use stringmid()Re: parsing files quickly1) if feasible, read the file into memory in one go , if the file consumes too much memory to easily read in, at least ensure that you use filehandles rather than filename to read lines.2) when possible, parse by records or fields, rather than by character - you've got a high levellanguage w/ great string capability, so don't limit yourself to cobol techniques.2) StringSplit by line into records3) For each record, stringsplit by delimiter into fieldsperform whatever operations / manipulations you require.4) avoid string concatenation in large loops, this forces variable size to constantly be redimmed internally (though au3 has improved this since v2). 5) Avoid the _Array* UDFs unless you know how they work, and are confident that they are efficient for your purpose. Reading the help file before you post... Not only will it make you look smarter, it will make you smarter.
bluebearr Posted October 16, 2005 Posted October 16, 2005 I've been trying to wrap my brain around what you are doing but not having a lot of success.I know CSV is tricky to parse and has to be done chracter by character because you have to know the history of what came before, though I found an example that uses RegEx (see this page: http://www.xbeat.net/vbspeed/c_ParseCSV.php)Are you calling your functon repeatedly to get each value? Something like:Dim $ResultsArray[0] = 0 $iResultComma = 0 $sCommaSeparatedValues = FileRead( "CSVFile", FileGetSize("CSVFile") ) while $ResultsArray[$iResultComma] <> "NULL" $iResultComma = $iResultComma + 1 ReDim $ResultsArray[$iResultComma] $ResultsArray[$count] = ParseCSV($sCommaSeparatedValues, $iResultComma) WEndIf so, I think it looks like you are starting from the beginning of the string each time you grab a new value. Thus, if you are reading a string with 1000 records, it's (roughly) like reading half the string length 500 times to finally parse it. To speed it up, try to parse the string while reading it only once.Also, my understanding is that parsing a CSV file should return a 2-dimension array, with carriage returns at the end of value signifying a new record, though if you are inside quotes at the time it is considered part of the record. Maybe I missed it, but I don't see how your function handles this situation. BlueBearrOddly enough, this is what I do for fun.
fsquirrelgpr Posted October 17, 2005 Author Posted October 17, 2005 Thanks for the advice, I definitely see a couple of ways now that I should be able to speed up what I am doing significantly. To clarify what I had been doing, here is an example of an EDI style file I have created (just a little piece): "DUE_DATE_FIELD:","08/6/2004" "TAX_CHARGE:","CCF (GAS)","16.35","_GAS_" "TOTAL_FIELD:","16.35" "METER_READING_LINE:","CCF (GAS)","70330","0","0","1","_GAS_" "ENTRY_DATE:","NULL" This is actually just a format I created to standardize from a more difficult and complex to read format. So I would set a loop up like: $aFile = _filereadtoarray... for $i = 1 to $aFile[0] $sEDICode = ParseCSV($aFile[$i], 0) if $sEDICode = "Tax_Charge" then do whatever $total = $total + ParseCSV($aFile[$i], 3) if $sEDICode = "Invoice_End" then do something else... next and so on. Hope that explains it. I will try to implement your suggestions to speed things up (currently it takes about 15-25 minutes for me to process and rewrite a 20,000 line file using this method). Thanks again for the help.
SumTingWong Posted October 17, 2005 Posted October 17, 2005 (edited) As flyingboz already mentioned, I think you will get better performance splitting your CSV line into an array in 1 go, rather than extracting the values one by one. Example code: #include <Array.au3> Dim $aLineValues $aLineValues = _SplitCSVLine('"DUE_DATE_FIELD:","08/6/2004"') _ArrayDisplay($aLineValues, "Values") $aLineValues = _SplitCSVLine('"TAX_CHARGE:","CCF (GAS)","16.35","_GAS_"') _ArrayDisplay($aLineValues, "Values") $aLineValues = _SplitCSVLine('"TOTAL_FIELD:","16.35"') _ArrayDisplay($aLineValues, "Values") $aLineValues = _SplitCSVLine('"METER_READING_LINE:","CCF (GAS)","70330","0","0","1","_GAS_"') _ArrayDisplay($aLineValues, "Values") $aLineValues = _SplitCSVLine('"ENTRY_DATE:","NULL"') _ArrayDisplay($aLineValues, "Values") Func _SplitCSVLine($sLine) Local $aValues Local $n $aValues = StringSplit($sLine, ",") For $n = 1 To $aValues[0] $aValues[$n] = StringReplace($aValues[$n], '"', "") Next Return $aValues EndFunc Edited October 17, 2005 by SumTingWong
seandisanti Posted October 17, 2005 Posted October 17, 2005 (edited) fsquirrelgpr said: Here is code I use for getting values out of a .CSV file (I am basing it on the .csv files made by OpenOffice.Org's Calc).It works but is rather slow (processing thousands of entries can take some time).I would appreciate any advise as to making it work faster. Also I am trying to use the UDF naming convention, point out any errors in that regards also.here's one that i use to commanate (don't bother checking dictionaries i think i just made up that word) files at work. the print spools i'm working with don't give me any nice tab seperation or anything, they just output with variable numbers of spaces filling the columns. i know that doesn't make any sense, but that's just the way it is. this one could probably be optimized too, but just putting it up here incase it helps... i'll clock it and yours and post the resultsFileDelete("C:\output.txt") $input = FileOpen("c:\sourceo.txt",0) $output = FileOpen("c:\output.txt",1) $counter = 0 while 1 $line = FileReadLine($input) $newline = "" if @error = -1 Then ExitLoop for $x = 1 to StringLen($line) $achar = StringMid($line,$x,1) if $achar = " " Then if StringRight($newline,1) = "," Then ContinueLoop Else $newline = $newline & "," EndIf Else $newline = $newline & $achar EndIf Next FileWriteLine($output,$newline) $counter = $counter + 1 ToolTip($counter) WEnd FileClose($input) FileClose($output)***edit***scratch that, not going to clock them. just gave yours a closer look and i'm confident that this one will shave some time for you as-is. if you want to speed it up some, could definitely take out the counter. going to play with it for a minute and try to really speed it up. Edited October 17, 2005 by cameronsdad
seandisanti Posted October 17, 2005 Posted October 17, 2005 cameronsdad said: here's one that i use to commanate (don't bother checking dictionaries i think i just made up that word) files at work. the print spools i'm working with don't give me any nice tab seperation or anything, they just output with variable numbers of spaces filling the columns. i know that doesn't make any sense, but that's just the way it is. this one could probably be optimized too, but just putting it up here incase it helps... i'll clock it and yours and post the resultsFileDelete("C:\output.txt") $input = FileOpen("c:\sourceo.txt",0) $output = FileOpen("c:\output.txt",1) $counter = 0 while 1 $line = FileReadLine($input) $newline = "" if @error = -1 Then ExitLoop for $x = 1 to StringLen($line) $achar = StringMid($line,$x,1) if $achar = " " Then if StringRight($newline,1) = "," Then ContinueLoop Else $newline = $newline & "," EndIf Else $newline = $newline & $achar EndIf Next FileWriteLine($output,$newline) $counter = $counter + 1 ToolTip($counter) WEnd FileClose($input) FileClose($output)***edit***scratch that, not going to clock them. just gave yours a closer look and i'm confident that this one will shave some time for you as-is. if you want to speed it up some, could definitely take out the counter. going to play with it for a minute and try to really speed it up.here we go... way way way faster....FileDelete("C:\output.txt") dim $contents $blah = FileRead("C:\sourceo.txt",filegetsize("C:\sourceo.txt")) $output = FileOpen("C:\output.txt",1) $contents = StringSplit($blah,@CRLF) for $z = 1 to $contents[0] $line = $contents[$z] $newline = "" for $x = 1 to StringLen($line) $achar = StringMid($line,$x,1) if $achar = " " Then if StringRight($newline,1) = "," Then ContinueLoop Else $newline = $newline & "," EndIf Else $newline = $newline & $achar EndIf Next if StringLen($newline) then FileWriteLine($output,$newline) Next FileClose($output)i left a check in there to omit 0 length lines because of using the @CRLF as the delimiter....
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now