Speeding up CSV File Processing

fsquirrelgpr · October 14, 2005

Here is code I use for getting values out of a .CSV file (I am basing it on the .csv files made by OpenOffice.Org's Calc).

It works but is rather slow (processing thousands of entries can take some time).

I would appreciate any advise as to making it work faster. Also I am trying to use the UDF naming convention, point out any errors in that regards also.

func ParseCSV($sCommaSeparatedValues, $iResultComma)
    Local $iCurrentComma = 0
    Local $iLeft = 0
    Local $sResult
    local $iRight = stringlen($sCommaSeparatedValues)
    local $iPosition
    local $cSeperator = ","
    local $cQuote = '"'
    local $iWithinQuotes = 0
;This section finds the left bounds that we actually want
    $iPosition = 1
    while($iCurrentComma < $iResultComma)
        if(stringmid($sCommaSeparatedValues, $iPosition, 1) = $cQuote) then
            if($iWithinQuotes = 0) Then
                $iWithinQuotes = 1
            Else
                $iWithinQuotes = 0
            EndIf
        EndIf
        if($iWithinQuotes = 0) Then
            if(stringmid($sCommaSeparatedValues,$iPosition,1) = $cSeperator) then
                $iCurrentComma = $iCurrentComma + 1
            EndIf
        EndIf
            
        $iPosition = $iPosition + 1
        if($iPosition > $iRight) Then
            return "NULL"
        EndIf
    WEnd
;This section gets the result
    $sResult = ""
    $iWithinQuotes = 0
    while($iPosition < $iRight + 1)
        if(stringmid($sCommaSeparatedValues, $iPosition, 1) = $cQuote) Then
            $iPosition = $iPosition + 1
            if($iWithinQuotes = 0) Then
                $iWithinQuotes = 1
            Else
                $iWithinQuotes = 0
            EndIf
        EndIf
        if($iWithinQuotes = 0) Then
            if(stringmid($sCommaSeparatedValues, $iPosition, 1) = $cSeperator) Then
                $iPosition = $iRight + 1
            Else
                $sResult = $sResult & stringmid($sCommaSeparatedValues, $iPosition, 1)
                $iPosition = $iPosition + 1
            EndIf
        Else
            $sResult = $sResult & stringmid($sCommaSeparatedValues, $iPosition, 1)
            $iPosition = $iPosition + 1
        EndIf
    WEnd
    $sResult = stringstripws($sResult, 3)
    return $sResult
EndFunc

HardCopy · October 15, 2005

Hey there,

These are just my 2 cents worth, looking at your code you are doing a fair bit of string manipulation etc.

Firstly i would recommend when working with CSV files is export to CSV using TAB Delimiter, unless of course your data contains tabs, which is pretty unlikely.

Secondly I would use stringsplit command - (Look in Help file)

Quick example: $Line = StringSplit($Record, @TAB)

$Line is now an array with each field as an element.

to parse the lines from array extract the attributes/elements you need and save them to another file.

Try that, which will require less code also.

Post back if you need further help.

Im sure you will get additonal advice from others.

HardCopy

Edited October 15, 2005 by HardCopy

flyingboz · October 15, 2005

Quote
...stuff about what delimiter to use...

Use a delimiter that works for your data -- Ideally it is a single byte that you can guarantee will never, ever, be used as anything other than a delimiter, or you can generate fixed field length inputs and use stringmid()

Re: parsing files quickly

1) if feasible, read the file into memory in one go , if the file consumes too much memory

to easily read in, at least ensure that you use filehandles rather than filename to read lines.

2) when possible, parse by records or fields, rather than by character - you've got a high level

language w/ great string capability, so don't limit yourself to cobol techniques.

2) StringSplit by line into records

3) For each record, stringsplit by delimiter into fields

perform whatever operations / manipulations you require.

4) avoid string concatenation in large loops, this forces variable size to constantly be redimmed internally (though au3 has improved this since v2).

5) Avoid the _Array* UDFs unless you know how they work, and are confident that they are efficient for your purpose.

bluebearr · October 16, 2005

I've been trying to wrap my brain around what you are doing but not having a lot of success.

I know CSV is tricky to parse and has to be done chracter by character because you have to know the history of what came before, though I found an example that uses RegEx (see this page: http://www.xbeat.net/vbspeed/c_ParseCSV.php)

Are you calling your functon repeatedly to get each value? Something like:

Dim $ResultsArray[0] = 0
$iResultComma = 0
$sCommaSeparatedValues = FileRead( "CSVFile", FileGetSize("CSVFile") )
while $ResultsArray[$iResultComma] <> "NULL"
    $iResultComma = $iResultComma + 1
    ReDim $ResultsArray[$iResultComma]
    $ResultsArray[$count] = ParseCSV($sCommaSeparatedValues, $iResultComma)
WEnd

If so, I think it looks like you are starting from the beginning of the string each time you grab a new value. Thus, if you are reading a string with 1000 records, it's (roughly) like reading half the string length 500 times to finally parse it. To speed it up, try to parse the string while reading it only once.

Also, my understanding is that parsing a CSV file should return a 2-dimension array, with carriage returns at the end of value signifying a new record, though if you are inside quotes at the time it is considered part of the record. Maybe I missed it, but I don't see how your function handles this situation.

fsquirrelgpr · October 17, 2005

Thanks for the advice, I definitely see a couple of ways now that I should be able to speed up what I am doing significantly.

To clarify what I had been doing, here is an example of an EDI style file I have created (just a little piece):

"DUE_DATE_FIELD:","08/6/2004"

"TAX_CHARGE:","CCF (GAS)","16.35","_GAS_"

"TOTAL_FIELD:","16.35"

"METER_READING_LINE:","CCF (GAS)","70330","0","0","1","_GAS_"

"ENTRY_DATE:","NULL"

This is actually just a format I created to standardize from a more difficult and complex to read format. So I would set a loop up like:

$aFile = _filereadtoarray...

for $i = 1 to $aFile[0]
     $sEDICode = ParseCSV($aFile[$i], 0)
     if $sEDICode = "Tax_Charge" then do whatever
              $total = $total + ParseCSV($aFile[$i], 3)
     if $sEDICode = "Invoice_End" then do something else...
next

and so on. Hope that explains it. I will try to implement your suggestions to speed things up (currently it takes about 15-25 minutes for me to process and rewrite a 20,000 line file using this method).

Thanks again for the help.

SumTingWong · October 17, 2005

As flyingboz already mentioned, I think you will get better performance splitting your CSV line into an array in 1 go, rather than extracting the values one by one.

Example code:

#include <Array.au3>

Dim $aLineValues

$aLineValues = _SplitCSVLine('"DUE_DATE_FIELD:","08/6/2004"')
_ArrayDisplay($aLineValues, "Values")
$aLineValues = _SplitCSVLine('"TAX_CHARGE:","CCF (GAS)","16.35","_GAS_"')
_ArrayDisplay($aLineValues, "Values")
$aLineValues = _SplitCSVLine('"TOTAL_FIELD:","16.35"')
_ArrayDisplay($aLineValues, "Values")
$aLineValues = _SplitCSVLine('"METER_READING_LINE:","CCF (GAS)","70330","0","0","1","_GAS_"')
_ArrayDisplay($aLineValues, "Values")
$aLineValues = _SplitCSVLine('"ENTRY_DATE:","NULL"')
_ArrayDisplay($aLineValues, "Values")

Func _SplitCSVLine($sLine)
 Local $aValues
 Local $n
 
 $aValues = StringSplit($sLine, ",")
 For $n = 1 To $aValues[0]
  $aValues[$n] = StringReplace($aValues[$n], '"', "")
 Next
 Return $aValues
EndFunc

Edited October 17, 2005 by SumTingWong

seandisanti · October 17, 2005

fsquirrelgpr said:
Here is code I use for getting values out of a .CSV file (I am basing it on the .csv files made by OpenOffice.Org's Calc).
It works but is rather slow (processing thousands of entries can take some time).
I would appreciate any advise as to making it work faster. Also I am trying to use the UDF naming convention, point out any errors in that regards also.

here's one that i use to commanate (don't bother checking dictionaries i think i just made up that word) files at work. the print spools i'm working with don't give me any nice tab seperation or anything, they just output with variable numbers of spaces filling the columns. i know that doesn't make any sense, but that's just the way it is. this one could probably be optimized too, but just putting it up here incase it helps... i'll clock it and yours and post the results

FileDelete("C:\output.txt")
$input = FileOpen("c:\sourceo.txt",0)
$output = FileOpen("c:\output.txt",1)
$counter = 0
while 1
    $line = FileReadLine($input)
    $newline = ""
    if @error = -1 Then ExitLoop
        for $x = 1 to StringLen($line)
            $achar = StringMid($line,$x,1)
            if $achar = " " Then
                if StringRight($newline,1) = "," Then 
                    ContinueLoop
                Else
                    $newline = $newline & ","
                EndIf
            Else
                $newline = $newline & $achar
            EndIf
        Next
        FileWriteLine($output,$newline)
        $counter = $counter + 1
        ToolTip($counter)
WEnd
FileClose($input)
FileClose($output)

***edit***

scratch that, not going to clock them. just gave yours a closer look and i'm confident that this one will shave some time for you as-is. if you want to speed it up some, could definitely take out the counter. going to play with it for a minute and try to really speed it up.

Edited October 17, 2005 by cameronsdad

seandisanti · October 17, 2005

cameronsdad said:
here's one that i use to commanate (don't bother checking dictionaries i think i just made up that word) files at work. the print spools i'm working with don't give me any nice tab seperation or anything, they just output with variable numbers of spaces filling the columns. i know that doesn't make any sense, but that's just the way it is. this one could probably be optimized too, but just putting it up here incase it helps... i'll clock it and yours and post the results
FileDelete("C:\output.txt")
$input = FileOpen("c:\sourceo.txt",0)
$output = FileOpen("c:\output.txt",1)
$counter = 0
while 1
    $line = FileReadLine($input)
    $newline = ""
    if @error = -1 Then ExitLoop
        for $x = 1 to StringLen($line)
            $achar = StringMid($line,$x,1)
            if $achar = " " Then
                if StringRight($newline,1) = "," Then 
                    ContinueLoop
                Else
                    $newline = $newline & ","
                EndIf
            Else
                $newline = $newline & $achar
            EndIf
        Next
        FileWriteLine($output,$newline)
        $counter = $counter + 1
        ToolTip($counter)
WEnd
FileClose($input)
FileClose($output)
***edit***
scratch that, not going to clock them. just gave yours a closer look and i'm confident that this one will shave some time for you as-is. if you want to speed it up some, could definitely take out the counter. going to play with it for a minute and try to really speed it up.

here we go... way way way faster....

FileDelete("C:\output.txt")
dim $contents
$blah = FileRead("C:\sourceo.txt",filegetsize("C:\sourceo.txt"))
$output = FileOpen("C:\output.txt",1)
$contents = StringSplit($blah,@CRLF)
for $z = 1 to $contents[0]
$line = $contents[$z]
$newline = ""
        for $x = 1 to StringLen($line)
            $achar = StringMid($line,$x,1)
            if $achar = " " Then
                if StringRight($newline,1) = "," Then 
                    ContinueLoop
                Else
                    $newline = $newline & ","
                EndIf
            Else
                $newline = $newline & $achar
            EndIf
        Next
if StringLen($newline) then FileWriteLine($output,$newline)
Next
FileClose($output)

i left a check in there to omit 0 length lines because of using the @CRLF as the delimiter....

Sign In

Speeding up CSV File Processing

Recommended Posts

fsquirrelgpr

HardCopy

flyingboz

bluebearr

fsquirrelgpr

SumTingWong

seandisanti

seandisanti

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta