Jump to content

Speeding up CSV File Processing


Recommended Posts

Here is code I use for getting values out of a .CSV file (I am basing it on the .csv files made by OpenOffice.Org's Calc).

It works but is rather slow (processing thousands of entries can take some time).

I would appreciate any advise as to making it work faster. Also I am trying to use the UDF naming convention, point out any errors in that regards also.

func ParseCSV($sCommaSeparatedValues, $iResultComma)
    Local $iCurrentComma = 0
    Local $iLeft = 0
    Local $sResult
    local $iRight = stringlen($sCommaSeparatedValues)
    local $iPosition
    local $cSeperator = ","
    local $cQuote = '"'
    local $iWithinQuotes = 0
;This section finds the left bounds that we actually want
    $iPosition = 1
    while($iCurrentComma < $iResultComma)
        if(stringmid($sCommaSeparatedValues, $iPosition, 1) = $cQuote) then
            if($iWithinQuotes = 0) Then
                $iWithinQuotes = 1
            Else
                $iWithinQuotes = 0
            EndIf
        EndIf
        if($iWithinQuotes = 0) Then
            if(stringmid($sCommaSeparatedValues,$iPosition,1) = $cSeperator) then
                $iCurrentComma = $iCurrentComma + 1
            EndIf
        EndIf
            
        $iPosition = $iPosition + 1
        if($iPosition > $iRight) Then
            return "NULL"
        EndIf
    WEnd
;This section gets the result
    $sResult = ""
    $iWithinQuotes = 0
    while($iPosition < $iRight + 1)
        if(stringmid($sCommaSeparatedValues, $iPosition, 1) = $cQuote) Then
            $iPosition = $iPosition + 1
            if($iWithinQuotes = 0) Then
                $iWithinQuotes = 1
            Else
                $iWithinQuotes = 0
            EndIf
        EndIf
        if($iWithinQuotes = 0) Then
            if(stringmid($sCommaSeparatedValues, $iPosition, 1) = $cSeperator) Then
                $iPosition = $iRight + 1
            Else
                $sResult = $sResult & stringmid($sCommaSeparatedValues, $iPosition, 1)
                $iPosition = $iPosition + 1
            EndIf
        Else
            $sResult = $sResult & stringmid($sCommaSeparatedValues, $iPosition, 1)
            $iPosition = $iPosition + 1
        EndIf
    WEnd
    $sResult = stringstripws($sResult, 3)
    return $sResult
EndFunc
Link to comment
Share on other sites

Hey there,

These are just my 2 cents worth, looking at your code you are doing a fair bit of string manipulation etc.

Firstly i would recommend when working with CSV files is export to CSV using TAB Delimiter, unless of course your data contains tabs, which is pretty unlikely.

Secondly I would use stringsplit command - (Look in Help file)

Quick example: $Line = StringSplit($Record, @TAB)

$Line is now an array with each field as an element.

to parse the lines from array extract the attributes/elements you need and save them to another file.

Try that, which will require less code also.

Post back if you need further help.

Im sure you will get additonal advice from others.

HardCopy

Edited by HardCopy

Contributions: UDF _DateYearFirstChildren are like Farts, you can just about stand your own.Why am I not a Vegetarian?...Well...my ancestors didn't fight & evolve to the Top of the food chain for me to survive on Salad

Link to comment
Share on other sites

...stuff about what delimiter to use...

Use a delimiter that works for your data -- Ideally it is a single byte that you can guarantee will never, ever, be used as anything other than a delimiter, or you can generate fixed field length inputs and use stringmid()

Re: parsing files quickly

1) if feasible, read the file into memory in one go , if the file consumes too much memory

to easily read in, at least ensure that you use filehandles rather than filename to read lines.

2) when possible, parse by records or fields, rather than by character - you've got a high level

language w/ great string capability, so don't limit yourself to cobol techniques.

2) StringSplit by line into records

3) For each record, stringsplit by delimiter into fields

perform whatever operations / manipulations you require.

4) avoid string concatenation in large loops, this forces variable size to constantly be redimmed internally (though au3 has improved this since v2).

5) Avoid the _Array* UDFs unless you know how they work, and are confident that they are efficient for your purpose.

Reading the help file before you post... Not only will it make you look smarter, it will make you smarter.

Link to comment
Share on other sites

I've been trying to wrap my brain around what you are doing but not having a lot of success.

I know CSV is tricky to parse and has to be done chracter by character because you have to know the history of what came before, though I found an example that uses RegEx (see this page: http://www.xbeat.net/vbspeed/c_ParseCSV.php)

Are you calling your functon repeatedly to get each value? Something like:

Dim $ResultsArray[0] = 0
$iResultComma = 0
$sCommaSeparatedValues = FileRead( "CSVFile", FileGetSize("CSVFile") )
while $ResultsArray[$iResultComma] <> "NULL"
    $iResultComma = $iResultComma + 1
    ReDim $ResultsArray[$iResultComma]
    $ResultsArray[$count] = ParseCSV($sCommaSeparatedValues, $iResultComma)
WEnd

If so, I think it looks like you are starting from the beginning of the string each time you grab a new value. Thus, if you are reading a string with 1000 records, it's (roughly) like reading half the string length 500 times to finally parse it. To speed it up, try to parse the string while reading it only once.

Also, my understanding is that parsing a CSV file should return a 2-dimension array, with carriage returns at the end of value signifying a new record, though if you are inside quotes at the time it is considered part of the record. Maybe I missed it, but I don't see how your function handles this situation.

BlueBearrOddly enough, this is what I do for fun.
Link to comment
Share on other sites

Thanks for the advice, I definitely see a couple of ways now that I should be able to speed up what I am doing significantly.

To clarify what I had been doing, here is an example of an EDI style file I have created (just a little piece):

"DUE_DATE_FIELD:","08/6/2004"

"TAX_CHARGE:","CCF (GAS)","16.35","_GAS_"

"TOTAL_FIELD:","16.35"

"METER_READING_LINE:","CCF (GAS)","70330","0","0","1","_GAS_"

"ENTRY_DATE:","NULL"

This is actually just a format I created to standardize from a more difficult and complex to read format. So I would set a loop up like:

$aFile = _filereadtoarray...

for $i = 1 to $aFile[0]
     $sEDICode = ParseCSV($aFile[$i], 0)
     if $sEDICode = "Tax_Charge" then do whatever
              $total = $total + ParseCSV($aFile[$i], 3)
     if $sEDICode = "Invoice_End" then do something else...
next

and so on. Hope that explains it. I will try to implement your suggestions to speed things up (currently it takes about 15-25 minutes for me to process and rewrite a 20,000 line file using this method).

Thanks again for the help.

Link to comment
Share on other sites

As flyingboz already mentioned, I think you will get better performance splitting your CSV line into an array in 1 go, rather than extracting the values one by one.

Example code:

#include <Array.au3>

Dim $aLineValues

$aLineValues = _SplitCSVLine('"DUE_DATE_FIELD:","08/6/2004"')
_ArrayDisplay($aLineValues, "Values")
$aLineValues = _SplitCSVLine('"TAX_CHARGE:","CCF (GAS)","16.35","_GAS_"')
_ArrayDisplay($aLineValues, "Values")
$aLineValues = _SplitCSVLine('"TOTAL_FIELD:","16.35"')
_ArrayDisplay($aLineValues, "Values")
$aLineValues = _SplitCSVLine('"METER_READING_LINE:","CCF (GAS)","70330","0","0","1","_GAS_"')
_ArrayDisplay($aLineValues, "Values")
$aLineValues = _SplitCSVLine('"ENTRY_DATE:","NULL"')
_ArrayDisplay($aLineValues, "Values")

Func _SplitCSVLine($sLine)
 Local $aValues
 Local $n
 
 $aValues = StringSplit($sLine, ",")
 For $n = 1 To $aValues[0]
  $aValues[$n] = StringReplace($aValues[$n], '"', "")
 Next
 Return $aValues
EndFunc
Edited by SumTingWong
Link to comment
Share on other sites

Here is code I use for getting values out of a .CSV file (I am basing it on the .csv files made by OpenOffice.Org's Calc).

It works but is rather slow (processing thousands of entries can take some time).

I would appreciate any advise as to making it work faster. Also I am trying to use the UDF naming convention, point out any errors in that regards also.

here's one that i use to commanate (don't bother checking dictionaries i think i just made up that word) files at work. the print spools i'm working with don't give me any nice tab seperation or anything, they just output with variable numbers of spaces filling the columns. i know that doesn't make any sense, but that's just the way it is. this one could probably be optimized too, but just putting it up here incase it helps... i'll clock it and yours and post the results

FileDelete("C:\output.txt")
$input = FileOpen("c:\sourceo.txt",0)
$output = FileOpen("c:\output.txt",1)
$counter = 0
while 1
    $line = FileReadLine($input)
    $newline = ""
    if @error = -1 Then ExitLoop
        for $x = 1 to StringLen($line)
            $achar = StringMid($line,$x,1)
            if $achar = " " Then
                if StringRight($newline,1) = "," Then 
                    ContinueLoop
                Else
                    $newline = $newline & ","
                EndIf
            Else
                $newline = $newline & $achar
            EndIf
        Next
        FileWriteLine($output,$newline)
        $counter = $counter + 1
        ToolTip($counter)
WEnd
FileClose($input)
FileClose($output)

***edit***

scratch that, not going to clock them. just gave yours a closer look and i'm confident that this one will shave some time for you as-is. if you want to speed it up some, could definitely take out the counter. going to play with it for a minute and try to really speed it up.

Edited by cameronsdad
Link to comment
Share on other sites

here's one that i use to commanate (don't bother checking dictionaries i think i just made up that word) files at work. the print spools i'm working with don't give me any nice tab seperation or anything, they just output with variable numbers of spaces filling the columns. i know that doesn't make any sense, but that's just the way it is. this one could probably be optimized too, but just putting it up here incase it helps... i'll clock it and yours and post the results

FileDelete("C:\output.txt")
$input = FileOpen("c:\sourceo.txt",0)
$output = FileOpen("c:\output.txt",1)
$counter = 0
while 1
    $line = FileReadLine($input)
    $newline = ""
    if @error = -1 Then ExitLoop
        for $x = 1 to StringLen($line)
            $achar = StringMid($line,$x,1)
            if $achar = " " Then
                if StringRight($newline,1) = "," Then 
                    ContinueLoop
                Else
                    $newline = $newline & ","
                EndIf
            Else
                $newline = $newline & $achar
            EndIf
        Next
        FileWriteLine($output,$newline)
        $counter = $counter + 1
        ToolTip($counter)
WEnd
FileClose($input)
FileClose($output)

***edit***

scratch that, not going to clock them. just gave yours a closer look and i'm confident that this one will shave some time for you as-is. if you want to speed it up some, could definitely take out the counter. going to play with it for a minute and try to really speed it up.

here we go... way way way faster....

FileDelete("C:\output.txt")
dim $contents
$blah = FileRead("C:\sourceo.txt",filegetsize("C:\sourceo.txt"))
$output = FileOpen("C:\output.txt",1)
$contents = StringSplit($blah,@CRLF)
for $z = 1 to $contents[0]
$line = $contents[$z]
$newline = ""
        for $x = 1 to StringLen($line)
            $achar = StringMid($line,$x,1)
            if $achar = " " Then
                if StringRight($newline,1) = "," Then 
                    ContinueLoop
                Else
                    $newline = $newline & ","
                EndIf
            Else
                $newline = $newline & $achar
            EndIf
        Next
if StringLen($newline) then FileWriteLine($output,$newline)
Next
FileClose($output)

i left a check in there to omit 0 length lines because of using the @CRLF as the delimiter....

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...