Compare a number of csv files and their values

simy8891 · May 26, 2014

Hi guys,

I'm getting some issues with timing: it's taking ages

What I have:

46 .csv files circa (number may vary)
Each of those has 6100 rows

What I have to do:

Compare a certain column (number) and if it's higher than the current one, update it

I'm obviously doing this with 2 For cycles, one into each other but it's really slow. It takes about 65/70 seconds every 1000 records. Is there something else I can try to speed up the process?

This is the For part I have:

For $i=1 To $ListOfCSVs[0]

        Local $tempArray= 0
        _FileReadToArray($tmpFiles&'\'&$ListOfCSVs[$i],$tempArray)

        ;MsgBox(0,$i,$tmpFiles&'\'&$ListOfCSVs[$i],$tempArray)

        $d=0

        For $d=1 To $CompleteArray[0][0]
            ;SplashTextOn($i, "ID:"&$d, -1, -1, -1, -1, 4, "", 24)
            $searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1)
            If  $searchSAM > 0 Then

                ;If found $searchSAM holds the position of the item just found
                $tmpItem=StringSplit($tempArray[$searchSAM],",")

                ;We're interested on the last row of $tmpItem >> $tmpItem[$tmpItem[0]]
                If $tmpItem[$tmpItem[0]] <> '' Then 

                    If $tmpItem[$tmpItem[0]] > $CompleteArray[$d][2] Then 

                        $CompleteArray[$d][2] = $tmpItem[$tmpItem[0]]

                    EndIf
                EndIf
            EndIf
            ;Splashoff()
        Next


    Next

I used the SplashTextOn to help me checking the speed of the cycle..

Cheers

jchd · May 26, 2014

I'd do that by loading everything into an SQLite DB and work from there but there are plenty of details which need specification to provide a useful guidance at this stage.

water · May 27, 2014

You could use Excel to do all the processing. Shouldn't be too hard with the rewritten Excel UDF that comes with the latest AutoIt beta version.

simy8891 · May 27, 2014

Thanks guys. I'll try with Excel but I can't use SQL.

I'll also try to store everything in an array, it's gonna be huge, but I hope that it's gonna be faster since it'll run through each name once.

Gianni · May 27, 2014

Hi simy88912

not sure to well understood the files content, anyway maybe this could be faster (?) not tested, just an idea

in short, instead of scanning all lines of tempfiles one by one, just do a StringInstr in the whole file

p.s.

is CompleteArray bigger or smaller than TempFiles?

For $i = 1 To $ListOfCSVs[0]

    Local $tempArray = 0
    ; _FileReadToArray($tmpFiles&'\'&$ListOfCSVs[$i],$tempArray) ; instead of read to array
    Local $hTempFile = FileOpen($tmpFiles & '\' & $ListOfCSVs[$i])
    Local $sTempFile = FileRead($hFileOpen) ;   read the whole file in one shot to use with StringInStr later

    ;MsgBox(0,$i,$tmpFiles&'\'&$ListOfCSVs[$i],$tempArray)

    $d = 0

    For $d = 1 To $CompleteArray[0][0]
        ;SplashTextOn($i, "ID:"&$d, -1, -1, -1, -1, 4, "", 24)
        ; if is found is because $CompleteArray[$d][0] is equal to a line in the TempFile. is it right?
        ; $searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1)
        ; If  $searchSAM > 0 Then

        ; what we want to know is only if that record (string) is or is not within the temp file.
        If StringInStr($sTempFile, $CompleteArray[$d][0]) Then ; this should be faster..... (?)

            ;If found $searchSAM holds the position of the item just found
            ; $tmpItem=StringSplit($tempArray[$searchSAM],",")

            ; if i've well understand first field of CompleteArray could contain the same value
            ; of a record in the TempFile so is ok to use that record instead of the one in TempFile

            $tmpItem = StringSplit($CompleteArray[$d][0], ",")

            ;We're interested on the last row of $tmpItem >> $tmpItem[$tmpItem[0]]
            If $tmpItem[$tmpItem[0]] <> '' Then

                If $tmpItem[$tmpItem[0]] > $CompleteArray[$d][2] Then

                    $CompleteArray[$d][2] = $tmpItem[$tmpItem[0]]

                EndIf
            EndIf
        EndIf
        ;Splashoff()
    Next
Next

Edited May 27, 2014 by Chimp

simy8891 · May 27, 2014

Hi Chimp,

thanks for the suggestion. In theory, CompleteArray and TempArray are the same size. There might be a mismatch of 1 or 2 lines so I thought I'd use directly _ArraySearch() since in 99.9% of the cases that string will be found. Sorry for not specifying this, obviously if you don't know what data's in there it is a bit difficult, however I believe your suggestion, applied to the specific case, will make it slower since it'll run a double check basically.

~~One thing though:~~

Most of the times $tmpItem[$tmpItem[0]] will be empty, is there a way to clean up the results in advance and have a smaller csv file? Count that for sure each .csv will have 60% items less with this check! But I'm stuck in the logical thinking of it

UPDATE:

Well, I'm feeling pretty stupid.. but the command generating these .csv files actually includes the option to omit empty rows for that kind of object. I tested a single .csv export and it went down from 6100 rows to 1200 circa. I hope it'll speed up the process.

Thanks

Edited May 27, 2014 by simy8891

Gianni · May 27, 2014

Hi Chimp,
....., applied to the specific case, will make it slower since it'll run a double check basically......

where is the double check?

simy8891 · May 27, 2014

Hey, I've updated my previous post. Anyway regarding this case, there will be a double check here:

If StringInStr($sTempFile, $CompleteArray[$d][0]) Then

Since I know that in 99.9% of the case the string already exists in the target array, it's kind of useless. Again, for the case!

Also, since I did shrink the csv files, now I'm running across the temporary array instead of the giant one, so I'm only checking items present in the csv files. In 7 minutes I'm on the half.. Last night it took 4 hours instead

Zedna · May 27, 2014

Thanks guys. I'll try with Excel but I can't use SQL.

I'll also try to store everything in an array, it's gonna be huge, but I hope that it's gonna be faster since it'll run through each name once.

Definitely use memory SQLite database. It will take only few seconds!

The only extra stuff you need is place SQLite3.dll file to script directory, no installation of any other software.

http://www.autoitscript.com/autoit3/files/beta/autoit/archive/sqlite/

Look here how simple and fast it is:

jchd · May 27, 2014

Zedna raises good points. Yet another simpler SQLite-based way: use sqlite3.exe command-line tool (freeware) with .import command to load a database with all your input data, then use SQLite Expert personal edition (also freeware) to query/massage it the way you want. Then you have zero line of code to develop, yet obtain the wanted result easily.

Gianni · May 27, 2014

excuse me simy8891,

when the statement $searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1) is executed (see line 10 below)

and an element is found, is because $tempArray[$searchSAM] and $CompleteArray[$d][0] are equals (I mean that the found element in $tempArray is equal to the one searched $CompleteArray[$d][0])

is it right?

if so then, using this statement $tmpItem = StringSplit($CompleteArray[$d][0], ",") (see line 13 below)

or this $tmpItem=StringSplit($tempArray[$searchSAM],",") should be equivalent.

am I wrong?

$searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1) ;    <-- this is found only if $CompleteArray[$d][0] is inside $tempArray
            If  $searchSAM > 0 Then

                ;If found $searchSAM holds the position of the item just found
                $tmpItem=StringSplit($tempArray[$searchSAM],",") ;                  <-- so here you can use $tempArray[$searchSAM]
                ;                                                                       or $CompleteArray[$d][0] becaude both are the same
                ;                                                                       is it right ???

simy8891 · May 27, 2014

I'm sorry Chimp, I just got what you meant.

I can try it on the new script and see if there's any improvement. I should try on the old one though, at least on a 4 hours run we'll be able to see an improvement, if there's any. With 15 minutes I think the difference's gonna be tiny.

Anyway, I finally got what I wanted from the whole script! Thanks guys

Sign In

Compare a number of csv files and their values

Recommended Posts

simy8891

jchd

water

simy8891

Gianni

simy8891

Gianni

simy8891

Zedna

jchd

Gianni

simy8891

Create an account or sign in to comment

Create an account

Sign in

Similar Content

Comparing two strings with various parameters and get Percentage of matching

_WinAPI_FileCompareBinary v0.70 build 2024-03-22 beta - reinventing the wheel ^^

Loading Large Table for lookup

Compare Active Directory Groups

EXCEL (XLS not XLSX) to array or sequential File - (Moved)

Browse

AutoIt Resources

Release

Beta