Jump to content

Compare a number of csv files and their values


Recommended Posts

Hi guys,

I'm getting some issues with timing: it's taking ages :(

What I have:

  • 46 .csv files circa (number may vary)
  • Each of those has 6100 rows

What I have to do:

  • Compare a certain column (number) and if it's higher than the current one, update it

 

I'm obviously doing this with 2 For cycles, one into each other but it's really slow. It takes about 65/70 seconds every 1000 records. Is there something else I can try to speed up the process?

This is the For part I have:

For $i=1 To $ListOfCSVs[0]

        Local $tempArray= 0
        _FileReadToArray($tmpFiles&'\'&$ListOfCSVs[$i],$tempArray)

        ;MsgBox(0,$i,$tmpFiles&'\'&$ListOfCSVs[$i],$tempArray)

        $d=0

        For $d=1 To $CompleteArray[0][0]
            ;SplashTextOn($i, "ID:"&$d, -1, -1, -1, -1, 4, "", 24)
            $searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1)
            If  $searchSAM > 0 Then

                ;If found $searchSAM holds the position of the item just found
                $tmpItem=StringSplit($tempArray[$searchSAM],",")

                ;We're interested on the last row of $tmpItem >> $tmpItem[$tmpItem[0]]
                If $tmpItem[$tmpItem[0]] <> '' Then 

                    If $tmpItem[$tmpItem[0]] > $CompleteArray[$d][2] Then 

                        $CompleteArray[$d][2] = $tmpItem[$tmpItem[0]]

                    EndIf
                EndIf
            EndIf
            ;Splashoff()
        Next


    Next

I used the SplashTextOn to help me checking the speed of the cycle..

Cheers

Link to comment
Share on other sites

I'd do that by loading everything into an SQLite DB and work from there but there are plenty of details which need specification to provide a useful guidance at this stage.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

You could use Excel to do all the processing. Shouldn't be too hard with the rewritten Excel UDF that comes with the latest AutoIt beta version.

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2022-02-19 - Version 1.6.1.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki
Task Scheduler (NEW 2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki

Standard UDFs:
Excel - Example Scripts - Wiki
Word - Wiki

Tutorials:
ADO - Wiki
WebDriver - Wiki

 

Link to comment
Share on other sites

Hi simy88912

not sure to well understood the files content, anyway maybe this could be faster (?) not tested, just an idea

in short, instead of scanning all lines of tempfiles one by one, just do a StringInstr in the whole file

p.s.

is CompleteArray bigger or smaller than TempFiles?

For $i = 1 To $ListOfCSVs[0]

    Local $tempArray = 0
    ; _FileReadToArray($tmpFiles&'\'&$ListOfCSVs[$i],$tempArray) ; instead of read to array
    Local $hTempFile = FileOpen($tmpFiles & '\' & $ListOfCSVs[$i])
    Local $sTempFile = FileRead($hFileOpen) ;   read the whole file in one shot to use with StringInStr later

    ;MsgBox(0,$i,$tmpFiles&'\'&$ListOfCSVs[$i],$tempArray)

    $d = 0

    For $d = 1 To $CompleteArray[0][0]
        ;SplashTextOn($i, "ID:"&$d, -1, -1, -1, -1, 4, "", 24)
        ; if is found is because $CompleteArray[$d][0] is equal to a line in the TempFile. is it right?
        ; $searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1)
        ; If  $searchSAM > 0 Then

        ; what we want to know is only if that record (string) is or is not within the temp file.
        If StringInStr($sTempFile, $CompleteArray[$d][0]) Then ; this should be faster..... (?)

            ;If found $searchSAM holds the position of the item just found
            ; $tmpItem=StringSplit($tempArray[$searchSAM],",")

            ; if i've well understand first field of CompleteArray could contain the same value
            ; of a record in the TempFile so is ok to use that record instead of the one in TempFile

            $tmpItem = StringSplit($CompleteArray[$d][0], ",")

            ;We're interested on the last row of $tmpItem >> $tmpItem[$tmpItem[0]]
            If $tmpItem[$tmpItem[0]] <> '' Then

                If $tmpItem[$tmpItem[0]] > $CompleteArray[$d][2] Then

                    $CompleteArray[$d][2] = $tmpItem[$tmpItem[0]]

                EndIf
            EndIf
        EndIf
        ;Splashoff()
    Next
Next
Edited by Chimp

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Link to comment
Share on other sites

Hi Chimp,

thanks for the suggestion. In theory, CompleteArray and TempArray are the same size. There might be a mismatch of 1 or 2 lines so I thought I'd use directly _ArraySearch() since in 99.9% of the cases that string will be found. Sorry for not specifying this, obviously if you don't know what data's in there it is a bit difficult, however I believe your suggestion, applied to the specific case, will make it slower since it'll run a double check basically.

One thing though:

Most of the times $tmpItem[$tmpItem[0]] will be empty, is there a way to clean up the results in advance and have a smaller csv file? Count that for sure each .csv will have 60% items less with this check! But I'm stuck in the logical thinking of it :P

UPDATE:

Well, I'm feeling pretty stupid.. but the command generating these .csv files actually includes the option to omit empty rows for that kind of object. I tested a single .csv export and it went down from 6100 rows to 1200 circa. I hope it'll speed up the process.

Thanks

Edited by simy8891
Link to comment
Share on other sites

Hi Chimp,

....., applied to the specific case, will make it slower since it'll run a double check basically......

 

where is the double check?

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Link to comment
Share on other sites

Hey, I've updated my previous post. Anyway regarding this case, there will be a double check here:

If StringInStr($sTempFile, $CompleteArray[$d][0]) Then

Since I know that in 99.9% of the case the string already exists in the target array, it's kind of useless. Again, for the case!

Also, since I did shrink the csv files, now I'm running across the temporary array instead of the giant one, so I'm only checking items present in the csv files. In 7 minutes I'm on the half.. Last night it took 4 hours instead :P

Link to comment
Share on other sites

Thanks guys. I'll try with Excel but I can't use SQL.

I'll also try to store everything in an array, it's gonna be huge, but I hope that it's gonna be faster since it'll run through each name once.

 

Definitely use memory SQLite database. It will take only few seconds!

The only extra stuff you need is place SQLite3.dll file to script directory, no installation of any other software.

http://www.autoitscript.com/autoit3/files/beta/autoit/archive/sqlite/

Look here how simple and fast it is:

Link to comment
Share on other sites

Zedna raises good points. Yet another simpler SQLite-based way: use sqlite3.exe command-line tool (freeware) with .import command to load a database with all your input data, then use SQLite Expert personal edition (also freeware) to query/massage it the way you want. Then you have zero line of code to develop, yet obtain the wanted result easily.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

excuse me simy8891,

when the statement $searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1) is executed (see line 10 below)

and an element is found, is because $tempArray[$searchSAM] and $CompleteArray[$d][0] are equals (I mean that the found element in $tempArray is equal to the one searched $CompleteArray[$d][0])

is it right?

if so then, using this statement $tmpItem = StringSplit($CompleteArray[$d][0], ",") (see line 13 below)

or this $tmpItem=StringSplit($tempArray[$searchSAM],",") should be equivalent.

am I wrong?

$searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1) ;    <-- this is found only if $CompleteArray[$d][0] is inside $tempArray
            If  $searchSAM > 0 Then

                ;If found $searchSAM holds the position of the item just found
                $tmpItem=StringSplit($tempArray[$searchSAM],",") ;                  <-- so here you can use $tempArray[$searchSAM]
                ;                                                                       or $CompleteArray[$d][0] becaude both are the same
                ;                                                                       is it right ???

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Link to comment
Share on other sites

I'm sorry Chimp, I just got what you meant.

I can try it on the new script and see if there's any improvement. I should try on the old one though, at least on a 4 hours run we'll be able to see an improvement, if there's any. With 15 minutes I think the difference's gonna be tiny.

Anyway, I finally got what I wanted from the whole script! Thanks guys :)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...