Sign in to follow this  
Followers 0
simy8891

Compare a number of csv files and their values

12 posts in this topic

Hi guys,

I'm getting some issues with timing: it's taking ages :(

What I have:

  • 46 .csv files circa (number may vary)
  • Each of those has 6100 rows

What I have to do:

  • Compare a certain column (number) and if it's higher than the current one, update it

 

I'm obviously doing this with 2 For cycles, one into each other but it's really slow. It takes about 65/70 seconds every 1000 records. Is there something else I can try to speed up the process?

This is the For part I have:

For $i=1 To $ListOfCSVs[0]

        Local $tempArray= 0
        _FileReadToArray($tmpFiles&'\'&$ListOfCSVs[$i],$tempArray)

        ;MsgBox(0,$i,$tmpFiles&'\'&$ListOfCSVs[$i],$tempArray)

        $d=0

        For $d=1 To $CompleteArray[0][0]
            ;SplashTextOn($i, "ID:"&$d, -1, -1, -1, -1, 4, "", 24)
            $searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1)
            If  $searchSAM > 0 Then

                ;If found $searchSAM holds the position of the item just found
                $tmpItem=StringSplit($tempArray[$searchSAM],",")

                ;We're interested on the last row of $tmpItem >> $tmpItem[$tmpItem[0]]
                If $tmpItem[$tmpItem[0]] <> '' Then 

                    If $tmpItem[$tmpItem[0]] > $CompleteArray[$d][2] Then 

                        $CompleteArray[$d][2] = $tmpItem[$tmpItem[0]]

                    EndIf
                EndIf
            EndIf
            ;Splashoff()
        Next


    Next

I used the SplashTextOn to help me checking the speed of the cycle..

Cheers

Share this post


Link to post
Share on other sites



I'd do that by loading everything into an SQLite DB and work from there but there are plenty of details which need specification to provide a useful guidance at this stage.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

You could use Excel to do all the processing. Shouldn't be too hard with the rewritten Excel UDF that comes with the latest AutoIt beta version.


My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2017-04-18 - Version 1.4.8.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (NEW 2017-02-27 - Version 1.3.1.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2015-04-01 - Version 0.4.0.0) - Download - General Help & Support - Example Scripts
Excel - Example Scripts - Wiki
Word - Wiki
PowerPoint (2015-06-06 - Version 0.0.5.0) - Download - General Help & Support

Tutorials:
ADO - Wiki

 

Share this post


Link to post
Share on other sites

Thanks guys. I'll try with Excel but I can't use SQL.

I'll also try to store everything in an array, it's gonna be huge, but I hope that it's gonna be faster since it'll run through each name once.

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

Hi simy88912

not sure to well understood the files content, anyway maybe this could be faster (?) not tested, just an idea

in short, instead of scanning all lines of tempfiles one by one, just do a StringInstr in the whole file

p.s.

is CompleteArray bigger or smaller than TempFiles?

For $i = 1 To $ListOfCSVs[0]

    Local $tempArray = 0
    ; _FileReadToArray($tmpFiles&'\'&$ListOfCSVs[$i],$tempArray) ; instead of read to array
    Local $hTempFile = FileOpen($tmpFiles & '\' & $ListOfCSVs[$i])
    Local $sTempFile = FileRead($hFileOpen) ;   read the whole file in one shot to use with StringInStr later

    ;MsgBox(0,$i,$tmpFiles&'\'&$ListOfCSVs[$i],$tempArray)

    $d = 0

    For $d = 1 To $CompleteArray[0][0]
        ;SplashTextOn($i, "ID:"&$d, -1, -1, -1, -1, 4, "", 24)
        ; if is found is because $CompleteArray[$d][0] is equal to a line in the TempFile. is it right?
        ; $searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1)
        ; If  $searchSAM > 0 Then

        ; what we want to know is only if that record (string) is or is not within the temp file.
        If StringInStr($sTempFile, $CompleteArray[$d][0]) Then ; this should be faster..... (?)

            ;If found $searchSAM holds the position of the item just found
            ; $tmpItem=StringSplit($tempArray[$searchSAM],",")

            ; if i've well understand first field of CompleteArray could contain the same value
            ; of a record in the TempFile so is ok to use that record instead of the one in TempFile

            $tmpItem = StringSplit($CompleteArray[$d][0], ",")

            ;We're interested on the last row of $tmpItem >> $tmpItem[$tmpItem[0]]
            If $tmpItem[$tmpItem[0]] <> '' Then

                If $tmpItem[$tmpItem[0]] > $CompleteArray[$d][2] Then

                    $CompleteArray[$d][2] = $tmpItem[$tmpItem[0]]

                EndIf
            EndIf
        EndIf
        ;Splashoff()
    Next
Next
Edited by Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

Hi Chimp,

thanks for the suggestion. In theory, CompleteArray and TempArray are the same size. There might be a mismatch of 1 or 2 lines so I thought I'd use directly _ArraySearch() since in 99.9% of the cases that string will be found. Sorry for not specifying this, obviously if you don't know what data's in there it is a bit difficult, however I believe your suggestion, applied to the specific case, will make it slower since it'll run a double check basically.

One thing though:

Most of the times $tmpItem[$tmpItem[0]] will be empty, is there a way to clean up the results in advance and have a smaller csv file? Count that for sure each .csv will have 60% items less with this check! But I'm stuck in the logical thinking of it :P

UPDATE:

Well, I'm feeling pretty stupid.. but the command generating these .csv files actually includes the option to omit empty rows for that kind of object. I tested a single .csv export and it went down from 6100 rows to 1200 circa. I hope it'll speed up the process.

Thanks

Edited by simy8891

Share this post


Link to post
Share on other sites

Hi Chimp,

....., applied to the specific case, will make it slower since it'll run a double check basically......

 

where is the double check?


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Hey, I've updated my previous post. Anyway regarding this case, there will be a double check here:

If StringInStr($sTempFile, $CompleteArray[$d][0]) Then

Since I know that in 99.9% of the case the string already exists in the target array, it's kind of useless. Again, for the case!

Also, since I did shrink the csv files, now I'm running across the temporary array instead of the giant one, so I'm only checking items present in the csv files. In 7 minutes I'm on the half.. Last night it took 4 hours instead :P

Share this post


Link to post
Share on other sites

Thanks guys. I'll try with Excel but I can't use SQL.

I'll also try to store everything in an array, it's gonna be huge, but I hope that it's gonna be faster since it'll run through each name once.

 

Definitely use memory SQLite database. It will take only few seconds!

The only extra stuff you need is place SQLite3.dll file to script directory, no installation of any other software.

http://www.autoitscript.com/autoit3/files/beta/autoit/archive/sqlite/

Look here how simple and fast it is:

Share this post


Link to post
Share on other sites

Zedna raises good points. Yet another simpler SQLite-based way: use sqlite3.exe command-line tool (freeware) with .import command to load a database with all your input data, then use SQLite Expert personal edition (also freeware) to query/massage it the way you want. Then you have zero line of code to develop, yet obtain the wanted result easily.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

excuse me simy8891,

when the statement $searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1) is executed (see line 10 below)

and an element is found, is because $tempArray[$searchSAM] and $CompleteArray[$d][0] are equals (I mean that the found element in $tempArray is equal to the one searched $CompleteArray[$d][0])

is it right?

if so then, using this statement $tmpItem = StringSplit($CompleteArray[$d][0], ",") (see line 13 below)

or this $tmpItem=StringSplit($tempArray[$searchSAM],",") should be equivalent.

am I wrong?

$searchSAM=_ArraySearch($tempArray, $CompleteArray[$d][0],2,0,0,1) ;    <-- this is found only if $CompleteArray[$d][0] is inside $tempArray
            If  $searchSAM > 0 Then

                ;If found $searchSAM holds the position of the item just found
                $tmpItem=StringSplit($tempArray[$searchSAM],",") ;                  <-- so here you can use $tempArray[$searchSAM]
                ;                                                                       or $CompleteArray[$d][0] becaude both are the same
                ;                                                                       is it right ???

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

I'm sorry Chimp, I just got what you meant.

I can try it on the new script and see if there's any improvement. I should try on the old one though, at least on a 4 hours run we'll be able to see an improvement, if there's any. With 15 minutes I think the difference's gonna be tiny.

Anyway, I finally got what I wanted from the whole script! Thanks guys :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0

  • Similar Content

    • czardas
      By czardas
      Haven't had much time to code recently. However the following thread inspired me.
      The debate about linear, parallel and binary search methods was rather interesting and, in an attempt to be diplomatic, I decided to combine @jchd's suggestion with @LarsJ's binary search example. I decided that the binary search algorithm required modification to make it more linear. As usual, 'if you invent something, it probably already exists and if it already exists, it exists for a reason'. My first attempt was not all that good. The code worked but was really a mess. I blame peer pressure (to post an example of a parallel search method).  I will delete that old code in due course.
      With a little memory jogging and a glance at the help file, the solution turned out to be quite easy: I just needed a better understanding of Euler. Further modification will be needed to work with more complicated unicode strings. The output could be returned as an array or a delimitered string. I'm not so interested in those details. I'm just going to post the algorithm for now and anyone, who wants to, can modify it to suit their needs. Both arrays must contain at least 1 element.
      Local $aFoo = [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,19,20,23,24,26,30,35,39,40,41] Local $aBar = [0,1,5,6,7,8,9,10,11,12,13,14,17,18,19,21,24,25,26,27,34,35,38,40] ParallelExponetialSearch($aFoo, $aBar) ; Compares two lists - returning positive matches. Each input array must be unique (individually) and in alphabetical order. Func ParallelExponetialSearch($aFoo, $aBar) Local $sFind, _ $iMin_F = -1, $iMax_F = UBound($aFoo) -1, $Lo_F = $iMin_F, $Hi_F, _ $iMin_B = -1, $iMax_B = UBound($aBar) -1, $Lo_B = $iMin_B, $Hi_B While $iMin_F < $iMax_F And $iMin_B < $iMax_B ; Toggle Arrays - Which array has most untested elements? This is the one we want to search next, ; so we can bypass more comparisons because (in theory) mismatches have a greater chance of being skipped. If $iMax_F - $iMin_F >= $iMax_B - $iMin_B Then ; $aFoo has more (or an equal number of) untested elements $Hi_F = $iMax_F $iMin_B += 1 $sFind = $aBar[$iMin_B] While $Lo_F < $Hi_F ; search $aFoo For $i = 0 To Floor(Log($Hi_F - $Lo_F) / Log(2)) $Lo_F = $iMin_F + 2^$i If $aFoo[$Lo_F] = $sFind Then $iMin_F = $Lo_F ; each match should be added to the output [perhaps an array] ConsoleWrite($sFind & " found at $aFoo[" & $Lo_F & "] = $aBar[" & $iMin_B & "]" & @LF) ExitLoop 2 ElseIf $aFoo[$Lo_F] > $sFind Then $Hi_F = $Lo_F -1 $iMin_F += Floor(2^($i -1)) $Lo_F = $iMin_F ContinueLoop 2 EndIf Next $iMin_F = $Lo_F ; minimum increment is one WEnd Else ; $aBar has more untested elements $Hi_B = $iMax_B $iMin_F += 1 $sFind = $aFoo[$iMin_F] While $Lo_B < $Hi_B ; search $aBar For $i = 0 To Floor(Log($Hi_B - $Lo_B) / Log(2)) $Lo_B = $iMin_B + 2^$i If $aBar[$Lo_B] = $sFind Then $iMin_B = $Lo_B ; each match should be added to the output [perhaps an array] ConsoleWrite($sFind & " found at $aFoo[" & $iMin_F & "] = $aBar[" & $Lo_B & "]" & @LF) ExitLoop 2 ElseIf $aBar[$Lo_B] > $sFind Then $Hi_B = $Lo_B -1 $iMin_B += Floor(2^($i -1)) $Lo_B = $iMin_B ContinueLoop 2 EndIf Next $iMin_B = $Lo_B ; minimum increment is one WEnd EndIf WEnd EndFunc ;==> ParallelExponetialSearch I hope this will be useful to someone. I believe it deserved a thread of its own!
    • fopetesl
      By fopetesl
      I have a csv file with delimiters "," and @CRLF
      After trying several different examples (simple ones!) I still haven't resolved an answer
      #include <Array.au3> #include <File.au3> ;#include "ArrayMultiColSort.au3" Global $BatchDir = "C:\ncat\" FileChangeDir($BatchDir) $sFile = "mod1.csv" ; Read file into a 2D array Global $aArray _FileReadToArray($sFile, $aArray, $FRTA_NOCOUNT, ",") ; And here it is _ArrayDisplay($aArray, "Original", Default, 8) Firstly it throws a MsgBox error "No array variable passed to function" _ArrayDisplay(), so no displayed data
      Second and probably more important how do I get _FileReadToArray() to split the imported array[][] into rows and columns?
      I tried "," & @CRLF without success.
    • ur
      By ur
      We can get a list of file using the below code.
      Local $aFileList = _FileListToArray(@DesktopDir, "*") Is there any option to use the above one recursively to get sub folders and their contents also.??
      And also, is there any way to serialize the above array locally to some file and load it later when we want in another program on another machine so that we can compare its contents with a folder in different machine, which is not network connected also.?
    • FrancescoDiMuro
      By FrancescoDiMuro
      Good morning guys
      I'd like to know if there is a way to convert a PDF in CSV or, eventually, in TXT, in order to read from it, like a database...
      I have a PDF and I think ( I dind't search a lot on the forum ) with AutoIt, but I'd like work with Excel styles...
      Does anyone know a good program which convert PDF to CSV? 
      PS: the PDF file is 5 MB, and it contains 439 pages...
      Thanks everyone for the help
       
    • 31290
      By 31290
      Hi guys, 
      I'd like to write a piece of tool that would allow me to update a certain field in our Active Directory from a comma separated csv file composed like this:

      This file, automatically generated, can hold more than 10k lines.
      Thus, I need column A to be in one variable, column B in a second one and column C in a third one.
      I'm really missing this part as updating the AD is fairly easy once the 3 variable are populated. 
      I see things like this:
      Here's my attempts at the moment:
      #include <File.au3> #include <Array.au3> Global $csv_file = @DesktopDir & "\Book1.csv" Global $aRecords If Not _FileReadToArray($csv_file,$aRecords) Then MsgBox(4096,"Error", " Error reading log to Array error:" & @error) Exit EndIf For $x = 1 to $aRecords[0] Msgbox(0,'Record:' & $x, $aRecords[$x]) ; Shows the line that was read from file $csv_line_values = StringSplit($aRecords[$x], ",",1) ; Splits the line into 2 or more variables and puts them in an array ; _ArrayDisplay($csv_line_values) ; Shows what's in the array you just created. ; $csv_line_values[0] holds the number of elements in array ; $csv_line_values[1] holds the value ; $csv_line_values[2] holds the value ; etc Msgbox(0, 0, $csv_line_values[1]) Next Any help on this please? 
      Thanks in advance
      -31290-