Sign in to follow this  
Followers 0
Rishav

Comparing two csv files

13 posts in this topic

Hi all

In my work, I often need to compare two csv files. the simple code that i have written for it is;

$File1 = "e:\File1.csv"
$File2 = "e:\File3.csv"

If FileGetSize($File1) = FileGetSize($File2) Then
    If FileRead($File1) = FileRead($File2) Then
        MsgBox(4096, "", "File content same.")
    Else
        MsgBox(4096, "", "File content different.")
    EndIf
Else
    MsgBox(4096, "", "File size different.")
EndIf

However I want to make it more robust and include a md5 checksum functionality as well. However, all the examples or scripts that i found here are too complex. Most require a 3rd party COM object or dlls. Can't i generate a simple md5 checksum using only functions? Can anyone direct me to a simpler way of doing this?

Also, is this approach good enough for comparing csv files? Will a fileread, filegetsize and then a md5 checksum ensure that the file comparing is without any errors? Any inputs or advices will be most helpful.

regards

Rishav

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Hi,

what do you want to achieve?

Just compare the size. If not equal, then they are different. If size is eqaul then you can compare byte by byte.

Or just use a tool :-)

Mega

Edited by Xenobiologist

Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Share this post


Link to post
Share on other sites

Hi xeno.

Actually I am working on an automation framework for my company's product. The product gives output as csv files which may have tens of thousands of line with minute changes. If the files happen to be different, it means my test case has failed.

so basically, every time i run my test case, i get two csv files. if the files are same test case passed and if they are different, it failed.

I don't want to use any external tool like Beyond compare and use autoit only for this.

Btw, i thought that my code does just that. First it compares the filesize, and then a byte by byte comparison. Am I right on this?

lastly, my main worry is that these being a csv file, will Fileread still be able to read all the binaries properly? or should I use file read to array?

Share this post


Link to post
Share on other sites

You may be able to generate a CRC hash for comparison. There are some examples of this in the forum.

Share this post


Link to post
Share on other sites

If all you need is to prove that the 2 files are indentical then what you have will work fine. If you are going to have to do this many times then computing a hash for a file and comparing the hash value with a previously computed value will save having to open 2 files.

Have a look at this thread for some very quick hash routines by Ward that do not use any external COM or DLL other than standard windows API.

Pure AutoIt MD5 hash


"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning."- Rick Cook

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

Hello Rishav,

As a fellow tester and writer of "corporate automation frameworks", let me ask some clarifying questions:

0 ) Bearing in mind your definition of a failing test, what would make the MD5 approach more "robust" than a text-based comparison?

1 ) What experience(s) have you had that gives you the sense that you need something more robust than the example script you provided? Did you have a particular context where it failed? If so, what was that context?

2 ) Is the performance of this comparison activity of vital concern? If so, then implementing MD5/CRC/SHA hashes in a UDF could be slower than using the native FileRead functions.

3 ) How would you prove to us/yourself that FileRead would read a CSV file differently than any other file, such that a comparison activity would fail?

My own 2 cents: I like your example script. Simple. Concise. Appears to work in theory, but some force is compelling you towards additional complexity in the name of "robust-ness". Management/Specification/Contextual failures notwithstanding, these forces could be either perceived external motivators ( i.e., "QTP and other COTS frameworks have <functionality X>, erego I MUST have it in order to consider mine 'complete'" ) or an internal one ( i.e., "I've never implemented <functionality X> in AutoIt before. That would be fun!" ).

I'm really interested to hear your answers!

Zach...

edit: changed question 2 language: "would most likely" to "could"; experiences to answers.

Edited by zfisherdrums

Share this post


Link to post
Share on other sites

Thanks a lot Bowmore. I will check that code, even though most of it just flies over my head. :o

Hi Zfisher.

0) I don't really know. I was thinking of getting rid of the fileread altogether that I am using now, and go for hash checks. Some csv files that i get are over 10 mb in size, and though I have no data points to back myself up, it felt that for large files fileread may not be the best approach.

1) No experience really. I suppose using hashes looked cooler than using fileread. :)

2) Actually, i was thinking the opposite. thanks for setting me straight on that.

3) To be honest, i simply don't know much about Fileread at all. To me it seems like it was made to work for text files and I am not confident of it working for huge csv files. though from my spiking it seems to work so far.

In summary, I suppose this would define me best in here ; "I've never implemented <functionality X> in AutoIt before. That would be fun!"

Usually the csv files that we get contain tens of thousands of rows of data and there might be only a few characters of difference between two files.

Can you advise me on;

a. Will it be better to open the file in binary using the Fileopen command with the binary flag and then using the Fileread command?

b. Will it be better to use File read to array command?

Share this post


Link to post
Share on other sites

I compare CSV files several times a day, but they are pretty small and I can manually read/digest the output of the "DOS" command named "fc"

fc file1.csv file2.csv

I press crtl-alt-d to bring up a "DOS" window

(which I always have positioned at 0,0)

I type fc and a space in that window

I drag/drop file1.csv into the "DOS" window

I type a space

I drag/drop file2.csv into the "DOS" window and hit enter

The output says no difference if the files are the same or it points out the differences.

If you wrap fc using AutoIt STDIO - I think that you will find it to be petty fast even for large files.


[size="1"][font="Arial"].[u].[/u][/font][/size]

Share this post


Link to post
Share on other sites

Rishav,

This has been a very enlightening day for me. and I have you to thank for it!

I did some experiments reading in a VMWare snapshot that was 17.2 MB in size.

On my machine, FileRead took 1044 ms to read in the file the first time. It averaged 395 ms on subsequent consecutive runs.

Using the MD5 UDFs that Bowmore linked to, the average time to generate an MD5 hash was 185 ms!

This is precisely why I said that it "could" be slower obtaining the hash code. The MD5 UDFs are implemented very well, it seems, and would have the advantage in this context. I was not aware of this before today. You were on the right track. So it is YOU who are putting me on the right track.

All that to say that the FileRead options are now a moot point; I would go with the MD5 route. I do, however, absolutely love herewasplato's old-new-thing approach.

Zach...

Share this post


Link to post
Share on other sites

... I do, however, absolutely love herewasplato's old-new-thing approach. ...

I was taught to avoid hash :-)

As long as you don't need the differences, then using the MD5 hash will be better/faster ..... addictive.


[size="1"][font="Arial"].[u].[/u][/font][/size]

Share this post


Link to post
Share on other sites

thanks all.

you have given me a lot of brain fodder to ruminate over.

Since time,isn't a matter at the moment,I'll probably end up using all the comparison methods in the scripts together as nested loops. :)

talk about overkill.

Share this post


Link to post
Share on other sites

#12 ·  Posted (edited)

Hi all

In my work, I often need to compare two csv files. the simple code that i have written for it is;

$File1 = "e:\File1.csv"
$File2 = "e:\File3.csv"

If FileGetSize($File1) = FileGetSize($File2) Then
    If FileRead($File1) = FileRead($File2) Then
        MsgBox(4096, "", "File content same.")
    Else
        MsgBox(4096, "", "File content different.")
    EndIf
Else
    MsgBox(4096, "", "File size different.")
EndIf
I just spotted an issue with your original post that needs pointing out

This line is doing a case insensitive comparison

If FileRead($File1) = FileRead($File2) Then

which will not of course prove that the two files are idnetical

Changing it to

If FileRead($File1) == FileRead($File2) Then

will do a case sensitive comparison and achive the goal of proving the two files are truely idnetical.

Edited by Bowmore

"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning."- Rick Cook

Share this post


Link to post
Share on other sites

Sorry for the late reply but thank you Bowmore. That was an awesome suggestion.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0