Comparison of two files

xdp22 · November 27, 2010

Hello :graduated: first of all sorry of my bad Englisch ( i will use translator ).

I tried to make a script that compares two files, if both of them are the same line, delete it, do not know how to go about it, this is an example how it should work:

We have 2 files.

File1.txt

a
b
lol
c
d

File2.txt

b
a
lol2
d
c

So lines "a,b,c,d" exist on File1.txt, and on File2.txt, so i want to make program what delete this lines, and after use of program, this files should look like that :

File1.txt

lol

File2.txt

lol1

Thank you guys

I was really trying but my code don't working ^^ here is it for proove i was trying

func lol1()
$line1 = 0  
$line2 = 0
$file1 = "test1.txt"
$file2 = "test2.txt"
$test1 = FileReadLine($baza1, $linia1 + 1)
$test22 = FileReadLine($baza2, $linia2 + 1)
_FileReadToArray($test1, $test2)
    if stringinstr($test1, $test2) then _FileWriteToLine($test1, "", "", $linia1 +1)
lol1()
EndFunc

SadBunny · November 27, 2010

If only you had linux (or cygwin) you could do this a whole lot easier :graduated:

But this is a Windows, Windows world...

#include <Array.au3>

Dim $array1[3]
$array1[0]="a"
$array1[1]="b"
$array1[2]="c"

Dim $array2[3]
$array2[0]="b"
$array2[1]="c"
$array2[2]="d"

deduplicate($array1,$array2)

_ArrayDisplay($array1)
_ArrayDisplay($array2)

Exit

Func deduplicate(ByRef $ar1, ByRef $ar2)
    $posInArray2 = 0
    For $i = 0 To UBound($ar1)-1
        ; look for $ar1[$i] in $ar2
        $posInArray2 = _ArraySearch($ar2,$ar1[$i])
        If $posInArray2 > -1 Then
            ; if found, delete line from both arrays
            _ArrayDelete($ar1,$i)
            _ArrayDelete($ar2,$posInArray2)
            ; ... we need to search the same element again, because the next element just became the current element :)
            $i -= 1
        EndIf
        ; $ar1 could be smaller than before, so the loop might go out of bounds for $ar1. If so, quit loop, we're done.
        If $i >= UBound($ar1)-1 Then Return
    Next
EndFunc

EDIT: This would only work if the files contain unique lines So don't use this for something serious!

Edited November 27, 2010 by SadBunny

Tvern · November 27, 2010

I don't know how big these files are, or how often the function will be called, this example will work, but it is not very fast, because it opens and closes the file twice and uses _ArrayDelete.

It can be made faster, but it'll get a little more complicated.

#include<file.au3>
#include<array.au3>

_RemoveDuplicateLines("test1.txt", "test2.txt")
Func _RemoveDuplicateLines($sFilePath1, $sFilePath2)
    Local $aFile1, $aFile2, $hFile1, $hFile2
    If Not _FileReadToArray($sFilePath1, $aFile1) Then Return SetError(1,1,0)
    If Not _FileReadToArray($sFilePath2, $aFile2) Then Return SetError(1,2,0)
    For $i = $aFile1[0] To 1 Step -1
        For $j = $aFile2[0] To 1 Step -1
            If $aFile1[$i] = $aFile2[$j] Then
                _ArrayDelete($aFile1,$i)
                _ArrayDelete($aFile2,$j)
                $aFile1[0] -= 1
                $aFile2[0] -= 1
                If Not $aFile1[0] Then ExitLoop 2 ;no point checking further once one file is completely empty
                If Not $aFile2[0] Then ExitLoop 2
                ExitLoop
            EndIf
        Next
    Next
    $aFile1[0] = ""
    $aFile2[0] = ""
    _FileWriteFromArray($sFilePath1,$aFile1,1)
    _FileWriteFromArray($sFilePath2,$aFile2,1)
EndFunc

Example is case-insensitive.

xdp22 · November 27, 2010

Thanks for both replys

I will test this two versions, thanks very much :graduated:

@UP

This files is big, i think minimum 1000 lines, it was good ?

Edit: version of Tvern works, but not always ( but thanks ), now i will test the Sadbunny version.

Edit: Thanks very much, can this work little faster? xD ( but it's not bad not bad, i just ask )

Edited November 27, 2010 by xdp22

Tvern · November 27, 2010

If with not always, you mean is doesn't remove duplicate lines in $aFile2, remove the ExitLoop.

JohnOne · November 27, 2010

Sorry to be off topic, but any chance you could give me the link to that translator you use.

Seems really good.

xdp22 · November 27, 2010

Sorry to be off topic, but any chance you could give me the link to that translator you use.
Seems really good.

It was just Google Translator, here you are - http://translate.google.com

But it's not good :graduated: really trust me.

@Tvern

Can u delete that? cuz this code is too advanced for me, i delete something, and this will no more work thank you ^^

Edited November 27, 2010 by xdp22

Tvern · November 27, 2010

I realised that just removing that line would not work anyways, Try this:

I've commented it, so hopefully you'll understand how it works.

#include<file.au3>
#include<array.au3>

_RemoveDuplicateLines("test1.txt", "test2.txt")
Func _RemoveDuplicateLines($sFilePath1, $sFilePath2)
    Local $aFile1, $aFile2, $hFile1, $hFile2, $fFound ;declare vars
    If Not _FileReadToArray($sFilePath1, $aFile1) Then Return SetError(1,1,0) ;only continue if the first file can be read
    If Not _FileReadToArray($sFilePath2, $aFile2) Then Return SetError(1,2,0) ;only continue if the second file can be read
    For $i = $aFile1[0] To 1 Step -1 ;loop through the first file array. (backwards is best when using _ArrayDelete)
        $fFound = False
     For $j = $aFile2[0] To 1 Step -1 ;loop through the second array
            If $aFile1[$i] = $aFile2[$j] Then ;if an entry from the first array matches one from the second...
             $fFound = True ;set the found flag, so the entry can be deleted from the first array later.
                _ArrayDelete($aFile2,$j) ;delete from the second array
                $aFile2[0] -= 1 ;reduce count by 1
                If Not $aFile2[0] Then ExitLoop 2 ;exit if one file is empty
            EndIf
        Next
     If $fFound Then ;if a match was found..
         _ArrayDelete($aFile1,$i) ;delete from first array
         $aFile1[0] -= 1 ;reduce count
         If Not $aFile1[0] Then ExitLoop ;exit if one file is empty
     EndIf
    Next
    $aFile1[0] = ""
    $aFile2[0] = ""
    _FileWriteFromArray($sFilePath1,$aFile1,1)
    _FileWriteFromArray($sFilePath2,$aFile2,1)
EndFunc

JohnOne · November 27, 2010

It was just Google Translator, here you are - http://translate.google.com
But it's not good really trust me.
@Tvern
Can u delete that? cuz this code is too advanced for me, i delete something, and this will no more work thank you ^^

Seriously, it must be good, if your posts have been run through it.

Are you using that traslator or no?

jchd · November 27, 2010

I believe there is a saddle curve in file sizes S1, S2 where the naïve (arrays, loop, complexity in S1*S2 comparisons plus unknown _ArrayDelete calls) approach is slower than a dictionary or even an SQLite implementation. Using either a hash table or a bTree will make the complexity in some vincinity of S1*log(S2) + S2*log(S1).

Also try to avoid repeated calls to _ArrayDelete in the case where common strings are likely since it's a rather lengthy function. A better way is either to make the array 2D and use a "found elsewhere" marker or (probably faster) empty the found strings in place and avoid copying them on output. Also use == if possible instead of = to avoid lengthy underlying code invokation.

Sorry I don't have time to make examples of any of those alternative implementations.

MvGulik · November 28, 2010

whatever Edited February 7, 2011 by MvGulik

jchd · November 28, 2010

By mere curiosity, I ran the following test:

input file F1 = 7815 lines, average 631 chars, total size 4934777 chars

input file F2 = same as F1 but with only 11 lines modified (so 7804 lines found in F1)

input file F3 = same as F1 but with only 3 line unchanged (leaving 7812 unique lines)

Run time from Scite: 16.891 on a stock PC, which I feel isn't as bad as it can seem given the task at hand. There is still ample room for optimizations.

This code will hapily process any number of input files with complexity T * log(T) with T = total # of input lines in all files.

#include <SQLite.au3>
#include <SQLite.Dll.au3>
#include <Array.au3>

Main()

; removes every occurence of same (without respect to lower ASCII case, see below) text line found elsewhere in a group of text files
Func Main()

    ; init SQLite
    _SQLite_Startup()

    ; create a :memory: DB
    Local $hDB = _SQLite_Open()

    ; create a single table, with an index on text and a trigger to delete strings "found elsewhere" right after insert
    ; doing so will minimize the number of comparisons, and those compares are fast low-level code
    ;
    ; WARNING: this will work as intended for lower ASCII without respect to case
    ;       Unicode compares *-with-* respect to case can be done efficiently by using COLLATE BINARY instead of NOCASE
    ;       universal Unicode compares without respect to case need a bit more complex setup (but can still be called efficient)
    _SQLite_Exec($hDB,  "CREATE TABLE Strings (LineNum INTEGER, Source INTEGER, String CHAR COLLATE NOCASE, " & _
                            "PRIMARY KEY (LineNum, Source));" & _
                        "CREATE INDEX ixString ON Strings (String COLLATE NOCASE);" & _
                        "CREATE TRIGGER trInsString AFTER INSERT ON Strings FOR EACH ROW " & _
                            "WHEN exists (select 1 from Strings where String = new.String and Source != new.Source) " & _
                                "BEGIN " & _
                                    "delete from Strings where String = new.String;" & _
                                "END;")

    ; get the list of input files (may process any number of files in the same run)
    Local $files = _FileListToArray(@ScriptDir & "\", '*.inputtxt', 1)
    If @error Then Return

    ; process input files
    Local $txtstr
    For $i = 1 to $files[0]
        _FileReadToArray($files[$i], $txtstr)
        ; process input lines
        If Not @error Then
            For $j = 1 To $txtstr[0]
                _SQLite_Exec($hDB, "insert into Strings (Linenum, Source, String) values (" & $j & "," & $i & "," & _SQLite_Escape($txtstr[$j]) & ");")
            Next
        EndIf
    Next

    ; store remaining data in output files
    Local $nrows, $ncols
    For $i = 1 to $files[0]
        ; select relevant strings left
        _SQLite_GetTable($hDB, "select String from Strings where Source = " & $i & ";", $txtstr, $nrows, $ncols)

        ; write to input filename + extra extension .uniq
        _FileWriteFromArray($files[$i] & '.uniq', $txtstr, 2)
    Next

EndFunc

jchd · November 28, 2010

Apologies to update myself.

I ran a simpler version which is as close as fastest than I believe an SQLite implementation can be.

Results from a run over a set of 15 XML files totalling 2 696 614 lines and 81518277 bytes, details below.

D:\somepath>wc *.xml

Lines Words Bytes

500 597 15195 F2008-08rpleb.xml

4077 4870 124056 F2008-09pckbx.xml

114008 135374 3441076 F2008-10dykxv.xml

161906 192098 4905017 F2008-11oowcq.xml

180883 214616 5481306 F2008-12wrnvy.xml

140594 166830 4261162 F2009-01wtiyu.xml

152628 181129 4603496 F2009-02baozh.xml

198300 234930 5993839 F2009-03mfbbb.xml

199248 235940 6011069 F2009-04vykln.xml

215067 255011 6497965 F2009-05fphdx.xml

204878 242848 6198099 F2009-06ndtvq.xml

93342 110597 2828989 F2009-07hqnms.xml

180302 213717 5445789 F2009-08pgriq.xml

221067 262122 6674143 F2009-09phjch.xml

205817 244028 6222960 F2009-10jqmsg.xml

227406 269189 6872148 F2009-11sclnn.xml

196591 232751 5941968 F2009-12lgzkm.xml

2696614 3196647 81518277 total

6 011 069 F2009-04vykln.xml

5 941 968 F2009-12lgzkm.xml

15 195 F2008-08rpleb.xml

6 222 960 F2009-10jqmsg.xml

124 056 F2008-09pckbx.xml

6 674 143 F2009-09phjch.xml

3 441 076 F2008-10dykxv.xml

5 445 789 F2009-08pgriq.xml

4 905 017 F2008-11oowcq.xml

2 828 989 F2009-07hqnms.xml

5 481 306 F2008-12wrnvy.xml

6 198 099 F2009-06ndtvq.xml

4 261 162 F2009-01wtiyu.xml

6 497 965 F2009-05fphdx.xml

4 603 496 F2009-02baozh.xml

5 993 839 F2009-03mfbbb.xml

6 872 148 F2009-11sclnn.xml

157 815 F2009-03mfbbb.xml.uniq

178 632 F2009-12lgzkm.xml.uniq

158 642 F2009-04vykln.xml.uniq

124 052 F2009-01wtiyu.xml.uniq

169 914 F2009-05fphdx.xml.uniq

166 560 F2008-12wrnvy.xml.uniq

167 236 F2009-06ndtvq.xml.uniq

150 090 F2008-11oowcq.xml.uniq

75 764 F2009-07hqnms.xml.uniq

108 093 F2008-10dykxv.xml.uniq

151 001 F2009-08pgriq.xml.uniq

5 886 F2008-09pckbx.xml.uniq

181 401 F2009-09phjch.xml.uniq

846 F2008-08rpleb.xml.uniq

176 872 F2009-10jqmsg.xml.uniq

119 988 F2009-02baozh.xml.uniq

188 160 F2009-11sclnn.xml.uniq

In this set, there are a very large number or dupplicate lines both within the same file and among files.

This version makes no atempt to delete dupplicate lines during insert, but instead extracts, at the output stage, those lines which have no copy elsewhere.

That turned out to make insertion about twice as fast (compared to the previous version using an insertion trigger) but only slowed down output a little (thanks to good indexing choice). This also demonstrates that you can have duplicate primary keys and ignore the row being inserted if it already exists in the DB.

>Exit code: 0 Time: 1241.077

I challenge making it significantly faster by using only vanilla AutoIt-provided resources/UDFs, specially when run over a large set of large files as the one examplified above.

I cheated in making the index use a (faster) binary compare (not a case-insensitive one), but :graduated:

Note that the code is rather simple, straitforward and naturally copes with as many files as needed.

#include <SQLite.au3>
#include <SQLite.Dll.au3>
#include <Array.au3>

Main()

; removes every occurence of exact same (with respect to case, see below) text line found elsewhere in a group of text files
Func Main()

    ; init SQLite
    _SQLite_Startup()

    ; create a :memory: DB
    Local $hDB = _SQLite_Open()

    ; create a single table, with an index on text
    ; doing so will minimize the number of comparisons, and those compares are fast low-level code
    ;
    ; WARNING: this will work as intended, for ASCII or Unicode, with respect to case
    ;       lower ASCII compares *-without-* respect to case can still be done efficiently by using COLLATE NOCASE
    ;       universal Unicode compares without respect to case need a bit more complex setup (but can still be called efficient)
    _SQLite_Exec($hDB,  "CREATE TABLE Strings (String CHAR, Source INTEGER, PRIMARY KEY (String, Source) ON CONFLICT IGNORE);")

    ; get the list of input files (may process any number of files in the same run)
    Local $dir = "your input path"
    Local $files = _FileListToArray(@ScriptDir & "\", '*.inputtxt', 1)
    If @error Then Return

    ; process input files
    Local $txtstr
    For $i = 1 to $files[0]
        ConsoleWrite("Processing file " & $dir & $files[$i] & @LF)
        _FileReadToArray($dir & $files[$i], $txtstr)

        ; process input lines
        _SQLite_Exec($hDB, "begin;")
        If Not @error Then
            For $j = 1 To $txtstr[0]
                _SQLite_Exec($hDB, "insert into Strings (Source, String) values (" & $i & "," & _SQLite_Escape($txtstr[$j]) & ");")
            Next
        EndIf
        _SQLite_Exec($hDB, "commit;")
    Next

    ; store remaining data in output files
    Local $nrows, $ncols
    ConsoleWrite("Creating output files" & @LF)
    For $i = 1 to $files[0]
        ; select relevant strings left
        _SQLite_GetTable($hDB, "select String from Strings X where " & _
                                    "Source = " & $i & " and " & _
                                        "not exists (select 1 from Strings Y where Y.String = X.String and Y.Source != X.Source);", _
                                $txtstr, $nrows, $ncols)

        ; write to input filename + extra extension .uniq
        _FileWriteFromArray($dir & $files[$i] & '.uniq', $txtstr, 2)
    Next

EndFunc

MvGulik · November 28, 2010

whatever Edited February 7, 2011 by MvGulik

jchd · November 29, 2010

It's an example of how simple and powerful using such a low-level but efficient DB like SQLite can be in an otherwise non-DB application.

Add error checking as required, the code was made in a rush.

ade · April 2, 2011

It's an example of how simple and powerful using such a low-level but efficient DB like SQLite can be in an otherwise non-DB application.
Add error checking as required, the code was made in a rush.

I know this post is old and wasn't sure whether to start a new thread and reference this post or to just reply at the bottom of this one. Apologies in advance if this offends anyone!

The code by jhcd is really nice and I have tried using it but adapting it slightly but to no avail. I am going around in circles and so have posted here in the hope that someone more knowledgable than me will help out.

What I would like to do is to remove the duplicates but not ALL of the duplicated instances, leaving 1 of the duplicates intact. Is it possible to make the SQLite query so that it does that, and if so what would it be?

Thanks!

Tvern · April 2, 2011

I know this post is old and wasn't sure whether to start a new thread and reference this post or to just reply at the bottom of this one. Apologies in advance if this offends anyone!
The code by jhcd is really nice and I have tried using it but adapting it slightly but to no avail. I am going around in circles and so have posted here in the hope that someone more knowledgable than me will help out.
What I would like to do is to remove the duplicates but not ALL of the duplicated instances, leaving 1 of the duplicates intact. Is it possible to make the SQLite query so that it does that, and if so what would it be?
Thanks!

It's usually better to start a new thread and link to threads that might be relevant.

What you ask for sounds like the way the examples already work. They ensure that each entry is unique, and that the result containst all unique entries.

If you mean you want to allow 1, or more duplicate, then I think my example would be easier to adjust and I think the SQLite example would become a great deal slower if you found a way to make it work. (but I am not that familliar with SQLite and there might be an effective way to do it yet.)

If you want to adjust my example. you should look at changing $fFound from a boolean, to an int, increasing it's value for each match found and then deleting values once the number reaches the upper limit you want to allow.

I'm going to bed now, but I'll see if I can have a look tomorrow. (I was going to look into a more effecient _arraydelete anyways, so this would make for a good reason to do that.)

jchd · April 3, 2011

@ade

Next time start a new thread so you have more chance to attract eyes.

Anyway, I would find it easier if you could restate your own distinct problem in your own words, preferably with a short example of the sample inputs (few lines) and intended result covering all your practical cases.

Sign In

Comparison of two files

Recommended Posts

xdp22

SadBunny

Tvern

xdp22

Tvern

JohnOne

xdp22

Tvern

JohnOne

jchd

MvGulik

jchd

jchd

MvGulik

jchd

ade

Tvern

jchd

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta