quickest remove duplicate lines method?

ade · April 3, 2011

Hello,

I am looking to find the quickest way to remove duplicate lines (but leaving 1 instance intact) from a set of text files.

I have gleaned and modifed code I found on here but the current result I have is too slow. I have tested with a text file of 100,000 lines which take 40 seconds to complete. I tested with a file of 5 million and when it didn't crash with an "out of memory" error it used 1Gb of memory, 100% of processor and I had to stop it after 15 minutes as I had no idea when, or even if, it was going to finish.

The code in this post by jchd reads into an array then assigns the lines to a SQLite db to do the removing of duplicates and writes out the results to a text file. It removed all instances of duplicates so I had to modify it. My modified code is:

#EndRegion ;**** Directives created by AutoIt3Wrapper_GUI ****
#include <SQLite.au3>
#include <SQLite.Dll.au3>
#include <Array.au3>

Main()

; removes every occurence of exact same (with respect to case, see below) text line found elsewhere in a group of text files
Func Main()

    ; init SQLite
    _SQLite_Startup()

    ; create a :memory: DB
    Local $hDB = _SQLite_Open()

    ; WARNING: this will work as intended, for ASCII or Unicode, with respect to case
    ;       lower ASCII compares *-without-* respect to case can still be done efficiently by using COLLATE NOCASE
    ;       universal Unicode compares without respect to case need a bit more complex setup (but can still be called efficient)
    _SQLite_Exec($hDB,  "CREATE TABLE Strings (String CHAR, Row INTEGER, PRIMARY KEY (String, Row) ON CONFLICT IGNORE);")

    ; get the list of input files (may process any number of files in the same run)
    Local $files = _FileListToArray($dir & "\", '*.txt', 1)
    If @error Then Return

    ; process input files
    Local $txtstr
    For $i = 1 to $files[0]
        ConsoleWrite("Processing file " & $dir & $files[$i] & @LF)
        _FileReadToArray($dir & $files[$i], $txtstr)
        _SQLite_Exec($hDB, "begin;")
        If Not @error Then
            For $j = 1 To $txtstr[0]
                _SQLite_Exec($hDB, "insert into Strings (String, Row) values (" & _SQLite_Escape($txtstr[$j]) & "," & $j & ");")
            Next
        EndIf
        _SQLite_Exec($hDB, "commit;")
    Next
    ; store remaining data in output files
    Local $nrows, $ncols
    ConsoleWrite("Creating output files" & @LF)
        _SQLite_GetTable($hDB, "select String from Strings where not exists (select 1 from Strings Y where Strings.String = Y.String and Strings.Row > Y.Row);",$txtstr, $nrows, $ncols)
        _FileWriteFromArray($dir & 'output.txt', $txtstr, 2)

EndFunc

Now this works (as stated above 100,000 lines in 40 seconds) but I am sure could be improved to make it quicker as my modifications are probably not the best of doing it. I also tried testing without an array (memory issue) using a FileOpen -- FileReadLine to input the line in to the sqlite db instead of writing to an array but it was a lot slower.

When using a huge text file (5 milion lines), using the array method requires a LOT of memory, too much and so I am thinking that a method that processes the lines as it reads them would be better?

I have seen a .net program complete the file of 5 million lines in about 30 seconds (pretty much the time it takes my autoit code to complete 100,000), so it must be possible it is just knowing the correct method and implementing it correctly, which I am sure is where I am going wrong

I thought of a couple of methods but not convinced they would be quicker:

sort the text file then only write the current line to the output file if current line != previous line

populate an array, sort the array and write the current element to the output file if current element != previous element

As you can see, I have no idea what functions, methods are quick to execute and which are no gos. So, what method would allow huge text files but remove the duplicates quickly without having excessive memory requirements?

Can someone please point me in the right direction, or tell me where I am going wrong, of getting a remove duplicate function as quick ( or near enough) to the .net one I mentioned above?

Thanks!

MvGulik · April 3, 2011

Don't try to read and process the whole data set in one go. (text lines from a single (big) file in this case, but its a general principle ... Data Caching.)

Or, read and process the data in chunks. (1000, 1k, 10k lines per chunk, or in ~X bytes/characters chunks (If the general line size is unknown)(without clipping lines of course).

---

so I am thinking that a method that processes the lines as it reads them would be better?

Probably not. Or, would probably be very memory lean, but at the cost of its speed. Balance things out by trying to find a compromise between the two. Speed versus memory use <-> Different Chunk size.

---

last:

When comparing something where the items size might be a problem, like images or text paragraphs, you could opt to not store/use the raw data of it in your compare, but a hash value of the data. like a crc value. Additional processing step will have its own speed penalty, but also has its own up sides. (and possible limitations, as not case insensitive for text for example.)

Edited April 3, 2011 by iEvKI3gv9Wrkd41u

jchd · April 4, 2011

ade,

Don't compare a compiled language speed with an interpreter like AutoIt, they are not playing the same game.

You can probably speed up things depending on the actual contents of your text data. If lines are long and similar enough, comparing them needs some time and also requires a fair amount of memory. In this case you might find it better to store some hash (MD5 will be OK) instead of the string itself. In this case, declare the column as blob and store the binary hash.

Use the search feature to locate fast MD5 UDFs.

ade · April 4, 2011

Probably not. Or, would probably be very memory lean, but at the cost of its speed. Balance things out by trying to find a compromise between the two. Speed versus memory use <-> Different Chunk size.

Yeah, that is a good point. I will have to experiment with the chunk sizes.

When comparing something where the items size might be a problem, like images or text paragraphs, you could opt to not store/use the raw data of it in your compare, but a hash value of the data. like a crc value. Additional processing step will have its own speed penalty, but also has its own up sides. (and possible limitations, as not case insensitive for text for example.)

I like the idea but my text lines are only 50 or so characters long I don't think the extra processing step would be worthwhile in this case.

Thanks for your input iEvKI3gv9Wrkd41u that has definitely cleared up the memory requirement issue, but the issue remains of making the whole process quicker than my code above. Can anyone offer some advice in regards to this aspect?

Thanks!

KaFu · April 4, 2011

Using the Windows 'Scripting.Dictionary' also seems to be a quite fast method to remove duplicate rows , of course attached example needs some extra polishing regarding reading and processing the input file in chunks, it's just a quick proof of concept.

#include <file.au3>
Dim $aRecords

$timer = TimerInit()
Global $oDict = ObjCreate('Scripting.Dictionary')
$oDict.CompareMode = 1

If Not _FileReadToArray("test.txt", $aRecords) Then
    MsgBox(4096, "Error", " Error reading log to Array     error:" & @error)
    Exit
EndIf
ConsoleWrite("Records ORG: " & $aRecords[0] & @CRLF)
ConsoleWrite(TimerDiff($timer) & @CRLF & @CRLF)
For $x = 1 To $aRecords[0]
    If Not $oDict.Exists($aRecords[$x]) Then
        $oDict.Add($aRecords[$x], 1)
    EndIf
Next
ConsoleWrite("Records Dup Removed: " & $oDict.Count() & @CRLF)
ConsoleWrite(TimerDiff($timer) & @CRLF & @CRLF)

; None duplicate rows
For $i In $oDict.Keys()
    ConsoleWrite($i & @crlf)
Next

ade · April 4, 2011

ade,
Don't compare a compiled language speed with an interpreter like AutoIt, they are not playing the same game.
You can probably speed up things depending on the actual contents of your text data. If lines are long and similar enough, comparing them needs some time and also requires a fair amount of memory. In this case you might find it better to store some hash (MD5 will be OK) instead of the string itself. In this case, declare the column as blob and store the binary hash.
Use the search feature to locate fast MD5 UDFs.

So what do you think realistically would be the quickest I could remove dups from a text file of 5 million lines in an interpreter language, taking into account the .net program can do it in about 30 seconds (don't know method employed in .net program)?

I will look into the hashes method but a quick question beforehand, would the calculation of the hash and then comparing the hashes be quicker than just comparing the lines?

Thanks for your help

jchd · April 4, 2011

So what do you think realistically would be the quickest I could remove dups from a text file of 5 million lines in an interpreter language, taking into account the .net program can do it in about 30 seconds (don't know method employed in .net program)?

5M lines * 50 chars is something like 250Mb, quite feasible with stock PCs. Say 300Mb with some extra space needed for indexing (whatever method used) and some structure to store data and we're still under what a browser needs for a decent 20-50 tabs opened in daily operations.

Now about speed, you can try another approach: remove indexing from the table declaration (inserts are slower when indexes are declared) and query with an aggregate, like:

select * from strings group by string having count(*) = 1; -- return rows whose strings are unique

OTOH, can you try running the script as it is with timerdiff calls to showh where the program is spending time: is it for storing strings or getting the result? There are two issues here: first, we are bound to using DllCalls and this is far slower than direct function invokation in a compiled language. Second, fetching the results from a _SQLite_GetTable* is kind of slow as well. There is a possibility than using an ODBC driver and fetching the result as a native array at once would be really quicker (or at least less time intensive). Only experiments in real world situation can tell.

As already exposed both in this thread and other previous threads on very similar subject, a ScriptingDictionary can do some good runtime-wise, but is less flexible as to what you can store and retrieve.

Then (another road!) is this for routine or rare use? If you only have to do it rarely you can still rely on extra software.

I will look into the hashes method but a quick question beforehand, would the calculation of the hash and then comparing the hashes be quicker than just comparing the lines?

That all depends on the actual contents of your data. As you say your typical input is circa 50 chars, I highly doubt running the hash beforehand would be faster. Longer lines differing only near their end are likely to cause much longer runtime than the hash method, but the limit depends on a number of factors linked to your actual samples.

ade · April 4, 2011

Using the Windows 'Scripting.Dictionary' also seems to be a quite fast method to remove duplicate rows , of course attached example needs some extra polishing regarding reading and processing the input file in chunks, it's just a quick proof of concept.

Thanks a lot for the example script. I really appreciate it as that would have taken me ages to write, get errors, modify, read more, modify, get errors, etc.

As you provided a working example I was able to complete a test with the same file of 100,000 lines and this is the result I got:

SQlite code in original post: 53 seconds

KaFu's Scripting example: 84 seconds

I am guessing that the SQLite would be faster than the Windows Scripting regardless of input file size, although please do correct me if I am wrong.

Thanks KaFu!

KaFu · April 4, 2011

KaFu's Scripting example: 84 seconds

I wonder how you parsed the output, changed the last few lines. Pls re-test the attached script.

#include <file.au3>
Dim $aRecords, $sOut

$timer = TimerInit()

Global $oDict = ObjCreate('Scripting.Dictionary')
$oDict.CompareMode = 1

If Not _FileReadToArray("test.txt", $aRecords) Then
    MsgBox(4096, "Error", " Error reading log to Array     error:" & @error)
    Exit
EndIf

ConsoleWrite("Records ORG: " & $aRecords[0] & @CRLF)
ConsoleWrite(TimerDiff($timer) & @CRLF & @CRLF)

For $x = 1 To $aRecords[0]
    If Not $oDict.Exists($aRecords[$x]) Then
        $oDict.Add($aRecords[$x], 1)
        $sOut &= $aRecords[$x] & @CRLF
    EndIf
Next

ConsoleWrite("Records Dup Removed: " & $oDict.Count() & @CRLF)
ConsoleWrite(TimerDiff($timer) & @CRLF & @CRLF)

; None duplicate rows
FileWrite("test_OUT.txt", $sOut)

Edit: Directly appending none-dup lines to output should even be faster.

Edited April 4, 2011 by KaFu

ade · April 4, 2011

OTOH, can you try running the script as it is with timerdiff calls to showh where the program is spending time: is it for storing strings or getting the result? There are two issues here: first, we are bound to using DllCalls and this is far slower than direct function invokation in a compiled language. Second, fetching the results from a _SQLite_GetTable* is kind of slow as well. There is a possibility than using an ODBC driver and fetching the result as a native array at once would be really quicker (or at least less time intensive). Only experiments in real world situation can tell.

Modified code as you suggested, put in timer diffs and this is the result

#Region ;**** Directives created by AutoIt3Wrapper_GUI ****
#AutoIt3Wrapper_outfile=removedups.exe
#EndRegion ;**** Directives created by AutoIt3Wrapper_GUI ****
#include <SQLite.au3>
#include <SQLite.Dll.au3>
#include <Array.au3>

Main()

; removes every occurence of exact same (with respect to case, see below) text line found elsewhere in a group of text files
Func Main()

    $timer = TimerInit()

    ; init SQLite
    _SQLite_Startup()

    ; create a :memory: DB
    Local $hDB = _SQLite_Open()

    ; create a single table, with an index on text
    ; doing so will minimize the number of comparisons, and those compares are fast low-level code
    ;
    ; WARNING: this will work as intended, for ASCII or Unicode, with respect to case
    ;       lower ASCII compares *-without-* respect to case can still be done efficiently by using COLLATE NOCASE
    ;       universal Unicode compares without respect to case need a bit more complex setup (but can still be called efficient)
    _SQLite_Exec($hDB,  "CREATE TABLE Strings (String CHAR);")

    Local $dir = "input path"
    Local $files = _FileListToArray($dir & "\", '*.txt', 1)
    If @error Then Return

    ; process input files
    Local $txtstr
    For $i = 1 to $files[0]
        ConsoleWrite("Processing file " & $dir & $files[$i] & @LF)
        _FileReadToArray($dir & $files[$i], $txtstr)
        ConsoleWrite("Finished storing text file in array = " & TimerDiff($timer) & @CRLF & @CRLF)
        ; process input lines
        _SQLite_Exec($hDB, "begin;")
        If Not @error Then
            For $j = 1 To $txtstr[0]
                _SQLite_Exec($hDB, "insert into Strings (String) values (" & _SQLite_Escape($txtstr[$j]) & ");")
            Next
        EndIf
        _SQLite_Exec($hDB, "commit;")
    Next
    ; store remaining data in output files
    ConsoleWrite("Finished storing array in = " & TimerDiff($timer) & @CRLF & @CRLF)
    Local $nrows, $ncols
    ConsoleWrite("Creating output files" & @LF)
    _SQLite_GetTable($hDB, "select * from Strings group by string having count(*)=1",$txtstr, $nrows, $ncols)
    ConsoleWrite("Finished writing sqlite_gettable = " & TimerDiff($timer) & @CRLF & @CRLF)
    _FileWriteFromArray($dir & 'output.txt', $txtstr, 2)
    ConsoleWrite("Finished writing array to file = " & TimerDiff($timer) & @CRLF & @CRLF)
EndFunc

Finished storing text file in array = 477.650930495356

Finished storing array in = 28131.3714452535

Creating output files

Finished writing sqlite_gettable = 41666.9411894528

Finished writing array to file = 42417.3807006198

+>15:16:40 AutoIT3.exe ended.rc:0

>Exit code: 0 Time: 43.836

What do you think jchd?

Then (another road!) is this for routine or rare use? If you only have to do it rarely you can still rely on extra software.

I would use it routinely, and would love to have a quick way of manipulating text files as I am sure I could adapt the code to use in a variety of tasks, but alas I may have to resort to a separate program that I already have that removes duplicates very quickly. I just prefer having my own source code I can adapt rather than an out of the box solution that I cannot modify.

jchd · April 4, 2011

After having done some experiments, I believe this is close to the best what you can get using AutoIt and its native UDFs.

;; includes standards
#include <SQLite.au3>
#include <SQLite.Dll.au3>
#include <Array.au3>


Main()


; leaves only a single occurence of exact same (with respect to case, see below) text line found elsewhere in a group of text files
Func Main()

$timer = TimerInit()

    Local $dir = @ScriptDir & "\"

    ; init SQLite
    _SQLite_Startup()

    ; create a :memory: DB
    Local $hDB = _SQLite_Open()

    ; create a single table, with an index on text
    ; doing so will minimize the number of comparisons, and those compares are fast low-level code
    ;
    ; WARNING: this will work as intended, for ASCII or Unicode, with respect to case
    ;       lower ASCII compares *-without-* respect to case can still be done efficiently by using COLLATE NOCASE
    ;       universal Unicode compares without respect to case need a bit more complex setup (but can still be called efficient)
    _SQLite_Exec($hDB,  "CREATE TABLE Strings (id integer primary key autoincrement, String CHAR CONSTRAINT ksString1 UNIQUE ON CONFLICT IGNORE);")

    ; get the list of input files (may process any number of files in the same run)
    Local $files = _FileListToArray($dir, '*.txt', 1)
    If @error Then Return



    ; process input files
    Local $txtstr
    For $i = 1 to $files[0]
        ConsoleWrite("Processing file " & $dir & $files[$i] & @LF)
        _FileReadToArray($dir & $files[$i], $txtstr)
ConsoleWrite("Finished reading text file in array = " & TimerDiff($timer) & @CRLF & @CRLF)
$timer = TimerInit()

        ; process input lines
        _SQLite_Exec($hDB, "begin;")
        If Not @error Then
            For $j = 1 To $txtstr[0]
                _SQLite_Exec($hDB, "insert into Strings (String) values ('" & StringReplace($txtstr[$j], "'", "''", 0, 1) & "');")
            Next
        EndIf
        _SQLite_Exec($hDB, "commit;")
    Next
ConsoleWrite("Finished storing array in DB = " & TimerDiff($timer) & @CRLF & @CRLF)
$timer = TimerInit()
    ; free large array
    $txtstr = 0

    ; store remaining data in output file
    Local $nrows, $ncols
    ConsoleWrite("Creating output file" & @LF)
        _SQLite_GetTable($hDB, "select String from Strings;",$txtstr, $nrows, $ncols)
ConsoleWrite("Finished fetching unique lines from DB = " & TimerDiff($timer) & @CRLF & @CRLF)
$timer = TimerInit()
        _FileWriteFromArray($dir & 'output.uniq', $txtstr, 2)
ConsoleWrite("Finished writing array to file = " & TimerDiff($timer) & @CRLF & @CRLF)
$timer = TimerInit()

EndFunc

Note that now times are deltas (not cumulative).

Of course, using a fast specialized tool (e.g. GNU coreutils uniq) gives far superior results. Even sort -u is really fast.

EDIT: BTW you do have a really fast PC!

Edited April 5, 2011 by jchd

ade · April 5, 2011

Sorry for the delay in responding. I didn´t get time to test and post yesterday.

I have now tested KaFu´s and jchd´s last scripts and both are very good.

I tested them with the same text file containing 1 million lines. There were hardly any duplicates so most of the lines had to be output to the file.

KaFu´s script

Completed in 180 seconds. Hats off, very, very quick!

I also tried the following modification to test

;hOut = FileOpen("output.txt",1)

For $x = 1 To $aRecords[0]
    If Not $oDict.Exists($aRecords[$x]) Then
        $oDict.Add($aRecords[$x], 1)
    FileWrite($hOut, $aRecords[$x])
    EndIf
Next
FileClose($hout)

This completed in 185 seconds, so 5 seconds slower than KaFu's version

jchd's script

A big improvement here in the latest version. It completed the 1 million line text file in 236 seconds (the last version I posted took 500 seconds).

Although not as fast as the compiled language version (which it never will be as jchd pointed out), for me both methods are definitely now in the usable range. Which to use depends on the requirements. For pure duplicate removal it would have to be KaFu's windows scripting method as it is the fastest of the 2. If more flexible data manipulation is required then jchd's SQLite method would win the day (although I am also going to research windows scripting to see what other data manipulation tasks it is good for).

I would just like to say thank you very much for all of your help, especially jchd and KaFu! From going round in circles getting nowhere fast I now have 2 really good methods for text file manipulation to choose from, which I would never have had without your help.

All the best,

Ade

jchd · April 5, 2011

Glad to see you have some solutions working for you. For the record I tried 1M of 50 chars (random in a-z) file and it took about 4 s with "uniq in.txt > out.txt"

KaFu · April 5, 2011

Nice to hear .

Although not as fast as the compiled language version (which it never will be as jchd pointed out), for me both methods are definitely now in the usable range.

Speaking of which... shortening the code will also increase the speed (for many iterations). Try something like this:

$c = @crlf
For $x = 1 To $a[0]
    If Not $o.Exists($a[$x]) Then
        $o.Add($a[$x], 1)
        $s &= $a[$x] & $c
    EndIf
Next

or, like I do for all my scripts, use the following Obfuscator parameters:

#Obfuscator_Parameters=/sf /sv /om /cs=0 /cn=0

Edited April 5, 2011 by KaFu

ade · April 5, 2011

Glad to see you have some solutions working for you. For the record I tried 1M of 50 chars (random in a-z) file and it took about 4 s with "uniq in.txt > out.txt"

4 seconds! That is amazing! Was that on unix or windows? Would there be much difference between the 2?

Where portability doesn't matter I can see the advantage in installing GNU coreutils for windows and using the uniq command. I'm gonna go and install it right now to have a play around.

ade · April 5, 2011

Nice to hear .

Speaking of which... shortening the code will also increase the speed (for many iterations). Try something like this:
$c = @crlf
For $x = 1 To $a[0]
    If Not $o.Exists($a[$x]) Then
        $o.Add($a[$x], 1)
        $s &= $a[$x] & $c
    EndIf
Next
or, like I do for all my scripts, use the following Obfuscator parameters:
#Obfuscator_Parameters=/sf /sv /om /cs=0 /cn=0

I didn't even think about variable length making a difference! I will test the obfuscator parameters as well.

Thanks for the tips!

jchd · April 5, 2011

Yep, GNU coreutils for Win32. Install, run, no fuss.

MvGulik · April 5, 2011

Interesting outcome.

Zedna · June 18, 2011

In SQLite scripts there can be _SQLite_Exec() called once for more commands separated by ;

This way it will be much faster.

Here are links to some my SQLite examples where it's used:

EDIT: in SQLite version i recommend these optimizations:

- don't use primary key and CONSTRAINT in table declaration

- _SQLite_Exec($hDB, "begin;") before main loop

- _SQLite_Exec($hDB, "commit;") after main loop

EDIT2:

- use index on table if you use SELECT query with WHERE

Edited June 18, 2011 by Zedna

superbadguy · July 13, 2011

Has anyone tried <a href="http://smartduplicatefinder.com">Smart Duplicate Finder </a>this thing puts the others to shame. not only does it have crc32 and bit-bit compare, it also supports music meta tag comparison.

You can specify individual folders or entire drives anywhere connected to your system.

Sign In

quickest remove duplicate lines method?

Recommended Posts

ade

MvGulik

jchd

ade

KaFu

ade

jchd

ade

KaFu

ade

jchd

ade

jchd

KaFu

ade

ade

jchd

MvGulik

Zedna

superbadguy

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta