the fastest way to find duplicate lines [solved]

fenhanxue · November 21, 2017

I Have an txt file which contains 1000000 lines.

I want to delete the duplicate lines.

I have try the code as follow, however, it runs very slowly :

Local $array
_FileReadToArray('test.txt',$array)
Local $aArrayUnique  = _ArrayUnique($array)
_FileWriteFromArray('test_unique.txt',$aArrayUnique);

If anyone can help me a faster way ? ( i want the code can get the result in no more than 20 seconds )

Edited November 25, 2017 by fenhanxue
solved

jchd · November 21, 2017

You can probably somehow speed up de-dup of that many lines, but the main point is that a flat text file isn't suitable for a routine task like that. You'd benefit hugely from converting to a database file. FYI SQLite is pretty easy to use and well supported from AutoIt.

SlackerAl · November 21, 2017

Whilst I completely agree with jchd, if you need a quick, one use, solution to this, you will find (quick Google) most of more robust text editors, e.g. notepad++ (free) and UltraEdit (commercial license) have ready made solutions for this problem.

KaFu · November 21, 2017

$oBuffer = ObjCreate('Scripting.Dictionary')
$h_File_Source = FileOpen("source.txt")
$h_File_Output = FileOpen("output.txt", 2)
While 1
    $sLine = FileReadLine($h_File_Source)
    If @error Then ExitLoop
    ; if in check buffer skip line
    If $oBuffer.Exists($sLine) Then ContinueLoop
    ; write line to output file
    FileWriteLine($h_File_Output, $sLine)
    ; Add to duplicate check buffer
    $oBuffer.Item($sLine) = 1
WEnd
FileClose($h_File_Source)
FileClose($h_File_Output)

junkew · November 21, 2017

Let powershell do the job

https://stackoverflow.com/questions/32385611/sort-very-large-text-file-in-powershell

Malkey · November 22, 2017

Here is a PowerShell method of removing duplicate lines from an unsorted file, from within AutoIt.

It may be faster on big files. On small files, KaFu's example is faster.

; Remove Duplicate Rows From A Text File Using Powershell... unsorted file, where order is important.
; Command from:  http://www.secretgeek.net/ps_duplicates

Local $hTimer = TimerInit()
local $sFileIn =  @ScriptDir & '\temp-6.txt'
local $sFileOut = @ScriptDir & '\newlist.txt'
Local $sCmd = '$hash = @{}' & @CRLF & _
        'gc ' & $sFileIn & '| % {if ($hash.$_ -eq $null) {$_} $hash.$_ = 1;} > ' & $sFileOut

RunWait('"C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" ' & $sCmd, "", @SW_HIDE, 2 + 4) ; Run command in PowerShell.
ConsoleWrite("Time Taken: " & round(TimerDiff($hTimer)/1000,4) & "Secs" & @CRLF)
ShellExecute($sFileOut) ; See unique file.

fenhanxue · November 22, 2017

thank you for your help ,@KaFu @Malkey

i do not kown much about Scripting.Dictionary

i wonder if the code ( ObjCreate('Scripting.Dictionary') ) will work in every computer ?

KaFu · November 22, 2017

41 minutes ago, fenhanxue said:

i wonder if the code ( ObjCreate('Scripting.Dictionary') ) will work in every computer ?

Afaik it's part of WSH and should be available from XP-SP3 on upwards by default:

https://en.wikipedia.org/wiki/Windows_Script_Host

fenhanxue · November 22, 2017

Thank you very much for your help

Sign In

the fastest way to find duplicate lines [solved]

Recommended Posts

fenhanxue

jchd

SlackerAl

KaFu

junkew

Malkey

fenhanxue

KaFu

fenhanxue

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta