Jump to content
Sign in to follow this  
covaks

Removing duplicates from file

Recommended Posts

covaks

Is there a quicker way to do this? I want to read one file (1.8MB in size, 253k words, one per line) and remove all the words that are found in another file (which has about 12,000 words).

#include <File.au3>
#include <Array.au3>

Dim $In[_FileCountLines("C:\wordlist.txt") + 1]
Dim $Bad[_FileCountLines("C:\Badwords.txt") + 1]
Dim $Out[_FileCountLines("C:\wordlist.txt") + 1]

_FileReadToArray("C:\Badwords.txt",$Bad)
_FileReadToArray("C:\wordlist.txt",$In)

For $x = 1 to $Bad[0]
    For $y = 1 to $In[0]
        If $Bad[$x] = $In[$y] Then
        Else
            $Out[$y] = $In[$y]
        EndIf
    Next
Next

_FileWriteFromArray("C:\clean.txt",$Out)
Edited by covaks

Share this post


Link to post
Share on other sites
Siao

I wouldn't use array for output, unless you need that data in array for something else than just writing to file.

Concatenate into single string and FileWrite it in one go after the loop. Should be more efficient.

Also, StringRegExpReplace (or, if your data is strictly 1 word per line as you say, a StringInStr+StringReplace) instead of the inner loop could also be much faster than iterating through the whole of the larger array for each bad word. This means ditching the $In array too.

Edited by Siao

"be smart, drink your wine"

Share this post


Link to post
Share on other sites
Siao

Which would be something like this:

Global $aBad, $sIn

$sIn = FileRead("wordlist.txt")
_FileReadToArray("badwords.txt", $aBad)

For $x = 1 to $aBad[0]
    $sIn = StringRegExpReplace($sIn, '(\A|\n)' & $aBad[$x] & '(\r|\z)', '')
Next

FileWrite("clean.txt", $sIn)

or

Global $aBad, $sIn

$sIn = StringStripCR(FileRead("wordlist.txt")) & @LF
_FileReadToArray("badwords.txt", $aBad)

For $x = 1 to $aBad[0]
    $sIn = StringReplace($sIn, $aBad[$x] & @LF, '')
Next

If StringRight($sIn, 1) = @LF Then $sIn = StringTrimRight($sIn, 1)
$sIn = StringReplace($sIn, @LF, @CRLF)

FileWrite("clean.txt", $sIn)
Edited by Siao

"be smart, drink your wine"

Share this post


Link to post
Share on other sites
covaks

Thank you very much. :-)

Share this post


Link to post
Share on other sites
glasglow

StringReplace($strings,@CRLF&@CRLF,"")

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×