Jump to content

Remove (all) duplicate entries from a file


 Share

Recommended Posts

Hi,

Firstly I'm new to AutoIT so if I come across as a total nonce I apologise in advance.

A brief description of what I want to do:

I have two text files that contain lists of objects, I want to remove everything that exists in file2 from file1 and output what is left to file3.

So if file1 contains:

cheese

cabbage

lego

truck

And file2 contains:

truck

cabbage

plank

Then I want to output a file that contains:

cheese

lego

I could have explained that better but I hope it makes sense. I looked at ways to do this by manipulating both file1 and file2 but after much forum trawling and head scratching I thought it might be easier to combine the files and then remove the duplicates from that. So far I have got a script that will combine the two files by appending the second file to the first, and I am using the following _ArrayRemoveDuplicates by nitro322 and SmOke_N to successfully remove one of the duplicates, but I really need to remove both.

;==================================================================
; Function Name:  _ArrayRemoveDuplicates()
;
; Description   :  Removes duplicate elements from an Array
; Parameter(s)   :  $avArray
;                  $iBase
;                  $iCaseSense
;                  $sDelimter
; Requirement(s) :  None
; Return Value(s):  On Success - Returns 1 and the cleaned up Array is set
;                  On Failure - Returns an -1 and sets @Error
;                       @Error=1 $avArray is not an array
;                       @Error=2 $iBase is different from 1 or 2
;                       @Error=3 $iCaseSense is different from 0 or 1
; Author         :  uteotw, but ALL the credits go to nitro322 and SmOke_N, see link below
; Note(s)       :  None
; Link        ;  http://www.autoitscript.com/forum/index.php?showtopic=7821
; Example      ;  Yes
;==================================================================
Func _ArrayRemoveDuplicates(ByRef $avArray, $iBase = 0, $iCaseSense = 0, $sDelimter = "")
    Local $sHold
   
    If Not IsArray($avArray) Then
        SetError(1)
        Return -1
    EndIf
    If Not ($iBase = 0 Or $iBase = 1) Then
        SetError(2)
        Return -1
    EndIf
    If $iBase = 1 AND $avArray[0] = 0 Then
        SetError(0)
        Return 0
    EndIf
    If Not ($iCaseSense = 0 Or $iCaseSense = 1) Then
        SetError(3)
        Return -1
    EndIf
    If $sDelimter = "" Then
        $sDelimter = Chr(01) & Chr(01)
    EndIf
 
    If $iBase = 0 Then
        For $i = $iBase To UBound($avArray) - 1
            If Not StringInStr($sDelimter & $sHold, $sDelimter & $avArray[$i] & $sDelimter, $iCaseSense) Then
                $sHold &= $avArray[$i] & $sDelimter
            EndIf
        Next
        $avNewArray = StringSplit(StringTrimRight($sHold, StringLen($sDelimter)), $sDelimter, 1)
        ReDim $avArray[$avNewArray[0]]
        For $i = 1 to $avNewArray[0]
            $avArray[$i-1] = $avNewArray[$i]
        Next
    ElseIf $iBase = 1 Then
        For $i= $iBase To UBound($avArray) - 1
            If Not StringInStr($sDelimter & $sHold, $sDelimter & $avArray[$i] & $sDelimter, $iCaseSense) Then
                $sHold &= $avArray[$i] & $sDelimter
            EndIf
        Next
        $avArray = StringSplit(StringTrimRight($sHold, StringLen($sDelimter)), $sDelimter, 1)
    EndIf

    Return 1
EndFunc

So my questions are: Am I barking up the wrong tree entirely by looking at combining these files? and is there a better way for me to do this? I've been through the code above to see if there was some way to get it to delete both entries but I am really new to this and am still trying to understand most of the code in there. If anyone could point me in the right direction I'd be very grateful.

Many Thanks

Link to comment
Share on other sites

Here is a start of a thought process:

1) read the files into arrays, _FileReadToArray

2) run a "For...In...Next" loop on the second array, getting each element into a variable

3) In the loop, Use _ArraySearch to search for that element in the first array.

4) If it exists, delete it from the first array, _ArrayDelete

5) Export the first array, which should have only unique items, _FileWriteFromArray

Hope this helps and Good Luck!

Bob

Edited by YellowLab

You can't see a rainbow without first experiencing the rain.

Link to comment
Share on other sites

Here is a start of a thought process:

1) read the files into arrays, _FileReadToArray

2) run a "For...In...Next" loop on the second array, getting each element into a variable

3) In the loop, Use _ArraySearch to search for that element in the first array.

4) If it exists, delete it from the first array, _ArrayDelete

5) Export the first array, which should have only unique items, _FileWriteFromArray

Hope this helps and Good Luck!

Bob

That's great, thanks for the response :D

Link to comment
Share on other sites

I programmed this while YellowLab was typing, but still this is probably worth something. And without the Array include! (It is there, but commented out)

It probably isn't the best of programming, but it gets you wheren you want, I think

#cs ----------------------------------------------------------------------------

 AutoIt Version: 3.2.10.0
 Author:         Triblade

 Script Function:
    Adds two files together without duplicates in $openfile3

#ce ----------------------------------------------------------------------------

; Script Start - Add your code below here

;#include <Array.au3>

$openfile1 = FileOpen("1.txt", 0)
$openfile2 = FileOpen("2.txt", 0)
$openfile3 = FileOpen("3.txt", 2)

$file1 = StringReplace(FileRead($openfile1), @CRLF, "*!*")
$file2 = StringReplace(FileRead($openfile2), @CRLF, "*!*")

FileClose($openfile1)
FileClose($openfile2)

$array1 = StringSplit($file1, "*!*")
$array2 = StringSplit($file2, "*!*")

For $i = 1 To $array1[0]
    For $j = 1 To $array2[0]
        If StringLower($array1[$i]) = StringLower($array2[$j]) Then
            $array1[$i] = ""
            $array2[$j] = ""
            ExitLoop
        EndIf
    Next
Next

;_ArrayDisplay($array1)
;_ArrayDisplay($array2)

For $i = 1 To $array1[0]
    If $array1[$i] <> "" Then FileWriteLine($openfile3, $array1[$i])
Next

For $i = 1 To $array2[0]
    If $array2[$i] <> "" Then FileWriteLine($openfile3, $array2[$i])
Next

FileClose($openfile3)

Exit

My active project(s): A-maze-ing generator (generates a maze)

My archived project(s): Pong3 (Multi-pinger)

Link to comment
Share on other sites

I think based on the OP, this loop is not necessary. If I am not mistaken, this will add unique entries from file 2 to file 3 which I don't think is what is intended.

SNIP...

For $i = 1 To $array2[0]
    If $array2[$i] <> "" Then FileWriteLine($openfile3, $array2[$i])
NextoÝ÷ Ûú®¢×®Ún°Ykºwڲ׫¶§+(ëb¢phl@ÈL  趫×Â¥u·­¢ëIæ«r­®)àmçè­ëaz^ø¥zÆ«z­{hjºÚÊÍOºÛazÇ¢wbÚ®¶²Ù'­È_»-²Úâ"¶r¥zg§µ«­¢+ØÀÌØí½Á¹¥±Äô¥±=Á¸ ÅÕ½ÐìĹÑáÐÅÕ½Ðì°À¤(ÀÌØí½Á¹¥±Èô¥±=Á¸ ÅÕ½ÐìȹÑáÐÅÕ½Ðì°À¤(ÀÌØí½Á¹¥±Ìô¥±=Á¸ ÅÕ½Ðì̹ÑáÐÅÕ½Ðì°È¤((ÀÌØí¥±ÄôMÑÉ¥¹IÁ±¡¥±I ÀÌØí½Á¹¥±Ä¤°
I1°ÅÕ½Ðì¨ÌÌì¨ÅÕ½Ðì¤(ÀÌØí¥±ÈôMÑÉ¥¹IÁ±¡¥±I ÀÌØí½Á¹¥±È¤°
I1°ÅÕ½Ðì¨ÌÌì¨ÅÕ½Ðì¤()¥±
±½Í ÀÌØí½Á¹¥±Ä¤)¥±
±½Í ÀÌØí½Á¹¥±È¤((ÀÌØíÉÉäÈôMÑÉ¥¹MÁ±¥Ð ÀÌØí¥±È°ÅÕ½Ðì¨ÌÌì¨ÅÕ½Ðì¤)½ÈÀÌØí¤ôÄQ¼ÀÌØíÉÉäÉlÁt(MÑÉ¥¹IÁ± ÀÌØí¥±Ä°ÀÌØíÉÉäÉm¥tµÀìÅÕ½Ðì¨ÌÌì¨ÅÕ½Ðì°ÅÕ½ÐìÅÕ½Ðì°À°À¤)9áÐ()MÑÉ¥¹IÁ± ÀÌØí¥±Ä°ÅÕ½Ðì¨ÌÌì¨ÅÕ½Ðì±
I1¤)¥±]É¥Ñ ÀÌØí½Á¹¥±Ì°ÀÌØí¥±Ä¤

This should be faster still.

Edited by YellowLab

You can't see a rainbow without first experiencing the rain.

Link to comment
Share on other sites

Triblade, that works perfectly after I commented out the lines mentioned by yellowlab. Thanks to both of you for the assistance, now I can go through everything and try and figure out what its all doing!! :D

Link to comment
Share on other sites

This script below is the same as my first one.

The only difference is, that I commented it now.

I hope you can understand it better...

;#include <Array.au3>

$openfile1 = FileOpen("1.txt", 0) ; Openes the first file and puts the handle of it in the variable
$openfile2 = FileOpen("2.txt", 0)
$openfile3 = FileOpen("3.txt", 2)

$file1 = StringReplace(FileRead($openfile1), @CRLF, "*!*") ; Reads the first file and replaces @CRLF (Carriage Return, Line Feed (or simply: next line)) with *!* This so that I later can easely track the seperator between lines
$file2 = StringReplace(FileRead($openfile2), @CRLF, "*!*")

FileClose($openfile1) ; Closes the first file. It is read and is not needed anymore
FileClose($openfile2)

$array1 = StringSplit($file1, "*!*") ; Split the read line on every *!* and put it in an array
$array2 = StringSplit($file2, "*!*")

For $i = 1 To $array1[0] ; Loop and add 1 to $i every turn until $i = the value of $array[0]
    For $j = 1 To $array2[0] ; Now loop in a loop. For every loop in the above one, loop for every item in the second file
        If StringLower($array1[$i]) = StringLower($array2[$j]) Then ; If the lowercase version of the value of $array1[$i] ($i = the number of the turn) is the same as $array[$j] then...
            $array1[$i] = "" ; If values are the same, empty the array-item
            $array2[$j] = "" ; Do the same for the other array(=other file)
            ExitLoop
        EndIf
    Next
Next

;_ArrayDisplay($array1)
;_ArrayDisplay($array2)

For $i = 1 To $array1[0] ; Loop through the array
    If $array1[$i] <> "" Then FileWriteLine($openfile3, $array1[$i]) ; If the array-item is not empty("") then write a line to file3
Next

For $i = 1 To $array2[0]
    If $array2[$i] <> "" Then FileWriteLine($openfile3, $array2[$i])
Next

FileClose($openfile3) ; Close the last file

Exit ; Exit the program. Not needed, but for nice and clean coding!

My active project(s): A-maze-ing generator (generates a maze)

My archived project(s): Pong3 (Multi-pinger)

Link to comment
Share on other sites

  • Moderators

To split the files into an array, use something like: StringSplit(StringStripCR(FileRead(FileName.ext)), @LF), you'll find the speed a bit faster... I believe _FileReadToArray() does the same thing.

Also, rather than opening and closing the file multiple times with FileWriteLine(), you'll find it much faster to store each line in a string... eg (pseudo code):

]Local $sMyHoldVar
For $i = 1 To $aSplit[0]
    $sMyHoldVar &= $aSplit[$i] & @CRLF
Next

FileWrite("FileName", StringTrimRight($sMyHoldVar, 2)) ; Note we got rid of the CR and LF on the end of the string before writing

This way you only open the file one time for a write, and you'll see (especially with bigger files) the speed move into the 100's of times faster than the before mentioned method.

You may also find this interesting:

http://www.autoitscript.com/forum/index.ph...st&p=245675

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

To split the files into an array, use something like: StringSplit(StringStripCR(FileRead(FileName.ext)), @LF), you'll find the speed a bit faster... I believe _FileReadToArray() does the same thing.

Also, rather than opening and closing the file multiple times with FileWriteLine(), you'll find it much faster to store each line in a string... eg (pseudo code):

]Local $sMyHoldVar
For $i = 1 To $aSplit[0]
    $sMyHoldVar &= $aSplit[$i] & @CRLF
Next

FileWrite("FileName", StringTrimRight($sMyHoldVar, 2)) ; Note we got rid of the CR and LF on the end of the string before writing

This way you only open the file one time for a write, and you'll see (especially with bigger files) the speed move into the 100's of times faster than the before mentioned method.

You may also find this interesting:

http://www.autoitscript.com/forum/index.ph...st&p=245675

Ahh, interresting. I didn't know the inner workings of FileWriteLine. I didn't thought that it opened and closed the file may times. I thought it hold the file open until the FileClose.

The reason I didn't use _FileReadToArray() is that I don't like to use UDF's when it's not needed.

LOL I didn't know about the StringStripCR (and StringStripWS) commands.....

I can use the WS version in my Pong3 program as well :D

(Well I sought them, but in VisualBasic 6 there called different so I didn't find them.)

So, thanks! :D

My active project(s): A-maze-ing generator (generates a maze)

My archived project(s): Pong3 (Multi-pinger)

Link to comment
Share on other sites

  • Moderators

Ahh, interresting. I didn't know the inner workings of FileWriteLine. I didn't thought that it opened and closed the file may times. I thought it hold the file open until the FileClose.

The reason I didn't use _FileReadToArray() is that I don't like to use UDF's when it's not needed.

LOL I didn't know about the StringStripCR (and StringStripWS) commands.....

I can use the WS version in my Pong3 program as well :D

(Well I sought them, but in VisualBasic 6 there called different so I didn't find them.)

So, thanks! :D

I didn't notice you used FileOpen, so you aren't opening and closing the file each time, but you are passing the stream from your app to the file, so storing the information in a variable would be faster in any language then sending out the stream :) .

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...