Jump to content

delete text files with same text content


way1000
 Share

Recommended Posts

i have to delete many files with same text content even if the order of text lines is different in each text file

so i need some help to create a solution

eg:

file1.txt:

text line a
text line c
text line b

deleted:

file2.txt:

text line a
text line b
text line c

same text content exactly but different text line order


it has to have an option to browse folder and batch delete files with duplicate text content of different line order

Link to comment
Share on other sites

  • Moderators

file2.txt, with same content but different order stays or is deleted?

Sorry, re-read your post.

Edited by JLogan3o13

"Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball

How to get your question answered on this forum!

Link to comment
Share on other sites

sure.  read both files into their own arrays, array 1 and array 2

check both arrays the same size with ubound

join both a1 and a2 with _arrayconcatenate to make a3

cut down new a3 using _ArrayUnique

check size of a3 with ubound is the same size as a1 and a2

if its different, they're not the same

if its the same they are.

Edit:  forgot the last bit, delete one of the files

Edited by gruntydatsun
forgot the last part of this guys question
Link to comment
Share on other sites

slow day here so here's one way:
 

#include <File.au3>
#include <array.au3>

Dim $aFile1, $aFile2, $aUnique

_FileReadToArray(@Scriptdir & "\file1.txt",$aFile1,0)   ;read file 1 into zero based array $aFile1
_FileReadToArray(@Scriptdir & "\file2.txt",$aFile2,0)   ;read file 2 into zero based array $aFile2

if UBound($aFile1) = UBound($aFile2) Then               ;if both arrays have same size
    _ArrayConcatenate($aFile1,$aFile2)                  ;merge a2 into a1
    $aUnique = _ArrayUnique($aFile1,0,0,0,0)            ;generate zero based array of unique lines in A1
EndIf

msgbox(1,"RESULT",(UBound($aFile2) = UBound($aUnique)) ? "Files are the same" : "Files are different")

for the rest you could get the folder path, get a file listing of that folder into two separate arrays then loop through both in a nested loop checking file 1 against every other file, then file 2 against every other file and so on until you hit the end.

For $x = 0 to Ubound($array1)-1
    For $y = 0 to Ubound($array2)-1
        ;is $array1[$x] a match with $array2[$y]
        ;if yes store path of file in $array2[$y] in $array3 to delete at the end
        ;or delete $array2[$y] as you go and check for errors when reading files in
    Next
Next

;delete all the files in $array3

 

Link to comment
Share on other sites

 

now i have this script but it says that files are different even if files are the same

#include <File.au3>
#include <array.au3>

Dim $aFile1, $aFile2, $aUnique, $array1, $array2

_FileReadToArray(@Scriptdir & "\file1.txt",$aFile1,0)   ;read file 1 into zero based array $aFile1
_FileReadToArray(@Scriptdir & "\file2.txt",$aFile2,0)   ;read file 2 into zero based array $aFile2

if UBound($aFile1) = UBound($aFile2) Then               ;if both arrays have same size
    _ArrayConcatenate($aFile1,$aFile2)                  ;merge a2 into a1
    $aUnique = _ArrayUnique($aFile1,0,0,0,0)            ;generate zero based array of unique lines in A1
EndIf

msgbox(1,"RESULT",(UBound($aFile2) = UBound($aUnique)) ? "Files are the same" : "Files are different")


For $x = 0 to Ubound($array1)-1
    For $y = 0 to Ubound($array2)-1
        ;is $array1[$x] a match with $array2[$y]
        ;if yes store path of file in $array2[$y] in $array3 to delete at the end
        ;or delete $array2[$y] as you go and check for errors when reading files in
    Next
Next

;delete all the files in $array3

 how to fix it

Edited by way1000
Link to comment
Share on other sites

Your problem will most likely be that you have trailing spaces, an additional blank line, different line terminators, incorrect encoding or a control character.

You can easily test your logic by replacing the file read with:

;Dim $aFile1, $aFile2, $aUnique, $array1, $array2
Dim $aUnique, $array1, $array2

;_FileReadToArray(@Scriptdir & "\file1.txt",$aFile1,0)   ;read file 1 into zero based array $aFile1
;_FileReadToArray(@Scriptdir & "\file2.txt",$aFile2,0)   ;read file 2 into zero based array $aFile2

Local $aFile1[3] = ["test1", "test2", "test3"]
Local $aFile2[3] = ["test3", "test2", "test1"]

Take small steps with the above, manually changing the arrays until you find your data error. You probably need to consider pre-processing your files to prevent such problems (e.g. strip trailing spaces etc.).

 

Edit: Sorry Grunty, I thought you might have gone to sleep :-), I'll get out of the way.

Edited by SlackerAl

Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Link to comment
Share on other sites

That sounds like good advice SlackAl. 

Find out why the match is failing then write something to clean up the input files or the process that generates them to deal with that. 

Load your input files up in notepad++ and turn on show all characters.  It'll be pretty obvious whats different.

Link to comment
Share on other sites

2 hours ago, gruntydatsun said:

attach a few sample files that make the problem happen and i'll have a look.   you're in luck... i have no life :)

 

 

how can i make it run on a folder with thousands of files to compare and delete as duplicates

Screenshot_1.png

Screenshot_2.png

Edited by way1000
Link to comment
Share on other sites

Hi, sorry I meant to attach the two files themselves as attachments to a post on this thread.

To do a folder with thousands of files, use FileFolderSelect to let the user pick the path, then get a directory listing of files into two arrays, FileListToArray then loop through both in a nested loop checking file 1 against every other file, then file 2 against every other file and so on until you hit the end.

  • Download Notepad++, install and open those text files in it.
  • Select View > Show Symbol > Show All Characters
  • This will show the non-printing characters that are probably stopping your matching (carriage returns, tabs etc)
     
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...