Jump to content

Find duplicate lines in several files


masvil
 Share

Recommended Posts

I'm trying to find duplicate lines in several files (all files in specified dir and subdirs) and store them into a new file adding a reference. Criteria: only if lines begin with speficied string.

Example:

c:\dir\hello.txt content:

hi, wou are you?

specifiedstring have a nice trip

I really like it

c:\dir\goodbye.txt content:

specifiedstring you are beautiful

specifiedstring what's up I wish you stay well

hi, my dear

specifiedstring have a nice trip

I really like it

c:\dir\subdir\adios.txt content:

specifiedstring what's up I wish you stay well

where are you going?

specifiedstring the pen is on the table

Processing those files I have to get result.txt containing:

"specifiedstring have a nice trip" found in hello.txt and goodbye.txt

"specifiedstring what's up I wish you stay well" found in goodbye.txt and adios.txt

Any help, please

Edited by masvil
Link to comment
Share on other sites

I'm trying to find duplicate lines in several files (all files in specified dir and subdirs) and store them into a new file adding a reference. Criteria: only if lines begin with speficied string.

Example:

c:\dir\hello.txt content:

hi, wou are you?

specifiedstring have a nice trip

I really like it

c:\dir\goodbye.txt content:

specifiedstring you are beautiful

specifiedstring what's up I wish you stay well

hi, my dear

specifiedstring have a nice trip

I really like it

c:\dir\subdir\adios.txt content:

specifiedstring what's up I wish you stay well

where are you going?

specifiedstring the pen is on the table

Processing those files I have to get result.txt containing:

"specifiedstring have a nice trip" found in hello.txt and goodbye.txt

"specifiedstring what's up I wish you stay well" found in goodbye.txt and adios.txt

Any help, please

Use the UDF below, then go through each file and get how many lines it is using _FileCountLines, then using a for loop to how every many lines, use FileReadLine and stringinstr to check if the string is there. If it is then you can save the text to like an array using _arrayadd or some sort of text buffer which you can then use later after all files are checked. The UDF below will grab every file, in the directory and subdirectory of every directory

THIS UDF WAS PROGRAMMED BY SMOKE_N NOT BY ME

Func _FileListToArrayEx($sPath, $sFilter = '*.*', $iFlag = 0, $sExclude = '', $iRecurse = False)
    If Not FileExists($sPath) Then Return SetError(1, 1, '')
    If $sFilter = -1 Or $sFilter = Default Then $sFilter = '*.*'
    If $iFlag = -1 Or $iFlag = Default Then $iFlag = 0
    If $sExclude = -1 Or $sExclude = Default Then $sExclude = ''
    Local $aBadChar[6] = ['\', '/', ':', '>', '<', '|']
    $sFilter = StringRegExpReplace($sFilter, '\s*;\s*', ';')
    If StringRight($sPath, 1) <> '\' Then $sPath &= '\'
    For $iCC = 0 To 5
        If StringInStr($sFilter, $aBadChar[$iCC]) Or _
            StringInStr($sExclude, $aBadChar[$iCC]) Then Return SetError(2, 2, '')
    Next
    If StringStripWS($sFilter, 8) = '' Then Return SetError(2, 2, '')
    If Not ($iFlag = 0 Or $iFlag = 1 Or $iFlag = 2) Then Return SetError(3, 3, '')
    Local $oFSO = ObjCreate("Scripting.FileSystemObject"), $sTFolder
    $sTFolder = $oFSO.GetSpecialFolder(2)
    Local $hOutFile = @TempDir & $oFSO.GetTempName
    If Not StringInStr($sFilter, ';') Then $sFilter &= ';'
    Local $aSplit = StringSplit(StringStripWS($sFilter, 8), ';'), $sRead, $sHoldSplit
    For $iCC = 1 To $aSplit[0]
        If StringStripWS($aSplit[$iCC],8) = '' Then ContinueLoop
        If StringLeft($aSplit[$iCC], 1) = '.' And _
            UBound(StringSplit($aSplit[$iCC], '.')) - 2 = 1 Then $aSplit[$iCC] = '*' & $aSplit[$iCC]
        $sHoldSplit &= '"' & $sPath & $aSplit[$iCC] & '" '
    Next
    $sHoldSplit = StringTrimRight($sHoldSplit, 1)
    If $iRecurse Then
        RunWait(@Comspec & ' /c dir /b /s /a ' & $sHoldSplit & ' > "' & $hOutFile & '"', '', @SW_HIDE)
    Else
        RunWait(@ComSpec & ' /c dir /b /a ' & $sHoldSplit & ' /o-e /od > "' & $hOutFile & '"', '', @SW_HIDE)
    EndIf
    $sRead &= FileRead($hOutFile)
    If Not FileExists($hOutFile) Then Return SetError(4, 4, '')
    FileDelete($hOutFile)
    If StringStripWS($sRead, 8) = '' Then SetError(4, 4, '')
    Local $aFSplit = StringSplit(StringTrimRight(StringStripCR($sRead), 1), @LF)
    Local $sHold
    For $iCC = 1 To $aFSplit[0]
        If $sExclude And StringLeft($aFSplit[$iCC], _
            StringLen(StringReplace($sExclude, '*', ''))) = StringReplace($sExclude, '*', '') Then ContinueLoop
        Switch $iFlag
            Case 0
                If StringRegExp($aFSplit[$iCC], '\w:\\') = 0 Then
                    $sHold &= $sPath & $aFSplit[$iCC] & Chr(1)
                Else
                    $sHold &= $aFSplit[$iCC] & Chr(1)
                EndIf
            Case 1
                If StringInStr(FileGetAttrib($sPath & '\' & $aFSplit[$iCC]), 'd') = 0 And _
                    StringInStr(FileGetAttrib($aFSplit[$iCC]), 'd') = 0 Then
                    If StringRegExp($aFSplit[$iCC], '\w:\\') = 0 Then
                        $sHold &= $sPath & $aFSplit[$iCC] & Chr(1)
                    Else
                        $sHold &= $aFSplit[$iCC] & Chr(1)
                    EndIf
                EndIf
            Case 2
                If StringInStr(FileGetAttrib($sPath & '\' & $aFSplit[$iCC]), 'd') Or _
                    StringInStr(FileGetAttrib($aFSplit[$iCC]), 'd') Then
                    If StringRegExp($aFSplit[$iCC], '\w:\\') = 0 Then
                        $sHold &= $sPath & $aFSplit[$iCC] & Chr(1)
                    Else
                        $sHold &= $aFSplit[$iCC] & Chr(1)
                    EndIf
                EndIf
        EndSwitch
    Next
    If StringTrimRight($sHold, 1) Then Return StringSplit(StringTrimRight($sHold, 1), Chr(1))
    Return SetError(4, 4, '')
EndFunc
Link to comment
Share on other sites

well... this does the job ^-^!

but...

it writes in "log.txt" twice..

i mean... if a line found in "file1" and in "file2" then

it writes "the line" is found in "file1" and if "file2"

and

"the line" is found in "file2" and "file1"

but.. anyway... here is it ^-^!

#include <file.au3>

$NumberOfFiles = 3

Dim $file, $temp_file
Dim $files[$NumberOfFiles]

$files[0] = @ScriptDir & "\file1.TXT"
$files[1] = @ScriptDir & "\file2.TXT"
$files[2] = @ScriptDir & "\file3.TXT"

$thefile = FileOpen(@ScriptDir & "\log.txt",2)
For $i = 0 to $NumberOfFiles - 1
    If Not _FileReadToArray($files[$i], $file) Then
;error opening file
        MsgBox(0,"","error!")
    Else
;the lines in $files[$i] are in $file
        For $j = 1 to $file[0]
            For $k = 0 to $NumberOfFiles - 1
                If $k = $i Then
            ;is the same file!
                Else
                    If not _FileReadToArray($files[$k], $temp_file) Then
                ;error?
                        MsgBox(0,"","error!")
                    Else
                        For $l = 1 to $temp_file[0]
                            If $temp_file[$l] = $file[$j] Then
                                FileWrite($thefile, $file[$j] & " found in " & $files[$i] & " and " & $files[$k] & @CRLF)
                            EndIf
                        Next
                    EndIf
                EndIf
            Next
        Next
    EndIf
Next
FileClose($thefile)
MsgBox(0,"","done!")

edit: hope this will help u!! ^-^!

saludos! =P!

Edited by AnythinG
Link to comment
Share on other sites

Hi,

Here's the logic;

; _FileCompareDupes.au3
#include<array.au3>
#include<file.au3>
Local $arFiles[3] = [@ScriptDir & "\hello.txt", @ScriptDir & "\goodbye.txt", @ScriptDir & "\adios.txt"  ]
Local $arArrays[3], $sCumul, $sResults = @ScriptDir & "resultDupes", $c = FileDelete($sResults);$arTmp[1], $arTmp2[1],

; get all the files to arrays which only contain matching lines [for speed, change to use RegExp on large fileread files]
_FilesReadArrays($arFiles, $arArrays, "specifiedstring")

; loop through 2-file comparison for dupes to Delimited string
For $i = 0 To UBound($arFiles) - 2  ;the last file will already have been checked against all others
    For $j = $i + 1 To UBound($arFiles) - 1     ;start at $i+1 so as not to repeat comparison of any 2 files
        Local $sStr = @TAB & ":Found in " & $arFiles[$i] & " and " & $arFiles[$j] & @CRLF   ;document which files have the dupes
        
        ;compare 2 files, return the array of dupe lines
        Local $ArrayCompare = _ArrayCompare($arArrays[$i], $arArrays[$j], 1, 1)     ;If @OSTYPE = "WIN32_WINDOWS"  Then Return 0 ;not Win 9x
        If $ArrayCompare[0] <> "" Then $sCumul &= _ArrayToString($ArrayCompare, $sStr) & $sStr
    Next
Next
FileWrite($sResults, $sCumul)
Run("notepad " & $sResults)

Attached is full scrip with funcs

Best, Randall

Edited by randallc
Link to comment
Share on other sites

Thanx for your effort, your're saving me from a serious problem at work!

@ AnythinG and randallc: your scripts works great! I have to ask you a final help because I'm not good with arrays and, as I have to solve that problem urgently, I have no time to learn about them. I need to check all files in @scriptdir, could you add it please?

Edited by masvil
Link to comment
Share on other sites

Thanx for your effort, your're saving me from a serious problem at work!

@ AnythinG and randallc: your scripts works great! I have to ask you a final help because I'm not good with arrays and, as I have to solve that problem urgently, I have no time to learn about them. I need to check all files in @scriptdir, could you add it please?

Hi,

In mine just change "$arFiles[3] line to;

$arFiles=_FileListToArray(@scriptdir)
_ArrayDelete($arFiles,0)
Best, Randall

[PS If you need subfolders, use newer version in my sig links _FileListToArrayNew; ]

Edited by randallc
Link to comment
Share on other sites

In mine just change "$arFiles[3] line to;

$arFiles=_FileListToArray(@scripdir)
_ArrayDelete($arFiles,0)
Done, but I get:

C:\temp\_FileCompareDupes\_FileCompareDupes.au3 (38) : ==> Array variable subscript badly formatted.: 
ReDim $arTmp2[$k] 
ReDim $arTmp2[^ ERROR

PS please also change "@scripdir" to "@scriptdir" into your last post.

Edited by masvil
Link to comment
Share on other sites

Done, but I get:

C:\temp\_FileCompareDupes\_FileCompareDupes.au3 (38) : ==> Array variable subscript badly formatted.: 
ReDim $arTmp2[$k] 
ReDim $arTmp2[^ ERROR

PS please also change "@scripdir" to "@scriptdir" into your last post.

Hi,

Yes, some compensating changes would be needed!

Local $arFiles=_FileListToArray(@ScriptDir,"*.txt",1)
_ArrayDelete($arFiles,0)
_ArrayDisplay($arFiles)
;~ Local $arFiles[3] = [@ScriptDir & "\hello.txt", @ScriptDir & "\goodbye.txt", @ScriptDir & "\adios.txt"  ]
Local $arArrays[UBound($arFiles)], $sCumul, $sResults = @ScriptDir & "resultDupes", $c = FileDelete($sResults);$arTmp[1], $arTmp2[1],oÝ÷ Ù«­¢+Ø$%¥ÀÌØí¬±ÐìÐìÀÑ¡¸I¥´ÀÌØíÉQµÀÉlÀÌØí­
Best, Randall
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...