Jump to content

Regex: remove all duplicate lines from a sorted file


Recommended Posts

Hello everyone

I have a text file with lines of text, and some lines occur more than once.  I would like to delete all lines that occur more than once (i.e. including all instances of that particular line).  The lines are already sorted alphabetically.  The problem is that I can't quite figure out how to write the regular expression.

#include <Array.au3>

$fo = FileOpen ("testfile.txt", 128)
$fwo = FileOpen ($fo & "_output.txt", 129)
$fr = FileRead ($fo)

$freg = StringRegExpReplace ($fr, '(.+?\R){2,}', '')
; I also tried:
; $freg = StringRegExpReplace ($fr, '(.+?\R)+', '')
; $freg = StringRegExpReplace ($fr, '(.+?' & @CRLF & '){2,}', '')
; $freg = StringRegExpReplace ($fr, '^(.+?\R){2,}', '')
; $freg = StringRegExpReplace ($fr, '(?m)^(.+?\R){2,}', '')

FileWrite ($fwo, $freg)

In the test file attached, only one line ("The quick brown fox.") should remain in the file.

Thanks

Samuel

PS. The alternative approach is to use an array and compare array items with each other in a series of loops, but I'm hoping the regex solution is viable.

testfile.txt

Edited by leuce
Link to post
Share on other sites

<snip>

Sorry, I misread the original request. :doh:

 

 

Edited by TheXman
Changed to use the text file data as-is
Link to post
Share on other sites
Just now, TheXman said:

Is that not what you get?

Sure - what's the problem with this?
That's what leuce wanted:

34 minutes ago, leuce said:

In the test file attached, only one line ("The quick brown fox.") should remain in the file.


But your result could also be achieved with a slightly shorter pattern:

$sNew = StringRegExpReplace(FileRead("testfile.txt"), '(?ms)^(\V+)$\R(?=.*\1)', '')

ConsoleWrite($sNew)

 

Link to post
Share on other sites
Posted (edited)
44 minutes ago, AspirinJunkie said:

Maybe a slightly simpler pattern:

$sNew = StringRegExpReplace(FileRead("testfile.txt"), '(?ms)^(\V+)$.*\1\R', '')

ConsoleWrite($sNew)

But it only works if the rows are already sorted - as you wrote.

Thanks, that regex works for the short test file and it works for slightly longer test files too, but then on one specific longer test file it fails near the middle of the file for no immediately apparent reason (larger test file attached). By "fail" I mean it deletes about 60 lines that are not duplicates.

(Added: I see both the first and the last line in the group of lines that is erroneously deleted end on the same word).

Ideally the script should rather delete too few than too many lines (i.e. if there are any lines that are not properly sorted, then those lines should just be ignored).

I'll see if I can figure out what happens with the long test file.  But thanks again for the regex help -- it definitely is not my strong point.

Samuel

testfile.txt

Edited by leuce
Link to post
Share on other sites
Posted (edited)

In the mean time, I figured out how to do this using array item comparisons and loops instead of regular expressions.

I know this is not relevant to regex but I thought I'd post it since this was my second option at a solution to my overall problem.

#include <Array.au3>

$j = 0

$fo = FileOpen ("testfile.txt", 128)
$fws = FileOpen ($fo & "_sorted.txt", 129)
$fwo = FileOpen ($fo & "_output.txt", 129) ; one can then compare these two files in e.g. WinMerge
$fr = FileRead ($fo)

$farr = StringSplit ($fr, @CRLF, 1)
$count = $farr[0]
_ArraySort($farr, 0, 0, 0, 0, 1)
$fstr = _ArrayToString ($farr, @CRLF)
FileWrite ($fws, $fstr)

While $j < $count

If $farr[$j] = $farr[$j+1] Then
$farr[$j] = "x"
$j = $j + 1
If $j <> $count Then
If $farr[$j] <> $farr[$j+1] Then
$farr[$j] = "x"
EndIf
EndIf
Else
$j = $j + 1
EndIf

WEnd

$fstr2 = _ArrayToString ($farr, @CRLF)

FileWrite ($fwo, $fstr2)

 

Edited by leuce
Link to post
Share on other sites

My philosophy is to avoid RegEx at all costs. Probably unfounded. This stems from a bitter disagreement with colleagues some years ago.

Anyway, this is how I would solve this problem. It works. and gets rid of the blank line.

#include <Array.au3>
#include <File.au3>

Dim $aText
_FileReadToArray("C:\Users\user\Downloads\testfile.txt", $aText)
_ArrayDisplay($aText)

_ArraySort($aText, Default, 1) ; optional, as apparently the text is already sorted
_ArrayDisplay($aText)

$i = 1
While $i <= $aText[0] - 1
  If $aText[$i] = $aText[$i + 1] Or $aText[$i] = "" Then
    _ArrayDelete($aText, $i)
    $aText[0] -= 1
  Else
    $i += 1
  EndIf
WEnd
_ArrayDisplay($aText)

_FileWriteFromArray("C:\Users\user\Downloads\testfileout.txt", $aText, 1)

 

Phil Seakins

Link to post
Share on other sites

@leuce !

If you prefer arrays, then you could also consider _ArrayUnique :

#include <File.au3>
#include <Array.au3>
Global $aArrSource, $aArrUnique
If Not _FileReadToArray(@ScriptDir & "\testfile.txt", $aArrSource, $FRTA_NOCOUNT) Then Exit MsgBox(BitOR(4096, 16), "Message : ", "Error : _FileReadToArray")
_ArraySort($aArrSource, Default, 0) ; optional : sort array
$aArrUnique = _ArrayUnique($aArrSource, 0, 0, 1, $ARRAYUNIQUE_NOCOUNT)
_FileWriteFromArray(@ScriptDir & "\testfile_unique.txt", $aArrUnique)

 

Musashi-C64.png

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."

Link to post
Share on other sites
9 hours ago, leuce said:

Thanks, that regex works for the short test file and it works for slightly longer test files too, but then on one specific longer test file it fails near the middle of the file for no immediately apparent reason (larger test file attached). By "fail" I mean it deletes about 60 lines that are not duplicates.

Yes this is because this string of the line appears again later as part of another line. You can fix this by writing a ^ in front of the \1 in the pattern.

However, the following pattern is much more efficient (FactFinders Pattern is good too for the result but expensive):

$sNew = StringRegExpReplace(FileRead("testfile.txt"), "(?m)^(.+\R)\1+", '')

ConsoleWrite($sNew)

@pseakins, @Musashi

He wanted to completely delete the lines that appear several times - not only the doubles.
In your solutions only the duplicates are removed - one line of them still remain.

Edited by AspirinJunkie
Link to post
Share on other sites
36 minutes ago, AspirinJunkie said:

He wanted to completely delete the lines that appear several times - not only the doubles.
In your solutions only the duplicates are removed - one line of them still remain.

Regarding this point, according to his description(s), I was not quite sure about it anyway. The title reads "... remove all duplicate lines...". The post states "I would like to delete all lines that occur more than once...". You're probably right, though.

This would therefore mean that :

Element A
Element B
Element B
Element C

becomes :

Element A
Element C

and not :

Element A
Element B
Element C

 

Musashi-C64.png

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."

Link to post
Share on other sites
2 hours ago, AspirinJunkie said:

He wanted to completely delete the lines that appear several times - not only the doubles.

Yes, I totally missed that requirement. Here's my second attempt. *** It doesn't work, please ignore ***

#include <Array.au3>
#include <File.au3>

Dim $aText
$sPrevLine = "xyzplugh"
_FileReadToArray("C:\Users\user\Downloads\testfile.txt", $aText)
_ArrayDisplay($aText)

_ArraySort($aText, Default, 1)
_ArrayDisplay($aText)

$i = 1
While $i <= $aText[0] - 1
  ; enable one or the other of the next two lines depending if you want to delete null lines
  ; If $aText[$i] = $aText[$i + 1] Or $aText[$i] = $sPrevLine Or $aText[$i] = "" Then
  If $aText[$i] = $aText[$i + 1] Or $aText[$i] = $sPrevLine Then
    $sPrevLine = $aText[$i]
    _ArrayDelete($aText, $i)
    $aText[0] -= 1
  Else
    $i += 1
  EndIf
WEnd
_ArrayDisplay($aText)

_FileWriteFromArray("C:\Users\user\Downloads\testfileout.txt", $aText, 1)

 

Ignore this last post of mine - it is rubbish - my code does not work correctly.

Edited by pseakins
irrelevant post

Phil Seakins

Link to post
Share on other sites

@leuce :

If you want to remove all lines that appear multiple times, then a regular expression (see examples from @Factfinder or @AspirinJunkie) is probably most suitable :

#include <WinAPIFiles.au3>
#include <FileConstants.au3>
Global $hSourceFile = FileOpen(@ScriptDir & "\testfile.txt", BitOR($FO_READ, $FO_UTF8))
If $hSourceFile = -1 Then Exit MsgBox(BitOR(4096, 16), "Message : ", "Error : reading the file")
Global $hTargetFile = FileOpen(@ScriptDir & "\testfile_target.txt", BitOR($FO_OVERWRITE, $FO_UTF8))
If $hTargetFile = -1 Then Exit MsgBox(BitOR(4096, 16), "Message : ", "Error : writing the file")
FileWrite($hTargetFile, StringRegExpReplace(FileRead($hSourceFile), "(?m)^(.+\R)\1+", ''))
FileClose($hTargetFile)
FileClose($hSourceFile)

 

Musashi-C64.png

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."

Link to post
Share on other sites

Just be careful with the last line.  If it does not have any \R (newline sequence) at the end, it will be included even if there is multiple occurrences of that line before...

Link to post
Share on other sites
28 minutes ago, Nine said:

Just be careful with the last line.  If it does not have any \R (newline sequence) at the end, it will be included even if there is multiple occurrences of that line before...

Good point. This should take care of that:

$freg = StringRegExpReplace (FileRead("testfile1.txt"), '(?s)(.+?)\R(\1(\R|$))+', '')

 

Link to post
Share on other sites

Yes it does :).  Tested speed with :

#include <Constants.au3>

$hTimer = TimerInit()
$freg = StringRegExpReplace (FileRead("testfile.txt"), '(?s)(.+?)\R(\1(\R|$))+', '')
ConsoleWrite (TimerDiff($hTimer) & @CRLF)

$hTimer = TimerInit()
$sNew = StringRegExpReplace(FileRead("testfile.txt"), "(?m)^(.+?)\R(\1(\R|$))+", '')
ConsoleWrite (TimerDiff($hTimer) & @CRLF)

MsgBox ($MB_SYSTEMMODAL, "", $freg = $sNew)
Quote

+>Setting Hotkeys...--> Press Ctrl+Alt+Break to Restart or Ctrl+BREAK to Stop.
833.19996119682
1.2710359091648
+>08:07:28 AutoIt3.exe ended.rc:0

Both provide the same result.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...