Jump to content

Counting and sorting duplicate lines


Recommended Posts

Hey ya'll,

So im in a bit of a pickle here.  I have a text file, file1.txt with a few hundred lines, some of these lines are exact duplicates.

What i'm trying to do is pick out the duplicate lines, and sort them out by number of occurrences and write the 10 most frequently seen duplicate lines to another file.  So the "top 10" duplicate lines from file1.txt would be output to file2.txt

I've looked around at the _array functions but i only see a way to remove duplicates and make unique arrays, which isn't exactly what i need...

Any help?

Link to comment
Share on other sites

Here; it is a list of twitter trends
 

#SheNeverLeft
#1Dbigannouncement
"Diwali in India"
"Between Two Ferns"
#OnTheRoadAgain1D
"Jeanie Buss"
#OttawaShooting
#FastFoodSlogans
"Darnell Coles"
"Avengers 2"
#1Dbigannouncement
#MostClutch
"Between Two Ferns"
#OnTheRoadAgainTour
"Happy Mole Day"
#OTRATour
#AYASummit
"Jeanie Buss"
"Avengers 2"
"Lisa Ann"
#StartSitESPN
"Happy Mole Day"
#OttawaShooting
"Between Two Ferns"
"Jeanie Buss"
#BryantAndNashNewVideo
#LameApocalypses
"Kevin Vickers"
#poptech
"Avengers 2"
"Happy Mole Day"
#LameApocalypses
#OttawaShooting
#AvengersAgeOfUltron
#AYASummit
#Engage2014
Halloween
Canada
Christmas
Scorpio
"Happy Mole Day"
#AvengersAgeOfUltron
#OttawaShooting
#LameApocalypses
#indysm
#HappyBirthdayGrandpaGrande
"Avengers 2"
Halloween
Canada
Christmas
#HappyBirthdayGrandpaGrande
"Happy Mole Day"
#LameApocalypses
#OttawaShooting
#AvengersAgeOfUltron
#StealMyGIF
"Avengers 2"
Canada
Halloween
Christmas
#LameApocalypses
"Happy Mole Day"
#AvengersAgeOfUltron
#ICryAtRavesWhen
#OttawaShooting
#AgeofUltron
"Avengers 2"
Canada
Halloween
"White House"
#ICryAtRavesWhen
#LameApocalypses
#AvengersAgeOfUltron
#PandaFunkFamily
#OttawaShooting
"Happy Diwali"
"Avengers 2"
Halloween
"Frank Ocean"
Canada
#ICryAtRavesWhen
#PandaFunkFamily
#LameApocalypses
#AvengersAgeOfUltron
#ZachGrandtourage
"Happy Diwali"
Ottawa
"Kim Possible"
"Lizzie McGuire"
Halloween
#ICryAtRavesWhen
#LameApocalypses
#AvengersAgeOfUltron
#AgeofUltron
"Happy Diwali"
#Paperwork
"Even Stevens"
Halloween
"Jessica Lange"
"That's So Raven"
#ICryAtRavesWhen
#LameApocalypses
#AvengersAgeOfUltron
#AgeofUltron
"Thinking About You - Frank Ocean"
"Gods & Monsters"
"Edward Mordrake"
"Happy Diwali"
Halloween
"Jessica Lange"
#LameApocalypses
#AvengersAgeOfUltron
#ICryAtRavesWhen
"Lurie Poston"
#thankyouvessel
Viscant
#OttawaShooting
"Happy Diwali"
"Lizzie McGuire"
"Gods and Monsters"
#LameApocalypses
#thankyouvessel
#AvengersAgeOfUltron
#AgeofUltron
#WorldSeriesGame2
"Gods & Monsters"
"S Club 7"
"Mark Jackson"
"Happy Diwali"
"Legally Blonde"
#LameApocalypses
#WorldSeriesGame2
#AvengersAgeOfUltron
#thankyouvessel
#DontAskBeau
"Nick Swisher"
"Zach Mettenberger"
"Teaser Trail"
PrincAss
"Edward Mordrake"
#WorldSeriesGame2
#AskBeau
Strickland
#VoightsRage
#tiannaQA
#AgeofUltron
Dora
Patti
"Zach Mettenberger"
"Teaser Trail"
#ReplaceAnAnimeTitleWithAss
"One in 5,000"
#CrawfordsNewVideo
#100Things
#AvengersAgeOfUltron
"My Cinnamon Twist"
#WorldSeriesGame2
"Thinking About You - Frank Ocean"
Kunitz
"Paranormal Activity 3"
#ReplaceAnAnimeTitleWithAss
#BabyDaddyChat
#WorldSeriesGame2
Ultron
#OttawaShooting
#NYGovDebate
"Joe Torre"
"Key & Peele"
"Watching Casper"
"Oliver and Thea"
#willmakesushappy
#ReplaceAnAnimeTitleWithAss
#AvengersAgeOfUltron
#ignitethegrind
#AskSierraDallas
"James Spader"
"White House"
Drumline
Canada
"Young Thug"
#BryantAndNashNewVideo
#ASKLOHANTHONY
#OttawaShooting
#Z100Rules
"Nathan Cirillo"
"Michael Zehaf-Bibeau"
Canada
Halloween
Drumline
"Jersey Shore"
#BryantAndNashNewVideo
#ListenToGhostOnYouTube
#OttawaShooting
#5SOSAmnesiaLyrics
#BigTimeLyrics
"Nathan Cirillo"
"Michael Zehaf-Bibeau"
"Ben Bradlee"
Makonnen
Inbox
#5SOSAmnesiaLyrics
#ANDvAFC
#OttawaShooting
#yesboo
#SELFIEFORSEB
"Liverpool 0-3 Real Madrid"
Poldi
Podolski
"WHY IS FOOD SO GOOD"
Olympiacos
#AskZachAttack
#LiverpoolVsRealMadrid
#OttawaShooting
#BryantAndNashNewVideo
#IfICouldTimeTravel
Reus
"Google Inbox"
Coutinho
"Ben Bradlee"
Mignolet
#AskZachAttack
#LiverpoolVsRealMadrid
#OttawaShooting
"You'll Never Walk Alone"
#IfICouldTimeTravel
#BryantAndNashNewVideo
"David J. Stern Sports Scholarship"
"Google Inbox"
Reds
Parliament
#PrayForOttawa
#OttawaShooting
#IfICouldTimeTravel
#StaySafeOttawa
"Canadian Parliament"
#twitterflight
"Friday After Next"
"S Club 7"
"Google Inbox"
Crowder

 

Link to comment
Share on other sites

How about putting each line into an array element, then make a copy of that array and then use _ArrayUnique to filter out the duplicates. Then using a for loop use StringReplace to search through the original array (arraytostring with a delimiter) and you can then find out how many times it found an exact match in the @extended macro.

You can Make a 2D array and put the string you searched for, and the @extended info in the second. $array[1][2] = [['sample string', 5]] like so.

Please ask if you have any questions. Give it a try and post script and I'm sure all of us would be happy to help. :)

Edited by MikahS

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to comment
Share on other sites

show me what you have accomplished and I'd be happy to take a look. :)

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to comment
Share on other sites

To keep it short, what i have is basically
 

for $l = 1 to ubound($unique_array) - 1
global $replace = StringReplace($trend_array_string,$unique_array[$l],"ReplacementString")
global $numreplacements = @extended
ConsoleWrite("The number of replacements done was : " & $unique_array[$l] & " : " & $numreplacements & @CRLF)
Next

And the output

The number of replacements done was : #SheNeverLeft : 2
The number of replacements done was : #LL2014 : 1
The number of replacements done was : "Between Two Ferns" : 4
The number of replacements done was : #askgrizfolk : 1
The number of replacements done was : #LameApocalypses : 13
The number of replacements done was : "Gold Glove" : 1
The number of replacements done was : "Diwali in India" : 2
The number of replacements done was : "Jeanie Buss" : 4
The number of replacements done was : #1Dbigannouncement : 3
The number of replacements done was : "Avengers 2" : 8
The number of replacements done was : #OnTheRoadAgain1D : 1
The number of replacements done was : #OttawaShooting : 14
The number of replacements done was : #FastFoodSlogans : 1
The number of replacements done was : "Darnell Coles" : 1
The number of replacements done was : #MostClutch : 1
The number of replacements done was : #OnTheRoadAgainTour : 1
The number of replacements done was : "Happy Mole Day" : 6
The number of replacements done was : #OTRATour : 1
The number of replacements done was : #AYASummit : 2
The number of replacements done was : "Lisa Ann" : 1
The number of replacements done was : #StartSitESPN : 1
The number of replacements done was : #BryantAndNashNewVideo : 5
The number of replacements done was : "Kevin Vickers" : 1
The number of replacements done was : #poptech : 1
The number of replacements done was : #AvengersAgeOfUltron : 13
The number of replacements done was : #Engage2014 : 1
The number of replacements done was : Halloween : 9
The number of replacements done was : Canada : 7
The number of replacements done was : Christmas : 3
The number of replacements done was : Scorpio : 1
The number of replacements done was : #indysm : 1
The number of replacements done was : #HappyBirthdayGrandpaGrande : 2
The number of replacements done was : #StealMyGIF : 1
The number of replacements done was : #ICryAtRavesWhen : 6
The number of replacements done was : #AgeofUltron : 5
The number of replacements done was : "White House" : 2
The number of replacements done was : #PandaFunkFamily : 2
The number of replacements done was : "Happy Diwali" : 6
The number of replacements done was : "Frank Ocean" : 1
The number of replacements done was : #ZachGrandtourage : 1
The number of replacements done was : Ottawa : 15
The number of replacements done was : "Kim Possible" : 1
The number of replacements done was : "Lizzie McGuire" : 2
The number of replacements done was : #Paperwork : 1
The number of replacements done was : "Even Stevens" : 1
The number of replacements done was : "Jessica Lange" : 2
The number of replacements done was : "That's So Raven" : 1
The number of replacements done was : "Thinking About You - Frank Ocean" : 2
The number of replacements done was : "Gods & Monsters" : 2
The number of replacements done was : "Edward Mordrake" : 2
The number of replacements done was : "Lurie Poston" : 1
The number of replacements done was : #thankyouvessel : 3
The number of replacements done was : Viscant : 1
The number of replacements done was : "Gods and Monsters" : 1
The number of replacements done was : #WorldSeriesGame2 : 5
The number of replacements done was : "S Club 7" : 1
The number of replacements done was : "Mark Jackson" : 1
The number of replacements done was : "Legally Blonde" : 1
The number of replacements done was : #DontAskBeau : 1
The number of replacements done was : "Nick Swisher" : 1
The number of replacements done was : "Zach Mettenberger" : 2
The number of replacements done was : "Teaser Trail" : 2
The number of replacements done was : PrincAss : 1
The number of replacements done was : #AskBeau : 1
The number of replacements done was : Strickland : 1
The number of replacements done was : #VoightsRage : 1
The number of replacements done was : #tiannaQA : 1
The number of replacements done was : Dora : 1
The number of replacements done was : Patti : 1
The number of replacements done was : #ReplaceAnAnimeTitleWithAss : 3
The number of replacements done was : "One in 5,000" : 1
The number of replacements done was : #CrawfordsNewVideo : 1
The number of replacements done was : #100Things : 1
The number of replacements done was : "My Cinnamon Twist" : 1
The number of replacements done was : Kunitz : 1
The number of replacements done was : "Paranormal Activity 3" : 1
The number of replacements done was : #BabyDaddyChat : 1
The number of replacements done was : Ultron : 19
The number of replacements done was : #NYGovDebate : 1
The number of replacements done was : "Joe Torre" : 1
The number of replacements done was : "Key & Peele" : 1
The number of replacements done was : "Watching Casper" : 1
The number of replacements done was : "Oliver and Thea" : 1
The number of replacements done was : #willmakesushappy : 1
The number of replacements done was : #ignitethegrind : 1
The number of replacements done was : #AskSierraDallas : 1
The number of replacements done was : "James Spader" : 1
The number of replacements done was : Drumline : 2
The number of replacements done was : "Young Thug" : 1
The number of replacements done was : #ASKLOHANTHONY : 1
The number of replacements done was : #Z100Rules : 1
The number of replacements done was : "Nathan Cirillo" : 2
The number of replacements done was : "Michael Zehaf-Bibeau" : 2
The number of replacements done was : "Jersey Shore" : 1
The number of replacements done was : #ListenToGhostOnYouTube : 1
The number of replacements done was : #5SOSAmnesiaLyrics : 2
The number of replacements done was : #BigTimeLyrics : 1
The number of replacements done was : "Ben Bradlee" : 2
The number of replacements done was : Makonnen : 1
The number of replacements done was : Inbox : 3
The number of replacements done was : #ANDvAFC : 1
The number of replacements done was : #yesboo : 1
The number of replacements done was : #SELFIEFORSEB : 1
The number of replacements done was : "Liverpool 0-3 Real Madrid" : 1
The number of replacements done was : Poldi : 1
The number of replacements done was : Podolski : 1
The number of replacements done was : "WHY IS FOOD SO GOOD" : 1
The number of replacements done was : Olympiacos : 1
The number of replacements done was : #AskZachAttack : 2
The number of replacements done was : #LiverpoolVsRealMadrid : 2
The number of replacements done was : #IfICouldTimeTravel : 2
The number of replacements done was : Reus : 1
The number of replacements done was : "Google Inbox" : 2
The number of replacements done was : Coutinho : 1
The number of replacements done was : Mignolet : 1
The number of replacements done was : "You'll Never Walk Alone" : 1
The number of replacements done was : "David J. Stern Sports Scholarship" : 1
The number of replacements done was : Reds : 1
The number of replacements done was : Parliament : 1

So now i have the unique list, with the corresponding amount of occurences.  How would i extract the top X lines?

Edited by phatzilla
Link to comment
Share on other sites

phatzilla,

To keep it short

 

No need to keep it short, show your whole script.  The script that you posted cannot work.

The solution that mikahS posted is about 13 lines long...

kylomas

edit: comment struck out

edit2: About the code you posted:

  1. You shouldn't declare variables in a loop
  2. It is not necessary to populate a variable to get stringreplace to set @EXTENDED
Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

test.txt is file with your data from post #3

#include <array.au3>
$data = FileReadToArray('test.txt')
_ArraySort($data)
_ArrayDisplay($data)
$search = ''
Global $dup[0]
For $i = UBound($data)-1 To 0 step - 1
   If $data[$i] <> $search Then
      $search = $data[$i]
   Else
      _ArrayAdd($dup,$data[$i])
      _ArrayDelete($data,$i)
   EndIf
Next
_ArraySort($dup)
_ArrayDisplay($dup)
_ArrayDisplay($data)
Link to comment
Share on other sites

Here is a way :

#Include <Array.au3>

Local $iCount
Local $sData = FileRead("data.txt")

; Uniq lines
Local $aDuplicates = StringRegExp($sData, "(?s)(?:\A|\R)(\N+)(?=\R|\Z)(?!.*\1)", 3)

Local $aResult[ UBound($aDuplicates)][2]

For $i = 0 To UBound($aDuplicates) - 1
    $iCount = UBound(  StringRegExp($sData, "(?:\A|\R)\Q" & $aDuplicates[$i] & "\E(?=\R|\Z)", 3)  )

    $aResult[$i][0] = $aDuplicates[$i]
    $aResult[$i][1] = $iCount
    
Next

_ArraySort($aResult, 1, 0, 0, 1)

; Delete uniq rows ########################
For $i = UBound($aResult) - 1 To 0 Step -1
    If $aResult[$i][1] > 1 Then ExitLoop
Next
Redim $aResult[$i + 1][2]
; #########################################



_ArrayDisplay($aResult)
Edited by jguinch
Link to comment
Share on other sites

another way

#include <array.au3>
Global $aData = FileReadToArray('File1.txt'), $aDuplicates[0][2], $iIndex = 0
_ArrayInsert($aData, 0)                            ; make array 1-based
_ArraySort($aData)                                ; sort data in input
For $i = 1 To UBound($aData) - 1                ; loop all elements of sorted data
    $aCount = _ArrayFindAll($aData, $aData[$i])    ; for each element count how many there are
    $nCount = UBound($aCount)
    If $nCount > 1 Then                            ; if there are more than 1
        ReDim $aDuplicates[UBound($aDuplicates) + 1][2] ; make room in output array
        $aDuplicates[$iIndex][0] = $aData[$i]    ; insert it's value in output array
        $aDuplicates[$iIndex][1] = $nCount        ; and how many there are
        $iIndex += 1                            ; point to next free output element
        $i += $nCount - 1                        ; skip the remaining same elements
    EndIf
Next
If UBound($aDuplicates) Then
    _ArraySort($aDuplicates, 1, 0, 0, 1)
    _ArrayDisplay($aDuplicates)
Else
    MsgBox(0, "Result", "There are no duplicates")
EndIf

edit:

removed previous listing with a bug

added comments

Edited by Chimp

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Link to comment
Share on other sites

 

Here is a way :

#Include <Array.au3>

Local $iCount
Local $sData = FileRead("data.txt")

; Uniq lines
Local $aDuplicates = StringRegExp($sData, "(?s)(?:\A|\R)(\N+)(?=\R|\Z)(?!.*\1)", 3)

Local $aResult[ UBound($aDuplicates)][2]

For $i = 0 To UBound($aDuplicates) - 1
    $iCount = UBound(  StringRegExp($sData, "(?:\A|\R)\Q" & $aDuplicates[$i] & "\E(?:\R|\Z)", 3)  )

    $aResult[$i][0] = $aDuplicates[$i]
    $aResult[$i][1] = $iCount
    
Next

_ArraySort($aResult, 1, 0, 0, 1)

; Delete uniq rows ########################
For $i = UBound($aResult) - 1 To 0 Step -1
    If $aResult[$i][1] > 1 Then ExitLoop
Next
Redim $aResult[$i + 1][2]
; #########################################



_ArrayDisplay($aResult)

 

if you have in input a file with only the same value repeated more times your function fails

try with a file in input like this for example:

one
one

or like this:

123
123
123
123

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Link to comment
Share on other sites

This seems to work...

#include <array.au3>

local $str = fileread(@desktopdir & '\test2.txt')                   ;   create string var from file
$str = stringregexpreplace($str,'(.+)(\R|$)','`\1`' & @crlf)        ;   delimit string for stringreplace (else "aa" would match "aaa" and "aaaa")
local $aStr1 = stringsplit($str,@crlf,3)                            ;   create array from string var
$aStr1 = _arrayunique($aStr1,0,0,0,0)                               ;   eleminate duplicate entries
local $aStr2[ubound($aStr1-1)][2]                                   ;   create 2D array sized to 1st array

for $1 = 0 to ubound($aStr1) - 1                                    ;   loop thru array
    stringreplace($str,$aStr1[$1],'')                               ;   get # of occurrences from string
    $aStr2[$1][1] = @extended                                       ;   populate count
    $aStr2[$1][0] = stringregexpreplace($aStr1[$1],'`(.*)`','\1')   ;   populate string
Next

_arraysort($aStr2,1,0,0,1)                                          ;   sort on count column
redim $aStr2[10][2]                                                 ;   cut array down to 10 entries
_arraydisplay($aStr2)                                               ;   viola

@jguinch - I think I'm going to love your use of regexp, if I ever figure it out...

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

You are right Chimp. I edited my code : replace (?:) by (?=) in the 2nd regex (it was an oversight o:) )

Thanks kylomas.

An similar code, but with a suppression of non duplicates lines at the beginning :

#Include <Array.au3>

Local $iCount
Local $sData = FileRead("data.txt")

; Eliminate non Duplicates
Local $sDuplicates = StringRegExpReplace($sData, "(?s)(?:\A|\R)(\N+)(?=\R|\Z)(?!.*\R\1)", "")
; Duplicates in an Array (uniq rows)
Local $aDuplicates = StringRegExp($sDuplicates, "(?s)(?:\A|\R)(\N+)(?=\R|\Z)(?!.*\1)", 3)

Local $aResult[ UBound($aDuplicates)][2]

For $i = 0 To UBound($aDuplicates) - 1
    $aResult[$i][0] = $aDuplicates[$i]
    $aResult[$i][1] = UBound(  StringRegExp($sData, "(?:\A|\R)\Q" & $aDuplicates[$i] & "\E(?=\R|\Z)", 3)  )
Next

_ArraySort($aResult, 1, 0, 0, 1)
_ArrayDisplay($aResult)
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...