Jump to content

Remove duplicated strings in a file


Recommended Posts

after reading each line in the file to an array.

aka _FileReadToArray()

EDIT: Note that arrays can handle some 16 million elements. so you're good =D

Edited by mechaflash213
Spoiler

“Hello, ladies, look at your man, now back to me, now back at your man, now back to me. Sadly, he isn’t me, but if he stopped using ladies scented body wash and switched to Old Spice, he could smell like he’s me. Look down, back up, where are you? You’re on a boat with the man your man could smell like. What’s in your hand, back at me. I have it, it’s an oyster with two tickets to that thing you love. Look again, the tickets are now diamonds. Anything is possible when your man smells like Old Spice and not a lady. I’m on a horse.”

 

Link to comment
Share on other sites

Comparing the removal of duplicate strings in a file by the array method and the Regular Expression method.

Using a 1,011 line file, we have :-

Array method Time taken approx.2.9 secs:

Reg.Exp. method Time taken approx.0.4 secs.

If the RunWait Sort function is not needed, that is, if the file is already sorted, the Reg.Exp. method's completion time is approx.0.01 secs.

The array method:

#include <File.au3>
#include <Array.au3>

Local $sDupString = _
        "ACM" & @CRLF & _
        "ACM" & @CRLF & _
        "ACM" & @CRLF & _
        "AJE" & @CRLF & _
        "AJE" & @CRLF & _
        "ACM" & @CRLF & _
        "AJE" & @CRLF & _
        "AJE" & @CRLF & _
        "AJE" & @CRLF & _
        "ALMA" & @CRLF & _
        "ALMA" & @CRLF
Local $sDup
For $i = 1 To 500
    $sDup = Chr(Random(66, 90, 1)) & Chr(Random(66, 90, 1)) & Chr(Random(66, 90, 1)) & Chr(Random(66, 90, 1)) & @CRLF
    $sDupString &= $sDup & $sDup
Next

Local $sFileName = "DupFile.txt"
If FileExists($sFileName) Then FileDelete($sFileName)
FileWrite($sFileName, $sDupString)

Local $begin = TimerInit()
Local $aDupArray
_FileReadToArray($sFileName, $aDupArray)
;_ArrayDisplay($aDupArray,0)
;_ArrayDelete($aModArray,0)

Local $aModArray = _ArrayUnique($aDupArray)
;_ArrayDelete($aModArray,0)

Local $sModString = _ArrayToString($aModArray, @CRLF, 2)

ConsoleWrite("Time: " & TimerDiff($begin) & @LF) ; Time: 2917.63390849968
MsgBox(0, "Results", StringStripWS($sModString, 2))
;_ArrayDisplay($aModArray)

FileDelete($sFileName)

The Regular Expression method:

Local $sDupString = _
        "ACM" & @CRLF & _
        "ACM" & @CRLF & _
        "ACM" & @CRLF & _
        "AJE" & @CRLF & _
        "AJE" & @CRLF & _
        "ACM" & @CRLF & _
        "AJE" & @CRLF & _
        "AJE" & @CRLF & _
        "AJE" & @CRLF & _
        "ALMA" & @CRLF & _
        "ALMA" & @CRLF
Local $sDup
For $i = 1 To 500
    $sDup = Chr(Random(66, 90, 1)) & Chr(Random(66, 90, 1)) & Chr(Random(66, 90, 1)) & Chr(Random(66, 90, 1)) & @CRLF
    $sDupString &= $sDup & $sDup
Next
Local $sFileName = "DupFile.txt"

If FileExists($sFileName) Then FileDelete($sFileName)
FileWrite($sFileName, $sDupString)
;ShellExecute($sFileName)
Local $begin = TimerInit()

RunWait(@ComSpec & " /c Sort " & $sFileName & " /O " & $sFileName, "", @SW_HIDE) ; Sort file

Local $sModString = StringRegExpReplace(FileRead($sFileName), "([^\v]*)(\v+)(\1\2*)*", "$1$2")
; Or
;Local $sModString = StringRegExpReplace(FileRead($sFileName) & @CRLF, "([^\v]*\v+)(\1*)", "\1")

ConsoleWrite("Time: " & TimerDiff($begin) & @LF) ; Time: 370.465418593286
MsgBox(0, "Results", StringStripWS($sModString, 2))

FileDelete($sFileName)
Link to comment
Share on other sites

Func _CreateFile($Path_In)

Local $hFile, $i, $s = ''

For $i = 0 To 1000000

  $s &= Random(1, 1000, 1) & @CRLF

Next

$hFile = FileOpen($Path_In, 2)

FileWrite($hFile, $s)

FileClose($hFile)

EndFunc



$Path_In = @ScriptDir & 'test_in.txt'

_CreateFile($Path_In)

$Path_Out = @ScriptDir & 'test_Out.txt'

$sText = FileRead($Path_In)

$err = 0

$timer = TimerInit()

$aText_Out = _StringUnique($sText)

If @error Then $err = @error

$timer = Round(TimerDiff($timer) / 1000, 2)

If $err Then

MsgBox(0, 'error', 'not found' & @CRLF & 'Time = ' & $timer & 'sec')

Exit

Else

$hFile = FileOpen($Path_Out, 2)

FileWrite($hFile, $aText_Out)

FileClose($hFile)

EndIf



MsgBox(0, "Time", 'Time = ' & $timer & 'sec')



; не учитывает регистр String = StRiNg = STRING

; not case sensitive, String = StRiNg = STRING

Func _StringUnique($sText, $sep = @CRLF)

Local $i, $k, $aText, $s, $Trg = 0, $LenSep

If StringInStr($sText, '[') And $sep <> '[' Then

  For $i = 0 To 255

   $s = Chr($i)

   If Not StringInStr($sText, $s) Then

    If StringInStr($sep, $s) Then ContinueLoop

    $sText = StringReplace($sText, '[', $s)

    $Trg = 1

    ExitLoop

   EndIf

  Next

  If Not $Trg Then Return SetError(1, 0, '')

EndIf



$LenSep = StringLen($sep)

$aText = StringSplit($sText, $sep, 1)

Assign('/', 2, 1)

$k = 0

$sText = ''

For $i = 1 To $aText[0]

  If Not IsDeclared($aText[$i] & '/') Then

   Assign($aText[$i] & '/', 0, 1)

   $sText &= $aText[$i] & $sep

   $k += 1

  EndIf

Next

If $k = 0 Then Return SetError(2, 0, '')

If $Trg Then $sText = StringReplace($sText, $s, '[')

Return StringTrimRight($sText, $LenSep)

EndFunc

Edited by AZJIO
Link to comment
Share on other sites

Malkey,

Nice regular expression.

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

Without writing to a file... (thanks to Malkey for the SRE.)

#include <Constants.au3>

Local $sString = _
        "ACM" & @CRLF & _
        "ACM" & @CRLF & _
        "ACM" & @CRLF & _
        "AJE" & @CRLF & _
        "AJE" & @CRLF & _
        "ACM" & @CRLF & _
        "AJE" & @CRLF & _
        "AJE" & @CRLF & _
        "AJE" & @CRLF & _
        "ALMA" & @CRLF & _
        "ALMA" & @CRLF

$sString = _Sort($sString)
$sString = _RemoveDuplicates($sString)
ConsoleWrite($sString & @CRLF)

Func _RemoveDuplicates($sData)
;~  Return StringRegExpReplace($sData, '([^v]*)(v+)(12*)*', '$1$2') ; By Malkey
;~  Return StringRegExpReplace($sData, '([^v]*v+)(1*)', '1') ; By Malkey
    Return StringRegExpReplace($sData, '([^R]+?)(R+)(1R+)+', '1' & @CRLF) ; By AZJIO
EndFunc   ;==>_RemoveDuplicates

Func _Sort($sSortList)
    Local $iPID = Run('sort.exe', @SystemDir, @SW_HIDE, $STDIN_CHILD + $STDOUT_CHILD), $sOutput = ''

    StdinWrite($iPID, $sSortList)
    StdinWrite($iPID)

    While 1
        $sOutput &= StdoutRead($iPID)
        If @error Then
            ExitLoop
        EndIf
    WEnd
    Return $sOutput
EndFunc   ;==>_Sort
Edited by guinness

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

AZJIO,

OK, point taken.

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

guinness

#include <Constants.au3>
Local $sString = _
     "ACM" & @CRLF & _
     "ACM" & @CRLF & _
     "AcM" & @CRLF & _
     "ACM" & @CRLF & _
     "AJE" & @CRLF & _
     "AJE" & @CRLF & _
     "AjE" & @CRLF & _
     "AJE" & @CRLF & _
     "AJE" & @CRLF & _
     "AJE" & @CRLF & _
     "ALMA" & @CRLF & _
     "ALMA" & @CRLF & _
     "aLMA" & @CRLF & _
     "ALMA" & @CRLF
$sString = _Sort($sString)
$sString = _RemoveDuplicates($sString)
MsgBox(0, 'True? Yes?', $sString)
Func _RemoveDuplicates($sData)
Return StringRegExpReplace($sData, '([^v]*)(v+)(12*)*', '$1$2') ; By Malkey
EndFunc ;==>_RemoveDuplicates
Func _Sort($sSortList)
Local $iPID = Run('sort.exe', @SystemDir, @SW_HIDE, $STDIN_CHILD + $STDOUT_CHILD), $sOutput = ''
StdinWrite($iPID, $sSortList)
StdinWrite($iPID)
While 1
     $sOutput &= StdoutRead($iPID)
     If @error Then
         ExitLoop
     EndIf
WEnd
Return $sOutput
EndFunc ;==>_Sort

the result of an incorrect

ACM

AcM

ACM

AJE

AjE

AJE

ALMA

aLMA

ALMA

Edited by AZJIO
Link to comment
Share on other sites

I understood. I changed my post with your SRE.

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

The problem is not in SPE

Sorting is case-insensitive. Link to the group in the regular expression is an exact match. Even if the SPE changed the result will be the same

(?i)([^R]+?)(R+)(1R+)+

(?i) - does not work for Cyrillic

If there is no LF for the last line you need to change the regular expression

(?i)([^R]+?)(R+)(1(R+|z))+

Edited by AZJIO
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...