Jump to content

Remove duplicate line entries from huge files


Simucal
 Share

Recommended Posts

I have a file that looks like this:

<?xml version="1.0"?>
<GliderConfig>
  <WaypointCloseness>5.0</WaypointCloseness>
  <WebNotifyCredentials>
  </WebNotifyCredentials>
  <WebNotifyEnabled>False</WebNotifyEnabled>
  <WebNotifyURL>
  </WebNotifyURL>
  <WindowPos>-1244,346</WindowPos>
  <LoadClasses>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>TreeLock 1.24.cs</SourceFile>
    <SourceFile>Pogue 2.33.cs</SourceFile>
    <SourceFile>Pogue 2.33.cs</SourceFile>
    <SourceFile>Pogue 2.33.cs</SourceFile>
    <SourceFile>Pogue 2.33.cs</SourceFile>
    <SourceFile>Pogue 2.33.cs</SourceFile>
    <SourceFile>Pogue 2.33.cs</SourceFile>
    <SourceFile>Pogue 2.33.cs</SourceFile>
    <SourceFile>Pogue 2.33.cs</SourceFile>
    <SourceFile>Pogue 2.33.cs</SourceFile>
  </LoadClasses>
</GliderConfig>

Except you need to multiply the number of duplicate <SourceFile>'s by a few hundred thousand. The file is almost 8megs of just duplicate sourcefiles and I need to quickly and efficiently remove the dupes.

Presently, I'm using _FileReadToArray and Smoke_N's _ArrayUnique function's but this is obviously not going to work well in my current situation. Does anyone have any suggestions for a faster method?

#Include <Array.au3>
#Include <File.au3>

Local $sFilePath = "C:\test.txt"
Local $aStartText
Local $aEndText
Local $sEndText

_RemoveDuplicateSourceLines()

Func _RemoveDuplicateSourceLines()
_FileReadtoArray($sFilePath, $aStartText)
$aEndText = _ArrayUnique($aStartText, '', 0)
$sEndText = _ArrayToString($aEndText, @CRLF)
FileWrite("C:\test_result.txt", $sEndText)
EndFunc

Func _ArrayUnique(ByRef $aArray, $vDelim = '', $iBase = 1, $iUnique = 1)
    If $vDelim = '' Then $vDelim = Chr(01)
    Local $sHold
    For $iCC = $iBase To UBound($aArray) - 1
        If Not StringInStr($vDelim & $sHold, $vDelim & $aArray[$iCC] & $vDelim, $iUnique) Then _
            $sHold &= $aArray[$iCC] & $vDelim
    Next
    Return StringSplit(StringTrimRight($sHold, StringLen($vDelim)), $vDelim)
EndFunc
AutoIt Scripts:Aimbot: Proof of Concept - PixelSearching Aimbot with several search/autoshoot/lock-on techniques.Sliding Toolbar - Add a nice Sliding Toolbar to your next script. Click the link to see an animation of it in action!FontInfo UDF - Get list of system fonts, or search to see if a particular font is installed.Get Extended Property UDF - Retrieve a files extended properties (e.g., video/image dimensions, file version, bitrate of song/video, etc)
Link to comment
Share on other sites

try reading each line one by one and doing a check to see if the current line is unique.

Edit: try this:

#Include <Array.au3>
#Include <File.au3>

Local $sFilePath = "test.txt"
Local $sUniqueStrings = "" ;contains all unique strings in format: *STRING*STRING*

$file = FileOpen($sFilePath, 0)

; Check if file opened for reading OK
If $file = -1 Then
    MsgBox(0, "Error", "Unable to open file.")
    Exit
EndIf

; Read in lines of text until the EOF is reached
While 1
    $line = FileReadLine($file)
    If @error = -1 Then ExitLoop
    
    ;add to unique string if line is unique
    if Not StringInStr($sUniqueStrings, $line) And StringLen($line) > 0 Then $sUniqueStrings &= $line & "*"
        
Wend
FileClose($file)

;trim trailing *
$sUniqueStrings = StringTrimRight($sUniqueStrings, 1)

_ArrayDisplay(StringSplit($sUniqueStrings, "*"))

took 1 second with the test file provided

Edited by ame1011
[font="Impact"] I always thought dogs laid eggs, and I learned something today. [/font]
Link to comment
Share on other sites

Hello Simucal,

Well, I didn't solve the problem, but in case you needed some metrics.

Method 1 = Original function in OP's code

Method 2 = Function using dictionary object ( see below )

Func _RemoveDuplicateSourceLines()
    Local $dctFinalText = ObjCreate( "Scripting.Dictionary" )   
    _FileReadtoArray($sFilePath, $aStartText)
    For $i = 1 to $aStartText[0]
        If Not $dctFinalText.Exists( $aStartText[$i] ) Then _
            $dctFinalText.Add( $aStartText[$i], "" )
    Next
    
    Local $hFile = FileOpen( "C:\test_result.txt", 1 )
    Local $aItems = $dctFinalText.Keys
    For $j = 0 to ( $dctFinalText.Count - 1 )
        FileWrite( $hFile, $aItems[$j] & @CRLF) 
    Next 
    FileClose( $hFile )
EndFunc

Results:

File: Original text file with LoadClasses cut-and-pasted repeatedly
System: Intel(R) Core(TM)2 CPU T7600  @ 2.33GHz, 1.99 GB of RAM
( 4 runs averaged )
Text file with 239370 lines ( 10,686 KB )                       
All times in milliseconds                       
            Avg/File        Avg/Line
Method 1    3322.08877  0.01388
Method 2    2917.31510  0.01219
                        
Text file with 1436160 lines ( 64,115 KB )                      
All times in milliseconds                       
            Avg/File        Avg/Line
Method 1    19963.86229 0.01390
Method 2    17712.96006 0.01233

Zach...

Edited by zfisherdrums
Link to comment
Share on other sites

took 1 second with the test file provided

Only problem is you need to increase the size of the file I provided by about a 100 times at least. I've seen them get from 8mb to 100mb.. full of duplicates. The help file even says this is a slow method.

Edited by Simucal
AutoIt Scripts:Aimbot: Proof of Concept - PixelSearching Aimbot with several search/autoshoot/lock-on techniques.Sliding Toolbar - Add a nice Sliding Toolbar to your next script. Click the link to see an animation of it in action!FontInfo UDF - Get list of system fonts, or search to see if a particular font is installed.Get Extended Property UDF - Retrieve a files extended properties (e.g., video/image dimensions, file version, bitrate of song/video, etc)
Link to comment
Share on other sites

What do you guys think about reading the whole file to a string and stripping them out? Ugh, I dont know. I'll look into the xml udfs and see what kind of speed I get with that.

AutoIt Scripts:Aimbot: Proof of Concept - PixelSearching Aimbot with several search/autoshoot/lock-on techniques.Sliding Toolbar - Add a nice Sliding Toolbar to your next script. Click the link to see an animation of it in action!FontInfo UDF - Get list of system fonts, or search to see if a particular font is installed.Get Extended Property UDF - Retrieve a files extended properties (e.g., video/image dimensions, file version, bitrate of song/video, etc)
Link to comment
Share on other sites

What do you guys think about reading the whole file to a string and stripping them out? Ugh, I dont know. I'll look into the xml udfs and see what kind of speed I get with that.

Hi,

Is post #4 too slow as well? 10Mb in 19secs seems pretty good to me...? [corretcion, 64Mb in 19secs...]

Are you expecting something quicker?

Best, randall

Edited by randallc
Link to comment
Share on other sites

Hello Simucal,

Well, I didn't solve the problem, but in case you needed some metrics.

Method 1 = Original function in OP's code

Method 2 = Function using dictionary object ( see below )

Func _RemoveDuplicateSourceLines()
    Local $dctFinalText = ObjCreate( "Scripting.Dictionary" )   
    _FileReadtoArray($sFilePath, $aStartText)
    For $i = 1 to $aStartText[0]
        If Not $dctFinalText.Exists( $aStartText[$i] ) Then _
            $dctFinalText.Add( $aStartText[$i], "" )
    Next
    
    Local $hFile = FileOpen( "C:\test_result.txt", 1 )
    Local $aItems = $dctFinalText.Keys
    For $j = 0 to ( $dctFinalText.Count - 1 )
        FileWrite( $hFile, $aItems[$j] & @CRLF) 
    Next 
    FileClose( $hFile )
EndFunc

Results:

File: Original text file with LoadClasses cutiand-pasted repeatedly
System: Intel(R) Core(TM)2 CPU T7600  @ 2.33GHz, 1.99 GB of RAM
( 4 runs averaged )
Text file with 239370 lines ( 10,686 KB )                       
All times in milliseconds                       
            Avg/File        Avg/Line
Method 1    3322.08877  0.01388
Method 2    2917.31510  0.01219
                        
Text file with 1436160 lines ( 64,115 KB )                      
All times in milliseconds                       
            Avg/File        Avg/Line
Method 1    19963.86229 0.01390
Method 2    17712.96006 0.01233

Zach...

Thanks for taking the time to take a look at this Zach, I appreciate it. That is a pretty big improvement in speed and I'm using it as my current method.

Hi,

Is post #4 too slow as well? 10Mb in 19secs seems pretty good to me...?

Are you expecting something quicker?

Best, randall

Honestly, yea.. if I can get something faster I could really use it. 10mb per 19seconds doesnt sound too bad, but I've just started running into 200mb files. That is 6.3 minutes.
AutoIt Scripts:Aimbot: Proof of Concept - PixelSearching Aimbot with several search/autoshoot/lock-on techniques.Sliding Toolbar - Add a nice Sliding Toolbar to your next script. Click the link to see an animation of it in action!FontInfo UDF - Get list of system fonts, or search to see if a particular font is installed.Get Extended Property UDF - Retrieve a files extended properties (e.g., video/image dimensions, file version, bitrate of song/video, etc)
Link to comment
Share on other sites

Just a clarification: Using Method 2, the 62.6 mb was parsed in @17 seconds. The 10 mb file averages just under 3 seconds.

Unless my calculations are incorrect, a 200 mb file should be parsed in @56 seconds.

Either way the second method achieves only a slight gain in speed compared to Method 1; but it is consistent.

Finally, this is assuming that the machine is at least an Intel® Core2 CPU T7600 @ 2.33GHz, 1.99 GB of RAM.

Zach...

Edited by zfisherdrums
Link to comment
Share on other sites

Could I get you to try this against the same test file to see how it times?

$sSrcFile = "C:\Temp\Test.txt"
$sResultFile = "C:\Temp\Results.txt"
$sFileData = FileRead($sSrcFile)
$sResults = ""
While 1
    $avSearch = StringRegExp($sFileData, "<SourceFile>.+</SourceFile>", 1)
    If @error Then
        ExitLoop
    Else
        $sResults &= $avSearch[0] & @CRLF
        $sFileData = StringReplace($sFileData, $avSearch[0], "")
    EndIf
WEnd

$sResults = StringTrimRight($sResults, 1)
$sResults = StringRegExpReplace($sResults, "</{0,}SourceFile>", "")
$hResultFile = FileOpen($sResultFile, 2)
FileWrite($hResultFile, $sResults)
FileClose($hResultFile)

Thanks.

:)

Edited by PsaltyDS
Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Could I get you to try this against the same test file to see how it times?

Thanks.

:)

Your version is by far the fastest ( no surprise there ). It appears to average just north of 7 seconds using the 62.6 mb file. The only issue I saw was the results file only contained the line items that were duplicates instead of ( Original File - duplicate lines ); but the proof is in the pudding. Thanks for showing me how to utilize RegExs for this kind of task - very helpful!

You could use the XML UDF and use an XSLT transformation to remove duplicates (See #5):

This looks fun as well. Thanks for posting the link!

Zach...

Edited by zfisherdrums
Link to comment
Share on other sites

Your version is by far the fastest ( no surprise there ). It appears to average just north of 7 seconds using the 62.6 mb file.

:)^_^ W00t! Party time! :);)

The only issue I saw was the results file only contained the line items that were duplicates instead of ( Original File - duplicate lines ); but the proof is in the pudding.

I thought that was the task at hand -- to just list each unique source file, didn't make any attempt to preserve the rest of the XML. I might have misunderstood the specs... ;)

Thanks for showing me how to utilize RegExs for this kind of task - very helpful!

Show you! I'm a RegExp dummy from way back. That was an hour's worth of Googling and reading to figure out how to do it. I was showing ME!

:D

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Show you! I'm a RegExp dummy from way back. That was an hour's worth of Googling and reading to figure out how to do it. I was showing ME!

Hi,

thanks for all your efforts! Only an hour.. I gotta speed up.

I tried to do this for removing dupes, but got nowhere; you are nearly there already.

I thought that was the task at hand -- to just list each unique source file, didn't make any attempt to preserve the rest of the XML. I might have misunderstood the specs...

"unique source file" - do you mean "unique source line in file"?

I still would like to get a UDF going to remove Dupes faster than ScriptDirectory.com; can you show me how using this as a starting point? - maybe even just explanation comments of your script would guide me enough?; or should I just loop through your dupes result array and do string replace lines with blanks? [then add one copy?]

Best, randall

Edited by randallc
Link to comment
Share on other sites

Hi,

thanks for all your efforts! Only an hour.. I gotta speed up.

I tried to do this for removing dupes, but got nowhere; you are nearly there already.

"unique source file" - do you mean "unique source line in file"?

I still would like to get a UDF going to remove Dupes faster than ScriptDirectory.com; can you show me how using this as a starting point? - maybe even just explanation comments of your script would guide me enough?; or should I just loop through your dupes result array and do string replace lines with blanks? [then add one copy?]

Best, randall

After perhaps misunderstanding what Simucal was trying to do, I coded this (newly commented):

$sSrcFile = "C:\Temp\Test.txt" ; input file
$sResultFile = "C:\Temp\Results.txt" ; output file
$sFileData = FileRead($sSrcFile) ; read input file as a single string
$sResults = "" ; String for assembling results

; Main loop
While 1
    ; This RegExp returns only the first match found each time, 
    ;   i.e.  $avSearch[0] = "<SourceFile>TreeLock 1.24.cs</SourceFile>"
    $avSearch = StringRegExp($sFileData, "<SourceFile>.+</SourceFile>", 1)
    If @error Then
        ; @error = no matches left = done
        ExitLoop
    Else
        ; Add current match to the results string
        $sResults &= $avSearch[0] & @CRLF 
        ; Remove any other identical fields from input string, so no duplicates are left
        $sFileData = StringReplace($sFileData, $avSearch[0], "") 
    EndIf
WEnd

; remove trailing @CRLF
$sResults = StringTrimRight($sResults, 1) 
; Remove field tags, leaving only filenames, i.e. "TreeLock 1.24.cs"
$sResults = StringRegExpReplace($sResults, "</{0,}SourceFile>", "") 
; Write results to output file
$hResultFile = FileOpen($sResultFile, 2)
FileWrite($hResultFile, $sResults)
FileClose($hResultFile)

The idea was simply to read it all in as one string (AutoIt strings can be up to 2GB, I believe), use a RegExp to find the desired field, and delete each found field from the rest of the string before continuing to prevent duplicate finds.

This has the additional advantage of making the remaining string smaller after each find, so the RegExp times get faster as it progresses.

I thought there might be some kind of uber-geeky RegExp backreference to the global matches so far. Then you could use a pattern that said "matches THIS, AND NOT already in the global match list". A pattern like that could be run just once and return the global matches. Alas, couldn't find such a thing.

:)

Edited by PsaltyDS
Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Since nobody responded to my post I wrote a pure XML version.

XML UDF:

#include <_XMLDomWrapper.au3>

Dim $sFile = "test.xml"

$XMLOBJECT = _XMLFileOpen ($sFile)

;Transform if stylesheet exists
If FileExists ( "config.xsl" ) Then
    _XMLTransform ( $XMLOBJECT, "config.xsl","out.xml" )
EndIf

config.xsl:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>
<xsl:template match="@*|node()">
  <xsl:if test="not(node()) or not(preceding-sibling::node()[.=string(current())])">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:if>
</xsl:template>
</xsl:stylesheet>

This will read in test.xml and generate out.xml

Link to comment
Share on other sites

Since nobody responded to my post I wrote a pure XML version.

File: Original text file with LoadClasses cut-and-pasted repeatedly

System: Intel® Core2 CPU T7600 @ 2.33GHz, 1.99 GB of RAM

Text file: 1436160 lines ( 64,115 KB )

All times in milliseconds

9917.816772 <--- Run 1

10016.26623 <--- Run 2

9995.64463 <--- Run 3

9941.373793 <--- Run 4

9967.775357 <--- Average

Posted Image

Link to comment
Share on other sites

File: Original text file with LoadClasses cut-and-pasted repeatedly

System: Intel® Core2 CPU T7600 @ 2.33GHz, 1.99 GB of RAM

Text file: 1436160 lines ( 64,115 KB )

All times in milliseconds

9917.816772 <--- Run 1

10016.26623 <--- Run 2

9995.64463 <--- Run 3

9941.373793 <--- Run 4

9967.775357 <--- Average

Posted Image

Even if weaponx's .xsl technique is a little slower, it still wins if it does what Simucal actually wanted and mine doesn't.

:)

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Even if weaponx's .xsl technique is a little slower, it still wins if it does what Simucal actually wanted and mine doesn't.

:)

Hi,

Try this for the Regexp? [btw I am presuming it is only the adjacent dupes Simulcal wanted trimmed to one copy, not throughout whole file?; is that true?]

Gold/ Silver/ Bronze?

8.8 secs for me, 4.4 of those fileread

Local $sSrcFile2, $sSrcFile = @ScriptDir & "\Test.txt"
ConsoleWrite("FileGetSize($sSrcFile)="&FileGetSize($sSrcFile)&@LF)
$sResultFile = @ScriptDir & "\Results.txt"
FileDelete($sResultFile)
$timerstamp1 = TimerInit()
$sFileData = FileRead($sSrcFile) ; read input file as a single string
ConsoleWrite("Read Time= " & Round(TimerDiff($timerstamp1)) & "" & @TAB & " msec" & @LF)
$sFileData1 = StringRegExpReplace($sFileData, "(?m)^(.*\r\n)\1+", "\1")
FileWrite($sResultFile, $sFileData1)
ConsoleWrite("Total Time= " & Round(TimerDiff($timerstamp1)) & "" & @TAB & " msec" & @LF)
Best, Randall Edited by randallc
Link to comment
Share on other sites

Hi,

Try this for the Regexp? [btw I am presuming it is only the adjacent dupes Simulcal wanted trimmed to one copy, not throughout whole file?; is that true?]

Gold/ Silver/ Bronze?

8.8 secs for me, 4.4 of those fileread

Local $sSrcFile2, $sSrcFile = @ScriptDir & "\Test.txt"
ConsoleWrite("FileGetSize($sSrcFile)="&FileGetSize($sSrcFile)&@LF)
$sResultFile = @ScriptDir & "\Results.txt"
FileDelete($sResultFile)
$timerstamp1 = TimerInit()
$sFileData = FileRead($sSrcFile) ; read input file as a single string
ConsoleWrite("Read Time= " & Round(TimerDiff($timerstamp1)) & "" & @TAB & " msec" & @LF)
$sFileData1 = StringRegExpReplace($sFileData, "(?m)^(.*\r\n)\1+", "\1")
FileWrite($sResultFile, $sFileData1)
ConsoleWrite("Total Time= " & Round(TimerDiff($timerstamp1)) & "" & @TAB & " msec" & @LF)
Best, Randall
Very cool. That's more like what I was trying to figure out how to do.

Unfortunately Simulcal seems to have migrated south for the winter, we may not know if it's what he wanted until he migrates north again in the spring!

:)

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Sorry it took me a day to get back to this post. I work at UPS and as you can imagine.. we get so swamped at this time of year that I barely have time to collapse on my bed when I get off work.

Anyway, to PsaltyDS and randallc... WOW! OMG! SSOOOOOOOO FAST! I had a feeling the regexp's might be a decent solution but wasn't sure how I would go about using it in this manner. 2.8 seconds for a 60mb file for me!

The output that Randallc guessed I wanted is correct. Thank you VERY much.. this makes those 200mb files fall to their knees!

Also to weaponx.. your method definately gets the award for interesting and unique way of running it. It made me read up the xml udf's myself and there seem like quite a few gems that I will try and remember. Thank you for your working example and hard work!

AutoIt Scripts:Aimbot: Proof of Concept - PixelSearching Aimbot with several search/autoshoot/lock-on techniques.Sliding Toolbar - Add a nice Sliding Toolbar to your next script. Click the link to see an animation of it in action!FontInfo UDF - Get list of system fonts, or search to see if a particular font is installed.Get Extended Property UDF - Retrieve a files extended properties (e.g., video/image dimensions, file version, bitrate of song/video, etc)
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...