Jump to content

Optimizing array difference search speed (or other solutions welcome!)


Recommended Posts

Hi gang,

I'm in a bit of a pickle here, the gist of it is that i have one "main" file with ~200k lines, and literally hundreds of other files each ranging from 1k - 100k lines.

I need to go through each of the "other" files and (separately) compare them to the "main" file, and save the differences between the two (no duplicates)

 

The issue is that each comparison (especially if the "other" file has 50+ k lines) takes over a minute each... 

Is there anyway to cut this time down? As far as i know im using the most optimized array difference script


Here's a rough mockup of the script im currently using

note: the main file has all unique lines, and the "other" files wont ever have any lines that *DO NOT* appear in the main file, if that helps

#include <array.au3>
#include <file.au3>



global $info_file
global $compare_file
global $Differece

$info_file = FileReadToArray("info-file(200k lines).txt")




$compare_file = FileReadToArray("compare-file(100k lines).txt")




$Differece = _Diff($info_file, $compare_file, 0) ; get the difference between 2 arrays, NO duplicates
_ArrayDisplay($Differece)







;=================================================
; Function Name:   _Diff($Set1, $Set2 [, $GetAll=0 [, $Delim=Default]])
; Description::    Find values in $Set1 that do not occur in $Set2
; Parameter(s):    $Set1    set 1 (1D-array or delimited string)
;                  $Set2    set 2 (1D-array or delimited string)
;      optional:   $GetAll  0 - only one occurence of every difference are shown (Default)
;                           1 - all differences are shown, allowing duplicates
;      optional:   $Delim   Delimiter for strings (Default use the separator character set by Opt("GUIDataSeparatorChar") )
; Return Value(s): Succes   1D-array of values from $Set1 that do not occur in $Set2
;                  Failure  -1  @error  set, that was given as array, isn't 1D-array
; Note:            Comparison is case-sensitive! - i.e. Number 9 is different to string '9'!
; Author(s):       BugFix (bugfix@autoit.de) Modified by ParoXsitiC for Faster _Diff (Formally _GetIntersection)
;=================================================
Func _Diff(ByRef $Set1, ByRef $Set2, $GetAll = 0, $Delim = Default)
    Local $o1 = ObjCreate("System.Collections.ArrayList")
    Local $o2 = ObjCreate("System.Collections.ArrayList")
    Local $oDiff1 = ObjCreate("System.Collections.ArrayList")
    Local $tmp, $i
    If $GetAll <> 1 Then $GetAll = 0
    If $Delim = Default Then $Delim = Opt("GUIDataSeparatorChar")
    If Not IsArray($Set1) Then
        If Not StringInStr($Set1, $Delim) Then
            $o1.Add($Set1)
        Else
            $tmp = StringSplit($Set1, $Delim, 1)
            For $i = 1 To UBound($tmp) - 1
                $o1.Add($tmp[$i])
            Next
        EndIf
    Else
        If UBound($Set1, 0) > 1 Then Return SetError(1, 0, -1)
        For $i = 0 To UBound($Set1) - 1
            $o1.Add($Set1[$i])
        Next
    EndIf

    If Not IsArray($Set2) Then
        If Not StringInStr($Set2, $Delim) Then
            $o2.Add($Set2)
        Else
            $tmp = StringSplit($Set2, $Delim, 1)
            For $i = 1 To UBound($tmp) - 1
                $o2.Add($tmp[$i])
            Next
        EndIf
    Else
        If UBound($Set2, 0) > 1 Then Return SetError(1, 0, -1)
        For $i = 0 To UBound($Set2) - 1
            $o2.Add($Set2[$i])
        Next
    EndIf

    For $tmp In $o1
        If Not $o2.Contains($tmp) And ($GetAll Or Not $oDiff1.Contains($tmp)) Then $oDiff1.Add($tmp)
    Next

    If $oDiff1.Count <= 0 Then Return 0

    Local $aOut[$oDiff1.Count]
    $i = 0
    For $tmp In $oDiff1
        $aOut[$i] = $tmp
        $i += 1
    Next
    Return $aOut
EndFunc   ;==>_Diff

 

Edited by phatzilla
Link to comment
Share on other sites

Merge the 2 arrays into one, then run that merged array through _ArrayUnique to get the array with no repeats.

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

_ArrayUnique uses a scripting dictionary

 

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

BrewManNH,

Will _arrayunique remove ALL duplicate entries?   If i recall, it simply keeps one of the entries.

If you run:

1
3
3
2

through _arrayunique, you'll get

1
3
2

I need it to be
1
2
 

 

 

What i need is a new array that's the difference between "main file" and "other file".

So if main file array is

2
1
4
5
3

and other file array is

2
3
1

new array should be

4
5

 

In other words, i dont want to preserve any instances of the duplicate lines.  the _Diff function achieves this, but i need to get it a bit faster, if possible...

Or, failing that, a new solution.

 

Edited by phatzilla
Link to comment
Share on other sites

What's in the files that you're comparing between the 2? What do some of the entries look like? I personally would sort the array to be searched and then use _ArrayBinarySearch to quickly search the array for any entries that are duplicates. I'd need a sample of a couple of files you're comparing, even generic ones that the format is similar to what you're working with, I can probably create something that will do this very fast. Using objects on arrays probably isn't very efficient.

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

How do lines differ, typically? Are they 99% the same or much more random? Are the differences located near the head, near the tail or can they differ anywhere in between?

This can dramatically change the speed of comparisons. Have you tried comparing the hash of the strings (MD5 will do here): comparing MD5s is pretty fast.

Depending on actual contents, SQLite might prove useful for your use case.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Actually, i just adjusted my script based on mikells suggestion....



 

global $a = FileReadToArray("info-file(200k lines).txt")
global $b = FileReadToArray("compare-file(100k lines).txt")

Local $sda = ObjCreate("Scripting.Dictionary")
Local $sdb = ObjCreate("Scripting.Dictionary")
Local $sdc = ObjCreate("Scripting.Dictionary")

For $i In $a
    $sda.Item($i)
Next
For $i In $b
    $sdb.Item($i)
Next

For $i In $a
    If $sdb.Exists($i) Then $sdc.Item($i)
Next
$asd3 = $sdc.Keys()

For $i In $asd3
    If $sda.Exists($i) Then $sda.Remove($i)
    If $sdb.Exists($i) Then $sdb.Remove($i)
Next
$asd1 = $sda.Keys()
$asd2 = $sdb.Keys()

_ArrayDisplay($asd1, "$asd1")
_ArrayDisplay($asd2, "$asd2")
_ArrayDisplay($asd3, "$asd3")


Now this is much faster... $ASD1 is the only array that i need as it contains the lines that are *ONLY* found in "info" file (because the "other" compare file will never contain any lines that aren't found in info file).  Is there any way to speed this script up?  I dont really need the other 2 new arrays asd2,asd3.  asd1 contains all the information i need.

Edited by phatzilla
Link to comment
Share on other sites

I have a  small doubt. You people see the above code ?. Is this safe to use the same variable name twice or thrice in the same script ?. I mean "$i". 

Edit - Never mind i got the answer. 

Edited by kcvinu
Spoiler

My Contributions

Glance GUI Library - A gui library based on Windows api functions. Written in Nim programming language.

UDF Link Viewer   --- A tool to visit the links of some most important UDFs 

 Includer_2  ----- A tool to type the #include statement automatically 

 Digits To Date  ----- date from 3 integer values

PrintList ----- prints arrays into console for testing.

 Alert  ------ An alternative for MsgBox 

 MousePosition ------- A simple tooltip display of mouse position

GRM Helper -------- A littile tool to help writing code with GUIRegisterMsg function

Access_UDF  -------- An UDF for working with access database files. (.*accdb only)

 

Link to comment
Share on other sites

  • Moderators

kcvinu,

To answer your question for any later readers: Yes, it is quite safe to use the same variable name in several For...Next loops as long as each loop is entirely separate and has terminated before the next one starts. If there is any overlap or nesting then the loop count variables must be differentiated.

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Link to comment
Share on other sites

In this case you can spare significant steps:

Local $a = FileReadToArray("info-file(200k lines).txt")
Local $sda = ObjCreate("Scripting.Dictionary")

For $i In $a
    $sda.Item($i)
Next

; for each "compare-file" do
; see _FileListToArray and friends

    ; also $a content is no more needed,
    ; we reuse it for all target files
    $a = FileReadToArray("compare-file(100k lines).txt")

    For $i In $a
        If $sda.Exists($i) Then $sda.Remove($i)
    Next
; next file

Local $asd1 = $sda.Keys()
$sda = 0
_ArrayDisplay($asd1, "$asd1")

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

@Melba23 Thank you for the clarification. 

Spoiler

My Contributions

Glance GUI Library - A gui library based on Windows api functions. Written in Nim programming language.

UDF Link Viewer   --- A tool to visit the links of some most important UDFs 

 Includer_2  ----- A tool to type the #include statement automatically 

 Digits To Date  ----- date from 3 integer values

PrintList ----- prints arrays into console for testing.

 Alert  ------ An alternative for MsgBox 

 MousePosition ------- A simple tooltip display of mouse position

GRM Helper -------- A littile tool to help writing code with GUIRegisterMsg function

Access_UDF  -------- An UDF for working with access database files. (.*accdb only)

 

Link to comment
Share on other sites

phatzilla,

Just for grins can you run your largest files through the following.  I'm interested in comparing the results to the dictionary method...

#include <FileConstants.au3>
#include <sqlite.au3>
#include <array.au3>

Local $sSQLiteMod = 'C:\Program Files (x86)\AutoIt3\install\Extras\SQLite\sqlite3.exe'

; do NOT use AutoIt macros for the next two file names

Local $sMasterFile = "fully.qualified.filename" ; your file with ALL entries
Local $sDetailFile = "fully.qualified.filename" ; your file with SOME entries

$st = TimerInit()

Local $sIn = 'create table if not exists mstr (c1);' & _
        'create table if not exists det (c1);' & _
        @CRLF & '.import ' & $sMasterFile & ' MSTR' & _
        @CRLF & '.import ' & $sDetailFile & ' DET' & _
        @CRLF & 'SELECT DISTINCT c1 FROM mstr WHERE c1 Not IN (SELECT DISTINCT c1 FROM det);'

Local $Out = _TempFile()
Local $in = _TempFile()

FileWrite($in, $sIn)

Local $sCmd = @ComSpec & ' /c "' & $sSQLiteMod & '" < ' & $in & ' > ' & $Out
Local $pid = RunWait($sCmd, @WorkingDir, @SW_HIDE)
ProcessWaitClose($pid)
ConsoleWrite('Time to find entries unique to mstr = ' & Round(TimerDiff($st) / 1000, 3) & ' seconds' & @CRLF)
ConsoleWrite(StdoutRead($pid) & @CRLF)
ConsoleWrite('! out ' & @CRLF & FileRead($Out) & @CRLF)

FileDelete($Out)
FileDelete($in)

kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

For this kind of handling, which requires to be fast enough, I use an external program from the CoreUtils for Windows (http://gnuwin32.sourceforge.net/packages/coreutils.htm) : uniq.exe. This tool can quickly extract/remove duplicate lines in a text file. In your case, you just have to put the 2 file contents in 1 file and remove the duplicates (it requires the file to be sorted first)

So, using the same idea than BrewManNH with _ArrayUniq, you can do something like this :

#Include <Array.au3>

Local $sMainFile = @ScriptDir & "\mainFile.txt"
Local $sPartialFile = @ScriptDir & "\secondFile.txt"

Local $sTempFile = @ScriptDir & "\tempFile.txt", $sResultFile = @ScriptDir & "\resultFile.txt"

; Concatenate the 2 files in 1
Local $sfile = FileWrite($sTempFile, FileRead($sMainFile) & FileRead($sPartialFile) )

; Sort the file and then remove all duplicate lines
RunWait(@ComSpec & ' /c sort "' & $sTempFile & '" | uniq -u > "' & $sResultFile & '"', "C:\Program Files (x86)\GnuWin32\bin", @SW_HIDE)
FileDelete($sTempFile)

Local $aResult = FileReadToArray($sResultFile)
_ArrayDisplay($aResult)

I know the problem is solved, but maybe someone else could be interested by this method...

Edited by jguinch
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...