phatzilla

Optimizing array difference search speed (or other solutions welcome!)

19 posts in this topic

#1 ·  Posted (edited)

Hi gang,

I'm in a bit of a pickle here, the gist of it is that i have one "main" file with ~200k lines, and literally hundreds of other files each ranging from 1k - 100k lines.

I need to go through each of the "other" files and (separately) compare them to the "main" file, and save the differences between the two (no duplicates)

 

The issue is that each comparison (especially if the "other" file has 50+ k lines) takes over a minute each... 

Is there anyway to cut this time down? As far as i know im using the most optimized array difference script


Here's a rough mockup of the script im currently using

note: the main file has all unique lines, and the "other" files wont ever have any lines that *DO NOT* appear in the main file, if that helps

#include <array.au3>
#include <file.au3>



global $info_file
global $compare_file
global $Differece

$info_file = FileReadToArray("info-file(200k lines).txt")




$compare_file = FileReadToArray("compare-file(100k lines).txt")




$Differece = _Diff($info_file, $compare_file, 0) ; get the difference between 2 arrays, NO duplicates
_ArrayDisplay($Differece)







;=================================================
; Function Name:   _Diff($Set1, $Set2 [, $GetAll=0 [, $Delim=Default]])
; Description::    Find values in $Set1 that do not occur in $Set2
; Parameter(s):    $Set1    set 1 (1D-array or delimited string)
;                  $Set2    set 2 (1D-array or delimited string)
;      optional:   $GetAll  0 - only one occurence of every difference are shown (Default)
;                           1 - all differences are shown, allowing duplicates
;      optional:   $Delim   Delimiter for strings (Default use the separator character set by Opt("GUIDataSeparatorChar") )
; Return Value(s): Succes   1D-array of values from $Set1 that do not occur in $Set2
;                  Failure  -1  @error  set, that was given as array, isn't 1D-array
; Note:            Comparison is case-sensitive! - i.e. Number 9 is different to string '9'!
; Author(s):       BugFix (bugfix@autoit.de) Modified by ParoXsitiC for Faster _Diff (Formally _GetIntersection)
;=================================================
Func _Diff(ByRef $Set1, ByRef $Set2, $GetAll = 0, $Delim = Default)
    Local $o1 = ObjCreate("System.Collections.ArrayList")
    Local $o2 = ObjCreate("System.Collections.ArrayList")
    Local $oDiff1 = ObjCreate("System.Collections.ArrayList")
    Local $tmp, $i
    If $GetAll <> 1 Then $GetAll = 0
    If $Delim = Default Then $Delim = Opt("GUIDataSeparatorChar")
    If Not IsArray($Set1) Then
        If Not StringInStr($Set1, $Delim) Then
            $o1.Add($Set1)
        Else
            $tmp = StringSplit($Set1, $Delim, 1)
            For $i = 1 To UBound($tmp) - 1
                $o1.Add($tmp[$i])
            Next
        EndIf
    Else
        If UBound($Set1, 0) > 1 Then Return SetError(1, 0, -1)
        For $i = 0 To UBound($Set1) - 1
            $o1.Add($Set1[$i])
        Next
    EndIf

    If Not IsArray($Set2) Then
        If Not StringInStr($Set2, $Delim) Then
            $o2.Add($Set2)
        Else
            $tmp = StringSplit($Set2, $Delim, 1)
            For $i = 1 To UBound($tmp) - 1
                $o2.Add($tmp[$i])
            Next
        EndIf
    Else
        If UBound($Set2, 0) > 1 Then Return SetError(1, 0, -1)
        For $i = 0 To UBound($Set2) - 1
            $o2.Add($Set2[$i])
        Next
    EndIf

    For $tmp In $o1
        If Not $o2.Contains($tmp) And ($GetAll Or Not $oDiff1.Contains($tmp)) Then $oDiff1.Add($tmp)
    Next

    If $oDiff1.Count <= 0 Then Return 0

    Local $aOut[$oDiff1.Count]
    $i = 0
    For $tmp In $oDiff1
        $aOut[$i] = $tmp
        $i += 1
    Next
    Return $aOut
EndFunc   ;==>_Diff

 

Edited by phatzilla

Share this post


Link to post
Share on other sites

Merge the 2 arrays into one, then run that merged array through _ArrayUnique to get the array with no repeats.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites

You could directly try Scripting.Dictionary , see here
 

1 person likes this

Share this post


Link to post
Share on other sites

_ArrayUnique uses a scripting dictionary

 


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

BrewManNH,

Will _arrayunique remove ALL duplicate entries?   If i recall, it simply keeps one of the entries.

If you run:

1
3
3
2

through _arrayunique, you'll get

1
3
2

I need it to be
1
2
 

 

 

What i need is a new array that's the difference between "main file" and "other file".

So if main file array is

2
1
4
5
3

and other file array is

2
3
1

new array should be

4
5

 

In other words, i dont want to preserve any instances of the duplicate lines.  the _Diff function achieves this, but i need to get it a bit faster, if possible...

Or, failing that, a new solution.

 

Edited by phatzilla

Share this post


Link to post
Share on other sites

What's in the files that you're comparing between the 2? What do some of the entries look like? I personally would sort the array to be searched and then use _ArrayBinarySearch to quickly search the array for any entries that are duplicates. I'd need a sample of a couple of files you're comparing, even generic ones that the format is similar to what you're working with, I can probably create something that will do this very fast. Using objects on arrays probably isn't very efficient.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

They are text files.

Each line is about 100-120 chars in length alphanumeric

Edited by phatzilla

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

_ArrayUnique uses a scripting dictionary

I know, I said directly because the OP asked for "save the differences between the two"  and _ArrayUnique doesn't do that

Edited by mikell

Share this post


Link to post
Share on other sites

How do lines differ, typically? Are they 99% the same or much more random? Are the differences located near the head, near the tail or can they differ anywhere in between?

This can dramatically change the speed of comparisons. Have you tried comparing the hash of the strings (MD5 will do here): comparing MD5s is pretty fast.

Depending on actual contents, SQLite might prove useful for your use case.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

Actually, i just adjusted my script based on mikells suggestion....



 

global $a = FileReadToArray("info-file(200k lines).txt")
global $b = FileReadToArray("compare-file(100k lines).txt")

Local $sda = ObjCreate("Scripting.Dictionary")
Local $sdb = ObjCreate("Scripting.Dictionary")
Local $sdc = ObjCreate("Scripting.Dictionary")

For $i In $a
    $sda.Item($i)
Next
For $i In $b
    $sdb.Item($i)
Next

For $i In $a
    If $sdb.Exists($i) Then $sdc.Item($i)
Next
$asd3 = $sdc.Keys()

For $i In $asd3
    If $sda.Exists($i) Then $sda.Remove($i)
    If $sdb.Exists($i) Then $sdb.Remove($i)
Next
$asd1 = $sda.Keys()
$asd2 = $sdb.Keys()

_ArrayDisplay($asd1, "$asd1")
_ArrayDisplay($asd2, "$asd2")
_ArrayDisplay($asd3, "$asd3")


Now this is much faster... $ASD1 is the only array that i need as it contains the lines that are *ONLY* found in "info" file (because the "other" compare file will never contain any lines that aren't found in info file).  Is there any way to speed this script up?  I dont really need the other 2 new arrays asd2,asd3.  asd1 contains all the information i need.

Edited by phatzilla

Share this post


Link to post
Share on other sites

#11 ·  Posted (edited)

I have a  small doubt. You people see the above code ?. Is this safe to use the same variable name twice or thrice in the same script ?. I mean "$i". 

Edit - Never mind i got the answer. 

Edited by kcvinu

My Contributions

UDF Link Viewer   --- A tool to visit the links of some most important UDFs 

 Includer_2  ----- A tool to type the #include statement automatically 

 Digits To Date  ----- date from 3 integer values

PrintList ----- prints arrays into console for testing.

 Alert  ------ An alternative for MsgBox 

 MousePosition ------- A simple tooltip display of mouse position

GRM Helper -------- A littile tool to help writing code with GUIRegisterMsg function

Access_UDF  -------- An UDF for working with access database files. (.*accdb only)

 

Share this post


Link to post
Share on other sites

kcvinu,

To answer your question for any later readers: Yes, it is quite safe to use the same variable name in several For...Next loops as long as each loop is entirely separate and has terminated before the next one starts. If there is any overlap or nesting then the loop count variables must be differentiated.

M23

1 person likes this

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

In this case you can spare significant steps:

Local $a = FileReadToArray("info-file(200k lines).txt")
Local $sda = ObjCreate("Scripting.Dictionary")

For $i In $a
    $sda.Item($i)
Next

; for each "compare-file" do
; see _FileListToArray and friends

    ; also $a content is no more needed,
    ; we reuse it for all target files
    $a = FileReadToArray("compare-file(100k lines).txt")

    For $i In $a
        If $sda.Exists($i) Then $sda.Remove($i)
    Next
; next file

Local $asd1 = $sda.Keys()
$sda = 0
_ArrayDisplay($asd1, "$asd1")

 


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

@Melba23 Thank you for the clarification. 


My Contributions

UDF Link Viewer   --- A tool to visit the links of some most important UDFs 

 Includer_2  ----- A tool to type the #include statement automatically 

 Digits To Date  ----- date from 3 integer values

PrintList ----- prints arrays into console for testing.

 Alert  ------ An alternative for MsgBox 

 MousePosition ------- A simple tooltip display of mouse position

GRM Helper -------- A littile tool to help writing code with GUIRegisterMsg function

Access_UDF  -------- An UDF for working with access database files. (.*accdb only)

 

Share this post


Link to post
Share on other sites

kcvinu,
Note also that the $i is declared as Local in the For loop (Thus Saith The Help File) but keeps its last value as long as it is not re-used

For $i = 1 to 10
   If $i = 5 Then Exitloop
Next
;...
Msgbox(0,"", $i)

:)

Share this post


Link to post
Share on other sites

Problem is solved!(for now).  Thank you for everyones help....  Got the time down to ~10 seconds max for the biggest files

Share this post


Link to post
Share on other sites

phatzilla,

Just for grins can you run your largest files through the following.  I'm interested in comparing the results to the dictionary method...

#include <FileConstants.au3>
#include <sqlite.au3>
#include <array.au3>

Local $sSQLiteMod = 'C:\Program Files (x86)\AutoIt3\install\Extras\SQLite\sqlite3.exe'

; do NOT use AutoIt macros for the next two file names

Local $sMasterFile = "fully.qualified.filename" ; your file with ALL entries
Local $sDetailFile = "fully.qualified.filename" ; your file with SOME entries

$st = TimerInit()

Local $sIn = 'create table if not exists mstr (c1);' & _
        'create table if not exists det (c1);' & _
        @CRLF & '.import ' & $sMasterFile & ' MSTR' & _
        @CRLF & '.import ' & $sDetailFile & ' DET' & _
        @CRLF & 'SELECT DISTINCT c1 FROM mstr WHERE c1 Not IN (SELECT DISTINCT c1 FROM det);'

Local $Out = _TempFile()
Local $in = _TempFile()

FileWrite($in, $sIn)

Local $sCmd = @ComSpec & ' /c "' & $sSQLiteMod & '" < ' & $in & ' > ' & $Out
Local $pid = RunWait($sCmd, @WorkingDir, @SW_HIDE)
ProcessWaitClose($pid)
ConsoleWrite('Time to find entries unique to mstr = ' & Round(TimerDiff($st) / 1000, 3) & ' seconds' & @CRLF)
ConsoleWrite(StdoutRead($pid) & @CRLF)
ConsoleWrite('! out ' & @CRLF & FileRead($Out) & @CRLF)

FileDelete($Out)
FileDelete($in)

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

Hmm, i dont seem to have sqlite3.exe

Share this post


Link to post
Share on other sites

#19 ·  Posted (edited)

For this kind of handling, which requires to be fast enough, I use an external program from the CoreUtils for Windows (http://gnuwin32.sourceforge.net/packages/coreutils.htm) : uniq.exe. This tool can quickly extract/remove duplicate lines in a text file. In your case, you just have to put the 2 file contents in 1 file and remove the duplicates (it requires the file to be sorted first)

So, using the same idea than BrewManNH with _ArrayUniq, you can do something like this :

#Include <Array.au3>

Local $sMainFile = @ScriptDir & "\mainFile.txt"
Local $sPartialFile = @ScriptDir & "\secondFile.txt"

Local $sTempFile = @ScriptDir & "\tempFile.txt", $sResultFile = @ScriptDir & "\resultFile.txt"

; Concatenate the 2 files in 1
Local $sfile = FileWrite($sTempFile, FileRead($sMainFile) & FileRead($sPartialFile) )

; Sort the file and then remove all duplicate lines
RunWait(@ComSpec & ' /c sort "' & $sTempFile & '" | uniq -u > "' & $sResultFile & '"', "C:\Program Files (x86)\GnuWin32\bin", @SW_HIDE)
FileDelete($sTempFile)

Local $aResult = FileReadToArray($sResultFile)
_ArrayDisplay($aResult)

I know the problem is solved, but maybe someone else could be interested by this method...

Edited by jguinch
1 person likes this

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Similar Content

    • ronmage
      By ronmage
      So I have a loop that keeps reading data from an array and searching it for the same value. If the value is no there it does work then adds the value to the array to prevent it from doing the same work.
      If _ArraySearch($ID,$filearray[$i]) = -1 Then Work.... _ArrayAdd($ID,$filearray[$i]) EndIf This is in a for loop hence $i
      So what is happening is the code works great for several hours. After a period of time _ArraySearch($ID,$filearray[$i]) will result in -1 even if $ID = $filearray. So it ready as if there is no data in the array. Anyone have this problem? 
       
      Also I am just running in using F5 not compiling it and running it if that makes a difference.
       
    • Ascer
      By Ascer
      1. Description.
      Udf working with MSDN System.Collections.ArrayList. Allow you to make fast operations on huge arrays, speed is even x10 better than basic _ArrayAdd.  Not prefered for small arrays < 600 items. 2. Requirements
      .NET Framework 1.1+ System Windows 3. Possibilities.
      ;=============================================================================================================== ; UDF Name: List.au3 ; ; Date: 2018-02-17, 10:52 ; Description: Simple udf to create System Collections as ArrayList and make multiple actions on them. ; ; Function(s): _ListCreate -> Creates a new list ; _ListCapacity -> Gets a list size in bytes ; _ListCount -> Gets items count in list ; _ListIsFixedSize -> Get bool if list if fixed size ; _ListIsReadOnly -> Get bool if list is read only ; _ListIsSynchronized -> Get bool if list is synchronized ; _ListGetItem -> Get item on index ; _ListSetItem -> Set item on index ; ; _ListAdd -> Add item at end of list ; _ListClear -> Remove all list items ; _ListClone -> Duplicate list in new var ; _ListContains -> Get bool if item is in list ; _ListGetHashCode -> Get hash code for list ; _ListGetRange -> Get list with items between indexs ; _ListIndexOf -> Get index of item ; _ListInsert -> Insert a new item on index ; _ListInsertRange -> Insert list into list on index ; _ListLastIndexOf -> Get index last of item ; _ListRemove -> Remove first found item ; _ListRemoveAt -> Remove item in index ; _ListRemoveRange -> Remove items between indexs ; _ListReverse -> Reverse all items in list ; _ListSetRange -> Set new value for items in range ; _ListSort -> Sort items in list (speed of reading) ; _ListToString -> Get list object name ; _ListTrimToSize -> Remove unused space in list ; ; Author(s): Ascer ;=============================================================================================================== 4. Downloads
      List.au3 5. Examples
      SpeedTest _ArrayAdd vs ListAdd SpeedTest ArraySearch vs ListIndexOf Basic usage - crating guild with members  
    • FMS
      By FMS
      Hello,
      I'm trying to read a binary file to an array but couln't get it to work.
      Also I coul not find any help in the forum around this subject whish was helpfull.
      Is there any way it could be done?
      I tried a lot of ways but maybe somebody know's the right way?
      #AutoIt3Wrapper_Au3Check_Parameters=-d -w 1 -w 2 -w 3 -w 4 -w 5 -w 6 -w 7 #include <File.au3> #include <Array.au3> #include <AutoItConstants.au3> Local $in=FileOpen("TEST_labels.idx1-ubyte",16) ; 16+0=Read binary Local $data = FileRead($in) Local $FileArray = BinaryToString($data,4) ;~ $FileArray = StringSplit($BinarydData, @CRLF, 1+2) ;~ Local $FileArray = StringRegExp($BinarydData, "[^\r\n]+", 3) FileClose($in) _ArrayDisplay($FileArray,"$FileArray","",32) MsgBox($MB_SYSTEMMODAL, "", "$FileArray = " & $FileArray )  
      TEST_labels.idx1-ubyte
    • dadalt95
      By dadalt95
      Perform a simple google search!
      The script below works fine until fill the google form!
      What I can't find is how to submit the form, tried a couple of ways and none of them worked.

       
      #include <IE.au3> $oIE = _IECreate ("www.google.com") $o_form = _IEFormGetObjByName ($oIE, "f") $o_login = _IEFormElementGetObjByName ($o_form, "q") $username = "80251369" _IEFormElementSetValue ($o_login, $username) $o_numer = _IEGetObjByName($o_form, "btnK") _IEAction ($o_numer, "click")  
      The code runs without any problem.
      I don't know how to proceed!
      Thanks in advance!
    • nacerbaaziz
      By nacerbaaziz
      Hi dears
      how are you? I hope You fine
      I have a question please
      I've created a listView
      It has several columns
      Is there any way  to search for text in an element of this list with text in all columns
      for example
      list view with 2 column
      the first is the file name and the second is the file path
      and i want to search for the item witch Containt the name and the path toGether
      I searched a lot but could not find what I was looking for
      If you do not understand the idea that I'm looking for, I can put an example
      Thanks in advance