Jump to content
phatzilla

Optimizing array difference search speed (or other solutions welcome!)

Recommended Posts

phatzilla

Hi gang,

I'm in a bit of a pickle here, the gist of it is that i have one "main" file with ~200k lines, and literally hundreds of other files each ranging from 1k - 100k lines.

I need to go through each of the "other" files and (separately) compare them to the "main" file, and save the differences between the two (no duplicates)

 

The issue is that each comparison (especially if the "other" file has 50+ k lines) takes over a minute each... 

Is there anyway to cut this time down? As far as i know im using the most optimized array difference script


Here's a rough mockup of the script im currently using

note: the main file has all unique lines, and the "other" files wont ever have any lines that *DO NOT* appear in the main file, if that helps

#include <array.au3>
#include <file.au3>



global $info_file
global $compare_file
global $Differece

$info_file = FileReadToArray("info-file(200k lines).txt")




$compare_file = FileReadToArray("compare-file(100k lines).txt")




$Differece = _Diff($info_file, $compare_file, 0) ; get the difference between 2 arrays, NO duplicates
_ArrayDisplay($Differece)







;=================================================
; Function Name:   _Diff($Set1, $Set2 [, $GetAll=0 [, $Delim=Default]])
; Description::    Find values in $Set1 that do not occur in $Set2
; Parameter(s):    $Set1    set 1 (1D-array or delimited string)
;                  $Set2    set 2 (1D-array or delimited string)
;      optional:   $GetAll  0 - only one occurence of every difference are shown (Default)
;                           1 - all differences are shown, allowing duplicates
;      optional:   $Delim   Delimiter for strings (Default use the separator character set by Opt("GUIDataSeparatorChar") )
; Return Value(s): Succes   1D-array of values from $Set1 that do not occur in $Set2
;                  Failure  -1  @error  set, that was given as array, isn't 1D-array
; Note:            Comparison is case-sensitive! - i.e. Number 9 is different to string '9'!
; Author(s):       BugFix (bugfix@autoit.de) Modified by ParoXsitiC for Faster _Diff (Formally _GetIntersection)
;=================================================
Func _Diff(ByRef $Set1, ByRef $Set2, $GetAll = 0, $Delim = Default)
    Local $o1 = ObjCreate("System.Collections.ArrayList")
    Local $o2 = ObjCreate("System.Collections.ArrayList")
    Local $oDiff1 = ObjCreate("System.Collections.ArrayList")
    Local $tmp, $i
    If $GetAll <> 1 Then $GetAll = 0
    If $Delim = Default Then $Delim = Opt("GUIDataSeparatorChar")
    If Not IsArray($Set1) Then
        If Not StringInStr($Set1, $Delim) Then
            $o1.Add($Set1)
        Else
            $tmp = StringSplit($Set1, $Delim, 1)
            For $i = 1 To UBound($tmp) - 1
                $o1.Add($tmp[$i])
            Next
        EndIf
    Else
        If UBound($Set1, 0) > 1 Then Return SetError(1, 0, -1)
        For $i = 0 To UBound($Set1) - 1
            $o1.Add($Set1[$i])
        Next
    EndIf

    If Not IsArray($Set2) Then
        If Not StringInStr($Set2, $Delim) Then
            $o2.Add($Set2)
        Else
            $tmp = StringSplit($Set2, $Delim, 1)
            For $i = 1 To UBound($tmp) - 1
                $o2.Add($tmp[$i])
            Next
        EndIf
    Else
        If UBound($Set2, 0) > 1 Then Return SetError(1, 0, -1)
        For $i = 0 To UBound($Set2) - 1
            $o2.Add($Set2[$i])
        Next
    EndIf

    For $tmp In $o1
        If Not $o2.Contains($tmp) And ($GetAll Or Not $oDiff1.Contains($tmp)) Then $oDiff1.Add($tmp)
    Next

    If $oDiff1.Count <= 0 Then Return 0

    Local $aOut[$oDiff1.Count]
    $i = 0
    For $tmp In $oDiff1
        $aOut[$i] = $tmp
        $i += 1
    Next
    Return $aOut
EndFunc   ;==>_Diff

 

Edited by phatzilla

Share this post


Link to post
Share on other sites
BrewManNH

Merge the 2 arrays into one, then run that merged array through _ArrayUnique to get the array with no repeats.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites
mikell

You could directly try Scripting.Dictionary , see here
 

  • Like 1

Share this post


Link to post
Share on other sites
BrewManNH

_ArrayUnique uses a scripting dictionary

 


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites
phatzilla

BrewManNH,

Will _arrayunique remove ALL duplicate entries?   If i recall, it simply keeps one of the entries.

If you run:

1
3
3
2

through _arrayunique, you'll get

1
3
2

I need it to be
1
2
 

 

 

What i need is a new array that's the difference between "main file" and "other file".

So if main file array is

2
1
4
5
3

and other file array is

2
3
1

new array should be

4
5

 

In other words, i dont want to preserve any instances of the duplicate lines.  the _Diff function achieves this, but i need to get it a bit faster, if possible...

Or, failing that, a new solution.

 

Edited by phatzilla

Share this post


Link to post
Share on other sites
BrewManNH

What's in the files that you're comparing between the 2? What do some of the entries look like? I personally would sort the array to be searched and then use _ArrayBinarySearch to quickly search the array for any entries that are duplicates. I'd need a sample of a couple of files you're comparing, even generic ones that the format is similar to what you're working with, I can probably create something that will do this very fast. Using objects on arrays probably isn't very efficient.


If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Share this post


Link to post
Share on other sites
phatzilla

They are text files.

Each line is about 100-120 chars in length alphanumeric

Edited by phatzilla

Share this post


Link to post
Share on other sites
mikell

_ArrayUnique uses a scripting dictionary

I know, I said directly because the OP asked for "save the differences between the two"  and _ArrayUnique doesn't do that

Edited by mikell

Share this post


Link to post
Share on other sites
jchd

How do lines differ, typically? Are they 99% the same or much more random? Are the differences located near the head, near the tail or can they differ anywhere in between?

This can dramatically change the speed of comparisons. Have you tried comparing the hash of the strings (MD5 will do here): comparing MD5s is pretty fast.

Depending on actual contents, SQLite might prove useful for your use case.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
phatzilla

Actually, i just adjusted my script based on mikells suggestion....



 

global $a = FileReadToArray("info-file(200k lines).txt")
global $b = FileReadToArray("compare-file(100k lines).txt")

Local $sda = ObjCreate("Scripting.Dictionary")
Local $sdb = ObjCreate("Scripting.Dictionary")
Local $sdc = ObjCreate("Scripting.Dictionary")

For $i In $a
    $sda.Item($i)
Next
For $i In $b
    $sdb.Item($i)
Next

For $i In $a
    If $sdb.Exists($i) Then $sdc.Item($i)
Next
$asd3 = $sdc.Keys()

For $i In $asd3
    If $sda.Exists($i) Then $sda.Remove($i)
    If $sdb.Exists($i) Then $sdb.Remove($i)
Next
$asd1 = $sda.Keys()
$asd2 = $sdb.Keys()

_ArrayDisplay($asd1, "$asd1")
_ArrayDisplay($asd2, "$asd2")
_ArrayDisplay($asd3, "$asd3")


Now this is much faster... $ASD1 is the only array that i need as it contains the lines that are *ONLY* found in "info" file (because the "other" compare file will never contain any lines that aren't found in info file).  Is there any way to speed this script up?  I dont really need the other 2 new arrays asd2,asd3.  asd1 contains all the information i need.

Edited by phatzilla

Share this post


Link to post
Share on other sites
kcvinu

I have a  small doubt. You people see the above code ?. Is this safe to use the same variable name twice or thrice in the same script ?. I mean "$i". 

Edit - Never mind i got the answer. 

Edited by kcvinu

My Contributions

UDF Link Viewer   --- A tool to visit the links of some most important UDFs 

 Includer_2  ----- A tool to type the #include statement automatically 

 Digits To Date  ----- date from 3 integer values

PrintList ----- prints arrays into console for testing.

 Alert  ------ An alternative for MsgBox 

 MousePosition ------- A simple tooltip display of mouse position

GRM Helper -------- A littile tool to help writing code with GUIRegisterMsg function

Access_UDF  -------- An UDF for working with access database files. (.*accdb only)

 

Share this post


Link to post
Share on other sites
Melba23

kcvinu,

To answer your question for any later readers: Yes, it is quite safe to use the same variable name in several For...Next loops as long as each loop is entirely separate and has terminated before the next one starts. If there is any overlap or nesting then the loop count variables must be differentiated.

M23

  • Like 1

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
jchd

In this case you can spare significant steps:

Local $a = FileReadToArray("info-file(200k lines).txt")
Local $sda = ObjCreate("Scripting.Dictionary")

For $i In $a
    $sda.Item($i)
Next

; for each "compare-file" do
; see _FileListToArray and friends

    ; also $a content is no more needed,
    ; we reuse it for all target files
    $a = FileReadToArray("compare-file(100k lines).txt")

    For $i In $a
        If $sda.Exists($i) Then $sda.Remove($i)
    Next
; next file

Local $asd1 = $sda.Keys()
$sda = 0
_ArrayDisplay($asd1, "$asd1")

 


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
kcvinu

@Melba23 Thank you for the clarification. 


My Contributions

UDF Link Viewer   --- A tool to visit the links of some most important UDFs 

 Includer_2  ----- A tool to type the #include statement automatically 

 Digits To Date  ----- date from 3 integer values

PrintList ----- prints arrays into console for testing.

 Alert  ------ An alternative for MsgBox 

 MousePosition ------- A simple tooltip display of mouse position

GRM Helper -------- A littile tool to help writing code with GUIRegisterMsg function

Access_UDF  -------- An UDF for working with access database files. (.*accdb only)

 

Share this post


Link to post
Share on other sites
mikell

kcvinu,
Note also that the $i is declared as Local in the For loop (Thus Saith The Help File) but keeps its last value as long as it is not re-used

For $i = 1 to 10
   If $i = 5 Then Exitloop
Next
;...
Msgbox(0,"", $i)

:)

Share this post


Link to post
Share on other sites
phatzilla

Problem is solved!(for now).  Thank you for everyones help....  Got the time down to ~10 seconds max for the biggest files

Share this post


Link to post
Share on other sites
kylomas

phatzilla,

Just for grins can you run your largest files through the following.  I'm interested in comparing the results to the dictionary method...

#include <FileConstants.au3>
#include <sqlite.au3>
#include <array.au3>

Local $sSQLiteMod = 'C:\Program Files (x86)\AutoIt3\install\Extras\SQLite\sqlite3.exe'

; do NOT use AutoIt macros for the next two file names

Local $sMasterFile = "fully.qualified.filename" ; your file with ALL entries
Local $sDetailFile = "fully.qualified.filename" ; your file with SOME entries

$st = TimerInit()

Local $sIn = 'create table if not exists mstr (c1);' & _
        'create table if not exists det (c1);' & _
        @CRLF & '.import ' & $sMasterFile & ' MSTR' & _
        @CRLF & '.import ' & $sDetailFile & ' DET' & _
        @CRLF & 'SELECT DISTINCT c1 FROM mstr WHERE c1 Not IN (SELECT DISTINCT c1 FROM det);'

Local $Out = _TempFile()
Local $in = _TempFile()

FileWrite($in, $sIn)

Local $sCmd = @ComSpec & ' /c "' & $sSQLiteMod & '" < ' & $in & ' > ' & $Out
Local $pid = RunWait($sCmd, @WorkingDir, @SW_HIDE)
ProcessWaitClose($pid)
ConsoleWrite('Time to find entries unique to mstr = ' & Round(TimerDiff($st) / 1000, 3) & ' seconds' & @CRLF)
ConsoleWrite(StdoutRead($pid) & @CRLF)
ConsoleWrite('! out ' & @CRLF & FileRead($Out) & @CRLF)

FileDelete($Out)
FileDelete($in)

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
phatzilla

Hmm, i dont seem to have sqlite3.exe

Share this post


Link to post
Share on other sites
jguinch

For this kind of handling, which requires to be fast enough, I use an external program from the CoreUtils for Windows (http://gnuwin32.sourceforge.net/packages/coreutils.htm) : uniq.exe. This tool can quickly extract/remove duplicate lines in a text file. In your case, you just have to put the 2 file contents in 1 file and remove the duplicates (it requires the file to be sorted first)

So, using the same idea than BrewManNH with _ArrayUniq, you can do something like this :

#Include <Array.au3>

Local $sMainFile = @ScriptDir & "\mainFile.txt"
Local $sPartialFile = @ScriptDir & "\secondFile.txt"

Local $sTempFile = @ScriptDir & "\tempFile.txt", $sResultFile = @ScriptDir & "\resultFile.txt"

; Concatenate the 2 files in 1
Local $sfile = FileWrite($sTempFile, FileRead($sMainFile) & FileRead($sPartialFile) )

; Sort the file and then remove all duplicate lines
RunWait(@ComSpec & ' /c sort "' & $sTempFile & '" | uniq -u > "' & $sResultFile & '"', "C:\Program Files (x86)\GnuWin32\bin", @SW_HIDE)
FileDelete($sTempFile)

Local $aResult = FileReadToArray($sResultFile)
_ArrayDisplay($aResult)

I know the problem is solved, but maybe someone else could be interested by this method...

Edited by jguinch
  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • Skeletor
      By Skeletor
      Hi Virtual People,
      My array works perfectly fine. However, what is the best practice if the line in the array doesn't have the correct amount of columns and if I can add a placeholder?

       
      For $count = 1 To _FileCountLines($FileRead1) Step 1 $string = FileReadLine($FileRead1, $count) $input = StringSplit($string, ",", 1) $value1 = $input[1] $value2 = $input[2] $value3 = $input[3] _Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $value2, "A1") _Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $value1, "B1") _Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $value3, "C1") Next  
    • MrCheese
      By MrCheese
      hi all,
      reviewing the forum, this thread is applicable: 
       
       
      I wanted to know if there is now a better way to do this?
      In essence, I load a tab delimited txt file into an array (works well). I used tab, as some fields in the original csv contains commas.
      However, I needed autoit to manipulate this array, and output it as a csv.
      IF my array contains items with a comma, without double quotes around the field, then how best do I get a csv out of this?
      My current workaround is to filewritefromarray tab delimited, then open it in excel and save as a csv. I will need to check this to see how the address fields behave that contain a comma.
       
      Any thoughts would be appreciated.
       
    • faustf
      By faustf
      hi guys how is possible set a speed of  all function _Ie ?? 
    • Skeletor
      By Skeletor
      Hi All,

      I would like to know how you would take a FileLineRead and insert it into an array which then inserts it into Excel?
      One thing to know is the files content is broken up, so I only use half of the content within $FileRead1.
      So its imperative that the $value1, $value2, etc variables be used. 
      Code below:
      $FileRead1 = FileReadLine("C:\temp\sample.txt",1) For $count = 1 To _FileCountLines($FileRead1) Step 1 $string = FileReadLine($FileRead1, $count) $input = StringSplit($string, ",", 1) $value1 = $input[1] $value2 = $input[2] $value3 = $input[3] $value4 = $input[4] _Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $value1, "A1") _Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $value2, "B1") _Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $value3, "C1") _Excel_RangeWrite($oWorkbook, $oWorkbook.Activesheet, $value4, "D1") Next  
    • AnonymousX
      By AnonymousX
      Hello,
      I'm trying to write a script that moves copies excel cells into an array. I'll than manipulate the values and send array into another program. 
      I don't want range to be specific to a workbook, or sheet, or set of cells.
      I want user to be able to highlight desired cells and to copy either normally ("Ctrl+C") or by a hotkey ("Alt+C"). 
      Could someone help me with this?
      Thank you,
      I've tried to write the framework: (edited)
      #include <MsgBoxConstants.au3> #include <Array.au3> #include <Excel.au3> HotKeySet("!v", "Pastedata") While True Sleep(1000) WEnd func Makearray() local $bArray ;User has cells already copied ;Convert clipboard into an array ;I don;t know how excel stores data to clipboard so don;t know how to bring it into array _Arraydisplay($bArray) MsgBox(0,0,$bArray) return $bArray endfunc func Pastedata() Local $aArray MsgBox(0,0,"wait",1) ;make array based on assumption user has already copied a range to clipboard $aArray = Makearray() ;paste code ;don;t worry about this I got the rest endfunc  
×