Sign in to follow this  
Followers 0
leuce

Two questions about merging arrays

4 posts in this topic

#1 ·  Posted (edited)

G'day everyone

I just need some advice on how to process two sets of data.

Question 1

I have two lists (list 1 and list 2) that I want to convert to three other lists (lists A, B and C). The data, by the way, is simply file names (one file name per line). List 1 is an old version of the data and list 2 is a new version of the data.

I want to create three lists, namely: a list that contains only data that is unique to list 1 (call it List A), and a list that contains only data that is unique to list 2 (call it list B), and a list that contains only the data that occurs in both list 1 and list 2 (call it list C).

I must confess that I'm pretty much a kindergarten programmer, so the way I would do it is one step at a time: i.e. create list A by comparing each line from list 1 with each line from list 2, one line at a time. I can then do the same thing all over again for list B, and all over again for list C.

; $list1 and $list2 are the two original lists, i.e. lines of data in a string
; $listA is the newly created list with data that is unique to $list1

$list1array = StringSplit ($list1, @CRLF, 1)

For $i = 1 to $list1array[0]

If NOT StringInStr ($list2, $list1array[$i]) Then
$listA = $listA & @CRLF & $list1array[$i]
EndIf

Next

However, I expect these lists to be very large (millions of lines), and I want to know if there is a quicker way of doing it. Specifically, I want to know if there is a way to create all three lists A, B and C in a single step.

So, the first question is: Will it be significantly faster to use a process that eats list 1 and 2 and spits out list A, B and C in one step? And if so, can you tell me what that method may be (or point me in some direction), please?

Question 2

If it turns out that creating the three lists one at a time is going to be no slower than some magical method of creating all three lists at the same time, my next question is how to create list C (the one that contains only data that occur in both list 1 and list 2).

One method I can think of is to simply merge both list 1 and list 2 into a temporary list (say, list D), and then create another temorary list (list E) in which the duplicates from list D was removed (using _ArrayUnique), and then use StringReplace with each line on list E to count whether it occurs once or twice in list D, and if twice, write it to list C... one line at a time.

; $list1 and $list2 are the two original lists, i.e. lines of data in a string
; $listC will be the final list with only data that occurs in both $list1 and $list2
; $listD is the temporary list with all entries in it
; $listE is the temporary list with only unique entries from $listD

$listD = $list1 & @CRLF & $list2

; I expect this would be necessary:
; While StringInStr ($listD, @CRLF & @CRLF) Then
; $listD = StringReplace ($listD, @CRLF & @CRLF, @CRLF)
; WEnd

$listDarray = StringSplit ($listD, @CRLF, 1)
$listEarray = _ArrayUnique ($listDarray)
; $listE = _ArrayToString ($listEarray, @CRLF)

For $i = 1 to $listEarray[0]

$j = StringReplace ($listD, $listEarray[$i], $listEarray[$i])
If @extended = 2 Then
$listC = $listC & @CRLF & $listEarray[$i]
EndIf

Next

I'm sure this must look really silly -- there must be a simpler, more magical way.

So, the second question is: Do you know of a simple way to compare two arrays and create a third array that contains only data that occurs in both those arrays?

Thanks

Samuel

Edited by leuce

Share this post


Link to post
Share on other sites



As you expect millions of entries, your best bet is making an SQLite database and putting SQL at work.

Building on code posted in you can easily derive three tables containing what you want, yet have even more possibilities.

If you're not SQL[ite] savvy, just ask and I'll refine this code to fit your needs later today.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

If you're not SQL[ite] savvy, just ask and I'll refine this code to fit your needs later today.

Thanks for the offer, but I may have found a software solution that does the thing that I'm trying to script, so please don't make an effort. However, I will look at the thread you linked to, for I have long thought that sooner or later I'm going to have to start using databases anyway.

The thread you mention also shows examples of how to find duplicates in multiple arrays, which I'll have a look at.

Thanks

Samuel

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

You won't get rid of me that easily ;)

Could you tell me how you intend to handle duplicates in a given list? For instance:

FileA

Joe

Greg

Bob

Mike

Bob

Greg

FileB

Elisa

Joe

Samantha

Joe

Bob

Joe

What exactly should ListA, ListB & ListC be?

Or is it 100% garanteed that duplicates never ever occur in a given input list? Bet your life on that?

Edit: I forgot to ask if casing is meaningful (Joe =?= joE) and which range of characters we're talking about: "English" letters a-z, latin with diacritics (Joël =?= joel), german (Fussball =?= fußball, München =?= Muenchen), cyrillic, thaï, whatever?

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0