Sign in to follow this  
Followers 0
groucho

help on comparing 2 arrays

8 posts in this topic

Hi,

I need to compare a large CSV file with a previous CSV file to produce input-file for a database:

if same record already exists then write the original record into the input-file,

if new record put it in the input-file with a new unique ID

the input file should in a month time be the "previous CSV file".

I understand the _FileReadToArray but the comparison should be on the combination of first and third colon (first file) and the combination of second and third colon.

I Was thinking of

_FileReadToArray (CSV)

Split the CSV-Array (CSV)

_FileReadToArray (previousCSV)

Split the CSV-Array (previousCSV)

Search (previousCSV) for each record in (CSV)

It would be great simply to know in what order what functions, the glue in-between I should work out myself.

Tnx in advance

Share this post


Link to post
Share on other sites



tnx,

I will combine this with some multi-dimensional array stuff

Share this post


Link to post
Share on other sites

This may be useful to you:

http://www.autoitscript.com/forum/index.php?showtopic=12710

<{POST_SNAPBACK}>

I wouldn't use that if I were you. It's very slow - first, uses multiple binary searches instead of just iterating through the arrays (they're both sorted anyway) and on top of that, uses _ArrayDelete on each item.

Try this instead:

http://www.autoitscript.com/forum/index.php?showtopic=12822

It's pretty fast, and I'm working on combining it with my binary tree UDF to see if I can make it even faster.

But that's only if you want to read your entire files into arrays. For what you're doing, you could very easily use my binary trees UDF to index and check your columns in your CSV files, and output as needed. Much faster, less memory used, more effecient.

Here's my Binary Tree UDFs:

http://www.autoitscript.com/forum/index.php?showtopic=13114

Share this post


Link to post
Share on other sites

I have considered your tables too. However, I am a infrastructure boy, with moderate knowledge of VB and AutoIT. The trouble for me with the tables - even with the definitions as given in another topic- is that the concept is pretty hard. I can understand the potential of the tables but it is very hard to translate this to common tasks. I will nevertheless give it a try.

Bottomline of both my question and your solutions (arraysearch and tables) is that we do database procedures on flat files.

Tnx anyway

Share this post


Link to post
Share on other sites

Grab my BTree UDF and try this pseudo-code:

$btRecords = _BTreeCreate()
FileOpen(...);your first CSV
$Line = FileReadLine(...)
$aItem = StringSplit($Line)
_BTreeSet($btRecords, $aItem[1] & $aItem[3], $Line);index on columns 1 & 3
;now loop back to the FileReadLine until you're done with the file
;finally, close the file:
FileClose(...)

;Optimize the tree, just to speed things up
_BTreeOptimize($btRecords)

;now read the second file, looking for duplicates
FileOpen(...);your second CSV
$Line = FileReadLine(...)
$aItem = StringSplit($Line)
_BTreeGet($btRecords, $aItem[2] & $aItem[3]);index on column 2 & 3
If @error then _BTreeSet($btRecords, $aItem[2] & $aItem[3], $Line);index on column 2 & 3, if it doesn't already exist
;now loop back to the FileReadLine until you're done with the file
;finally, close the file:
FileClose(...)

;now write the uniquely indexed lines to a file:
FileOpen(...);your output file
for $i = 1 to $btRecords[0][0]
FileWriteLine(..., $btRecords[$i][3])
FileClose(...)

Share this post


Link to post
Share on other sites

OK, some results:

1. I used comparison of 2 arrays in _Array1PullCommon on file with 400.000 lines

(copy1 without 1000 lines in the middle; copy2 without 1000 lines somewhere at the end)

with output in three files (in_both.txt, only_in1.txt, only_in2.txt)

Colleague did similar procedure month ago with VBS: 3hrs 99% CPU

It took me 10 min 99% CPU, then 50 min 65%.

So: pretty good!

2. to make some "uniqueness" / "remove duplicate lines" in these files and subsequent files I re-arranged a bit in _arraysearch, also for "quick" search of a particular entry:

http://www.autoitscript.com/forum/index.ph...213entry90213

On the whole:

tnx for helping me on the way

Share this post


Link to post
Share on other sites

OK, some results:

1. I used comparison of 2 arrays in _Array1PullCommon on file with 400.000 lines

(copy1 without 1000 lines in the middle; copy2 without 1000 lines somewhere at the end)

with output in three files (in_both.txt, only_in1.txt, only_in2.txt)

Colleague did similar procedure month ago with VBS: 3hrs 99% CPU

It took me 10 min 99% CPU, then 50 min 65%.

So: pretty good!

2. to make some "uniqueness" / "remove duplicate lines" in these files and subsequent files I re-arranged a bit in _arraysearch, also  for  "quick" search of a particular entry:

<{POST_SNAPBACK}>

You shouldn't need to use _ArraySort() because my _Array1PullCommon() function already returns sorted arrays. Also, if you're searching sorted arrays, use _ArrayBinarySearch, it'll be much faster!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0