Jump to content

Comparison of two very larges arrays : how to (efficiently) ?


Go to solution Solved by SOLVE-SMART,

Recommended Posts

Posted

Hello,

In a very near future, I will have to  compare one 1.5M lines file with a 9M lines file (both CSV files), in order to see which of the first exists in the second.

Usually, I tend to use loops for checking value per value from the first array with the second (by escaping my verification loop when the value is found), but considering the excessive amount of combinations here, I'm asking for advices.

(to be cristal clear, it's about references who need to be deactivated: the main file of 9M is the total of ref in the main system, and there's only 1.5M refs in the second system, who need to be cleaned up by deactivation items who are not sold in store anymore)

Thanks for the help ! 

Posted

Hi @Ghost_Line,

how your CSV structure (columns) look like? Which values do you want to compare?
Without further information I simply suggest this:

  • Bulk the data in a database (like SQLite).
  • Do your comparison based on your criterias with SQL.

Best regards
Sven

==> AutoIt related: 🔗 Organization AutoIt Community🔗 GitHub, 🔗 Discord Server, 🔗 Cheat Sheet🔗 autoit-webdriver-boilerplate

Spoiler

🌍 Au3Forums

🎲 AutoIt (en) Cheat Sheet

📊 AutoIt limits/defaults

💎 Code Katas: [...] (comming soon)

🎭 Collection of GitHub users with AutoIt projects

🐞 False-Positives

🔮 Me on GitHub

💬 Opinion about new forum sub category

📑 UDF wiki list

✂ VSCode-AutoItSnippets

📑 WebDriver FAQs

👨‍🏫 WebDriver Tutorial (coming soon)

Posted
Just now, argumentum said:

I'd put all that in a SQLite DB and let the DB engine do the heavy lifting. Make a query and that should make it manageable.
Maybe make a 512 hash of each line will simplify the DB query. Or to be able to compare 1D arrays. My 2 cents 

Hi Argumentum,

Tahnks for this quick answer ! By any chance, maybe you've a comprehensive example somewhere (one who is up to date with the state of the library as it exists now) ?

Posted
2 minutes ago, argumentum said:

I'd put all that in a SQLite DB and let the DB engine do the heavy lifting

Oh 😁 , same idea, but I am a bit too late @argumentum.

Best regards
Sven

==> AutoIt related: 🔗 Organization AutoIt Community🔗 GitHub, 🔗 Discord Server, 🔗 Cheat Sheet🔗 autoit-webdriver-boilerplate

Spoiler

🌍 Au3Forums

🎲 AutoIt (en) Cheat Sheet

📊 AutoIt limits/defaults

💎 Code Katas: [...] (comming soon)

🎭 Collection of GitHub users with AutoIt projects

🐞 False-Positives

🔮 Me on GitHub

💬 Opinion about new forum sub category

📑 UDF wiki list

✂ VSCode-AutoItSnippets

📑 WebDriver FAQs

👨‍🏫 WebDriver Tutorial (coming soon)

  • Solution
Posted
2 minutes ago, Ghost_Line said:

By any chance, maybe you've a comprehensive example somewhere (one who is up to date with the state of the library as it exists now) ?

There are many examples regarding SQLite handling. Maybe this example could help as a start.

Best regards
Sven

==> AutoIt related: 🔗 Organization AutoIt Community🔗 GitHub, 🔗 Discord Server, 🔗 Cheat Sheet🔗 autoit-webdriver-boilerplate

Spoiler

🌍 Au3Forums

🎲 AutoIt (en) Cheat Sheet

📊 AutoIt limits/defaults

💎 Code Katas: [...] (comming soon)

🎭 Collection of GitHub users with AutoIt projects

🐞 False-Positives

🔮 Me on GitHub

💬 Opinion about new forum sub category

📑 UDF wiki list

✂ VSCode-AutoItSnippets

📑 WebDriver FAQs

👨‍🏫 WebDriver Tutorial (coming soon)

Posted (edited)
3 minutes ago, SOLVE-SMART said:

Hi @Ghost_Line,

how your CSV structure (columns) look like? Which values do you want to compare?
Without further information I simply suggest this:

  • Bulk the data in a database (like SQLite).
  • Do your comparison based on your criterias with SQL.

Best regards
Sven

Hi Sven,

It will probably looks like something like that (shown in Excel for readability):

image.png.37e8c34652f251f3a02bd5d473c8d672.png

(the first row is the ID of the item, the second one the color, the third one the size and the last one the barcode)

I plan to compare on a combination of the 3 first rows (because of some internal issue on the barcode value in some cases).

 

Edited by Ghost_Line
Posted

Off-topic:

Spoiler
1 minute ago, argumentum said:

@SOLVE-SMART has a good head, follow his advice :) 

Thanks for the 💐 . You deserve the same praise, because you had the idea a few seconds before me 😅 .

Topic:

2 minutes ago, Nine said:

Another solution is to have a small ASM injection code into the script.  That way you should perform the comparaison in less than 1 sec.

Wow, I could not do that - no experience so far with such a approach. But very interesting. May I ask for resources/references to learn about this @Nine?

Best regards
Sven

==> AutoIt related: 🔗 Organization AutoIt Community🔗 GitHub, 🔗 Discord Server, 🔗 Cheat Sheet🔗 autoit-webdriver-boilerplate

Spoiler

🌍 Au3Forums

🎲 AutoIt (en) Cheat Sheet

📊 AutoIt limits/defaults

💎 Code Katas: [...] (comming soon)

🎭 Collection of GitHub users with AutoIt projects

🐞 False-Positives

🔮 Me on GitHub

💬 Opinion about new forum sub category

📑 UDF wiki list

✂ VSCode-AutoItSnippets

📑 WebDriver FAQs

👨‍🏫 WebDriver Tutorial (coming soon)

Posted
2 minutes ago, SOLVE-SMART said:

May I ask for resources/references to learn about this

This is what I use for creating the code :

https://defuse.ca/online-x86-assembler.htm

The instruction reference :

https://www.felixcloutier.com/x86/

You will need to search for a more comprehensive tutorial about coding in assembler.

There is quite a number of examples made here in the forum, that could help you get started.

Good luck.

Posted (edited)
2 hours ago, Nine said:

Another solution is to have a small ASM injection code into the script.  That way you should perform the comparaison in less than 1 sec.

Salut Nine,

Merci pour la proposition, mais je suis bcp trop c*n pour comprendre comment faire une chose pareille 😅

(Thanks for the proposal, but I'm far too d*mb to understand how to do such a thing) 

Edited by Ghost_Line
Posted

@Ghost_Line

It's always fun to try doing assembler when the comparaison is not too hefty.  The good thing about this solution is that it does not require any external support out of the script.  Give us an example of the two files, the precise rules and expected output.  Someone here may decide to jump into it.  Once you got the basic code, it will be quite easy for you to modify it if the rules change over the time.

Posted

If you insist :) 

I will arrange the compared data in order to have what I have in this example on both side:

image.png.063898e253a5feda4d623740e1bdaaeb.png

I was thinking of concatenating part (or all) the rows into one unique value, to have something more easy to compare.

Posted

As others have said, you can use SQLite to perform such task. Since both files are .CSV, you don't even have to write code at all: use the SQLite command-line tool to load each .CSV in its own table and perform the comparison. Something like:

SELECT GA_CODEARTICLE1, GA_DESIGNATION1, GA_CODEDIM1 FROM <SMALLER FILE> INTERSECT SELECT GA_CODEARTICLE1, GA_DESIGNATION1, GA_CODEDIM1 FROM <LARGER FILE>

It woiuld greatly help to create on index on those 3 column on each table before launching the comparison.

Various CLI options let you work with memory- or disk-based DBs. You can as well obtain the output in .CSV in other format (e.g. HTML, JSON, ...).

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Posted (edited)
38 minutes ago, jchd said:

As others have said, you can use SQLite to perform such task. Since both files are .CSV, you don't even have to write code at all: use the SQLite command-line tool to load each .CSV in its own table and perform the comparison. Something like:

SELECT GA_CODEARTICLE1, GA_DESIGNATION1, GA_CODEDIM1 FROM <SMALLER FILE> INTERSECT SELECT GA_CODEARTICLE1, GA_DESIGNATION1, GA_CODEDIM1 FROM <LARGER FILE>

It woiuld greatly help to create on index on those 3 column on each table before launching the comparison.

Various CLI options let you work with memory- or disk-based DBs. You can as well obtain the output in .CSV in other format (e.g. HTML, JSON, ...).

Merci pour le coup de main, je vais creuser ça :)

(thanks for your help, I'll explore this)

Edited by Ghost_Line
Posted
3 minutes ago, argumentum said:

oh French. The language of love ... 🤔
Careful how you look into it ? :P 

Note to self: learn French :D 

😄

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...