Sign in to follow this  
Followers 0

[ASK] compare and delete duplicate line in 2 txt files

15 posts in this topic

Posted (edited)

Hi,

I have a text file named a.txt which is consist of houndred/thousand lines.

I have another text file named b.txt which is consist of hundred thousand (and growing to millions) of lines.

I would like to scan every single line of a.txt and compare it with every single line on b.txt

If the line available in b.txt, delete that line on file b.txt

so there will be no duplicate line between file a.txt and b.txt

What is the best way to do this since the b.txt consist hundred thousand line? I afraid if I'm not doing it with a right way, I will take forever to process :)

Thanks a lot :)

Edited by michaelslamet

Share this post


Link to post
Share on other sites



Posted

What have you tried so far? Have you got anything currently in place?

Share this post


Link to post
Share on other sites

Posted

Hi Mallie,

Thanks for your reply :)

I haven't code anything yet. I think if I do it "conventionally" it will be simple. Just read every single line in a.txt, compare it with every single line in b.txt, but I think it will take forever to finish. Am I wrong? I'm new to AutoIT :)

Share this post


Link to post
Share on other sites

Posted

You're also pretty much limited by how much you can load into memory at any one time.

Are these text files in any kind of order? Can they be sorted? What kind of data are they containing?

Share this post


Link to post
Share on other sites

Posted

If I use _FileReadToArray, can it handle/store up to hundread thousand or a million lines into array?

Share this post


Link to post
Share on other sites

Posted

Hi Mallie,

Those lines are not sorted, let assumse that they're not sorted in any ways. Both files consist mostly numbers, each line consist no more than 40-50 characters.

Thanks Mallie :)

Share this post


Link to post
Share on other sites

Posted

Of that I have no idea. In a perfect world your only limitation will be your available memory.

You should be able to read the data into the array in chunks. For example:

If $FirstBitOfLineInFileB = $FirstBitOfLineInFileA Then $aArray[$x] = $LineInFileB

Then you have a smaller dataset, you can use something similar to a bubble sort or _ArrayNaturalSort() to organise it and then it SHOULD make the process more "Bitesized". However, you'll have a lot of file access on your hands.

Share this post


Link to post
Share on other sites

Posted

Hi Mallie,

Those lines are not sorted, let assumse that they're not sorted in any ways. Both files consist mostly numbers, each line consist no more than 40-50 characters.

Thanks Mallie :)

First thing for you to do then is to try loading your million-and-one line file into an array. Next will be to apply some sort of sorting mechanism to it while it's in memory. This will give you an idea of the time and processor impact this action will have. I'd recommend your script take a "Start Time" and a "End Time" for you to accurately measure the time taken. Keep an eye on TaskMan while it's running for the memory and processor impact.

If this works then you're a step closer to getting this working.

Share this post


Link to post
Share on other sites

Posted (edited)

Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt that's 1,000,000,000 comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's.

No, no, no scrap all of that.

Ok, are we to assume that a.txt has a fixed amount of lines? If so, then after we consider who will be using your program (I assume it's just you) will you have enough memory to load a.txt into an array? If b.txt grows in size then you'll have to read it off of disk. Those are a few questions that will have to answered before you can continue.

Edited by LaCastiglione

Share this post


Link to post
Share on other sites

Posted

Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt and that's a billion comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's.

Lol up early or late?

Dont forget, with sorted datasets you only need to compare to a point... For example, Garage comes after Basement in the dictionary. You can exitloop or continueloop when $Val_A > $Val_B.

Share this post


Link to post
Share on other sites

Posted

By the looks of it it will currently check all files in a directory, and create a new file with entries minus duplicates, but it won't be hard to make it accept a path to two, or more files directly.

If you're having trouble converting the function to your needs let us know.

Share this post


Link to post
Share on other sites

Posted

Lol up early or late?

Thanks Mallie, I realized my error.

Share this post


Link to post
Share on other sites

Posted (edited)

What I would do is along these lines:

1) open the larger file and store some hash (eg MD5) of each line in a :memory: table of SQLite created with (md5 blob, source int) and placing a constant in source (eg 1)

2) create a unique index on the hash column

3)open the smaller file and for each line: get its MD5 hash and try an insert on it (with source = 2). If the insert fails due to dupplicate you know you have the line at hand in the first file as well.

4) optionally delete from the table the rows with source = 2 if the distinct lines don't have to become part of the lager file.

5) optionally use SQLite backup to obtain a disk-based database in case you can reuse it for the next time.

Why doing so? If you have an index built on the "reference" MD5s, SQLite will use a O(K log(N)) algorithm to check for dups, _way_ better than the naïve O(K N) linear compare, where K = small file # lines, N = large file # lines.

I bet the time saved while checking for dups this way largely offset the burden of computing the hash.

I also bet that computing the hash is still better than storing raw text contents (but that may depend on how long/similar your lines are)

Edit:

Depending on your actual problem statement detail, using a scripting dictionnary should also work but I don't know the limits this can hold.

Edit2:

@Tvern

:) I really didn't remember at once that I wrote that some time ago! The idea is the same anyway.

Edited by jchd

Share this post


Link to post
Share on other sites

Posted

Mallie, LaCastiglione, Tvern and jchd.

Thanks for all of you for your reply.

I think this is too complicated for me to code it by myself :) I'm going to ask somobody expert in RentACoder/vworker to code it for me.

Many thanks for the input :)

Share this post


Link to post
Share on other sites

Posted

Don't throw money for that. Take the code Tvern pointed out (I completely forgot what I wrote few months ago: that's age!) and instead of looping thru all files in a directory, just do it with your two files.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0