[ASK] compare and delete duplicate line in 2 txt files

michaelslamet · March 24, 2011

Hi,

I have a text file named a.txt which is consist of houndred/thousand lines.

I have another text file named b.txt which is consist of hundred thousand (and growing to millions) of lines.

I would like to scan every single line of a.txt and compare it with every single line on b.txt

If the line available in b.txt, delete that line on file b.txt

so there will be no duplicate line between file a.txt and b.txt

What is the best way to do this since the b.txt consist hundred thousand line? I afraid if I'm not doing it with a right way, I will take forever to process

Thanks a lot

Edited March 24, 2011 by michaelslamet

Mallie99 · March 24, 2011

What have you tried so far? Have you got anything currently in place?

michaelslamet · March 24, 2011

Hi Mallie,

Thanks for your reply

I haven't code anything yet. I think if I do it "conventionally" it will be simple. Just read every single line in a.txt, compare it with every single line in b.txt, but I think it will take forever to finish. Am I wrong? I'm new to AutoIT

Mallie99 · March 24, 2011

You're also pretty much limited by how much you can load into memory at any one time.

Are these text files in any kind of order? Can they be sorted? What kind of data are they containing?

michaelslamet · March 24, 2011

If I use _FileReadToArray, can it handle/store up to hundread thousand or a million lines into array?

michaelslamet · March 24, 2011

Hi Mallie,

Those lines are not sorted, let assumse that they're not sorted in any ways. Both files consist mostly numbers, each line consist no more than 40-50 characters.

Thanks Mallie

Mallie99 · March 24, 2011

Of that I have no idea. In a perfect world your only limitation will be your available memory.

You should be able to read the data into the array in chunks. For example:

If $FirstBitOfLineInFileB = $FirstBitOfLineInFileA Then $aArray[$x] = $LineInFileB

Then you have a smaller dataset, you can use something similar to a bubble sort or _ArrayNaturalSort() to organise it and then it SHOULD make the process more "Bitesized". However, you'll have a lot of file access on your hands.

Mallie99 · March 24, 2011

Hi Mallie,
Those lines are not sorted, let assumse that they're not sorted in any ways. Both files consist mostly numbers, each line consist no more than 40-50 characters.
Thanks Mallie

First thing for you to do then is to try loading your million-and-one line file into an array. Next will be to apply some sort of sorting mechanism to it while it's in memory. This will give you an idea of the time and processor impact this action will have. I'd recommend your script take a "Start Time" and a "End Time" for you to accurately measure the time taken. Keep an eye on TaskMan while it's running for the memory and processor impact.

If this works then you're a step closer to getting this working.

jaberwacky · March 24, 2011

Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt that's 1,000,000,000 comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's.

No, no, no scrap all of that.

Ok, are we to assume that a.txt has a fixed amount of lines? If so, then after we consider who will be using your program (I assume it's just you) will you have enough memory to load a.txt into an array? If b.txt grows in size then you'll have to read it off of disk. Those are a few questions that will have to answered before you can continue.

Edited March 24, 2011 by LaCastiglione

Mallie99 · March 24, 2011

Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt and that's a billion comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's.

Lol up early or late?

Dont forget, with sorted datasets you only need to compare to a point... For example, Garage comes after Basement in the dictionary. You can exitloop or continueloop when $Val_A > $Val_B.

Tvern · March 24, 2011

By the looks of it it will currently check all files in a directory, and create a new file with entries minus duplicates, but it won't be hard to make it accept a path to two, or more files directly.

If you're having trouble converting the function to your needs let us know.

jaberwacky · March 24, 2011

Lol up early or late?

Thanks Mallie, I realized my error.

jchd · March 24, 2011

What I would do is along these lines:

1) open the larger file and store some hash (eg MD5) of each line in a :memory: table of SQLite created with (md5 blob, source int) and placing a constant in source (eg 1)

2) create a unique index on the hash column

3)open the smaller file and for each line: get its MD5 hash and try an insert on it (with source = 2). If the insert fails due to dupplicate you know you have the line at hand in the first file as well.

4) optionally delete from the table the rows with source = 2 if the distinct lines don't have to become part of the lager file.

5) optionally use SQLite backup to obtain a disk-based database in case you can reuse it for the next time.

Why doing so? If you have an index built on the "reference" MD5s, SQLite will use a O(K log(N)) algorithm to check for dups, _way_ better than the naïve O(K N) linear compare, where K = small file # lines, N = large file # lines.

I bet the time saved while checking for dups this way largely offset the burden of computing the hash.

I also bet that computing the hash is still better than storing raw text contents (but that may depend on how long/similar your lines are)

Edit:

Depending on your actual problem statement detail, using a scripting dictionnary should also work but I don't know the limits this can hold.

Edit2:

@Tvern

I really didn't remember at once that I wrote that some time ago! The idea is the same anyway.

Edited March 24, 2011 by jchd

michaelslamet · March 25, 2011

Mallie, LaCastiglione, Tvern and jchd.

Thanks for all of you for your reply.

I think this is too complicated for me to code it by myself I'm going to ask somobody expert in RentACoder/vworker to code it for me.

Many thanks for the input

jchd · March 25, 2011

Don't throw money for that. Take the code Tvern pointed out (I completely forgot what I wrote few months ago: that's age!) and instead of looping thru all files in a directory, just do it with your two files.

[ASK] compare and delete duplicate line in 2 txt files

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members