Jump to content

[ASK] compare and delete duplicate line in 2 txt files


Recommended Posts

Hi,

I have a text file named a.txt which is consist of houndred/thousand lines.

I have another text file named b.txt which is consist of hundred thousand (and growing to millions) of lines.

I would like to scan every single line of a.txt and compare it with every single line on b.txt

If the line available in b.txt, delete that line on file b.txt

so there will be no duplicate line between file a.txt and b.txt

What is the best way to do this since the b.txt consist hundred thousand line? I afraid if I'm not doing it with a right way, I will take forever to process :)

Thanks a lot :)

Edited by michaelslamet
Link to comment
Share on other sites

You're also pretty much limited by how much you can load into memory at any one time.

Are these text files in any kind of order? Can they be sorted? What kind of data are they containing?

Are you telling me something I need to know or something I want to know?

Link to comment
Share on other sites

Of that I have no idea. In a perfect world your only limitation will be your available memory.

You should be able to read the data into the array in chunks. For example:

If $FirstBitOfLineInFileB = $FirstBitOfLineInFileA Then $aArray[$x] = $LineInFileB

Then you have a smaller dataset, you can use something similar to a bubble sort or _ArrayNaturalSort() to organise it and then it SHOULD make the process more "Bitesized". However, you'll have a lot of file access on your hands.

Are you telling me something I need to know or something I want to know?

Link to comment
Share on other sites

Hi Mallie,

Those lines are not sorted, let assumse that they're not sorted in any ways. Both files consist mostly numbers, each line consist no more than 40-50 characters.

Thanks Mallie :)

First thing for you to do then is to try loading your million-and-one line file into an array. Next will be to apply some sort of sorting mechanism to it while it's in memory. This will give you an idea of the time and processor impact this action will have. I'd recommend your script take a "Start Time" and a "End Time" for you to accurately measure the time taken. Keep an eye on TaskMan while it's running for the memory and processor impact.

If this works then you're a step closer to getting this working.

Are you telling me something I need to know or something I want to know?

Link to comment
Share on other sites

Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt that's 1,000,000,000 comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's.

No, no, no scrap all of that.

Ok, are we to assume that a.txt has a fixed amount of lines? If so, then after we consider who will be using your program (I assume it's just you) will you have enough memory to load a.txt into an array? If b.txt grows in size then you'll have to read it off of disk. Those are a few questions that will have to answered before you can continue.

Edited by LaCastiglione
Link to comment
Share on other sites

Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt and that's a billion comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's.

Lol up early or late?

Dont forget, with sorted datasets you only need to compare to a point... For example, Garage comes after Basement in the dictionary. You can exitloop or continueloop when $Val_A > $Val_B.

Are you telling me something I need to know or something I want to know?

Link to comment
Share on other sites

By the looks of it it will currently check all files in a directory, and create a new file with entries minus duplicates, but it won't be hard to make it accept a path to two, or more files directly.

If you're having trouble converting the function to your needs let us know.

Link to comment
Share on other sites

What I would do is along these lines:

1) open the larger file and store some hash (eg MD5) of each line in a :memory: table of SQLite created with (md5 blob, source int) and placing a constant in source (eg 1)

2) create a unique index on the hash column

3)open the smaller file and for each line: get its MD5 hash and try an insert on it (with source = 2). If the insert fails due to dupplicate you know you have the line at hand in the first file as well.

4) optionally delete from the table the rows with source = 2 if the distinct lines don't have to become part of the lager file.

5) optionally use SQLite backup to obtain a disk-based database in case you can reuse it for the next time.

Why doing so? If you have an index built on the "reference" MD5s, SQLite will use a O(K log(N)) algorithm to check for dups, _way_ better than the naïve O(K N) linear compare, where K = small file # lines, N = large file # lines.

I bet the time saved while checking for dups this way largely offset the burden of computing the hash.

I also bet that computing the hash is still better than storing raw text contents (but that may depend on how long/similar your lines are)

Edit:

Depending on your actual problem statement detail, using a scripting dictionnary should also work but I don't know the limits this can hold.

Edit2:

@Tvern

:) I really didn't remember at once that I wrote that some time ago! The idea is the same anyway.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Don't throw money for that. Take the code Tvern pointed out (I completely forgot what I wrote few months ago: that's age!) and instead of looping thru all files in a directory, just do it with your two files.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...