Jump to content

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more here. X
X


Photo

[ASK] compare and delete duplicate line in 2 txt files


  • Please log in to reply
14 replies to this topic

#1 michaelslamet

michaelslamet

    Universalist

  • Active Members
  • PipPipPipPipPipPip
  • 883 posts

Posted 24 March 2011 - 07:27 AM

Hi,
I have a text file named a.txt which is consist of houndred/thousand lines.
I have another text file named b.txt which is consist of hundred thousand (and growing to millions) of lines.

I would like to scan every single line of a.txt and compare it with every single line on b.txt
If the line available in b.txt, delete that line on file b.txt
so there will be no duplicate line between file a.txt and b.txt

What is the best way to do this since the b.txt consist hundred thousand line? I afraid if I'm not doing it with a right way, I will take forever to process :)

Thanks a lot :)

Edited by michaelslamet, 24 March 2011 - 07:29 AM.








#2 Mallie99

Mallie99

    Wayfarer

  • Active Members
  • Pip
  • 67 posts

Posted 24 March 2011 - 07:33 AM

What have you tried so far? Have you got anything currently in place?
Are you telling me something I need to know or something I want to know?

#3 michaelslamet

michaelslamet

    Universalist

  • Active Members
  • PipPipPipPipPipPip
  • 883 posts

Posted 24 March 2011 - 07:36 AM

Hi Mallie,
Thanks for your reply :)

I haven't code anything yet. I think if I do it "conventionally" it will be simple. Just read every single line in a.txt, compare it with every single line in b.txt, but I think it will take forever to finish. Am I wrong? I'm new to AutoIT :)

#4 Mallie99

Mallie99

    Wayfarer

  • Active Members
  • Pip
  • 67 posts

Posted 24 March 2011 - 07:38 AM

You're also pretty much limited by how much you can load into memory at any one time.

Are these text files in any kind of order? Can they be sorted? What kind of data are they containing?
Are you telling me something I need to know or something I want to know?

#5 michaelslamet

michaelslamet

    Universalist

  • Active Members
  • PipPipPipPipPipPip
  • 883 posts

Posted 24 March 2011 - 07:41 AM

If I use _FileReadToArray, can it handle/store up to hundread thousand or a million lines into array?

#6 michaelslamet

michaelslamet

    Universalist

  • Active Members
  • PipPipPipPipPipPip
  • 883 posts

Posted 24 March 2011 - 07:44 AM

Hi Mallie,
Those lines are not sorted, let assumse that they're not sorted in any ways. Both files consist mostly numbers, each line consist no more than 40-50 characters.

Thanks Mallie :)

#7 Mallie99

Mallie99

    Wayfarer

  • Active Members
  • Pip
  • 67 posts

Posted 24 March 2011 - 07:48 AM

Of that I have no idea. In a perfect world your only limitation will be your available memory.

You should be able to read the data into the array in chunks. For example:

If $FirstBitOfLineInFileB = $FirstBitOfLineInFileA Then $aArray[$x] = $LineInFileB

Then you have a smaller dataset, you can use something similar to a bubble sort or _ArrayNaturalSort() to organise it and then it SHOULD make the process more "Bitesized". However, you'll have a lot of file access on your hands.
Are you telling me something I need to know or something I want to know?

#8 Mallie99

Mallie99

    Wayfarer

  • Active Members
  • Pip
  • 67 posts

Posted 24 March 2011 - 07:52 AM

Hi Mallie,
Those lines are not sorted, let assumse that they're not sorted in any ways. Both files consist mostly numbers, each line consist no more than 40-50 characters.

Thanks Mallie :)


First thing for you to do then is to try loading your million-and-one line file into an array. Next will be to apply some sort of sorting mechanism to it while it's in memory. This will give you an idea of the time and processor impact this action will have. I'd recommend your script take a "Start Time" and a "End Time" for you to accurately measure the time taken. Keep an eye on TaskMan while it's running for the memory and processor impact.

If this works then you're a step closer to getting this working.
Are you telling me something I need to know or something I want to know?

#9 jaberwacky

jaberwacky

    RegExp("\m/")

  • Active Members
  • PipPipPipPipPipPip
  • 3,237 posts

Posted 24 March 2011 - 08:26 AM

Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt that's 1,000,000,000 comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's.



No, no, no scrap all of that.



Ok, are we to assume that a.txt has a fixed amount of lines? If so, then after we consider who will be using your program (I assume it's just you) will you have enough memory to load a.txt into an array? If b.txt grows in size then you'll have to read it off of disk. Those are a few questions that will have to answered before you can continue.

Edited by LaCastiglione, 24 March 2011 - 08:35 AM.


#10 Mallie99

Mallie99

    Wayfarer

  • Active Members
  • Pip
  • 67 posts

Posted 24 March 2011 - 08:32 AM

Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt and that's a billion comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's.

Lol up early or late?

Dont forget, with sorted datasets you only need to compare to a point... For example, Garage comes after Basement in the dictionary. You can exitloop or continueloop when $Val_A > $Val_B.
Are you telling me something I need to know or something I want to know?

#11 Tvern

Tvern

    Universalist

  • Active Members
  • PipPipPipPipPipPip
  • 972 posts

Posted 24 March 2011 - 08:37 AM

Appearantly this is a very fast way to do what you want.
By the looks of it it will currently check all files in a directory, and create a new file with entries minus duplicates, but it won't be hard to make it accept a path to two, or more files directly.
If you're having trouble converting the function to your needs let us know.

#12 jaberwacky

jaberwacky

    RegExp("\m/")

  • Active Members
  • PipPipPipPipPipPip
  • 3,237 posts

Posted 24 March 2011 - 08:37 AM

Lol up early or late?


Thanks Mallie, I realized my error.

#13 jchd

jchd

    Whatever your capacity, resistance is futile.

  • MVPs
  • 5,378 posts

Posted 24 March 2011 - 08:44 AM

What I would do is along these lines:
1) open the larger file and store some hash (eg MD5) of each line in a :memory: table of SQLite created with (md5 blob, source int) and placing a constant in source (eg 1)
2) create a unique index on the hash column
3)open the smaller file and for each line: get its MD5 hash and try an insert on it (with source = 2). If the insert fails due to dupplicate you know you have the line at hand in the first file as well.
4) optionally delete from the table the rows with source = 2 if the distinct lines don't have to become part of the lager file.
5) optionally use SQLite backup to obtain a disk-based database in case you can reuse it for the next time.

Why doing so? If you have an index built on the "reference" MD5s, SQLite will use a O(K log(N)) algorithm to check for dups, _way_ better than the naïve O(K N) linear compare, where K = small file # lines, N = large file # lines.
I bet the time saved while checking for dups this way largely offset the burden of computing the hash.
I also bet that computing the hash is still better than storing raw text contents (but that may depend on how long/similar your lines are)

Edit:
Depending on your actual problem statement detail, using a scripting dictionnary should also work but I don't know the limits this can hold.

Edit2:
@Tvern
:) I really didn't remember at once that I wrote that some time ago! The idea is the same anyway.

Edited by jchd, 24 March 2011 - 09:26 AM.

SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!

SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)

An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.

 

SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.

 

PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

RegExp tutorial: enough to get started

Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.


#14 michaelslamet

michaelslamet

    Universalist

  • Active Members
  • PipPipPipPipPipPip
  • 883 posts

Posted 25 March 2011 - 05:48 AM

Mallie, LaCastiglione, Tvern and jchd.
Thanks for all of you for your reply.

I think this is too complicated for me to code it by myself :) I'm going to ask somobody expert in RentACoder/vworker to code it for me.

Many thanks for the input :)

#15 jchd

jchd

    Whatever your capacity, resistance is futile.

  • MVPs
  • 5,378 posts

Posted 25 March 2011 - 07:50 AM

Don't throw money for that. Take the code Tvern pointed out (I completely forgot what I wrote few months ago: that's age!) and instead of looping thru all files in a directory, just do it with your two files.

SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!

SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)

An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.

 

SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.

 

PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

RegExp tutorial: enough to get started

Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users