michaelslamet Posted March 24, 2011 Share Posted March 24, 2011 (edited) Hi, I have a text file named a.txt which is consist of houndred/thousand lines. I have another text file named b.txt which is consist of hundred thousand (and growing to millions) of lines. I would like to scan every single line of a.txt and compare it with every single line on b.txt If the line available in b.txt, delete that line on file b.txt so there will be no duplicate line between file a.txt and b.txt What is the best way to do this since the b.txt consist hundred thousand line? I afraid if I'm not doing it with a right way, I will take forever to process Thanks a lot Edited March 24, 2011 by michaelslamet Link to comment Share on other sites More sharing options...
Mallie99 Posted March 24, 2011 Share Posted March 24, 2011 What have you tried so far? Have you got anything currently in place? Are you telling me something I need to know or something I want to know? Link to comment Share on other sites More sharing options...
michaelslamet Posted March 24, 2011 Author Share Posted March 24, 2011 Hi Mallie, Thanks for your reply I haven't code anything yet. I think if I do it "conventionally" it will be simple. Just read every single line in a.txt, compare it with every single line in b.txt, but I think it will take forever to finish. Am I wrong? I'm new to AutoIT Link to comment Share on other sites More sharing options...
Mallie99 Posted March 24, 2011 Share Posted March 24, 2011 You're also pretty much limited by how much you can load into memory at any one time. Are these text files in any kind of order? Can they be sorted? What kind of data are they containing? Are you telling me something I need to know or something I want to know? Link to comment Share on other sites More sharing options...
michaelslamet Posted March 24, 2011 Author Share Posted March 24, 2011 If I use _FileReadToArray, can it handle/store up to hundread thousand or a million lines into array? Link to comment Share on other sites More sharing options...
michaelslamet Posted March 24, 2011 Author Share Posted March 24, 2011 Hi Mallie, Those lines are not sorted, let assumse that they're not sorted in any ways. Both files consist mostly numbers, each line consist no more than 40-50 characters. Thanks Mallie Link to comment Share on other sites More sharing options...
Mallie99 Posted March 24, 2011 Share Posted March 24, 2011 Of that I have no idea. In a perfect world your only limitation will be your available memory. You should be able to read the data into the array in chunks. For example: If $FirstBitOfLineInFileB = $FirstBitOfLineInFileA Then $aArray[$x] = $LineInFileB Then you have a smaller dataset, you can use something similar to a bubble sort or _ArrayNaturalSort() to organise it and then it SHOULD make the process more "Bitesized". However, you'll have a lot of file access on your hands. Are you telling me something I need to know or something I want to know? Link to comment Share on other sites More sharing options...
Mallie99 Posted March 24, 2011 Share Posted March 24, 2011 Hi Mallie,Those lines are not sorted, let assumse that they're not sorted in any ways. Both files consist mostly numbers, each line consist no more than 40-50 characters.Thanks Mallie First thing for you to do then is to try loading your million-and-one line file into an array. Next will be to apply some sort of sorting mechanism to it while it's in memory. This will give you an idea of the time and processor impact this action will have. I'd recommend your script take a "Start Time" and a "End Time" for you to accurately measure the time taken. Keep an eye on TaskMan while it's running for the memory and processor impact.If this works then you're a step closer to getting this working. Are you telling me something I need to know or something I want to know? Link to comment Share on other sites More sharing options...
jaberwacky Posted March 24, 2011 Share Posted March 24, 2011 (edited) Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt that's 1,000,000,000 comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's. No, no, no scrap all of that.Ok, are we to assume that a.txt has a fixed amount of lines? If so, then after we consider who will be using your program (I assume it's just you) will you have enough memory to load a.txt into an array? If b.txt grows in size then you'll have to read it off of disk. Those are a few questions that will have to answered before you can continue. Edited March 24, 2011 by LaCastiglione Helpful Posts and Websites: AutoIt3 Variables and Function Parameters MHz | AutoIt Wiki | Using the GUIToolTip UDF BrewManNH | Can't find what you're looking for on the Forum? Link to comment Share on other sites More sharing options...
Mallie99 Posted March 24, 2011 Share Posted March 24, 2011 Assume that you have two text files: A.txt & B.txt. A.txt has 1,000 lines and B.txt has 1,000,000 lines. If you compare the first line in A.txt to all 1,000,000 lines in B.txt, well, that's a million comparisons. And then you take line two from A.txt and compare that to all 1,000,000 lines in B.txt and that's a billion comparisons. And so on. It is 4:30 in the morning but I think my reasoning is sound. I did stay at a Howard Johson's.Lol up early or late?Dont forget, with sorted datasets you only need to compare to a point... For example, Garage comes after Basement in the dictionary. You can exitloop or continueloop when $Val_A > $Val_B. Are you telling me something I need to know or something I want to know? Link to comment Share on other sites More sharing options...
Tvern Posted March 24, 2011 Share Posted March 24, 2011 By the looks of it it will currently check all files in a directory, and create a new file with entries minus duplicates, but it won't be hard to make it accept a path to two, or more files directly.If you're having trouble converting the function to your needs let us know. Link to comment Share on other sites More sharing options...
jaberwacky Posted March 24, 2011 Share Posted March 24, 2011 Lol up early or late?Thanks Mallie, I realized my error. Helpful Posts and Websites: AutoIt3 Variables and Function Parameters MHz | AutoIt Wiki | Using the GUIToolTip UDF BrewManNH | Can't find what you're looking for on the Forum? Link to comment Share on other sites More sharing options...
jchd Posted March 24, 2011 Share Posted March 24, 2011 (edited) What I would do is along these lines: 1) open the larger file and store some hash (eg MD5) of each line in a :memory: table of SQLite created with (md5 blob, source int) and placing a constant in source (eg 1) 2) create a unique index on the hash column 3)open the smaller file and for each line: get its MD5 hash and try an insert on it (with source = 2). If the insert fails due to dupplicate you know you have the line at hand in the first file as well. 4) optionally delete from the table the rows with source = 2 if the distinct lines don't have to become part of the lager file. 5) optionally use SQLite backup to obtain a disk-based database in case you can reuse it for the next time. Why doing so? If you have an index built on the "reference" MD5s, SQLite will use a O(K log(N)) algorithm to check for dups, _way_ better than the naïve O(K N) linear compare, where K = small file # lines, N = large file # lines. I bet the time saved while checking for dups this way largely offset the burden of computing the hash. I also bet that computing the hash is still better than storing raw text contents (but that may depend on how long/similar your lines are) Edit: Depending on your actual problem statement detail, using a scripting dictionnary should also work but I don't know the limits this can hold. Edit2: @Tvern I really didn't remember at once that I wrote that some time ago! The idea is the same anyway. Edited March 24, 2011 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
michaelslamet Posted March 25, 2011 Author Share Posted March 25, 2011 Mallie, LaCastiglione, Tvern and jchd. Thanks for all of you for your reply. I think this is too complicated for me to code it by myself I'm going to ask somobody expert in RentACoder/vworker to code it for me. Many thanks for the input Link to comment Share on other sites More sharing options...
jchd Posted March 25, 2011 Share Posted March 25, 2011 Don't throw money for that. Take the code Tvern pointed out (I completely forgot what I wrote few months ago: that's age!) and instead of looping thru all files in a directory, just do it with your two files. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now