What I would do is along these lines:
1) open the larger file and store some hash (eg MD5) of each line in a :memory: table of SQLite created with (md5 blob, source int) and placing a constant in source (eg 1)
2) create a unique index on the hash column
3)open the smaller file and for each line: get its MD5 hash and try an insert on it (with source = 2). If the insert fails due to dupplicate you know you have the line at hand in the first file as well.
4) optionally delete from the table the rows with source = 2 if the distinct lines don't have to become part of the lager file.
5) optionally use SQLite backup to obtain a disk-based database in case you can reuse it for the next time.
Why doing so? If you have an index built on the "reference" MD5s, SQLite will use a O(K log(N)) algorithm to check for dups, _way_ better than the naïve O(K N) linear compare, where K = small file # lines, N = large file # lines.
I bet the time saved while checking for dups this way largely offset the burden of computing the hash.
I also bet that computing the hash is still better than storing raw text contents (but that may depend on how long/similar your lines are)
Depending on your actual problem statement detail, using a scripting dictionnary should also work but I don't know the limits this can hold.
I really didn't remember at once that I wrote that some time ago! The idea is the same anyway.
Edited by jchd, 24 March 2011 - 09:26 AM.