saracoth Posted March 24, 2005 Share Posted March 24, 2005 I know there is no builtin function or pre-packaged UDF to do this, but does anyone know of any DLLs or other programs which can compare semi-long strings and return a percentage of similarity between the two? Or is there example code, maybe in another language, I could examine and convert to AutoIt? If you want info as to why, the long version is below. The short version? Comparing sections of webpages to archives of sections from the same website. I'm working on a script that will grab my often viewed webpages in the background, use tags and other things to grab sections out of it, and then give me a single local page with all the new content when I press a hotkey. I'm not having many problems with that so far, though I need to implement a number of other modes to catch all my webpages. Right now it can only grab things within specific tags, like tags starting with table width="100%" cellpadding="5" on LiveJournal sites. I'm testing it on the blogs I regularly visit and I already feel like it was worth writing those hundreds of lines of code. However, there are minor issues outside my program. I get re-runs of old entries just because "add comment" has changed to "1 comments" or something. Plus, oddly enough, one friend's blog had four different posts ending in a <3 on one pass, but the final <3 of each was gone the next check (without her editing anything and despite my having downloaded a complete page each time!). To avoid little things like that, I'm hoping to be able to compare sections of text and get a percentage of similarity. Give the ability to specify a percentage threshold at which sections are considered "new." I could just compare line-by-line and get a percentage of lines that are the same and in the same order (3rd line in A same as 3rd line in B, etc.), but I know functions like this exist in other languages. I'm already liking the script so far. Even if I have to skip a few re-runs every couple of days, I'm still saving time since I'm on dialup and very few old entries are not getting repeated anyway. Link to comment Share on other sites More sharing options...
jdickens Posted March 28, 2005 Share Posted March 28, 2005 This is not a trivial algorithm. There are several ways of interpreting similarity. J If I am too verbose, just say so. You don't need to run on and on. Link to comment Share on other sites More sharing options...
Blue_Drache Posted March 28, 2005 Share Posted March 28, 2005 This is not a trivial algorithm. There are several ways of interpreting similarity.J<{POST_SNAPBACK}>I think RegExp will have to be the way to go. Lofting the cyberwinds on teknoleather wings, I am...The Blue Drache Link to comment Share on other sites More sharing options...
saracoth Posted March 28, 2005 Author Share Posted March 28, 2005 I'm open to new experiences. If it'd only take a day's work to implement something, I won't think twice about doing it.Anything more.... Well, I'd still look into it for sake of understanding http://isp.imm.dtu.dk/thor/projects/multim...ning/node5.html and http://pi0959.kub.nl/Paai/Onderw/V-I/Conte...milarities.html are an interesting reads, for instance.Also, I'm not sure how regular expressions would help. At least I didn't come up with anything on a Google search. I'm not sure how to use regular expressions to compare two arbitrary pieces of text without prior preparation for each comparison For now, I'm working on the <3 problem by just using replace functions to delete them all. The comments thing is bearable, but I hate writing large-scale programs half-arsed Thanks for your replies, though! They both led me to some interesting times Link to comment Share on other sites More sharing options...
steveR Posted March 28, 2005 Share Posted March 28, 2005 (edited) Here is a Perl module:http://search.cpan.org/~tareka/String-Trigram-0.1/ Edited March 28, 2005 by steveR AutoIt3 online docs Use it... Know it... Live it...MSDN libraryglobal Help and SupportWindows: Just another pane in the glass. Link to comment Share on other sites More sharing options...
saracoth Posted March 29, 2005 Author Share Posted March 29, 2005 Here is a Perl module:http://search.cpan.org/~tareka/String-Trigram-0.1/<{POST_SNAPBACK}>That's looking pretty interesting, actually. Thanks!cpan was down, but I found a copy at http://mirrors.develooper.com/perl/backpan...id/T/TA/TAREKA/ to look at. Not sure what kind of performance I'll be looking at, but I'm happy with turning the similarity feature off by default and on for individual sites via .ini entry.Now I just have the other 10 things on that script's to do list to worry about.... Link to comment Share on other sites More sharing options...
Guest checkist Posted March 30, 2005 Share Posted March 30, 2005 (edited) Fortunately, I attended 'Information Retrieval' class last semester. so I may help (or not)Let's start with very simple (yet not appliable) solution. Compare each string's character-by-character and count the difference for example. the end of the century the end of the centuri differs only last character. so similarity is 21/22 * 100Problems? many. let us consider following example the end of the century th end of the century In our view (Human's view), it only differs one. (second string misses 'e' at 3rd place)but with our very simplified algorithm, it returns 2/22 * 100 similarity.(Because we compared the strings character-by-character, only first 2 character matches)So, let us define following differences:insertion : target string has extra character at position i.deletion : source string has extra character at position i.replace : target string's ith character and source string's ith character differsExamples : (suppose i = 1~...) the end of the century tshe end of the century = insertion[2]= Similarity = 21/22 (1 error) the end of the century tsheend of the century = insertion[2], deletion[4]= Similarity = 20/22 (2 error) the end of the century the end of the centuri = replace[22]= Similarity = 21/22 (1 error)Now things seem to be pretty reasonable. but how do we implement it?1. Brute-force! Enumerate all possible combination (ex. 'insertion[1]-match-match-deletion[0]' 'match-deletion[2]-match-insertion[4]' ......) and find the least error. (Easy to implement. may consume resources RAPIDLY)2. Use dynamic-programming method More elegant method than above. which I forgot Anyway, this method uses only (length of source string * length of target string) size array, and fast. Search the net for dynamic-programming, or ask another people (sorry) Edited March 30, 2005 by checkist Link to comment Share on other sites More sharing options...
saracoth Posted March 31, 2005 Author Share Posted March 31, 2005 Fortunately, I attended 'Information Retrieval' class last semester. so I may help (or not)You did, actually. Very much so. I'd thought about the first method myself and also decided it would be too resource intensive. But the second.... Wow. I'm getting a lot of good info on dynamic programming from searches!For a math/logic freak like me, reading this is almost as exciting as pr0n! Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now