Jump to content
Sign in to follow this  
saracoth

String Similarity Percentage?

Recommended Posts

saracoth

I know there is no builtin function or pre-packaged UDF to do this, but does anyone know of any DLLs or other programs which can compare semi-long strings and return a percentage of similarity between the two? Or is there example code, maybe in another language, I could examine and convert to AutoIt?

If you want info as to why, the long version is below. The short version? Comparing sections of webpages to archives of sections from the same website.

I'm working on a script that will grab my often viewed webpages in the background, use tags and other things to grab sections out of it, and then give me a single local page with all the new content when I press a hotkey.

I'm not having many problems with that so far, though I need to implement a number of other modes to catch all my webpages. Right now it can only grab things within specific tags, like tags starting with table width="100%" cellpadding="5" on LiveJournal sites. I'm testing it on the blogs I regularly visit and I already feel like it was worth writing those hundreds of lines of code.

However, there are minor issues outside my program. I get re-runs of old entries just because "add comment" has changed to "1 comments" or something. Plus, oddly enough, one friend's blog had four different posts ending in a <3 on one pass, but the final <3 of each was gone the next check (without her editing anything and despite my having downloaded a complete page each time!).

To avoid little things like that, I'm hoping to be able to compare sections of text and get a percentage of similarity. Give the ability to specify a percentage threshold at which sections are considered "new." I could just compare line-by-line and get a percentage of lines that are the same and in the same order (3rd line in A same as 3rd line in B, etc.), but I know functions like this exist in other languages.

I'm already liking the script so far. Even if I have to skip a few re-runs every couple of days, I'm still saving time since I'm on dialup and very few old entries are not getting repeated anyway.

Share this post


Link to post
Share on other sites
jdickens

This is not a trivial algorithm. There are several ways of interpreting similarity.

J


If I am too verbose, just say so. You don't need to run on and on.

Share this post


Link to post
Share on other sites
Blue_Drache

This is not a trivial algorithm.  There are several ways of interpreting similarity.

J

<{POST_SNAPBACK}>

I think RegExp will have to be the way to go.

Lofting the cyberwinds on teknoleather wings, I am...The Blue Drache

Share this post


Link to post
Share on other sites
saracoth

I'm open to new experiences. If it'd only take a day's work to implement something, I won't think twice about doing it.

Anything more.... Well, I'd still look into it for sake of understanding :)

http://isp.imm.dtu.dk/thor/projects/multim...ning/node5.html and http://pi0959.kub.nl/Paai/Onderw/V-I/Conte...milarities.html are an interesting reads, for instance.

Also, I'm not sure how regular expressions would help. At least I didn't come up with anything on a Google search. I'm not sure how to use regular expressions to compare two arbitrary pieces of text without prior preparation for each comparison :(

For now, I'm working on the <3 problem by just using replace functions to delete them all. The comments thing is bearable, but I hate writing large-scale programs half-arsed :(

Thanks for your replies, though! They both led me to some interesting times :)

Share this post


Link to post
Share on other sites
saracoth

Here is a Perl module:

http://search.cpan.org/~tareka/String-Trigram-0.1/

<{POST_SNAPBACK}>

That's looking pretty interesting, actually. Thanks!

cpan was down, but I found a copy at http://mirrors.develooper.com/perl/backpan...id/T/TA/TAREKA/ to look at. Not sure what kind of performance I'll be looking at, but I'm happy with turning the similarity feature off by default and on for individual sites via .ini entry.

Now I just have the other 10 things on that script's to do list to worry about.... :)

Share this post


Link to post
Share on other sites
Guest checkist

Fortunately, I attended 'Information Retrieval' class last semester.

so I may help (or not)

Let's start with very simple (yet not appliable) solution.

Compare each string's character-by-character and count the difference

for example.

the end of the century

the end of the centuri

differs only last character. so similarity is 21/22 * 100

Problems? many. let us consider following example

the end of the century

th end of the century

In our view (Human's view), it only differs one. (second string misses 'e' at 3rd place)

but with our very simplified algorithm, it returns 2/22 * 100 similarity.

(Because we compared the strings character-by-character, only first 2 character matches)

So, let us define following differences:

insertion : target string has extra character at position i.

deletion : source string has extra character at position i.

replace : target string's ith character and source string's ith character differs

Examples : (suppose i = 1~...)

the end of the century

tshe end of the century

= insertion[2]

= Similarity = 21/22 (1 error)

the end of the century

tsheend of the century

= insertion[2], deletion[4]

= Similarity = 20/22 (2 error)

the end of the century

the end of the centuri

= replace[22]

= Similarity = 21/22 (1 error)

Now things seem to be pretty reasonable. but how do we implement it?

1. Brute-force!

Enumerate all possible combination

(ex. 'insertion[1]-match-match-deletion[0]'

'match-deletion[2]-match-insertion[4]'

......)

and find the least error.

(Easy to implement. may consume resources RAPIDLY)

2. Use dynamic-programming method

More elegant method than above. which I forgot :)

Anyway, this method uses only (length of source string * length of target string)

size array, and fast.

Search the net for dynamic-programming, or ask another people (sorry)

Edited by checkist

Share this post


Link to post
Share on other sites
saracoth

Fortunately, I attended 'Information Retrieval' class last semester.

so I may help (or not)

You did, actually. Very much so. I'd thought about the first method myself and also decided it would be too resource intensive. But the second.... Wow. I'm getting a lot of good info on dynamic programming from searches!

For a math/logic freak like me, reading this is almost as exciting as pr0n! :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×