[SOLVED] RegEx for Unique Words in source

zackrspv · April 3, 2008

Hey,

is this possible:

Creating a REGEX that strips out any words that repeat themselves against the source it is being ran against.

For example:

$source="this is the end of the end for the end of the end is the beginning of the end with all hopes for the start of a new end"

After the regex would give me:

$result = "this is the end of for beginning with all hopes start a new"

Can regex's do that?

Edited April 3, 2008 by zackrspv

zackrspv · April 3, 2008

Hey,

is this possible:

Creating a REGEX that strips out any words that repeat themselves against the source it is being ran against.

For example:

After the regex would give me:

Can regex's do that?

oh i may have it:

/\b(\w+)(\s+\1)+\b/i

hehe

zackrspv · April 3, 2008

oh i may have it:
/\b(\w+)(\s+\1)+\b/i
hehe

well, that wasn't it.

I did find this one:

\b(\S+)\b(\s+\1\b)+

Which will remove things like: "This is is the end end of the the end end", but still not quite what i need. The search continues...

weaponx · April 3, 2008

We need a better description. Are you wanting to strip out any word AND its duplicates, any duplicates of a word, or just any repeating pattern.

The middle option seems the most logical, this will leave the first occurence intact and destroy all others.

zackrspv · April 3, 2008

We need a better description. Are you wanting to strip out any word AND its duplicates, any duplicates of a word, or just any repeating pattern.
The middle option seems the most logical, this will leave the first occurence intact and destroy all others.

Yep; I supose i should have better stated: I want to find only the unique words in a given source. No repeats whatsoever.

I've been lookin all around the net, as you can see above and still havn't figured it out; i've gotten close haha, but no cigar it seems.

sherkas · April 3, 2008

no idea on how your wanting to do this but cant you just do string manipulation?

do a stringsplit on " " and then for every element add to a new result string with " " and if the word already exists skip it..

zackrspv · April 3, 2008

no idea on how your wanting to do this but cant you just do string manipulation?
do a stringsplit on " " and then for every element add to a new result string with " " and if the word already exists skip it..

While that would be dooable, that would be very intensive. I may have sources that have over 100 words or more, and constantly going through the recursion on those words would be........well intensive.

I've seen regular expressions do this work before in javascript, for example, but I can't translate that over to what autoit supports, at least i don't know how too. And regex would be much more simplier.

The whole point behind this is to take like an entire article source and remove all duplicate words, thus leaving just unique words. Somewhat of a method of forming keywords automatically. i'd remove simple words like 'and, for, is, the, etc' but, it's the best method i could think of to form the keywords for a full text search that i'm adding to one of my applications.

zackrspv · April 3, 2008

While that would be dooable, that would be very intensive. I may have sources that have over 100 words or more, and constantly going through the recursion on those words would be........well intensive.

I've seen regular expressions do this work before in javascript, for example, but I can't translate that over to what autoit supports, at least i don't know how too. And regex would be much more simplier.

The whole point behind this is to take like an entire article source and remove all duplicate words, thus leaving just unique words. Somewhat of a method of forming keywords automatically. i'd remove simple words like 'and, for, is, the, etc' but, it's the best method i could think of to form the keywords for a full text search that i'm adding to one of my applications.

So, i used the idea mentioned above, and confirmed it is very slow. Takes almost 5x longer than w/o that proceedure. I have to simply wonder if a regex would just be faster than splitting the string, etc. But, here's what I have so far:

$nOffset = 1
    While 1
        $array = StringRegExp($sHTML, '(?s)(?i)(?:<TD class=article_text>\<SPAN id=_ctl0_ArticleRepeater__ctl1_ArticleText>)(.*?)(?:\</span>\</td>)', 1, $nOffset)
        If @error = 0 Then
            $nOffset = @extended
        Else
            ExitLoop
        EndIf
        For $i = 0 To UBound($array) - 1
            $str = $array[$i]
            $str = StringRegExpReplace($str, "&#(.*?);", "")
            $str = StringRegExpReplace($str, "<(.*?)>", "")
            $str = StringRegExpReplace($str, "</(.*?)>", "")
            $str = StringRegExpReplace($str, "&(.*?);", "")
            $str = StringRegExpReplace($str, "\s", " ")
            $str = StringReplace($str, ",", " ")

            $remove = StringSplit("'tis,'twas,a,able,about,across,after,ain't,all,almost,also,am,among,an,and,any,are,aren't,as,at,be,because,been,but,by,can,can't,cannot,could,could've,couldn't,dear,did,didn't,do,does,doesn't,don't,either,else,ever,every,for,from,get,got,had,has,hasn't,have,he,he'd,he'll,he's,her,hers,him,his,how,how'd,how'll,how's,however,i,i'd,i'll,i'm,i've,if,in,into,is,isn't,it,it's,its,just,least,let,like,likely,may,me,might,might've,mightn't,most,must,must've,mustn't,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,shan't,she,she'd,she'll,she's,should,should've,shouldn't,since,so,some,than,that,that'll,that's,the,their,them,then,there,there's,these,they,they'd,they'll,they're,they've,this,tis,to,too,twas,us,wants,was,wasn't,we,we'd,we'll,we're,were,weren't,what,what'd,what's,when,when,when'd,when'll,when's,where,where'd,where'll,where's,which,while,who,who'd,who'll,who's,whom,why,why'd,why'll,why's,will,with,won't,would,would've,wouldn't,yet,you,you'd,you'll,you're,you've,your", ",", 1)
            for $k = 1 to UBound($remove) - 1
                $str = StringRegExpReplace($str, "(?i)\b("&$remove[$k]&")\s", "")
            Next

        Next
        
            $unique = StringSplit($str, ", ", 0)
            dim $keywords[UBound($unique)-1]
            for $dl = 1 to UBound($unique) - 1
                $ustr = $unique[$dl]
                if $ustr = "" or $ustr = " " Then
                Else
                    if _ArraySearch($keywords, $unique[$dl]) > 0 Then
                    Else
                    _ArrayAdd($keywords, $unique[$dl])
                    EndIf
                EndIf
            Next
        $title = _ArrayToString($keywords, " ")
        $title = StringRegExpReplace($title, "\s\s+", "")
        ProgressSet($flvalue, "Complted: " & $flvalue & "%" & @CRLF & "Updated: " & $title)
        $file = FileOpen($dbfile, 9)
        FileWrite($file, $id & ", " & $title & @CRLF)
        FileClose($file)

    WEnd

Edited April 3, 2008 by zackrspv

weaponx · April 3, 2008

$string = "This this is is a a test test message message"
$result = StringRegExpReplace($string, "(?i)\b(\w+)(?:(\s+)\1\b)+", "$1")
ConsoleWrite($result & @CRLF)

Edited April 3, 2008 by weaponx

zackrspv · April 3, 2008

$string = "This this is is a a test test message message"
$result = StringRegExpReplace($string, "(?i)\b(\w+)(?:(\s+)\1\b)+", "$1")
ConsoleWrite($result & @CRLF)

Well, that helped. But, as it won't fix something like: 'This is the end of the end is the end of the end', it's not quite what i was looking for; but you did point me in the right direction:

#include <array.au3>
$string = "This is the end of the end for the end of the end is the begnning of that start of the fourth end."
$keywords = StringSplit($string, " ", 0)
_ArraySort($keywords, 0)
$string = _ArrayToString($keywords, " ", 1);- start at $keywords[1] to avoid the Count in the string
$result = StringRegExpReplace($string, "(?i)\b(\w+)(?:(\s+)\1\b)+", "$1")
ConsoleWrite($result & @CRLF)

It's much faster than the example i posted before, so i'm happy for the most part

Edited April 3, 2008 by zackrspv

Sign In

[SOLVED] RegEx for Unique Words in source

Recommended Posts

zackrspv

zackrspv

zackrspv

weaponx

zackrspv

sherkas

zackrspv

zackrspv

weaponx

zackrspv

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta