Jump to content

[SOLVED] RegEx for Unique Words in source


Recommended Posts

Hey,

is this possible:

Creating a REGEX that strips out any words that repeat themselves against the source it is being ran against.

For example:

$source="this is the end of the end for the end of the end is the beginning of the end with all hopes for the start of a new end"

After the regex would give me:

$result = "this is the end of for beginning with all hopes start a new"

Can regex's do that?

Edited by zackrspv

-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Link to comment
Share on other sites

Hey,

is this possible:

Creating a REGEX that strips out any words that repeat themselves against the source it is being ran against.

For example:

After the regex would give me:

Can regex's do that?

oh i may have it:

/\b(\w+)(\s+\1)+\b/i

hehe

-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Link to comment
Share on other sites

oh i may have it:

/\b(\w+)(\s+\1)+\b/i

hehe

well, that wasn't it.

I did find this one:

\b(\S+)\b(\s+\1\b)+

Which will remove things like: "This is is the end end of the the end end", but still not quite what i need. The search continues...

-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Link to comment
Share on other sites

We need a better description. Are you wanting to strip out any word AND its duplicates, any duplicates of a word, or just any repeating pattern.

The middle option seems the most logical, this will leave the first occurence intact and destroy all others.

Link to comment
Share on other sites

We need a better description. Are you wanting to strip out any word AND its duplicates, any duplicates of a word, or just any repeating pattern.

The middle option seems the most logical, this will leave the first occurence intact and destroy all others.

Yep; I supose i should have better stated: I want to find only the unique words in a given source. No repeats whatsoever.

I've been lookin all around the net, as you can see above and still havn't figured it out; i've gotten close haha, but no cigar it seems.

-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Link to comment
Share on other sites

no idea on how your wanting to do this but cant you just do string manipulation?

do a stringsplit on " " and then for every element add to a new result string with " " and if the word already exists skip it..

While that would be dooable, that would be very intensive. I may have sources that have over 100 words or more, and constantly going through the recursion on those words would be........well intensive.

I've seen regular expressions do this work before in javascript, for example, but I can't translate that over to what autoit supports, at least i don't know how too. And regex would be much more simplier.

The whole point behind this is to take like an entire article source and remove all duplicate words, thus leaving just unique words. Somewhat of a method of forming keywords automatically. i'd remove simple words like 'and, for, is, the, etc' but, it's the best method i could think of to form the keywords for a full text search that i'm adding to one of my applications.

-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Link to comment
Share on other sites

While that would be dooable, that would be very intensive. I may have sources that have over 100 words or more, and constantly going through the recursion on those words would be........well intensive.

I've seen regular expressions do this work before in javascript, for example, but I can't translate that over to what autoit supports, at least i don't know how too. And regex would be much more simplier.

The whole point behind this is to take like an entire article source and remove all duplicate words, thus leaving just unique words. Somewhat of a method of forming keywords automatically. i'd remove simple words like 'and, for, is, the, etc' but, it's the best method i could think of to form the keywords for a full text search that i'm adding to one of my applications.

So, i used the idea mentioned above, and confirmed it is very slow. Takes almost 5x longer than w/o that proceedure. I have to simply wonder if a regex would just be faster than splitting the string, etc. But, here's what I have so far:

$nOffset = 1
    While 1
        $array = StringRegExp($sHTML, '(?s)(?i)(?:<TD class=article_text>\<SPAN id=_ctl0_ArticleRepeater__ctl1_ArticleText>)(.*?)(?:\</span>\</td>)', 1, $nOffset)
        If @error = 0 Then
            $nOffset = @extended
        Else
            ExitLoop
        EndIf
        For $i = 0 To UBound($array) - 1
            $str = $array[$i]
            $str = StringRegExpReplace($str, "&#(.*?);", "")
            $str = StringRegExpReplace($str, "<(.*?)>", "")
            $str = StringRegExpReplace($str, "</(.*?)>", "")
            $str = StringRegExpReplace($str, "&(.*?);", "")
            $str = StringRegExpReplace($str, "\s", " ")
            $str = StringReplace($str, ",", " ")

            $remove = StringSplit("'tis,'twas,a,able,about,across,after,ain't,all,almost,also,am,among,an,and,any,are,aren't,as,at,be,because,been,but,by,can,can't,cannot,could,could've,couldn't,dear,did,didn't,do,does,doesn't,don't,either,else,ever,every,for,from,get,got,had,has,hasn't,have,he,he'd,he'll,he's,her,hers,him,his,how,how'd,how'll,how's,however,i,i'd,i'll,i'm,i've,if,in,into,is,isn't,it,it's,its,just,least,let,like,likely,may,me,might,might've,mightn't,most,must,must've,mustn't,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,shan't,she,she'd,she'll,she's,should,should've,shouldn't,since,so,some,than,that,that'll,that's,the,their,them,then,there,there's,these,they,they'd,they'll,they're,they've,this,tis,to,too,twas,us,wants,was,wasn't,we,we'd,we'll,we're,were,weren't,what,what'd,what's,when,when,when'd,when'll,when's,where,where'd,where'll,where's,which,while,who,who'd,who'll,who's,whom,why,why'd,why'll,why's,will,with,won't,would,would've,wouldn't,yet,you,you'd,you'll,you're,you've,your", ",", 1)
            for $k = 1 to UBound($remove) - 1
                $str = StringRegExpReplace($str, "(?i)\b("&$remove[$k]&")\s", "")
            Next

        Next
        
            $unique = StringSplit($str, ", ", 0)
            dim $keywords[UBound($unique)-1]
            for $dl = 1 to UBound($unique) - 1
                $ustr = $unique[$dl]
                if $ustr = "" or $ustr = " " Then
                Else
                    if _ArraySearch($keywords, $unique[$dl]) > 0 Then
                    Else
                    _ArrayAdd($keywords, $unique[$dl])
                    EndIf
                EndIf
            Next
        $title = _ArrayToString($keywords, " ")
        $title = StringRegExpReplace($title, "\s\s+", "")
        ProgressSet($flvalue, "Complted: " & $flvalue & "%" & @CRLF & "Updated: " & $title)
        $file = FileOpen($dbfile, 9)
        FileWrite($file, $id & ", " & $title & @CRLF)
        FileClose($file)

    WEnd
Edited by zackrspv

-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Link to comment
Share on other sites

$string = "This this is is a a test test message message"
$result = StringRegExpReplace($string, "(?i)\b(\w+)(?:(\s+)\1\b)+", "$1")
ConsoleWrite($result & @CRLF)

Edited by weaponx
Link to comment
Share on other sites

$string = "This this is is a a test test message message"
$result = StringRegExpReplace($string, "(?i)\b(\w+)(?:(\s+)\1\b)+", "$1")
ConsoleWrite($result & @CRLF)
Well, that helped. But, as it won't fix something like: 'This is the end of the end is the end of the end', it's not quite what i was looking for; but you did point me in the right direction:

#include <array.au3>
$string = "This is the end of the end for the end of the end is the begnning of that start of the fourth end."
$keywords = StringSplit($string, " ", 0)
_ArraySort($keywords, 0)
$string = _ArrayToString($keywords, " ", 1);- start at $keywords[1] to avoid the Count in the string
$result = StringRegExpReplace($string, "(?i)\b(\w+)(?:(\s+)\1\b)+", "$1")
ConsoleWrite($result & @CRLF)

It's much faster than the example i posted before, so i'm happy for the most part :)

Edited by zackrspv

-_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...