zackrspv Posted April 3, 2008 Share Posted April 3, 2008 (edited) Hey,is this possible:Creating a REGEX that strips out any words that repeat themselves against the source it is being ran against.For example:$source="this is the end of the end for the end of the end is the beginning of the end with all hopes for the start of a new end"After the regex would give me: $result = "this is the end of for beginning with all hopes start a new"Can regex's do that? Edited April 3, 2008 by zackrspv -_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë. Link to comment Share on other sites More sharing options...
zackrspv Posted April 3, 2008 Author Share Posted April 3, 2008 Hey, is this possible: Creating a REGEX that strips out any words that repeat themselves against the source it is being ran against. For example: After the regex would give me: Can regex's do that? oh i may have it: /\b(\w+)(\s+\1)+\b/i hehe -_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë. Link to comment Share on other sites More sharing options...
zackrspv Posted April 3, 2008 Author Share Posted April 3, 2008 oh i may have it: /\b(\w+)(\s+\1)+\b/i hehe well, that wasn't it. I did find this one: \b(\S+)\b(\s+\1\b)+ Which will remove things like: "This is is the end end of the the end end", but still not quite what i need. The search continues... -_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë. Link to comment Share on other sites More sharing options...
weaponx Posted April 3, 2008 Share Posted April 3, 2008 We need a better description. Are you wanting to strip out any word AND its duplicates, any duplicates of a word, or just any repeating pattern. The middle option seems the most logical, this will leave the first occurence intact and destroy all others. Link to comment Share on other sites More sharing options...
zackrspv Posted April 3, 2008 Author Share Posted April 3, 2008 We need a better description. Are you wanting to strip out any word AND its duplicates, any duplicates of a word, or just any repeating pattern.The middle option seems the most logical, this will leave the first occurence intact and destroy all others.Yep; I supose i should have better stated: I want to find only the unique words in a given source. No repeats whatsoever.I've been lookin all around the net, as you can see above and still havn't figured it out; i've gotten close haha, but no cigar it seems. -_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë. Link to comment Share on other sites More sharing options...
sherkas Posted April 3, 2008 Share Posted April 3, 2008 no idea on how your wanting to do this but cant you just do string manipulation? do a stringsplit on " " and then for every element add to a new result string with " " and if the word already exists skip it.. Link to comment Share on other sites More sharing options...
zackrspv Posted April 3, 2008 Author Share Posted April 3, 2008 no idea on how your wanting to do this but cant you just do string manipulation?do a stringsplit on " " and then for every element add to a new result string with " " and if the word already exists skip it..While that would be dooable, that would be very intensive. I may have sources that have over 100 words or more, and constantly going through the recursion on those words would be........well intensive.I've seen regular expressions do this work before in javascript, for example, but I can't translate that over to what autoit supports, at least i don't know how too. And regex would be much more simplier.The whole point behind this is to take like an entire article source and remove all duplicate words, thus leaving just unique words. Somewhat of a method of forming keywords automatically. i'd remove simple words like 'and, for, is, the, etc' but, it's the best method i could think of to form the keywords for a full text search that i'm adding to one of my applications. -_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë. Link to comment Share on other sites More sharing options...
zackrspv Posted April 3, 2008 Author Share Posted April 3, 2008 (edited) While that would be dooable, that would be very intensive. I may have sources that have over 100 words or more, and constantly going through the recursion on those words would be........well intensive. I've seen regular expressions do this work before in javascript, for example, but I can't translate that over to what autoit supports, at least i don't know how too. And regex would be much more simplier. The whole point behind this is to take like an entire article source and remove all duplicate words, thus leaving just unique words. Somewhat of a method of forming keywords automatically. i'd remove simple words like 'and, for, is, the, etc' but, it's the best method i could think of to form the keywords for a full text search that i'm adding to one of my applications. So, i used the idea mentioned above, and confirmed it is very slow. Takes almost 5x longer than w/o that proceedure. I have to simply wonder if a regex would just be faster than splitting the string, etc. But, here's what I have so far: expandcollapse popup$nOffset = 1 While 1 $array = StringRegExp($sHTML, '(?s)(?i)(?:<TD class=article_text>\<SPAN id=_ctl0_ArticleRepeater__ctl1_ArticleText>)(.*?)(?:\</span>\</td>)', 1, $nOffset) If @error = 0 Then $nOffset = @extended Else ExitLoop EndIf For $i = 0 To UBound($array) - 1 $str = $array[$i] $str = StringRegExpReplace($str, "&#(.*?);", "") $str = StringRegExpReplace($str, "<(.*?)>", "") $str = StringRegExpReplace($str, "</(.*?)>", "") $str = StringRegExpReplace($str, "&(.*?);", "") $str = StringRegExpReplace($str, "\s", " ") $str = StringReplace($str, ",", " ") $remove = StringSplit("'tis,'twas,a,able,about,across,after,ain't,all,almost,also,am,among,an,and,any,are,aren't,as,at,be,because,been,but,by,can,can't,cannot,could,could've,couldn't,dear,did,didn't,do,does,doesn't,don't,either,else,ever,every,for,from,get,got,had,has,hasn't,have,he,he'd,he'll,he's,her,hers,him,his,how,how'd,how'll,how's,however,i,i'd,i'll,i'm,i've,if,in,into,is,isn't,it,it's,its,just,least,let,like,likely,may,me,might,might've,mightn't,most,must,must've,mustn't,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,shan't,she,she'd,she'll,she's,should,should've,shouldn't,since,so,some,than,that,that'll,that's,the,their,them,then,there,there's,these,they,they'd,they'll,they're,they've,this,tis,to,too,twas,us,wants,was,wasn't,we,we'd,we'll,we're,were,weren't,what,what'd,what's,when,when,when'd,when'll,when's,where,where'd,where'll,where's,which,while,who,who'd,who'll,who's,whom,why,why'd,why'll,why's,will,with,won't,would,would've,wouldn't,yet,you,you'd,you'll,you're,you've,your", ",", 1) for $k = 1 to UBound($remove) - 1 $str = StringRegExpReplace($str, "(?i)\b("&$remove[$k]&")\s", "") Next Next $unique = StringSplit($str, ", ", 0) dim $keywords[UBound($unique)-1] for $dl = 1 to UBound($unique) - 1 $ustr = $unique[$dl] if $ustr = "" or $ustr = " " Then Else if _ArraySearch($keywords, $unique[$dl]) > 0 Then Else _ArrayAdd($keywords, $unique[$dl]) EndIf EndIf Next $title = _ArrayToString($keywords, " ") $title = StringRegExpReplace($title, "\s\s+", "") ProgressSet($flvalue, "Complted: " & $flvalue & "%" & @CRLF & "Updated: " & $title) $file = FileOpen($dbfile, 9) FileWrite($file, $id & ", " & $title & @CRLF) FileClose($file) WEnd Edited April 3, 2008 by zackrspv -_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë. Link to comment Share on other sites More sharing options...
weaponx Posted April 3, 2008 Share Posted April 3, 2008 (edited) $string = "This this is is a a test test message message" $result = StringRegExpReplace($string, "(?i)\b(\w+)(?:(\s+)\1\b)+", "$1") ConsoleWrite($result & @CRLF) Edited April 3, 2008 by weaponx Link to comment Share on other sites More sharing options...
zackrspv Posted April 3, 2008 Author Share Posted April 3, 2008 (edited) $string = "This this is is a a test test message message" $result = StringRegExpReplace($string, "(?i)\b(\w+)(?:(\s+)\1\b)+", "$1") ConsoleWrite($result & @CRLF) Well, that helped. But, as it won't fix something like: 'This is the end of the end is the end of the end', it's not quite what i was looking for; but you did point me in the right direction: #include <array.au3> $string = "This is the end of the end for the end of the end is the begnning of that start of the fourth end." $keywords = StringSplit($string, " ", 0) _ArraySort($keywords, 0) $string = _ArrayToString($keywords, " ", 1);- start at $keywords[1] to avoid the Count in the string $result = StringRegExpReplace($string, "(?i)\b(\w+)(?:(\s+)\1\b)+", "$1") ConsoleWrite($result & @CRLF) It's much faster than the example i posted before, so i'm happy for the most part Edited April 3, 2008 by zackrspv -_-------__--_-_-____---_-_--_-__-__-_ ^^€ñ†®øÞÿ ë×阮§ wï†høµ† ƒë@®, wï†høµ† †ïmë, @ñd wï†høµ† @ †ïmïdï†ÿ ƒø® !ïƒë. €×阮 ñø†, bµ† ïñ§†ë@d wï†hïñ, ñ@ÿ, †h®øµghøµ† †hë 맧ëñ§ë øƒ !ïƒë. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now