czardas Posted April 25, 2016 Share Posted April 25, 2016 (edited) I am looking for a way to define a word boundary in any language. The standard regular expression '\b' is not very useful for this because it simply doesn't define a word boundary in any standard language that I am aware of. It ignores several characters that are non-alphabetic and hardly recognizes letters from any alphabet other than English. What is needed is something more substantial but also efficient. So far I've considered testing for spaces, punctuation and delimiters. I'm just wondering if there might be an easier way, or some trick I don't know about. Testing all non-alphabetic Unicode ranges would be rather slow and the code would be humongous. Here are the code points for punctuation in the 1st multilingual plane to give you an impression of how this might be done. $sPunctuation = _ '\x{21}-\x{23}\x{25}-\x{2A}\x{2C}-\x{2F}\x{3A}\x{3B}\x{3F}\x{40}\x{5B}-\x{5D}\x{5F}\x{7B}\x{7D}\x{A1}\x{A7}\x{AB}\x{B6}\x{B7}\x{BB}\x{BF}\x{037E}\x{0387}' & _ '\x{055A}-\x{055F}\x{0589}\x{058A}\x{05BE}\x{05C0}\x{05C3}\x{05C6}\x{05F3}\x{05F4}\x{0609}\x{060A}\x{060C}\x{060D}\x{061B}\x{061E}\x{061F}\x{066A}-\x{066D}' & _ '\x{06D4}\x{0700}-\x{070D}\x{07F7}-\x{07F9}\x{0830}-\x{083E}\x{085E}\x{0964}\x{0965}\x{0970}\x{0AF0}\x{0DF4}\x{0E4F}\x{0E5A}\x{0E5B}\x{0F04}-\x{0F12}\x{0F14}' & _ '\x{0F3A}-\x{0F3D}\x{0F85}\x{0FD0}-\x{0FD4}\x{0FD9}\x{0FDA}\x{104A}-\x{104F}\x{10FB}\x{1360}-\x{1368}\x{1400}\x{166D}\x{166E}\x{169B}\x{169C}\x{16EB}-\x{16ED}' & _ '\x{1735}\x{1736}\x{17D4}-\x{17D6}\x{17D8}-\x{17DA}\x{1800}-\x{180A}\x{1944}\x{1945}\x{1A1E}\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B60}' & _ '\x{1BFC}-\x{1BFF}\x{1C3B}-\x{1C3F}\x{1C7E}\x{1C7F}\x{1CC0}-\x{1CC7}\x{1CD3}\x{2010}-\x{2027}\x{2030}-\x{2043}\x{2045}-\x{2051}\x{2053}-\x{205E}\x{207D}' & _ '\x{207E}\x{208D}\x{208E}\x{2329}\x{232A}\x{2768}-\x{2775}\x{27C5}\x{27C6}\x{27E6}-\x{27EF}\x{2983}-\x{2998}\x{29D8}-\x{29DB}\x{29FC}\x{29FD}\x{2CF9}-\x{2CFC}' & _ '\x{2CFE}\x{2CFF}\x{2D70}\x{2E00}-\x{2E2E}\x{2E30}-\x{2E3B}\x{3001}-\x{3003}\x{3008}-\x{3011}\x{3014}-\x{301F}\x{3030}\x{303D}\x{30A0}\x{30FB}\x{A4FE}\x{A4FF}' & _ '\x{A60D}-\x{A60F}\x{A673}\x{A67E}\x{A6F2}-\x{A6F7}\x{A874}-\x{A877}\x{A8CE}\x{A8CF}\x{A8F8}-\x{A8FA}\x{A92E}\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}\x{A9DF}' & _ '\x{AA5C}-\x{AA5F}\x{AADE}\x{AADF}\x{AAF0}\x{AAF1}\x{ABEB}\x{FD3E}\x{FD3F}\x{FE10}-\x{FE19}\x{FE30}-\x{FE52}\x{FE54}-\x{FE61}\x{FE63}\x{FE68}\x{FE6A}\x{FE6B}' & _ '\x{FF01}-\x{FF03}\x{FF05}-\x{FF0A}\x{FF0C}-\x{FF0F}\x{FF1A}\x{FF1B}\x{FF1F}\x{FF20}\x{FF3B}-\x{FF3D}\x{FF3F}\x{FF5B}\x{FF5D}\x{FF5F}-\x{FF65}' Edited April 25, 2016 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
iamtheky Posted April 25, 2016 Share Posted April 25, 2016 I think space is a pretty reliable word boundary. Why the precision? ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 (edited) If I want to find the exact word 'green' in this very sentence, I'm already stuck. I don't speak all languages and can't assume that a word may be preceded by the dollar sign in AutoIt (although that's not punctuation ). Edited April 25, 2016 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
iamtheky Posted April 25, 2016 Share Posted April 25, 2016 (edited) Speaking them is easy, you can leave the punctuation if you are just going to speak them. But, I'm guessing you need the polished version for something else... $sStr = "If I want to find the ,exact (word) 'green' in this very sentence, I'm already stuck" $aStr = stringsplit($sStr , " " , 2) $s_text = $aStr[6] & " " & $aStr[7] & " " & $aStr[8] $o_speech = ObjCreate("SAPI.SpVoice") $o_speech.Speak($s_text) Edited April 25, 2016 by iamtheky ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 Hmm right. A recursive loop may test millions of words. operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
iamtheky Posted April 25, 2016 Share Posted April 25, 2016 (edited) How are you eliminating all your known words first? Is that easier, or worthwhile, leaving a smaller group to test weird boundaries? Edited April 25, 2016 by iamtheky ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 (edited) The user knows/inputs the word. When searching for the word, the code is meant to prevent a false positive match when the word appears within a larger word. So a definition of word boundary is needed. Edited April 25, 2016 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
iamtheky Posted April 25, 2016 Share Posted April 25, 2016 (edited) "when searching for a word"... within a string of unknown delimiters, (also, is there a language that does not use spaces in digital text)? within an array of words that could potentially have punctuation attached? can it be either? Edited April 25, 2016 by iamtheky ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 (edited) Searching (within a string) for a word in an unknown language (or searching for a random string), with unknown delimiters which are not letters of some kind. I define this as a word boundary whether the search term is a word or not. Edited April 25, 2016 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
iamtheky Posted April 25, 2016 Share Posted April 25, 2016 (edited) Thats what post #1 looked like, I just didnt think you were serious. If its random, then word boundaries wouldnt exist. Are you going to try and learn the boundaries, and then check for them if they are unknown? Edited April 25, 2016 by iamtheky ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 The expression should look something like this: [symbol](Anything)[symbol] operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
iamtheky Posted April 25, 2016 Share Posted April 25, 2016 sure, but if your symbol is a character in another language then you have to check how its being used. ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 (edited) That's where some compromise is going to be needed, and the reason I targeted punctuation, spaces and delimiters. It doesn't have to be perfect. If a colony of king penguins say that's not how we write 'penguin language', then too bad. Edited April 25, 2016 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
iamtheky Posted April 25, 2016 Share Posted April 25, 2016 maybe just finding the anomalous, finding a way to look at the top end as well might be tough... #include<array.au3> $sStr = "If I want to find the exact word 'green'" ;~ $sStr = "If I want to find the exact *word* green" ;~ $sStr = "If I want to find the (exact) word green" $astr = StringToASCIIArray(stringstripws($sStr , 8)) $min = _ArrayMinIndex($aStr , 0) $min2 = _ArrayMinIndex($aStr , 0 , $min + 1) $sOut = "" for $i = $min + 1 to $min2 - 1 $sOut &= ChrW($aStr[$i]) Next msgbox(0, '' , $sOut) ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 (edited) It has to be as efficient as possible, so a number of OR conditions starting with the most likely scenario StringIsAscII() etc... I wish I was better at RegExp. Here's a quick mock up of the approach I am considering. It needs more work. Local $sString = 'traa la la $green_' Local $sFind = 'green' MsgBox(0, "", StringRegExp($sString, '(\A.*)([\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}])(' & $sFind & ')(?2)(.*\z)')) Edited April 25, 2016 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 (edited) Actually this is better. Remove the delimiter and 'greenhouse' fails. Local $sString = 'green|house' Local $sFind = 'green' $sString = StringReplace($sString, '\E', ChrW(57344), 0, 1) ; U+E000 $sFind = StringReplace($sFind, '\E', ChrW(57344), 0, 1) ; ditto MsgBox(0, "", StringRegExp($sString, '(\A|[\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}]|\z)(\Q' & $sFind & '\E)(?1)')) A further check is still needed: in case the escape sequence '\E' occurs within $sFind. [Added] Edited April 25, 2016 by czardas Added \Q ... \E to the regexp operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
alien4u Posted April 25, 2016 Share Posted April 25, 2016 (edited) This is not the most efficient and also not a good code from me but maybe give you another perspective or idea to reach your final goal: #include<array.au3> Local $sString = 'traa la la skjalkjlk lasjdlkJDLKjlj9023840928309482093jalkfjlakjflk ____kkjfkw wejflqwjrkl _ $green___ ___ 2390802983092u3roijalksfna' Local $sFind = 'green' $isthere = StringInStr($sString,$sFind) - 1 $totalStringSize = StringLen($sString) $sFindSize = StringLen($sFind) If $isthere < $totalStringSize Then $restchars = $totalStringSize - $isthere If $sFindSize < $restchars Then $norightchars = StringTrimRight($sString, $restchars - $sFindSize) $isthereagain = StringInStr($norightchars,$sFind) -1 ConsoleWrite(StringTrimLeft($norightchars,StringLen($norightchars) - $sFindSize)&@CRLF) ElseIf $sFindSize == $restchars Then ConsoleWrite(StringTrimLeft($sString,StringLen($sString) - $sFindSize)&@CRLF) EndIf EndIf Regards Alien. Edited April 25, 2016 by alien4u Fixing example code Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 @alien4u Thanks - all suggestions are welcome, especially because there's a lot I don't know about languages and this is a multilingual community. It may be a good idea to do a preliminary test with StringInStr() before looking for some kind of word boundary. operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
alien4u Posted April 25, 2016 Share Posted April 25, 2016 Hi @czardas My code is really bad and it does not work properly but my point is with StringInStr() you will find where is the substring no matter what delimiter is there and base on that you could extract that word. Regards Alien. czardas 1 Link to comment Share on other sites More sharing options...
czardas Posted April 25, 2016 Author Share Posted April 25, 2016 (edited) StringRegExp() does all these steps internally and speed comparisons will likely show this to be the best method. However as @iamtheky pointed out, spaces are the most likely delimiter and more complex routines can be used when faster comparisons fail. I've spent a couple of days trying to think of how best to approach this. Thanks for the suggestions. Edited April 25, 2016 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now