Define Word Boundary - SOLVED

czardas · April 25, 2016

I am looking for a way to define a word boundary in any language. The standard regular expression '\b' is not very useful for this because it simply doesn't define a word boundary in any standard language that I am aware of. It ignores several characters that are non-alphabetic and hardly recognizes letters from any alphabet other than English. What is needed is something more substantial but also efficient. So far I've considered testing for spaces, punctuation and delimiters. I'm just wondering if there might be an easier way, or some trick I don't know about. Testing all non-alphabetic Unicode ranges would be rather slow and the code would be humongous. Here are the code points for punctuation in the 1st multilingual plane to give you an impression of how this might be done.

$sPunctuation = _
'\x{21}-\x{23}\x{25}-\x{2A}\x{2C}-\x{2F}\x{3A}\x{3B}\x{3F}\x{40}\x{5B}-\x{5D}\x{5F}\x{7B}\x{7D}\x{A1}\x{A7}\x{AB}\x{B6}\x{B7}\x{BB}\x{BF}\x{037E}\x{0387}' & _
'\x{055A}-\x{055F}\x{0589}\x{058A}\x{05BE}\x{05C0}\x{05C3}\x{05C6}\x{05F3}\x{05F4}\x{0609}\x{060A}\x{060C}\x{060D}\x{061B}\x{061E}\x{061F}\x{066A}-\x{066D}' & _
'\x{06D4}\x{0700}-\x{070D}\x{07F7}-\x{07F9}\x{0830}-\x{083E}\x{085E}\x{0964}\x{0965}\x{0970}\x{0AF0}\x{0DF4}\x{0E4F}\x{0E5A}\x{0E5B}\x{0F04}-\x{0F12}\x{0F14}' & _
'\x{0F3A}-\x{0F3D}\x{0F85}\x{0FD0}-\x{0FD4}\x{0FD9}\x{0FDA}\x{104A}-\x{104F}\x{10FB}\x{1360}-\x{1368}\x{1400}\x{166D}\x{166E}\x{169B}\x{169C}\x{16EB}-\x{16ED}' & _
'\x{1735}\x{1736}\x{17D4}-\x{17D6}\x{17D8}-\x{17DA}\x{1800}-\x{180A}\x{1944}\x{1945}\x{1A1E}\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B60}' & _
'\x{1BFC}-\x{1BFF}\x{1C3B}-\x{1C3F}\x{1C7E}\x{1C7F}\x{1CC0}-\x{1CC7}\x{1CD3}\x{2010}-\x{2027}\x{2030}-\x{2043}\x{2045}-\x{2051}\x{2053}-\x{205E}\x{207D}' & _
'\x{207E}\x{208D}\x{208E}\x{2329}\x{232A}\x{2768}-\x{2775}\x{27C5}\x{27C6}\x{27E6}-\x{27EF}\x{2983}-\x{2998}\x{29D8}-\x{29DB}\x{29FC}\x{29FD}\x{2CF9}-\x{2CFC}' & _
'\x{2CFE}\x{2CFF}\x{2D70}\x{2E00}-\x{2E2E}\x{2E30}-\x{2E3B}\x{3001}-\x{3003}\x{3008}-\x{3011}\x{3014}-\x{301F}\x{3030}\x{303D}\x{30A0}\x{30FB}\x{A4FE}\x{A4FF}' & _
'\x{A60D}-\x{A60F}\x{A673}\x{A67E}\x{A6F2}-\x{A6F7}\x{A874}-\x{A877}\x{A8CE}\x{A8CF}\x{A8F8}-\x{A8FA}\x{A92E}\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}\x{A9DF}' & _
'\x{AA5C}-\x{AA5F}\x{AADE}\x{AADF}\x{AAF0}\x{AAF1}\x{ABEB}\x{FD3E}\x{FD3F}\x{FE10}-\x{FE19}\x{FE30}-\x{FE52}\x{FE54}-\x{FE61}\x{FE63}\x{FE68}\x{FE6A}\x{FE6B}' & _
'\x{FF01}-\x{FF03}\x{FF05}-\x{FF0A}\x{FF0C}-\x{FF0F}\x{FF1A}\x{FF1B}\x{FF1F}\x{FF20}\x{FF3B}-\x{FF3D}\x{FF3F}\x{FF5B}\x{FF5D}\x{FF5F}-\x{FF65}'

Edited April 25, 2016 by czardas

iamtheky · April 25, 2016

I think space is a pretty reliable word boundary. Why the precision?

czardas · April 25, 2016

If I want to find the exact word 'green' in this very sentence, I'm already stuck. I don't speak all languages and can't assume that a word may be preceded by the dollar sign in AutoIt (although that's not punctuation :think: ).

Edited April 25, 2016 by czardas

iamtheky · April 25, 2016

Speaking them is easy, you can leave the punctuation if you are just going to speak them. But, I'm guessing you need the polished version for something else...

$sStr = "If I want to find the ,exact (word) 'green' in this very sentence, I'm already stuck"

$aStr = stringsplit($sStr , " " , 2)

$s_text = $aStr[6] & " " & $aStr[7] & " " & $aStr[8]
$o_speech = ObjCreate("SAPI.SpVoice")
$o_speech.Speak($s_text)

Edited April 25, 2016 by iamtheky

czardas · April 25, 2016

Hmm right. A recursive loop may test millions of words.

iamtheky · April 25, 2016

How are you eliminating all your known words first? Is that easier, or worthwhile, leaving a smaller group to test weird boundaries?

Edited April 25, 2016 by iamtheky

czardas · April 25, 2016

The user knows/inputs the word. When searching for the word, the code is meant to prevent a false positive match when the word appears within a larger word. So a definition of word boundary is needed.

Edited April 25, 2016 by czardas

iamtheky · April 25, 2016

"when searching for a word"... within a string of unknown delimiters, (also, is there a language that does not use spaces in digital text)? within an array of words that could potentially have punctuation attached? can it be either?

Edited April 25, 2016 by iamtheky

czardas · April 25, 2016

Searching (within a string) for a word in an unknown language (or searching for a random string), with unknown delimiters which are not letters of some kind. I define this as a word boundary whether the search term is a word or not.

Edited April 25, 2016 by czardas

iamtheky · April 25, 2016

Thats what post #1 looked like, I just didnt think you were serious.

If its random, then word boundaries wouldnt exist. Are you going to try and learn the boundaries, and then check for them if they are unknown?

Edited April 25, 2016 by iamtheky

czardas · April 25, 2016

The expression should look something like this: [symbol](Anything)[symbol]

iamtheky · April 25, 2016

sure, but if your symbol is a character in another language then you have to check how its being used.

czardas · April 25, 2016

That's where some compromise is going to be needed, and the reason I targeted punctuation, spaces and delimiters. It doesn't have to be perfect. If a colony of king penguins say that's not how we write 'penguin language', then too bad.

Edited April 25, 2016 by czardas

iamtheky · April 25, 2016

maybe just finding the anomalous, finding a way to look at the top end as well might be tough...

#include<array.au3>

$sStr = "If I want to find the exact word 'green'"
;~ $sStr = "If I want to find the exact *word* green"
;~ $sStr = "If I want to find the (exact) word green"

$astr = StringToASCIIArray(stringstripws($sStr , 8))

$min = _ArrayMinIndex($aStr , 0)
$min2 = _ArrayMinIndex($aStr , 0 , $min + 1)
$sOut = ""

for $i = $min + 1 to $min2 - 1
    $sOut &= ChrW($aStr[$i])
Next

msgbox(0, '' , $sOut)

czardas · April 25, 2016

It has to be as efficient as possible, so a number of OR conditions starting with the most likely scenario StringIsAscII() etc...

I wish I was better at RegExp. Here's a quick mock up of the approach I am considering. It needs more work.

Local $sString = 'traa la la $green_'
Local $sFind = 'green'
MsgBox(0, "", StringRegExp($sString, '(\A.*)([\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}])(' & $sFind & ')(?2)(.*\z)'))

Edited April 25, 2016 by czardas

czardas · April 25, 2016

Actually this is better. Remove the delimiter and 'greenhouse' fails.

Local $sString = 'green|house'
Local $sFind = 'green'
$sString = StringReplace($sString, '\E', ChrW(57344), 0, 1) ; U+E000
$sFind = StringReplace($sFind, '\E', ChrW(57344), 0, 1) ; ditto

MsgBox(0, "", StringRegExp($sString, '(\A|[\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}]|\z)(\Q' & $sFind & '\E)(?1)'))

~~A further check is still needed: in case the escape sequence '\E' occurs within $sFind.~~ [Added]

Edited April 25, 2016 by czardas
Added \Q ... \E to the regexp

alien4u · April 25, 2016

This is not the most efficient and also not a good code from me but maybe give you another perspective or idea to reach your final goal:

#include<array.au3>
Local $sString = 'traa la la skjalkjlk lasjdlkJDLKjlj9023840928309482093jalkfjlakjflk ____kkjfkw wejflqwjrkl _ $green___ ___ 2390802983092u3roijalksfna'
Local $sFind = 'green'

$isthere = StringInStr($sString,$sFind) - 1
$totalStringSize = StringLen($sString)
$sFindSize = StringLen($sFind)

If $isthere < $totalStringSize Then
    $restchars = $totalStringSize - $isthere
    If $sFindSize < $restchars Then
        $norightchars = StringTrimRight($sString, $restchars - $sFindSize)
        $isthereagain = StringInStr($norightchars,$sFind) -1
        ConsoleWrite(StringTrimLeft($norightchars,StringLen($norightchars) - $sFindSize)&@CRLF)
    ElseIf $sFindSize == $restchars Then
        ConsoleWrite(StringTrimLeft($sString,StringLen($sString) - $sFindSize)&@CRLF)
    EndIf
EndIf

Regards
Alien.

Edited April 25, 2016 by alien4u
Fixing example code

czardas · April 25, 2016

@alien4u Thanks - all suggestions are welcome, especially because there's a lot I don't know about languages and this is a multilingual community. It may be a good idea to do a preliminary test with StringInStr() before looking for some kind of word boundary.

alien4u · April 25, 2016

Hi @czardas
My code is really bad and it does not work properly but my point is with StringInStr() you will find where is the substring no matter what delimiter is there and base on that you could extract that word.

Regards
Alien.

czardas · April 25, 2016

StringRegExp() does all these steps internally and speed comparisons will likely show this to be the best method. However as @iamtheky pointed out, spaces are the most likely delimiter and more complex routines can be used when faster comparisons fail. I've spent a couple of days trying to think of how best to approach this. Thanks for the suggestions.

Edited April 25, 2016 by czardas

Define Word Boundary - SOLVED

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members