czardas

Define Word Boundary - SOLVED

26 posts in this topic

#1 ·  Posted (edited)

I am looking for a way to define a word boundary in any language. The standard regular expression '\b' is not very useful for this because it simply doesn't define a word boundary in any standard language that I am aware of. It ignores several characters that are non-alphabetic and hardly recognizes letters from any alphabet other than English. What is needed is something more substantial but also efficient. So far I've considered testing for spaces, punctuation and delimiters. I'm just wondering if there might be an easier way, or some trick I don't know about. Testing all non-alphabetic Unicode ranges would be rather slow and the code would be humongous. Here are the code points for punctuation in the 1st multilingual plane to give you an impression of how this might be done.
 

$sPunctuation = _
'\x{21}-\x{23}\x{25}-\x{2A}\x{2C}-\x{2F}\x{3A}\x{3B}\x{3F}\x{40}\x{5B}-\x{5D}\x{5F}\x{7B}\x{7D}\x{A1}\x{A7}\x{AB}\x{B6}\x{B7}\x{BB}\x{BF}\x{037E}\x{0387}' & _
'\x{055A}-\x{055F}\x{0589}\x{058A}\x{05BE}\x{05C0}\x{05C3}\x{05C6}\x{05F3}\x{05F4}\x{0609}\x{060A}\x{060C}\x{060D}\x{061B}\x{061E}\x{061F}\x{066A}-\x{066D}' & _
'\x{06D4}\x{0700}-\x{070D}\x{07F7}-\x{07F9}\x{0830}-\x{083E}\x{085E}\x{0964}\x{0965}\x{0970}\x{0AF0}\x{0DF4}\x{0E4F}\x{0E5A}\x{0E5B}\x{0F04}-\x{0F12}\x{0F14}' & _
'\x{0F3A}-\x{0F3D}\x{0F85}\x{0FD0}-\x{0FD4}\x{0FD9}\x{0FDA}\x{104A}-\x{104F}\x{10FB}\x{1360}-\x{1368}\x{1400}\x{166D}\x{166E}\x{169B}\x{169C}\x{16EB}-\x{16ED}' & _
'\x{1735}\x{1736}\x{17D4}-\x{17D6}\x{17D8}-\x{17DA}\x{1800}-\x{180A}\x{1944}\x{1945}\x{1A1E}\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B60}' & _
'\x{1BFC}-\x{1BFF}\x{1C3B}-\x{1C3F}\x{1C7E}\x{1C7F}\x{1CC0}-\x{1CC7}\x{1CD3}\x{2010}-\x{2027}\x{2030}-\x{2043}\x{2045}-\x{2051}\x{2053}-\x{205E}\x{207D}' & _
'\x{207E}\x{208D}\x{208E}\x{2329}\x{232A}\x{2768}-\x{2775}\x{27C5}\x{27C6}\x{27E6}-\x{27EF}\x{2983}-\x{2998}\x{29D8}-\x{29DB}\x{29FC}\x{29FD}\x{2CF9}-\x{2CFC}' & _
'\x{2CFE}\x{2CFF}\x{2D70}\x{2E00}-\x{2E2E}\x{2E30}-\x{2E3B}\x{3001}-\x{3003}\x{3008}-\x{3011}\x{3014}-\x{301F}\x{3030}\x{303D}\x{30A0}\x{30FB}\x{A4FE}\x{A4FF}' & _
'\x{A60D}-\x{A60F}\x{A673}\x{A67E}\x{A6F2}-\x{A6F7}\x{A874}-\x{A877}\x{A8CE}\x{A8CF}\x{A8F8}-\x{A8FA}\x{A92E}\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}\x{A9DF}' & _
'\x{AA5C}-\x{AA5F}\x{AADE}\x{AADF}\x{AAF0}\x{AAF1}\x{ABEB}\x{FD3E}\x{FD3F}\x{FE10}-\x{FE19}\x{FE30}-\x{FE52}\x{FE54}-\x{FE61}\x{FE63}\x{FE68}\x{FE6A}\x{FE6B}' & _
'\x{FF01}-\x{FF03}\x{FF05}-\x{FF0A}\x{FF0C}-\x{FF0F}\x{FF1A}\x{FF1B}\x{FF1F}\x{FF20}\x{FF3B}-\x{FF3D}\x{FF3F}\x{FF5B}\x{FF5D}\x{FF5F}-\x{FF65}'

 

Edited by czardas

Share this post


Link to post
Share on other sites



I think space is a pretty reliable word boundary.  Why the precision?


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

If I want to find the exact word 'green' in this very sentence, I'm already stuck. I don't speak all languages and can't assume that a word may be preceded by the dollar sign in AutoIt (although that's not punctuation :think:).

Edited by czardas

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Speaking them is easy, you can leave the punctuation if you are just going to speak them. But, I'm guessing you need the polished version for something else...

$sStr = "If I want to find the ,exact (word) 'green' in this very sentence, I'm already stuck"

$aStr = stringsplit($sStr , " " , 2)

$s_text = $aStr[6] & " " & $aStr[7] & " " & $aStr[8]
$o_speech = ObjCreate("SAPI.SpVoice")
$o_speech.Speak($s_text)

 

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

How are you eliminating all your known words first?  Is that easier, or worthwhile, leaving a smaller group to test weird boundaries?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

The user knows/inputs the word. When searching for the word, the code is meant to prevent a false positive match when the word appears within a larger word. So a definition of word boundary is needed.

Edited by czardas

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

"when searching for a word"...  within a string of unknown delimiters, (also, is there a language that does not use spaces in digital text)?  within an array of words that could potentially have punctuation attached?  can it be either?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

Searching (within a string) for a word in an unknown language (or searching for a random string), with unknown delimiters which are not letters of some kind. I define this as a word boundary whether the search term is a word or not.

Edited by czardas

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

Thats what post #1 looked like, I just didnt think you were serious.

If its random, then word boundaries wouldnt exist.  Are you going to try and learn the boundaries, and then check for them if they are unknown?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

The expression should look something like this: [symbol](Anything)[symbol]

Share this post


Link to post
Share on other sites

sure, but if your symbol is a character in another language then you have to check how its being used.


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#13 ·  Posted (edited)

That's where some compromise is going to be needed, and the reason I targeted punctuation, spaces and delimiters. It doesn't have to be perfect. If a colony of king penguins say that's not how we write 'penguin language', then too bad. :lol:

Edited by czardas

Share this post


Link to post
Share on other sites

maybe just finding the anomalous, finding a way to look at the top end as well might be tough...

#include<array.au3>

$sStr = "If I want to find the exact word 'green'"
;~ $sStr = "If I want to find the exact *word* green"
;~ $sStr = "If I want to find the (exact) word green"

$astr = StringToASCIIArray(stringstripws($sStr , 8))

$min = _ArrayMinIndex($aStr , 0)
$min2 = _ArrayMinIndex($aStr , 0 , $min + 1)
$sOut = ""

for $i = $min + 1 to $min2 - 1
    $sOut &= ChrW($aStr[$i])
Next

msgbox(0, '' , $sOut)

 


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#15 ·  Posted (edited)

It has to be as efficient as possible, so a number of OR conditions starting with the most likely scenario StringIsAscII() etc...

I wish I was better at RegExp. Here's a quick mock up of the approach I am considering. It needs more work.

Local $sString = 'traa la la $green_'
Local $sFind = 'green'
MsgBox(0, "", StringRegExp($sString, '(\A.*)([\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}])(' & $sFind & ')(?2)(.*\z)'))

 

Edited by czardas

Share this post


Link to post
Share on other sites

#16 ·  Posted (edited)

Actually this is better. Remove the delimiter and 'greenhouse' fails.

Local $sString = 'green|house'
Local $sFind = 'green'
$sString = StringReplace($sString, '\E', ChrW(57344), 0, 1) ; U+E000
$sFind = StringReplace($sFind, '\E', ChrW(57344), 0, 1) ; ditto

MsgBox(0, "", StringRegExp($sString, '(\A|[\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}]|\z)(\Q' & $sFind & '\E)(?1)'))

A further check is still needed: in case the escape sequence '\E' occurs within $sFind. [Added]

Edited by czardas
Added \Q ... \E to the regexp

Share this post


Link to post
Share on other sites

#17 ·  Posted (edited)

This is not the most efficient and also not a good code from me but maybe give you another perspective or idea to reach your final goal:

#include<array.au3>
Local $sString = 'traa la la skjalkjlk lasjdlkJDLKjlj9023840928309482093jalkfjlakjflk ____kkjfkw wejflqwjrkl _ $green___ ___ 2390802983092u3roijalksfna'
Local $sFind = 'green'

$isthere = StringInStr($sString,$sFind) - 1
$totalStringSize = StringLen($sString)
$sFindSize = StringLen($sFind)

If $isthere < $totalStringSize Then
    $restchars = $totalStringSize - $isthere
    If $sFindSize < $restchars Then
        $norightchars = StringTrimRight($sString, $restchars - $sFindSize)
        $isthereagain = StringInStr($norightchars,$sFind) -1
        ConsoleWrite(StringTrimLeft($norightchars,StringLen($norightchars) - $sFindSize)&@CRLF)
    ElseIf $sFindSize == $restchars Then
        ConsoleWrite(StringTrimLeft($sString,StringLen($sString) - $sFindSize)&@CRLF)
    EndIf
EndIf

Regards
Alien.

Edited by alien4u
Fixing example code

Share this post


Link to post
Share on other sites

@alien4u Thanks - all suggestions are welcome, especially because there's a lot I don't know about languages and this is a multilingual community. It may be a good idea to do a preliminary test with StringInStr() before looking for some kind of word boundary.

Share this post


Link to post
Share on other sites

Hi @czardas
My code is really bad and it does not work properly but my point is with StringInStr() you will find where is the substring no matter what delimiter is there and base on that you could extract that word.

Regards
Alien.

1 person likes this

Share this post


Link to post
Share on other sites

#20 ·  Posted (edited)

StringRegExp() does all these steps internally and speed comparisons will likely show this to be the best method. However as @iamtheky pointed out, spaces are the most likely delimiter and more complex routines can be used when faster comparisons fail. I've spent a couple of days trying to think of how best to approach this. Thanks for the suggestions.

Edited by czardas

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now