Jump to content

Define Word Boundary - SOLVED


Recommended Posts

I am looking for a way to define a word boundary in any language. The standard regular expression '\b' is not very useful for this because it simply doesn't define a word boundary in any standard language that I am aware of. It ignores several characters that are non-alphabetic and hardly recognizes letters from any alphabet other than English. What is needed is something more substantial but also efficient. So far I've considered testing for spaces, punctuation and delimiters. I'm just wondering if there might be an easier way, or some trick I don't know about. Testing all non-alphabetic Unicode ranges would be rather slow and the code would be humongous. Here are the code points for punctuation in the 1st multilingual plane to give you an impression of how this might be done.
 

$sPunctuation = _
'\x{21}-\x{23}\x{25}-\x{2A}\x{2C}-\x{2F}\x{3A}\x{3B}\x{3F}\x{40}\x{5B}-\x{5D}\x{5F}\x{7B}\x{7D}\x{A1}\x{A7}\x{AB}\x{B6}\x{B7}\x{BB}\x{BF}\x{037E}\x{0387}' & _
'\x{055A}-\x{055F}\x{0589}\x{058A}\x{05BE}\x{05C0}\x{05C3}\x{05C6}\x{05F3}\x{05F4}\x{0609}\x{060A}\x{060C}\x{060D}\x{061B}\x{061E}\x{061F}\x{066A}-\x{066D}' & _
'\x{06D4}\x{0700}-\x{070D}\x{07F7}-\x{07F9}\x{0830}-\x{083E}\x{085E}\x{0964}\x{0965}\x{0970}\x{0AF0}\x{0DF4}\x{0E4F}\x{0E5A}\x{0E5B}\x{0F04}-\x{0F12}\x{0F14}' & _
'\x{0F3A}-\x{0F3D}\x{0F85}\x{0FD0}-\x{0FD4}\x{0FD9}\x{0FDA}\x{104A}-\x{104F}\x{10FB}\x{1360}-\x{1368}\x{1400}\x{166D}\x{166E}\x{169B}\x{169C}\x{16EB}-\x{16ED}' & _
'\x{1735}\x{1736}\x{17D4}-\x{17D6}\x{17D8}-\x{17DA}\x{1800}-\x{180A}\x{1944}\x{1945}\x{1A1E}\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B60}' & _
'\x{1BFC}-\x{1BFF}\x{1C3B}-\x{1C3F}\x{1C7E}\x{1C7F}\x{1CC0}-\x{1CC7}\x{1CD3}\x{2010}-\x{2027}\x{2030}-\x{2043}\x{2045}-\x{2051}\x{2053}-\x{205E}\x{207D}' & _
'\x{207E}\x{208D}\x{208E}\x{2329}\x{232A}\x{2768}-\x{2775}\x{27C5}\x{27C6}\x{27E6}-\x{27EF}\x{2983}-\x{2998}\x{29D8}-\x{29DB}\x{29FC}\x{29FD}\x{2CF9}-\x{2CFC}' & _
'\x{2CFE}\x{2CFF}\x{2D70}\x{2E00}-\x{2E2E}\x{2E30}-\x{2E3B}\x{3001}-\x{3003}\x{3008}-\x{3011}\x{3014}-\x{301F}\x{3030}\x{303D}\x{30A0}\x{30FB}\x{A4FE}\x{A4FF}' & _
'\x{A60D}-\x{A60F}\x{A673}\x{A67E}\x{A6F2}-\x{A6F7}\x{A874}-\x{A877}\x{A8CE}\x{A8CF}\x{A8F8}-\x{A8FA}\x{A92E}\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}\x{A9DF}' & _
'\x{AA5C}-\x{AA5F}\x{AADE}\x{AADF}\x{AAF0}\x{AAF1}\x{ABEB}\x{FD3E}\x{FD3F}\x{FE10}-\x{FE19}\x{FE30}-\x{FE52}\x{FE54}-\x{FE61}\x{FE63}\x{FE68}\x{FE6A}\x{FE6B}' & _
'\x{FF01}-\x{FF03}\x{FF05}-\x{FF0A}\x{FF0C}-\x{FF0F}\x{FF1A}\x{FF1B}\x{FF1F}\x{FF20}\x{FF3B}-\x{FF3D}\x{FF3F}\x{FF5B}\x{FF5D}\x{FF5F}-\x{FF65}'

 

Edited by czardas
Link to comment
Share on other sites

I think space is a pretty reliable word boundary.  Why the precision?

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

Speaking them is easy, you can leave the punctuation if you are just going to speak them. But, I'm guessing you need the polished version for something else...

$sStr = "If I want to find the ,exact (word) 'green' in this very sentence, I'm already stuck"

$aStr = stringsplit($sStr , " " , 2)

$s_text = $aStr[6] & " " & $aStr[7] & " " & $aStr[8]
$o_speech = ObjCreate("SAPI.SpVoice")
$o_speech.Speak($s_text)

 

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

How are you eliminating all your known words first?  Is that easier, or worthwhile, leaving a smaller group to test weird boundaries?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

"when searching for a word"...  within a string of unknown delimiters, (also, is there a language that does not use spaces in digital text)?  within an array of words that could potentially have punctuation attached?  can it be either?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

Searching (within a string) for a word in an unknown language (or searching for a random string), with unknown delimiters which are not letters of some kind. I define this as a word boundary whether the search term is a word or not.

Edited by czardas
Link to comment
Share on other sites

Thats what post #1 looked like, I just didnt think you were serious.

If its random, then word boundaries wouldnt exist.  Are you going to try and learn the boundaries, and then check for them if they are unknown?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

sure, but if your symbol is a character in another language then you have to check how its being used.

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

That's where some compromise is going to be needed, and the reason I targeted punctuation, spaces and delimiters. It doesn't have to be perfect. If a colony of king penguins say that's not how we write 'penguin language', then too bad. :lol:

Edited by czardas
Link to comment
Share on other sites

maybe just finding the anomalous, finding a way to look at the top end as well might be tough...

#include<array.au3>

$sStr = "If I want to find the exact word 'green'"
;~ $sStr = "If I want to find the exact *word* green"
;~ $sStr = "If I want to find the (exact) word green"

$astr = StringToASCIIArray(stringstripws($sStr , 8))

$min = _ArrayMinIndex($aStr , 0)
$min2 = _ArrayMinIndex($aStr , 0 , $min + 1)
$sOut = ""

for $i = $min + 1 to $min2 - 1
    $sOut &= ChrW($aStr[$i])
Next

msgbox(0, '' , $sOut)

 

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

It has to be as efficient as possible, so a number of OR conditions starting with the most likely scenario StringIsAscII() etc...

I wish I was better at RegExp. Here's a quick mock up of the approach I am considering. It needs more work.

Local $sString = 'traa la la $green_'
Local $sFind = 'green'
MsgBox(0, "", StringRegExp($sString, '(\A.*)([\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}])(' & $sFind & ')(?2)(.*\z)'))

 

Edited by czardas
Link to comment
Share on other sites

Actually this is better. Remove the delimiter and 'greenhouse' fails.

Local $sString = 'green|house'
Local $sFind = 'green'
$sString = StringReplace($sString, '\E', ChrW(57344), 0, 1) ; U+E000
$sFind = StringReplace($sFind, '\E', ChrW(57344), 0, 1) ; ditto

MsgBox(0, "", StringRegExp($sString, '(\A|[\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}]|\z)(\Q' & $sFind & '\E)(?1)'))

A further check is still needed: in case the escape sequence '\E' occurs within $sFind. [Added]

Edited by czardas
Added \Q ... \E to the regexp
Link to comment
Share on other sites

This is not the most efficient and also not a good code from me but maybe give you another perspective or idea to reach your final goal:

#include<array.au3>
Local $sString = 'traa la la skjalkjlk lasjdlkJDLKjlj9023840928309482093jalkfjlakjflk ____kkjfkw wejflqwjrkl _ $green___ ___ 2390802983092u3roijalksfna'
Local $sFind = 'green'

$isthere = StringInStr($sString,$sFind) - 1
$totalStringSize = StringLen($sString)
$sFindSize = StringLen($sFind)

If $isthere < $totalStringSize Then
    $restchars = $totalStringSize - $isthere
    If $sFindSize < $restchars Then
        $norightchars = StringTrimRight($sString, $restchars - $sFindSize)
        $isthereagain = StringInStr($norightchars,$sFind) -1
        ConsoleWrite(StringTrimLeft($norightchars,StringLen($norightchars) - $sFindSize)&@CRLF)
    ElseIf $sFindSize == $restchars Then
        ConsoleWrite(StringTrimLeft($sString,StringLen($sString) - $sFindSize)&@CRLF)
    EndIf
EndIf

Regards
Alien.

Edited by alien4u
Fixing example code
Link to comment
Share on other sites

StringRegExp() does all these steps internally and speed comparisons will likely show this to be the best method. However as @iamtheky pointed out, spaces are the most likely delimiter and more complex routines can be used when faster comparisons fail. I've spent a couple of days trying to think of how best to approach this. Thanks for the suggestions.

Edited by czardas
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...