Jump to content

Recommended Posts

Posted (edited)

I am looking for a way to define a word boundary in any language. The standard regular expression '\b' is not very useful for this because it simply doesn't define a word boundary in any standard language that I am aware of. It ignores several characters that are non-alphabetic and hardly recognizes letters from any alphabet other than English. What is needed is something more substantial but also efficient. So far I've considered testing for spaces, punctuation and delimiters. I'm just wondering if there might be an easier way, or some trick I don't know about. Testing all non-alphabetic Unicode ranges would be rather slow and the code would be humongous. Here are the code points for punctuation in the 1st multilingual plane to give you an impression of how this might be done.
 

$sPunctuation = _
'\x{21}-\x{23}\x{25}-\x{2A}\x{2C}-\x{2F}\x{3A}\x{3B}\x{3F}\x{40}\x{5B}-\x{5D}\x{5F}\x{7B}\x{7D}\x{A1}\x{A7}\x{AB}\x{B6}\x{B7}\x{BB}\x{BF}\x{037E}\x{0387}' & _
'\x{055A}-\x{055F}\x{0589}\x{058A}\x{05BE}\x{05C0}\x{05C3}\x{05C6}\x{05F3}\x{05F4}\x{0609}\x{060A}\x{060C}\x{060D}\x{061B}\x{061E}\x{061F}\x{066A}-\x{066D}' & _
'\x{06D4}\x{0700}-\x{070D}\x{07F7}-\x{07F9}\x{0830}-\x{083E}\x{085E}\x{0964}\x{0965}\x{0970}\x{0AF0}\x{0DF4}\x{0E4F}\x{0E5A}\x{0E5B}\x{0F04}-\x{0F12}\x{0F14}' & _
'\x{0F3A}-\x{0F3D}\x{0F85}\x{0FD0}-\x{0FD4}\x{0FD9}\x{0FDA}\x{104A}-\x{104F}\x{10FB}\x{1360}-\x{1368}\x{1400}\x{166D}\x{166E}\x{169B}\x{169C}\x{16EB}-\x{16ED}' & _
'\x{1735}\x{1736}\x{17D4}-\x{17D6}\x{17D8}-\x{17DA}\x{1800}-\x{180A}\x{1944}\x{1945}\x{1A1E}\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B60}' & _
'\x{1BFC}-\x{1BFF}\x{1C3B}-\x{1C3F}\x{1C7E}\x{1C7F}\x{1CC0}-\x{1CC7}\x{1CD3}\x{2010}-\x{2027}\x{2030}-\x{2043}\x{2045}-\x{2051}\x{2053}-\x{205E}\x{207D}' & _
'\x{207E}\x{208D}\x{208E}\x{2329}\x{232A}\x{2768}-\x{2775}\x{27C5}\x{27C6}\x{27E6}-\x{27EF}\x{2983}-\x{2998}\x{29D8}-\x{29DB}\x{29FC}\x{29FD}\x{2CF9}-\x{2CFC}' & _
'\x{2CFE}\x{2CFF}\x{2D70}\x{2E00}-\x{2E2E}\x{2E30}-\x{2E3B}\x{3001}-\x{3003}\x{3008}-\x{3011}\x{3014}-\x{301F}\x{3030}\x{303D}\x{30A0}\x{30FB}\x{A4FE}\x{A4FF}' & _
'\x{A60D}-\x{A60F}\x{A673}\x{A67E}\x{A6F2}-\x{A6F7}\x{A874}-\x{A877}\x{A8CE}\x{A8CF}\x{A8F8}-\x{A8FA}\x{A92E}\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}\x{A9DF}' & _
'\x{AA5C}-\x{AA5F}\x{AADE}\x{AADF}\x{AAF0}\x{AAF1}\x{ABEB}\x{FD3E}\x{FD3F}\x{FE10}-\x{FE19}\x{FE30}-\x{FE52}\x{FE54}-\x{FE61}\x{FE63}\x{FE68}\x{FE6A}\x{FE6B}' & _
'\x{FF01}-\x{FF03}\x{FF05}-\x{FF0A}\x{FF0C}-\x{FF0F}\x{FF1A}\x{FF1B}\x{FF1F}\x{FF20}\x{FF3B}-\x{FF3D}\x{FF3F}\x{FF5B}\x{FF5D}\x{FF5F}-\x{FF65}'

 

Edited by czardas
Posted

I think space is a pretty reliable word boundary.  Why the precision?

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Posted (edited)

If I want to find the exact word 'green' in this very sentence, I'm already stuck. I don't speak all languages and can't assume that a word may be preceded by the dollar sign in AutoIt (although that's not punctuation :think:).

Edited by czardas
Posted (edited)

Speaking them is easy, you can leave the punctuation if you are just going to speak them. But, I'm guessing you need the polished version for something else...

$sStr = "If I want to find the ,exact (word) 'green' in this very sentence, I'm already stuck"

$aStr = stringsplit($sStr , " " , 2)

$s_text = $aStr[6] & " " & $aStr[7] & " " & $aStr[8]
$o_speech = ObjCreate("SAPI.SpVoice")
$o_speech.Speak($s_text)

 

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Posted (edited)

How are you eliminating all your known words first?  Is that easier, or worthwhile, leaving a smaller group to test weird boundaries?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Posted (edited)

The user knows/inputs the word. When searching for the word, the code is meant to prevent a false positive match when the word appears within a larger word. So a definition of word boundary is needed.

Edited by czardas
Posted (edited)

"when searching for a word"...  within a string of unknown delimiters, (also, is there a language that does not use spaces in digital text)?  within an array of words that could potentially have punctuation attached?  can it be either?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Posted (edited)

Searching (within a string) for a word in an unknown language (or searching for a random string), with unknown delimiters which are not letters of some kind. I define this as a word boundary whether the search term is a word or not.

Edited by czardas
Posted (edited)

Thats what post #1 looked like, I just didnt think you were serious.

If its random, then word boundaries wouldnt exist.  Are you going to try and learn the boundaries, and then check for them if they are unknown?

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Posted

sure, but if your symbol is a character in another language then you have to check how its being used.

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Posted (edited)

That's where some compromise is going to be needed, and the reason I targeted punctuation, spaces and delimiters. It doesn't have to be perfect. If a colony of king penguins say that's not how we write 'penguin language', then too bad. :lol:

Edited by czardas
Posted

maybe just finding the anomalous, finding a way to look at the top end as well might be tough...

#include<array.au3>

$sStr = "If I want to find the exact word 'green'"
;~ $sStr = "If I want to find the exact *word* green"
;~ $sStr = "If I want to find the (exact) word green"

$astr = StringToASCIIArray(stringstripws($sStr , 8))

$min = _ArrayMinIndex($aStr , 0)
$min2 = _ArrayMinIndex($aStr , 0 , $min + 1)
$sOut = ""

for $i = $min + 1 to $min2 - 1
    $sOut &= ChrW($aStr[$i])
Next

msgbox(0, '' , $sOut)

 

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Posted (edited)

It has to be as efficient as possible, so a number of OR conditions starting with the most likely scenario StringIsAscII() etc...

I wish I was better at RegExp. Here's a quick mock up of the approach I am considering. It needs more work.

Local $sString = 'traa la la $green_'
Local $sFind = 'green'
MsgBox(0, "", StringRegExp($sString, '(\A.*)([\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}])(' & $sFind & ')(?2)(.*\z)'))

 

Edited by czardas
Posted (edited)

Actually this is better. Remove the delimiter and 'greenhouse' fails.

Local $sString = 'green|house'
Local $sFind = 'green'
$sString = StringReplace($sString, '\E', ChrW(57344), 0, 1) ; U+E000
$sFind = StringReplace($sFind, '\E', ChrW(57344), 0, 1) ; ditto

MsgBox(0, "", StringRegExp($sString, '(\A|[\x{00}-\x{40}\x{5B}-\x{60}\x{7B}-\x{7E}]|\z)(\Q' & $sFind & '\E)(?1)'))

A further check is still needed: in case the escape sequence '\E' occurs within $sFind. [Added]

Edited by czardas
Added \Q ... \E to the regexp
Posted (edited)

This is not the most efficient and also not a good code from me but maybe give you another perspective or idea to reach your final goal:

#include<array.au3>
Local $sString = 'traa la la skjalkjlk lasjdlkJDLKjlj9023840928309482093jalkfjlakjflk ____kkjfkw wejflqwjrkl _ $green___ ___ 2390802983092u3roijalksfna'
Local $sFind = 'green'

$isthere = StringInStr($sString,$sFind) - 1
$totalStringSize = StringLen($sString)
$sFindSize = StringLen($sFind)

If $isthere < $totalStringSize Then
    $restchars = $totalStringSize - $isthere
    If $sFindSize < $restchars Then
        $norightchars = StringTrimRight($sString, $restchars - $sFindSize)
        $isthereagain = StringInStr($norightchars,$sFind) -1
        ConsoleWrite(StringTrimLeft($norightchars,StringLen($norightchars) - $sFindSize)&@CRLF)
    ElseIf $sFindSize == $restchars Then
        ConsoleWrite(StringTrimLeft($sString,StringLen($sString) - $sFindSize)&@CRLF)
    EndIf
EndIf

Regards
Alien.

Edited by alien4u
Fixing example code
Posted

@alien4u Thanks - all suggestions are welcome, especially because there's a lot I don't know about languages and this is a multilingual community. It may be a good idea to do a preliminary test with StringInStr() before looking for some kind of word boundary.

Posted

Hi @czardas
My code is really bad and it does not work properly but my point is with StringInStr() you will find where is the substring no matter what delimiter is there and base on that you could extract that word.

Regards
Alien.

Posted (edited)

StringRegExp() does all these steps internally and speed comparisons will likely show this to be the best method. However as @iamtheky pointed out, spaces are the most likely delimiter and more complex routines can be used when faster comparisons fail. I've spent a couple of days trying to think of how best to approach this. Thanks for the suggestions.

Edited by czardas

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...