Jump to content

Specify characters to be interpreted literally in StringRegExp


leuce
 Share

Recommended Posts

Hello

I'm trying to get a script to search a string for a search query on a whole-words-only basis.

This means that I would use something like StringRegExp ($string, "\b" & $query & "\b").  However, I have no control over what the $query will be -- it may contain characters that "mean" something in regular expressions.  For example, it may contain a  backslash or a fullstop, but I don't want the backslash or fullstop to mean what they usually mean in regular expressions.

Is there a way to use regular expression search while specifying that a certain portion of it should always be read literally?  Or is my only solution to make a list of potential special characters and then escape them?

Here's a sample code, in case my explanation above is insufficiently clear:

$string1 = "asdf bcd asdf"
$string2 = "asdf .c. asdf"
$query = ".c."
$foo1 = StringRegExp ($string1, "\b" & $query & "\b", 1)
MsgBox (0, "", $foo1[0], 0) ; I want it to fail, but it returns "bcd"
$foo2 = StringRegExp ($string2, "\b" & $query & "\b", 1)
MsgBox (0, "", $foo2[0], 0) ; I want it to return ".c.", but it fails

If my only solution is to escape characters, do you happen to know of a ready list of characters that must be escaped?  For the moment it appears to me to be "[](){}?.\^$*+|".

Thanks

Samuel

Edited by leuce
Link to comment
Share on other sites

Look at \Q .... \E in the help file (turn off special characters between these instructions). If these exact sequences appear in the pattern to be tested, then you might need to make some replacements in the test string first, and change the pattern accordingly. You will have to experiment and see what suits your requirements.

BTW '.c.' is not adjacent to any word character, so using '\b' won't work in this example.

Local $string2 = "asdf .c. asdf"
Local $query = ".c."
Local $foo2 = StringRegExp($string2, "(\Q" & $query & "\E)", 1)
MsgBox (0, "", $foo2[0], 0) ; I want it to return ".c.", but it fails

 

Edited by czardas
Link to comment
Share on other sites

why is stringinstr insufficient for this task?

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

Thanks for all your replies.

The script that I write will perform searches in paragraphs from files.  The user specifies the search query, and he specifies e.g. whether it is case-sensitive or not, and e.g. whether whole words should be search or not, etc.  If the user was to specify "match whole words only", then StringInStr can't be used.

For example, in the string "the overt rover is over", a StringInStr query for "over" will always match "overt", "rover" and "over" . But if the user wants to match only "over" (i.e. whole words only) (and not "overt" and "rover", which contains "over" but which aren't "over" by themselves), then StringInStr can't be used (AFAIK).

Link to comment
Share on other sites

Case ;Whole Words

$Frmt_UserInput = " " & $userstring & " "

stringinstr($string , $Frmt_UserInput)

but regex works too.

 

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

The part about word boundaries needs clear definition. If the search string begins, or ends, with a symbol; then you need to define boundary. You might want to define spaces, or the start and end of the source string as boundaries.

Edited by czardas
Link to comment
Share on other sites

especially with literals.  Without a decent knowledge of the target we can edge case it for days, especially if its not all English.

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

Well, I thought that since AutoIt does contain the concept of "word boundaries" in StringRegExp, I might as well make use of it.  But if it turns out to be unreliable, then my other option would be (as iamtheky and czardas appear to point to) to specify word boundaries myself.  These would be spaces, tabs, line breaks, all punctuation marks, and probably hyphens too.

And obviously things would get extra complicated once you get to non-Latin scripts, but (and perhaps I should have said so, sorry) my intended user uses a language that uses a Latin script.  ATM I'm the only user :-p

I'm calling it a night, but here (attached) is what I have at this stage (not yet taking into account any of your comments... that's for tomorrow).

WFTM delete segs.zip

Link to comment
Share on other sites

Our help file indicates which are characters denoting \b. Also using (*UCP) you can significantly extend what \b means.

Also wanted to add the \b is in no way "unreliable". It just may not be the criterion you need, but as czardas points out, there are ways to overcome this.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Also if you say the safe word regexp will stop whatever it is doing to you :)

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

  • 3 weeks later...
On 12/19/2016 at 11:45 PM, iamtheky said:

Case ;Whole Words

$Frmt_UserInput = " " & $userstring & " "

stringinstr($string , $Frmt_UserInput)

but regex works too.

 

Thanks for the reply, but your example assumes that the only word boundary character is a space.  Word can also be bounded by punctuation :-)

Link to comment
Share on other sites

On 12/19/2016 at 11:03 PM, czardas said:

Look at \Q .... \E in the help file (turn off special characters between these instructions).

Local $string2 = "asdf .c. asdf"
Local $query = ".c."
Local $foo2 = StringRegExp($string2, "(\Q" & $query & "\E)", 1)
MsgBox (0, "", $foo2[0], 0) ; I want it to return ".c.", but it fails

 

Thanks, I did not realise that a variable's content will be treated as ordinary text if used in a regular expression.  I had thought that I could literalise characters by placing them in a variable.  Your tip to use \Q and \E to literalise characters in the regular expression is helpful.

Link to comment
Share on other sites

If the search criteria contains the sequence '\E', then you will need to first make replacements using StringReplace(). A good replacement character would be one from the private Unicode range. In ArrayWorkshop, I used ChrW(57344) [= U+E000] as a replacement character. Of course you might have to undo the replacements afterwards. This depends on how your code is written.

Edited by czardas
Link to comment
Share on other sites

46 minutes ago, czardas said:

If the search criteria contains the sequence '\E', then you will need to first make replacements using StringReplace().

In the case where the search criteria contains \E, why not just replacing \E by \E\\E\Q in the pattern ? So it looks like \Q\E\\E\Q\E ? (i'm not sure to understand what you say)

Link to comment
Share on other sites

Maybe, because there's a very small chance that U+E000 is also within the source, or the search string. Although I can't think of many legitimate reasons for private range Unicode characters to be part of a search pattern. There's still a chance that the sequence '\E\\E\Q' already exists, and things can get kind of messy. That's why I prefer my imperfect solution.

Edit: After thinking about it, I believe your solution should work regardless.

Edited by czardas
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...