leuce

How to redefine "word boundary" in regex, or define a new character type

6 posts in this topic

Hello everyone

I'm processing text files using AutoIt scripts, and in one such process I want to replace all standalone numbers with "00".  In other words, where a number is a "word", I want the number replaced with "00".  

I'm using the regex \b\d+\b.  However, this matches things that I do not consider "word boundaries".  For example, a date 12-Jan-15 is not "three words", in my view of what a "word" is, but Autoit think so.  Similarly, although "x; 123" or "y, 234" are two "words" each in my view, "x;123" or "y,234" are only one word each, but AutoIt thinks "x;123" and "y,234" are two "words" each.

The solution to my problem, I think, is if I could redefine the meaning of "\b", or define some other class that uses a different letter in regex.  Is there a way to do that?  It would simplify my regular expressions later in the script if I could define that earlier in the script.

Thanks

Samuel

Share this post


Link to post
Share on other sites



1 hour ago, leuce said:

I'm processing text files using AutoIt scripts, and in one such process I want to replace all standalone numbers with "00".  In other words, where a number is a "word", I want the number replaced with "00".  

Your statement is contradictory.

 

Should be the result this way?

"This is a test created on  12-Jan-15 with 1234 chars." -> "This is a test created on  12-Jan-15 with 00 chars."

 

If not please give an example.


Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯

Share this post


Link to post
Share on other sites

leuce,

It's up to you to define precisely what you consider a "word boundary". For PCRE (AutoIt regex engine), \b has a very precise semantics and I understand it doesn't fit your need. Ask yourself what your "word boundary" really means and try it with that test, then any other text you can come up with:

"Pi is a transcendental value close to 3.1415926 that is as simple as 1+2=3 or rather 1 + 2 = 3."

Is a "stand-alone number" forcibly prefixed and suffixed by a space? Maybe another set of characters, or maybe depending of other rules like the final 3 above followed by a final point?

Formalize your need and thetranslation into regex will certainly follow.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

You problem is Regex and not AutoIt.  I agree with the sages above, your example is not clear.  Perhaps post a sample of the source text with the replacements as you would EXPECT the outcome to be?

Also, practice your Regex here:

https://regex101.com/


Skysnake

Why is the snake in the sky?

Share this post


Link to post
Share on other sites

Here is my solution of my understanding of the problem.

Local $sTest = "334 This is a 1 test created on 12-Jan-15 with 1234 characters (3.456) and x;123.321 and y: 456 and z,789."

ConsoleWrite(StringRegExpReplace($sTest, "(?<=\s|^|\(|;|,|-)\d+(\.\d+)?(?=\s|$|\.|,|\)|-)", "00") & @LF)
#cs
This reg. exp. pattern will match "\d+" an integer or, "\d+(\.\d+)?" a decimal number, only if that number is preceeded by:-
    "\s" a space or a linefeed, or, "^" the number is at the beginning of the string, or, "(" an open bracket, ":" a semi-colon,"'" a coma, or, "-" a dash.
And, the matching number must also be followed by:-
    "\s" a space or a linefeed, or, "$" the number is at the end of the string, or, "\." a dot, "," a coma, ")" a closing bracket, or, "-" a dash.

Returns:-
00 This is a 00 test created on 00-Jan-00 with 00 characters (00) and x;00 and y: 00 and z,00.
#ce

 

Share this post


Link to post
Share on other sites
On mardi 26 janvier 2016 at 10:11 AM, leuce said:

For example, a date 12-Jan-15 is not "three words", in my view of what a "word" is, but Autoit think so.

Malkey, your code doesn't fit the "explanations" in post #1
Probably it should be *something* like this (but the requirements are really imprecise)

$s = "Pi is a transcendental value close to 3.1415926 that is on 12-Jan-15 as simple as 1+2=3 or rather 10 + 20 = 30."

ConsoleWrite(StringRegExpReplace($s, '(?<=^|\s)\d*\.?\d+(?=[.,;\s]|$)', "00") )


 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now