Guy_

Regex 'headaches' + best browser for tester & more...

19 posts in this topic

#1 ·  Posted (edited)

1) Was testing some code that would add a full stop after a sentence when it thinks it is proper. In my simplified example are just a small range of characters I want to exclude. However, although my code in the tester is what I want, my AutoIt version doesn't respect my exclusions and is adding full stops to the other characters too.

fAps6Z3.png
 

#include <Array.au3>

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'

MsgBox(0,"original text", $text)

$_aFull_Stop_missing = StringRegExp( $text, "(.[^.…!\""]\s{2,}.{3})", 3 )

If Not @error Then

    _ArrayDisplay($_aFull_Stop_missing)

    For $i = 0 To UBound($_aFull_Stop_missing)-1
            $found = $_aFull_Stop_missing[$i]
            $fix = StringLeft( $found, 1) & "." & StringRight( $found, StringLen($found) -1 )
            $text = StringReplace( $text, $found, $fix, 1)
        Next
    EndIf

MsgBox(0,"processed text", $text)

 

2) I'm pretty sure I've seen (maybe older) info on how a StringRegExReplace can uppercase/lowercase a result. But I can't get to work any of what I saw... Neither of these are working here:

$text = "Http://site.com, Www.domain.org"

$result = StringRegExpReplace($text, "(?i)(https?|www)", StringLower("$1") )

ToolTip($result)

Sleep(2500)

$result = StringRegExpReplace($text, "(?i)(https?|www)", "\L$1" )

ToolTip($result)

Sleep(2500)


3) I've had the impression the https://regex101.com tester can give different results depending on the browser? If so, is there a preferred browser?

4) General question:

I see both "If Not @error" and "If @error = 0" being used.

"If Not @error" reads best to me. Is there a reason to not always use that variation?

5) If I am mostly processing text from the web or pdf, do I have to use a UTF setting in my regex everywhere/anywhere?  So far I have not had that impression at all, but the first example made me start thinking if character encoding could be involved... (although adding UTF instructions in the regex didn't help)

TIA!! :sweating:

Edited by Guy_

Share this post


Link to post
Share on other sites



1) I don't understant what is the expected result of the regex. What do you want to do ? Extract sentences ? Can you give use an example of result ?

2)

$text = "Http://site.com, Www.domain.org"
$result = Execute('"' & StringRegExpReplace(StringReplace($text, '"', '""'), "(?i)(https?|www)", '" & StringLower(''$1'') & "' ) & '"' )
ConsoleWrite($result)

 

1 person likes this

Share this post


Link to post
Share on other sites

Im not good at regexp, but i like pretending.

#include <Array.au3>

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'

MsgBox(0,"original text", $text)

$_aFull_Stop_missing = StringRegExp($text, '([^\!\\"\s]\w\s{2,})', 3 )

If Not @error Then

    _ArrayDisplay($_aFull_Stop_missing)

    For $i = 0 To UBound($_aFull_Stop_missing)-1
            $found = $_aFull_Stop_missing[$i]
            $fix = stringstripws($found , 8) & "." & @CRLF & @CRLF
            $text = StringReplace( $text, $found, $fix, 1)
        Next
    EndIf

MsgBox(0,"processed text", $text)

 

and for urls just stringlower the whole thing..

1 person likes this

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Im not good at regexp, but i like pretending.

Wow, thanks! You fooled me good! ;)   I will study that till I "get it."

But it seems basically a cool workaround, so I'm still wondering why the tester does what I want and my AutoIt regex selects more than that...

Edited by Guy_

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

Is that what you want?

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'
MsgBox(0,"original text", $text)
$fixed = StringRegExpReplace($text, '(?<![.…!"])(\r\n\r\n)', '.$1')
MsgBox(0,"processed text", $fixed)

2/ AutoIt != Perl

To change case of results, you need Execute:

$text = "Http://site.com, Www.domain.org"
$result = Execute('"' & StringRegExpReplace($text, '(?i)(https?|www)', '" & StringLower("$1") & "') & '"')
ConsoleWrite($result & @LF)

3/ I'm not aware of such dependancy.

4/ If Not @error is always fine.

5/ PCRE as compiled into AutoIt is UTF-aware (hopefully since AutoIt strings are UTF16!). What you may need (*UCP) in case you can expect to have to benefit of the wider range of \w, \d, \b, ...

Edited by jchd
1 person likes this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

1) I don't understant what is the expected result of the regex. What do you want to do ? Extract sentences ? Can you give use an example of result ?

The code by boththose demonstrates it well.

My own thinking was to select a hopefully unique part of the text including a last character of a sentence *if* that sentence ending excludes certain characters that may be indicating it does *not* need a full stop (and there have to be at least 2 returns).

I need to understand why AutoIt behaves differently from the tester here...

 

$text = "Http://site.com, Www.domain.org"
$result = Execute('"' & StringRegExpReplace(StringReplace($text, '"', '""'), "(?i)(https?|www)", '" & StringLower(''$1'') & "' ) & '"' )
ConsoleWrite($result)

Thanks, that does work but is very confusing to me.

Is this demonstrating a needed workaround and could we do this more easily in earlier times? (at least, I Googled simpler examples like I provided that I guess used to work once...)

Is this more demonstrating "the ultimate failsafe pro way" and can it be done simpler? (ok, just saw jchd's answer too)

I will take note of the principle of course, but if this is what I actually needed, the code seems over the top and I'd better use(?) ... (I realize it's not exactly the same)

$text = StringReplace($text, "Http", "http")
$text = StringReplace($text, "Www.", "www.")

 

Edited by Guy_

Share this post


Link to post
Share on other sites

Guy, perhaps the first ConsoleWrite below will help you see what's needed to achieve the result.

$text = "Http://site.com, Www.domain.org"
$result = '"' & StringRegExpReplace($text, '(?i)(https?|www)', '" & StringLower("$1") & "') & '"'
ConsoleWrite($result & @LF)
$result = Execute($result)
ConsoleWrite($result & @LF)

 

1 person likes this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

For the first question, using jchd's way, here is a way to store each end of sentence character in an array :

; Possible end of sentences
Local $aEndChars[] = ['.', '!', '?', '...', '…', '"']

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'
MsgBox(0,"original text", $text)

$sExpr = "(?<!"
For $i = 0 To UBound($aEndChars) - 1
    $sExpr &= "\Q" & $aEndChars[$i] & "\E|"
Next
$sExpr = StringRegExpReplace($sExpr, "\|$", "") & ")"

$fixed = StringRegExpReplace($text, $sExpr & '(\R{2})', '.$1')
MsgBox(0,"processed text", $fixed)

 

1 person likes this

Share this post


Link to post
Share on other sites

Thanks a lot everyone!  I'm busy studying all of this further now :)

Share this post


Link to post
Share on other sites

Fine, come back if you still have questions.

1 person likes this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Is that what you want?

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'
MsgBox(0,"original text", $text)
$fixed = StringRegExpReplace($text, '(?<![.…!"])(\r\n\r\n)', '.$1')
MsgBox(0,"processed text", $fixed)

That was a breakthrough in helping me understand lookbehind. Beautiful :)

However, neither in this one or the variation of jguinch can I get it to work for a comma in the lookbehind... The usual \escaping doesn't seem to work and I can't find anything about it... (maybe it doesn't work with some other characters either; haven't checked all of them out yet. But the comma stood out.)

Share this post


Link to post
Share on other sites

Well, with my code you would have to add ',' in the $aEndChars array :

Local $aEndChars[] = ['.', '!', '?', '...', '…', '"', ',']

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'With comma sentence,' & @CRLF & @CRLF & "Last sentence."
MsgBox(0,"original text", $text)

$sExpr = "(?<!"
For $i = 0 To UBound($aEndChars) - 1
    $sExpr &= "\Q" & $aEndChars[$i] & "\E|"
Next
$sExpr = StringRegExpReplace($sExpr, "\|$", "") & ")"

$fixed = StringRegExpReplace($text, $sExpr & '(\R{2})', '.$1')
MsgBox(0,"processed text", $fixed)

It's not good ?

With jchd,'s code, [.…!",] should work
 

Share this post


Link to post
Share on other sites

I don't see [.…!",] failing, nor why on Earth it would fail either.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Except if there is a remaining space after the comma  :)

Share this post


Link to post
Share on other sites

Of course but that is indedendant of the "stop-char" being a dot, a comma, ellipsis, whatever. It's trivial to get rid of extra whitespaces between the stop-char (or absence of) and the two line terminations.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#16 ·  Posted (edited)

Indeed works in the example of jchd but for some reason I'm still not sure of, I managed to have it fail in a very simple example of my own. For a moment it seemed '\r\n\r\n' was asking for 4 returns if my source was the clipboard and I got it working by making that '\r\r'. Then I got it working normally after all, but only after hours including fidding to discover which characters I had to double up or escape.

Space after the comma was never the issue and I made sure of that.

Qyn0U2s.png

Lastly, I now remain stumped by how to make it work when there are more than 2 returns... Both of these can lead to bad results:

$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*\"\“\”\„''\‘\’\>])(\R{2,})', '.$1')
$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*\"\“\”\„''\‘\’\>])(\R{2}\R*)', '.$1')

 

Edited by Guy_

Share this post


Link to post
Share on other sites

Guy,

PCRE \R by default translates into this atomic group:

(?>\r\n|\n|\x0b|\f|\r|\x85|\x{2028}|\x{2029})

Hence, if \R finds \r\n first, it will find a match. But it was decided to compile AutoIt PCRE with the option PCRE_BSR_ANYCRLF, which changes \R into the equivalent of:

(?>\r\n|\n|\r)

The default behavior (matching 0x0B, \f and 0x85 and the two other codepoints) can be restored in a pattern by placing (*BSR_UNICODE) at its head.

But anyway, \R{2,} will definitely match all combinations of two or more line terminations using CR and/or LF. Note that "abc" & @CR & @LF counts for only one line termination (this is @CRLF).

$text = "a" & @CRLF & @CRLF & @CRLF & "b" & @CR & @CR & @CR & "c" & @LF & @LF & @LF & "d" & @CRLF & @LF & @CRLF & "e" & @LF & @CRLF & @CRLF & "f" & @LF & @CR & "g" & @CR & @CRLF
$result = StringRegExp($text, '(.*)(\R{2,})', 3)
For $i = 0 To UBound($result) - 1 Step 2
    ConsoleWrite($result[$i] & ' -> ' & Binary($result[$i + 1]) & @LF)
Next

 

1 person likes this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

[.?…!;,:+=&*"“”„''‘’>] should be sufficient

Indeed, it now is... *sigh*  ;)  Thank you.

I guess I was trying double quotes around the whole thing at the time...

-
Wanted to apply my newly learned knowledge to have a full stop in similar circumstances but only when the line immediately after does not start with a capital... This again does not work after like 30 variations and is adding full stops when there is more than one return...

$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*"“”„''‘’>])(\R)(?=[A-Z])', '.$1')

I'm ready for a good cry... Giving up? Maybe tomorrow  ;)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Similar Content

    • ISI360
      By ISI360
      Hi!

      I need a little bit help from some RegEx experts please:
      I would make my ISN AutoIt Studio faster when generating the scripttree. And what would be better to do this via regex?
      Problem is i am not really good at this regex stuff. So maybe someone could help me here.
       
      The challange is to get all Global Variables from a script via RegEx in a Array.
      Here is a example script with some tests:
      Global $Var1 = 1234 Local $Local_Var = 1234 $Ignore_me_too = 1234 Global $Var2 = 1234, $var3 = 1242 Global $ahIcons[30], $ahLabels[30] Global Const $Var4 = iniread($inivar1,"jj","jj","") , $var5= iniread($inivar2,"jj","jj","") Global $Var_String = "was" Global $Array_Test[16] = [1,15,16,0,31,15,25,15,25,30,8,30,8,15,1,15] Global Enum $MARGIN_SCRIPT_NUMBER = 0, $MARGIN_SCRIPT_ICON, $MARGIN_SCRIPT_FOLD Global Const $Delim = '\', $Delim1 = '|' Global $hard1 = "a", _ $hard2 = "b", _ $hard3 = "c"  
      The returning array should look like this:
      $Var1 $Var2 $var3 $Var4 $var5 $Var_String $Array_Test $MARGIN_SCRIPT_NUMBER $MARGIN_SCRIPT_ICON $MARGIN_SCRIPT_FOLD $Delim $Delim1 $hard1 $hard2 $hard3  
      I already made some success with a expression i found in the SciTE Jump Tool:  (\$\w+)(?:[\h\[.=+*/^,)\-])?
      This nearly returns the perfect results. But it does not check if it´s a global variable (with the const and enum options) and also returns variables in commands (for example $inivar1)
      I also found this regex: (?im:^(?=Global|Const|Enum|Static)(?:Global)?\h*(?:Const|Enum|Static)?(?:(?<=Enum)\h+Step\h+[+*-]\d+)?\h*)([^\r\n .\=]+)
      This returns also usefull results...but trying to understand this explodes my head

      Maybe someone can help me here?
      Thanks in advance!
    • TheAutomator
      By TheAutomator
      Can anyone tell me why this isn't working?..
      #include <array.au3> $regexp = StringRegExp("test 'a b c'", "'([^']|'')*'|\S+", 3) _ArrayDisplay($regexp) trying to split this "test 'a b c'  'some other '' test'' ...'" into:
      0: test
      1: 'a b c'
      2: ...
      but it gives me:
      0: test
      1: c
    • anthonyjr2
      By anthonyjr2
      Hi guys,
      I am pretty bad with regex, and am having some trouble trying to come up with an expression for a certain type of string. Basically I want to be able to tell if a string is of the format:
      AA#####A
      Where the A's are any letter from A-Z and the #'s are any digit from 0-9.
      I've been playing around with a regex tester online for a while but I can't really seem to grasp the concept very well. Could anyone give me any tips?
      This isn't exactly an AutoIt specific question which is why I didn't post it in General Help & Support.
    • tezhihi
      By tezhihi
      I have a file (see attached file) with a string all line and this problem on here is I want to separate all $00:, $03:, $10:, $20:, $25:, $30:, $40:, $45:, $110:, $115:, $120: and $T. It's mean that each $ with value start a new line ( a new paragraph). I tried with Regular Expression in notepad++ ex:
      Find ($00:, $01:, $03: and so on) with regex (\$)([0-9]+): and replace is \r\n\1\2 (I think \r\n is @CRLF (not sure :() ) Find $T with regex (\$T)(.*?)(\$T) and replace is \1\2\r\n\3 When I try these regex to replace in notepad on StringRegexReplace the results is incorrect . I have read some example simple about regex. Please advise me how to do that with some example on autoit . The result will be in attached photo. Thanks 
      ahihi.txt

    • MyEarth
      By MyEarth
      Hello, i need to validate a string can be different things. I just need a True - False return value, no groups or things like that. It will be always one line at time to be processed by StringRegEx
      Valid:
      13:52|String
      02:52 XX|String
      13:52~SUN, MON, TUE, WED, THU, FRI, SAT|String
      02:52 XX~SUN, MON, FRI|String
      22/04/2017 13:52|String
      22/04/2017 02:52 YY|String
      Not Valid
      22/04/2017 13:52~Dom|String
      I need to validate until and inclusively the | after that i don't care
      The XX and YY value are two $sVariable from my script
      SUN, MON, TUE, WED, THU, FRI, SAT are fixed value, the can be mixed but always in the same order like
      SUN
      SUN, TUE, WED
      SUN, SAT 
      The time can be 12 or 24 hours, the date is always in the same format DD/MM/YYYY. If there is a date can't be a day after that ( see not valid )
      Well i think is all
      Sorry if i don't provide a working code, regex is too way complex.
      Thanks