Jump to content
Guy_

Regex 'headaches' + best browser for tester & more...

Recommended Posts

Guy_

1) Was testing some code that would add a full stop after a sentence when it thinks it is proper. In my simplified example are just a small range of characters I want to exclude. However, although my code in the tester is what I want, my AutoIt version doesn't respect my exclusions and is adding full stops to the other characters too.

fAps6Z3.png
 

#include <Array.au3>

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'

MsgBox(0,"original text", $text)

$_aFull_Stop_missing = StringRegExp( $text, "(.[^.…!\""]\s{2,}.{3})", 3 )

If Not @error Then

    _ArrayDisplay($_aFull_Stop_missing)

    For $i = 0 To UBound($_aFull_Stop_missing)-1
            $found = $_aFull_Stop_missing[$i]
            $fix = StringLeft( $found, 1) & "." & StringRight( $found, StringLen($found) -1 )
            $text = StringReplace( $text, $found, $fix, 1)
        Next
    EndIf

MsgBox(0,"processed text", $text)

 

2) I'm pretty sure I've seen (maybe older) info on how a StringRegExReplace can uppercase/lowercase a result. But I can't get to work any of what I saw... Neither of these are working here:

$text = "Http://site.com, Www.domain.org"

$result = StringRegExpReplace($text, "(?i)(https?|www)", StringLower("$1") )

ToolTip($result)

Sleep(2500)

$result = StringRegExpReplace($text, "(?i)(https?|www)", "\L$1" )

ToolTip($result)

Sleep(2500)


3) I've had the impression the https://regex101.com tester can give different results depending on the browser? If so, is there a preferred browser?

4) General question:

I see both "If Not @error" and "If @error = 0" being used.

"If Not @error" reads best to me. Is there a reason to not always use that variation?

5) If I am mostly processing text from the web or pdf, do I have to use a UTF setting in my regex everywhere/anywhere?  So far I have not had that impression at all, but the first example made me start thinking if character encoding could be involved... (although adding UTF instructions in the regex didn't help)

TIA!! :sweating:

Edited by Guy_

Share this post


Link to post
Share on other sites
jguinch

1) I don't understant what is the expected result of the regex. What do you want to do ? Extract sentences ? Can you give use an example of result ?

2)

$text = "Http://site.com, Www.domain.org"
$result = Execute('"' & StringRegExpReplace(StringReplace($text, '"', '""'), "(?i)(https?|www)", '" & StringLower(''$1'') & "' ) & '"' )
ConsoleWrite($result)

 

  • Like 1

Share this post


Link to post
Share on other sites
iamtheky

Im not good at regexp, but i like pretending.

#include <Array.au3>

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'

MsgBox(0,"original text", $text)

$_aFull_Stop_missing = StringRegExp($text, '([^\!\\"\s]\w\s{2,})', 3 )

If Not @error Then

    _ArrayDisplay($_aFull_Stop_missing)

    For $i = 0 To UBound($_aFull_Stop_missing)-1
            $found = $_aFull_Stop_missing[$i]
            $fix = stringstripws($found , 8) & "." & @CRLF & @CRLF
            $text = StringReplace( $text, $found, $fix, 1)
        Next
    EndIf

MsgBox(0,"processed text", $text)

 

and for urls just stringlower the whole thing..

  • Like 1

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites
Guy_

Im not good at regexp, but i like pretending.

Wow, thanks! You fooled me good! ;)   I will study that till I "get it."

But it seems basically a cool workaround, so I'm still wondering why the tester does what I want and my AutoIt regex selects more than that...

Edited by Guy_

Share this post


Link to post
Share on other sites
jchd

Is that what you want?

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'
MsgBox(0,"original text", $text)
$fixed = StringRegExpReplace($text, '(?<![.…!"])(\r\n\r\n)', '.$1')
MsgBox(0,"processed text", $fixed)

2/ AutoIt != Perl

To change case of results, you need Execute:

$text = "Http://site.com, Www.domain.org"
$result = Execute('"' & StringRegExpReplace($text, '(?i)(https?|www)', '" & StringLower("$1") & "') & '"')
ConsoleWrite($result & @LF)

3/ I'm not aware of such dependancy.

4/ If Not @error is always fine.

5/ PCRE as compiled into AutoIt is UTF-aware (hopefully since AutoIt strings are UTF16!). What you may need (*UCP) in case you can expect to have to benefit of the wider range of \w, \d, \b, ...

Edited by jchd
  • Like 1

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Guy_

1) I don't understant what is the expected result of the regex. What do you want to do ? Extract sentences ? Can you give use an example of result ?

The code by boththose demonstrates it well.

My own thinking was to select a hopefully unique part of the text including a last character of a sentence *if* that sentence ending excludes certain characters that may be indicating it does *not* need a full stop (and there have to be at least 2 returns).

I need to understand why AutoIt behaves differently from the tester here...

 

$text = "Http://site.com, Www.domain.org"
$result = Execute('"' & StringRegExpReplace(StringReplace($text, '"', '""'), "(?i)(https?|www)", '" & StringLower(''$1'') & "' ) & '"' )
ConsoleWrite($result)

Thanks, that does work but is very confusing to me.

Is this demonstrating a needed workaround and could we do this more easily in earlier times? (at least, I Googled simpler examples like I provided that I guess used to work once...)

Is this more demonstrating "the ultimate failsafe pro way" and can it be done simpler? (ok, just saw jchd's answer too)

I will take note of the principle of course, but if this is what I actually needed, the code seems over the top and I'd better use(?) ... (I realize it's not exactly the same)

$text = StringReplace($text, "Http", "http")
$text = StringReplace($text, "Www.", "www.")

 

Edited by Guy_

Share this post


Link to post
Share on other sites
jchd

Guy, perhaps the first ConsoleWrite below will help you see what's needed to achieve the result.

$text = "Http://site.com, Www.domain.org"
$result = '"' & StringRegExpReplace($text, '(?i)(https?|www)', '" & StringLower("$1") & "') & '"'
ConsoleWrite($result & @LF)
$result = Execute($result)
ConsoleWrite($result & @LF)

 

  • Like 1

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
jguinch

For the first question, using jchd's way, here is a way to store each end of sentence character in an array :

; Possible end of sentences
Local $aEndChars[] = ['.', '!', '?', '...', '…', '"']

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'
MsgBox(0,"original text", $text)

$sExpr = "(?<!"
For $i = 0 To UBound($aEndChars) - 1
    $sExpr &= "\Q" & $aEndChars[$i] & "\E|"
Next
$sExpr = StringRegExpReplace($sExpr, "\|$", "") & ")"

$fixed = StringRegExpReplace($text, $sExpr & '(\R{2})', '.$1')
MsgBox(0,"processed text", $fixed)

 

  • Like 1

Share this post


Link to post
Share on other sites
Guy_

Thanks a lot everyone!  I'm busy studying all of this further now :)

Share this post


Link to post
Share on other sites
jchd

Fine, come back if you still have questions.

  • Like 1

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Guy_

Is that what you want?

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'
MsgBox(0,"original text", $text)
$fixed = StringRegExpReplace($text, '(?<![.…!"])(\r\n\r\n)', '.$1')
MsgBox(0,"processed text", $fixed)

That was a breakthrough in helping me understand lookbehind. Beautiful :)

However, neither in this one or the variation of jguinch can I get it to work for a comma in the lookbehind... The usual \escaping doesn't seem to work and I can't find anything about it... (maybe it doesn't work with some other characters either; haven't checked all of them out yet. But the comma stood out.)

Share this post


Link to post
Share on other sites
jguinch

Well, with my code you would have to add ',' in the $aEndChars array :

Local $aEndChars[] = ['.', '!', '?', '...', '…', '"', ',']

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'With comma sentence,' & @CRLF & @CRLF & "Last sentence."
MsgBox(0,"original text", $text)

$sExpr = "(?<!"
For $i = 0 To UBound($aEndChars) - 1
    $sExpr &= "\Q" & $aEndChars[$i] & "\E|"
Next
$sExpr = StringRegExpReplace($sExpr, "\|$", "") & ")"

$fixed = StringRegExpReplace($text, $sExpr & '(\R{2})', '.$1')
MsgBox(0,"processed text", $fixed)

It's not good ?

With jchd,'s code, [.…!",] should work
 

Share this post


Link to post
Share on other sites
jchd

I don't see [.…!",] failing, nor why on Earth it would fail either.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
mikell

Except if there is a remaining space after the comma  :)

Share this post


Link to post
Share on other sites
jchd

Of course but that is indedendant of the "stop-char" being a dot, a comma, ellipsis, whatever. It's trivial to get rid of extra whitespaces between the stop-char (or absence of) and the two line terminations.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Guy_

Indeed works in the example of jchd but for some reason I'm still not sure of, I managed to have it fail in a very simple example of my own. For a moment it seemed '\r\n\r\n' was asking for 4 returns if my source was the clipboard and I got it working by making that '\r\r'. Then I got it working normally after all, but only after hours including fidding to discover which characters I had to double up or escape.

Space after the comma was never the issue and I made sure of that.

Qyn0U2s.png

Lastly, I now remain stumped by how to make it work when there are more than 2 returns... Both of these can lead to bad results:

$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*\"\“\”\„''\‘\’\>])(\R{2,})', '.$1')
$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*\"\“\”\„''\‘\’\>])(\R{2}\R*)', '.$1')

 

Edited by Guy_

Share this post


Link to post
Share on other sites
jchd

Guy,

PCRE \R by default translates into this atomic group:

(?>\r\n|\n|\x0b|\f|\r|\x85|\x{2028}|\x{2029})

Hence, if \R finds \r\n first, it will find a match. But it was decided to compile AutoIt PCRE with the option PCRE_BSR_ANYCRLF, which changes \R into the equivalent of:

(?>\r\n|\n|\r)

The default behavior (matching 0x0B, \f and 0x85 and the two other codepoints) can be restored in a pattern by placing (*BSR_UNICODE) at its head.

But anyway, \R{2,} will definitely match all combinations of two or more line terminations using CR and/or LF. Note that "abc" & @CR & @LF counts for only one line termination (this is @CRLF).

$text = "a" & @CRLF & @CRLF & @CRLF & "b" & @CR & @CR & @CR & "c" & @LF & @LF & @LF & "d" & @CRLF & @LF & @CRLF & "e" & @LF & @CRLF & @CRLF & "f" & @LF & @CR & "g" & @CR & @CRLF
$result = StringRegExp($text, '(.*)(\R{2,})', 3)
For $i = 0 To UBound($result) - 1 Step 2
    ConsoleWrite($result[$i] & ' -> ' & Binary($result[$i + 1]) & @LF)
Next

 

  • Like 1

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Guy_

[.?…!;,:+=&*"“”„''‘’>] should be sufficient

Indeed, it now is... *sigh*  ;)  Thank you.

I guess I was trying double quotes around the whole thing at the time...

-
Wanted to apply my newly learned knowledge to have a full stop in similar circumstances but only when the line immediately after does not start with a capital... This again does not work after like 30 variations and is adding full stops when there is more than one return...

$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*"“”„''‘’>])(\R)(?=[A-Z])', '.$1')

I'm ready for a good cry... Giving up? Maybe tomorrow  ;)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • milkmoron
      By milkmoron
      I am trying to search in a web browser dates XX/XX/XXXX that are also links. I want to click them after and remove them from the array. This is all I have so far. Nothing shows up. What am I doing wrong?
      ControlFocus ("Customer Center", "", "")
      Local $aArray = StringRegExp('(..)/(..)/(....)', '(..)/(..)/(....)', $STR_REGEXPARRAYFULLMATCH)
      For $i = 0 To UBound($aArray) - 1
          MsgBox($MB_SYSTEMMODAL, "RegExp Test with Option 2 - " & $i, $aArray[$i])
      Next
       
    • WoodGrain
      By WoodGrain
      Hi All,
      I'd like to replace 'COMMA' with ',' for example:
      $myString = "COMMA" StringRegExpReplace($myString, 'COMMA', ',') Now I've tried escaping the ',' in various ways unsuccessfully, such as:
      '[,]'
      "[,]"
      '\,'
      [,] seems to work in the pattern, I just can't figure out how to use it in the replace, and it seems everyone online is only interested in removing/replacing commas lol.
      I also tried creating and using a variable as the replacement but also didn't work:
      $myComma = "," $myString = "COMMA" StringRegExpReplace($myString, 'COMMA', $myComma) I'm sure it's super simple if someone could point me in the right direction - thanks.
    • rcmaehl
      By rcmaehl
      Hi all,
      I still suck at regex as always and I need some help. According to the regex tester I normally use this should be working fine but it doesn't....
      StringRegExp($sString, "\A[1-9]+[0-9]*(\-[1-9]+[0-9]*)?,*\Z") I basically want to match:
      all numbers EXCEPT 0, but including 10, 20, etc with each number separated by a comma and allowing a "-" separated range as a value For example:
      1-5,7,10-12 I've spent a couple hours modifying it but I'm not sure where I've gone wrong. Any help would be appreciated!
    • ISI360
      By ISI360
      Hi!

      I need a little bit help from some RegEx experts please:
      I would make my ISN AutoIt Studio faster when generating the scripttree. And what would be better to do this via regex?
      Problem is i am not really good at this regex stuff. So maybe someone could help me here.
       
      The challange is to get all Global Variables from a script via RegEx in a Array.
      Here is a example script with some tests:
      Global $Var1 = 1234 Local $Local_Var = 1234 $Ignore_me_too = 1234 Global $Var2 = 1234, $var3 = 1242 Global $ahIcons[30], $ahLabels[30] Global Const $Var4 = iniread($inivar1,"jj","jj","") , $var5= iniread($inivar2,"jj","jj","") Global $Var_String = "was" Global $Array_Test[16] = [1,15,16,0,31,15,25,15,25,30,8,30,8,15,1,15] Global Enum $MARGIN_SCRIPT_NUMBER = 0, $MARGIN_SCRIPT_ICON, $MARGIN_SCRIPT_FOLD Global Const $Delim = '\', $Delim1 = '|' Global $hard1 = "a", _ $hard2 = "b", _ $hard3 = "c"  
      The returning array should look like this:
      $Var1 $Var2 $var3 $Var4 $var5 $Var_String $Array_Test $MARGIN_SCRIPT_NUMBER $MARGIN_SCRIPT_ICON $MARGIN_SCRIPT_FOLD $Delim $Delim1 $hard1 $hard2 $hard3  
      I already made some success with a expression i found in the SciTE Jump Tool:  (\$\w+)(?:[\h\[.=+*/^,)\-])?
      This nearly returns the perfect results. But it does not check if it´s a global variable (with the const and enum options) and also returns variables in commands (for example $inivar1)
      I also found this regex: (?im:^(?=Global|Const|Enum|Static)(?:Global)?\h*(?:Const|Enum|Static)?(?:(?<=Enum)\h+Step\h+[+*-]\d+)?\h*)([^\r\n .\=]+)
      This returns also usefull results...but trying to understand this explodes my head

      Maybe someone can help me here?
      Thanks in advance!
    • TheAutomator
      By TheAutomator
      Can anyone tell me why this isn't working?..
      #include <array.au3> $regexp = StringRegExp("test 'a b c'", "'([^']|'')*'|\S+", 3) _ArrayDisplay($regexp) trying to split this "test 'a b c'  'some other '' test'' ...'" into:
      0: test
      1: 'a b c'
      2: ...
      but it gives me:
      0: test
      1: c
×