Jump to content

Regex 'headaches' + best browser for tester & more...


Guy_
 Share

Recommended Posts

1) Was testing some code that would add a full stop after a sentence when it thinks it is proper. In my simplified example are just a small range of characters I want to exclude. However, although my code in the tester is what I want, my AutoIt version doesn't respect my exclusions and is adding full stops to the other characters too.

fAps6Z3.png
 

#include <Array.au3>

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'

MsgBox(0,"original text", $text)

$_aFull_Stop_missing = StringRegExp( $text, "(.[^.…!\""]\s{2,}.{3})", 3 )

If Not @error Then

    _ArrayDisplay($_aFull_Stop_missing)

    For $i = 0 To UBound($_aFull_Stop_missing)-1
            $found = $_aFull_Stop_missing[$i]
            $fix = StringLeft( $found, 1) & "." & StringRight( $found, StringLen($found) -1 )
            $text = StringReplace( $text, $found, $fix, 1)
        Next
    EndIf

MsgBox(0,"processed text", $text)

 

2) I'm pretty sure I've seen (maybe older) info on how a StringRegExReplace can uppercase/lowercase a result. But I can't get to work any of what I saw... Neither of these are working here:

$text = "Http://site.com, Www.domain.org"

$result = StringRegExpReplace($text, "(?i)(https?|www)", StringLower("$1") )

ToolTip($result)

Sleep(2500)

$result = StringRegExpReplace($text, "(?i)(https?|www)", "\L$1" )

ToolTip($result)

Sleep(2500)


3) I've had the impression the https://regex101.com tester can give different results depending on the browser? If so, is there a preferred browser?

4) General question:

I see both "If Not @error" and "If @error = 0" being used.

"If Not @error" reads best to me. Is there a reason to not always use that variation?

5) If I am mostly processing text from the web or pdf, do I have to use a UTF setting in my regex everywhere/anywhere?  So far I have not had that impression at all, but the first example made me start thinking if character encoding could be involved... (although adding UTF instructions in the regex didn't help)

TIA!! :sweating:

Edited by Guy_
Link to comment
Share on other sites

1) I don't understant what is the expected result of the regex. What do you want to do ? Extract sentences ? Can you give use an example of result ?

2)

$text = "Http://site.com, Www.domain.org"
$result = Execute('"' & StringRegExpReplace(StringReplace($text, '"', '""'), "(?i)(https?|www)", '" & StringLower(''$1'') & "' ) & '"' )
ConsoleWrite($result)

 

Link to comment
Share on other sites

Im not good at regexp, but i like pretending.

#include <Array.au3>

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'

MsgBox(0,"original text", $text)

$_aFull_Stop_missing = StringRegExp($text, '([^\!\\"\s]\w\s{2,})', 3 )

If Not @error Then

    _ArrayDisplay($_aFull_Stop_missing)

    For $i = 0 To UBound($_aFull_Stop_missing)-1
            $found = $_aFull_Stop_missing[$i]
            $fix = stringstripws($found , 8) & "." & @CRLF & @CRLF
            $text = StringReplace( $text, $found, $fix, 1)
        Next
    EndIf

MsgBox(0,"processed text", $text)

 

and for urls just stringlower the whole thing..

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

Im not good at regexp, but i like pretending.

Wow, thanks! You fooled me good! ;)   I will study that till I "get it."

But it seems basically a cool workaround, so I'm still wondering why the tester does what I want and my AutoIt regex selects more than that...

Edited by Guy_
Link to comment
Share on other sites

Is that what you want?

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'
MsgBox(0,"original text", $text)
$fixed = StringRegExpReplace($text, '(?<![.…!"])(\r\n\r\n)', '.$1')
MsgBox(0,"processed text", $fixed)

2/ AutoIt != Perl

To change case of results, you need Execute:

$text = "Http://site.com, Www.domain.org"
$result = Execute('"' & StringRegExpReplace($text, '(?i)(https?|www)', '" & StringLower("$1") & "') & '"')
ConsoleWrite($result & @LF)

3/ I'm not aware of such dependancy.

4/ If Not @error is always fine.

5/ PCRE as compiled into AutoIt is UTF-aware (hopefully since AutoIt strings are UTF16!). What you may need (*UCP) in case you can expect to have to benefit of the wider range of \w, \d, \b, ...

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

1) I don't understant what is the expected result of the regex. What do you want to do ? Extract sentences ? Can you give use an example of result ?

The code by boththose demonstrates it well.

My own thinking was to select a hopefully unique part of the text including a last character of a sentence *if* that sentence ending excludes certain characters that may be indicating it does *not* need a full stop (and there have to be at least 2 returns).

I need to understand why AutoIt behaves differently from the tester here...

 

$text = "Http://site.com, Www.domain.org"
$result = Execute('"' & StringRegExpReplace(StringReplace($text, '"', '""'), "(?i)(https?|www)", '" & StringLower(''$1'') & "' ) & '"' )
ConsoleWrite($result)

Thanks, that does work but is very confusing to me.

Is this demonstrating a needed workaround and could we do this more easily in earlier times? (at least, I Googled simpler examples like I provided that I guess used to work once...)

Is this more demonstrating "the ultimate failsafe pro way" and can it be done simpler? (ok, just saw jchd's answer too)

I will take note of the principle of course, but if this is what I actually needed, the code seems over the top and I'd better use(?) ... (I realize it's not exactly the same)

$text = StringReplace($text, "Http", "http")
$text = StringReplace($text, "Www.", "www.")

 

Edited by Guy_
Link to comment
Share on other sites

Guy, perhaps the first ConsoleWrite below will help you see what's needed to achieve the result.

$text = "Http://site.com, Www.domain.org"
$result = '"' & StringRegExpReplace($text, '(?i)(https?|www)', '" & StringLower("$1") & "') & '"'
ConsoleWrite($result & @LF)
$result = Execute($result)
ConsoleWrite($result & @LF)

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

For the first question, using jchd's way, here is a way to store each end of sentence character in an array :

; Possible end of sentences
Local $aEndChars[] = ['.', '!', '?', '...', '…', '"']

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'
MsgBox(0,"original text", $text)

$sExpr = "(?<!"
For $i = 0 To UBound($aEndChars) - 1
    $sExpr &= "\Q" & $aEndChars[$i] & "\E|"
Next
$sExpr = StringRegExpReplace($sExpr, "\|$", "") & ")"

$fixed = StringRegExpReplace($text, $sExpr & '(\R{2})', '.$1')
MsgBox(0,"processed text", $fixed)

 

Link to comment
Share on other sites

Fine, come back if you still have questions.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Is that what you want?

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'Last sentence.'
MsgBox(0,"original text", $text)
$fixed = StringRegExpReplace($text, '(?<![.…!"])(\r\n\r\n)', '.$1')
MsgBox(0,"processed text", $fixed)

That was a breakthrough in helping me understand lookbehind. Beautiful :)

However, neither in this one or the variation of jguinch can I get it to work for a comma in the lookbehind... The usual \escaping doesn't seem to work and I can't find anything about it... (maybe it doesn't work with some other characters either; haven't checked all of them out yet. But the comma stood out.)

Link to comment
Share on other sites

Well, with my code you would have to add ',' in the $aEndChars array :

Local $aEndChars[] = ['.', '!', '?', '...', '…', '"', ',']

$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!'  & @CRLF & @CRLF & 'With comma sentence,' & @CRLF & @CRLF & "Last sentence."
MsgBox(0,"original text", $text)

$sExpr = "(?<!"
For $i = 0 To UBound($aEndChars) - 1
    $sExpr &= "\Q" & $aEndChars[$i] & "\E|"
Next
$sExpr = StringRegExpReplace($sExpr, "\|$", "") & ")"

$fixed = StringRegExpReplace($text, $sExpr & '(\R{2})', '.$1')
MsgBox(0,"processed text", $fixed)

It's not good ?

With jchd,'s code, [.…!",] should work
 

Link to comment
Share on other sites

I don't see [.…!",] failing, nor why on Earth it would fail either.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Of course but that is indedendant of the "stop-char" being a dot, a comma, ellipsis, whatever. It's trivial to get rid of extra whitespaces between the stop-char (or absence of) and the two line terminations.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Indeed works in the example of jchd but for some reason I'm still not sure of, I managed to have it fail in a very simple example of my own. For a moment it seemed '\r\n\r\n' was asking for 4 returns if my source was the clipboard and I got it working by making that '\r\r'. Then I got it working normally after all, but only after hours including fidding to discover which characters I had to double up or escape.

Space after the comma was never the issue and I made sure of that.

Qyn0U2s.png

Lastly, I now remain stumped by how to make it work when there are more than 2 returns... Both of these can lead to bad results:

$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*\"\“\”\„''\‘\’\>])(\R{2,})', '.$1')
$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*\"\“\”\„''\‘\’\>])(\R{2}\R*)', '.$1')

 

Edited by Guy_
Link to comment
Share on other sites

Guy,

PCRE \R by default translates into this atomic group:

(?>\r\n|\n|\x0b|\f|\r|\x85|\x{2028}|\x{2029})

Hence, if \R finds \r\n first, it will find a match. But it was decided to compile AutoIt PCRE with the option PCRE_BSR_ANYCRLF, which changes \R into the equivalent of:

(?>\r\n|\n|\r)

The default behavior (matching 0x0B, \f and 0x85 and the two other codepoints) can be restored in a pattern by placing (*BSR_UNICODE) at its head.

But anyway, \R{2,} will definitely match all combinations of two or more line terminations using CR and/or LF. Note that "abc" & @CR & @LF counts for only one line termination (this is @CRLF).

$text = "a" & @CRLF & @CRLF & @CRLF & "b" & @CR & @CR & @CR & "c" & @LF & @LF & @LF & "d" & @CRLF & @LF & @CRLF & "e" & @LF & @CRLF & @CRLF & "f" & @LF & @CR & "g" & @CR & @CRLF
$result = StringRegExp($text, '(.*)(\R{2,})', 3)
For $i = 0 To UBound($result) - 1 Step 2
    ConsoleWrite($result[$i] & ' -> ' & Binary($result[$i + 1]) & @LF)
Next

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

[.?…!;,:+=&*"“”„''‘’>] should be sufficient

Indeed, it now is... *sigh*  ;)  Thank you.

I guess I was trying double quotes around the whole thing at the time...

-
Wanted to apply my newly learned knowledge to have a full stop in similar circumstances but only when the line immediately after does not start with a capital... This again does not work after like 30 variations and is adding full stops when there is more than one return...

$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*"“”„''‘’>])(\R)(?=[A-Z])', '.$1')

I'm ready for a good cry... Giving up? Maybe tomorrow  ;)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...