Jump to content
Sign in to follow this  
jchd

Default regexp options

Recommended Posts

jchd

I'd like to submit two points to your consideration, especially Jon since they point to PCRE compile options or canned option(s) set or prepended to every pattern submitted.

First, it seems to me that the current default newline convention is LF only.

#include <Array.au3>

Local $sData =  "12 abc." & @CRLF & "13 def;" & @CRLF & "14 ghi."
Local $aRes

; Say we want an array of lines that start with a number and end with a dot.

; Using the default built-in convention, we miss "12 abc."
$aRes = StringRegExp($sData, "(?m)(^\d+.*\.$)", 3)
_ArrayDisplay($aRes, "Valid lines (*LF)")

; Forcing newline convention to be @CRLF works like expected
$aRes = StringRegExp($sData, "(*CRLF)(?m)(^\d+.*\.$)", 3)
_ArrayDisplay($aRes, "Valid lines (*CRLF)")

While this default works well under Unix-like OSes using @LF only, it brings issues with $ under Windows.

EDIT: much simplified example follows in subsequent post.

In multiline mode (?m), $ is a true assertion at end of subject or before a newline (c.f. current newline convention). In the first example above, the literal dot (.) is not the character just before $, since there is a @CR between the dot and the @LF which is where $ is true.

Using the sequence (*CRLF) at start of a pattern, we force the newline convention to be @CRLF as a whole, which is the most common situation in the Windows world. You can see the difference with the second example above.

That's why I'd recommend to use the  --enable-newline-is-crlf PCRE library build-time option. This is equivalent to prepending (*CRLF) to every pattern submitted. Yet people can override this default setting when they need to process text using another convention. For the record, available conventions (to be used once at the very start of a pattern) are:

  (*CR)        carriage return
  (*LF)        linefeed
  (*CRLF)      carriage return, followed by linefeed
  (*ANYCRLF)   any of the three above
  (*ANY)       all Unicode newline sequences

Note to Jon: equivalently, the PCRE_NEWLINE_CRLF option bit can be passed at pattern compile-time to pcre_compile[2]().

_________________________________________________________

The second point offers more room for debate. The question is: should we force the UCP option internally or should we leave it to the users to specify it when they actually need it?

You all know I've been a strong lobbyist for compilation of PCRE with full Unicode support (UCP option). It allows users of non-english scripts (= written language) to see casing apply to their fancy letters, use category properties and this is very important.

The issue is that the UCP option is currently forced ON internally. Not only it slows down most pattern matching to a great extent but it also precludes users to reset the option. The consequence is that many common features like w or b change their meaning to extend it to the full Unicode plane 0 (AutoIt charset). It may not be what user want, but they have no way to revert to the non-UCP behavior.

If we leave UCP support in (that is the --enable-unicode-properties library compile-time option) but do not force it at pattern compile-time (by not setting the PCRE_UCP option bit passed to pcre_compile[2]()), I feel we have the best of all worlds. By default, pattern matching will run at the best speed and if/when people know or suspect they will have to match non-english letters, non-ASCII punctuation and the like, they can always prepend (*UCP) right at the start of their patterns, which is the pattern option to enable that feature.

I'm sorry if this sounds complicated but in fact it isn't really. Anyway the outcome impacts the regexp summary in StringRegExp help file.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
kylomas

jchd,

Would'nt (*anycrlf) be the best default?

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
jchd

I don't think so.

Say you parse a CSV file where CRLF terminate lines, but where text fields are allowed to contains CR or LF (both alone) to denote internal line or paragraph breaks. That is not a vaporware invention, such files do exist.

My idea is that the default (library-wise or pattern compile-time option passed to the compiling function) should match the default, most common newline convention in use under Windows. This is the less surprising behavior users can expect.

The next-to-come release is going to be the most script-breaking for a long time (v2 --> v3 was something different). I feel it's the right time to fix a not-so-good choice for the very reason my example script shows.

To be fair, I must admit that this detail escaped me for so long, albeit it hit me several times where I was unable to spare time to check the root cause.

Laziness always bites you (and me), soon or late.

As a simple demonstration that we currently do unexpected things:

Local $aRes = StringRegExp("abc" & @CRLF, "(^.*$)", 1)
ConsoleWrite("Length of captured line '" & $aRes[0] & "' is " & StringLen($aRes[0]) & @LF)

I find this elementary result unduly troublesome, in the Windows world that is.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
kylomas

My idea is that the default (library-wise or pattern compile-time option passed to the compiling function) should match the default, most common newline convention in use under Windows. This is the less surprising behavior users can expect.

 

Yes, thank you...I did not realize that CR and/or LF were used for other than EOL.

edit:  your SRE example perfectly illustrates the need for this!

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
jchd

I really should have spotted the newline convention "misfeature" well before and also anticipated that UCP ON by default would not be the best option while lobbying for UCP support, but then who's perfect around here? I'm certainly not this one!

Changing the newline convention now is only marginally script-breaking and should bring more good than tears in the short- or middle-term.

Adjusting the UCP behavior before release is the right time to do it, unless strong arguments against that pop up here.

Now it's up to you to finally decide but as you say, users can always select more appropriate options at run-time for the job at hand.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
jchd

Great, thanks for listenning.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
jchd

Not at all. UTF support is built-in and enabled. UCP support is buil-in and will be disabled by default in the next version.

If you need UCP features, just prepend "(*UCP)" to your pattern.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Iczer

why not do it automatically (internally) - it would be best speed for any pattern without extra code. I mean - it not always known for sure what pattern may contain:

If StringRegExp($infoString, $complexPatternAR[$x], 0) Then

Share this post


Link to post
Share on other sites
jchd

As I exposed in the first post, there are good reasons to leave UCP disabled by default. First it doesn't break compatibility with the existing regexp base. Second, users who need UCP can still enable it very easily. Third, operations are much faster.

If you want you can always use:

#include <Array.au3>

Func _StringRegExpUCP($sSubject, $sPattern, $iFlag = 0, $iOffset = 1)
    Return StringRegExp($sSubject, "(*UCP)" & $sPattern, $iFlag, $iOffset)
EndFunc

Local $a = _StringRegExpUCP("ðăĈŹƶƺ ɱɵɸʍ ξςάώϝ ຍຈຂກნჰ fiﺨﺱﺞﺌﺢﻋﻕﺦﺂ", "(\w+)", 3)
_ArrayDisplay($a)

I fail to see where the problem resides.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Iczer

what i mean - its patterns like this:

speed
スピード
ઝડપ
velocitatem
accélérer

"speed" in different languages. If in most cases (50%) its eng, then regexp will get max speed overall if UCP would be enabled by regexp function itself automatically.

may be i do not understand something though.

Share this post


Link to post
Share on other sites
Jon

How would the function know to enable the option or not? Scan through the string looking for surrogates or something? What if it was a 2MB string that didn't have any surrogates in it? That would mean it would end up scanning a 2MB string, deciding it doesn't need UCP support and then doing a regexp on top of that processing.

Share this post


Link to post
Share on other sites
jchd

Iczer,

What you showed is a subject, not a pattern.

Anyway once UCP is enabled by prepending the string "(*UCP)" at the head of the pattern (just like in my _StringRegExpUCP function), it will be using UCP support for the entire pattern. The added time needed by the PCRE pattern compiler to parse "(*UCP)" ahead of the pattern and raise its internal UCP flag is close to zero.

Once UCP is ON, either by internal option or thru "(*UCP)" option in pattern, matching speed for many metacharacters and escapes slows down considerably, irrespective of the range of the codepoints in the subject. For instance, casing, d, D, b, B, w, W, ..., many POSIX classes, etc. see their character range vastly extended when UCP is on. UCP causes the engine to use Unicode tries to check character properties for every codepoint (irrespective of which scripts or category it belongs to: Punctuation, Math symbol, Currency, Greek, Cyrillic, Thai, Latin, Bopomofo, ...) in the subject and that is what makes it significantly slower than with UCP OFF.

I don't get your point.

Jon,

I don't think Iczer is contemplating support for Unicode planes other than BMP, plane 0 where most common languages live. Planes > 0 contain extinct languages, musical and maths symbols, a few very rare live languages and yes a new full range of asian ideographs but the use of these codepoints is fairly uncommon. So no, we certainly won't look for UTF-16 surrogates!

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Iczer

i understand reasons

on a side note - what about adding option to decide UCP flag externally, something like this :

StringRegExp ( "test", "pattern" [, flag = 0 [, offset = 1 [, UCPisOff = 1 ]]] )

so i do not need to change patterns and it can be used like this:

For $i = 1 To $patternsArray[0]
    If StringRegExp ( "test", $patternsArray[$i], 0, 1, StringIsASCII ( $patternsArray[$i] ) ) Then
        ;...
    EndIf
Next
Edited by Iczer

Share this post


Link to post
Share on other sites
guinness

I'm happy to use jchd's workaround.


UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Share this post


Link to post
Share on other sites
jchd

Given that the newly introduced UCP support changes the semantics of so many (sub)patterns I believe an option is not that useful. In the general case existing patterns will have to be examined for correctness in case extension to Unicode could cause unexpected results. In my view, options are useful for adjusting the behavior of function, when this adjustment is not possible otherwise. Here (*UCP) in the string will do what you option proposes.

New patterns using UCP features actively can use a custom function like above or prepend (*UCP) anyway at no cost.

Inserting or concatenating a 6-char string is not a chore as I see it. You can do the same as your option this way:

#include <Array.au3>

Local $aSubject = "ðăĈŹƶƺ ɱɵɸʍ ξςάώϝ ຍຈຂກნჰ fiﺨﺱﺞﺌﺢﻋﻕﺦﺂ", $sPattern = "(\w+)", $aResult

Func _StringRegExp($sSubject, $sPattern, $iFlag = 0, $iOffset = 1)
    Return StringRegExp($sSubject, (StringIsASCII($sSubject) ? "" : "(*UCP)") & $sPattern, $iFlag, $iOffset)
EndFunc

$aResult = _StringRegExp($aSubject, $sPattern, 3)
_ArrayDisplay($aResult)

BTW, scanning the whole string for determining if it's exclusively ASCII just for the sake of saving a few cycles during the subsequent matching process seems a non-optimization to me.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Iczer

but i hear ternary is slooow :sweating:  :) (about 4 times)  i think i better get away with some pattern preparing subfunction

Share this post


Link to post
Share on other sites
jchd

You're kidding, right?


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×