Jump to content

All the characters in regular expressions are case sensitive


Recommended Posts

To get the full range of

$timer = TimerInit()
$sRange = _GetRangeSPE()
MsgBox(0, "Timer", Round(TimerDiff($timer) / 1000, 2) & ' sec')

$hFile = FileOpen(@ScriptDir & '\Range.txt', 2)
FileWrite($hFile, $sRange)
FileClose($hFile)

Func _GetRangeSPE()
    Local $Lower, $Upper, $s, $sRange, $tmp, $trg1 = 0, $trg2 = 0

    For $i = 0x80 To 0xFFFF
        $s = ChrW($i)
        $Upper = StringUpper($s)
        $Lower = StringLower($s)
        If Not ($Upper == $Lower) Then
            $trg1 += 1
            $tmp = $i
        Else
            $trg1 = 0
        EndIf

        Switch $trg1
            Case 1
                $sRange &= '\x{' & Hex($i, 4) & '}'
            Case 2
                $trg2 = 1
            Case 3
                $sRange &= '-'
            Case 0
                If $trg2 Then
                    $trg2 = 0
                    $sRange &= '\x{' & Hex($tmp, 4) & '}'
                EndIf
        EndSwitch
    Next
    Return $sRange
EndFunc   ;==>_GetRangeSPE
$timer = TimerInit()
$sRes = __FO_UserLocale2('Она может задать диапазон не обязательно для русского языка. Check if a string fits a given regular expression pattern.', '\x{00C0}-\x{00D6}\x{00D8}-\x{00DE}\x{00E0}-\x{00F6}\x{00F8}-\x{012F}\x{0132}-\x{0137}\x{0139}-\x{0148}\x{014A}-\x{017E}\x{0181}-\x{018C}\x{018E}-\x{0194}\x{0196}-\x{0199}\x{019C}\x{019D}\x{019F}-\x{01A5}\x{01A7}-\x{01A9}\x{01AC}-\x{01B9}\x{01BC}\x{01BD}\x{01C4}\x{01C6}\x{01C7}\x{01C9}\x{01CA}\x{01CC}-\x{01EF}\x{01F1}\x{01F3}-\x{01F5}\x{01FA}-\x{0217}\x{0253}\x{0254}\x{0256}\x{0257}\x{0259}\x{025B}\x{0260}\x{0263}\x{0268}\x{0269}\x{026F}\x{0272}\x{0275}\x{0283}\x{0288}\x{028A}\x{028B}\x{0292}\x{0386}\x{0388}-\x{038A}\x{038C}\x{038E}\x{038F}\x{0391}-\x{03A1}\x{03A3}-\x{03AF}\x{03B1}-\x{03CE}\x{03E2}-\x{03EF}\x{0401}-\x{040C}\x{040E}-\x{044F}\x{0451}-\x{045C}\x{045E}-\x{0481}\x{0490}-\x{04BF}\x{04C1}-\x{04C4}\x{04C7}\x{04C8}\x{04CB}\x{04CC}\x{04D0}-\x{04EB}\x{04EE}-\x{04F5}\x{04F8}\x{04F9}\x{0531}-\x{0556}\x{0561}-\x{0586}\x{10A0}-\x{10C5}\x{1E00}-\x{1E95}\x{1EA0}-\x{1EF9}\x{1F00}-\x{1F15}\x{1F18}-\x{1F1D}\x{1F20}-\x{1F45}\x{1F48}-\x{1F4D}\x{1F51}\x{1F53}\x{1F55}\x{1F57}\x{1F59}\x{1F5B}\x{1F5D}\x{1F5F}-\x{1F7D}\x{1FB0}\x{1FB1}\x{1FB8}-\x{1FBB}\x{1FC8}-\x{1FCB}\x{1FD0}\x{1FD1}\x{1FD8}-\x{1FDB}\x{1FE0}\x{1FE1}\x{1FE5}\x{1FE8}-\x{1FEC}\x{1FF8}-\x{1FFB}\x{2160}-\x{217F}\x{24B6}-\x{24E9}\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}')
MsgBox(0, "Timer", Round(TimerDiff($timer), 2) & ' msec' & @LF & $sRes)

Func __FO_UserLocale2($sMask, $sLocale)
    Local $s, $tmp
    $sLocale = StringRegExpReplace($sMask, '[^' & $sLocale & ']', '')
    $tmp = StringLen($sLocale)
    For $i = 1 To $tmp
        $s = StringMid($sLocale, $i, 1)
        If $s Then
            If StringInStr($sLocale, $s, 0, 2, $i) Then
                $sLocale = $s & StringReplace($sLocale, $s, '')
            EndIf
        Else
            ExitLoop
        EndIf
    Next
    If $sLocale Then
        $tmp = StringSplit($sLocale, '')
        For $i = 1 To $tmp[0]
            $sMask = StringReplace($sMask, $tmp[$i], '[' & StringUpper($tmp[$i]) & StringLower($tmp[$i]) & ']')
        Next
    EndIf
    Return $sMask
EndFunc   ;==>__FO_UserLocale2
$timer = TimerInit()
$sRes = __FO_UserLocale('Она может задать диапазон не обязательно для русского языка. Check if a string fits a given regular expression pattern.', '\x{80}-\x{ffff}')
MsgBox(0, "Timer", Round(TimerDiff($timer), 2) & ' msec' & @LF & $sRes)

Func __FO_UserLocale($sMask, $sLocale)
    Local $s, $tmp
    $sLocale = StringRegExpReplace($sMask, '[^' & $sLocale & ']', '')
    $tmp = StringLen($sLocale)
    For $i = 1 To $tmp
        $s = StringMid($sLocale, $i, 1)
        If $s Then
            If StringInStr($sLocale, $s, 0, 2, $i) Then
                $sLocale = $s & StringReplace($sLocale, $s, '')
            EndIf
        Else
            ExitLoop
        EndIf
    Next
    If $sLocale Then
        Local $Upper, $Lower
        $tmp = StringSplit($sLocale, '')
        For $i = 1 To $tmp[0]
            $Upper = StringUpper($tmp[$i])
            $Lower = StringLower($tmp[$i])
            If Not ($Upper == $Lower) Then $sMask = StringReplace($sMask, $tmp[$i], '[' & $Upper & $Lower & ']')
        Next
    EndIf
    Return $sMask
EndFunc   ;==>__FO_UserLocale

I want to make search of the files which names are sensitive to the register on any system. In such a way I form regular expression.

I on a right way?
 

Link to comment
Share on other sites

I'm sorry but I don't quite understand your need. The part I don't get is "files which names are sensitive to the register on any system"

Anyway, changing the casing of Unicode codepoints is non-trivial. There are a number of problematic codepoints, like German eszet, title case ligatures, turkish dotted vs. dotless i and more.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

For Cyrillic is a template similar to [Ww][Oo][Rr][Dd]. Currently used manual setting range. But I wanted to do automatically for all.

 

There are a number of problematic codepoints, like German eszet, title case ligatures, turkish dotted vs. dotless i and more.

You want to tell that the AutoIt3 functions too will be faulty? (StringLower, StringUpper)

Edited by AZJIO
Link to comment
Share on other sites

That's no straightforward even if our PCRE library was compiled with the UCP support (which is severely lacking).

Basic and extended cyrillic are handled fine by ToUpper/ToLower but as I said, some codepoints are difficult to handle.

For instance a "westerner" would say [ii] (that is [x49x69]) is OK but a turkish would need both [iı] (that is [x49x{0131}]) and [İi] (that is [x{0130}x69]).

The same issue arises with dotted vs. dotless J, with German eszet ß ⇄ SS (the newly introduced uppercase eszet even makes that worse), several uppercase, titlecase and lowercase codepoints (e.g. DŽ, Dž, dž)...

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

As a last (from me) note in this thread, observe that a number of codepoints don't roundtrip correctly and there are also exceptional cases like the greek sigma (one capital letter but two distinct lowercase letters depending on the final or not position of the letter in a word).

Local $s = 'ß'
MsgBox(0, '', $s & @LF & StringUpper($s) & @LF & StringLower($s))
$s = 'DŽ'
MsgBox(0, '', $s & @LF & StringUpper($s) & @LF & StringLower($s))
$s = 'SS'
MsgBox(0, '', $s & @LF & StringUpper($s) & @LF & StringLower($s))
$s = 'Dž'
MsgBox(0, '', $s & @LF & StringUpper($s) & @LF & StringLower($s))
$s = 'dž'
MsgBox(0, '', $s & @LF & StringUpper($s) & @LF & StringLower($s))
$s = 'Σ'
MsgBox(0, '', $s & @LF & StringUpper($s) & @LF & StringLower($s))
$s = 'σ'
MsgBox(0, '', $s & @LF & StringUpper($s) & @LF & StringLower($s))
$s = 'ς'
MsgBox(0, '', $s & @LF & StringUpper($s) & @LF & StringLower($s))

More subtileties are detailed in this document.

Also PCRE included limited support for codepoints having more than one "other cases" like the Greek sigma in version 8.32 (2012/12/30) and 8.33 (released today), but we're far behind that (and the goodness of many other new features like the JIT option).

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

There are two options, either not to use at all, or to use with some exceptions which are difficult for processing. I think that nevertheless it will be more convenient to many people even if partial implementation by 99%.

Let will answer at whom this problem.

I am interested in the full range x{80}-x{ffff} or partial x{014A}-x{017E}x{0181}-x{018C}, etc. The second option is faster. I hope Unicode ranges will not change.

Link to comment
Share on other sites

Also PCRE included limited support for codepoints having more than one "other cases" like the Greek sigma in version 8.32 (2012/12/30) and 8.33 (released today), but we're far behind that (and the goodness of many other new features like the JIT option).

I see AutoIt v3.3.8.1 is using PCRE v8.12.

UDF List:

 
_AdapterConnections()_AlwaysRun()_AppMon()_AppMonEx()_ArrayFilter/_ArrayReduce_BinaryBin()_CheckMsgBox()_CmdLineRaw()_ContextMenu()_ConvertLHWebColor()/_ConvertSHWebColor()_DesktopDimensions()_DisplayPassword()_DotNet_Load()/_DotNet_Unload()_Fibonacci()_FileCompare()_FileCompareContents()_FileNameByHandle()_FilePrefix/SRE()_FindInFile()_GetBackgroundColor()/_SetBackgroundColor()_GetConrolID()_GetCtrlClass()_GetDirectoryFormat()_GetDriveMediaType()_GetFilename()/_GetFilenameExt()_GetHardwareID()_GetIP()_GetIP_Country()_GetOSLanguage()_GetSavedSource()_GetStringSize()_GetSystemPaths()_GetURLImage()_GIFImage()_GoogleWeather()_GUICtrlCreateGroup()_GUICtrlListBox_CreateArray()_GUICtrlListView_CreateArray()_GUICtrlListView_SaveCSV()_GUICtrlListView_SaveHTML()_GUICtrlListView_SaveTxt()_GUICtrlListView_SaveXML()_GUICtrlMenu_Recent()_GUICtrlMenu_SetItemImage()_GUICtrlTreeView_CreateArray()_GUIDisable()_GUIImageList_SetIconFromHandle()_GUIRegisterMsg()_GUISetIcon()_Icon_Clear()/_Icon_Set()_IdleTime()_InetGet()_InetGetGUI()_InetGetProgress()_IPDetails()_IsFileOlder()_IsGUID()_IsHex()_IsPalindrome()_IsRegKey()_IsStringRegExp()_IsSystemDrive()_IsUPX()_IsValidType()_IsWebColor()_Language()_Log()_MicrosoftInternetConnectivity()_MSDNDataType()_PathFull/GetRelative/Split()_PathSplitEx()_PrintFromArray()_ProgressSetMarquee()_ReDim()_RockPaperScissors()/_RockPaperScissorsLizardSpock()_ScrollingCredits_SelfDelete()_SelfRename()_SelfUpdate()_SendTo()_ShellAll()_ShellFile()_ShellFolder()_SingletonHWID()_SingletonPID()_Startup()_StringCompact()_StringIsValid()_StringRegExpMetaCharacters()_StringReplaceWholeWord()_StringStripChars()_Temperature()_TrialPeriod()_UKToUSDate()/_USToUKDate()_WinAPI_Create_CTL_CODE()_WinAPI_CreateGUID()_WMIDateStringToDate()/_DateToWMIDateString()Au3 script parsingAutoIt SearchAutoIt3 PortableAutoIt3WrapperToPragmaAutoItWinGetTitle()/AutoItWinSetTitle()CodingDirToHTML5FileInstallrFileReadLastChars()GeoIP databaseGUI - Only Close ButtonGUI ExamplesGUICtrlDeleteImage()GUICtrlGetBkColor()GUICtrlGetStyle()GUIEventsGUIGetBkColor()Int_Parse() & Int_TryParse()IsISBN()LockFile()Mapping CtrlIDsOOP in AutoItParseHeadersToSciTE()PasswordValidPasteBinPosts Per DayPreExpandProtect GlobalsQueue()Resource UpdateResourcesExSciTE JumpSettings INISHELLHOOKShunting-YardSignature CreatorStack()Stopwatch()StringAddLF()/StringStripLF()StringEOLToCRLF()VSCROLLWM_COPYDATAMore Examples...

Updated: 22/04/2018

Link to comment
Share on other sites

AZJIO

I've developped a small SQLte extension to handle such issues "mostly gracefully". You could download the source and adapt it to your needs. Search unifuzz in the forum.

guinness

Yeah, we're still using a prehistoric version. That's a pity since there have been a large number of very useful features introduced since 8.12. The first bonus is the native support of UTF-16 (UTF-32 as well), which would avoid going back and forth with UTF-8, speed up and simplify the code greatly. Then many dark corners have been cleared and finally the JIT engine is MUCH faster in most uses. Of course we also badly need callbacks, UCP, ...

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...