Problem with regexp after updating autoit

PClough · April 11, 2018

Hi everyone!

After updating autoit, I tried to run an old program using complex regexp's. It did not work. Eventually I broke the problem down to this example:

#include <Array.au3>

$buf = "First title" & @CRLF & "Tom" & Chr(0x92) & "s sleepwalking" & @CRLF & "Last | line" & @CRLF

$items = StringRegExp($buf, '([\x20-\xff]+)\x0d\x0a', 3)

_ArrayDisplay($items,'')

And this is the result I get when running it:

Row 0

PClough · April 11, 2018

Hi everyone!

After updating autoit, I tried to run an old program using complex regexp's. It did not work. Eventually I broke the problem down to this example:

#include <Array.au3>

$buf = "First title" & @CRLF & "Tom" & Chr(0x92) & "s sleepwalking" & @CRLF & "Last | line" & @CRLF

$items = StringRegExp($buf, '([\x20-\xff]+)\x0d\x0a', 3)

_ArrayDisplay($items,'')

And this is the result I get when running this snippet:

Row 0 First Title

Row 1 s sleepwalking

Row 2 Last | line

In other words, PCRE considers that Chr(0x92) is not within the [\x20-\xff] range. Whereas if I replace Chr(0x92) with Chr(0x27), it works fine. Can't figure out how this can be. Would anyone have a clue?

Thanks!

boomingranny · April 12, 2018

The same thing happened on my pc when I tested your code.

taking it a step further:

For $i = 0x20 To 0xFF
    $buf = "First title" & @CRLF & "Tom" & Chr($i) & "s sleepwalking" & @CRLF & "Last | line" & @CRLF
    $items = StringRegExp($buf, '([\x20-\xff]+)\x0d\x0a', 3)
    If StringLen($items[1]) <= 14 Then ConsoleWrite (Hex($i) &" "&Chr($i)&" "&$items[1]&@CRLF)
Next

shows that the following characters appear to be incorrectly excluded from the regular expression rule:

hex char

00000080   €   s sleepwalking
00000082   ‚   s sleepwalking
00000083   ƒ   s sleepwalking
00000084   „   s sleepwalking
00000085   …   s sleepwalking
00000086   †   s sleepwalking
00000087   ‡   s sleepwalking
00000088   ˆ   s sleepwalking
00000089   ‰   s sleepwalking
0000008A   Š   s sleepwalking
0000008B   ‹   s sleepwalking
0000008C   Œ   s sleepwalking
0000008E   Ž   s sleepwalking
00000091   ‘   s sleepwalking
00000092   ’   s sleepwalking
00000093   “   s sleepwalking
00000094   ”   s sleepwalking
00000095   •   s sleepwalking
00000096   –   s sleepwalking
00000097   —   s sleepwalking
00000098   ˜   s sleepwalking
00000099   ™   s sleepwalking
0000009A   š   s sleepwalking
0000009B   ›   s sleepwalking
0000009C   œ   s sleepwalking
0000009E   ž   s sleepwalking
0000009F   Ÿ   s sleepwalking

Edited April 12, 2018 by boomingranny

Andreik · April 12, 2018

Are you running last version of AutoIt?

mikell · April 12, 2018

May I suggest

$items = StringRegExp($buf, '([[:^cntrl:]]+)', 3)

PClough · April 12, 2018

Thanks Mikell, your suggestion works fine. But why is it that the original code doesn't? I have lots of regexp using constructs such as "([\x20-\xff])" with all kinds of boundaries which cannot be simulated using POSIX classes. Do you know if there any way of knowing when they will work and when they won't?

jchd · April 12, 2018

Guys, you mix ANSI and Unicode codepoints. AutoIt strings are native Unicode (well almost).

Indeed, Chr(0x92) translates into the character ~~Prime~~ RIGHT SINGLE QUOTATION MARK, hex Unicode codepoint 0x2019 (using the western latin codepage):

ConsoleWrite(Hex(AscW(Chr(0x92))) & @LF)

It's the same issue with a number of "extended ANSI" characters, which depend on the ANSI codepage in force. That's the whole point for inventing Unicode, first published in 1991.

PCRE in AutoIt handles Unicode strings since version 3.3.10.0 (23rd December, 2013) (Release) as changelog says.

EDIT: actually the official name of Unicode codepoint U2019 is "RIGHT SINGLE QUOTATION MARK" and glyph looks like this < ’ >. The name "PRIME" refers to codepoint U2032 and its glyph looks like this: < ′ >, pretty much the same visually.

Edited April 12, 2018 by jchd
Precision

jchd · April 12, 2018

See my answer in the other post.

JLogan3o13 · April 12, 2018

@PClough making multiple posts on the same question will not get it answered more quickly. Stick to one in the future. Threads merged.

mikell · April 12, 2018

@jchd
So please how should the OP's initial pattern be correctly written - if possible ?
I'm a little confused as the doc says
Non-printing characters
A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters, apart from the binary zero that terminates a pattern (...). In an ASCII or Unicode environment, these escapes are as follows:
\xhh character with hex code hh

jchd · April 12, 2018

If the subject string actually contains extended ASCII characters of some ANSI codepage that are not mapped identically in Unicode, then they can only be filtered in/out using their Unicode codepoint value or directly using the literal character itself. Re-using the OP code both examples below work:

#include <Array.au3>
$buf = "First title" & @CRLF & "Tom" & Chr(0x92) & "s sleepwalking" & @CRLF & "Last | line" & @CRLF
$items = StringRegExp($buf, '([\x20-\xff\x2019]+)\x0d\x0a', 3)
_ArrayDisplay($items,'')
$items = StringRegExp($buf, '([\x20-\xff’]+)\x0d\x0a', 3)
_ArrayDisplay($items,'')

Of course if the goal is to split the input on line breaks, using a regexp like this is both an overkill and prone to failure. StringSplit would do the job fine, as well as

$items = StringRegExp($buf, '(.*)\R', 3)

Last note: given the fact that many distinct single or double Unicode quotes, apostrophes and prime (and accents and diacritics !!!) are very closely similar-looking despite having different codepoints, I recommend being very careful not to confuse between them.

To wit (and I didn't put them all): '´ʹʼʽʻʾʿ̀`ՙ՝᾿῀´῾‘’‛“”‟′″‴‵‶‷
Have fun telling which is which just by looking!

.

PClough · April 12, 2018

Thanks for the clarification concerning Unicode versus extended ANSI. Put this is making things even more confusing in the following case, which is actually where I initially got into trouble - leading to my first post (by the way, sorry to JLogan3o13 for having made a double post; partly my mistake, partly due to the fact that the timeout of the forum session is a bit short, and partly due to the fact that once I realised I had made this double posting, I found no way of deleting the incomplete one).

So, going back to the original problem: the program in question does some analysis of large text documents. Some of these text documents are saved using the export function of various word processors, others are copied and pasted from a word processor into a text editor. In any case, as a result, these documents are ANSI texts, but often include some of these weird characters mentioned by jchd. In this particular case, the programme was failing on one of the ""RIGHT SINGLE QUOTATION MARK" mentioned by jchd, but also on other similar extended ANSI characters, like curly double quotes etc..

So, jchd suggests that these weird characters should be filtered in by explicitly listing them in the regex range - which is obviously not practicle given that there are a lot of them. Isn't there another way which would make it possible to use the [\xmm-\xMM] range definition. Otherwise, this means that this functionality of PCRE is not actually usable with the autoit's Unicode strings. Any suggestion?

PS - Of course I realise that a StringSplit on CRLF would do the job perfectly well here. But this was just a simple way of explaining my problem.

jchd · April 12, 2018

OK so let's try to go back to the root problem. Indeed, word processors have a tendancy to replace ASCII single quotes by curly apostrophes (e.g. Word) and this isn't an ASCII character anymore. Furthermore, the output you may get may seem to be ANSI but may prove to be UTF8 actually.

So to help making progress, tell us on which criterion you intend to split or otherwise process your inputs. Posting a reasonably sort real-world example would partly help.

FYI, current implementation of PCRE in AutoIt includes the support for UCP (Unicode character properties), which allows very complex patterns.

PClough · April 13, 2018

Thanks jchd for trying to help with this particular case, but what has been said so far shows that I've missed some rather fundamental changes in Autoit, namely the fact that, as you wrote, "AutoIt strings are native Unicode (well almost)". So, this raises some questions:

(1) Sorry for being dumb, but I'm not sure what you mean by "native Unicode", nor by "well almost".

(2) Since, I have a lot of existing code which work on the assumption that Chr(0xmm) is the old ANSI character coded as 0xmm, for 0 <= mm <= 0xff, I need to understand the unicode implementation used by Autoit, in case I need to change and recompile some of my code. For instance: since Chr(0x92) = U2019, is this transcoding into Unicode systematically done for all 0xmm >= 0x80? More generally, is there any document explaining Autoit's unicode implementation?

(3) If Autoit automatically transcodes every string into Unicode, doesn't this imply that PCRE should always be used in UCP mode?

(4) What is the impact of the Unicode encoding on non regex functions, especially StringInStr, given that this is one of the most used functions (in my code, in any case)? In particular, since StringInStr returns an offset in the source string, can we safely assume that this offset is correct whatever the content of the source string (i.e. including if it contains multi-bytes codepoints)?

PS - Sorry, I feel like I'm asking for a full-length lecture and I don't want to impose this on anyone. So, once again, if there's any document or part of the source code that would give me an idea of how this works, let me know and I'll manage to find my way. Thanks for your help in any case!

jchd · April 13, 2018

Ok, let's go.

First, notice the difference between three distinct things:
a character set (assignment of a numeric value [a codepoint] to a character name),
the representation of thoses characters in memory or in transit on a network (the encoding)
and the rendering of a given character (its visual aspect, or final rendered glyph) in a given font.

AutoIt currently uses character set UCS2, a subset of the full Unicode standard, where every character is represented by a single 16-bit coding unit. This encoding gives access to what is called the Unicode BMP (Basic Multilingual Plane) of 64K characters, enough for most purposes in common practice. Full Unicode can use several encodings:

UTF8 (multibyte encoding, coding unit = one byte, some characters need 4 bytes for representing)
UTF16-LE (multi-word encoding, coding unit = one 16-bit word in little endian representation, some characters need 2 encoding units)
UTF16-BE (multi-word encoding, coding unit = one 16-bit word in big endian representation, some characters need 2 encoding units)
UTF32-LE (multi-word encoding, coding unit = one 32-bit word in little endian representation, all characters need 1 encoding unit)
UTF32-BE (multi-word encoding, coding unit = one 32-bit word in big endian representation, all characters need 1 encoding unit).

Native AutoIt strings use UCS2 in UTF16-LE encoding (limited to one single 16-bit word per character). Windows has been using the full Unicode range in UTF16-LE encoding natively for a long time. Yet characters above the BMP (whose codepoints are greater then 0xFFFF) are rather rare and difficult to display as there exist only very little fonts capable of doing so, and no known font cover the full Unicode (far from that). Of course most users don't need to display daily aegyptian hieroglyphs, dominoes, mahjong tiles or antique musical symbols and such things like that.

Refer to https://en.wikipedia.org/wiki/Plane_(Unicode) to have a helicopter view of what is in the various Unicode planes. You'll see that characters in the sole BMP already cover a large range of use cases worldwide. That's why even with the limitation introduced by UCS2 (limited to the BMP) you can process a wide variety of texts. This is why I said "well, almost". That "almost" is good enough for 99% of users. A very good site centered on Unicode is https://r12a.github.io/scripts/tutorial/index by one of early Unicode design engineers. Don't miss anythings, there a lots ot things to learn/discover about human languages and cultures. The apps tab is very usefull too.

Sidenote: even if UCS2 limits us to the BMP, we can still create strings in AutoIt that contain characters outside the BMP. We can for instance represent the Unicode character U+2F834 ("CJK COMPATIBILITY IDEOGRAPH-2F834") which encodes as 0xD87E 0xDC34 in UTF16-LE by coding it ChrW(0xD87E) & ChrW(0xDC34 ) but string functions will AFAIK consider that as a string of two (completely unrelated) characters.

Still there?

Now what happens with $s = "Joe" & Chr(0x92) & " garage" is a good question. Frankly ... I don't know exactly!
Run this code:

Local $c = Chr(0x92)
ConsoleWrite(VarGetType($c) & @TAB & Binary($c) & @LF)
$c = "A" & $c & "B"
Local $a = StringRegExp($c, ".", 3)
_ArrayDisplay($a)
_ArrayDisplay(StringToASCIIArray($c))

You'll see that the last _ArrayDisplay tells us that Chr(0x92) is now the Unicode codepoint 8217 (decimal) or 0x2019. That's because the ANSI character 0x92 in my Latin1 extended ASCII codepage is mapped to the Unicode character U+2019. Unicode codepoint 0x92 is a control character named "PRIVATE USE TWO". So in fact AutoIt does indeed a good job in not changing the character in the string. It's only that extended ANSI codes not always map to the same Unicode codepoint. Insiders (Devs, @trancexx, mediums) can jump in here to explain in gory details how strings work in AutoIt but let me offer an "out of thin air" explanation: the string $c is kept in non-concatenated form (part Unicode, part ANSI) until it's passed to some functions (which?), at which time the non-Unicode part(s) are converted into their Unicode mapping.

Conclusion: you see that while Unicode has solved the portability issue of texts in all human scripts ever used (and more: there is even a Klingon group) it also introduces some difficulties, and we've only scratched a very thin layer off a large surface here.

To see the differences between Unicode and the specific ANSI codepage you use, use https://r12a.github.io/uniview/ and select the Latin, Basic & Latin1 supplement block, then compare the content to the Windows Charset applet in your codepage mode. There you can click on a given character to view it's Unicode properties and definitions. Latin1 ANSI ("western ANSI") differs only in the range 0x80-0x9F but your codepage may differ more.

I hope this answers some of your questions and I apologize for having lost dozens of readers. Let them RIP.

PClough · April 13, 2018

Well, thanks so much for the effort jchd! You did a good job of showing the problems one should expect. To find solutions and workaround, I'm going to have to do some experimentation, but at least I now know in which direction I should look.

There's only one remaining question that I can't resolve easily (at least not without experimenting on different computers using different codepages, which I don't have at hand). Do Autoit's internal conversion routines use the codepage of the host computer on which the script is running? I hope not, otherwise, this would mean that scripts would not necessarily be portable across continents - in the internet age, that would be a strange choice!

jchd · April 14, 2018

WRT to codepages, AutoIt is in no way responsible for the non-portability issues. The portability issue was the tradeoff implied by multiplication of diverging codepages. Each of them solved a local charset problem by making it possible to represent the characters of a given set of human scripts, but that made the overall picture worse. Babel story again: https://msdn.microsoft.com/fr-fr/library/windows/desktop/dd317756(v=vs.85).aspx

That you code in C, AutoIt, Python or any other language of your choice, having to process texts from various sources and codepages is also going to be a nightmare. That is unless every piece of string comes along with its codepage metadata, something you don't have standard tools to do.

Things are both simple and, as just said, difficult to deal with. If your source code uses a precise ANSI codepage, the resulting literal strings in the compiled .EXE will contain the Unicode equivalent only if this codepage was in effect when the script was compiled. The compiler can't devise that your source uses a distinct codepage than the one it uses itself. Same thing occurs when you directly interpret an .AU3 file w/o compiling.

Let me show a simple example. Suppose you are Turkish and create a local (Turkish) ANSI source file where a literal string contains the Turkish dotless i (character ı, Unicode name LATIN SMALL LETTER DOTLESS I, Unicode codepoint U+131, Turkish ANSI code 0xFD). If the ANSI codepage in effect when the script is compiled is also Turkish ANSI, then the 0xFD will corectly translate into the Unicode codepoint U+131 inside the string. But if you send this ANSI source file to your Vietnamese friend and if he compiles it locally (using its Windows Viet ANSI codepage in effect), the character 0xFD will translate into the character ư, Unicode name LATIN SMALL LETTER U WITH HORN, Unicode codepoint U+1B0, ANSI Viet code 0xFD. If at compile time the ANSI codepage is Windows western (Latin1), then 0xFD will correctly translate into the character ý, Unicode name LATIN SMALL LETTER Y WITH ACUTE, Unicode codepoint U+FD, Latin1 ANSI code 0xFD, thanks to the one-to-one mapping of the range 0xA0-0xFF between Latin1 ANSI and Unicode.

Hence we arrive at a relatively peaceful point: if your source file is Unicode (UTF8 encoding) then it will be portable. Chr() isn't portable (depends on codepage in effect at compile time, or run time if you launch from the source) while ChrW() is.

All of what I wrote here is equally true for all programming languages I know of.

Note that there are other issues than character representation. Unicode does a great job for portability of text but it also silently hides implied metadata. If you specify that some source or data file uses, say, ANSI Hebrew, then you can assume that a standard Hebrew sort will work to sort an array of strings since you can assume that by using Hebrew codepage, all strings are indeed Hebrew. You may want to sort lexicographically, use a more sophisticated collation so that if you're French, a, A, ä, à, ... Ä come in some expected order before B, b; or use a natural sort (so "10ABC" comes after "9ABC").

But when all is Unicode --making it easier to mix strings in German, Hebrew, Turkish, Vietnamese and French)-- which kind of sort (collation) should you use? Again it's a generic difficult programming problem and the programming language used is irrelevant. This is the hard matter for yet another long story that I've currently no time to tell. Instead of reading text walls like this one, have a deep look at how Unicode recommends collating (comparing) strings: http://unicode.org/reports/tr10/ and hope to survive.

PClough · April 14, 2018

Ok and thanks for all this jchd! You've convinced me. I'm going to seriously delve into the arcanes of Unicode...

jchd · April 14, 2018

Fasten your belt and have a nice ride!

Sign In

Problem with regexp after updating autoit

Recommended Posts

PClough

PClough

boomingranny

Andreik

mikell

PClough

jchd

jchd

JLogan3o13

mikell

jchd

PClough

jchd

PClough

jchd

PClough

jchd

PClough

jchd

Create an account or sign in to comment

Create an account

Sign in

Similar Content

Extract hex number from string

Make a metasymbol condition to make operations over the text file line

RegExp Multiline Comments

StringRegExpSplit

RegExp - Remove white spaces before and after coma

Browse

AutoIt Resources

Release

Beta