Correct regex syntax for hex characters

czardas · March 18, 2013

Hmm, it seems strange. The regexp pattern seems to be working though.

If StringRegExp(ChrW(0x1A),"[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]") Then MsgBox(0, "Error", "_broken segment_")

leuce · March 18, 2013

I don't think it is a memory problem either, because if I reduce the TMX file to just the few TUs surrounding the invalid character (i.e. total file size less than 1000 characters), the script doesn't catch the invalid character either.

trancexx · March 18, 2013

x{FFFD} character will get you in trouble with regexp. Go one below. And don't tell anyone I told you that. :shhh:

jchd · March 19, 2013

Codepoint U+00FFFD is perfectly valid by itself (it means "Invalid character got replaced by this special codepoint") provided that:

1) you don't object that the file contains this codepoint by itself

2) you don't object that invalid sequences in input file gets converted into this codepoint

3) PCRE implementation of PCRE support doesn't do anything surprising

Anyway, I'm surprised you say that the pattern doesn't work to detect 0x1A:

Local $m, $s = "abc" & Chr(0x1A) & "def"

If StringRegExp($s, "[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]") Then
$m = "Failed"
Else
$m = "Passed"
EndIf
ConsoleWrite($m & @LF)

@trancexx,

Can you eloborate further?

Edited March 19, 2013 by jchd

trancexx · March 19, 2013

Codepoint U+00FFFD is perfectly valid by itself (it means "Invalid character got replaced by this special codepoint") provided that:
1) you don't object that the file contains this codepoint by itself
2) you don't object that invalid sequences in input file gets converted into this codepoint
3) PCRE implementation of PCRE support doesn't do anything surprising
Anyway, I'm surprised you say that the pattern doesn't work to detect 0x1A:
Local $m, $s = "abc" & Chr(0x1A) & "def"

If StringRegExp($s, "[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]") Then
$m = "Failed"
Else
$m = "Passed"
EndIf
ConsoleWrite($m & @LF)
@trancexx,
Can you eloborate further?

The problem is for code points U+D800 to U+DFFF. Your pattern (x{FFFD} part) will cause wrong results here. Even though the behavior is explainable (you did it actually), it may be seen as unexpected.

You see, x{FFFD} matches both U+FFFD and the whole range from U+D800 to U+DFFF. That's what your pattern explicitly tries no to do. So it's either x20-x{FFFD} or the last hex is x{FFFC}.

Edited March 19, 2013 by trancexx

jchd · March 19, 2013

That's true if the input file is UTF-16 encoded and contains codepoints > U+FFFF (those which use surrogates).

Since the OP said he reads UTF-8 text, there should be no surrogate in the input file.

Yet a question remains hinted to by my points 1) and 2): should invalid UTF-8 combinations not already converted to U+FFFD inside the input stream be considered charset errors? If no, then merging the ranges as trancexx did is fine, else we need to parse the UTF-8 by ourself byte by byte and check that condition. Anyway "native" U+FFFD in the input should NOT be excluded from the valid XML charset range since it is explicitely allowed as a handy placeholder.

Also note that whatever other Unicode-conformant program reading an Unicode text file contining invalid UTF-* sequences will actually replace them with U+FFFD instance(s), so merging the ranges into x20-x{FFFD} is probably the simplest way to behave.

czardas · March 19, 2013

I don't follow the logic here at all. Why would so many ill-formed variants (U+D800 - U+DFFF) be needed? Why substitute them at all? What kind of logic is there to this? To me this seems like a waste of resources (perhaps because I don't understand it, or maybe I'm just not ready to understand it).

Edited March 19, 2013 by czardas

leuce · March 19, 2013

Okay, confession time: I investigated my non-matching x1A problem and it now appears that the program that I used to test the TMX file interpreted x201A as x001A. Well, that is my guess, based on the error message I get from that program (it complains about an unexpected file end at that position). When I remove all instances of x201A from the file, it stops complaining about it.

jchd · March 19, 2013

czardas,

This range is for surrogates, i.e. 16-bit values that are reserved for encoding codepoints from upper planes ( > U+FFFF) when using either UTF-16 encoding.

These values are "non-characters" by themselves, since they must be associated in pairs to encode a codepoint outside plane 0.

When a text file uses UTF-8 these values shouldn't occur, as UTF-8 provides its own mean to encode codepoints. Hence a conforming conversion from UTF-16 to UTF-8 will never produce a codepoint in this range. Pathologic programs can however let non-characters appear in their output stream.

The U+FFFD codepoint ( the so-called "replacement character") is the default codepoint to indicate an invalid character during a conversion: you can see it in some ill-formed web pages as a white question mark in a black hexagonal background. By itself it is not an error but merely a trace that an earlier error produced something that couldn't be represented.

In short, a valid Unicode text may contains occurences of U+FFFD and these are not particularly toxic for subsequent processings.

When a conforming program reads Unicode text and discovers invalid sequences, there are two common ways to handle the situation: either halt with an error OR replace every invalid sequence by a replacement character. Thus there are two sources of U+FFFD: replacement already done at an earlier point and actual errors in the encoding of the text stream. Both read as U+FFFD to a conforming program (my points 1) and 2) above).

Note that there are more complex conditions which make a particular sequence invalid, for instance overlong sequences.

jchd · March 19, 2013

leuce,

U+201A is a low quote style character. Instead of suppressing it you should perhaps replace it by something similar and meaningful. That or replace the program that fails on this codepoint.

czardas · March 19, 2013

Thanks jchd. It's a good explanation of what is happening. I'm still not sure of the need for such a large range. I would have thought one character would be sufficient. I need to read more about it. I read somewhere that some of these surrogates can be used in programming (for whatever purpose the programmer decides). I'm not sure about it, but I'll read up on it. Much appreciated.

Update to post

I found the answer to my question about surrogates here: http://en.wikipedia.org/wiki/Plane_%28Unicode%29

Edited March 20, 2013 by czardas

leuce · March 19, 2013

U+201A is a low quote style character. Instead of suppressing it you should perhaps replace it by something similar and meaningful.

That is exactly what I'm going to try to do. The program also flounders on U+221A but it accepts the entity √. Do you know if one can replace all characters that end on 1A with similarly named entities?

What would the regex be for that, I wonder... something like this, perhaps?

$tmxfileread = StringRegExpReplace ($tmxfileread, "\x{([a-f0-9][a-f0-9])1a}", "&#x$11a;")

Edited March 19, 2013 by leuce

jchd · March 19, 2013

I can't succeed in making $n (or n, standing for n^th capture replacement) interpolate in replacement pattern.

Testing even simpler match pattern like x{00[0-9]9} doesn't match anything in abc999def.

So I'm doubting that those expression work in PCRE. About x{hhh...}, the PCRE doc says:

If characters other than hexadecimal digits appear between x{ and }, or if there is no terminating }, this form of escape is not recognized. Instead, the initial x will be interpreted as a basic hexadecimal escape, with no following digits, giving a character whose value is zero.

You may want to select the smallest subset of these codepoints which are worth expanding as hex entities:

CodePoint CharacterName                                  GeneralCategory
--------- ------------------------------------------------- ---------------
001A      <control>                                      Cc           
011A      LATIN CAPITAL LETTER E WITH CARON              Lu           
021A      LATIN CAPITAL LETTER T WITH COMMA BELOW          Lu             
031A      COMBINING LEFT ANGLE ABOVE                        Mn            
041A      CYRILLIC CAPITAL LETTER KA                        Lu            
051A      CYRILLIC CAPITAL LETTER QA                        Lu            
061A      ARABIC SMALL KASRA                                Mn            
071A      SYRIAC LETTER HETH                                Lo            
091A      DEVANAGARI LETTER CA                            Lo              
0A1A      GURMUKHI LETTER CA                                Lo            
0B1A      ORIYA LETTER CA                                  Lo             
0C1A      TELUGU LETTER CA                                Lo              
0D1A      MALAYALAM LETTER CA                              Lo             
0E1A      THAI CHARACTER BO BAIMAI                        Lo              
0F1A      TIBETAN SIGN RDEL DKAR GCIG                      So             
101A      MYANMAR LETTER YA                              Lo           
111A      HANGUL CHOSEONG RIEUL-HIEUH                      Lo             
121A      ETHIOPIC SYLLABLE MI                            Lo              
131A      ETHIOPIC SYLLABLE GGI                          Lo           
141A      CANADIAN SYLLABICS WEST-CREE WAA                Lo              
151A      CANADIAN SYLLABICS WEST-CREE SHWI              Lo           
161A      CANADIAN SYLLABICS SAYISI JI                    Lo              
191A      LIMBU LETTER SSA                                Lo              
1A1A      BUGINESE VOWEL SIGN O                          Mc           
1B1A      BALINESE LETTER JA                                Lo            
1C1A      LEPCHA LETTER YA                                Lo              
1D1A      LATIN LETTER SMALL CAPITAL TURNED R              Ll             
1E1A      LATIN CAPITAL LETTER E WITH TILDE BELOW          Lu             
1F1A      GREEK CAPITAL LETTER EPSILON WITH PSILI AND VARIA Lu            
201A      SINGLE LOW-9 QUOTATION MARK                      Ps             
211A      DOUBLE-STRUCK CAPITAL Q                          Lu             
221A      SQUARE ROOT                                      Sm             
231A      WATCH                                          So           
241A      SYMBOL FOR SUBSTITUTE                          So           
251A      BOX DRAWINGS UP HEAVY AND LEFT LIGHT            So              
261A      BLACK LEFT POINTING INDEX                      So           
271A      HEAVY GREEK CROSS                              So           
281A      BRAILLE PATTERN DOTS-245                        So              
291A      RIGHTWARDS ARROW-TAIL                          Sm           
2A1A      INTEGRAL WITH UNION                              Sm             
2B1A      DOTTED SQUARE                                  So           
2C1A      GLAGOLITIC CAPITAL LETTER PE                    Lu              
2D1A      GEORGIAN SMALL LETTER CAN                      Ll           
2E1A      HYPHEN WITH DIAERESIS                          Pd           
2F1A      KANGXI RADICAL CLIFF                            So              
301A      LEFT WHITE SQUARE BRACKET                      Ps           
311A      BOPOMOFO LETTER A                              Lo           
321A      PARENTHESIZED HANGUL PHIEUPH A                    So            
331A      SQUARE KURUZEIRO                                So              
A01A      YI SYLLABLE BIET                                Lo              
A11A      YI SYLLABLE TIT                                  Lo             
A21A      YI SYLLABLE GGAT                                Lo              
A31A      YI SYLLABLE SOP                                  Lo             
A41A      YI SYLLABLE JJI                                  Lo             
A51A      VAI SYLLABLE CEE                                Lo              
A61A      VAI SYMBOL DANG                                  Lo             
A71A      MODIFIER LETTER LOWER RIGHT CORNER ANGLE        Lm              
A81A      SYLOTI NAGRI LETTER PHO                          Lo             
A91A      KAYAH LI LETTER RA                                Lo            
AA1A      CHAM LETTER PA                                    Lo            
F91A      CJK COMPATIBILITY IDEOGRAPH-F91A                Lo              
FA1A      CJK COMPATIBILITY IDEOGRAPH-FA1A                Lo              
FC1A      ARABIC LIGATURE KHAH WITH HAH ISOLATED FORM      Lo             
FD1A      ARABIC LIGATURE SHEEN WITH YEH FINAL FORM      Lo           
FF1A      FULLWIDTH COLON                                  Po

leuce · March 20, 2013

I can't succeed in making $n (or n, standing for n^th capture replacement) interpolate in replacement pattern. ... You may want to select the smallest subset of these codepoints which are worth expanding as hex entities...

Well, the script I'm writing is something that users will use when they know that there is something wrong with their file (the script is a file fixer), so I suppose they won't mind waiting a bit for it to complete.

I'm probably going to have to match all of these characters individually, and not just a presumed useful subset of them, because in the file that I tested this week there were many characters that were completely unexpected for the language combination (I suspect the source text was OCR'ed -- for example, I saw the word "iPad" in it in which the "i" looks like an "i" to the human eye but it is really something completely different).

Anyway, I tried this:

$o = "x"
$n = 0
$m = FileRead (FileOpen ("test.tmx", 32))

MsgBox (0, "", @extended, 0)

$arr = StringSplit ("\x{011A}|\x{021A}|\x{031A}|\x{041A}|\x{051A}|\x{061A}|\x{071A}|\x{091A}|\x{0A1A}|\x{0B1A}|\x{0C1A}|\x{0D1A}|\x{0E1A}|\x{0F1A}|\x{101A}|\x{111A}|\x{121A}|\x{131A}|\x{141A}|\x{151A}|\x{161A}|\x{1C1A}|\x{1D1A}|\x{1E1A}|\x{1F1A}|\x{201A}|\x{211A}|\x{221A}|\x{231A}|\x{241A}|\x{251A}|\x261A}|\x{271A}|\x{281A}|\x{291A}|\x{2A1A}|\x{2B1A}|\x{2C1A}|\x{2D1A}|\x{2E1A}|\x{2F1A}|\x301A}|\x{311A}|\x{321A}|\x{331A}|\x{A01A}|\x{A11A}|\x{A21A}|\x{A31A}|\x{A41A}|\x{A51A}|\xA61A}|\x{A71A}|\x{A81A}|\x{A91A}|\x{AA1A}|\x{F91A}|\x{FA1A}|\x{FC1A}|\x{FD1A}|\x{FF1A}", "|", 1)

For $i = 1 to $arr[0]

If StringRegExp ($m, $arr[$i]) Then

$a = StringSplit ($arr[$i], "{", 1)
$b = StringSplit ($a[2], "}", 1)
$c = "&#x" & $b[1] & ";"

$m = StringRegExpReplace ($m, $arr[$i], $c)

$o = $o & "|" & $c
Else
$n = $n + 1
EndIf

Next

MsgBox (0, "", $n & " __ " & $o, 0)

FileWrite (FileOpen ("test2.tmx", 34), $m)

...and it works. On my computer it takes the script 9 seconds to read a 350 MB TMX file, and the rest of the script takes less than a minute, making 6 replacements of one character and 10 replacements of another character.

Thanks again for all your help, guys!

Samuel

czardas · March 20, 2013

Good thread, good questions - all very informative.

jchd · March 20, 2013

Another way:

Local $s = ChrW(0x1A) & 'abc' & ChrW(0x1A) & 'def' & ChrW(0x221A) & 'ghi' & ChrW(0x331A) & 'jkl' & ChrW(0x331A)
Local $t = Execute("'" & StringRegExpReplace($s, _
    "(?x)" & _
    "([" & _
        "\x{001A}\x{011A}\x{021A}\x{031A}\x{041A}\x{051A}\x{061A}\x{071A}\x{091A}\x{0A1A}\x{0B1A}\x{0C1A}\x{0D1A}\x{0E1A}\x{0F1A}" & _
        "\x{101A}\x{111A}\x{121A}\x{131A}\x{141A}\x{151A}\x{161A}\x{191A}\x{1A1A}\x{1B1A}\x{1C1A}\x{1D1A}\x{1E1A}\x{1F1A}" & _
        "\x{201A}\x{211A}\x{221A}\x{231A}\x{241A}\x{251A}\x{261A}\x{271A}\x{281A}\x{291A}\x{2A1A}\x{2B1A}\x{2C1A}\x{2D1A}\x{2E1A}\x{2F1A}" & _
        "\x{301A}\x{311A}\x{321A}\x{331A}" & _
        "\x{A01A}\x{A11A}\x{A21A}\x{A31A}\x{A41A}\x{A51A}\x{A61A}\x{A71A}\x{A81A}\x{A91A}\x{AA1A}" & _
        "\x{F91A}\x{FA1A}\x{FC1A}\x{FD1A}\x{FF1A}" & _
    "])", _
    "&#x' & Hex(AscW('$1'), 4) & '") & "'")
ConsoleWrite($t & @LF)

Correct regex syntax for hex characters

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members