Jump to content

Correct regex syntax for hex characters


 Share

Recommended Posts

I don't think it is a memory problem either, because if I reduce the TMX file to just the few TUs surrounding the invalid character (i.e. total file size less than 1000 characters), the script doesn't catch the invalid character either.

Link to comment
Share on other sites

Codepoint U+00FFFD is perfectly valid by itself (it means "Invalid character got replaced by this special codepoint") provided that:

1) you don't object that the file contains this codepoint by itself

2) you don't object that invalid sequences in input file gets converted into this codepoint

3) PCRE implementation of PCRE support doesn't do anything surprising

Anyway, I'm surprised you say that the pattern doesn't work to detect 0x1A:

Local $m, $s = "abc" & Chr(0x1A) & "def"

If StringRegExp($s, "[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]") Then
$m = "Failed"
Else
$m = "Passed"
EndIf
ConsoleWrite($m & @LF)

@trancexx,

Can you eloborate further?

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Codepoint U+00FFFD is perfectly valid by itself (it means "Invalid character got replaced by this special codepoint") provided that:

1) you don't object that the file contains this codepoint by itself

2) you don't object that invalid sequences in input file gets converted into this codepoint

3) PCRE implementation of PCRE support doesn't do anything surprising

Anyway, I'm surprised you say that the pattern doesn't work to detect 0x1A:

Local $m, $s = "abc" & Chr(0x1A) & "def"

If StringRegExp($s, "[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]") Then
$m = "Failed"
Else
$m = "Passed"
EndIf
ConsoleWrite($m & @LF)

@trancexx,

Can you eloborate further?

The problem is for code points U+D800 to U+DFFF. Your pattern (x{FFFD} part) will cause wrong results here. Even though the behavior is explainable (you did it actually), it may be seen as unexpected.

You see, x{FFFD} matches both U+FFFD and the whole range from U+D800 to U+DFFF. That's what your pattern explicitly tries no to do. So it's either x20-x{FFFD} or the last hex is x{FFFC}.

Edited by trancexx

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

That's true if the input file is UTF-16 encoded and contains codepoints > U+FFFF (those which use surrogates).

Since the OP said he reads UTF-8 text, there should be no surrogate in the input file.

Yet a question remains hinted to by my points 1) and 2): should invalid UTF-8 combinations not already converted to U+FFFD inside the input stream be considered charset errors? If no, then merging the ranges as trancexx did is fine, else we need to parse the UTF-8 by ourself byte by byte and check that condition. Anyway "native" U+FFFD in the input should NOT be excluded from the valid XML charset range since it is explicitely allowed as a handy placeholder.

Also note that whatever other Unicode-conformant program reading an Unicode text file contining invalid UTF-* sequences will actually replace them with U+FFFD instance(s), so merging the ranges into x20-x{FFFD} is probably the simplest way to behave.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I don't follow the logic here at all. Why would so many ill-formed variants (U+D800 - U+DFFF) be needed? Why substitute them at all? What kind of logic is there to this? To me this seems like a waste of resources (perhaps because I don't understand it, or maybe I'm just not ready to understand it).

Edited by czardas
Link to comment
Share on other sites

Okay, confession time: I investigated my non-matching x1A problem and it now appears that the program that I used to test the TMX file interpreted x201A as x001A. Well, that is my guess, based on the error message I get from that program (it complains about an unexpected file end at that position). When I remove all instances of x201A from the file, it stops complaining about it.

Link to comment
Share on other sites

czardas,

This range is for surrogates, i.e. 16-bit values that are reserved for encoding codepoints from upper planes ( > U+FFFF) when using either UTF-16 encoding.

These values are "non-characters" by themselves, since they must be associated in pairs to encode a codepoint outside plane 0.

When a text file uses UTF-8 these values shouldn't occur, as UTF-8 provides its own mean to encode codepoints. Hence a conforming conversion from UTF-16 to UTF-8 will never produce a codepoint in this range. Pathologic programs can however let non-characters appear in their output stream.

The U+FFFD codepoint ( the so-called "replacement character") is the default codepoint to indicate an invalid character during a conversion: you can see it in some ill-formed web pages as a white question mark in a black hexagonal background. By itself it is not an error but merely a trace that an earlier error produced something that couldn't be represented.

In short, a valid Unicode text may contains occurences of U+FFFD and these are not particularly toxic for subsequent processings.

When a conforming program reads Unicode text and discovers invalid sequences, there are two common ways to handle the situation: either halt with an error OR replace every invalid sequence by a replacement character. Thus there are two sources of U+FFFD: replacement already done at an earlier point and actual errors in the encoding of the text stream. Both read as U+FFFD to a conforming program (my points 1) and 2) above).

Note that there are more complex conditions which make a particular sequence invalid, for instance overlong sequences.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

leuce,

U+201A is a low quote style character. Instead of suppressing it you should perhaps replace it by something similar and meaningful. That or replace the program that fails on this codepoint.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Thanks jchd. It's a good explanation of what is happening. I'm still not sure of the need for such a large range. I would have thought one character would be sufficient. I need to read more about it. I read somewhere that some of these surrogates can be used in programming (for whatever purpose the programmer decides). I'm not sure about it, but I'll read up on it. Much appreciated. :)

Update to post

I found the answer to my question about surrogates here: http://en.wikipedia.org/wiki/Plane_%28Unicode%29

Edited by czardas
Link to comment
Share on other sites

U+201A is a low quote style character. Instead of suppressing it you should perhaps replace it by something similar and meaningful.

That is exactly what I'm going to try to do. The program also flounders on U+221A but it accepts the entity √. Do you know if one can replace all characters that end on 1A with similarly named entities?

What would the regex be for that, I wonder... something like this, perhaps?

$tmxfileread = StringRegExpReplace ($tmxfileread, "\x{([a-f0-9][a-f0-9])1a}", "&#x$11a;")
Edited by leuce
Link to comment
Share on other sites

I can't succeed in making $n (or n, standing for nth capture replacement) interpolate in replacement pattern.

Testing even simpler match pattern like x{00[0-9]9} doesn't match anything in abc999def.

So I'm doubting that those expression work in PCRE. About x{hhh...}, the PCRE doc says:

If characters other than hexadecimal digits appear between x{ and }, or if there is no terminating }, this form of escape is not recognized. Instead, the initial x will be interpreted as a basic hexadecimal escape, with no following digits, giving a character whose value is zero.

You may want to select the smallest subset of these codepoints which are worth expanding as hex entities:

CodePoint CharacterName                                  GeneralCategory
--------- ------------------------------------------------- ---------------
001A      <control>                                      Cc           
011A      LATIN CAPITAL LETTER E WITH CARON              Lu           
021A      LATIN CAPITAL LETTER T WITH COMMA BELOW          Lu             
031A      COMBINING LEFT ANGLE ABOVE                        Mn            
041A      CYRILLIC CAPITAL LETTER KA                        Lu            
051A      CYRILLIC CAPITAL LETTER QA                        Lu            
061A      ARABIC SMALL KASRA                                Mn            
071A      SYRIAC LETTER HETH                                Lo            
091A      DEVANAGARI LETTER CA                            Lo              
0A1A      GURMUKHI LETTER CA                                Lo            
0B1A      ORIYA LETTER CA                                  Lo             
0C1A      TELUGU LETTER CA                                Lo              
0D1A      MALAYALAM LETTER CA                              Lo             
0E1A      THAI CHARACTER BO BAIMAI                        Lo              
0F1A      TIBETAN SIGN RDEL DKAR GCIG                      So             
101A      MYANMAR LETTER YA                              Lo           
111A      HANGUL CHOSEONG RIEUL-HIEUH                      Lo             
121A      ETHIOPIC SYLLABLE MI                            Lo              
131A      ETHIOPIC SYLLABLE GGI                          Lo           
141A      CANADIAN SYLLABICS WEST-CREE WAA                Lo              
151A      CANADIAN SYLLABICS WEST-CREE SHWI              Lo           
161A      CANADIAN SYLLABICS SAYISI JI                    Lo              
191A      LIMBU LETTER SSA                                Lo              
1A1A      BUGINESE VOWEL SIGN O                          Mc           
1B1A      BALINESE LETTER JA                                Lo            
1C1A      LEPCHA LETTER YA                                Lo              
1D1A      LATIN LETTER SMALL CAPITAL TURNED R              Ll             
1E1A      LATIN CAPITAL LETTER E WITH TILDE BELOW          Lu             
1F1A      GREEK CAPITAL LETTER EPSILON WITH PSILI AND VARIA Lu            
201A      SINGLE LOW-9 QUOTATION MARK                      Ps             
211A      DOUBLE-STRUCK CAPITAL Q                          Lu             
221A      SQUARE ROOT                                      Sm             
231A      WATCH                                          So           
241A      SYMBOL FOR SUBSTITUTE                          So           
251A      BOX DRAWINGS UP HEAVY AND LEFT LIGHT            So              
261A      BLACK LEFT POINTING INDEX                      So           
271A      HEAVY GREEK CROSS                              So           
281A      BRAILLE PATTERN DOTS-245                        So              
291A      RIGHTWARDS ARROW-TAIL                          Sm           
2A1A      INTEGRAL WITH UNION                              Sm             
2B1A      DOTTED SQUARE                                  So           
2C1A      GLAGOLITIC CAPITAL LETTER PE                    Lu              
2D1A      GEORGIAN SMALL LETTER CAN                      Ll           
2E1A      HYPHEN WITH DIAERESIS                          Pd           
2F1A      KANGXI RADICAL CLIFF                            So              
301A      LEFT WHITE SQUARE BRACKET                      Ps           
311A      BOPOMOFO LETTER A                              Lo           
321A      PARENTHESIZED HANGUL PHIEUPH A                    So            
331A      SQUARE KURUZEIRO                                So              
A01A      YI SYLLABLE BIET                                Lo              
A11A      YI SYLLABLE TIT                                  Lo             
A21A      YI SYLLABLE GGAT                                Lo              
A31A      YI SYLLABLE SOP                                  Lo             
A41A      YI SYLLABLE JJI                                  Lo             
A51A      VAI SYLLABLE CEE                                Lo              
A61A      VAI SYMBOL DANG                                  Lo             
A71A      MODIFIER LETTER LOWER RIGHT CORNER ANGLE        Lm              
A81A      SYLOTI NAGRI LETTER PHO                          Lo             
A91A      KAYAH LI LETTER RA                                Lo            
AA1A      CHAM LETTER PA                                    Lo            
F91A      CJK COMPATIBILITY IDEOGRAPH-F91A                Lo              
FA1A      CJK COMPATIBILITY IDEOGRAPH-FA1A                Lo              
FC1A      ARABIC LIGATURE KHAH WITH HAH ISOLATED FORM      Lo             
FD1A      ARABIC LIGATURE SHEEN WITH YEH FINAL FORM      Lo           
FF1A      FULLWIDTH COLON                                  Po

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I can't succeed in making $n (or n, standing for nth capture replacement) interpolate in replacement pattern. ... You may want to select the smallest subset of these codepoints which are worth expanding as hex entities...

Well, the script I'm writing is something that users will use when they know that there is something wrong with their file (the script is a file fixer), so I suppose they won't mind waiting a bit for it to complete.

I'm probably going to have to match all of these characters individually, and not just a presumed useful subset of them, because in the file that I tested this week there were many characters that were completely unexpected for the language combination (I suspect the source text was OCR'ed -- for example, I saw the word "iPad" in it in which the "i" looks like an "i" to the human eye but it is really something completely different).

Anyway, I tried this:

$o = "x"
$n = 0
$m = FileRead (FileOpen ("test.tmx", 32))

MsgBox (0, "", @extended, 0)

$arr = StringSplit ("\x{011A}|\x{021A}|\x{031A}|\x{041A}|\x{051A}|\x{061A}|\x{071A}|\x{091A}|\x{0A1A}|\x{0B1A}|\x{0C1A}|\x{0D1A}|\x{0E1A}|\x{0F1A}|\x{101A}|\x{111A}|\x{121A}|\x{131A}|\x{141A}|\x{151A}|\x{161A}|\x{1C1A}|\x{1D1A}|\x{1E1A}|\x{1F1A}|\x{201A}|\x{211A}|\x{221A}|\x{231A}|\x{241A}|\x{251A}|\x261A}|\x{271A}|\x{281A}|\x{291A}|\x{2A1A}|\x{2B1A}|\x{2C1A}|\x{2D1A}|\x{2E1A}|\x{2F1A}|\x301A}|\x{311A}|\x{321A}|\x{331A}|\x{A01A}|\x{A11A}|\x{A21A}|\x{A31A}|\x{A41A}|\x{A51A}|\xA61A}|\x{A71A}|\x{A81A}|\x{A91A}|\x{AA1A}|\x{F91A}|\x{FA1A}|\x{FC1A}|\x{FD1A}|\x{FF1A}", "|", 1)

For $i = 1 to $arr[0]

If StringRegExp ($m, $arr[$i]) Then

$a = StringSplit ($arr[$i], "{", 1)
$b = StringSplit ($a[2], "}", 1)
$c = "&#x" & $b[1] & ";"

$m = StringRegExpReplace ($m, $arr[$i], $c)

$o = $o & "|" & $c
Else
$n = $n + 1
EndIf

Next

MsgBox (0, "", $n & " __ " & $o, 0)

FileWrite (FileOpen ("test2.tmx", 34), $m)

...and it works. On my computer it takes the script 9 seconds to read a 350 MB TMX file, and the rest of the script takes less than a minute, making 6 replacements of one character and 10 replacements of another character.

Thanks again for all your help, guys!

Samuel

Link to comment
Share on other sites

Another way:

Local $s = ChrW(0x1A) & 'abc' & ChrW(0x1A) & 'def' & ChrW(0x221A) & 'ghi' & ChrW(0x331A) & 'jkl' & ChrW(0x331A)
Local $t = Execute("'" & StringRegExpReplace($s, _
    "(?x)" & _
    "([" & _
        "\x{001A}\x{011A}\x{021A}\x{031A}\x{041A}\x{051A}\x{061A}\x{071A}\x{091A}\x{0A1A}\x{0B1A}\x{0C1A}\x{0D1A}\x{0E1A}\x{0F1A}" & _
        "\x{101A}\x{111A}\x{121A}\x{131A}\x{141A}\x{151A}\x{161A}\x{191A}\x{1A1A}\x{1B1A}\x{1C1A}\x{1D1A}\x{1E1A}\x{1F1A}" & _
        "\x{201A}\x{211A}\x{221A}\x{231A}\x{241A}\x{251A}\x{261A}\x{271A}\x{281A}\x{291A}\x{2A1A}\x{2B1A}\x{2C1A}\x{2D1A}\x{2E1A}\x{2F1A}" & _
        "\x{301A}\x{311A}\x{321A}\x{331A}" & _
        "\x{A01A}\x{A11A}\x{A21A}\x{A31A}\x{A41A}\x{A51A}\x{A61A}\x{A71A}\x{A81A}\x{A91A}\x{AA1A}" & _
        "\x{F91A}\x{FA1A}\x{FC1A}\x{FD1A}\x{FF1A}" & _
    "])", _
    "&#x' & Hex(AscW('$1'), 4) & '") & "'")
ConsoleWrite($t & @LF)

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...