Jump to content
Sign in to follow this  
leuce

Correct regex syntax for hex characters

Recommended Posts

leuce

G'day everyone

I'm trying to evaluate strings on whether they contain only certain characters, which I specify in hexademical format. However, I have no idea how to write the regular expression, and all my tinkering produces the wrong results.

What I'm ultimately trying to accomplish is to test if a string contains only valid XML 1.0 characters.

The valid characters that I'm trying to evaluate are:

\x0009

\x000A

\x000D

\x0020-\xD7FF

\xE000-\xFFFD

\x10000-\x10FFFF

I want to specify them all as a single variable ($sValid), which I will then include in the regular expression, as follows:

If StringRegExp($aArray[$i], "\A[" & $sValid & "]*\Z") Then
; Then $aArray[$i] contains only valid characters
EndIf

The file that I read the input from is UTF8 (but does that matter?).

How would I have to write the variable $sValid to let the regular expression work?

Thanks

Samuel

Share this post


Link to post
Share on other sites
czardas

I take it the last three sets include all values within the range (x0020-xD7FF ==> from x0020 to xD7FF).

Edited by czardas

Share this post


Link to post
Share on other sites
leuce

I take it the last three sets include all values within the range (x0020-xD7FF ==> from x0020 to xD7FF).

If I understand correctly, the last two sets are outside the range of the third last set. Or... what don't I understand? :-)

Share this post


Link to post
Share on other sites
czardas

I'm asking if that expression indicates one match or multiple matches within the hexadecimal range. What does the minus symbol indicate? Edit: Perhaps a silly question.

Edited by czardas

Share this post


Link to post
Share on other sites
Melba23

leuce,

Can you please post an example string so that we can see the format. :)

M23


Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
leuce

I'm asking if that expression indicates one match or multiple matches within the hexadecimal range. What does the minus symbol indicate? Edit: Perhaps a silly question.

The minus is my attempt at indicating a range. Ah, now I understand your original question: yes, the minus means "to". In the same way as one might have [a-zA-Z] in a regular expression, where the minus means "to".

I need to find out if the string I'm evaluating contains any character that is not any of those valid characters.

Share this post


Link to post
Share on other sites
leuce

Can you please post an example string so that we can see the format. :)

Sure, but I don't know if it will help. It is a TMX file, which is XML. The original file is actually UTF16LE, and the XML is 1.0. But I resaved it as UTF8 because I thought that that was required for AutoIt's regex.

Here is one string that will be evaluated:

<tu>
<tuv xml:lang="NL-NL">
<seg>&lt;cf font="Arial" symboltypeface="Arial" ansitypeface="Arial" color="0xFFFFCC"&gt;&lt;cf size="32"&gt;Van bachelor naar master&lt;/cf&gt;&lt;cf size="28"&gt; &lt;br indentation="0" leftmargin="0" alignment="ppAlignLeft" spacewithin="1" spacebefore="0.5" spaceafter="0" mask="0"/&gt;</seg>
</tuv>
<tuv xml:lang="EN-GB">
<seg>&lt;cf font="Arial" symboltypeface="Arial" ansitypeface="Arial" color="0xFFFFCC"&gt;&lt;cf size="32"&gt;From Bachelor's to Master's&lt;/cf&gt;&lt;cf size="28"&gt; &lt;br indentation="0" leftmargin="0" alignment="ppAlignLeft" spacewithin="1" spacebefore="0.5" spaceafter="0" mask="0"/&gt;</seg>
</tuv>
</tu>

The file I'm trying to process is an XML file with invalid characters in it. I want to save as much of the file as possible while removing the invalid characters. Perhaps there is a freeware program somewhere on the internet that can do it too.

Added: Just in case the example is confusing, let me say that the hexadecimal colour codes in that text are not the characters that I'm trying to match. I'm trying to match individual characters. The above text contains only valid characters, but some of the strings may contain invalid ones.

Edited by leuce

Share this post


Link to post
Share on other sites
AZJIO

leuce,

[a-zA-Z] ???

[a-fA-F] Yes?

[a-fA-F0-9]+

{?i}[0-9A-F]+

x10000-x10FFFF ???

x[0-9A-Fa-f]+ Yes?

Share this post


Link to post
Share on other sites
Melba23

leuce,

This appears to do what you want - but I am not totally confident as I have never used the multiple digit Hex pattern before. :wacko:

Note that I have reversed the logic - if you get a match then there is an unwanted character in the string; no match means that the string is good. That should allow you to use it as a RegExpReplace pattern to strip the unwanted characters if you decide to go that way: ;)

(?i)(\x{000[12345678bcef]}|\x{001\d}|\x{d[89abcdef]\d\d}|\x{fff[ef]})

(?i)            - Case insensitive
( | | | | | )        - Look for any of these alternatives, which are:

\x{000[12345678bcef]}   - Any 000# character other than 0009, 000A, 000D
\x{001\d}        - Any 001# character
\x{d[89abcdef]\d\d}    - Any D8## to DF## character
\x{fff[ef]}        - FFFE and FFFF

Give it a try and see how you get on. :)

M23


Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
leuce

Note that I have reversed the logic - if you get a match then there is an unwanted character in the string; no match means that the string is good.

I was actually thinking the same thing -- then I don't have to match the entire string, only one character in the string.

The curly brackets is what I was after -- I did not know exactly how to write the hex characters in a regular expression. I had thought that AutoIt would treat a hexademical character as a single unit in regular expressions, so I'm surprised to learn that something like "x{fff[ef]})" is possible.

Do you know if these regular expressions work on all files, or only on UTF8 files?

Thanks

Samuel

Share this post


Link to post
Share on other sites
jchd

leuce,

FYI whatever text encoding is used, once you load text in AutoIt strings data is converted to a subset of UTF-16LE called UCS-2. It's (roughly) the restriction of Unicode to its plane-0, i.e. codepoints that fit into a single 16-bit representation. Hence the extra range x10000-x10FFFF is irrelevant to AutoIt (and, honestly, I seriously doubt you would encounter such codepoints in real-world data). Also surrogates will endup being remapped to invalid character in the conversion process occuring during file read.

The following pattern will match any invalid or excluded codepoint:

[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]
  • Like 2

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
leuce

x10000-x10FFFF ???

The reason for that notation is that I had thought that AutoIt would treat the hexadecimal character as a unit within regex... in other words that it would treat "x10000" as a single entity and "x10FFFF" as a single entity, and that something like [x10000-x10FFFF] would be the same as something like [a-z].

Share this post


Link to post
Share on other sites
Melba23

leuce,

I'm surprised to learn that something like "x{fff[ef]})" is possible

To be honest, so was I! :D

The curly brackets are described on the Help file page for StringRegExp. :)

You will need to ask someone like jchd about the encoding - well above my hobbyist level that. :(

M23

Edit: I see he has already responded and given you a much more compact pattern. ;)


Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
jchd

That's correct but you need to use the correct hex syntax used in PCRE: either x** or x{******} where bold red asterisks are optional.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
jchd

Sorry to add to myself.

Note that while "x{fff[ef]})" is indeed possible either by itself or within alternation (like "abc|def|x{fff[ef]})" the syntax won't work inside a character class (inside square brackets). Also, alternation is much slower than a character class: alternation works on complete sub-expressions while character classes work on individual characters.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Melba23

jchd,

Nice pattern - I did not realise that you could do this:

\x20-\x{D7FF}

Learning point for today. :)

M23


Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
leuce

Thanks, jchd, for the expression.

Perhaps one of you can tell me what is wrong with my script, because I know for a fact that there is at least one x1A character in it (possibly more, or other non-valid characters), but the script doesn't catch it. I know that there is an x1A character because I've seen it (it took a while to track it down in my 350 MB XML file, but it is there).

My script is this:

#cs

TMX Fixer (per-segment)

1. Read input file (TMX), then split by </tu>.
2. For each array item:
2.1 If it contains an invalid character, write it to an error file (TXT, one file per error).
2.2 If it does not contain an invalid character, write it to the output file (TMX).

Note: splitting by </tu> means that the head and the first TU are both in array item 1, but we assume that there are no invalid characters in the head or in the first TU.

#ce

MsgBox (0, "TMX Fixer (per-segment)", "The per-segment version of TMX Fixer examines every TU (aka translation unit, aka segment) individually and saves only segments that contain no invalid characters to the output TMX file, and saves removed segments with invalid characters to separate error files.", 0)

$pathtotmxfile = FileOpenDialog ("Select TMX file", @ScriptDir, "TMX (*.tmx)|All files (*.*)")

$tmxfileopen = FileOpen ($pathtotmxfile, 32)
$tmxfileread = FileRead ($tmxfileopen)

MsgBox (0, "TMX file read", @extended & " characters were read.", 0)

$tmxfilearray = StringSplit ($tmxfileread, "</tu>", 1)

MsgBox (0, "TMX file split", "TMX was split into approximately " & $tmxfilearray[0] - 1 & " translation units.", 0)

$outputtmxfileopen = FileOpen ($pathtotmxfile & "_output.tmx", 34)

Global $sInvalid = "[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]"
; Global $sInvalid = "[\x00-\x08\x0B\x0C\x0E-\x1F\x{FFFE}\x{FFFF}]"

For $i = 1 to $tmxfilearray[0]

If StringRegExp ($tmxfilearray[$i], $sInvalid) Then
; $tmxfilearray[$i] = StringRegExpReplace ($tmxfilearray[$i], $sInvalid, "!!!$1!!!") ; when the rest works
$roguefileopen = FileOpen ($pathtotmxfile & "_broken segment_" & $i, 34)
FileWrite ($roguefileopen, $tmxfilearray[$i] & "</tu>")
FileClose ($roguefileopen)
Else
FileWrite ($outputtmxfileopen, $tmxfilearray[$i] & "</tu>")
EndIf

If IsInt ($i/10000) Then
ToolTip ("Currently at unit " & $i)
EndIf

Next
Edited by leuce

Share this post


Link to post
Share on other sites
czardas

Are you sure you are reading all of the file? 350MB is big and I'm not sure if StringSplit can handle this. Although I would expect some out of memory message.

Share this post


Link to post
Share on other sites
leuce

Are you sure you are reading all of the file? 350MB is big and I'm not sure if StringSplit can handle this. Although I would expect some out of memory message.

It reads it all (that's why the script tells the user how many characters are read, etc). The TMX file has about 650 000 </tu> tags in it, and the script reports that number to the user. The script reports about 185 million characters for both a UTF16LE file (350 MB) and a UTF8 file (180 MB), which sounds about right.

Running the whole script takes about a minute (I'm not sure how much the fact that I have 6 GB RAM on a quad core 64-bit computer has to do with it). Anyway, the first x1A character occurs at string number 115003, so the script should write an error file by then.

Could it be that the script is running too fast and therefore "misses" the match? That would be very odd...

Edited by leuce

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×