Jump to content

How to check string for XML legality


Recommended Posts

G'day everyone

I need to check a string to ensure that it does not contain any characters that are invalid XML characters. I'm not talking about entities or tags errors (although I will want to check those as well, in a separate process), but about the fact that some characters may not exist in valid XML. Is there an existing function that will check a string for XML validity?

Thanks

Samuel

Link to comment
Share on other sites

The easiest way would be to make a list of valid characters and use StringRegExp.

Here is an example where I only allow 'printing' characters (including spaces). Chr(15) is not valid and so it returns 0 for the second test.

Local $asTests[2] = ["<test>Valid xml</test>", "<test2>Invalid xml" & Chr(15) & "</test2>"]

$sValid = "[:print:]"

For $i = 0 To UBound($asTests) - 1
    MsgBox(0, $asTests[$i], StringRegExp($asTests[$i], "\A[" & $sValid & "]*\Z"))
Next
Link to comment
Share on other sites

The easiest way would be to make a list of valid characters and use StringRegExp.

Theoretically, the number of legal XML characters are finite (essentially the entire Unicode character set, plus some more characters) and the number of illegal XML characters are infinite, but there is a small number of characters that commonly occur in the types of files that I want to check, so I suspected that it would be faster to check for those than to check all the legal characters.

I found a list of the illegal characters in someone else's Perl/Python script:

http://www.proz.com/forum/cat_tools_technical_help/200111-tmx_fixer.html#1747147

s/[\x00-\x08]|\x0B|\x0C|[\x0E-\x1F]//g;

...so I'll see if I can figure out how to convert that to AutoIt regex syntax and use that.

Local $asTests[2] = ["<test>Valid xml</test>", "<test2>Invalid xml" & Chr(15) & "</test2>"]
$sValid = "[:print:]"
For $i = 0 To UBound($asTests) - 1
    MsgBox(0, $asTests[$i], StringRegExp($asTests[$i], "\A[" & $sValid & "]*\Z"))
Next
Thanks for the code snippet.

Samuel

Link to comment
Share on other sites

What characters are valid in a given XML file depends on the encoding specified in the header. That will make it very complicated if you want to cover all encoding cases.

How about just load the XML string by _XMLLoadXML() and error check the function?

:huh2:

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...