Jump to content
Sign in to follow this  
lgwapnitsky

Unicode and StringRegExpReplace issues

Recommended Posts

I also tried the literal (\xe2\x84\xa2) which did not work either.

Just got it working...I entered something "slightly" wrong...I love and hate regexp at the same time.

edit: nevermind...the replacement isn't working just yet...this code is not successful right now:

$teststring = ChrW(Dec(2122))
$teststring = StringRegExpReplace($teststring, "[\x99]", "TM")
consolewrite($teststring)
Edited by lgwapnitsky

Share this post


Link to post
Share on other sites

I also tried the literal (\xe2\x84\xa2) which did not work either.

I'm thinking that PCRE isn't picking up certain characters. There have been 2 updates to that engine, since the last time it was updated for AutoIt, which involved bug fixes so I'll try to find out if any of those fixes could be for this issue. I think I know of a test I can run to see if there are any others that are not being picked up.

And consider that comment about "it should be \x99" to be a brain dead moment. That's what you had written. Caffeine straight, double up please. I might have to start taking my morning coffee by injection.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Interestingly enough this works

$s = Chr(153)
Local $i = StringRegExp($s, "™")

ConsoleWrite(($i <> 0) & @CRLF)

Which would seem to indicate a problem in Hex. Now, is that problem in AutoIt or PCRE. I suspect PCRE.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Interestingly enough this works

$s = Chr(153)
Local $i = StringRegExp($s, "™")

ConsoleWrite(($i <> 0) & @CRLF)

Which would seem to indicate a problem in Hex. Now, is that problem in AutoIt or PCRE. I suspect PCRE.

I found the same thing and am still puzzling it over...

Share this post


Link to post
Share on other sites

I found the same thing and am still puzzling it over...

I think I'll go ahead and report it as a bug although I still say it's in PCRE and not AutoIt. I'll also recommend in the report that the engine be updated from the current 8.0.0 to 8.0.2 which is the latest.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Hi George,

Beware that \x99 is the Latin1 ANSI assignment for ™ (the Unicode code point \U2122).

Charmap.exe shows it nicely, once you check the "advanced view" checkbox (I don't know how it gets called in non frenchy Windows). You can switch the displayed charset between Unicode, Ansi (select your codepage), OEM, ...

Sorry I missed your replies earlier.

You're correct but I'm thinking that it should also pick up on the Hex value which is \x99 and that seems to be where it fails. We can't check \U2122 because the Unicode support doesn't work and I would still like to know why. It's been a pain since day 1.

Edit: In English it's called "Advanced View", not sure what it's called in the frenchy windows though.

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Here is a complete list of the characters that fail.

\x80

\x82

\x83

\x84

\x85

\x86

\x87

\x88

\x89

\x8A

\x8B

\x8C

\x8E

\x91

\x92

\x93

\x94

\x95

\x96

\x97

\x98

\x99

\x9A

\x9B

\x9C

\x9E

\x9F

Now for the good news. I finally got my SRE mentor involved in this and it looks like he has a work around. Just one short function which I'm testing right now. as soon as it's tested I'll come back and post it.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Here is a complete list of the characters that fail.

\x80

\x82

\x83

\x84

\x85

\x86

\x87

\x88

\x89

\x8A

\x8B

\x8C

\x8E

\x91

\x92

\x93

\x94

\x95

\x96

\x97

\x98

\x99

\x9A

\x9B

\x9C

\x9E

\x9F

Now for the good news. I finally got my SRE mentor involved in this and it looks like he has a work around. Just one short function which I'm testing right now. as soon as it's tested I'll come back and post it.

Can you PM me as well as reply if you get it? SPAM is killing me today no matter how many times I correct it...

Share this post


Link to post
Share on other sites

@lgwapnitsky I'll Pm you a copy of this as well.

Well, again thanks to SmOke_N, we have a solution that works for MOST of those on the list. It's still a bit buggy on the few that remain but I'll work away at that. I'm posting the test code along with the function.

Local $s_str = "€agZ˜™berƒ„…†‡‰Š‹ŒŽ‘’“”"
Local $s_pattern = "(\w" & _SRE_HexToChar("[\x98-\x99]+") & ")"
Local $a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("1 is good" & @CRLF)

$s_pattern = "(\w" & _SRE_HexToChar("\x80-FF") & ")"
$a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("2 is good" & @CRLF)

$s_pattern ="(\w" & _SRE_HexToChar("80-FF") & ")"
$a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("3 is good" & @CRLF)

$s_pattern ="(\w" & _SRE_HexToChar("81-\xFF") & ")"
$a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("4 is good" & @CRLF)

$s_pattern ="(\w" & _SRE_HexToChar("80-FF*") & ")"
$a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("5 is good" & @CRLF)

Func _SRE_HexToChar($s_str)

    Local $s_ret_str = ""
    Local $s_pattern = "(?:\[)?(?:\\x)?([[:xdigit:]]+)-" & _
        "(?:\\x)?([[:xdigit:]]+)(?:\])?((?:\*|\+|\?))?"
    Local $a_range = StringRegExp($s_str, $s_pattern, 1)
    If Not @error Then
        $s_ret_str = "(?:"
        For $i = Dec($a_range[0]) To Dec($a_range[1])
            $s_ret_str &= Chr($i) & "|"
        Next
        $s_ret_str = StringTrimRight($s_ret_str, 1)
        If Not $s_ret_str Then Return ""
        If UBound($a_range) = 3 Then Return $s_ret_str & ")" & $a_range[2]
        Return $s_ret_str & ")"
    EndIf

    Local $a_hex = StringRegExp($s_str, "[[:xdigit:]]+", 1)
    If @error Then Return ""
    Return Chr(Dec($a_hex[0]))
EndFunc

It will allow you to do ranges too with * or + or ?

you can use [\x##] or \x## or ##


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

@lgwapnitsky I'll Pm you a copy of this as well.

Well, again thanks to SmOke_N, we have a solution that works for MOST of those on the list. It's still a bit buggy on the few that remain but I'll work away at that. I'm posting the test code along with the function.

Local $s_str = "€agZ˜™berƒ„…†‡‰Š‹ŒŽ‘’“”"
Local $s_pattern = "(\w" & _SRE_HexToChar("[\x98-\x99]+") & ")"
Local $a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("1 is good" & @CRLF)

$s_pattern = "(\w" & _SRE_HexToChar("\x80-FF") & ")"
$a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("2 is good" & @CRLF)

$s_pattern ="(\w" & _SRE_HexToChar("80-FF") & ")"
$a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("3 is good" & @CRLF)

$s_pattern ="(\w" & _SRE_HexToChar("81-\xFF") & ")"
$a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("4 is good" & @CRLF)

$s_pattern ="(\w" & _SRE_HexToChar("80-FF*") & ")"
$a_sre = StringRegExp($s_str, $s_pattern, 1)
If IsArray($a_sre) Then ConsoleWrite("5 is good" & @CRLF)

Func _SRE_HexToChar($s_str)

    Local $s_ret_str = ""
    Local $s_pattern = "(?:\[)?(?:\\x)?([[:xdigit:]]+)-" & _
        "(?:\\x)?([[:xdigit:]]+)(?:\])?((?:\*|\+|\?))?"
    Local $a_range = StringRegExp($s_str, $s_pattern, 1)
    If Not @error Then
        $s_ret_str = "(?:"
        For $i = Dec($a_range[0]) To Dec($a_range[1])
            $s_ret_str &= Chr($i) & "|"
        Next
        $s_ret_str = StringTrimRight($s_ret_str, 1)
        If Not $s_ret_str Then Return ""
        If UBound($a_range) = 3 Then Return $s_ret_str & ")" & $a_range[2]
        Return $s_ret_str & ")"
    EndIf

    Local $a_hex = StringRegExp($s_str, "[[:xdigit:]]+", 1)
    If @error Then Return ""
    Return Chr(Dec($a_hex[0]))
EndFunc

It will allow you to do ranges too with * or + or ?

you can use [\x##] or \x## or ##

That is simultaneously beautiful and ugly...I'll test it on my stuff and report back.

Thanks for all the hard work.

Share this post


Link to post
Share on other sites

That is simultaneously beautiful and ugly...I'll test it on my stuff and report back.

Are you talking about SmOke_N or the function?

Thanks for all the hard work.

Don't thank me, thank SmOke. The hardest thing I had to do was IM him and point out a simple error in the function, which he promptly fixed.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

One more question...for now...

\x2122 doesn't seem to be matching ™ (which I'm also still trying to figure out why I can't type it on my keyboard). This is for a different portion of the utility.

It does match if you wrap anything greater than 2 hex digits in curly braces like this:

"\x{2122}"

:idea:


Share this post


Link to post
Share on other sites

The _SRE_HexToChar() func can be called as such for your hex query string:

_SRE_HexToChar("[\x##-\x##]")
_SRE_HexToChar("[\x##-##]")
_SRE_HexToChar("[##-\x##]")
_SRE_HexToChar("[##-##]")
_SRE_HexToChar("[##]")
_SRE_HexToChar("##")

Where ## represents the 2 character hex value.

Also anything in square brackets can be followed by a "*", "+", or "?".

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Are you talking about SmOke_N or the function?

Don't thank me, thank SmOke. The hardest thing I had to do was IM him and point out a simple error in the function, which he promptly fixed.

Found something odd. This does not work in finding my "TM" symbol:

$array = StringRegExp($teststring, "[[:alnum:]\-\s\x27\x9e\x9f" & _SRE_HexToChar("c9-ff") & "]+", 3)

but this does:

$array = StringRegExp($teststring, "[[:alnum:]\-\s\x27\x9e\x9f" & _SRE_HexToChar("c9-ff") & _SRE_HexToChar("99") & "]+", 3)

The former does not show my "TM"s, but the latter does.

Share this post


Link to post
Share on other sites

It does match if you wrap anything greater than 2 hex digits in curly braces like this:

"\x{2122}"

:idea:

That doesn't surprize me too much. \x is only supposed to work with a maximum of 2 hex characters. SmOke_N and I agreed this morning that the documentation is in error for that as well. it says 2 digits and that clearly is not correct. Hex is not always just digits. It should be xdigits as in [a-fA-F0-9]

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Here's a quick little converter for getting Unicode 4-character code values. Anything with 00 at the end can be put as a simple \x##, or its ANSI equivalent.

*edit: see fixed version in -> this post

Edited by Ascend4nt

Share this post


Link to post
Share on other sites

Here's a quick little converter for getting Unicode 4-character code values. Anything with 00 at the end can be put as a simple \x##, or its ANSI equivalent.

$sInputStr=InputBox("Unicode-Hex converter","Enter Unicode character(s) in box to see Hexadecimal equivalents","","",360,160)
If $sInputStr<>"" Then MsgBox(0,"Hex equivalent","Original string:"&@CRLF&$sInputStr&@CRLF&"Hexadecimal equivalents:"&@CRLF& _
    StringRegExpReplace(StringTrimLeft(StringToBinary($sInputStr,2),2),"(.{4})","$1,"))

Now there is another nice toy. Thank you.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Found something odd. This does not work in finding my "TM" symbol:

$array = StringRegExp($teststring, "[[:alnum:]\-\s\x27\x9e\x9f" & _SRE_HexToChar("c9-ff") & "]+", 3)

but this does:

$array = StringRegExp($teststring, "[[:alnum:]\-\s\x27\x9e\x9f" & _SRE_HexToChar("c9-ff") & _SRE_HexToChar("99") & "]+", 3)

The former does not show my "TM"s, but the latter does.

No idea what your test string is, or what you feel the issue is. Clearly hex 99 is less than hex c9, so I'm left to guess that there's another issue?

Edit:

Remember:

##-## is a range from least to greatest, your "TM" symbol ( hex 99 ), is not even in that first pattern.

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

No idea what your test string is, or what you feel the issue is. Clearly hex 99 is less than hex c9, so I'm left to guess that there's another issue?

Edit:

Remember:

##-## is a range from least to greatest, your "TM" symbol ( hex 99 ), is not even in that first pattern.

You're right...i've been working at this too long. Regexp has fried my brain today.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...