Jump to content
Sign in to follow this  
czardas

StringRegExp problem with Hex [Solved]

Recommended Posts

czardas

This may be nothing more than a help file issue. The help file states that \x represents ascii codes. Let's test this assumption.

Local $sTestString = ""
For $i = 0 To 255
$sTestString &= Chr($i)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")
; Now $sTestString has a string Length of 27 characters

; What went wrong?
For $i = 1 To StringLen($sTestString)
; The following 27 characters were not replaced
ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

In conclusion, either regexp is broken, my machine is broken or the help file is wrong about what \x actually does.

Edited by czardas

Share this post


Link to post
Share on other sites
czardas

Here's a work around if anyone needs it.

Local $sTestString = ""
For $i = 0 To 255
    $sTestString &= Chr($i)
Next
; $sTestString has a string Length of 255 characters

Local $sSRE = ""
For $i = 128 To 255
     $sSRE &= Chr($i)
Next

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[" & $sSRE & "\x00-\x7F]", "")
; Now $sTestString has a string Length of 0 characters
MsgBox(0, "", StringLen($sTestString))
Edited by czardas

Share this post


Link to post
Share on other sites
DXRW4E

Hi czardas, will have to be

$sTestString = StringRegExpReplace($sTestString, "[[:ascii:]\x80-\xff]+", "")

Ciao.

Edited by DXRW4E

apps-odrive.pngdrive_app_badge.png box-logo.png new_logo.png MEGA_Logo.png

Share this post


Link to post
Share on other sites
czardas

Hi czardas, will have to be

$sTestString = StringRegExpReplace($sTestString, "[[:ascii:]\x80-\xff]+", "")

Ciao.

What's the difference?

That's inconsistant with the function Chr() which will sometimes return other characters. It's only a help file description issue. The help file says for \x

Match the ascii character whose code is given in hexadecimal.

\x80-\xFF is not consistant with the table of AscII characters in my help file : ascii code page win-2152 Edited by czardas

Share this post


Link to post
Share on other sites
DXRW4E

Look here for more http://www.autoitscript.com/autoit3/pcrepattern.html

Ciao.


apps-odrive.pngdrive_app_badge.png box-logo.png new_logo.png MEGA_Logo.png

Share this post


Link to post
Share on other sites
DXRW4E

I do not think, this example is wrong then??, because French is not unicode? http://www.autoitscript.com/autoit3/pcrepattern.html

ect ect ect if character tables for a French locale are in use, [xc8-xcb] matches accented E characters in both cases ect ect ect

Ciao. Edited by DXRW4E

apps-odrive.pngdrive_app_badge.png box-logo.png new_logo.png MEGA_Logo.png

Share this post


Link to post
Share on other sites
jchd

All of our strings are Unicode, the issue isn't there. Indeed, the help file should say "US-ASCII", which refers to the range x00-x7F.

Now remember that current implementation of PCRE in AutoIt painfully converts strings (subject and patterns) to UTF-8 before submitting them to the engine. This is no problem with US-ASCII since this range is common to Unicode and all codepages.

The issue arises with codepoints > 0x7F as you can see:

For $i = 128 To 255
ConsoleWrite(Hex($i, 2) & ' = ' &StringToBinary(Chr($i), 4) & @LF)
Next

None of those character are represented by a single byte, thanks to UTF-8 representation.

In the pattern, x00-xFF is taken litterally and compiled into the engine verbatim. EDIT: that's untrue

Are things clearer now?

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
czardas

Yes it clarifies what is happening, thanks. :) I think it's easy to take things you read at face value. ASCII code pages include an extended range. Perhaps the term ASCII is sometimes used too freely, and the small change you suggest would hint that there's something more going on. Looking at my ASCII code page of characters is going to be misleading.

Share this post


Link to post
Share on other sites
jchd

That's why ASCII (originally 7-bit) ⊊ ANSI.

⊊ means "is a subset of but not equal to"

  • Like 1

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
trancexx

All of our strings are Unicode, the issue isn't there. Indeed, the help file should say "US-ASCII", which refers to the range x00-x7F.

Now remember that current implementation of PCRE in AutoIt painfully converts strings (subject and patterns) to UTF-8 before submitting them to the engine. This is no problem with US-ASCII since this range is common to Unicode and all codepages.

The issue arises with codepoints > 0x7F as you can see:

For $i = 128 To 255
ConsoleWrite(Hex($i, 2) & ' = ' &StringToBinary(Chr($i), 4) & @LF)
Next

None of those character are represented by a single byte, thanks to UTF-8 representation.

In the pattern, x00-xFF is taken litterally and compiled into the engine verbatim.

Are things clearer now?

Where did you get those information form? Or from whom?

edit:

Oh I see, it's in the help file. Never mind.

Edited by trancexx

♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites
trancexx

Shit, I just read every line of the help file regarding regexp and everything looks fine.

czardas, what help file you are talking about? Could you check for which AutoIt version that help file is written for?

Edited by trancexx

♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites
jchd

I finally got a few minutes to dig further. In fact I told bullshit (but I was not the only one).

Run this simple test and you'll see what "fiat lux" means:

Local $sTestString = ""
For $i = 0 To 255
    $sTestString &= Chr($i)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")
; Now $sTestString has a string Length of 27 characters

; What went wrong?
ConsoleWrite("Using Chr($i)" & @LF)
For $i = 1 To StringLen($sTestString)
    ; The following 27 characters were not replaced
    ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

$sTestString = ""
For $i = 0 To 255
    $sTestString &= ChrW($i)    ; this is where the difference lies (pun intended)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")

ConsoleWrite("Using ChrW($i)" & @LF)
For $i = 1 To StringLen($sTestString)
    ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
trancexx

I finally got a few minutes to dig further. In fact I told bullshit (but I was not the only one).

Run this simple test and you'll see what "fiat lux" means:

Local $sTestString = ""
For $i = 0 To 255
$sTestString &= Chr($i)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")
; Now $sTestString has a string Length of 27 characters

; What went wrong?
ConsoleWrite("Using Chr($i)" & @LF)
For $i = 1 To StringLen($sTestString)
; The following 27 characters were not replaced
ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

$sTestString = ""
For $i = 0 To 255
$sTestString &= ChrW($i) ; this is where the difference lies (pun intended)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")

ConsoleWrite("Using ChrW($i)" & @LF)
For $i = 1 To StringLen($sTestString)
ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

You didn't say much wrong. The issue is indeed conversion between encoding. But the confusion is created by Chr() function.

I'm not sure what help file issues you both are referring too.


♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites
czardas

Lol you're right, it's different. I was right - it needed changing, but it has been done already. I feel quite embarrassed. The current help file was in storage. :whistle:

I didn't intend to waste anyone's time. I got a lot of good help today.

@JCHD I stumbled upon the same thing testing with ChrW, but I got the string length wrong. I wrote 255 instead of 256 in the comments. :D

Edited by czardas

Share this post


Link to post
Share on other sites
czardas

The implications of this are beginning to dawn on me. Any function which uses ANSI such as _HexToString() are liable to fail under certain circumstances. For example: hard coded hex strings converted using an inappropriate code page. Ouch!

Share this post


Link to post
Share on other sites
jchd

Correct. That's precisely what motivated Unicode: run out the ANSI and other uncomplete codepages hell. Unicode isn't exempt of difficulties but it brings too many advantages for only dark corner drawbacks.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
czardas

Now I think I need to redesign one or two things. To begin with - my win-1252 keyboard could be made to work with any code page. I never thought to use Unicode to represent the ANSI, but it's an intriguing idea. At the moment it requires win-1252 to be the default code page. It's also possible to design a number of similar extended ASCII code page keyboards which will work on any Windows machine. In some ways it might seem a strange thing to do, but I quite like the idea. :)

Edited by czardas

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.