Jump to content

StringRegExp problem with Hex [Solved]


Recommended Posts

This may be nothing more than a help file issue. The help file states that \x represents ascii codes. Let's test this assumption.

Local $sTestString = ""
For $i = 0 To 255
$sTestString &= Chr($i)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")
; Now $sTestString has a string Length of 27 characters

; What went wrong?
For $i = 1 To StringLen($sTestString)
; The following 27 characters were not replaced
ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

In conclusion, either regexp is broken, my machine is broken or the help file is wrong about what \x actually does.

Edited by czardas
Link to comment
Share on other sites

Here's a work around if anyone needs it.

Local $sTestString = ""
For $i = 0 To 255
    $sTestString &= Chr($i)
Next
; $sTestString has a string Length of 255 characters

Local $sSRE = ""
For $i = 128 To 255
     $sSRE &= Chr($i)
Next

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[" & $sSRE & "\x00-\x7F]", "")
; Now $sTestString has a string Length of 0 characters
MsgBox(0, "", StringLen($sTestString))
Edited by czardas
Link to comment
Share on other sites

Hi czardas, will have to be

$sTestString = StringRegExpReplace($sTestString, "[[:ascii:]\x80-\xff]+", "")

Ciao.

What's the difference?

That's inconsistant with the function Chr() which will sometimes return other characters. It's only a help file description issue. The help file says for \x

Match the ascii character whose code is given in hexadecimal.

\x80-\xFF is not consistant with the table of AscII characters in my help file : ascii code page win-2152 Edited by czardas
Link to comment
Share on other sites

All of our strings are Unicode, the issue isn't there. Indeed, the help file should say "US-ASCII", which refers to the range x00-x7F.

Now remember that current implementation of PCRE in AutoIt painfully converts strings (subject and patterns) to UTF-8 before submitting them to the engine. This is no problem with US-ASCII since this range is common to Unicode and all codepages.

The issue arises with codepoints > 0x7F as you can see:

For $i = 128 To 255
ConsoleWrite(Hex($i, 2) & ' = ' &StringToBinary(Chr($i), 4) & @LF)
Next

None of those character are represented by a single byte, thanks to UTF-8 representation.

In the pattern, x00-xFF is taken litterally and compiled into the engine verbatim. EDIT: that's untrue

Are things clearer now?

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Yes it clarifies what is happening, thanks. :) I think it's easy to take things you read at face value. ASCII code pages include an extended range. Perhaps the term ASCII is sometimes used too freely, and the small change you suggest would hint that there's something more going on. Looking at my ASCII code page of characters is going to be misleading.

Link to comment
Share on other sites

That's why ASCII (originally 7-bit) ⊊ ANSI.

⊊ means "is a subset of but not equal to"

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

All of our strings are Unicode, the issue isn't there. Indeed, the help file should say "US-ASCII", which refers to the range x00-x7F.

Now remember that current implementation of PCRE in AutoIt painfully converts strings (subject and patterns) to UTF-8 before submitting them to the engine. This is no problem with US-ASCII since this range is common to Unicode and all codepages.

The issue arises with codepoints > 0x7F as you can see:

For $i = 128 To 255
ConsoleWrite(Hex($i, 2) & ' = ' &StringToBinary(Chr($i), 4) & @LF)
Next

None of those character are represented by a single byte, thanks to UTF-8 representation.

In the pattern, x00-xFF is taken litterally and compiled into the engine verbatim.

Are things clearer now?

Where did you get those information form? Or from whom?

edit:

Oh I see, it's in the help file. Never mind.

Edited by trancexx

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Shit, I just read every line of the help file regarding regexp and everything looks fine.

czardas, what help file you are talking about? Could you check for which AutoIt version that help file is written for?

Edited by trancexx

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

I finally got a few minutes to dig further. In fact I told bullshit (but I was not the only one).

Run this simple test and you'll see what "fiat lux" means:

Local $sTestString = ""
For $i = 0 To 255
    $sTestString &= Chr($i)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")
; Now $sTestString has a string Length of 27 characters

; What went wrong?
ConsoleWrite("Using Chr($i)" & @LF)
For $i = 1 To StringLen($sTestString)
    ; The following 27 characters were not replaced
    ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

$sTestString = ""
For $i = 0 To 255
    $sTestString &= ChrW($i)    ; this is where the difference lies (pun intended)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")

ConsoleWrite("Using ChrW($i)" & @LF)
For $i = 1 To StringLen($sTestString)
    ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I finally got a few minutes to dig further. In fact I told bullshit (but I was not the only one).

Run this simple test and you'll see what "fiat lux" means:

Local $sTestString = ""
For $i = 0 To 255
$sTestString &= Chr($i)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")
; Now $sTestString has a string Length of 27 characters

; What went wrong?
ConsoleWrite("Using Chr($i)" & @LF)
For $i = 1 To StringLen($sTestString)
; The following 27 characters were not replaced
ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

$sTestString = ""
For $i = 0 To 255
$sTestString &= ChrW($i) ; this is where the difference lies (pun intended)
Next
; $sTestString has a string Length of 255 characters

; Remove all characters
$sTestString = StringRegExpReplace($sTestString, "[\x00-\xFF]", "")

ConsoleWrite("Using ChrW($i)" & @LF)
For $i = 1 To StringLen($sTestString)
ConsoleWrite(Asc(StringMid($sTestString, $i, 1)) &@LF)
Next

You didn't say much wrong. The issue is indeed conversion between encoding. But the confusion is created by Chr() function.

I'm not sure what help file issues you both are referring too.

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Lol you're right, it's different. I was right - it needed changing, but it has been done already. I feel quite embarrassed. The current help file was in storage. :whistle:

I didn't intend to waste anyone's time. I got a lot of good help today.

@JCHD I stumbled upon the same thing testing with ChrW, but I got the string length wrong. I wrote 255 instead of 256 in the comments. :D

Edited by czardas
Link to comment
Share on other sites

Correct. That's precisely what motivated Unicode: run out the ANSI and other uncomplete codepages hell. Unicode isn't exempt of difficulties but it brings too many advantages for only dark corner drawbacks.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Now I think I need to redesign one or two things. To begin with - my win-1252 keyboard could be made to work with any code page. I never thought to use Unicode to represent the ANSI, but it's an intriguing idea. At the moment it requires win-1252 to be the default code page. It's also possible to design a number of similar extended ASCII code page keyboards which will work on any Windows machine. In some ways it might seem a strange thing to do, but I quite like the idea. :)

Edited by czardas
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...