Jump to content

Regex and character help, please (AutoIt loses no break spaces)


leuce
 Share

Recommended Posts

Hello everyone

I'm trying to perform a regex find/replace on a piece of text, but I encounter two problems (possibly related to each other).  The first problem is that my regular expression may be incorrect.  The second problem is that during AutoIt's processing of the text, some characters are changed that should not be changed.

Since pasting no-break spaces and zero width non joiners can't be shown in the forum, I've added an attachment with the text that I copy (to the clipboard), as well as what the result should look like after the regex replacement.

The text contains, among others, one or more series consisting of a no-break space, a number, a no-break space, and a zero width non joiner.  If you view the attached file in Word with "non-printing characters" enabled (i.e. so that you can see spaces and line breaks), you should see the zero width non joiner as a little box.

However, when I run this script, and paste the text that is added to the clipboard, it appears that the no-break spaces were converted to normal spaces by AutoIt, which may (or may not) explain why the regex replacement does not work.

$zerowidthnonjoiner = BinaryToString ("0x0C20", 2)
$nobreakspace = BinaryToString ("0xA000", 2)

$grabbedtext = ClipGet ()
Sleep ("1000")

$grabbedtext2 = StringRegExpReplace ($grabbedtext, '(' & $nobreakspace & ')([0-9]+?)(' & $nobreakspace & $zerowidthnonjoiner & ')', '{$2}')

MsgBox (0, "", $grabbedtext & @CRLF & @CRLF & $grabbedtext2, 0)
$toput = $grabbedtext & @CRLF & @CRLF & $grabbedtext2
ClipPut ($toput)

I originally tried:

$grabbedtext2 = StringRegExpReplace ($grabbedtext, $nobreakspace & '([0-9]+?)' & $nobreakspace & $zerowidthnonjoiner, '{$1}')

 

I have tried splitting up this problem into two separate problems, but I could not.

Firstly, can you tell me if my regex syntax is correct?  And secondly, do you know where the problem occurs with the no-break spaces being converted to normal spaces, and how I can avoid that?

Thanks

Samuel

 

document with text.doc

Edited by leuce
Link to comment
Share on other sites

The code below shows that there is no emasculation of Unicode strings:

$zerowidthnonjoiner = ChrW(0x200C)
$nonbreakspace = ChrW(0xA0)

$grabbedtext = _
    $nonbreakspace & "111" & $nonbreakspace & $zerowidthnonjoiner & "blah1..." & _
    $nonbreakspace & "222" & $nonbreakspace & $zerowidthnonjoiner & "blah2..." & _
    $nonbreakspace & "333" & $nonbreakspace & $zerowidthnonjoiner & "blah3..." & _
    $nonbreakspace & "444" & $nonbreakspace & $zerowidthnonjoiner & "blah4..." & _
    $nonbreakspace & "555" & $nonbreakspace & $zerowidthnonjoiner & "blah5..."

$grabbedtext2 = StringRegExpReplace($grabbedtext, '(?<=' & $nonbreakspace & ')([0-9]+?)(?=' & $nonbreakspace & $zerowidthnonjoiner & ')', '{$1}')

Local $aChrW = StringToASCIIArray($grabbedtext)
_NameIt($aChrW)
_ArrayDisplay($aChrW, "Before")
Local $aChrW2 = StringToASCIIArray($grabbedtext2)
_NameIt($aChrW2)
_ArrayDisplay($aChrW2, "After")


Func _NameIt(ByRef $a)
    For $i = 0 To UBound($a) - 1
        Switch $a[$i]
            Case 0xA0
                $a[$i] = "NBS"
            Case 0x200B
                $a[$i] = "ZWS"
            Case 0x200C
                $a[$i] = "ZWNJ"
            Case 0x200D
                $a[$i] = "ZWJ"
            Case 0xFEFF
                $a[$i] = "ZWNBS"
            Case Else
                $a[$i] = ChrW($a[$i])
        EndSwitch
    Next
EndFunc

Your regex was indeed not suited to the job.

Then basic Clip* functions might convert (= emasculate) Unicode codepoints to Windows charset. You should be happier with _ClipBoard_{Get|Set}Data using $CF_UNICODETEXT explicitely.

EDIT: after checking, it appears that ClipGet & ClipPut don't change the offending codepoints, so your other issue is elsewhere.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

2 hours ago, jchd said:

The code below shows that there is no emasculation of Unicode strings:...

Then basic Clip* functions might convert (= emasculate) Unicode codepoints to Windows charset.

Isn’t Unicode Unisex by nature?
Therefore is emasculation thru conversion therapy even possible?

:)

Code hard, but don’t hard code...

Link to comment
Share on other sites

5 hours ago, jchd said:

Your regex was indeed not suited to the job.

Thanks very much for the code snippet.  However, my explanation was not up to scratch either (-: because your regex retains the no-break space and the zero width non joiner, and I want them removed as well.  Oh, well, I could just StringReplace to remove them 🙂 that's good enough for me.

For the record, I wanted this:
[some text] [no break space] [number] [no break space] [zero width non joiner] [some more text]

...to be replaced with this:
[some text] [left curly bracket] [number] [right curly bracket] [some more text]

Link to comment
Share on other sites

Ah, then:

$zerowidthnonjoiner = ChrW(0x200C)
$nonbreakspace = ChrW(0xA0)

$grabbedtext = _
    $nonbreakspace & "111" & $nonbreakspace & $zerowidthnonjoiner & "blah1..." & _
    $nonbreakspace & "222" & $nonbreakspace & $zerowidthnonjoiner & "blah2..." & _
    $nonbreakspace & "333" & $nonbreakspace & $zerowidthnonjoiner & "blah3..." & _
    $nonbreakspace & "444" & $nonbreakspace & $zerowidthnonjoiner & "blah4..." & _
    $nonbreakspace & "555" & $nonbreakspace & $zerowidthnonjoiner & "blah5..."

;~ $grabbedtext2 = StringRegExpReplace($grabbedtext, '(' & $nonbreakspace & ')(\d+)(' & $nonbreakspace & $zerowidthnonjoiner & ')', '{$2}')
; less verbose
$grabbedtext2 = StringRegExpReplace($grabbedtext, '(\xA0)(\d+)(\xA0\x{200C})', '{$2}')

Local $aChrW = StringToASCIIArray($grabbedtext)
_NameIt($aChrW)
_ArrayDisplay($aChrW, "Before")
Local $aChrW2 = StringToASCIIArray($grabbedtext2)
_NameIt($aChrW2)
_ArrayDisplay($aChrW2, "After")


Func _NameIt(ByRef $a)
    For $i = 0 To UBound($a) - 1
        Switch $a[$i]
            Case 0xA0
                $a[$i] = "NBS"
            Case 0x200B
                $a[$i] = "ZWS"
            Case 0x200C
                $a[$i] = "ZWNJ"
            Case 0x200D
                $a[$i] = "ZWJ"
            Case 0xFEFF
                $a[$i] = "ZWNBS"
            Case Else
                $a[$i] = ChrW($a[$i])
        EndSwitch
    Next
EndFunc

In fact I misunderstood your requirements and your initial pattern was on par AFAICT.

It remains that the content of your grabbed text may not contain what you expect. My snippet demonstrates that: 1) both NBSs and ZWNJ are correctly detected in an input string; 2) the regex correctly matches them as requested.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

2 hours ago, jchd said:

It remains that the content of your grabbed text may not contain what you expect.

I'm beginning to suspect that you're right.

By the way, I'm using this script to process text that is copied from a form on a web site. I'm very fortunate in that the web site developer chose to put tags around these characters (on the HTML clipboard), so I'm going to rewrite my script to read the HTML clipboard instead and do the regex find replace using the tags.  Hopefully then it should not matter if there are NBSP and ZWNJ characters inbetween.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...