Regex and character help, please (AutoIt loses no break spaces)

leuce · November 23, 2020

Hello everyone

I'm trying to perform a regex find/replace on a piece of text, but I encounter two problems (possibly related to each other). The first problem is that my regular expression may be incorrect. The second problem is that during AutoIt's processing of the text, some characters are changed that should not be changed.

Since pasting no-break spaces and zero width non joiners can't be shown in the forum, I've added an attachment with the text that I copy (to the clipboard), as well as what the result should look like after the regex replacement.

The text contains, among others, one or more series consisting of a no-break space, a number, a no-break space, and a zero width non joiner. If you view the attached file in Word with "non-printing characters" enabled (i.e. so that you can see spaces and line breaks), you should see the zero width non joiner as a little box.

However, when I run this script, and paste the text that is added to the clipboard, it appears that the no-break spaces were converted to normal spaces by AutoIt, which may (or may not) explain why the regex replacement does not work.

$zerowidthnonjoiner = BinaryToString ("0x0C20", 2)
$nobreakspace = BinaryToString ("0xA000", 2)

$grabbedtext = ClipGet ()
Sleep ("1000")

$grabbedtext2 = StringRegExpReplace ($grabbedtext, '(' & $nobreakspace & ')([0-9]+?)(' & $nobreakspace & $zerowidthnonjoiner & ')', '{$2}')

MsgBox (0, "", $grabbedtext & @CRLF & @CRLF & $grabbedtext2, 0)
$toput = $grabbedtext & @CRLF & @CRLF & $grabbedtext2
ClipPut ($toput)

I originally tried:

$grabbedtext2 = StringRegExpReplace ($grabbedtext, $nobreakspace & '([0-9]+?)' & $nobreakspace & $zerowidthnonjoiner, '{$1}')

I have tried splitting up this problem into two separate problems, but I could not.

Firstly, can you tell me if my regex syntax is correct? And secondly, do you know where the problem occurs with the no-break spaces being converted to normal spaces, and how I can avoid that?

Thanks

Samuel

document with text.doc

Edited November 23, 2020 by leuce

jchd · November 23, 2020

The code below shows that there is no emasculation of Unicode strings:

$zerowidthnonjoiner = ChrW(0x200C)
$nonbreakspace = ChrW(0xA0)

$grabbedtext = _
    $nonbreakspace & "111" & $nonbreakspace & $zerowidthnonjoiner & "blah1..." & _
    $nonbreakspace & "222" & $nonbreakspace & $zerowidthnonjoiner & "blah2..." & _
    $nonbreakspace & "333" & $nonbreakspace & $zerowidthnonjoiner & "blah3..." & _
    $nonbreakspace & "444" & $nonbreakspace & $zerowidthnonjoiner & "blah4..." & _
    $nonbreakspace & "555" & $nonbreakspace & $zerowidthnonjoiner & "blah5..."

$grabbedtext2 = StringRegExpReplace($grabbedtext, '(?<=' & $nonbreakspace & ')([0-9]+?)(?=' & $nonbreakspace & $zerowidthnonjoiner & ')', '{$1}')

Local $aChrW = StringToASCIIArray($grabbedtext)
_NameIt($aChrW)
_ArrayDisplay($aChrW, "Before")
Local $aChrW2 = StringToASCIIArray($grabbedtext2)
_NameIt($aChrW2)
_ArrayDisplay($aChrW2, "After")


Func _NameIt(ByRef $a)
    For $i = 0 To UBound($a) - 1
        Switch $a[$i]
            Case 0xA0
                $a[$i] = "NBS"
            Case 0x200B
                $a[$i] = "ZWS"
            Case 0x200C
                $a[$i] = "ZWNJ"
            Case 0x200D
                $a[$i] = "ZWJ"
            Case 0xFEFF
                $a[$i] = "ZWNBS"
            Case Else
                $a[$i] = ChrW($a[$i])
        EndSwitch
    Next
EndFunc

Your regex was indeed not suited to the job.

Then basic Clip* functions might convert (= emasculate) Unicode codepoints to Windows charset. You should be happier with _ClipBoard_{Get|Set}Data using $CF_UNICODETEXT explicitely.

EDIT: after checking, it appears that ClipGet & ClipPut don't change the offending codepoints, so your other issue is elsewhere.

Edited November 23, 2020 by jchd

JockoDundee · November 23, 2020

2 hours ago, jchd said:

The code below shows that there is no emasculation of Unicode strings:...

Then basic Clip* functions might convert (= emasculate) Unicode codepoints to Windows charset.

Isn’t Unicode Unisex by nature?
Therefore is emasculation thru conversion therapy even possible?

leuce · November 23, 2020

5 hours ago, jchd said:

Your regex was indeed not suited to the job.

Thanks very much for the code snippet. However, my explanation was not up to scratch either (-: because your regex retains the no-break space and the zero width non joiner, and I want them removed as well. Oh, well, I could just StringReplace to remove them 🙂 that's good enough for me.

For the record, I wanted this:
[some text] [no break space] [number] [no break space] [zero width non joiner] [some more text]

...to be replaced with this:
[some text] [left curly bracket] [number] [right curly bracket] [some more text]

jchd · November 23, 2020

Ah, then:

$zerowidthnonjoiner = ChrW(0x200C)
$nonbreakspace = ChrW(0xA0)

$grabbedtext = _
    $nonbreakspace & "111" & $nonbreakspace & $zerowidthnonjoiner & "blah1..." & _
    $nonbreakspace & "222" & $nonbreakspace & $zerowidthnonjoiner & "blah2..." & _
    $nonbreakspace & "333" & $nonbreakspace & $zerowidthnonjoiner & "blah3..." & _
    $nonbreakspace & "444" & $nonbreakspace & $zerowidthnonjoiner & "blah4..." & _
    $nonbreakspace & "555" & $nonbreakspace & $zerowidthnonjoiner & "blah5..."

;~ $grabbedtext2 = StringRegExpReplace($grabbedtext, '(' & $nonbreakspace & ')(\d+)(' & $nonbreakspace & $zerowidthnonjoiner & ')', '{$2}')
; less verbose
$grabbedtext2 = StringRegExpReplace($grabbedtext, '(\xA0)(\d+)(\xA0\x{200C})', '{$2}')

Local $aChrW = StringToASCIIArray($grabbedtext)
_NameIt($aChrW)
_ArrayDisplay($aChrW, "Before")
Local $aChrW2 = StringToASCIIArray($grabbedtext2)
_NameIt($aChrW2)
_ArrayDisplay($aChrW2, "After")


Func _NameIt(ByRef $a)
    For $i = 0 To UBound($a) - 1
        Switch $a[$i]
            Case 0xA0
                $a[$i] = "NBS"
            Case 0x200B
                $a[$i] = "ZWS"
            Case 0x200C
                $a[$i] = "ZWNJ"
            Case 0x200D
                $a[$i] = "ZWJ"
            Case 0xFEFF
                $a[$i] = "ZWNBS"
            Case Else
                $a[$i] = ChrW($a[$i])
        EndSwitch
    Next
EndFunc

In fact I misunderstood your requirements and your initial pattern was on par AFAICT.

It remains that the content of your grabbed text may not contain what you expect. My snippet demonstrates that: 1) both NBSs and ZWNJ are correctly detected in an input string; 2) the regex correctly matches them as requested.

leuce · November 23, 2020

2 hours ago, jchd said:

It remains that the content of your grabbed text may not contain what you expect.

I'm beginning to suspect that you're right.

By the way, I'm using this script to process text that is copied from a form on a web site. I'm very fortunate in that the web site developer chose to put tags around these characters (on the HTML clipboard), so I'm going to rewrite my script to read the HTML clipboard instead and do the regex find replace using the tags. Hopefully then it should not matter if there are NBSP and ZWNJ characters inbetween.

Sign In

Regex and character help, please (AutoIt loses no break spaces)

Recommended Posts

leuce

jchd

JockoDundee

leuce

jchd

leuce

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta