RegEx

BinaryBrother · September 6, 2010

How would I go about removing ALL escape sequences, non-printable characters, and basically everything non alphanumeric... But not basic punctuation etc ?

I've already tried stuff like... StringRegExpReplace($Buffer, "([^a-z]|[^A-Z]|\?|!)", "") But I almost know for certain that's not the right way to go about it... And my somewhat working way is VERY tedious...

To be honest, I've tried almost a hundred different regular expressions, and ended up with about 10 lines of "StringRegExpReplace" to do what I need done... By finding each and every possible escape sequence slipping through the Buffer, and creating a RegExp for EACH one... It's very tedious since it seems like there are a hundred...

Summery of request:

RegEx pattern to remove all non-alphanumeric characters. (Except for a few other characters, (Period). (Coma), (Left & Right brackets) [ ], (Exclamation mark)!, (Question Mark) ?

I need the punctuation because some of my current running RegExp relies on it.

What it's for:

Telnet server communications/automation. It's putting a bunch of garbage through (escape sequences, non-printable characters, compression negotiation, etc.)

Thanks for all the help guys.

Varian · September 6, 2010

Would this work?

StringRegExpReplace($Buffer, "[^\w\[\],?!.]", "")

^ first character in bracket = not

\w = Short Hand class for Word Characters (letters, digits)

\[ = literal Left bracker

\] = literal Right bracket

,?!. = literal characters

So if a characters is NOT

a Word Character

a Left bracket

a Right bracket

, or ? or ! or .

then replace it with nothing (remove it)

You may also want to use (?m) before the brackets to signal that it is a line by line search

(If this doesn't work post some of the text if possible to help debug)

Edited September 6, 2010 by Varian

enaiman · September 7, 2010

@BinaryBrother

I can feel your pain mate. It happens that I'm at the same moment in need of something very similar. Well, for me it is COM1. Everything is fine and I have no issues with escape sequences until I open an editor and send some commads. Until I close the editor I keep getting these damn sequences.

Here you can see a sample:

[24;1H[0KI sw_prep.xsf [Modified] 2/35 5%[2;1H
tftp get 1[2;11H[2;11H0.8.199.1[2;20H[2;20H6 vr "VR-M[2;30H[2;30Hgmt" snmp_[2;40H[2;40Haccess.po[2;49H[2;49Hl force-ov[2;59H[24;1H
[0KI sw_prep.xsf [Modified] 3/36 8%[2;59H[2;59Herwrite

What I tried so far was:

$txt = StringRegExpReplace($txt, "\e\[\d+;\d+H", "")
    $txt = StringRegExpReplace($txt, "\e\[0K", "")

But still there are some slipping through.

After a bit of reading I found that there aren't actually so many escape codes, but even so, dealing successfully with them is not an easy task.

Here is the list of escape codes (CSI stands for begining of ESC sequence):

CSI n A CUU – CUrsor Up Moves the cursor n (default 1) cells in the given direction. If the cursor is already at the edge of the screen, this has no effect. 
CSI n B CUD – CUrsor Down 
CSI n C CUF – CUrsor Forward 
CSI n D CUB – CUrsor Back 
CSI n E CNL – Cursor Next Line Moves cursor to beginning of the line n (default 1) lines down. 
CSI n F CPL – Cursor Previous Line Moves cursor to beginning of the line n (default 1) lines up. 
CSI n G CHA – Cursor Horizontal Absolute Moves the cursor to column n. 
CSI n ; m H CUP – CUrsor Position Moves the cursor to row n, column m. The values are 1-based, and default to 1 (top left corner) if omitted. A sequence such as CSI ;5H is a synonym for CSI 1;5H as well as CSI 17;H is the same as CSI 17H and CSI 17;1H 
CSI n J ED – Erase Data Clears part of the screen. If n is zero (or missing), clear from cursor to end of screen. If n is one, clear from cursor to beginning of the screen. If n is two, clear entire screen (and moves cursor to upper left on MS-DOS ANSI.SYS). 
CSI n K EL – Erase in Line Erases part of the line. If n is zero (or missing), clear from cursor to the end of the line. If n is one, clear from cursor to beginning of the line. If n is two, clear entire line. Cursor position does not change. 
CSI n S SU – Scroll Up Scroll whole page up by n (default 1) lines. New lines are added at the bottom. (not ANSI.SYS) 
CSI n T SD – Scroll Down Scroll whole page down by n (default 1) lines. New lines are added at the top. (not ANSI.SYS) 
CSI n ; m f HVP – Horizontal and Vertical Position Moves the cursor to row n, column m. Both default to 1 if omitted. Same as CUP 
CSI n [;k] m SGR – Select Graphic Rendition Sets SGR parameters. After CSI can be zero or more parameters separated with ;. With no parameters, CSI m is treated as CSI 0 m (reset / normal), which is typical of most of the ANSI escape sequences. 
CSI 6 n DSR – Device Status Report Reports the cursor position to the application as (as though typed at the keyboard) ESC[n;mR, where n is the row and m is the column. (May not work on MS-DOS.) 
CSI s SCP – Save Cursor Position Saves the cursor position. 
CSI u RCP – Restore Cursor Position Restores the cursor position. 
CSI ?25l DECTCEM Hides the cursor. 
CSI ?25h DECTCEM Shows the cursor.

Ascend4nt · September 7, 2010

This would replace all unprintable characters:

$sData=StringRegExpReplace($sData,'[^[:print:]]','')

For keeping alphanumerics theres the '[:alnum:]' class as well. What you really want is to put everything you want to exclude into the [^..] part of the pattern. In other words, no '|' is needed ('[^ast\?]' will look for anything thats not a,s,t, or '?'). You could also do ranges. For example, if you want to include a range of ASCII characters, you could use something like '[^\x20-\x7e]'. Etc etc

BinaryBrother · September 7, 2010

This is what I have so far... That appears to strip 90% of the junk.

(ÿ|ý|||û|\[ÿ|Fÿ|\]ÿ|'ÿýÈ|Vÿ|ú||ð|Uú||ù|\[\dz|<\w+>|</\w+>)

Some of the escape sequences didn't paste properly... But it's not just escape sequences that I'm trying to remove, it's all of the above junk as well... Which I'm now stripping per character (seems to be pretty fast as well).

I'll give the

$sData=StringRegExpReplace($sData,'[^[:print:]]','')

A look...

Edited September 7, 2010 by BinaryBrother

Sign In

RegEx

Recommended Posts

BinaryBrother

Varian

enaiman

Ascend4nt

BinaryBrother

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta