ASCII 127 character

saywell · September 13, 2010

Hi again!

The next problem in my programming has popped up.

It's probably dead easy but it's late and I've been puzzling for an hour - time to seek expert help!!

I'm capturing text from a word document and writing it into an HTML document.

Some of the word docs contain what look like ASCII 127 characters [presumably left over from word formatting, despite using range.text to acquire it].

It should be easy to do a StringReplace and swap them for either a single white space or [better] a null string] but how do I specify the 127 characters as the search substring?

I've tried

$sImport_Content = StringRegExpReplace ($sImport_Content, "\x7F", " ")

which I'd hoped would do the trick, but to no avail!

Regards,

William

PsaltyDS · September 13, 2010

Are you sure that's not Unicode? What are you doing with the text that requires you to worry about "127 characters" at all?

saywell · September 13, 2010

Are you sure that's not Unicode? What are you doing with the text that requires you to worry about "127 characters" at all?

Thanks.

No, I'm not sure at all!!

I get the text using

Local $sImport_Content = $oWordApp.Activedocument.Range.Text

and when I add it to an html document [after replacing vert white spaces with <br>] I get the text with some characters that look like:

()()

()7

in a miniature box. Looking at an ascii chart on the web I thought it was like the image next to the 127 character. But looking again, they are a bit different.

is how they appear on the web browser and when I 'view source' in Firefox. But I think this may just be a symbol for something non-alphanumeric.

I don't want to do anything with them - just to get rid of them!!

if it's any clue, they appear where there is a table in the word doc.

Any help appreciated!

[edit] I just viewed ths in IE andthe character shows a s a black dot, not the symbol described, which shows in firefox!

Edited September 13, 2010 by saywell

Varian · September 13, 2010

Have you tried

$sImport_Content = StringRegExpReplace ($sImport_Content, "[^[:ascii:]]", "")

PsaltyDS · September 13, 2010

I think you will have to select your .Range to be only one cell of the table at a time to get the text without the proprietary formatting. Another option might be to use export tools in Word to export to HTML in the first place.

Edit: I like Varian's idea too.

Edited September 13, 2010 by PsaltyDS

saywell · September 14, 2010

Thanks, Guys.

I'm importing external docs into a rudimentary mangement/record system, so the docs may contain all sorts of formatting out of my control and knowledge. Thus parsing bits separately isn't really an option.

The odd random character isn't too big a deal, but it offends my sence of decency!!

I'd thought along the lines of

$sImport_Content = StringRegExpReplace ($sImport_Content, "[^[:ascii:]]", "")

but was put off by the helpfile entry:

[^:class:] Match any character not in the class, but only if the first character.

as these characters pop up all through the bit of text that came from the table, not just the first charcter of the string. Or am I reading it wrongly, and it really means only the first occurrence ? If so I could loop until none left.

I'll give it a try tonight. [i'm in UK and looking at your replies before going to work].

Thanks again.

William

[edit] It looks like the helpfile s correct! Neither :ascii: nor :cntrl: has any effect. I guess I'll have to live with them!

Edited September 14, 2010 by saywell

Ascend4nt · September 14, 2010

saywell, you'll notice if you look closer that Varian is using '[^[:ascii:]]', NOT '[[^:ascii:]]'. There's a difference, though it might not be clear at first. One is inside the class, one is inside the surrounding brackets.

Why not give '[^[:print:]]' a try, as Varian's method will keep non-printing characters as well.

~~But are you certain that the text is not UTF-8 encoded? (This would be clear at the top of HTML text). If it is UTF-8 encoded, you'll want to do a BinaryToString() conversion first.~~

*edit: oops, ignore the strike-through text, I thought I saw HTML being used as a source.

Edited September 14, 2010 by Ascend4nt

saywell · September 14, 2010

saywell, you'll notice if you look closer that Varian is using '[^[:ascii:]]', NOT '[[^:ascii:]]'. There's a difference, though it might not be clear at first. One is inside the class, one is inside the surrounding brackets.

Why not give '[^[:print:]]' a try, as Varian's method will keep non-printing characters as well.

~~But are you certain that the text is not UTF-8 encoded? (This would be clear at the top of HTML text). If it is UTF-8 encoded, you'll want to do a BinaryToString() conversion first.~~

*edit: oops, ignore the strike-through text, I thought I saw HTML being used as a source.

Thanks, Ascend4nt.

Some progress made!

I had used Variena's code correctly in the program, as I copied and pasted from his post.

If I use your suggestion of [^[:print:]] it certainly takes out the errant characters, but unfortunately removes all the verticaltab characters so I get no line breaks.

What I need is a regex to specify removal of the set of characters that are in :print but not in :ascii - but I don't think my regex-ing is up to that .! In fact I'm struggling to understand the syntax of the brackets.

Or alternatively the set of :print excluding white spaces - \s.

Another, less elegant, way might be first to swap the vertical tabs for <br> (which has to be done anyway, then the whitespace set for a temporary string that's unlikely to occur 'naturally', then strip the rubbish with :print, then restore the whitespaces from the temporary string. A bit long-winded but I'll give it a try!

William

[edit] Update - Ive implemented my no-elegant kludge as above, and it seems to be doing the necessary. No more little space invaders!

Edited September 14, 2010 by saywell

Ascend4nt · September 14, 2010

I wasn't aware that [:print:] would get rid of carriage return/linefeeds, but you're right. That seemed odd at first, until I realized they aren't actually 'printable' characters - just commands basically.

If its a standard Ascii character set you want to keep, you could try '[^\x20-\x7E]', which won't ~~replace carriage returns or~~ spaces.

*edit: oops, just realized you added \s to the pattern, effectively making it '[^[:print:]\s]'. I don't know why I didn't think of that. I was starting to lean towards \v, but \s covers those characters.

(also, oops, I did replace carriage returns lol. I need some caffeine..)

Edited September 14, 2010 by Ascend4nt

saywell · September 14, 2010

I wasn't aware that [:print:] would get rid of carriage return/linefeeds, but you're right. That seemed odd at first, until I realized they aren't actually 'printable' characters - just commands basically.

If its a standard Ascii character set you want to keep, you could try '[^\x20-\x7E]', which won't ~~replace carriage returns or~~ spaces.

*edit: oops, just realized you added \s to the pattern, effectively making it '[^[:print:]\s]'. I don't know why I didn't think of that. I was starting to lean towards \v, but \s covers those characters.
(also, oops, I did replace carriage returns lol. I need some caffeine..)

That's great - SORTED!!

I used '[^[:print:]\s]' as you suggested - much more elegant than my way.

Many thanks.

William

ASCII 127 character

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members