Jump to content

ASCII 127 character


saywell
 Share

Recommended Posts

Hi again!

The next problem in my programming has popped up.

It's probably dead easy but it's late and I've been puzzling for an hour - time to seek expert help!!

I'm capturing text from a word document and writing it into an HTML document.

Some of the word docs contain what look like ASCII 127 characters [presumably left over from word formatting, despite using range.text to acquire it].

It should be easy to do a StringReplace and swap them for either a single white space or [better] a null string] but how do I specify the 127 characters as the search substring?

I've tried

$sImport_Content = StringRegExpReplace ($sImport_Content, "\x7F", " ")

which I'd hoped would do the trick, but to no avail!

Regards,

William

Link to comment
Share on other sites

Are you sure that's not Unicode? What are you doing with the text that requires you to worry about "127 characters" at all?

;)

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Are you sure that's not Unicode? What are you doing with the text that requires you to worry about "127 characters" at all?

;)

Thanks.

No, I'm not sure at all!!

I get the text using

Local $sImport_Content = $oWordApp.Activedocument.Range.Text

and when I add it to an html document [after replacing vert white spaces with <br>] I get the text with some characters that look like:

()()

()7

in a miniature box. Looking at an ascii chart on the web I thought it was like the image next to the 127 character. But looking again, they are a bit different.

is how they appear on the web browser and when I 'view source' in Firefox. But I think this may just be a symbol for something non-alphanumeric.

I don't want to do anything with them - just to get rid of them!!

if it's any clue, they appear where there is a table in the word doc.

Any help appreciated!

[edit] I just viewed ths in IE andthe character shows a s a black dot, not the symbol described, which shows in firefox!

Edited by saywell
Link to comment
Share on other sites

I think you will have to select your .Range to be only one cell of the table at a time to get the text without the proprietary formatting. Another option might be to use export tools in Word to export to HTML in the first place.

;)

Edit: I like Varian's idea too.

Edited by PsaltyDS
Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Thanks, Guys.

I'm importing external docs into a rudimentary mangement/record system, so the docs may contain all sorts of formatting out of my control and knowledge. Thus parsing bits separately isn't really an option.

The odd random character isn't too big a deal, but it offends my sence of decency!!

I'd thought along the lines of

$sImport_Content = StringRegExpReplace ($sImport_Content, "[^[:ascii:]]", "")

but was put off by the helpfile entry:

[^:class:] Match any character not in the class, but only if the first character.

as these characters pop up all through the bit of text that came from the table, not just the first charcter of the string. Or am I reading it wrongly, and it really means only the first occurrence ? If so I could loop until none left.

I'll give it a try tonight. [i'm in UK and looking at your replies before going to work].

Thanks again.

William

[edit] It looks like the helpfile s correct! Neither :ascii: nor :cntrl: has any effect. I guess I'll have to live with them!

Edited by saywell
Link to comment
Share on other sites

saywell, you'll notice if you look closer that Varian is using '[^[:ascii:]]', NOT '[[^:ascii:]]'. There's a difference, though it might not be clear at first. One is inside the class, one is inside the surrounding brackets.

Why not give '[^[:print:]]' a try, as Varian's method will keep non-printing characters as well.

But are you certain that the text is not UTF-8 encoded? (This would be clear at the top of HTML text). If it is UTF-8 encoded, you'll want to do a BinaryToString() conversion first.

*edit: oops, ignore the strike-through text, I thought I saw HTML being used as a source.

Edited by Ascend4nt
Link to comment
Share on other sites

saywell, you'll notice if you look closer that Varian is using '[^[:ascii:]]', NOT '[[^:ascii:]]'. There's a difference, though it might not be clear at first. One is inside the class, one is inside the surrounding brackets.

Why not give '[^[:print:]]' a try, as Varian's method will keep non-printing characters as well.

But are you certain that the text is not UTF-8 encoded? (This would be clear at the top of HTML text). If it is UTF-8 encoded, you'll want to do a BinaryToString() conversion first.

*edit: oops, ignore the strike-through text, I thought I saw HTML being used as a source.

Thanks, Ascend4nt.

Some progress made!

I had used Variena's code correctly in the program, as I copied and pasted from his post.

If I use your suggestion of [^[:print:]] it certainly takes out the errant characters, but unfortunately removes all the verticaltab characters so I get no line breaks.

What I need is a regex to specify removal of the set of characters that are in :print but not in :ascii - but I don't think my regex-ing is up to that .! In fact I'm struggling to understand the syntax of the brackets.

Or alternatively the set of :print excluding white spaces - \s.

Another, less elegant, way might be first to swap the vertical tabs for <br> (which has to be done anyway, then the whitespace set for a temporary string that's unlikely to occur 'naturally', then strip the rubbish with :print, then restore the whitespaces from the temporary string. A bit long-winded but I'll give it a try!

William

[edit] Update - Ive implemented my no-elegant kludge as above, and it seems to be doing the necessary. No more little space invaders!

Edited by saywell
Link to comment
Share on other sites

I wasn't aware that [:print:] would get rid of carriage return/linefeeds, but you're right. That seemed odd at first, until I realized they aren't actually 'printable' characters - just commands basically.

If its a standard Ascii character set you want to keep, you could try '[^\x20-\x7E]', which won't replace carriage returns or spaces.

*edit: oops, just realized you added \s to the pattern, effectively making it '[^[:print:]\s]'. I don't know why I didn't think of that. I was starting to lean towards \v, but \s covers those characters.

(also, oops, I did replace carriage returns lol. I need some caffeine..)

Edited by Ascend4nt
Link to comment
Share on other sites

I wasn't aware that [:print:] would get rid of carriage return/linefeeds, but you're right. That seemed odd at first, until I realized they aren't actually 'printable' characters - just commands basically.

If its a standard Ascii character set you want to keep, you could try '[^\x20-\x7E]', which won't replace carriage returns or spaces.

*edit: oops, just realized you added \s to the pattern, effectively making it '[^[:print:]\s]'. I don't know why I didn't think of that. I was starting to lean towards \v, but \s covers those characters.

(also, oops, I did replace carriage returns lol. I need some caffeine..)

That's great - SORTED!!

I used '[^[:print:]\s]' as you suggested - much more elegant than my way.

Many thanks.

William

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...