saywell Posted September 13, 2010 Share Posted September 13, 2010 Hi again! The next problem in my programming has popped up. It's probably dead easy but it's late and I've been puzzling for an hour - time to seek expert help!! I'm capturing text from a word document and writing it into an HTML document. Some of the word docs contain what look like ASCII 127 characters [presumably left over from word formatting, despite using range.text to acquire it]. It should be easy to do a StringReplace and swap them for either a single white space or [better] a null string] but how do I specify the 127 characters as the search substring? I've tried $sImport_Content = StringRegExpReplace ($sImport_Content, "\x7F", " ") which I'd hoped would do the trick, but to no avail! Regards, William Link to comment Share on other sites More sharing options...
PsaltyDS Posted September 13, 2010 Share Posted September 13, 2010 Are you sure that's not Unicode? What are you doing with the text that requires you to worry about "127 characters" at all? Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law Link to comment Share on other sites More sharing options...
saywell Posted September 13, 2010 Author Share Posted September 13, 2010 (edited) Are you sure that's not Unicode? What are you doing with the text that requires you to worry about "127 characters" at all? Thanks. No, I'm not sure at all!! I get the text using Local $sImport_Content = $oWordApp.Activedocument.Range.Text and when I add it to an html document [after replacing vert white spaces with <br>] I get the text with some characters that look like: ()() ()7 in a miniature box. Looking at an ascii chart on the web I thought it was like the image next to the 127 character. But looking again, they are a bit different. is how they appear on the web browser and when I 'view source' in Firefox. But I think this may just be a symbol for something non-alphanumeric. I don't want to do anything with them - just to get rid of them!! if it's any clue, they appear where there is a table in the word doc. Any help appreciated! [edit] I just viewed ths in IE andthe character shows a s a black dot, not the symbol described, which shows in firefox! Edited September 13, 2010 by saywell Link to comment Share on other sites More sharing options...
Varian Posted September 13, 2010 Share Posted September 13, 2010 Have you tried$sImport_Content = StringRegExpReplace ($sImport_Content, "[^[:ascii:]]", "") Link to comment Share on other sites More sharing options...
PsaltyDS Posted September 13, 2010 Share Posted September 13, 2010 (edited) I think you will have to select your .Range to be only one cell of the table at a time to get the text without the proprietary formatting. Another option might be to use export tools in Word to export to HTML in the first place. Edit: I like Varian's idea too. Edited September 13, 2010 by PsaltyDS Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law Link to comment Share on other sites More sharing options...
saywell Posted September 14, 2010 Author Share Posted September 14, 2010 (edited) Thanks, Guys. I'm importing external docs into a rudimentary mangement/record system, so the docs may contain all sorts of formatting out of my control and knowledge. Thus parsing bits separately isn't really an option. The odd random character isn't too big a deal, but it offends my sence of decency!! I'd thought along the lines of $sImport_Content = StringRegExpReplace ($sImport_Content, "[^[:ascii:]]", "") but was put off by the helpfile entry: [^:class:] Match any character not in the class, but only if the first character. as these characters pop up all through the bit of text that came from the table, not just the first charcter of the string. Or am I reading it wrongly, and it really means only the first occurrence ? If so I could loop until none left. I'll give it a try tonight. [i'm in UK and looking at your replies before going to work]. Thanks again. William [edit] It looks like the helpfile s correct! Neither :ascii: nor :cntrl: has any effect. I guess I'll have to live with them! Edited September 14, 2010 by saywell Link to comment Share on other sites More sharing options...
Ascend4nt Posted September 14, 2010 Share Posted September 14, 2010 (edited) saywell, you'll notice if you look closer that Varian is using '[^[:ascii:]]', NOT '[[^:ascii:]]'. There's a difference, though it might not be clear at first. One is inside the class, one is inside the surrounding brackets.Why not give '[^[:print:]]' a try, as Varian's method will keep non-printing characters as well. But are you certain that the text is not UTF-8 encoded? (This would be clear at the top of HTML text). If it is UTF-8 encoded, you'll want to do a BinaryToString() conversion first.*edit: oops, ignore the strike-through text, I thought I saw HTML being used as a source. Edited September 14, 2010 by Ascend4nt My contributions: Performance Counters in Windows - Measure CPU, Disk, Network etc Performance | Network Interface Info, Statistics, and Traffic | CPU Multi-Processor Usage w/o Performance Counters | Disk and Device Read/Write Statistics | Atom Table Functions | Process, Thread, & DLL Functions UDFs | Process CPU Usage Trackers | PE File Overlay Extraction | A3X Script Extract | File + Process Imports/Exports Information | Windows Desktop Dimmer Shade | Spotlight + Focus GUI - Highlight and Dim for Eyestrain Relief | CrossHairs (FullScreen) | Rubber-Band Boxes using GUI's (_GUIBox) | GUI Fun! | IE Embedded Control Versioning (use IE9+ and HTML5 in a GUI) | Magnifier (Vista+) Functions UDF | _DLLStructDisplay (Debug!) | _EnumChildWindows (controls etc) | _FileFindEx | _ClipGetHTML | _ClipPutHTML + ClipPutHyperlink | _FileGetShortcutEx | _FilePropertiesDialog | I/O Port Functions | File(s) Drag & Drop | _RunWithReducedPrivileges | _ShellExecuteWithReducedPrivileges | _WinAPI_GetSystemInfo | dotNETGetVersions | Drive(s) Power Status | _WinGetDesktopHandle | _StringParseParameters | Screensaver, Sleep, Desktop Lock Disable | Full-Screen Crash Recovery Wrappers/Modifications of others' contributions: _DOSWildcardsToPCRegEx (original code: RobSaunder's) | WinGetAltTabWinList (original: Authenticity) UDF's added support/programming to: _ExplorerWinGetSelectedItems | MIDIEx UDF (original code: eynstyne) (All personal code/wrappers centrally located at Ascend4nt's AutoIT Code) Link to comment Share on other sites More sharing options...
saywell Posted September 14, 2010 Author Share Posted September 14, 2010 (edited) saywell, you'll notice if you look closer that Varian is using '[^[:ascii:]]', NOT '[[^:ascii:]]'. There's a difference, though it might not be clear at first. One is inside the class, one is inside the surrounding brackets. Why not give '[^[:print:]]' a try, as Varian's method will keep non-printing characters as well. But are you certain that the text is not UTF-8 encoded? (This would be clear at the top of HTML text). If it is UTF-8 encoded, you'll want to do a BinaryToString() conversion first. *edit: oops, ignore the strike-through text, I thought I saw HTML being used as a source. Thanks, Ascend4nt. Some progress made! I had used Variena's code correctly in the program, as I copied and pasted from his post. If I use your suggestion of [^[:print:]] it certainly takes out the errant characters, but unfortunately removes all the verticaltab characters so I get no line breaks. What I need is a regex to specify removal of the set of characters that are in :print but not in :ascii - but I don't think my regex-ing is up to that .! In fact I'm struggling to understand the syntax of the brackets. Or alternatively the set of :print excluding white spaces - \s. Another, less elegant, way might be first to swap the vertical tabs for <br> (which has to be done anyway, then the whitespace set for a temporary string that's unlikely to occur 'naturally', then strip the rubbish with :print, then restore the whitespaces from the temporary string. A bit long-winded but I'll give it a try! William [edit] Update - Ive implemented my no-elegant kludge as above, and it seems to be doing the necessary. No more little space invaders! Edited September 14, 2010 by saywell Link to comment Share on other sites More sharing options...
Ascend4nt Posted September 14, 2010 Share Posted September 14, 2010 (edited) I wasn't aware that [:print:] would get rid of carriage return/linefeeds, but you're right. That seemed odd at first, until I realized they aren't actually 'printable' characters - just commands basically.If its a standard Ascii character set you want to keep, you could try '[^\x20-\x7E]', which won't replace carriage returns or spaces.*edit: oops, just realized you added \s to the pattern, effectively making it '[^[:print:]\s]'. I don't know why I didn't think of that. I was starting to lean towards \v, but \s covers those characters.(also, oops, I did replace carriage returns lol. I need some caffeine..) Edited September 14, 2010 by Ascend4nt My contributions: Performance Counters in Windows - Measure CPU, Disk, Network etc Performance | Network Interface Info, Statistics, and Traffic | CPU Multi-Processor Usage w/o Performance Counters | Disk and Device Read/Write Statistics | Atom Table Functions | Process, Thread, & DLL Functions UDFs | Process CPU Usage Trackers | PE File Overlay Extraction | A3X Script Extract | File + Process Imports/Exports Information | Windows Desktop Dimmer Shade | Spotlight + Focus GUI - Highlight and Dim for Eyestrain Relief | CrossHairs (FullScreen) | Rubber-Band Boxes using GUI's (_GUIBox) | GUI Fun! | IE Embedded Control Versioning (use IE9+ and HTML5 in a GUI) | Magnifier (Vista+) Functions UDF | _DLLStructDisplay (Debug!) | _EnumChildWindows (controls etc) | _FileFindEx | _ClipGetHTML | _ClipPutHTML + ClipPutHyperlink | _FileGetShortcutEx | _FilePropertiesDialog | I/O Port Functions | File(s) Drag & Drop | _RunWithReducedPrivileges | _ShellExecuteWithReducedPrivileges | _WinAPI_GetSystemInfo | dotNETGetVersions | Drive(s) Power Status | _WinGetDesktopHandle | _StringParseParameters | Screensaver, Sleep, Desktop Lock Disable | Full-Screen Crash Recovery Wrappers/Modifications of others' contributions: _DOSWildcardsToPCRegEx (original code: RobSaunder's) | WinGetAltTabWinList (original: Authenticity) UDF's added support/programming to: _ExplorerWinGetSelectedItems | MIDIEx UDF (original code: eynstyne) (All personal code/wrappers centrally located at Ascend4nt's AutoIT Code) Link to comment Share on other sites More sharing options...
saywell Posted September 14, 2010 Author Share Posted September 14, 2010 I wasn't aware that [:print:] would get rid of carriage return/linefeeds, but you're right. That seemed odd at first, until I realized they aren't actually 'printable' characters - just commands basically. If its a standard Ascii character set you want to keep, you could try '[^\x20-\x7E]', which won't replace carriage returns or spaces. *edit: oops, just realized you added \s to the pattern, effectively making it '[^[:print:]\s]'. I don't know why I didn't think of that. I was starting to lean towards \v, but \s covers those characters. (also, oops, I did replace carriage returns lol. I need some caffeine..) That's great - SORTED!! I used '[^[:print:]\s]' as you suggested - much more elegant than my way. Many thanks. William Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now