Guy_ Posted August 10, 2014 Posted August 10, 2014 (edited) I often copy text from a website or pdf into a variable and once in a while pasting it back into WordPad gives weird results. It used to originate more frequently within larger Facebook texts or YouTube comments. One example from a pdf is where bullets were changed into a corner like character, etc. I assume many of these could be control characters? What is the best way to filter them out, please? From reading in the manual, my only guess was something like the following, but it seems to do nothing (not sure though, and less easy to test for me...). $text = StringRegExpReplace ( $text, '[[:cntrl:]]', "" ) Or is it something with [:print:] ? (meaning, "give me only the characters that would normally print?") I don't mind if your solution removes Returns too (though ideally not), cause I usually remove those myself. Thank You for any pointers! Edited August 10, 2014 by Guy_
computergroove Posted August 10, 2014 Posted August 10, 2014 This cant be the easiest way but you can read a character at a time and delete it if it doesn't match a list of characters you know you want to keep. Lot of coding work probably though. Get Scite to add a popup when you use a 3rd party UDF -> http://www.autoitscript.com/autoit3/scite/docs/SciTE4AutoIt3/user-calltip-manager.html
Guy_ Posted August 10, 2014 Author Posted August 10, 2014 This cant be the easiest way but you can read a character at a time and delete it if it doesn't match a list of characters you know you want to keep. Lot of coding work probably though. Not necessarily. You can do that sort of thing with StringRegExpReplace probably. For example, to replace everything that is NOT a-z, A-Z or 0-9 in your text with "" ... $text = StringRegExpReplace ( $text, '[^[:alnum:]]', "" ) And then you can add other characters to it that you are still missing, but may need a lot of escape characters and will look a mess... I would be afraid to miss out on a few characters too, so I am hoping the other way round exists too and is neater code (and/or faster).
jchd Posted August 10, 2014 Posted August 10, 2014 There are several options open but there is something unclear: "One example from a pdf is where bullets were changed into a corner like character" That seems to means this is some ANSI codepage XYZ blindly transfered to ANSI codepage ABC. Neither bullets nor framing symbols are control characters. Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Guy_ Posted August 10, 2014 Author Posted August 10, 2014 Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available. I've tried that in the first message, but the "corner" character wouldn't display. I was prepared for something like your explanation anyway and it's the lesser of my worries. Weird stuff can happen or be manipulated with pdf files it seems. I think I even have a pdf that displays normal readable text, but if you copy from it it's a garbled mess of characters, probably on purpose. - Since I believe I usually have horizontal spacing problems in my output, for now I've put in these lines and I'll see how that goes... $text = StringRegExpReplace( $text, '\h', " " ) $text = StringRegExpReplace( $text, '[ ]{2,}', " " ) I'm hoping that should make any amount of horizontal spacing into one space, which I'm very ok with. I had one example on YouTube from a while ago, but at the moment it doesn't show the problem I was getting anymore... I'll dig this thread up again if I run across an example later. And I'm still hoping other people have needed this and for an elegant solution to give me all displaying characters (+ space) without any control chars & stuff.
jchd Posted August 10, 2014 Posted August 10, 2014 (edited) Read the doc of StringRegExp. There you'll see that by enabling Unicode category properties you have access to a whole new world of character classes. The discussion of this in detail would have rendered our help file too complex for newcomers but you'll find details explained in full in the official PCRE documentation (link below) under pcrepattern. For instance you can detect all Unicode symbols of a string with the class "(*UCP)[pS]" Edited August 10, 2014 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Guy_ Posted August 11, 2014 Author Posted August 11, 2014 (edited) Thanks for the pointers, jchd! I do find some clues there, but it may need a total study of RegEx before I can do anything with it, as something like this (although I need the reverse) doesn't seem to do anything: $text = StringRegExpReplace( $text, '(*UCP)[\pS]', "" ) Maybe I need to activate that PCRE somewhere first. I may look into it further later. At the moment, I also don't know if ending up with Unicode only would filter out control codes? - In the mean time, I did some random YouTube tests and one example is in the comments on http://www.youtube.com/all_comments?v=qTdOxn9MoPg If you carefully select the line "Trust what you see after you catch bed bugs into a glass jar." and no more, and then paste it somewhere, you'll get an extra kind of space at the end. I don't even know if that's a control character, but you get it a lot if you accidentally select a little more than the exact word or line in some websites. If I look at the html source, I don't really get a clue from it... It looks clean. [...] Trust what you see after you catch bed bugs into a glass jar.</div> This stuff confuses my program and I'd love to know what kind of code is causing that that I can filter for. Even though in this case it looks to be some kind of space, even this code (just as a test) didn't filter it out: $text = StringRegExpReplace( $text, '\h', "" ) Edited August 11, 2014 by Guy_
jchd Posted August 11, 2014 Posted August 11, 2014 Your example doesn't paste gribberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things. Anyway, if you want to remove everything except Unicode letters and digits (whatever language), whitespaces, punctuation and currency symbols (for example) then you can try this: Local $text = "Abç dêf" & @TAB & "123456.789 - 123000 = 456.789 € (convert to £, ₯ or $ as needed!)" & @CRLF & _ @TAB & "• First bullet" & @CRLF & _ @TAB & "‣ Second bullet" & @CRLF & _ @TAB & "• русский текст" & @CRLF & _ @TAB & "• 中國文字" & @CRLF & _ "end of test…" & @TAB & "¿Does that work for you?" MsgBox(0, "Input text", $text) Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]|[•‣]", "") MsgBox(0, "Filtered text", $str) Of course this is only a sketch which you'll need to adjust to your own needs. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Guy_ Posted August 12, 2014 Author Posted August 12, 2014 (edited) Your example doesn't paste gibberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things. You are right. It seems I *did* select too much there... You are also right it depends on the browser. If I select too far, Firefox gives me an extra kind of space, IE gives me some kind of newline... However, your new code pointer is already filtering this off! So in the first minutes, it looks very promising. Thank You Very Much However, I'll still have to figure out how to include important stuff like ".,;:/?)!'"&[](){}*@#" cause it seems to filter all of these out (and more probably) ...? That makes me wonder what else I'll be missing. And again, the pdf stuff is the least of my worries. I'd rather keep the bullets for other situations (and that seems an easy fix). Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]", "") I'm now hoping the chars still missing are a simple "class" or do I have to add them back in manually in some way? At first glance adding in [:punct:] seems a working fix: Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s[:punct:]]", "") Edited August 12, 2014 by Guy_
JeffAllenNJ Posted July 28, 2020 Posted July 28, 2020 StringRegExpReplace($text, '[^[:print:]]', '') lee321987 1
water Posted July 28, 2020 Posted July 28, 2020 You noticed that this thread is 6 years old My UDFs and Tutorials: Spoiler UDFs: Active Directory (NEW 2024-07-28 - Version 1.6.3.0) - Download - General Help & Support - Example Scripts - Wiki ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki Task Scheduler (2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki Standard UDFs: Excel - Example Scripts - Wiki Word - Wiki Tutorials: ADO - Wiki WebDriver - Wiki
JeffAllenNJ Posted August 26, 2020 Posted August 26, 2020 (edited) Yeah, but it still pops up at the top of google search, so I thought I'd supply the answer for anyone else searching. sorry it took me a month to reply! Edited August 26, 2020 by jaja714
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now