Jump to content
Sign in to follow this  
Guy_

Filtering out control characters from copied text

Recommended Posts

I often copy text from a website or pdf into a variable and once in a while pasting it back into WordPad gives weird results.

It used to originate more frequently within larger Facebook texts or YouTube comments.

One example from a pdf is where bullets were changed into a corner like character, etc.

I assume many of these could be control characters?

What is the best way to filter them out, please?

From reading in the manual, my only guess was something like the following, but it seems to do nothing (not sure though, and less easy to test for me...).

$text = StringRegExpReplace ( $text, '[[:cntrl:]]', "" )

Or is it something with [:print:] ?  (meaning, "give me only the characters that would normally print?")

I don't mind if your solution removes Returns too (though ideally not), cause I usually remove those myself.

Thank You for any pointers! :)

Edited by Guy_

Share this post


Link to post
Share on other sites

This cant be the easiest way but you can read a character at a time and delete it if it doesn't match a list of characters you know you want to keep. Lot of coding work probably though.


Get Scite to add a popup when you use a 3rd party UDF -> http://www.autoitscript.com/autoit3/scite/docs/SciTE4AutoIt3/user-calltip-manager.html

Share this post


Link to post
Share on other sites

This cant be the easiest way but you can read a character at a time and delete it if it doesn't match a list of characters you know you want to keep. Lot of coding work probably though.

 

Not necessarily. You can do that sort of thing with StringRegExpReplace probably.

For example, to replace everything that is NOT a-z, A-Z or 0-9 in your text with "" ...

$text = StringRegExpReplace ( $text,  '[^[:alnum:]]', "" )

And then you can add other characters to it that you are still missing, but may need a lot of escape characters and will look a mess...

I would be afraid to miss out on a few characters too, so I am hoping the other way round exists too and is neater code (and/or faster).

Share this post


Link to post
Share on other sites

There are several options open but there is something unclear: "One example from a pdf is where bullets were changed into a corner like character"

That seems to means this is some ANSI codepage XYZ blindly transfered to ANSI codepage ABC.

Neither bullets nor framing symbols are control characters.

Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available.

I've tried that in the first message, but the "corner" character wouldn't display.

I was prepared for something like your explanation anyway and it's the lesser of my worries.

Weird stuff can happen or be manipulated with pdf files it seems.

I think I even have a pdf that displays normal readable text, but if you copy from it it's a garbled mess of characters, probably on purpose.

-

Since I believe I usually have horizontal spacing problems in my output, for now I've put in these lines and I'll see how that goes...

$text = StringRegExpReplace( $text, '\h', " " )
$text = StringRegExpReplace( $text, '[ ]{2,}', " " )

I'm hoping that should make any amount of horizontal spacing into one space, which I'm very ok with.

I had one example on YouTube from a while ago, but at the moment it doesn't show the problem I was getting anymore...

I'll dig this thread up again if I run across an example later.

And I'm still hoping other people have needed this and for an elegant solution to give me all displaying characters (+ space) without any control chars & stuff.

Share this post


Link to post
Share on other sites

Read the doc of StringRegExp. There you'll see that by enabling Unicode category properties you have access to a whole new world of character classes. The discussion of this in detail would have rendered our help file too complex for newcomers but you'll find details explained in full in the official PCRE documentation (link below) under pcrepattern.

For instance you can detect all Unicode symbols of a string with the class "(*UCP)[pS]"

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Thanks for the pointers, jchd!

I do find some clues there, but it may need a total study of RegEx before I can do anything with it, as something like this (although I need the reverse) doesn't seem to do anything:

$text = StringRegExpReplace( $text, '(*UCP)[\pS]', "" )

Maybe I need to activate that PCRE somewhere first. I may look into it further later.

At the moment, I also don't know if ending up with Unicode only would filter out control codes?

-

In the mean time, I did some random YouTube tests and one example is in the comments on http://www.youtube.com/all_comments?v=qTdOxn9MoPg

If you carefully select the line "Trust what you see after you catch bed bugs into a glass jar." and no more, and then paste it somewhere, you'll get an extra kind of space at the end.

I don't even know if that's a control character, but you get it a lot if you accidentally select a little more than the exact word or line in some websites.

If I look at the html source, I don't really get a clue from it... It looks clean.

[...] Trust what you see after you catch bed bugs into a glass jar.</div>

This stuff confuses my program and I'd love to know what kind of code is causing that that I can filter for.

Even though in this case it looks to be some kind of space, even this code (just as a test) didn't filter it out:

$text = StringRegExpReplace( $text, '\h', "" )
Edited by Guy_

Share this post


Link to post
Share on other sites

Your example doesn't paste gribberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things.

Anyway, if you want to remove everything except Unicode letters and digits (whatever language), whitespaces, punctuation and currency symbols (for example) then you can try this:

Local $text = "Abç dêf" & @TAB & "123456.789 - 123000 = 456.789 € (convert to £, ₯ or $ as needed!)" & @CRLF & _
                @TAB & "• First bullet" & @CRLF & _
                @TAB & "‣ Second bullet" & @CRLF & _
                @TAB & "• русский текст" & @CRLF & _
                @TAB & "• 中國文字" & @CRLF & _
                "end of test…" & @TAB & "¿Does that work for you?"
MsgBox(0, "Input text", $text)
Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]|[•‣]", "")
MsgBox(0, "Filtered text", $str)

Of course this is only a sketch which you'll need to adjust to your own needs.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Your example doesn't paste gibberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things.

You are right. It seems I *did* select too much there...

You are also right it depends on the browser. If I select too far, Firefox gives me an extra kind of space, IE gives me some kind of newline...

However, your new code pointer is already filtering this off!

So in the first minutes, it looks very promising.

Thank You Very Much  :)

However, I'll still have to figure out how to include important stuff like ".,;:/?)!'"&[](){}*@#" cause it seems to filter all of these out (and more probably) ...?

That makes me wonder what else I'll be missing.

And again, the pdf stuff is the least of my worries. I'd rather keep the bullets for other situations (and that seems an easy fix).

Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]", "")

I'm now hoping the chars still missing are a simple "class" or do I have to add them back in manually in some way?

At first glance adding in [:punct:] seems a working fix:

Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s[:punct:]]", "")
Edited by Guy_

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Similar Content

    • By jmp
      I am adding labour charge to total paid amount using : 
      #include <IE.au3> #include <Array.au3> $oIE = _IEAttach ("Shop") $oTable = _IETableGetCollection ($oIE, 1) $aTableData3 = _IETableWriteToArray ($oTable) Local $sitem1 = $aTableData3[5][1] Local $sitem2 = $aTableData3[5][2] Local $lcharge = "10" ;add manualy using inputbox, becuase not generating online Local $atotPric = "Payable Total Price " Local $oTds = _IETagNameGetCollection($oIE, "td") For $oTd In $oTds If $oTd.Innertext = $atotPric Then $iatotPric = $oTd.NextElementSibling.innertext MsgBox (0, "2", $iatotPric) EndIf Next $irCtotal = StringFormat("%.2f", $sitem1 + $sitem2 + $lcharge) $crTotp = StringReplace(_IEBodyReadHTML($oIE), $iatotPric, $irCtotal) _IEBodyWriteHTML ($oIE, $crTotp) But, It was also changing Total price, I want to change only Payable Total Price.

    • By nacerbaaziz
      hello sirs
      i've some questions about StringRegExpReplace i hope you can help me
       
      i tried to make a function that give me the host of the url and other give me the url with out host
      for example i've this link
      https://www.example.com/vb/result.php
      i need the first give me the
      example.com
      and the other give me 
      /vb/result.php
      i find that
      $s_source = "https://www.google.com/vb/index.php" Local $s_Host = StringRegExpReplace($s_Source, '.*://(.*?)/.*', '\1') Local $s_Page = StringRegExpReplace($s_source, '.*://.*?(/.*)', '\1') msgBox(64, $s_Host, $s_Page)  
      but i found some problems i need your help to correct it
      first: when i get the host if the url has www i want to remove it
      second: if the url with out host did not have other things 
      i need the result to be ""
      e.g
      https://www.example.com
      the first i want it
      example.com
      and the second i want it to be ""
      i hope that you can help me
      thanks in advance
    • By fs1234
      Hi,
      I would like to change the hungarian characters in a string, but I can't figure out how to do it.
      Help, pls.
       
      #include <MsgBoxConstants.au3> Local $sInput = "Árvíztűrő tükörfúrógép" Local $sOutput = StringRegExpReplace($sInput, "(?-i)(á)|(Á)|(é)|(É)|(í)|(Í)|(ó)|(Ó)|(ö)|(Ö)|(ő)|(Ő)|(ú)|(Ú)|(ü)|(Ü)|(ű)|(Ű)", "(?1a)(?2A)(?3e)(?4E)(?5i)(?6I)(?7o)(?8O)(?9o)(?10O)(?11o)(?12O)(?13u)(?14U)(?15u)(?16U)(?17u)(?18U)") Display($sInput, $sOutput) Func Display($sInput, $sOutput) ; Format the output. Local $sMsg = StringFormat("Input:\t%s\n\nOutput:\t%s", $sInput, $sOutput) MsgBox($MB_SYSTEMMODAL, "Results", $sMsg) EndFunc ;==>Display  
    • By Skysnake
      I need some regex help
      I inherited some data 
      The data is massive and I need a clean, fast solution 
      source is text and complex.  
      I need to find dates such as "31-01-2018" and replace with "31-JAN-2018"
      Problem is that my regex "31-01-2018" takes for ever and replaces all.
      The ideal would be to search like this, but I am not managing
      \d{2}-(01)-\d{4} replace (01) with JAN But if I do it that way, the entire search string gets replaced by JAN.  This is not an error, but typically regex behaviour.  Any ideas?
      Skysnake
    • By luckyluke
      $t = '... 1-347-318-9643 1-347-318-9647 1-347-318-9648 1-347-318-9650 1-347-318-9651 1-347-318-9652 1-347-318-9653 1-347-318-9655 1-347-318-&nbsp;...' $pattern = '347.*?318.*?9655' $tmp = StringRegExpReplace($t, $pattern, "|||", 1) ConsoleWrite($tmp & @CRLF) However i got this output:
      ... 1-|||  1-347-318-&nbsp;...
      Why i got only that, where is the other string, i thought the output should be this:
      ... 1-347-318-9643  1-347-318-9647  1-347-318-9648  1-347-318-9650  1-347-318-9651  1-347-318-9652  1-347-318-9653  1-|||  1-347-318-&nbsp;...
×
×
  • Create New...