Jump to content
Sign in to follow this  
Guy_

Filtering out control characters from copied text

Recommended Posts

Guy_

I often copy text from a website or pdf into a variable and once in a while pasting it back into WordPad gives weird results.

It used to originate more frequently within larger Facebook texts or YouTube comments.

One example from a pdf is where bullets were changed into a corner like character, etc.

I assume many of these could be control characters?

What is the best way to filter them out, please?

From reading in the manual, my only guess was something like the following, but it seems to do nothing (not sure though, and less easy to test for me...).

$text = StringRegExpReplace ( $text, '[[:cntrl:]]', "" )

Or is it something with [:print:] ?  (meaning, "give me only the characters that would normally print?")

I don't mind if your solution removes Returns too (though ideally not), cause I usually remove those myself.

Thank You for any pointers! :)

Edited by Guy_

Share this post


Link to post
Share on other sites
computergroove

This cant be the easiest way but you can read a character at a time and delete it if it doesn't match a list of characters you know you want to keep. Lot of coding work probably though.


Get Scite to add a popup when you use a 3rd party UDF -> http://www.autoitscript.com/autoit3/scite/docs/SciTE4AutoIt3/user-calltip-manager.html

Share this post


Link to post
Share on other sites
Guy_

This cant be the easiest way but you can read a character at a time and delete it if it doesn't match a list of characters you know you want to keep. Lot of coding work probably though.

 

Not necessarily. You can do that sort of thing with StringRegExpReplace probably.

For example, to replace everything that is NOT a-z, A-Z or 0-9 in your text with "" ...

$text = StringRegExpReplace ( $text,  '[^[:alnum:]]', "" )

And then you can add other characters to it that you are still missing, but may need a lot of escape characters and will look a mess...

I would be afraid to miss out on a few characters too, so I am hoping the other way round exists too and is neater code (and/or faster).

Share this post


Link to post
Share on other sites
jchd

There are several options open but there is something unclear: "One example from a pdf is where bullets were changed into a corner like character"

That seems to means this is some ANSI codepage XYZ blindly transfered to ANSI codepage ABC.

Neither bullets nor framing symbols are control characters.

Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Guy_

Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available.

I've tried that in the first message, but the "corner" character wouldn't display.

I was prepared for something like your explanation anyway and it's the lesser of my worries.

Weird stuff can happen or be manipulated with pdf files it seems.

I think I even have a pdf that displays normal readable text, but if you copy from it it's a garbled mess of characters, probably on purpose.

-

Since I believe I usually have horizontal spacing problems in my output, for now I've put in these lines and I'll see how that goes...

$text = StringRegExpReplace( $text, '\h', " " )
$text = StringRegExpReplace( $text, '[ ]{2,}', " " )

I'm hoping that should make any amount of horizontal spacing into one space, which I'm very ok with.

I had one example on YouTube from a while ago, but at the moment it doesn't show the problem I was getting anymore...

I'll dig this thread up again if I run across an example later.

And I'm still hoping other people have needed this and for an elegant solution to give me all displaying characters (+ space) without any control chars & stuff.

Share this post


Link to post
Share on other sites
jchd

Read the doc of StringRegExp. There you'll see that by enabling Unicode category properties you have access to a whole new world of character classes. The discussion of this in detail would have rendered our help file too complex for newcomers but you'll find details explained in full in the official PCRE documentation (link below) under pcrepattern.

For instance you can detect all Unicode symbols of a string with the class "(*UCP)[pS]"

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Guy_

Thanks for the pointers, jchd!

I do find some clues there, but it may need a total study of RegEx before I can do anything with it, as something like this (although I need the reverse) doesn't seem to do anything:

$text = StringRegExpReplace( $text, '(*UCP)[\pS]', "" )

Maybe I need to activate that PCRE somewhere first. I may look into it further later.

At the moment, I also don't know if ending up with Unicode only would filter out control codes?

-

In the mean time, I did some random YouTube tests and one example is in the comments on http://www.youtube.com/all_comments?v=qTdOxn9MoPg

If you carefully select the line "Trust what you see after you catch bed bugs into a glass jar." and no more, and then paste it somewhere, you'll get an extra kind of space at the end.

I don't even know if that's a control character, but you get it a lot if you accidentally select a little more than the exact word or line in some websites.

If I look at the html source, I don't really get a clue from it... It looks clean.

[...] Trust what you see after you catch bed bugs into a glass jar.</div>

This stuff confuses my program and I'd love to know what kind of code is causing that that I can filter for.

Even though in this case it looks to be some kind of space, even this code (just as a test) didn't filter it out:

$text = StringRegExpReplace( $text, '\h', "" )
Edited by Guy_

Share this post


Link to post
Share on other sites
jchd

Your example doesn't paste gribberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things.

Anyway, if you want to remove everything except Unicode letters and digits (whatever language), whitespaces, punctuation and currency symbols (for example) then you can try this:

Local $text = "Abç dêf" & @TAB & "123456.789 - 123000 = 456.789 € (convert to £, ₯ or $ as needed!)" & @CRLF & _
                @TAB & "• First bullet" & @CRLF & _
                @TAB & "‣ Second bullet" & @CRLF & _
                @TAB & "• русский текст" & @CRLF & _
                @TAB & "• 中國文字" & @CRLF & _
                "end of test…" & @TAB & "¿Does that work for you?"
MsgBox(0, "Input text", $text)
Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]|[•‣]", "")
MsgBox(0, "Filtered text", $str)

Of course this is only a sketch which you'll need to adjust to your own needs.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Guy_

Your example doesn't paste gibberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things.

You are right. It seems I *did* select too much there...

You are also right it depends on the browser. If I select too far, Firefox gives me an extra kind of space, IE gives me some kind of newline...

However, your new code pointer is already filtering this off!

So in the first minutes, it looks very promising.

Thank You Very Much  :)

However, I'll still have to figure out how to include important stuff like ".,;:/?)!'"&[](){}*@#" cause it seems to filter all of these out (and more probably) ...?

That makes me wonder what else I'll be missing.

And again, the pdf stuff is the least of my worries. I'd rather keep the bullets for other situations (and that seems an easy fix).

Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]", "")

I'm now hoping the chars still missing are a simple "class" or do I have to add them back in manually in some way?

At first glance adding in [:punct:] seems a working fix:

Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s[:punct:]]", "")
Edited by Guy_

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Similar Content

    • ViciousXUSMC
      By ViciousXUSMC
      So I ran into this crazy "program" that cant be uninstalled via WMI, MSIExec, etc.
      The only way to uninstall it was from Add/Remove programs manually... Or I found if you find it in the registry under HKCU and run the  uninstall string, it will also uninstall.
      However the string in the registry cant be run directly in a cmd window because of the format errors.
      It has spaces without quotations, it has invalid characters, etc, etc 
      I know things run different when executed in the registry, so maybe there is a way I can run the regsitry key just like how the system does?  If so chime in.
      Otherwise I did this a crude way using several stringregexpreplace() functions and have it working.
      The solution feels so barbaric and crude that I wanted to post it so some of you guys better than me can clean up the code, maybe offer alternative ways to do it, or reduce the number of times I process the string.
      Here is the string right out of the registry:
      c:\Program Files\Common Files\Microsoft Shared\VSTO\10.0\VSTOInstaller.exe /Uninstall file:///C:/Users/it022565/AppData/Local/Temp/OOBAXTOWordAddIn/ApplicationXtender.AXTO.Word.vsto Here is my cave man scripting to turn this into a run able string.
       
      Func _UninstallOld() For $i = 1 to 100 ;Enumerate Registry $sEnumBase = "HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\" ;Look in HKCU for the uninstall string for the old version $sEnum = RegEnumKey($sEnumBase, $i) If @Error Then Return If $iDebug = 1 Then MsgBox(0, "", $sEnum) If StringInStr(RegRead($sEnumBase & $sEnum, "DisplayName"), "Word Addin") Then ExitLoop Next If $iDebug = 1 Then MsgBox(0, "", $sEnum) $sKey = "HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\" & $sEnum $sKey2 = RegRead($sKey, "UninstallString") If $iDebug = 1 Then MsgBox(0, "Original Install Location", $sKey2) $sKey3 = StringRegExpReplace($sKey2, "(?i)(c:.*exe)", '"$1"') If $iDebug = 1 Then MsgBox(0, "", $sKey3) $sKey4 = StringRegExpReplace($sKey3, "(?i)file:///", "") If $iDebug = 1 Then MsgBox(0, "", $sKey4) $sKey5 = StringRegExpReplace($sKey4, "%20", " ") If $iDebug = 1 Then MsgBox(0, "", $sKey5) $sKey6 = StringRegExpReplace($sKey5, '(?i)((?<!")c:.*vsto)', '"$1"') If $iDebug = 1 Then MsgBox(0, "", $sKey6) RunWait(@ComSpec & ' /c ' & '"' & $sKey6 & ' /s"', "", @SW_HIDE) EndFunc Basically step by step I add quotations, strip bad characters, etc.  Kind of proud for using look behind for once
      Looking forward to what you guys come up with.
    • VIP
      By VIP
      Need help to make function better  with full infomation
      #include <Array.au3> #include <File.au3> _TEST(@ScriptFullPath) _TEST("A:") _TEST("A:\B.c") _TEST("D:\E\F\") _TEST("G:\H/../J.k/") _TEST("M:\N\k..J.k") _TEST("D:\E\F\..\G\G\I..J.K.M") Func _TEST($sFilePath) Local $sDrive = "", $sFullPathDir = "", $sDirPath = "", $sDirName = "", $sFileName = "", $sFileNameExt = "", $sExtension = "", $sExt = "" Local $aPathSplit = _PathSplitByRef($sFilePath, $sDrive, $sFullPathDir, $sDirPath, $sDirName, $sFileName, $sFileNameExt, $sExtension, $sExt) ConsoleWrite("!Path IN : " & $sFilePath & @CRLF) ; C:\Windows\System32\etc\hosts.exe ConsoleWrite("- Driver : " & $sDrive & @CRLF) ; C: ConsoleWrite("- DirPath : " & $sFullPathDir & @CRLF) ; C:\Windows\System32\etc\etc ConsoleWrite("- DirPath : " & $sDirPath & @CRLF) ; \Windows\System32\etc\ ConsoleWrite("- DirName : " & $sDirName & @CRLF) ; etc ConsoleWrite("- FileName : " & $sFileName & @CRLF) ; hosts ConsoleWrite("- FileNameExt: " & $sFileNameExt & @CRLF) ; hosts.exe ConsoleWrite("- Extension : " & $sExtension & @CRLF) ; .exe ConsoleWrite("- Ext : " & $sExt & @CRLF & @CRLF) ; exe ;~ ConsoleWrite("!Path IN : " & $aPathSplit[0] & @CRLF) ; C:\Windows\System32\etc\hosts.exe ;~ ConsoleWrite("- Driver : " & $aPathSplit[1] & @CRLF) ; C: ;~ ConsoleWrite("- DirPath : " & $aPathSplit[2] & @CRLF) ; C:\Windows\System32\etc\etc ;~ ConsoleWrite("- DirPath : " & $aPathSplit[3] & @CRLF) ; \Windows\System32\etc\ ;~ ConsoleWrite("- DirName : " & $aPathSplit[4] & @CRLF) ; etc ;~ ConsoleWrite("- FileName : " & $aPathSplit[5] & @CRLF) ; hosts ;~ ConsoleWrite("- FileNameExt: " & $aPathSplit[6] & @CRLF) ; hosts.exe ;~ ConsoleWrite("- Extension : " & $aPathSplit[7] & @CRLF) ; .exe ;~ ConsoleWrite("- Ext : " & $aPathSplit[8] & @CRLF) ; exe ;~ _ArrayDisplay($aPathSplit, "_PathSplit of " & $sFilePath) EndFunc ;==>_TEST Func _PathSplitByRef($sFilePath, ByRef $sDrive, ByRef $sFullPathDir, ByRef $sDirPath, ByRef $sDirName, ByRef $sFileName, ByRef $sFileNameExt, ByRef $sExtension, ByRef $sExt) If StringInStr($sFilePath,"..") Then $sFilePath=_PathFull($sFilePath) Local $aPartOfPath=StringRegExp($sFilePath, "^\h*((?:\\\\\?\\)*(\\\\[^\?\/\\]+|[A-Za-z]:)?(.*[\/\\]\h*)?((?:[^\.\/\\]|(?(?=\.[^\/\\]*\.)\.))*)?([^\/\\]*))$", $STR_REGEXPARRAYMATCH) ;~ If @error Then ReDim $aPartOfPath[9] ;~ $aPartOfPath[0] = $sFilePath ;~ EndIf $aPartOfPath[0] = $sFilePath ; C:\Windows\System32\etc\hosts.exe $sDrive = $aPartOfPath[1] ; C: $sFullPathDir = $aPartOfPath[1] & $aPartOfPath[2] ; C:\Windows\System32\etc If StringLeft($aPartOfPath[2], 1) == "/" Then $sDirPath = StringRegExpReplace($aPartOfPath[2], "\h*[\/\\]+\h*", "\/") Else $sDirPath = StringRegExpReplace($aPartOfPath[2], "\h*[\/\\]+\h*", "\\") EndIf $aPartOfPath[2] = $sFullPathDir ; C:\Windows\System32\etc $sDirName=StringReplace($sDirPath,"\","") $sDirName=StringReplace($sDirPath,"/","") $sFileName = $aPartOfPath[3] ; hosts $aPartOfPath[5] = $sFileName ; hosts $sExtension = $aPartOfPath[4] ; .exe $aPartOfPath[7] = $sExtension ; .exe $aPartOfPath[3] = $sDirPath ; \Windows\System32\etc\ $aPartOfPath[4] = $sDirName ; etc $aPartOfPath[6] = $sFileName & $sExtension ; hosts.exe $sFileNameExt = $aPartOfPath[6] ; hosts.exe $sExt = StringReplace($sExtension,".","") ; exe $aPartOfPath[8] = $sExt ; exe Return $aPartOfPath EndFunc ;==>_PathSplitByRef  
    • hawkair
      By hawkair
      Hi
      I am trying to insert line numbers in to a string
      with this script
      Func _MyInc () Static Local $i = 0 $i += 1 Return $i EndFunc Exit _InsertLines() Func _InsertLines()     $String = "A" & @CRLF & "B" & @CRLF & "C" & @CRLF & "D" $NewString =  Execute("'" & StringRegExpReplace($String,"[\r\n]*",  "' & _MyInc () & '\1" ) & "'") MsgBox (0, "", $NewString) EndFunc but I get this:
      1A23B45C67D8
      I never really could master how Execute works here and I always get some working example and make substitutions.
      But this is the closest i could get...
       
    • AutoBert
      By AutoBert
      The idea to use translation api:

      i used the script from @mikell to build this func:
      Func _Translate($sFrom, $from, $to) ;thanks to mikell (autoitscript.com) ;https://www.autoitscript.com/forum/topic/182893-prompt-me-how-to-see-the-text-in-the-translation-boxhttpstranslategooglecom/?do=findComment&comment=1313423 Local $url = "https://translate.googleapis.com/translate_a/single?client=gtx" $url &= "&sl=" & $from & "&tl=" & $to & "&dt=t&q=" & $sFrom Local $oHTTP = ObjCreate("Microsoft.XMLHTTP") $oHTTP.Open("POST", $url, False) $oHTTP.Send() Local $sData = $oHTTP.ResponseText $sData = StringRegExpReplace($sData, '.*?\["(.*?)"[^\[]*', "$1" & @CRLF) Return $sData EndFunc ;==>_Translate when i call this func with:
      $sText='AutoIt v3 is a freeware BASIC-like scripting language designed for automating the Windows GUI and general scripting. It uses a combination of simulated keystrokes, mouse movement and window/control manipulation in order to automate tasks in a way not possible or reliable with other languages (e.g. VBScript and SendKeys). AutoIt is also very small, self-contained and will run on all versions of Windows out-of-the-box with no annoying "runtimes" required!' MsgBox(64,'',_Translate($sText,'en','de')) nearly all is seeing here:

      only the "!" is wrong "\" but when using 'auto' instead of 'en' the result is:

      2 lines are appended. So my question is, is it possible to extend the pattern (i never worked with regex) and in best case setting @extended with the detected language?
      @Trong: as you can see yet i am returning translated text and don't use GuiCtrlSetData to assign it to a EditBox.
    • ViciousXUSMC
      By ViciousXUSMC
      I was working on something last night and decided to use StringRegExpReplace() for a config file, I never noticed that you cant just "overwrite" the file with the update so easily it required a few more pieces of code to work properly.
      Is this the simplest way (what I used) and while I searched for it and did not find it do we have or will we have a RegEx equivalent for _ReplaceStringInFile()?
      $sFile = FileRead(@ScriptDir & "\test.txt") $hFile = FileOpen(@ScriptDir & "\test.txt", 2) $sNewContent = StringRegExpReplace($sFile, "(test)", "new$1") FileWrite($hFile, $sNewContent) FileClose($hFile)  
×