Jump to content

Short question, difficult answer - maybe (Its difficult to explain)


Recommended Posts

Posted (edited)

So just in short,

i just copy some text from .pdf, and this is it

 

Spoiler

1205653024_imagescreenshot.png.a0f19484255098e719fba29f1f15ba5a.png

as you see there are so many symbol there, especially like this
from here [i cant copy that symbol, so check it here]

Spoiler

596877452_unvisiblesymbol.png.b7e8afb6ef0710b0fce502a6ce9940bf.png

so my question is, since this symbol is not selectable, if i copy those sentences from ms.word to here, it will appear like this

 

Spoiler

Needless to say, after that he collected and relished everything he could find in the Gos[1]vami’s books that would enhance his bhajana and he became very deeply engrossed in fol[1]lowing the raganuga-marga. He remained absorbed in bhajana almost all the time. His extraor[1]dinary immersion in prema became obvious when he was engaged in sravana-kirtana. Dur[1]ing lila-sravana-kirtana, tears, mucous and saliva would stream from his face, and two

Vaisnavas wiping them away could not stop the flow. Once, while sitting on the bank of the

Manasa-ganga in deep trance, he fell into the water and remained there for three days. On

the fourth day he floated to the surface. When his followers found him and pulled him out of

the water, they saw that he was still alive. After they loudly sang nama-kirtana for a long time,

he finally returned to external consciousness. From this time on, he was known as “Siddha[1]baba.”

so it is very wonderful, that autoitscript forum can transliterate it to "[1]", word denote it as that symbol, but notepad, plain text is unable to transliterate it.

i attach the ms.word file, you may try to copy it, but nothing copied, only "space" copied,
and if you copied them to notepad, the same only space appear..

my question is how i can find that symbol, because i need to replace that symbol, i am working on transliteration, and if i can't find this symbol, how can i replace it..i want to remove it, so i need to find it then remove it

so this is my current script for replacing some romanian letters, i just need to find out the way to detect that symbol, and then how can i remove it? without leaving "any space", as you can see from image above, remove the symbol, and make everything looks nice, that what is wanted.

 

#include <File.au3>
#include <FileConstants.au3>
#include <MsgBoxConstants.au3>


Func convert($file)
    Local $srch = "¸õΩ•@~µ∫˙√†‰∂ˇî®ßÃĀåḌḤĪïùàḶḸṂṆñìṄṚṜṢŚṬāḍḥīḷḹṃṁṇṅṛṝṣśṭūäÇéüöëò"
    Local $repl = "Sns-aamnhntrdtirsnaadhinhmllmnnnnrrsstadhiiimmnnrrsstuasiutnd"

    Local $check = FileGetAttrib($file)
    If StringInStr($check, "D") Then
        ConsoleWrite("Skipping the directory " & $file & @CRLF)
        Return
    Else
        ConsoleWrite("Parsing file: " & $file & @CRLF)
    EndIf

    ; load file content into memory
    $filereader = FileOpen($file)
    $content = FileRead($filereader)
    FileClose($filereader)

    ; change all characters in memory
    For $i = 1 To StringLen($srch)
        $content = StringReplace($content, StringMid($srch, $i, 1), StringMid($repl, $i, 1))
    Next

    ; write back file to disk
    $filewriter = FileOpen($file, 2)
    FileWrite($filewriter, $content)
    FileClose($filewriter)
EndFunc   ;==>convert

; Display an open dialog to select a list of file(s).
Local $sFileOpenDialog = FileOpenDialog("Hold down Ctrl or Shift to choose multiple files.", @ScriptDir & "\", "Au3 (*.au3)", BitOR($FD_FILEMUSTEXIST, $FD_MULTISELECT))

If @error Then Exit MsgBox($MB_SYSTEMMODAL, "", "No file(s) were selected.")
; split up the selected files into an array
$sFileOpenDialog = StringSplit($sFileOpenDialog, "|")

; walk through the array, convert one file after another
For $file = 1 To $sFileOpenDialog[0]
    convert($sFileOpenDialog[$file])
Next

 
thats my question, please help guys

example word file.docx

Edited by subuddhi
Link to post
Share on other sites
  • subuddhi changed the title to Short question, difficult answer - maybe (Its difficult to explain)

Copy the text from the PDF file and run this script. Then paste wherever after running the script.

#include <MsgBoxConstants.au3>

Global $g_sClipboard
Global $g_sReplaceWith = '[1]'

$g_sClipboard = ClipGet() ; Retrive text from the clipboard
$g_sReplacedString = StringReplace($g_sClipboard, Chr(2), $g_sReplaceWith) ; Replace the symbol with $g_sReplaceWith

MsgBox($MB_OK, @ScriptName, $g_sReplacedString) ; Display the new text

ClipPut($g_sReplacedString) ; Write the new text to the clipboard

You can change $g_sReplaceWith with whatever you want.

Link to post
Share on other sites

this is what happen when i am using your script above

 

Spoiler

pertama.jpg.78615132e2980bf1e075316cf14cf809.jpg


after that, after using the script, modify the clipboard

 

Spoiler

71947644_gambarkeedua.jpg.2649fb089ac040c2938dad960e893a86.jpg

i want to ask, if that script intended, to change that symbol to [1] ? but it seems remove it, and leave the space there, as you can see,
its okey with the remove, but why there is still "space"

Link to post
Share on other sites

If the special characters only need to get removed, try adding this line after line 26 (below the for loop which changes all characters in memory):

$content = StringRegExpReplace($content, '[^[:print:]]', '')

Just a guess, but it should simply kill all non-printable characters out of the text.

Any of my own codes posted on the forum are free for use by others without any restriction of any kind. (WTFPL)

Link to post
Share on other sites

I would suggest to display the content of the clipboard in binary.  This way you can see exactly the value used to represent that (sequence of) character(s).

You can use the Binary() function to perform such a task.  Once you know the value, just replace it with ""

 

 

Link to post
Share on other sites
Posted (edited)
1 hour ago, Marc said:

If the special characters only need to get removed, try adding this line after line 26 (below the for loop which changes all characters in memory):

$content = StringRegExpReplace($content, '[^[:print:]]', '')

Just a guess, but it should simply kill all non-printable characters out of the text.

this will make all formating gone, all new paragraph spacing gone, when we press "enter" there is gap between text, all gone..this is all right, but it is too much, the new paraghraph line should be keep.
here the result
 

Spoiler

Needless to say, after that he collected and relished everything he could find in the Gosvamis books that would enhance his bhajana and he became very deeply engrossed in following the raganuga-marga. He remained absorbed in bhajana almost all the time. His extraordinary immersion in prema became obvious when he was engaged in sravana-kirtana. During lila-sravana-kirtana, tears, mucous and saliva would stream from his face, and twoVaisnavas wiping them away could not stop the flow. Once, while sitting on the bank of theManasa-ganga in deep trance, he fell into the water and remained there for three days. Onthe fourth day he floated to the surface. When his followers found him and pulled him out ofthe water, they saw that he was still alive. After they loudly sang nama-kirtana for a long time,he finally returned to external consciousness. From this time on, he was known as Siddhababa.

what do you think?

Edited by subuddhi
Link to post
Share on other sites
Posted (edited)

i guess it is like this,

 

#include <MsgBoxConstants.au3>
#include <StringConstants.au3>

Example()

Func Example()
    ; Define the string that will be converted later.
    ; NOTE: This string may show up as ?? in the help file and even in some editors.
    ; This example is saved as UTF-8 with BOM.  It should display correctly in editors
    ; which support changing code pages based on BOMs.
    Local Const $sString = "Hello - 你好"

    ; Temporary variables used to store conversion results.  $dBinary will hold
    ; the original string in binary form and $sConverted will hold the result
    ; afte it's been transformed back to the original format.
    Local $dBinary = Binary(""), $sConverted = ""

    ; Convert the original UTF-8 string to an ANSI compatible binary string.
    $dBinary = StringToBinary($sString)

    ; Convert the ANSI compatible binary string back into a string.
    $sConverted = BinaryToString($dBinary)

    ; Display the resulsts.  Note that the last two characters will appear
    ; as ?? since they cannot be represented in ANSI.
    DisplayResults($sString, $dBinary, $sConverted, "ANSI")

    ; Convert the original UTF-8 string to an UTF16-LE binary string.
    $dBinary = StringToBinary($sString, $SB_UTF16LE)

    ; Convert the UTF16-LE binary string back into a string.
    $sConverted = BinaryToString($dBinary, $SB_UTF16LE)

    ; Display the resulsts.
    DisplayResults($sString, $dBinary, $sConverted, "UTF16-LE")

    ; Convert the original UTF-8 string to an UTF16-BE binary string.
    $dBinary = StringToBinary($sString, $SB_UTF16BE)

    ; Convert the UTF16-BE binary string back into a string.
    $sConverted = BinaryToString($dBinary, $SB_UTF16BE)

    ; Display the resulsts.
    DisplayResults($sString, $dBinary, $sConverted, "UTF16-BE")

    ; Convert the original UTF-8 string to an UTF-8 binary string.
    $dBinary = StringToBinary($sString, $SB_UTF8)

    ; Convert the UTF8 binary string back into a string.
    $sConverted = BinaryToString($dBinary, $SB_UTF8)

    ; Display the resulsts.
    DisplayResults($sString, $dBinary, $sConverted, "UTF8")
EndFunc   ;==>Example

; Helper function which formats the message for display.  It takes the following parameters:
; $sOriginal - The original string before conversions.
; $dBinary - The original string after it has been converted to binary.
; $sConverted- The string after it has been converted to binary and then back to a string.
; $sConversionType - A human friendly name for the encoding type used for the conversion.
Func DisplayResults($sOriginal, $dBinary, $sConverted, $sConversionType)
    MsgBox($MB_SYSTEMMODAL, "", "Original:" & @CRLF & $sOriginal & @CRLF & @CRLF & "Binary:" & @CRLF & $dBinary & @CRLF & @CRLF & $sConversionType & ":" & @CRLF & $sConverted)
EndFunc   ;==>DisplayResults

i see that it convert string to binary

 

but how can i decide which one the binary of the symbol

and the binary of other word, because it just appear as set/combination of number,

could you give an example please?

detect any simbol in sentence then remove it then convert again to string

 

Edited by subuddhi
Link to post
Share on other sites

Lets make it simple.  Copy into clipboard a single word containing the symbol you want to get rid of.  Then run the following script :

ConsoleWrite(ClipGet() & @CRLF)
Local $dData = Binary(ClipGet())
ConsoleWrite($dData & @CRLF)

What do you get into the console ?

Link to post
Share on other sites
Posted (edited)

actually every time i want to test ConsoleWrite, i always fail..

please see this

Spoiler

1582979526_sarannine.thumb.png.30de8488b7fdbd33f836f360e6715a72.png

as you see, i block that symbol in notepad and then ctrl+c, then i go to SciTe and press F5, but there is error notification said error couldnt open input file

Edited by subuddhi
Link to post
Share on other sites

Okey problem solved, 

0x02 this is what @Luke94 means with this 

Quote
On 7/2/2021 at 5:42 PM, Luke94 said:

$g_sReplacedString = StringReplace($g_sClipboard, Chr(2), $g_sReplaceWith) ; Replace the symbol with $g_sReplaceWith

 

then i just make it like this,

 

$g_sReplacedString = StringReplace($g_sClipboard, Chr(2), '') ; Replace the symbol with $g_sReplaceWith

and add to my script above, problem solved..haha i just realized i am too lazy..

Thanks nine, i am also learn that binary, and Marc also, and especially Luke

Link to post
Share on other sites
Posted (edited)

i am facing new problem,
2015771682_gambar1.jpg.41f06a9f653dbd1048891737f23b2a49.jpg
as you see between the word "ki" and "ora", it should be written ki"s"ora, so the letter "s" is gone, 
then using binary, i found out that it is 0x8D, and converted to number it is 141, so it is Chr(141)..
everything is detected, i may remove or replace it to any other letter

 

but the problem when i copy the text in picure above to notepad or to this autoitforum, that Chr(141) gone..
"of nava Yugala-kiora"
as you can see it is gone, and there is no space even, then how can detect that character and replace it with another?

i see only microsoft word able to write that letter as a "Blank space", notepad and this forum is can't, then how i manage to find out that letter and replace it, using this "String Replace", because so far my coding working on notepad..is there any other way?

 

 

Edited by subuddhi
Link to post
Share on other sites

Not sure I fully understand your issue. But I believe the problem comes from the encoding you are transfering to has a UTF-8 encoding and any character after chr(127) requires a second byte.  You need to setup your receiver as ANSI to see the chr(141).

Link to post
Share on other sites
Posted (edited)

okey i manage to solve it,
so i replace the letter from memory, clipboard..i dont use notepad

 

#include <MsgBoxConstants.au3>
Global $g_sClipboard , $dBinary , $g_sFind

;seperate every letter
;first letter

Global $g_sFind = "A00A"
Global $g_sReplaceWith = "74"
$g_sClipboard = ClipGet() ; Retrive text from the clipboard
Global $dBinary = Binary($g_sClipboard)
Global $string = String($dBinary)
;MsgBox($MB_OK, "1", $g_sFind) ; Display the new text
$g_sReplacedString = StringReplace($string, $g_sFind, $g_sReplaceWith) ; Replace the symbol with $g_sReplaceWith

;second letter
Global $g_sFind = "8D"
;MsgBox($MB_OK, "2", $g_sFind) ; Display the new text
Global $g_sReplaceWith = "73"
$g_sReplacedString = StringReplace($g_sReplacedString, $g_sFind, $g_sReplaceWith) ; Replace the symbol with $g_sReplaceWith

;third letter
Global $g_sFind = "9420"
;MsgBox($MB_OK, "3", $g_sFind) ; Display the new text
Global $g_sReplaceWith = "6920"
$g_sReplacedString = StringReplace($g_sReplacedString, $g_sFind, $g_sReplaceWith) ; Replace the symbol with $g_sReplaceWith

;convert back to string
$g_sReplacedString = BinaryToString($g_sReplacedString)

;display and retrive back to clipboard
MsgBox($MB_OK, @ScriptName, $g_sReplacedString) ; Display the new text
ClipPut($g_sReplacedString) ; Write the new text to the clipboard

but i dont know how to make it efficient, i just rewrite multiple line again and again,
is it possible to make it like this?
 

Local $srch = "Ÿ§¨‚Œ¸õΩ•@~µ∫˙√†‰∂ˇî®ßÃĀåḌḤĪïùàḶḸṂṆñìṄṚṜṢŚṬāḍḥīḷḹṃṁṇṅṛṝṣśṭūäÇéüöëò"
Local $repl = "usrSaSns-aamnhntrdtirsnaadhinhmllmnnnnrrsstadhiiimmnnrrsstuasiutnd"

 $content = StringReplace($content, StringMid($srch, $i, 1), StringMid($repl, $i, 1))

so i just write all the binary i want to search in 1 line and 1 line for replaced binary..something like that
because there is, some 4 character binary and there is some 2 character binary, that makes me little confuse

Edited by subuddhi
Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...