Jump to content

Method for extracting text from RTF file

Recommended Posts

I've worked with RTF files and rich edit controls for years. But recently I needed to simply extract the text from an RTF file (i.e., without any of the formatting).

I found a few suggested methods, but none were as simple as what I would like.

A short investigation revealed that although all the necessary pieces are in the standard Au3 function set, there is no macro Extract Text function.

What I'm providing below is a working tool for ferreting out and tuning any subtle aspects to any processing you might need.

THE IMPORTANT THING TO KNOW IS THIS: the file's ENCODING is key to everything.

If you're certain of the file's encoding, then just specify it in your _GUICtrlRichEdit_StreamFromFile() call.

If you don't know it, use FileGetEncoding() and use the result in your StreamFrom call.

BUT HERE'S THE CAVEAT: determining a file's encoding is tricky. There's a wide range of programs writing RTFs ... and the specification(s) for a file's encoding can be rather loosely implemented. As a result, there's a note in the Au3 function that it will return Binary (code = 16) if the encoding isn't clear. But you can never specify Binary in your StreamFrom call or you will get gibberish.

A fairly reliable "rule" is that if your don't get a clear encoding indication—like UTF-8 or UTF-16—then it's pretty safe to assume ANSI for an RTF file on a windows PC ... so replace any code=16 with code=512.

Feel free to suggest alternatives ... or ways to make the process more robust.

For anyone who's interested, I found this related discussion on StackExchange: link

#include <GUIConstantsEx.au3>
#include <GuiRichEdit.au3>
#include <WindowsConstants.au3>
#include <WinAPISysWin.au3>
#include <String.au3>
#include <Array.au3>
#include <File.au3>

Global $watch = "C:\path\to\file.rtf"                 ; path to RTF file

$hGui = GUICreate("Extract text from RTF", 660, 320, -1, -1)
$lblMask = GUICtrlCreateLabel("", 10, 10, 300, 220)
$hRichEdit = _GUICtrlRichEdit_Create($hGui, "This is a test.", 10, 20, 300, 220, BitOR($ES_MULTILINE, $WS_VSCROLL))
$normal = GUICtrlCreateEdit("initial text", 330, 20, 320, 240)
$cButton = GUICtrlCreateButton("Process the file", 80, 270, 180, 30)
$eButton = GUICtrlCreateButton("Examine first 500", 400, 270, 180, 30)
GUICtrlSetState($cButton, $GUI_FOCUS)

While True
    $iMsg = GUIGetMsg()
        Case $iMsg = $GUI_EVENT_CLOSE
            _GUICtrlRichEdit_Destroy($hRichEdit) ; needed unless script crashes
        Case $iMsg = $cButton
            $encoding = FileGetEncoding($watch)
            If $encoding = 16 Then $encoding = 512
;            MsgBox(0, "Encoding is ", $encoding)
            _GUICtrlRichEdit_StreamFromFile($hRichEdit, $watch, $encoding)
            GUICtrlSetData($normal, _GUICtrlRichEdit_GetText($hRichEdit, True))
            ConsoleWrite("Processed" & @CRLF)
        Case $iMsg = $eButton
            $readText = StringLeft(GUICtrlRead($normal), 500)
            MsgBox(0, "2: ", $readText & @CRLF & _StringRepeat("-", 80) & @CRLF & _StringToHex($readText))


Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Create New...