Jump to content

Split an RTF at Delimiter


Go to solution Solved by ioa747,

Recommended Posts

I am a bit of a novice with AutoIT but I have exhaustively searched the libraries and threads here for something similar to what I am trying to do.

I would like to split an RTF at a specific line in the file, which contains the text "Official". So, if there are 2 pages and there is 1 mention of "Official" I would like 2 RTF's. 

I am already successfully using GUI streaming to open an RTF, determine encoding, and output the plain text lines to an array. However, I would like to split the RTF into multiple before gleaning the plain text.

Could someone provide a very high-level example so that I have somewhere to start?

Thanks a ton!

Link to comment
Share on other sites

Thank you for your time, ioa. I took a look at that thread and the FriendlyHide function, which gave me some ideas. I would be grateful for continued assistance.

I am outputting the start of the document through the 2nd occurrence of the delimiter like this:

_GUICtrlRichEdit_StreamFromFile($hRichEdit, @ScriptDir & "\test.RTF")

$delimeter = "Official:"
$delimLen = StringLen($delimeter)
$occurrence = 1

; find 2nd occurrence of delimeter
$finding = StringInStr(_GUICtrlRichEdit_GetText($hRichEdit), $delimeter, 0, $occurrence + 1)

; look for multiple found in this rtf
If $finding > 0 Then
    _GUICtrlRichEdit_SetSel($hRichEdit, 0, $finding)
EndIf

_GUICtrlRichEdit_StreamToFile($hRichEdit, @ScriptDir & "\new.RTF")

However, this outputs the start of the document until about 200 characters before the 2nd occurrence instead of at the 2nd occurrence. Ideas?

Link to comment
Share on other sites

I tried it and it works, the only change I made is:

_GUICtrlRichEdit_StreamFromFile($hRichEdit, @ScriptDir & "\test.RTF", $FO_UTF8_NOBOM)

_GUICtrlRichEdit_SetSel($hRichEdit, 0, $finding - 1)

 

Edit:
maybe it's from .rtf, upload an example .rtf

Edited by ioa747

I know that I know nothing

Link to comment
Share on other sites

I tried the $FO_UTF8_NO BOM and got the same result. I wonder if the shortening of characters may be due to these first couple of lines?

$hGui = GUICreate("Extract text from RTF", -1, -1)
$hRichEdit = _GUICtrlRichEdit_Create($hGui, "", -1, -1)
_GUICtrlRichEdit_StreamFromFile($hRichEdit, @ScriptDir & "\test.RTF", $FO_UTF8_NOBOM)

I can share an example RTF but it may take me a bit. Until then, I thought something might pop as the issue after looking at these.

Link to comment
Share on other sites

I've attached a sample RTF here:

sample.rtf

I've modified my code a great deal and am happy with the results except for the mentioned issue of output cutting off about 200 characters short for each iteration. It also cuts off the right margin. I'm not super worried about that but if this thread results in a solution to that problem as well that would be fantastic.

Edited by rmock
Link to comment
Share on other sites

  • Solution

with sample.rtf, I have the same behavior as you

The output is   about 200 characters before the 2nd occurrence

I only succeeded in plain text

; https://www.autoitscript.com/forum/topic/211403-split-an-rtf-at-delimiter
#AutoIt3Wrapper_Au3Check_Parameters=-d -w 1 -w 2 -w 3 -w 4 -w 5 -w 6 -w 7

#include <MsgBoxConstants.au3>
#include <FileConstants.au3>

SplitIt(@ScriptDir & "\sample.rtf")

;----------------------------------------------------------------------------------------
Func SplitIt($sFilePath)

    ; Open the file for reading and store the handle to a variable.
    Local $hFileOpen = FileOpen($sFilePath, $FO_READ)
    If $hFileOpen = -1 Then
        MsgBox($MB_SYSTEMMODAL, "", "An error occurred when reading the file.")
        Return False
    EndIf

    ; Read the contents of the file using the handle returned by FileOpen.
    Local $sFileTxt = FileRead($hFileOpen)
    ; Close the handle returned by FileOpen.
    FileClose($hFileOpen)

    Local $sDelim, $aSplit, $sNewRtf

    ; find  'Official:' formated
    $sDelim = "{\rtlch\fcs1 \ab\af0\afs20 \ltrch\fcs0 \b\f0\fs20\kerning0\insrsid1062128 \hich\af0\dbch\af31505\loch\f0 Official\hich\af0\dbch\af31505\loch\f0 : }"
    $aSplit = StringSplit($sFileTxt, $sDelim, 1)

    If $aSplit[0] > 2 Then
        $sNewRtf = $aSplit[1] & $sDelim & $aSplit[2]
    EndIf

    ; find last page brake
    Local $iPageBr = StringInStr($sNewRtf, "\page", 0, -1) -1

    ; remove page brake and close
    $sNewRtf = StringLeft($sNewRtf, $iPageBr) & "}}"

    ; save  $StringA and $StringB as rtf
    Local $sNewFilePath = StringTrimRight($sFilePath, 4)  & "_NEW.rtf"

    If FileExists($sNewFilePath) Then FileDelete($sNewFilePath)

    FileWrite($sNewFilePath, $sNewRtf)

EndFunc   ;==>SplitIt

https://www.arcdev.hu/manuals/standard/rtf/rtfspeci.pdf

 

Edited by ioa747
fixed it.

I know that I know nothing

Link to comment
Share on other sites

When I try to open sample_NEW.rtf I get: "Word was unable to read this document. It may be corrupt."

I'll continue to play around with your script to see if I can get a different result.

Link to comment
Share on other sites

Beautiful! Is there documentation somewhere to explain the syntax used in the $sDelim variable? I would like to change the delimiter to include a space and a 2nd word.

Link to comment
Share on other sites

Thanks, ioa747! I was able to create a new delimiter using this method. When I attempt to create a new RTF for each occurrence of the delimiter I again get the  "Word was unable to read this document. It may be corrupt" error. I imagine it has to do with how I am using the page breaks within the loop.

; open rtf in Notepad++ and find formatting for delimiter line
    $sDelim = "{\rtlch\fcs1 \ab\af0\afs20 \ltrch\fcs0 \b\f0\fs20\kerning0\insrsid1062128 \hich\af0\dbch\af31505\loch\f0 Official Transcript: }"
    $aSplit = StringSplit($sFileTxt, $sDelim, 1)

    $loop = 1

    For $i = 0 To UBound($aSplit) - 1

        If $i <> 0 Then

            If UBound($aSplit) > 2 Then
                $sNewRtf = $aSplit[$i]
            EndIf

            ; find last page break
            Global $iPageBr = StringInStr($sNewRtf, "\page", 0, -1) -1

            ; remove page break and close
            $sNewRtf = StringLeft($sNewRtf, $iPageBr) & "}}"

            ; save  $StringA and $StringB as rtf
            Local $sNewFilePath = StringTrimRight($sFilePath, 4)  & "_NEW_" & $loop & ".rtf"

            If FileExists($sNewFilePath) Then FileDelete($sNewFilePath)

            FileWrite($sNewFilePath, $sNewRtf)
        EndIf

        $loop += 1
    Next

 

Edited by rmock
Link to comment
Share on other sites

each piece needs a header

Edit:
so you need to split it into.
<header> + <official1>
<header> + <official2>

Edit:
open it with wordpad (not word), because it makes formatting simpler
select the section you want to separate and paste as a new document in wordpad,
then open it as text to see the formatting , and at what point it divides it
and what it uses for a header

 

Edited by ioa747

I know that I know nothing

Link to comment
Share on other sites

Thank you for your diligent assistance, ioa747! I am over the hump with this script. Here is what is currently working for me:

; split RTF
;------------------------------------------------------------
SplitIt(@ScriptDir & "\test.rtf")

Func SplitIt($sFilePath)

    ; Open the file for reading and store the handle to a variable
    Local $hFileOpen = FileOpen($sFilePath, $FO_READ)
    If $hFileOpen = -1 Then
        MsgBox($MB_SYSTEMMODAL, "", "An error occurred when reading the file.")
        Return False
    EndIf

    ; Read the contents of the file using the handle returned by FileOpen
    Local $sFileTxt = FileRead($hFileOpen)
    ; Close the handle returned by FileOpen.
    FileClose($hFileOpen)

    Local $sDelim, $aSplit, $sNewRtf, $iPageBr

    ; open rtf in Notepad++ and find formatting for delimiter line
    $sDelim = "\pard\qc\b Official: "
    $aSplit = StringSplit($sFileTxt, $sDelim, 1)
    $header = $aSplit[1]

    If UBound($aSplit) > 2 Then

        For $i = 2 To UBound($aSplit) - 1

            $sNewRtf = $header & $sDelim & $aSplit[$i]

            ; find page breaks
            $iPageBr = StringInStr($sNewRtf, "\page", 0, -1)

            If $iPageBr = 0 Then
                ; no page break so this is the last iteration
                ; output normally
                $sNewRtf = $sNewRtf
            Else
                ; page breaks found so this is not the last iteration
                ; remove page break and close
                $sNewRtf = StringLeft($sNewRtf, $iPageBr) & "}}"
            EndIf

            Local $sNewFilePath = StringTrimRight($sFilePath, 4) & "_" & $i & ".rtf"
            If FileExists($sNewFilePath) Then FileDelete($sNewFilePath)
            FileWrite($sNewFilePath, $sNewRtf)

        Next
    EndIf

EndFunc

 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...