Extracting text out of .doc file

trancexx · September 17, 2008

Extracting text out of Word doc file is something that is easily done if you have MS Word installed. This is not about that.

This is about reading and analysing structure of doc file. If someone is interested in how, what or something else click here (location might change in time, and... it's generally not advisable to read that much ).

Extraction is very fast considering AutoIt's limitations.

Here's the script:

Dim $file = FileOpenDialog("Choose .doc file", @DesktopDir, "Word doc file (*.doc)", 1)
If @error Then Exit

Dim $TxT = DOCtoTXT($file)

If Not @error Then
    MsgBox(0, "Extracted text", $TxT)
Else
    MsgBox(16, "Error", "Error reading file: error " & @error)
EndIf


Func DOCtoTXT($docfile)
    
    Local $extension = StringSplit($docfile, ".", 1)
    $extension = $extension[$extension[0]]
    
    Local $hwnd = FileOpen($docfile, 16)
    Local $content = FileRead($hwnd)
    FileClose($hwnd)
    
    Local $contentdoc = BinaryMid($content, 513, 2) ;    0xECA5 - for .doc file of our interest
    
    Select
        Case $extension <> "doc" And $contentdoc <> "0xECA5"
            Return SetError(1) ; not doc file or quasi doc file with wrong extension
        Case $extension <> "doc" And $contentdoc = "0xECA5"
            Return SetError(2) ; extension incorrect, header indicates doc file
        Case $extension = "doc" And $contentdoc <> "0xECA5"
            Return SetError(3) ; extension incorrect or quasi doc file (extracting code required)
    EndSelect
    
    Local $complex_bin = BinaryMid($content, 523, 2) ; little endian
    Local $complex
    For $a = 1 To 2
        ; little endian -> big endian
        $complex &= Hex(BinaryMid($complex_bin, 3 - $a, 1))
    Next
    $complex = Dec($complex)
    If Mod(Floor($complex / 4), 2) <> 0 Then Return SetError(4) ; complex doc file (extracting code required)
    
    Local $start_bin = BinaryMid($content, 537, 4) ; little endian  
    Local $start
    For $i = 1 To 4
        ; little endian -> big endian
        $start &= Hex(BinaryMid($start_bin, 5 - $i, 1))
    Next
    $start = Dec($start) ; text starts here
    
    Local $end_bin = BinaryMid($content, 541, 4) ; little endian
    Local $end
    For $i = 1 To 4
        ; little endian -> big endian
        $end &= Hex(BinaryMid($end_bin, 5 - $i, 1))
    Next
    $end = Dec($end) ; text ends here

    If $start > $end Then Return SetError(5) ; corrupted header
    
    Local $content1 = BinaryMid($content, 513 + $start, $end - $start)
    
    Local $text
    
    $text = StringReplace(BinaryToString($content1), Chr(0), "") 
                $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2")
                $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "")
                $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "")
                $text = StringRegExpReplace($text, "\v", @CRLF)

    Return $text

EndFunc

If you look closer you will see that byte order of some data that we use is little endian and is converted to big endian inside loop ...that is only a suggestion (reading backwards is another solution or maybe shifting).

Extracted text is "contaminated" so series of replacments are needed to get us only the text. After that the function will return.

edit:

Updated 2nd January 2009

-further improved 'replacments' part of the script (this also lead to speed improvement)

(previously updated on GEOSoft's suggestion)

Edited January 2, 2009 by trancexx

monoceres · September 17, 2008

Really good work

Could prove useful in the future >_<

KaFu · September 17, 2008

Nice UDF trancexx, I think I've got a usage for this too .

Best Regards

Andreik · September 17, 2008

Really nice and useful script.

torels · September 17, 2008

this is wonderful

just today my italian teacher (i live in italy...) gave me tons of files written in doc format which include many "musts" of the italian litterature from 1600 up to today and I was thinking of a way to take all the text out and putting it in a txt... and maybe then group everything in an exe... to manage the files >_<

thanks for this udf

keep up the good work

Szhlopp · September 17, 2008

this is wonderful
just today my italian teacher (i live in italy...) gave me tons of files written in doc format which include many "musts" of the italian litterature from 1600 up to today and I was thinking of a way to take all the text out and putting it in a txt... and maybe then group everything in an exe... to manage the files >_<
thanks for this udf
keep up the good work

Nice!

Now all you need to do is make it work with .DocX files :idiot: (DocX = MS Office Word 2007)

I just tested it and it Error's out on me with an error of 1.

Overall nice work!

Golbez · October 15, 2008

<3

ty so much this is a great code!!

YourSpace · October 16, 2008

i keep getting error 3...

trancexx · October 16, 2008

i keep getting error 3...

That is documented.

Your file is not "true" doc. Try saving it as a doc (with different name or whatever) after you open it with word. Or find a way to extract text for that kind of files.

There is a flaw in DOCtoTXT() function, now looking at it.

Correct replacements should be (StringRegExpReplace() part):

("..." represents anything)

...Chr(19)...Chr(20)...Chr(21)... --> drop Chr(19)...Chr(20) part, leave the rest

...Chr(19)...Chr(20)... --> drop Chr(19)...Chr(20) part but not if Chr(21) in between, leave the rest

...Chr(19)...Chr(21)... --> drop Chr(19)...Chr(21) part but not if Chr(20) in between, leave the rest

...Chr(8)...Chr(25)... --> drop Chr(8)...Chr(25) part, leave the rest

Any idea?

ptrex · October 16, 2008

@trancexx

Realy nice and fast indead !!

Regards

ptrex

BrettF · October 16, 2008

Hi,

This is very nice work!

Any chance of TXT to DOC?

That would be great!

Cheers,

Brett

GEOSoft · December 22, 2008

This is very good work.

I do have i suggestion though. Add the following line at the bottom if your StringReplace calls (last item befor you return $text

$text = StringReplace($text, Chr(11), @CRLF)

If you put it in any sooner you will double space the document lines.

trancexx · December 24, 2008

This is very good work.
I do have i suggestion though. Add the following line at the bottom if your StringReplace calls (last item befor you return $text
$text = StringReplace($text, Chr(11), @CRLF)
If you put it in any sooner you will double space the document lines.

Thanks

How about this to be all of the replacements:

$text = BinaryToString($content1)
    $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(12), "")
    $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), "") 
    $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), "") 
    $text = StringRegExpReplace($text, "[^[:space:][:print:]]", "")
    $text = StringReplace(StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF), Chr(11), @CRLF)

That wouldn't solve the main problem though. And that one is posted few posts above yours.

When I wrote that code I knew virtually nothing about AutoIt structure functions. Since that is changed lately, that first post code looks kind of "primitive" (...I read one of your posts mentioning lost eras). Beside that it can be done with less memory usage if we read the header first and later read the text at calculated offset. That is good when there are lots of pictures inside doc file, for example.

All in all this could be done in more advanced manner.

I know (yes I do) that you are very skilled with Regexp; find the way to remove the garbage properly from the output (this post describes the problem), please.

GEOSoft · December 24, 2008

Thanks
How about this to be all of the replacements:
$text = BinaryToString($content1)
    $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(12), "")
    $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), "") 
    $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), "") 
    $text = StringRegExpReplace($text, "[^[:space:][:print:]]", "")
    $text = StringReplace(StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF), Chr(11), @CRLF)
That wouldn't solve the main problem though. And that one is posted few posts above yours.

When I wrote that code I knew virtually nothing about AutoIt structure functions. Since that is changed lately, that first post code looks kind of "primitive" (...I read one of your posts mentioning lost eras). Beside that it can be done with less memory usage if we read the header first and later read the text at calculated offset. That is good when there are lots of pictures inside doc file, for example.
All in all this could be done in more advanced manner.

I know (yes I do) that you are very skilled with Regexp; find the way to remove the garbage properly from the output (this post describes the problem), please.

I think I tried that code change and I think for some reason it gave a problem so I settled for just adding the extra line. I'll look at it again. If you want I can send you the file I used for testing.

As for the other problem you are having, I'll look at that as well but it won't be for a couple of days. I have a hunch that Chr(11) may have been part of that "garbage". The big trick is in just finding out what the extra garbage is, then it's easy to handle.

Edit: Try changing that line

$text = StringReplace(StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF), Chr(11), @CRLF)

To

$text = StringRegExpReplace($text,"\r\n|\r|\n|\x0b", @CRLF)

I have not tested it but it looks about right. \x0b is the Hex for Chr(11). You could also use the Octal as \013.

The full list of ASCII octal codes is in my On-line Help >> Appendix>>ASCII Characters page. The link to my online help is in my sig.

EDIT 2: Stupid, stupi, stupid. Chr(11) is a vertical tab and I think that when you are converting to text then you will want to replace other vertiacl characters as well, things like form feeds. It that's the case then change what I just gave you to.

$text = StringRegExpReplace($text,"\r\n|\r|\n|\v", @CRLF)

Edited December 24, 2008 by GEOSoft

GEOSoft · December 24, 2008

That is documented.
Your file is not "true" doc. Try saving it as a doc (with different name or whatever) after you open it with word. Or find a way to extract text for that kind of files.

There is a flaw in DOCtoTXT() function, now looking at it.

Correct replacements should be (StringRegExpReplace() part):
("..." represents anything)

...Chr(19)...Chr(20)...Chr(21)... --> drop Chr(19)...Chr(20) part, leave the rest
...Chr(19)...Chr(20)... --> drop Chr(19)...Chr(20) part but not if Chr(21) in between, leave the rest
...Chr(19)...Chr(21)... --> drop Chr(19)...Chr(21) part but not if Chr(20) in between, leave the rest
...Chr(8)...Chr(25)... --> drop Chr(8)...Chr(25) part, leave the rest

Any idea?

Can you send me a file or files that show these problems?

The last one looks like it should just be

$Text = StringRegExpReplace($Text, "\x08|\x19", "") :remove all Chr(8) and Chr(25)

or

$Text = StringRegExpReplace($Text, "\x08(.*)\x19", $1);; Only matches if there are characters between Chr(8) and Chr(25)

The last one may need a bit of work depending on whether or not they will be on the same line.

Edited December 24, 2008 by GEOSoft

trancexx · December 24, 2008

Can you send me a file or files that show these problems?
The last one looks like it should just be
$Text = StringRegExpReplace($Text, "\x08|\x19", "") :remove all Chr(8) and Chr(25)
or
$Text = StringRegExpReplace($Text, "\x08(.*)\x19", $1);; Only matches if there are characters between Chr(8) and Chr(25)
The last one may need a bit of work depending on whether or not they will be on the same line.

I can create "problematic" file to show you the problem (will update this post with attachment).

Problem with doc files is this...

Hyperlinks, references and similar stuff are stored between 0x13 and 0x14 and responding text is between 0x14 and 0x15. Like this:

0x13www.google.com0x14visit me0x15

Inside doc file visible part is only what is between 0x14 and 0x15:

visit me

So everything between 0x13 and 0x14 must be dropped (that is not visible text).

Problem is that some other stuff is stored directly between 0x13 and 0x15. In this casses there is no occurence of 0x14. All it needs to be done is remove 0x13 and 0x15 and everything in between. This colides with 0x13...0x14...0x15 (google link for example).

I will attach doc file that literaly demonstrates the problem.

When you get this as extracted text, we got it:

The basic syntax classes used in the grammar are:
<int32> is either a decimal number or "0x" followed by a hexadecimal number, and must be represented in 32 bits.
<int64> is either a decimal number or "0x" followed by a hexadecimal number, and must be represented in 64 bits.
<hexbyte> is a hexadecimal number that fits into one byte.

A command line debugger is included with the tools. It is called cordbg. A description and the usage of the debugger can be found in the cordbg documentation.
The source code of the command line debugger is included with .NET SDK package.

edit:

I hope I was clear enough.

Edited December 24, 2008 by trancexx

GEOSoft · December 24, 2008

Okay, I'll work on it but in the meantime replace the line

$text = StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF); Unix and Mac

with

$text = StringRegExpReplace($text, "\v", @CRLF)

Then you can forget about the Char(11) thing too. This is tested and replaces all vertical whitespaces inluding @CRLF, @CR and @LF and Chr(11)

trancexx · December 24, 2008

Will do.

Just a little more about replacements.

That attached file looks like this before replacements:

The basic syntax classes0x13 XE "Syntax:Syntax classes" 0x15 used in the grammar are:
<int320x13 XE "int32" 0x15> is either a decimal number or 0x followed by a hexadecimal number, and must be represented in 32 bits.
<int640x13 XE "int64" 0x15> is either a decimal number or 0x followed by a hexadecimal number, and must be represented in 64 bits.
<hexbyte0x13 XE "hexbyte" 0x15> is a hexadecimal number that fits into one byte.
A command line debugger0x13 XE "Debugger" 0x15 is included with the tools. It is called cordbg0x13 XE "cordbg" 0x15. A description and the usage of the debugger can be found in the 0x13HYPERLINK "cordbgShell.doc"0x14cordbg documentation0x15 .
The source code of the command line debugger is included with .NET SDK package

I cannot make it more obvious

Confuzzled · December 24, 2008

Your file is not "true" doc. Try saving it as a doc (with different name or whatever) after you open it with word. Or find a way to extract text for that kind of files.
Any idea?

Not trying to rain on your parade, but I though the entire purpose of the exercise was not to use Word. If you have Word, why not use the built-in save as text function which has already been extensively tested and performs what you need.

Does your script cater for the various flavors of document structure that are out there? Word 97, 2000, XP, 2007, etc? All of them, or just the ones you have available?

Suggestion: As well as the Microsoft documentation, have a look at the comments and algorithms in some of the OpenOffice open source code for ideas of how they solve this problem.

Yours is truly a noble coding challenge, worthy of admiration.

Seasons Greetings.

GEOSoft · December 24, 2008

Give this a try and pay attention to my comments in the comment block.

Dim $file = FileOpenDialog("Choose .doc file", @DesktopDir, "Word doc file (*.doc)", 1)
If @error Then Exit

Dim $TxT = DOCtoTXT($file)

If Not @error Then
    MsgBox(0, "Extracted text", $TxT)
Else
    MsgBox(16, "Error", "Error reading file: error " & @error)
EndIf


Func DOCtoTXT($docfile)
    
    Local $extension = StringSplit($docfile, ".", 1)
    $extension = $extension[$extension[0]]
    
    Local $hwnd = FileOpen($docfile, 16)
    Local $content = FileRead($hwnd)
    FileClose($hwnd)
    
    Local $contentdoc = BinaryMid($content, 513, 2);    0xECA5 - for .doc file of our interest
    
    Select
        Case $extension <> "doc" And $contentdoc <> "0xECA5"
            Return SetError(1); not doc file or quasi doc file with wrong extension
        Case $extension <> "doc" And $contentdoc = "0xECA5"
            Return SetError(2); extension incorrect, header indicates doc file
        Case $extension = "doc" And $contentdoc <> "0xECA5"
            Return SetError(3); extension incorrect or quasi doc file (extracting code required)
    EndSelect
    
    Local $complex_bin = BinaryMid($content, 523, 2); little endian
    Local $complex
    For $a = 1 To 2
      ; little endian -> big endian
        $complex &= Hex(BinaryMid($complex_bin, 3 - $a, 1))
    Next
    $complex = Dec($complex)
    If Mod(Floor($complex / 4), 2) <> 0 Then Return SetError(4); complex doc file (extracting code required)
    
    Local $start_bin = BinaryMid($content, 537, 4); little endian  
    Local $start
    For $i = 1 To 4
      ; little endian -> big endian
        $start &= Hex(BinaryMid($start_bin, 5 - $i, 1))
    Next
    $start = Dec($start); text starts here
    
    Local $end_bin = BinaryMid($content, 541, 4); little endian
    Local $end
    For $i = 1 To 4
      ; little endian -> big endian
        $end &= Hex(BinaryMid($end_bin, 5 - $i, 1))
    Next
    $end = Dec($end); text ends here

    If $start > $end Then Return SetError(5); corrupted header
    
    Local $content1 = BinaryMid($content, 513 + $start, $end - $start)
    
    Local $text = BinaryToString($content1)
    $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(1), "")
    $text = StringReplace(StringReplace($text, Chr(2), ""), Chr(3), "")
    $text = StringReplace(StringReplace($text, Chr(4), ""), Chr(5), "")
    $text = StringReplace(StringReplace($text, Chr(6), ""), Chr(7), "")
    Local $sHold
    Local $aRep = StringRegExp($text, "\x13.+?\x15", 3)
    
    If NOT @Error Then
    
      For $i = 0 To Ubound($aRep) -1
         $sHold = $aRep[$i]
         If StringInStr($sHold, Chr(20)) Then
            $sHold = StringRegExpReplace($sHold, "\x13.+?\x14(.*?)\x15(.*)", "$1$2")
         Else
            $sHold = ""
         EndIf
         $text = StringReplace($text, $aRep[$i], $sHold)
      Next
    Else
    
    EndIf  
    #cs
  ;; Not sure what all you are attempting to do here so I just commented it out
  ;; I got the right results from your test file and a couple of mine without it.
  ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), ""); dropping everything between "0x13" and "0x14", inclusive
  ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), ""); same with "0x08" and "0x19" - couldn't find any docummentation on this (this is my impression)
  ;$text = StringReplace($text, Chr(21), "")
    #ce
    $text = StringRegExpReplace($text, "\v", @CRLF)
      
    Return $text

EndFunc

Edit: Major fix. It worked on some and not on others. Should be better now. Be sure to test it on a wide range of files. I'm a bit concerned about what might happen if the .doc file contains a table.

Edited December 25, 2008 by GEOSoft

Extracting text out of .doc file

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Top Posters In This Topic

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members