Jump to content

Extracting text out of .doc file


trancexx
 Share

Recommended Posts

Extracting text out of Word doc file is something that is easily done if you have MS Word installed. This is not about that.

This is about reading and analysing structure of doc file. If someone is interested in how, what or something else click here (location might change in time, and... it's generally not advisable to read that much :)).

Extraction is very fast considering AutoIt's limitations.

Here's the script:

Dim $file = FileOpenDialog("Choose .doc file", @DesktopDir, "Word doc file (*.doc)", 1)
If @error Then Exit

Dim $TxT = DOCtoTXT($file)

If Not @error Then
    MsgBox(0, "Extracted text", $TxT)
Else
    MsgBox(16, "Error", "Error reading file: error " & @error)
EndIf


Func DOCtoTXT($docfile)
    
    Local $extension = StringSplit($docfile, ".", 1)
    $extension = $extension[$extension[0]]
    
    Local $hwnd = FileOpen($docfile, 16)
    Local $content = FileRead($hwnd)
    FileClose($hwnd)
    
    Local $contentdoc = BinaryMid($content, 513, 2) ;    0xECA5 - for .doc file of our interest
    
    Select
        Case $extension <> "doc" And $contentdoc <> "0xECA5"
            Return SetError(1) ; not doc file or quasi doc file with wrong extension
        Case $extension <> "doc" And $contentdoc = "0xECA5"
            Return SetError(2) ; extension incorrect, header indicates doc file
        Case $extension = "doc" And $contentdoc <> "0xECA5"
            Return SetError(3) ; extension incorrect or quasi doc file (extracting code required)
    EndSelect
    
    Local $complex_bin = BinaryMid($content, 523, 2) ; little endian
    Local $complex
    For $a = 1 To 2
        ; little endian -> big endian
        $complex &= Hex(BinaryMid($complex_bin, 3 - $a, 1))
    Next
    $complex = Dec($complex)
    If Mod(Floor($complex / 4), 2) <> 0 Then Return SetError(4) ; complex doc file (extracting code required)
    
    Local $start_bin = BinaryMid($content, 537, 4) ; little endian  
    Local $start
    For $i = 1 To 4
        ; little endian -> big endian
        $start &= Hex(BinaryMid($start_bin, 5 - $i, 1))
    Next
    $start = Dec($start) ; text starts here
    
    Local $end_bin = BinaryMid($content, 541, 4) ; little endian
    Local $end
    For $i = 1 To 4
        ; little endian -> big endian
        $end &= Hex(BinaryMid($end_bin, 5 - $i, 1))
    Next
    $end = Dec($end) ; text ends here

    If $start > $end Then Return SetError(5) ; corrupted header
    
    Local $content1 = BinaryMid($content, 513 + $start, $end - $start)
    
    Local $text
    
    $text = StringReplace(BinaryToString($content1), Chr(0), "") 
                $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2")
                $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "")
                $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "")
                $text = StringRegExpReplace($text, "\v", @CRLF)

    Return $text

EndFunc

If you look closer you will see that byte order of some data that we use is little endian and is converted to big endian inside loop ...that is only a suggestion (reading backwards is another solution or maybe shifting).

Extracted text is "contaminated" so series of replacments are needed to get us only the text. After that the function will return.

edit:

Updated 2nd January 2009

-further improved 'replacments' part of the script (this also lead to speed improvement)

(previously updated on GEOSoft's suggestion)

Edited by trancexx

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

  • Replies 42
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

this is wonderful :)

just today my italian teacher (i live in italy...) gave me tons of files written in doc format which include many "musts" of the italian litterature from 1600 up to today and I was thinking of a way to take all the text out and putting it in a txt... and maybe then group everything in an exe... to manage the files >_<

thanks for this udf

keep up the good work

Some Projects:[list][*]ZIP UDF using no external files[*]iPod Music Transfer [*]iTunes UDF - fully integrate iTunes with au3[*]iTunes info (taskbar player hover)[*]Instant Run - run scripts without saving them before :)[*]Get Tube - YouTube Downloader[*]Lyric Finder 2 - Find Lyrics to any of your song[*]DeskBox - A Desktop Extension Tool[/list]indifference will ruin the world, but in the end... WHO CARES :P---------------http://torels.altervista.org

Link to comment
Share on other sites

this is wonderful :)

just today my italian teacher (i live in italy...) gave me tons of files written in doc format which include many "musts" of the italian litterature from 1600 up to today and I was thinking of a way to take all the text out and putting it in a txt... and maybe then group everything in an exe... to manage the files >_<

thanks for this udf

keep up the good work

Nice!

Now all you need to do is make it work with .DocX files :idiot: (DocX = MS Office Word 2007)

I just tested it and it Error's out on me with an error of 1.

Overall nice work!

Link to comment
Share on other sites

  • 4 weeks later...

i keep getting error 3...

That is documented.

Your file is not "true" doc. Try saving it as a doc (with different name or whatever) after you open it with word. Or find a way to extract text for that kind of files.

There is a flaw in DOCtoTXT() function, now looking at it.

Correct replacements should be (StringRegExpReplace() part):

("..." represents anything)

...Chr(19)...Chr(20)...Chr(21)... --> drop Chr(19)...Chr(20) part, leave the rest

...Chr(19)...Chr(20)... --> drop Chr(19)...Chr(20) part but not if Chr(21) in between, leave the rest

...Chr(19)...Chr(21)... --> drop Chr(19)...Chr(21) part but not if Chr(20) in between, leave the rest

...Chr(8)...Chr(25)... --> drop Chr(8)...Chr(25) part, leave the rest

Any idea?

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

  • 2 months later...

This is very good work.

I do have i suggestion though. Add the following line at the bottom if your StringReplace calls (last item befor you return $text

$text = StringReplace($text, Chr(11), @CRLF)

If you put it in any sooner you will double space the document lines.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

This is very good work.

I do have i suggestion though. Add the following line at the bottom if your StringReplace calls (last item befor you return $text

$text = StringReplace($text, Chr(11), @CRLF)

If you put it in any sooner you will double space the document lines.

Thanks

How about this to be all of the replacements:

$text = BinaryToString($content1)
    $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(12), "")
    $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), "") 
    $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), "") 
    $text = StringRegExpReplace($text, "[^[:space:][:print:]]", "")
    $text = StringReplace(StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF), Chr(11), @CRLF)

That wouldn't solve the main problem though. And that one is posted few posts above yours.

When I wrote that code I knew virtually nothing about AutoIt structure functions. Since that is changed lately, that first post code looks kind of "primitive" (...I read one of your posts mentioning lost eras). Beside that it can be done with less memory usage if we read the header first and later read the text at calculated offset. That is good when there are lots of pictures inside doc file, for example.

All in all this could be done in more advanced manner.

I know (yes I do) that you are very skilled with Regexp; find the way to remove the garbage properly from the output (this post describes the problem), please.

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Thanks

How about this to be all of the replacements:

$text = BinaryToString($content1)
    $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(12), "")
    $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), "") 
    $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), "") 
    $text = StringRegExpReplace($text, "[^[:space:][:print:]]", "")
    $text = StringReplace(StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF), Chr(11), @CRLF)

That wouldn't solve the main problem though. And that one is posted few posts above yours.

When I wrote that code I knew virtually nothing about AutoIt structure functions. Since that is changed lately, that first post code looks kind of "primitive" (...I read one of your posts mentioning lost eras). Beside that it can be done with less memory usage if we read the header first and later read the text at calculated offset. That is good when there are lots of pictures inside doc file, for example.

All in all this could be done in more advanced manner.

I know (yes I do) that you are very skilled with Regexp; find the way to remove the garbage properly from the output (this post describes the problem), please.

I think I tried that code change and I think for some reason it gave a problem so I settled for just adding the extra line. I'll look at it again. If you want I can send you the file I used for testing.

As for the other problem you are having, I'll look at that as well but it won't be for a couple of days. I have a hunch that Chr(11) may have been part of that "garbage". The big trick is in just finding out what the extra garbage is, then it's easy to handle.

Edit: Try changing that line

$text = StringReplace(StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF), Chr(11), @CRLF)

To

$text = StringRegExpReplace($text,"\r\n|\r|\n|\x0b", @CRLF)

I have not tested it but it looks about right. \x0b is the Hex for Chr(11). You could also use the Octal as \013.

The full list of ASCII octal codes is in my On-line Help >> Appendix>>ASCII Characters page. The link to my online help is in my sig.

EDIT 2: Stupid, stupi, stupid. Chr(11) is a vertical tab and I think that when you are converting to text then you will want to replace other vertiacl characters as well, things like form feeds. It that's the case then change what I just gave you to.

$text = StringRegExpReplace($text,"\r\n|\r|\n|\v", @CRLF)
Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

That is documented.

Your file is not "true" doc. Try saving it as a doc (with different name or whatever) after you open it with word. Or find a way to extract text for that kind of files.

There is a flaw in DOCtoTXT() function, now looking at it.

Correct replacements should be (StringRegExpReplace() part):

("..." represents anything)

...Chr(19)...Chr(20)...Chr(21)... --> drop Chr(19)...Chr(20) part, leave the rest

...Chr(19)...Chr(20)... --> drop Chr(19)...Chr(20) part but not if Chr(21) in between, leave the rest

...Chr(19)...Chr(21)... --> drop Chr(19)...Chr(21) part but not if Chr(20) in between, leave the rest

...Chr(8)...Chr(25)... --> drop Chr(8)...Chr(25) part, leave the rest

Any idea?

Can you send me a file or files that show these problems?

The last one looks like it should just be

$Text = StringRegExpReplace($Text, "\x08|\x19", "") :remove all Chr(8) and Chr(25)

or

$Text = StringRegExpReplace($Text, "\x08(.*)\x19", $1);; Only matches if there are characters between Chr(8) and Chr(25)

The last one may need a bit of work depending on whether or not they will be on the same line.

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Can you send me a file or files that show these problems?

The last one looks like it should just be

$Text = StringRegExpReplace($Text, "\x08|\x19", "") :remove all Chr(8) and Chr(25)

or

$Text = StringRegExpReplace($Text, "\x08(.*)\x19", $1);; Only matches if there are characters between Chr(8) and Chr(25)

The last one may need a bit of work depending on whether or not they will be on the same line.

I can create "problematic" file to show you the problem (will update this post with attachment).

Problem with doc files is this...

Hyperlinks, references and similar stuff are stored between 0x13 and 0x14 and responding text is between 0x14 and 0x15. Like this:

0x13www.google.com0x14visit me0x15

Inside doc file visible part is only what is between 0x14 and 0x15:

visit me

So everything between 0x13 and 0x14 must be dropped (that is not visible text).

Problem is that some other stuff is stored directly between 0x13 and 0x15. In this casses there is no occurence of 0x14. All it needs to be done is remove 0x13 and 0x15 and everything in between. This colides with 0x13...0x14...0x15 (google link for example).

I will attach doc file that literaly demonstrates the problem.

When you get this as extracted text, we got it:

The basic syntax classes used in the grammar are:
<int32> is either a decimal number or "0x" followed by a hexadecimal number, and must be represented in 32 bits.
<int64> is either a decimal number or "0x" followed by a hexadecimal number, and must be represented in 64 bits.
<hexbyte> is a hexadecimal number that fits into one byte.

A command line debugger is included with the tools. It is called cordbg. A description and the usage of the debugger can be found in the cordbg documentation.
The source code of the command line debugger is included with .NET SDK package.

edit:

I hope I was clear enough.

Edited by trancexx

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Okay, I'll work on it but in the meantime replace the line

$text = StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF); Unix and Mac

with

$text = StringRegExpReplace($text, "\v", @CRLF)
Then you can forget about the Char(11) thing too. This is tested and replaces all vertical whitespaces inluding @CRLF, @CR and @LF and Chr(11)

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Will do.

Just a little more about replacements.

That attached file looks like this before replacements:

The basic syntax classes0x13 XE "Syntax:Syntax classes" 0x15 used in the grammar are:

<int320x13 XE "int32" 0x15> is either a decimal number or 0x followed by a hexadecimal number, and must be represented in 32 bits.

<int640x13 XE "int64" 0x15> is either a decimal number or 0x followed by a hexadecimal number, and must be represented in 64 bits.

<hexbyte0x13 XE "hexbyte" 0x15> is a hexadecimal number that fits into one byte.

A command line debugger0x13 XE "Debugger" 0x15 is included with the tools. It is called cordbg0x13 XE "cordbg" 0x15. A description and the usage of the debugger can be found in the 0x13HYPERLINK "cordbgShell.doc"0x14cordbg documentation0x15 .

The source code of the command line debugger is included with .NET SDK package

I cannot make it more obvious :)

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

Your file is not "true" doc. Try saving it as a doc (with different name or whatever) after you open it with word. Or find a way to extract text for that kind of files.

Any idea?

Not trying to rain on your parade, but I though the entire purpose of the exercise was not to use Word. If you have Word, why not use the built-in save as text function which has already been extensively tested and performs what you need.

Does your script cater for the various flavors of document structure that are out there? Word 97, 2000, XP, 2007, etc? All of them, or just the ones you have available?

Suggestion: As well as the Microsoft documentation, have a look at the comments and algorithms in some of the OpenOffice open source code for ideas of how they solve this problem.

Yours is truly a noble coding challenge, worthy of admiration. :)

Seasons Greetings.

Link to comment
Share on other sites

Give this a try and pay attention to my comments in the comment block.

Dim $file = FileOpenDialog("Choose .doc file", @DesktopDir, "Word doc file (*.doc)", 1)
If @error Then Exit

Dim $TxT = DOCtoTXT($file)

If Not @error Then
    MsgBox(0, "Extracted text", $TxT)
Else
    MsgBox(16, "Error", "Error reading file: error " & @error)
EndIf


Func DOCtoTXT($docfile)
    
    Local $extension = StringSplit($docfile, ".", 1)
    $extension = $extension[$extension[0]]
    
    Local $hwnd = FileOpen($docfile, 16)
    Local $content = FileRead($hwnd)
    FileClose($hwnd)
    
    Local $contentdoc = BinaryMid($content, 513, 2);    0xECA5 - for .doc file of our interest
    
    Select
        Case $extension <> "doc" And $contentdoc <> "0xECA5"
            Return SetError(1); not doc file or quasi doc file with wrong extension
        Case $extension <> "doc" And $contentdoc = "0xECA5"
            Return SetError(2); extension incorrect, header indicates doc file
        Case $extension = "doc" And $contentdoc <> "0xECA5"
            Return SetError(3); extension incorrect or quasi doc file (extracting code required)
    EndSelect
    
    Local $complex_bin = BinaryMid($content, 523, 2); little endian
    Local $complex
    For $a = 1 To 2
      ; little endian -> big endian
        $complex &= Hex(BinaryMid($complex_bin, 3 - $a, 1))
    Next
    $complex = Dec($complex)
    If Mod(Floor($complex / 4), 2) <> 0 Then Return SetError(4); complex doc file (extracting code required)
    
    Local $start_bin = BinaryMid($content, 537, 4); little endian  
    Local $start
    For $i = 1 To 4
      ; little endian -> big endian
        $start &= Hex(BinaryMid($start_bin, 5 - $i, 1))
    Next
    $start = Dec($start); text starts here
    
    Local $end_bin = BinaryMid($content, 541, 4); little endian
    Local $end
    For $i = 1 To 4
      ; little endian -> big endian
        $end &= Hex(BinaryMid($end_bin, 5 - $i, 1))
    Next
    $end = Dec($end); text ends here

    If $start > $end Then Return SetError(5); corrupted header
    
    Local $content1 = BinaryMid($content, 513 + $start, $end - $start)
    
    Local $text = BinaryToString($content1)
    $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(1), "")
    $text = StringReplace(StringReplace($text, Chr(2), ""), Chr(3), "")
    $text = StringReplace(StringReplace($text, Chr(4), ""), Chr(5), "")
    $text = StringReplace(StringReplace($text, Chr(6), ""), Chr(7), "")
    Local $sHold
    Local $aRep = StringRegExp($text, "\x13.+?\x15", 3)
    
    If NOT @Error Then
    
      For $i = 0 To Ubound($aRep) -1
         $sHold = $aRep[$i]
         If StringInStr($sHold, Chr(20)) Then
            $sHold = StringRegExpReplace($sHold, "\x13.+?\x14(.*?)\x15(.*)", "$1$2")
         Else
            $sHold = ""
         EndIf
         $text = StringReplace($text, $aRep[$i], $sHold)
      Next
    Else
    
    EndIf  
    #cs
  ;; Not sure what all you are attempting to do here so I just commented it out
  ;; I got the right results from your test file and a couple of mine without it.
  ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), ""); dropping everything between "0x13" and "0x14", inclusive
  ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), ""); same with "0x08" and "0x19" - couldn't find any docummentation on this (this is my impression)
  ;$text = StringReplace($text, Chr(21), "")
    #ce
    $text = StringRegExpReplace($text, "\v", @CRLF)
      
    Return $text

EndFunc

Edit: Major fix. It worked on some and not on others. Should be better now. Be sure to test it on a wide range of files. I'm a bit concerned about what might happen if the .doc file contains a table.

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...