Extracting text out of .doc file

trancexx · December 26, 2008

Give this a try and pay attention to my comments in the comment block.

Dim $file = FileOpenDialog("Choose .doc file", @DesktopDir, "Word doc file (*.doc)", 1)
If @error Then Exit

Dim $TxT = DOCtoTXT($file)

If Not @error Then
    MsgBox(0, "Extracted text", $TxT)
Else
    MsgBox(16, "Error", "Error reading file: error " & @error)
EndIf


Func DOCtoTXT($docfile)
    
    Local $extension = StringSplit($docfile, ".", 1)
    $extension = $extension[$extension[0]]
    
    Local $hwnd = FileOpen($docfile, 16)
    Local $content = FileRead($hwnd)
    FileClose($hwnd)
    
    Local $contentdoc = BinaryMid($content, 513, 2);    0xECA5 - for .doc file of our interest
    
    Select
        Case $extension <> "doc" And $contentdoc <> "0xECA5"
            Return SetError(1); not doc file or quasi doc file with wrong extension
        Case $extension <> "doc" And $contentdoc = "0xECA5"
            Return SetError(2); extension incorrect, header indicates doc file
        Case $extension = "doc" And $contentdoc <> "0xECA5"
            Return SetError(3); extension incorrect or quasi doc file (extracting code required)
    EndSelect
    
    Local $complex_bin = BinaryMid($content, 523, 2); little endian
    Local $complex
    For $a = 1 To 2
     ; little endian -> big endian
        $complex &= Hex(BinaryMid($complex_bin, 3 - $a, 1))
    Next
    $complex = Dec($complex)
    If Mod(Floor($complex / 4), 2) <> 0 Then Return SetError(4); complex doc file (extracting code required)
    
    Local $start_bin = BinaryMid($content, 537, 4); little endian  
    Local $start
    For $i = 1 To 4
     ; little endian -> big endian
        $start &= Hex(BinaryMid($start_bin, 5 - $i, 1))
    Next
    $start = Dec($start); text starts here
    
    Local $end_bin = BinaryMid($content, 541, 4); little endian
    Local $end
    For $i = 1 To 4
     ; little endian -> big endian
        $end &= Hex(BinaryMid($end_bin, 5 - $i, 1))
    Next
    $end = Dec($end); text ends here

    If $start > $end Then Return SetError(5); corrupted header
    
    Local $content1 = BinaryMid($content, 513 + $start, $end - $start)
    
    Local $text = BinaryToString($content1)
    $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(1), "")
    $text = StringReplace(StringReplace($text, Chr(2), ""), Chr(3), "")
    $text = StringReplace(StringReplace($text, Chr(4), ""), Chr(5), "")
    $text = StringReplace(StringReplace($text, Chr(6), ""), Chr(7), "")
    Local $sHold
    Local $aRep = StringRegExp($text, "\x13.+?\x15", 3)
    
    If NOT @Error Then
    
      For $i = 0 To Ubound($aRep) -1
         $sHold = $aRep[$i]
         If StringInStr($sHold, Chr(20)) Then
            $sHold = StringRegExpReplace($sHold, "\x13.+?\x14(.*?)\x15(.*)", "$1$2")
         Else
            $sHold = ""
         EndIf
         $text = StringReplace($text, $aRep[$i], $sHold)
      Next
    Else
    
    EndIf  
    #cs
 ;; Not sure what all you are attempting to do here so I just commented it out
 ;; I got the right results from your test file and a couple of mine without it.
 ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), ""); dropping everything between "0x13" and "0x14", inclusive
 ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), ""); same with "0x08" and "0x19" - couldn't find any docummentation on this (this is my impression)
 ;$text = StringReplace($text, Chr(21), "")
    #ce
    $text = StringRegExpReplace($text, "\v", @CRLF)
      
    Return $text

EndFunc

Edit: Major fix. It worked on some and not on others. Should be better now. Be sure to test it on a wide range of files. I'm a bit concerned about what might happen if the .doc file contains a table.

No!

I'm sorry to say but that's going like few steps and more the wrong way.

There must be no loops on this level in order for this to have sense. That is incredibly slow. Solution must be found by some other means (regular expression). That code will take almost 5 minutes to complete taking 100% CPU usage for this file (2,4MB) that I'm testing on... and still won't do the job right.

On the other hand it can be done in less than 2 sec like this:

$text = StringReplace(BinaryToString($content1), Chr(0), "") 
    $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2")
    $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "")
    $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "")
    $text = StringRegExpReplace($text, "\v", @CRLF)

And that is almost right. I guess a spoon of salt is all it needs to be perfect.

Salt is wanted!

trancexx · December 26, 2008

Not trying to rain on your parade, but I though the entire purpose of the exercise was not to use Word. If you have Word, why not use the built-in save as text function which has already been extensively tested and performs what you need.
Does your script cater for the various flavors of document structure that are out there? Word 97, 2000, XP, 2007, etc? All of them, or just the ones you have available?
Suggestion: As well as the Microsoft documentation, have a look at the comments and algorithms in some of the OpenOffice open source code for ideas of how they solve this problem.
Yours is truly a noble coding challenge, worthy of admiration.
Seasons Greetings.

Really don't know how to respond to this. I'm ambivalent.

You are practicing few logical fallacies there.

I will stop now (otherwise you would probably hate me).

GEOSoft · December 26, 2008

very close

Quafboy · January 18, 2009

No!
I'm sorry to say but that's going like few steps and more the wrong way.
There must be no loops on this level in order for this to have sense. That is incredibly slow. Solution must be found by some other means (regular expression). That code will take almost 5 minutes to complete taking 100% CPU usage for this file (2,4MB) that I'm testing on... and still won't do the job right.
On the other hand it can be done in less than 2 sec like this:
$text = StringReplace(BinaryToString($content1), Chr(0), "") 
    $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2")
    $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "")
    $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "")
    $text = StringRegExpReplace($text, "\v", @CRLF)
And that is almost right. I guess a spoon of salt is all it needs to be perfect.
Salt is wanted!

Hello everyone, not to bring up an old post or anything but I just wanted to throw my two cents in.

The code above works WAY better than the huge block a couple posts up.

CODE

$content1 = FileRead("C:\227953.doc", 99999)

$text = StringReplace(BinaryToString($content1), Chr(0), "")

$text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2")

$text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "")

$text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "")

$text = StringRegExpReplace($text, "\v", @CRLF)

FileWrite("C:\output.txt", $text)

I haven't really changed anything yet but I'll post the updated code when I get it just how it should be. PS I'm talking about tables in the test sample. I'm writing to a txt document to check the end result.

Also I just wanted to say that I have been floating around these forums for a while but felt compelled to join the party after this post. I love AutoIT and I continue to be fascinated by the great work you all do and the amount of effort put into your code, especially since I am a beginner and need all the help I can get.

Thanks to you all.

Edited January 18, 2009 by Quafboy

dmob · February 13, 2009

@trancexx

This is excellent work.... saved the day for one of my customers who had a big and

seemingly corrupt Word document that would not open in Word.

Using your utility I was able to ClipPut the document and paste into a new document.

Saved her hours of retyping and made a hero of me!

Very well done. Five stars.

trancexx · February 13, 2009

@trancexx
This is excellent work.... saved the day for one of my customers who had a big and
seemingly corrupt Word document that would not open in Word.
Using your utility I was able to ClipPut the document and paste into a new document.
Saved her hours of retyping and made a hero of me!
Very well done. Five stars.

If you are he then I hope you have given her a chance to say thanks.

If you are she then... pretty much the same.

AvidLearner · March 16, 2009

Is tis possible fr .docx extension files..??

Szhlopp · March 17, 2009

Is tis possible fr .docx extension files..??

This was majorly slammed and bashed in my original post so I'll just post the code here.

(If this doesn't work, you can modify it to work. Here is the basic process though)

ClipPut(_ReadDocXContent(@ScriptDir & '\DocXtest.docx'))


Func _ReadDocXContent($ReadLocation)
$Name = @ScriptDir & "\TempDoc.zip"
$UnZipName = @ScriptDir & '\DocXdoc'
FileCopy($ReadLocation, $Name, 1)
_Zip_Unzip($Name, "word\document.xml", $UnZipName, 16)
If @error Then Return 0
Sleep(200)
$Text = FileRead(@ScriptDir & '\DocXdoc\document.xml')
$RegEx = StringRegExp($Text, "<w:body>(.*?)<w:sect", 3)
$RegEx = StringRegExpReplace($Regex[0], "<w:br/>", @CRLF, 0)
$RegEx = StringRegExpReplace($Regex, "</w:p>", @CRLF & @CRLF, 0)
$RegEx = StringRegExpReplace($Regex, "<w:tab/>", @TAB, 0)
$RegEx = StringRegExpReplace($Regex, "<(.*?)>", "", 0)
$RegEx = StringRegExpReplace($Regex, "&lt;", "<", 0)
$RegEx = StringRegExpReplace($Regex, "&gt;", ">", 0)
$RegEx = StringRegExpReplace($Regex, "&amp;", "&", 0)
$RegEx = StringRegExpReplace($Regex, "â", '"', 0)
$RegEx = StringRegExpReplace($Regex, "â", '"', 0)
$RegEx = StringRegExpReplace($Regex, "â", "'", 0)
$RegEx = StringRegExpReplace($Regex, "â", "'", 0)
$RegEx = StringRegExpReplace($Regex, "â", "-", 0)
$RegEx = StringRegExpReplace($Regex, "â¦", "...", 0)
FileDelete($Name)
DirRemove($UnZipName, 1)
$RegEx = StringTrimRight($RegEx, 4)
Return $RegEx

EndFunc

Func _Zip_Unzip($hZipFile, $hFilename, $hDestPath, $flag = 4)
    Local $DLLChk = _Zip_DllChk()
    If $DLLChk <> 0 Then Return SetError($DLLChk, 0, 0) ;no dll
    If Not FileExists($hZipFile) Then Return SetError(1, 0, 0) ;no zip file
    
    If Not FileExists($hDestPath) Then DirCreate($hDestPath)
    
    $oApp = ObjCreate("Shell.Application")
    $hFolderitem = $oApp.NameSpace($hZipFile).Parsename($hFilename)
    
    $oApp.NameSpace($hDestPath).Copyhere($hFolderitem, $flag)
    
    
EndFunc   ;==>_Zip_Unzip

Func _Zip_DllChk()
    If Not FileExists(@SystemDir & "\zipfldr.dll") Then Return 2
    If Not RegRead("HKEY_CLASSES_ROOT\CLSID\{E88DCCE0-B7B3-11d1-A9F0-00AA0060FA31}", "") Then Return 3
    Return 0
EndFunc   ;==>_Zip_DllChk

ldub · March 21, 2009

Accents are not taken into account!

Éèëçàù characters, for example, disappear.

Thanks a lot,

Spiff59 · March 23, 2009

I tried sucking the returned string from this function into an array for processing and didn't get a very good match.

What winword reports as a 310-line file, generated a 641-line array.

Changing the last SRER to use @CR instead of @CRLF cut my array down to 321 lines.

Does anyone see anything wrong with changing the last SRER in the UDF to use just @CR instead of @CRLF (the original Word doc has just the single OD character as line terminators)?

The other problem causing the "mismatch" appears to be "extra" characters tacked onto the end of the file, beyond what the Word GUI shows as the last character. The file mentioned above, for example, has the characters "0D 03 0D 0D 04 0D 0D 03 0D 0D 04 0D 0D 0D 0D" tacked onto the end, which causes the extra 11 elements in the arrray. I can stick:

$text = StringLeft($text,StringInStr($text, Chr(13) & Chr(3)) - 1)

into the function and get a 310-line array that matches my 310-line file, but I'm sure that StringLeft statement would not be valid for all word docs. Has anyone else encountered garbage characters at the end of thier string, and found the correct method of truncating them?

Edited March 23, 2009 by Spiff59

Spiff59 · March 24, 2009

Do you think replacing:

Local $start_bin = BinaryMid($content, 537, 4); little endian  
    Local $start
    For $i = 1 To 4
     ; little endian -> big endian
        $start &= Hex(BinaryMid($start_bin, 5 - $i, 1))
    Next
    $start = Dec($start); text starts here
    
    Local $end_bin = BinaryMid($content, 541, 4); little endian
    Local $end
    For $i = 1 To 4
     ; little endian -> big endian
        $end &= Hex(BinaryMid($end_bin, 5 - $i, 1))
    Next
    $end = Dec($end); text ends here

With:

Local $start_end_bin = BinaryMid($content, 537, 8)
    Local $start, $end
    For $i = 4 To 1 Step -1; little endian -> big endian
        $start &= Hex(BinaryMid($start_end_bin, $i, 1))
    Next
    $start = Dec($start); text starts here
    For $i = 8 To 5 Step -1; little endian -> big endian
        $end &= Hex(BinaryMid($start_end_bin, $i, 1))
    Next
    $end = Dec($end); text ends here

Might be a little cleaner/faster?

(It gets rid of one BinaryMid command, eight arithmetic operations, and a variable)

Edit: PS - It looks like that "end of text" address includes header, footer, and other miscellaneous paragraphs, which is causing the "extra" bytes at the end of the returned string that I referred to in my prior post. Am still looking into the right way to process it. Ideas anyone?

Edited March 25, 2009 by Spiff59

trancexx · March 27, 2009

I will update first post when you say: "Change that, that and that, because it should be like this, this and this".

ldub · June 22, 2009

The images disappear from the doc file!

GEOSoft · June 22, 2009

The images disappear from the doc file!

Well I, for one, simply can't understand why you don't get the images when you use a function made for extracting TEXT from a .doc file.

trancexx · June 23, 2009

Well I, for one, simply can't understand why you don't get the images when you use a function made for extracting TEXT from a .doc file.

That's because you're Dutch

GEOSoft · June 23, 2009

That's because you're Dutch

As it happens I'm not but I hope you're ready for the fallout that you may have just caused.

trancexx · June 23, 2009

As it happens I'm not but I hope you're ready for the fallout that you may have just caused.

I fear no Dutch!

... and hope Jos wouldn't read this.

GEOSoft · June 23, 2009

I fear no Dutch!
... and hope Jos wouldn't read this.

Him and some others. Now back to our regularly scheduled topic.

ManusX · June 24, 2011

I found this page searching with Google, after I read today the manual of AutoIt.

My question is: there is any chance to take from a DOC file the text with diacritics? I need to take the text from Word Doc files to convert it after in a particular HTML format, but I need also the romanian special characters... I was very hapy when I found this code here, but, when I have for instance the word: "câmp" which in romanian means "field", in the results the word is "cmp".

I tried to use the flag 4 in the function BinaryToString to get UTF8 encoding, but this didn't help me.

Please, any help would be good.

Thanks

mesale0077 · June 25, 2011

hi

ş,ü,ğ,ö,ç letters dont read

why?

thank you now

Extracting text out of .doc file

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Top Posters In This Topic

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members