Jump to content

Extracting text out of .doc file


trancexx
 Share

Recommended Posts

Give this a try and pay attention to my comments in the comment block.

Dim $file = FileOpenDialog("Choose .doc file", @DesktopDir, "Word doc file (*.doc)", 1)
If @error Then Exit

Dim $TxT = DOCtoTXT($file)

If Not @error Then
    MsgBox(0, "Extracted text", $TxT)
Else
    MsgBox(16, "Error", "Error reading file: error " & @error)
EndIf


Func DOCtoTXT($docfile)
    
    Local $extension = StringSplit($docfile, ".", 1)
    $extension = $extension[$extension[0]]
    
    Local $hwnd = FileOpen($docfile, 16)
    Local $content = FileRead($hwnd)
    FileClose($hwnd)
    
    Local $contentdoc = BinaryMid($content, 513, 2);    0xECA5 - for .doc file of our interest
    
    Select
        Case $extension <> "doc" And $contentdoc <> "0xECA5"
            Return SetError(1); not doc file or quasi doc file with wrong extension
        Case $extension <> "doc" And $contentdoc = "0xECA5"
            Return SetError(2); extension incorrect, header indicates doc file
        Case $extension = "doc" And $contentdoc <> "0xECA5"
            Return SetError(3); extension incorrect or quasi doc file (extracting code required)
    EndSelect
    
    Local $complex_bin = BinaryMid($content, 523, 2); little endian
    Local $complex
    For $a = 1 To 2
     ; little endian -> big endian
        $complex &= Hex(BinaryMid($complex_bin, 3 - $a, 1))
    Next
    $complex = Dec($complex)
    If Mod(Floor($complex / 4), 2) <> 0 Then Return SetError(4); complex doc file (extracting code required)
    
    Local $start_bin = BinaryMid($content, 537, 4); little endian  
    Local $start
    For $i = 1 To 4
     ; little endian -> big endian
        $start &= Hex(BinaryMid($start_bin, 5 - $i, 1))
    Next
    $start = Dec($start); text starts here
    
    Local $end_bin = BinaryMid($content, 541, 4); little endian
    Local $end
    For $i = 1 To 4
     ; little endian -> big endian
        $end &= Hex(BinaryMid($end_bin, 5 - $i, 1))
    Next
    $end = Dec($end); text ends here

    If $start > $end Then Return SetError(5); corrupted header
    
    Local $content1 = BinaryMid($content, 513 + $start, $end - $start)
    
    Local $text = BinaryToString($content1)
    $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(1), "")
    $text = StringReplace(StringReplace($text, Chr(2), ""), Chr(3), "")
    $text = StringReplace(StringReplace($text, Chr(4), ""), Chr(5), "")
    $text = StringReplace(StringReplace($text, Chr(6), ""), Chr(7), "")
    Local $sHold
    Local $aRep = StringRegExp($text, "\x13.+?\x15", 3)
    
    If NOT @Error Then
    
      For $i = 0 To Ubound($aRep) -1
         $sHold = $aRep[$i]
         If StringInStr($sHold, Chr(20)) Then
            $sHold = StringRegExpReplace($sHold, "\x13.+?\x14(.*?)\x15(.*)", "$1$2")
         Else
            $sHold = ""
         EndIf
         $text = StringReplace($text, $aRep[$i], $sHold)
      Next
    Else
    
    EndIf  
    #cs
 ;; Not sure what all you are attempting to do here so I just commented it out
 ;; I got the right results from your test file and a couple of mine without it.
 ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), ""); dropping everything between "0x13" and "0x14", inclusive
 ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), ""); same with "0x08" and "0x19" - couldn't find any docummentation on this (this is my impression)
 ;$text = StringReplace($text, Chr(21), "")
    #ce
    $text = StringRegExpReplace($text, "\v", @CRLF)
      
    Return $text

EndFunc

Edit: Major fix. It worked on some and not on others. Should be better now. Be sure to test it on a wide range of files. I'm a bit concerned about what might happen if the .doc file contains a table.

No!

I'm sorry to say but that's going like few steps and more the wrong way.

There must be no loops on this level in order for this to have sense. That is incredibly slow. Solution must be found by some other means (regular expression). That code will take almost 5 minutes to complete taking 100% CPU usage for this file (2,4MB) that I'm testing on... and still won't do the job right.

On the other hand it can be done in less than 2 sec like this:

$text = StringReplace(BinaryToString($content1), Chr(0), "") 
    $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2")
    $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "")
    $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "")
    $text = StringRegExpReplace($text, "\v", @CRLF)

And that is almost right. I guess a spoon of salt is all it needs to be perfect.

Salt is wanted!

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

  • Replies 42
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Not trying to rain on your parade, but I though the entire purpose of the exercise was not to use Word. If you have Word, why not use the built-in save as text function which has already been extensively tested and performs what you need.

Does your script cater for the various flavors of document structure that are out there? Word 97, 2000, XP, 2007, etc? All of them, or just the ones you have available?

Suggestion: As well as the Microsoft documentation, have a look at the comments and algorithms in some of the OpenOffice open source code for ideas of how they solve this problem.

Yours is truly a noble coding challenge, worthy of admiration. :)

Seasons Greetings.

Really don't know how to respond to this. I'm ambivalent.

You are practicing few logical fallacies there.

I will stop now (otherwise you would probably hate me).

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

very close

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

  • 4 weeks later...

No!

I'm sorry to say but that's going like few steps and more the wrong way.

There must be no loops on this level in order for this to have sense. That is incredibly slow. Solution must be found by some other means (regular expression). That code will take almost 5 minutes to complete taking 100% CPU usage for this file (2,4MB) that I'm testing on... and still won't do the job right.

On the other hand it can be done in less than 2 sec like this:

$text = StringReplace(BinaryToString($content1), Chr(0), "") 
    $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2")
    $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "")
    $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "")
    $text = StringRegExpReplace($text, "\v", @CRLF)

And that is almost right. I guess a spoon of salt is all it needs to be perfect.

Salt is wanted!

Hello everyone, not to bring up an old post or anything but I just wanted to throw my two cents in.

The code above works WAY better than the huge block a couple posts up.

CODE
$content1 = FileRead("C:\227953.doc", 99999)

$text = StringReplace(BinaryToString($content1), Chr(0), "")

$text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2")

$text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "")

$text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "")

$text = StringRegExpReplace($text, "\v", @CRLF)

FileWrite("C:\output.txt", $text)

I haven't really changed anything yet but I'll post the updated code when I get it just how it should be. PS I'm talking about tables in the test sample. I'm writing to a txt document to check the end result.

Also I just wanted to say that I have been floating around these forums for a while but felt compelled to join the party after this post. I love AutoIT and I continue to be fascinated by the great work you all do and the amount of effort put into your code, especially since I am a beginner and need all the help I can get.

Thanks to you all.

Edited by Quafboy
Link to comment
Share on other sites

  • 4 weeks later...

@trancexx

This is excellent work.... saved the day for one of my customers who had a big and

seemingly corrupt Word document that would not open in Word.

Using your utility I was able to ClipPut the document and paste into a new document.

Saved her hours of retyping and made a hero of me!

Very well done. Five stars.

Link to comment
Share on other sites

@trancexx

This is excellent work.... saved the day for one of my customers who had a big and

seemingly corrupt Word document that would not open in Word.

Using your utility I was able to ClipPut the document and paste into a new document.

Saved her hours of retyping and made a hero of me!

Very well done. Five stars.

If you are he then I hope you have given her a chance to say thanks.

If you are she then... pretty much the same.

♡♡♡

.

eMyvnE

Link to comment
Share on other sites

  • 1 month later...

Is tis possible fr .docx extension files..??

This was majorly slammed and bashed in my original post so I'll just post the code here.

(If this doesn't work, you can modify it to work. Here is the basic process though)

ClipPut(_ReadDocXContent(@ScriptDir & '\DocXtest.docx'))


Func _ReadDocXContent($ReadLocation)
$Name = @ScriptDir & "\TempDoc.zip"
$UnZipName = @ScriptDir & '\DocXdoc'
FileCopy($ReadLocation, $Name, 1)
_Zip_Unzip($Name, "word\document.xml", $UnZipName, 16)
If @error Then Return 0
Sleep(200)
$Text = FileRead(@ScriptDir & '\DocXdoc\document.xml')
$RegEx = StringRegExp($Text, "<w:body>(.*?)<w:sect", 3)
$RegEx = StringRegExpReplace($Regex[0], "<w:br/>", @CRLF, 0)
$RegEx = StringRegExpReplace($Regex, "</w:p>", @CRLF & @CRLF, 0)
$RegEx = StringRegExpReplace($Regex, "<w:tab/>", @TAB, 0)
$RegEx = StringRegExpReplace($Regex, "<(.*?)>", "", 0)
$RegEx = StringRegExpReplace($Regex, "&lt;", "<", 0)
$RegEx = StringRegExpReplace($Regex, "&gt;", ">", 0)
$RegEx = StringRegExpReplace($Regex, "&amp;", "&", 0)
$RegEx = StringRegExpReplace($Regex, "â", '"', 0)
$RegEx = StringRegExpReplace($Regex, "â", '"', 0)
$RegEx = StringRegExpReplace($Regex, "â", "'", 0)
$RegEx = StringRegExpReplace($Regex, "â", "'", 0)
$RegEx = StringRegExpReplace($Regex, "â", "-", 0)
$RegEx = StringRegExpReplace($Regex, "â¦", "...", 0)
FileDelete($Name)
DirRemove($UnZipName, 1)
$RegEx = StringTrimRight($RegEx, 4)
Return $RegEx

EndFunc

Func _Zip_Unzip($hZipFile, $hFilename, $hDestPath, $flag = 4)
    Local $DLLChk = _Zip_DllChk()
    If $DLLChk <> 0 Then Return SetError($DLLChk, 0, 0) ;no dll
    If Not FileExists($hZipFile) Then Return SetError(1, 0, 0) ;no zip file
    
    If Not FileExists($hDestPath) Then DirCreate($hDestPath)
    
    $oApp = ObjCreate("Shell.Application")
    $hFolderitem = $oApp.NameSpace($hZipFile).Parsename($hFilename)
    
    $oApp.NameSpace($hDestPath).Copyhere($hFolderitem, $flag)
    
    
EndFunc   ;==>_Zip_Unzip

Func _Zip_DllChk()
    If Not FileExists(@SystemDir & "\zipfldr.dll") Then Return 2
    If Not RegRead("HKEY_CLASSES_ROOT\CLSID\{E88DCCE0-B7B3-11d1-A9F0-00AA0060FA31}", "") Then Return 3
    Return 0
EndFunc   ;==>_Zip_DllChk
Link to comment
Share on other sites

I tried sucking the returned string from this function into an array for processing and didn't get a very good match.

What winword reports as a 310-line file, generated a 641-line array.

Changing the last SRER to use @CR instead of @CRLF cut my array down to 321 lines.

Does anyone see anything wrong with changing the last SRER in the UDF to use just @CR instead of @CRLF (the original Word doc has just the single OD character as line terminators)?

The other problem causing the "mismatch" appears to be "extra" characters tacked onto the end of the file, beyond what the Word GUI shows as the last character. The file mentioned above, for example, has the characters "0D 03 0D 0D 04 0D 0D 03 0D 0D 04 0D 0D 0D 0D" tacked onto the end, which causes the extra 11 elements in the arrray. I can stick:

$text = StringLeft($text,StringInStr($text, Chr(13) & Chr(3)) - 1)

into the function and get a 310-line array that matches my 310-line file, but I'm sure that StringLeft statement would not be valid for all word docs. Has anyone else encountered garbage characters at the end of thier string, and found the correct method of truncating them?

Edited by Spiff59
Link to comment
Share on other sites

Do you think replacing:

Local $start_bin = BinaryMid($content, 537, 4); little endian  
    Local $start
    For $i = 1 To 4
     ; little endian -> big endian
        $start &= Hex(BinaryMid($start_bin, 5 - $i, 1))
    Next
    $start = Dec($start); text starts here
    
    Local $end_bin = BinaryMid($content, 541, 4); little endian
    Local $end
    For $i = 1 To 4
     ; little endian -> big endian
        $end &= Hex(BinaryMid($end_bin, 5 - $i, 1))
    Next
    $end = Dec($end); text ends here

With:

Local $start_end_bin = BinaryMid($content, 537, 8)
    Local $start, $end
    For $i = 4 To 1 Step -1; little endian -> big endian
        $start &= Hex(BinaryMid($start_end_bin, $i, 1))
    Next
    $start = Dec($start); text starts here
    For $i = 8 To 5 Step -1; little endian -> big endian
        $end &= Hex(BinaryMid($start_end_bin, $i, 1))
    Next
    $end = Dec($end); text ends here

Might be a little cleaner/faster?

(It gets rid of one BinaryMid command, eight arithmetic operations, and a variable)

Edit: PS - It looks like that "end of text" address includes header, footer, and other miscellaneous paragraphs, which is causing the "extra" bytes at the end of the returned string that I referred to in my prior post. Am still looking into the right way to process it. Ideas anyone?

Edited by Spiff59
Link to comment
Share on other sites

  • 2 months later...

The images disappear from the doc file!

Well I, for one, simply can't understand why you don't get the images when you use a function made for extracting TEXT from a .doc file.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

That's because you're Dutch

:D

As it happens I'm not but I hope you're ready for the fallout that you may have just caused.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

I fear no Dutch! :D

... and hope Jos wouldn't read this. :D

Him and some others. Now back to our regularly scheduled topic.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

  • 2 years later...

I found this page searching with Google, after I read today the manual of AutoIt.

My question is: there is any chance to take from a DOC file the text with diacritics? I need to take the text from Word Doc files to convert it after in a particular HTML format, but I need also the romanian special characters... I was very hapy when I found this code here, but, when I have for instance the word: "câmp" which in romanian means "field", in the results the word is "cmp".

I tried to use the flag 4 in the function BinaryToString to get UTF8 encoding, but this didn't help me.

Please, any help would be good.

Thanks

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...