trancexx Posted December 26, 2008 Author Posted December 26, 2008 Give this a try and pay attention to my comments in the comment block. expandcollapse popupDim $file = FileOpenDialog("Choose .doc file", @DesktopDir, "Word doc file (*.doc)", 1) If @error Then Exit Dim $TxT = DOCtoTXT($file) If Not @error Then MsgBox(0, "Extracted text", $TxT) Else MsgBox(16, "Error", "Error reading file: error " & @error) EndIf Func DOCtoTXT($docfile) Local $extension = StringSplit($docfile, ".", 1) $extension = $extension[$extension[0]] Local $hwnd = FileOpen($docfile, 16) Local $content = FileRead($hwnd) FileClose($hwnd) Local $contentdoc = BinaryMid($content, 513, 2); 0xECA5 - for .doc file of our interest Select Case $extension <> "doc" And $contentdoc <> "0xECA5" Return SetError(1); not doc file or quasi doc file with wrong extension Case $extension <> "doc" And $contentdoc = "0xECA5" Return SetError(2); extension incorrect, header indicates doc file Case $extension = "doc" And $contentdoc <> "0xECA5" Return SetError(3); extension incorrect or quasi doc file (extracting code required) EndSelect Local $complex_bin = BinaryMid($content, 523, 2); little endian Local $complex For $a = 1 To 2 ; little endian -> big endian $complex &= Hex(BinaryMid($complex_bin, 3 - $a, 1)) Next $complex = Dec($complex) If Mod(Floor($complex / 4), 2) <> 0 Then Return SetError(4); complex doc file (extracting code required) Local $start_bin = BinaryMid($content, 537, 4); little endian Local $start For $i = 1 To 4 ; little endian -> big endian $start &= Hex(BinaryMid($start_bin, 5 - $i, 1)) Next $start = Dec($start); text starts here Local $end_bin = BinaryMid($content, 541, 4); little endian Local $end For $i = 1 To 4 ; little endian -> big endian $end &= Hex(BinaryMid($end_bin, 5 - $i, 1)) Next $end = Dec($end); text ends here If $start > $end Then Return SetError(5); corrupted header Local $content1 = BinaryMid($content, 513 + $start, $end - $start) Local $text = BinaryToString($content1) $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(1), "") $text = StringReplace(StringReplace($text, Chr(2), ""), Chr(3), "") $text = StringReplace(StringReplace($text, Chr(4), ""), Chr(5), "") $text = StringReplace(StringReplace($text, Chr(6), ""), Chr(7), "") Local $sHold Local $aRep = StringRegExp($text, "\x13.+?\x15", 3) If NOT @Error Then For $i = 0 To Ubound($aRep) -1 $sHold = $aRep[$i] If StringInStr($sHold, Chr(20)) Then $sHold = StringRegExpReplace($sHold, "\x13.+?\x14(.*?)\x15(.*)", "$1$2") Else $sHold = "" EndIf $text = StringReplace($text, $aRep[$i], $sHold) Next Else EndIf #cs ;; Not sure what all you are attempting to do here so I just commented it out ;; I got the right results from your test file and a couple of mine without it. ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), ""); dropping everything between "0x13" and "0x14", inclusive ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), ""); same with "0x08" and "0x19" - couldn't find any docummentation on this (this is my impression) ;$text = StringReplace($text, Chr(21), "") #ce $text = StringRegExpReplace($text, "\v", @CRLF) Return $text EndFunc Edit: Major fix. It worked on some and not on others. Should be better now. Be sure to test it on a wide range of files. I'm a bit concerned about what might happen if the .doc file contains a table.No! I'm sorry to say but that's going like few steps and more the wrong way. There must be no loops on this level in order for this to have sense. That is incredibly slow. Solution must be found by some other means (regular expression). That code will take almost 5 minutes to complete taking 100% CPU usage for this file (2,4MB) that I'm testing on... and still won't do the job right. On the other hand it can be done in less than 2 sec like this: $text = StringReplace(BinaryToString($content1), Chr(0), "") $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2") $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "") $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "") $text = StringRegExpReplace($text, "\v", @CRLF) And that is almost right. I guess a spoon of salt is all it needs to be perfect. Salt is wanted! ♡♡♡ . eMyvnE
trancexx Posted December 26, 2008 Author Posted December 26, 2008 Not trying to rain on your parade, but I though the entire purpose of the exercise was not to use Word. If you have Word, why not use the built-in save as text function which has already been extensively tested and performs what you need.Does your script cater for the various flavors of document structure that are out there? Word 97, 2000, XP, 2007, etc? All of them, or just the ones you have available?Suggestion: As well as the Microsoft documentation, have a look at the comments and algorithms in some of the OpenOffice open source code for ideas of how they solve this problem.Yours is truly a noble coding challenge, worthy of admiration. Seasons Greetings.Really don't know how to respond to this. I'm ambivalent.You are practicing few logical fallacies there. I will stop now (otherwise you would probably hate me). ♡♡♡ . eMyvnE
GEOSoft Posted December 26, 2008 Posted December 26, 2008 very close George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
Quafboy Posted January 18, 2009 Posted January 18, 2009 (edited) No! I'm sorry to say but that's going like few steps and more the wrong way. There must be no loops on this level in order for this to have sense. That is incredibly slow. Solution must be found by some other means (regular expression). That code will take almost 5 minutes to complete taking 100% CPU usage for this file (2,4MB) that I'm testing on... and still won't do the job right. On the other hand it can be done in less than 2 sec like this: $text = StringReplace(BinaryToString($content1), Chr(0), "") $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2") $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "") $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "") $text = StringRegExpReplace($text, "\v", @CRLF) And that is almost right. I guess a spoon of salt is all it needs to be perfect. Salt is wanted! Hello everyone, not to bring up an old post or anything but I just wanted to throw my two cents in. The code above works WAY better than the huge block a couple posts up. CODE$content1 = FileRead("C:\227953.doc", 99999) $text = StringReplace(BinaryToString($content1), Chr(0), "") $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2") $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "") $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "") $text = StringRegExpReplace($text, "\v", @CRLF) FileWrite("C:\output.txt", $text) I haven't really changed anything yet but I'll post the updated code when I get it just how it should be. PS I'm talking about tables in the test sample. I'm writing to a txt document to check the end result. Also I just wanted to say that I have been floating around these forums for a while but felt compelled to join the party after this post. I love AutoIT and I continue to be fascinated by the great work you all do and the amount of effort put into your code, especially since I am a beginner and need all the help I can get. Thanks to you all. Edited January 18, 2009 by Quafboy
dmob Posted February 13, 2009 Posted February 13, 2009 @trancexx This is excellent work.... saved the day for one of my customers who had a big and seemingly corrupt Word document that would not open in Word. Using your utility I was able to ClipPut the document and paste into a new document. Saved her hours of retyping and made a hero of me! Very well done. Five stars.
trancexx Posted February 13, 2009 Author Posted February 13, 2009 @trancexxThis is excellent work.... saved the day for one of my customers who had a big and seemingly corrupt Word document that would not open in Word. Using your utility I was able to ClipPut the document and paste into a new document. Saved her hours of retyping and made a hero of me!Very well done. Five stars.If you are he then I hope you have given her a chance to say thanks. If you are she then... pretty much the same. ♡♡♡ . eMyvnE
Szhlopp Posted March 17, 2009 Posted March 17, 2009 Is tis possible fr .docx extension files..?? This was majorly slammed and bashed in my original post so I'll just post the code here. (If this doesn't work, you can modify it to work. Here is the basic process though) expandcollapse popupClipPut(_ReadDocXContent(@ScriptDir & '\DocXtest.docx')) Func _ReadDocXContent($ReadLocation) $Name = @ScriptDir & "\TempDoc.zip" $UnZipName = @ScriptDir & '\DocXdoc' FileCopy($ReadLocation, $Name, 1) _Zip_Unzip($Name, "word\document.xml", $UnZipName, 16) If @error Then Return 0 Sleep(200) $Text = FileRead(@ScriptDir & '\DocXdoc\document.xml') $RegEx = StringRegExp($Text, "<w:body>(.*?)<w:sect", 3) $RegEx = StringRegExpReplace($Regex[0], "<w:br/>", @CRLF, 0) $RegEx = StringRegExpReplace($Regex, "</w:p>", @CRLF & @CRLF, 0) $RegEx = StringRegExpReplace($Regex, "<w:tab/>", @TAB, 0) $RegEx = StringRegExpReplace($Regex, "<(.*?)>", "", 0) $RegEx = StringRegExpReplace($Regex, "<", "<", 0) $RegEx = StringRegExpReplace($Regex, ">", ">", 0) $RegEx = StringRegExpReplace($Regex, "&", "&", 0) $RegEx = StringRegExpReplace($Regex, "â", '"', 0) $RegEx = StringRegExpReplace($Regex, "â", '"', 0) $RegEx = StringRegExpReplace($Regex, "â", "'", 0) $RegEx = StringRegExpReplace($Regex, "â", "'", 0) $RegEx = StringRegExpReplace($Regex, "â", "-", 0) $RegEx = StringRegExpReplace($Regex, "â¦", "...", 0) FileDelete($Name) DirRemove($UnZipName, 1) $RegEx = StringTrimRight($RegEx, 4) Return $RegEx EndFunc Func _Zip_Unzip($hZipFile, $hFilename, $hDestPath, $flag = 4) Local $DLLChk = _Zip_DllChk() If $DLLChk <> 0 Then Return SetError($DLLChk, 0, 0) ;no dll If Not FileExists($hZipFile) Then Return SetError(1, 0, 0) ;no zip file If Not FileExists($hDestPath) Then DirCreate($hDestPath) $oApp = ObjCreate("Shell.Application") $hFolderitem = $oApp.NameSpace($hZipFile).Parsename($hFilename) $oApp.NameSpace($hDestPath).Copyhere($hFolderitem, $flag) EndFunc ;==>_Zip_Unzip Func _Zip_DllChk() If Not FileExists(@SystemDir & "\zipfldr.dll") Then Return 2 If Not RegRead("HKEY_CLASSES_ROOT\CLSID\{E88DCCE0-B7B3-11d1-A9F0-00AA0060FA31}", "") Then Return 3 Return 0 EndFunc ;==>_Zip_DllChk RegEx/RegExRep Tester!Nerd Olympics - Community App!Login UDFMemory UDF - "Game.exe+753EC" - CE pointer to AU3Password Manager W/ SourceDataFiler - Include files in your au3!--- Was I helpful? Click the little green '+'
ldub Posted March 21, 2009 Posted March 21, 2009 Accents are not taken into account! Éèëçàù characters, for example, disappear. Thanks a lot,
Spiff59 Posted March 23, 2009 Posted March 23, 2009 (edited) I tried sucking the returned string from this function into an array for processing and didn't get a very good match. What winword reports as a 310-line file, generated a 641-line array. Changing the last SRER to use @CR instead of @CRLF cut my array down to 321 lines. Does anyone see anything wrong with changing the last SRER in the UDF to use just @CR instead of @CRLF (the original Word doc has just the single OD character as line terminators)? The other problem causing the "mismatch" appears to be "extra" characters tacked onto the end of the file, beyond what the Word GUI shows as the last character. The file mentioned above, for example, has the characters "0D 03 0D 0D 04 0D 0D 03 0D 0D 04 0D 0D 0D 0D" tacked onto the end, which causes the extra 11 elements in the arrray. I can stick: $text = StringLeft($text,StringInStr($text, Chr(13) & Chr(3)) - 1) into the function and get a 310-line array that matches my 310-line file, but I'm sure that StringLeft statement would not be valid for all word docs. Has anyone else encountered garbage characters at the end of thier string, and found the correct method of truncating them? Edited March 23, 2009 by Spiff59
Spiff59 Posted March 24, 2009 Posted March 24, 2009 (edited) Do you think replacing: Local $start_bin = BinaryMid($content, 537, 4); little endian Local $start For $i = 1 To 4 ; little endian -> big endian $start &= Hex(BinaryMid($start_bin, 5 - $i, 1)) Next $start = Dec($start); text starts here Local $end_bin = BinaryMid($content, 541, 4); little endian Local $end For $i = 1 To 4 ; little endian -> big endian $end &= Hex(BinaryMid($end_bin, 5 - $i, 1)) Next $end = Dec($end); text ends here With: Local $start_end_bin = BinaryMid($content, 537, 8) Local $start, $end For $i = 4 To 1 Step -1; little endian -> big endian $start &= Hex(BinaryMid($start_end_bin, $i, 1)) Next $start = Dec($start); text starts here For $i = 8 To 5 Step -1; little endian -> big endian $end &= Hex(BinaryMid($start_end_bin, $i, 1)) Next $end = Dec($end); text ends here Might be a little cleaner/faster? (It gets rid of one BinaryMid command, eight arithmetic operations, and a variable) Edit: PS - It looks like that "end of text" address includes header, footer, and other miscellaneous paragraphs, which is causing the "extra" bytes at the end of the returned string that I referred to in my prior post. Am still looking into the right way to process it. Ideas anyone? Edited March 25, 2009 by Spiff59
trancexx Posted March 27, 2009 Author Posted March 27, 2009 I will update first post when you say: "Change that, that and that, because it should be like this, this and this". ♡♡♡ . eMyvnE
GEOSoft Posted June 22, 2009 Posted June 22, 2009 The images disappear from the doc file!Well I, for one, simply can't understand why you don't get the images when you use a function made for extracting TEXT from a .doc file. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
trancexx Posted June 23, 2009 Author Posted June 23, 2009 Well I, for one, simply can't understand why you don't get the images when you use a function made for extracting TEXT from a .doc file.That's because you're Dutch ♡♡♡ . eMyvnE
GEOSoft Posted June 23, 2009 Posted June 23, 2009 That's because you're Dutch As it happens I'm not but I hope you're ready for the fallout that you may have just caused. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
trancexx Posted June 23, 2009 Author Posted June 23, 2009 As it happens I'm not but I hope you're ready for the fallout that you may have just caused.I fear no Dutch! ... and hope Jos wouldn't read this. ♡♡♡ . eMyvnE
GEOSoft Posted June 23, 2009 Posted June 23, 2009 I fear no Dutch! ... and hope Jos wouldn't read this. Him and some others. Now back to our regularly scheduled topic. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
ManusX Posted June 24, 2011 Posted June 24, 2011 I found this page searching with Google, after I read today the manual of AutoIt. My question is: there is any chance to take from a DOC file the text with diacritics? I need to take the text from Word Doc files to convert it after in a particular HTML format, but I need also the romanian special characters... I was very hapy when I found this code here, but, when I have for instance the word: "câmp" which in romanian means "field", in the results the word is "cmp". I tried to use the flag 4 in the function BinaryToString to get UTF8 encoding, but this didn't help me. Please, any help would be good. Thanks
mesale0077 Posted June 25, 2011 Posted June 25, 2011 hi ş,ü,ğ,ö,ç letters dont read why? thank you now
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now