trancexx Posted September 17, 2008 Share Posted September 17, 2008 (edited) Extracting text out of Word doc file is something that is easily done if you have MS Word installed. This is not about that.This is about reading and analysing structure of doc file. If someone is interested in how, what or something else click here (location might change in time, and... it's generally not advisable to read that much ).Extraction is very fast considering AutoIt's limitations.Here's the script:expandcollapse popupDim $file = FileOpenDialog("Choose .doc file", @DesktopDir, "Word doc file (*.doc)", 1) If @error Then Exit Dim $TxT = DOCtoTXT($file) If Not @error Then MsgBox(0, "Extracted text", $TxT) Else MsgBox(16, "Error", "Error reading file: error " & @error) EndIf Func DOCtoTXT($docfile) Local $extension = StringSplit($docfile, ".", 1) $extension = $extension[$extension[0]] Local $hwnd = FileOpen($docfile, 16) Local $content = FileRead($hwnd) FileClose($hwnd) Local $contentdoc = BinaryMid($content, 513, 2) ; 0xECA5 - for .doc file of our interest Select Case $extension <> "doc" And $contentdoc <> "0xECA5" Return SetError(1) ; not doc file or quasi doc file with wrong extension Case $extension <> "doc" And $contentdoc = "0xECA5" Return SetError(2) ; extension incorrect, header indicates doc file Case $extension = "doc" And $contentdoc <> "0xECA5" Return SetError(3) ; extension incorrect or quasi doc file (extracting code required) EndSelect Local $complex_bin = BinaryMid($content, 523, 2) ; little endian Local $complex For $a = 1 To 2 ; little endian -> big endian $complex &= Hex(BinaryMid($complex_bin, 3 - $a, 1)) Next $complex = Dec($complex) If Mod(Floor($complex / 4), 2) <> 0 Then Return SetError(4) ; complex doc file (extracting code required) Local $start_bin = BinaryMid($content, 537, 4) ; little endian Local $start For $i = 1 To 4 ; little endian -> big endian $start &= Hex(BinaryMid($start_bin, 5 - $i, 1)) Next $start = Dec($start) ; text starts here Local $end_bin = BinaryMid($content, 541, 4) ; little endian Local $end For $i = 1 To 4 ; little endian -> big endian $end &= Hex(BinaryMid($end_bin, 5 - $i, 1)) Next $end = Dec($end) ; text ends here If $start > $end Then Return SetError(5) ; corrupted header Local $content1 = BinaryMid($content, 513 + $start, $end - $start) Local $text $text = StringReplace(BinaryToString($content1), Chr(0), "") $text = StringRegExpReplace($text, "(?s)(\x13.+?)\x14(.*?)\x15?", "$1" & Chr(21) & "$2") $text = StringRegExpReplace($text, '(?s)\x13(.*?)\x15', "") $text = StringRegExpReplace($text, "[^[:space:]|[:print:]]", "") $text = StringRegExpReplace($text, "\v", @CRLF) Return $text EndFuncIf you look closer you will see that byte order of some data that we use is little endian and is converted to big endian inside loop ...that is only a suggestion (reading backwards is another solution or maybe shifting).Extracted text is "contaminated" so series of replacments are needed to get us only the text. After that the function will return.edit: Updated 2nd January 2009 -further improved 'replacments' part of the script (this also lead to speed improvement)(previously updated on GEOSoft's suggestion) Edited January 2, 2009 by trancexx ♡♡♡ . eMyvnE Link to comment Share on other sites More sharing options...
monoceres Posted September 17, 2008 Share Posted September 17, 2008 Really good work Could prove useful in the future >_< Broken link? PM me and I'll send you the file! Link to comment Share on other sites More sharing options...
KaFu Posted September 17, 2008 Share Posted September 17, 2008 Nice UDF trancexx, I think I've got a usage for this too . Best Regards OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2022-Nov-26) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21)HMW - Hide my Windows (2018-Sep-16) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2023-Jun-03) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16) Link to comment Share on other sites More sharing options...
Andreik Posted September 17, 2008 Share Posted September 17, 2008 Really nice and useful script. When the words fail... music speaks. Link to comment Share on other sites More sharing options...
torels Posted September 17, 2008 Share Posted September 17, 2008 this is wonderful just today my italian teacher (i live in italy...) gave me tons of files written in doc format which include many "musts" of the italian litterature from 1600 up to today and I was thinking of a way to take all the text out and putting it in a txt... and maybe then group everything in an exe... to manage the files >_< thanks for this udf keep up the good work Some Projects:[list][*]ZIP UDF using no external files[*]iPod Music Transfer [*]iTunes UDF - fully integrate iTunes with au3[*]iTunes info (taskbar player hover)[*]Instant Run - run scripts without saving them before :)[*]Get Tube - YouTube Downloader[*]Lyric Finder 2 - Find Lyrics to any of your song[*]DeskBox - A Desktop Extension Tool[/list]indifference will ruin the world, but in the end... WHO CARES :P---------------http://torels.altervista.org Link to comment Share on other sites More sharing options...
Szhlopp Posted September 17, 2008 Share Posted September 17, 2008 this is wonderful just today my italian teacher (i live in italy...) gave me tons of files written in doc format which include many "musts" of the italian litterature from 1600 up to today and I was thinking of a way to take all the text out and putting it in a txt... and maybe then group everything in an exe... to manage the files >_<thanks for this udfkeep up the good workNice!Now all you need to do is make it work with .DocX files (DocX = MS Office Word 2007)I just tested it and it Error's out on me with an error of 1. Overall nice work! RegEx/RegExRep Tester!Nerd Olympics - Community App!Login UDFMemory UDF - "Game.exe+753EC" - CE pointer to AU3Password Manager W/ SourceDataFiler - Include files in your au3!--- Was I helpful? Click the little green '+' Link to comment Share on other sites More sharing options...
Golbez Posted October 15, 2008 Share Posted October 15, 2008 <3 ty so much this is a great code!! Link to comment Share on other sites More sharing options...
YourSpace Posted October 16, 2008 Share Posted October 16, 2008 i keep getting error 3... Link to comment Share on other sites More sharing options...
trancexx Posted October 16, 2008 Author Share Posted October 16, 2008 i keep getting error 3...That is documented.Your file is not "true" doc. Try saving it as a doc (with different name or whatever) after you open it with word. Or find a way to extract text for that kind of files.There is a flaw in DOCtoTXT() function, now looking at it.Correct replacements should be (StringRegExpReplace() part):("..." represents anything)...Chr(19)...Chr(20)...Chr(21)... --> drop Chr(19)...Chr(20) part, leave the rest...Chr(19)...Chr(20)... --> drop Chr(19)...Chr(20) part but not if Chr(21) in between, leave the rest...Chr(19)...Chr(21)... --> drop Chr(19)...Chr(21) part but not if Chr(20) in between, leave the rest...Chr(8)...Chr(25)... --> drop Chr(8)...Chr(25) part, leave the restAny idea? ♡♡♡ . eMyvnE Link to comment Share on other sites More sharing options...
ptrex Posted October 16, 2008 Share Posted October 16, 2008 @trancexx Realy nice and fast indead !! Regards ptrex Contributions :Firewall Log Analyzer for XP - Creating COM objects without a need of DLL's - UPnP support in AU3Crystal Reports Viewer - PDFCreator in AutoIT - Duplicate File FinderSQLite3 Database functionality - USB Monitoring - Reading Excel using SQLRun Au3 as a Windows Service - File Monitor - Embedded Flash PlayerDynamic Functions - Control Panel Applets - Digital Signing Code - Excel Grid In AutoIT - Constants for Special Folders in WindowsRead data from Any Windows Edit Control - SOAP and Web Services in AutoIT - Barcode Printing Using PS - AU3 on LightTD WebserverMS LogParser SQL Engine in AutoIT - ImageMagick Image Processing - Converter @ Dec - Hex - Bin -Email Address Encoder - MSI Editor - SNMP - MIB ProtocolFinancial Functions UDF - Set ACL Permissions - Syntax HighLighter for AU3ADOR.RecordSet approach - Real OCR - HTTP Disk - PDF Reader Personal Worldclock - MS Indexing Engine - Printing ControlsGuiListView - Navigation (break the 4000 Limit barrier) - Registration Free COM DLL Distribution - Update - WinRM SMART Analysis - COM Object Browser - Excel PivotTable Object - VLC Media Player - Windows LogOnOff Gui -Extract Data from Outlook to Word & Excel - Analyze Event ID 4226 - DotNet Compiler Wrapper - Powershell_COM - New Link to comment Share on other sites More sharing options...
BrettF Posted October 16, 2008 Share Posted October 16, 2008 Hi, This is very nice work! Any chance of TXT to DOC? That would be great! Cheers, Brett Vist my blog!UDFs: Opens The Default Mail Client | _LoginBox | Convert Reg to AU3 | BASS.au3 (BASS.dll) (Includes various BASS Libraries) | MultiLang.au3 (Multi-Language GUIs!)Example Scripts: Computer Info Telnet Server | "Secure" HTTP Server (Based on Manadar's Server)Software: AAMP- Advanced AutoIt Media Player | WorldCam | AYTU - Youtube Uploader Tutorials: Learning to Script with AutoIt V3Projects (Hardware + AutoIt): ArduinoUseful Links: AutoIt 1-2-3 | The AutoIt Downloads Section: | SciTE4AutoIt3 Full Version! Link to comment Share on other sites More sharing options...
GEOSoft Posted December 22, 2008 Share Posted December 22, 2008 This is very good work. I do have i suggestion though. Add the following line at the bottom if your StringReplace calls (last item befor you return $text $text = StringReplace($text, Chr(11), @CRLF) If you put it in any sooner you will double space the document lines. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
trancexx Posted December 24, 2008 Author Share Posted December 24, 2008 This is very good work. I do have i suggestion though. Add the following line at the bottom if your StringReplace calls (last item befor you return $text $text = StringReplace($text, Chr(11), @CRLF) If you put it in any sooner you will double space the document lines.Thanks How about this to be all of the replacements: $text = BinaryToString($content1) $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(12), "") $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), "") $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), "") $text = StringRegExpReplace($text, "[^[:space:][:print:]]", "") $text = StringReplace(StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF), Chr(11), @CRLF) That wouldn't solve the main problem though. And that one is posted few posts above yours. When I wrote that code I knew virtually nothing about AutoIt structure functions. Since that is changed lately, that first post code looks kind of "primitive" (...I read one of your posts mentioning lost eras). Beside that it can be done with less memory usage if we read the header first and later read the text at calculated offset. That is good when there are lots of pictures inside doc file, for example. All in all this could be done in more advanced manner. I know (yes I do) that you are very skilled with Regexp; find the way to remove the garbage properly from the output (this post describes the problem), please. ♡♡♡ . eMyvnE Link to comment Share on other sites More sharing options...
GEOSoft Posted December 24, 2008 Share Posted December 24, 2008 (edited) Thanks How about this to be all of the replacements: $text = BinaryToString($content1) $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(12), "") $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), "") $text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), "") $text = StringRegExpReplace($text, "[^[:space:][:print:]]", "") $text = StringReplace(StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF), Chr(11), @CRLF) That wouldn't solve the main problem though. And that one is posted few posts above yours. When I wrote that code I knew virtually nothing about AutoIt structure functions. Since that is changed lately, that first post code looks kind of "primitive" (...I read one of your posts mentioning lost eras). Beside that it can be done with less memory usage if we read the header first and later read the text at calculated offset. That is good when there are lots of pictures inside doc file, for example. All in all this could be done in more advanced manner. I know (yes I do) that you are very skilled with Regexp; find the way to remove the garbage properly from the output (this post describes the problem), please.I think I tried that code change and I think for some reason it gave a problem so I settled for just adding the extra line. I'll look at it again. If you want I can send you the file I used for testing. As for the other problem you are having, I'll look at that as well but it won't be for a couple of days. I have a hunch that Chr(11) may have been part of that "garbage". The big trick is in just finding out what the extra garbage is, then it's easy to handle. Edit: Try changing that line $text = StringReplace(StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF), Chr(11), @CRLF) To $text = StringRegExpReplace($text,"\r\n|\r|\n|\x0b", @CRLF) I have not tested it but it looks about right. \x0b is the Hex for Chr(11). You could also use the Octal as \013. The full list of ASCII octal codes is in my On-line Help >> Appendix>>ASCII Characters page. The link to my online help is in my sig. EDIT 2: Stupid, stupi, stupid. Chr(11) is a vertical tab and I think that when you are converting to text then you will want to replace other vertiacl characters as well, things like form feeds. It that's the case then change what I just gave you to. $text = StringRegExpReplace($text,"\r\n|\r|\n|\v", @CRLF) Edited December 24, 2008 by GEOSoft George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
GEOSoft Posted December 24, 2008 Share Posted December 24, 2008 (edited) That is documented. Your file is not "true" doc. Try saving it as a doc (with different name or whatever) after you open it with word. Or find a way to extract text for that kind of files. There is a flaw in DOCtoTXT() function, now looking at it. Correct replacements should be (StringRegExpReplace() part): ("..." represents anything) ...Chr(19)...Chr(20)...Chr(21)... --> drop Chr(19)...Chr(20) part, leave the rest ...Chr(19)...Chr(20)... --> drop Chr(19)...Chr(20) part but not if Chr(21) in between, leave the rest ...Chr(19)...Chr(21)... --> drop Chr(19)...Chr(21) part but not if Chr(20) in between, leave the rest ...Chr(8)...Chr(25)... --> drop Chr(8)...Chr(25) part, leave the rest Any idea?Can you send me a file or files that show these problems? The last one looks like it should just be $Text = StringRegExpReplace($Text, "\x08|\x19", "") :remove all Chr(8) and Chr(25) or $Text = StringRegExpReplace($Text, "\x08(.*)\x19", $1);; Only matches if there are characters between Chr(8) and Chr(25) The last one may need a bit of work depending on whether or not they will be on the same line. Edited December 24, 2008 by GEOSoft George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
trancexx Posted December 24, 2008 Author Share Posted December 24, 2008 (edited) Can you send me a file or files that show these problems? The last one looks like it should just be $Text = StringRegExpReplace($Text, "\x08|\x19", "") :remove all Chr(8) and Chr(25) or $Text = StringRegExpReplace($Text, "\x08(.*)\x19", $1);; Only matches if there are characters between Chr(8) and Chr(25) The last one may need a bit of work depending on whether or not they will be on the same line.I can create "problematic" file to show you the problem (will update this post with attachment). Problem with doc files is this... Hyperlinks, references and similar stuff are stored between 0x13 and 0x14 and responding text is between 0x14 and 0x15. Like this: 0x13www.google.com0x14visit me0x15 Inside doc file visible part is only what is between 0x14 and 0x15: visit me So everything between 0x13 and 0x14 must be dropped (that is not visible text). Problem is that some other stuff is stored directly between 0x13 and 0x15. In this casses there is no occurence of 0x14. All it needs to be done is remove 0x13 and 0x15 and everything in between. This colides with 0x13...0x14...0x15 (google link for example). I will attach doc file that literaly demonstrates the problem. When you get this as extracted text, we got it: The basic syntax classes used in the grammar are: <int32> is either a decimal number or "0x" followed by a hexadecimal number, and must be represented in 32 bits. <int64> is either a decimal number or "0x" followed by a hexadecimal number, and must be represented in 64 bits. <hexbyte> is a hexadecimal number that fits into one byte. A command line debugger is included with the tools. It is called cordbg. A description and the usage of the debugger can be found in the cordbg documentation. The source code of the command line debugger is included with .NET SDK package. edit: I hope I was clear enough. Edited December 24, 2008 by trancexx ♡♡♡ . eMyvnE Link to comment Share on other sites More sharing options...
GEOSoft Posted December 24, 2008 Share Posted December 24, 2008 Okay, I'll work on it but in the meantime replace the line $text = StringReplace(StringReplace($text, @LF, @CRLF), @CR, @CRLF); Unix and Mac with $text = StringRegExpReplace($text, "\v", @CRLF) Then you can forget about the Char(11) thing too. This is tested and replaces all vertical whitespaces inluding @CRLF, @CR and @LF and Chr(11) George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
trancexx Posted December 24, 2008 Author Share Posted December 24, 2008 Will do.Just a little more about replacements.That attached file looks like this before replacements:The basic syntax classes0x13 XE "Syntax:Syntax classes" 0x15 used in the grammar are:<int320x13 XE "int32" 0x15> is either a decimal number or 0x followed by a hexadecimal number, and must be represented in 32 bits.<int640x13 XE "int64" 0x15> is either a decimal number or 0x followed by a hexadecimal number, and must be represented in 64 bits.<hexbyte0x13 XE "hexbyte" 0x15> is a hexadecimal number that fits into one byte.A command line debugger0x13 XE "Debugger" 0x15 is included with the tools. It is called cordbg0x13 XE "cordbg" 0x15. A description and the usage of the debugger can be found in the 0x13HYPERLINK "cordbgShell.doc"0x14cordbg documentation0x15 .The source code of the command line debugger is included with .NET SDK packageI cannot make it more obvious ♡♡♡ . eMyvnE Link to comment Share on other sites More sharing options...
Confuzzled Posted December 24, 2008 Share Posted December 24, 2008 Your file is not "true" doc. Try saving it as a doc (with different name or whatever) after you open it with word. Or find a way to extract text for that kind of files.Any idea?Not trying to rain on your parade, but I though the entire purpose of the exercise was not to use Word. If you have Word, why not use the built-in save as text function which has already been extensively tested and performs what you need.Does your script cater for the various flavors of document structure that are out there? Word 97, 2000, XP, 2007, etc? All of them, or just the ones you have available?Suggestion: As well as the Microsoft documentation, have a look at the comments and algorithms in some of the OpenOffice open source code for ideas of how they solve this problem.Yours is truly a noble coding challenge, worthy of admiration. Seasons Greetings. Link to comment Share on other sites More sharing options...
GEOSoft Posted December 24, 2008 Share Posted December 24, 2008 (edited) Give this a try and pay attention to my comments in the comment block. expandcollapse popupDim $file = FileOpenDialog("Choose .doc file", @DesktopDir, "Word doc file (*.doc)", 1) If @error Then Exit Dim $TxT = DOCtoTXT($file) If Not @error Then MsgBox(0, "Extracted text", $TxT) Else MsgBox(16, "Error", "Error reading file: error " & @error) EndIf Func DOCtoTXT($docfile) Local $extension = StringSplit($docfile, ".", 1) $extension = $extension[$extension[0]] Local $hwnd = FileOpen($docfile, 16) Local $content = FileRead($hwnd) FileClose($hwnd) Local $contentdoc = BinaryMid($content, 513, 2); 0xECA5 - for .doc file of our interest Select Case $extension <> "doc" And $contentdoc <> "0xECA5" Return SetError(1); not doc file or quasi doc file with wrong extension Case $extension <> "doc" And $contentdoc = "0xECA5" Return SetError(2); extension incorrect, header indicates doc file Case $extension = "doc" And $contentdoc <> "0xECA5" Return SetError(3); extension incorrect or quasi doc file (extracting code required) EndSelect Local $complex_bin = BinaryMid($content, 523, 2); little endian Local $complex For $a = 1 To 2 ; little endian -> big endian $complex &= Hex(BinaryMid($complex_bin, 3 - $a, 1)) Next $complex = Dec($complex) If Mod(Floor($complex / 4), 2) <> 0 Then Return SetError(4); complex doc file (extracting code required) Local $start_bin = BinaryMid($content, 537, 4); little endian Local $start For $i = 1 To 4 ; little endian -> big endian $start &= Hex(BinaryMid($start_bin, 5 - $i, 1)) Next $start = Dec($start); text starts here Local $end_bin = BinaryMid($content, 541, 4); little endian Local $end For $i = 1 To 4 ; little endian -> big endian $end &= Hex(BinaryMid($end_bin, 5 - $i, 1)) Next $end = Dec($end); text ends here If $start > $end Then Return SetError(5); corrupted header Local $content1 = BinaryMid($content, 513 + $start, $end - $start) Local $text = BinaryToString($content1) $text = StringReplace(StringReplace($text, Chr(0), ""), Chr(1), "") $text = StringReplace(StringReplace($text, Chr(2), ""), Chr(3), "") $text = StringReplace(StringReplace($text, Chr(4), ""), Chr(5), "") $text = StringReplace(StringReplace($text, Chr(6), ""), Chr(7), "") Local $sHold Local $aRep = StringRegExp($text, "\x13.+?\x15", 3) If NOT @Error Then For $i = 0 To Ubound($aRep) -1 $sHold = $aRep[$i] If StringInStr($sHold, Chr(20)) Then $sHold = StringRegExpReplace($sHold, "\x13.+?\x14(.*?)\x15(.*)", "$1$2") Else $sHold = "" EndIf $text = StringReplace($text, $aRep[$i], $sHold) Next Else EndIf #cs ;; Not sure what all you are attempting to do here so I just commented it out ;; I got the right results from your test file and a couple of mine without it. ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(19) & '(.*?)' & Chr(20), ""); dropping everything between "0x13" and "0x14", inclusive ;$text = StringRegExpReplace($text, '(?s)(?i)' & Chr(8) & '(.*?)' & Chr(25), ""); same with "0x08" and "0x19" - couldn't find any docummentation on this (this is my impression) ;$text = StringReplace($text, Chr(21), "") #ce $text = StringRegExpReplace($text, "\v", @CRLF) Return $text EndFunc Edit: Major fix. It worked on some and not on others. Should be better now. Be sure to test it on a wide range of files. I'm a bit concerned about what might happen if the .doc file contains a table. Edited December 25, 2008 by GEOSoft George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now