Jump to content
Fenzik

Xdoc2txt, extracting text from advanced documment formats, using DLL

Recommended Posts

Posted (edited)

Hello!

i wrote this function as alternative to using the Com Object or Commandline version of this project, discussed also earlyer on this forum.

Project site - http://ebstudio.info/home/xdoc2txt.html

Advantage of this implementation is that you do not need to register Com dll, using regsvr32.

But you still need the project Dll (xd2txlib.dll).

Enjoy!

; #FUNCTION# ====================================================================================================================
; Name ..........: _ExtractText
; Description ...: Extracts text from advanced documment formats (Doc, Docx, ODT, XLS, ...)
; Syntax ........: _ExtractText($sFilename[, $bProperties = False[, $hDll = 0]])
; Parameters ....: $sFilename           - a string value.
;                  $bProperties         - [optional] a boolean value. Default is False. If True, documment properties will be returned instead of the text.
;                  $hDll                - [optional] a handle value. Default is 0. Optional handle to previously opened xd2txlib.dll. By default the xd2txlib.dll (Expected in @scriptdir) will be opened and closed during the function call.
; Return value .: String, containing the text or documment properties or empty string and Error as follows:
;1 - The file does not exists.
;2 - Error during opening xd2txlib.dll.
;3 - No text returned.
; Author ........: Fenzik
; Modified ......:
; Remarks .......: Project site - http://ebstudio.info/home/xdoc2txt.html
; Related .......:
; Link ..........:
; Example .......: No
; ===============================================================================================================================
Func _ExtractText($sFilename, $bProperties = False, $hDll = 0)
If Not FileExists($sFilename) Then Return SetError(1, "", "")
Local $bLoaded = False
If $hDll = 0 Then
  $hDll = DllOpen(@scriptdir&"\xd2txlib.dll")
  If $hDll = -1 Then Return SetError(2, "", "")
$bLoaded = True
Endif
$aResult = DllCall($hDll, "int:cdecl", "ExtractText", "WSTR", $sFilename, "BOOL", $bProperties, "WSTR*", "")
If $aResult[0] = 0 Then Return SetError(3, "", "")
If $bLoaded = True Then DllClose($hDll)
Return $aResult[3]
EndFunc

 

 

xd2txlib-example.zip

Edited by Fenzik
Attached the example instead of original project.

Share this post


Link to post
Share on other sites
Posted (edited)

Here is easy example, including sample documment.

 

Edited by Fenzik
Example moved to the first post.

Share this post


Link to post
Share on other sites

Hi Fensik,

I made this small piece of code for testing and I got the following error: !>10:25:20 AutoIt3.exe ended.rc:-1073740940.

Would you please point me out what I'm doing wrong. Thanks in advance.

 

Global $bProperties, $hDll
Global $sMessage = "Hold down Ctrl or Shift to choose multiple files."
Global $sFileOpenDialog = FileOpenDialog($sMessage, @ScriptDir & "\", "Text files (*.docx;*.doc;*.rtf;*.wri;*.txt)|Excel (*.xls;*.xlsx)", BitOR($FD_FILEMUSTEXIST, $FD_MULTISELECT))
If @error Then
    MsgBox($MB_SYSTEMMODAL, "", "No file(s) were selected.")
    Exit
Else
    FileChangeDir(@ScriptDir)
    $sFileOpenDialog = StringReplace($sFileOpenDialog, "|", @CRLF)
    MsgBox($MB_SYSTEMMODAL, "", "You chose the following files:" & @CRLF & $sFileOpenDialog)
EndIf

Global $FoundText = _ExtractText($sFileOpenDialog, $bProperties, $hDll) ; $bProperties = False --------------------------------
If Not @error Then ConsoleWrite($FoundText)
Exit

; #FUNCTION# ====================================================================================================================
; Name ..........: _ExtractText
; Description ...: Extracts text from advanced documment formats (Doc, Docx, ODT, XLS, ...)
; Syntax ........: _ExtractText($sFilename[, $bProperties = False[, $hDll = 0]])
; Parameters ....: $sFilename           - a string value.
;                  $bProperties         - [optional] a boolean value. Default is False. If True, documment properties will be returned instead of the text.
;                  $hDll                - [optional] a handle value. Default is 0. Optional handle to previously opened xd2txlib.dll. By default the xd2txlib.dll (Expected in @scriptdir) will be opened and closed during the function call.
; Return value .: String, containing the text or documment properties or empty string and Error as follows:
;1 - The file does not exists.
;2 - Error during opening xd2txlib.dll.
;3 - No text returned.
; Author ........: Fenzik
; Modified ......:
; Remarks .......: Project site - http://ebstudio.info/home/xdoc2txt.html
; Related .......:
; Link ..........:
; Example .......: No
; ===============================================================================================================================
Func _ExtractText($sFilename, $bProperties = False, $hDll = 0)
    If Not FileExists($sFilename) Then Return SetError(1, "", "")
    Local $bLoaded = False
    If $hDll = 0 Then
        $hDll = DllOpen(@ScriptDir & "\xd2txlib.dll")
        If $hDll = -1 Then Return SetError(2, "", "")
        $bLoaded = True
    EndIf
    $aResult = DllCall($hDll, "int:cdecl", "ExtractText", "WSTR", $sFilename, "BOOL", $bProperties, "WSTR*", "")
    If $aResult[0] = 0 Then Return SetError(3, "", "")
    If $bLoaded = True Then DllClose($hDll)
    Return $aResult[3]
EndFunc   ;==>_ExtractText

 

Share this post


Link to post
Share on other sites
Posted (edited)

Ok, i made few corrections and comments to your example.

 

So here it is..

#include <FileConstants.au3>
#include <msgboxconstants.au3>

Global $properties = False
Global $foundtext = ""
Global $sMessage = "Hold down Ctrl or Shift to choose multiple files."
Global $sFileOpenDialog = FileOpenDialog($sMessage, @ScriptDir & "\", "Text files (*.docx;*.doc;*.rtf;*.wri;*.txt)|Excel (*.xls;*.xlsx)", BitOR($FD_FILEMUSTEXIST, $FD_MULTISELECT))
If @error Then
  MsgBox($MB_SYSTEMMODAL, "", "No file(s) were selected.")
  Exit
EndIf
;FileChangeDir(@ScriptDir)
;It's not necessary to change working dir here.
;Here we must know if user selected one or more files
If Not StringInStr($sFileOpenDialog, "|") Then
  ;Only one selected file
  MsgBox($MB_SYSTEMMODAL, "", "You chose the following file:" & @CRLF & $sFileOpenDialog) ;Only one selected file
  ;so lets convert it here
  $foundtext = _ExtractText($sFileOpenDialog)
  ;in this case the DLL is opened and closed during function call.
  ;Properties are False by default so it's not necessary to use it if you want to have it false.
  ;show the result
  If Not @error Then ConsoleWrite($foundtext)
Else
  $files = StringSplit($sFileOpenDialog, "|") ;Multiple files
  ;The path is in $files[1], so we have to put the path and filenames together
  $sFileOpenDialog = ""
  For $i = 1 To $files[0] - 1
    $sFileOpenDialog &= $files[1] & "\" & $files[$i + 1] & @CRLF
  Next
  ;so here you have full paths to selected files divided by @crlf and you can show them in msgbox
  MsgBox($MB_SYSTEMMODAL, "", "You chose the following files:" & @CRLF & $sFileOpenDialog)
  ;And here is the problem. You passed whole set of files, divided by @crlf, with path to the directory only at the first line..
  ;Solet them be converted And showed one by one And pass the handle of previously opened DLL.
  $hXd2tx = DllOpen(@ScriptDir & "\xd2txlib.dll")
  $files = StringSplit($sFileOpenDialog, @CRLF)
  For $i = 1 To UBound($files) - 1
    $foundtext = _ExtractText($files[$i], $properties, $hXd2tx) ; $bProperties = False --------------------------------
    If Not @error Then ConsoleWrite($foundtext)
  Next
;Close the DLL
DllClose($hxd2tx)
EndIf

; #FUNCTION# ====================================================================================================================
; Name ..........: _ExtractText
; Description ...: Extracts text from advanced documment formats (Doc, Docx, ODT, XLS, ...)
; Syntax ........: _ExtractText($sFilename[, $bProperties = False[, $hDll = 0]])
; Parameters ....: $sFilename           - a string value.
;                  $bProperties         - [optional] a boolean value. Default is False. If True, documment properties will be returned instead of the text.
;                  $hDll                - [optional] a handle value. Default is 0. Optional handle to previously opened xd2txlib.dll. By default the xd2txlib.dll (Expected in @scriptdir) will be opened and closed during the function call.
; Return value .: String, containing the text or documment properties or empty string and Error as follows:
;1 - The file does not exists.
;2 - Error during opening xd2txlib.dll.
;3 - No text returned.
; Author ........: Fenzik
; Modified ......:
; Remarks .......: Project site - http://ebstudio.info/home/xdoc2txt.html
; Related .......:
; Link ..........:
; Example .......: No
; ===============================================================================================================================
Func _ExtractText($sFilename, $bProperties = False, $hDll = 0)
  If Not FileExists($sFilename) Then Return SetError(1, "", "")
  Local $bLoaded = False
  If $hDll = 0 Then
    $hDll = DllOpen(@ScriptDir & "\xd2txlib.dll")
    If $hDll = -1 Then Return SetError(2, "", "")
    $bLoaded = True
  EndIf
  $aResult = DllCall($hDll, "int:cdecl", "ExtractText", "WSTR", $sFilename, "BOOL", $bProperties, "WSTR*", "")
  If $aResult[0] = 0 Then Return SetError(3, "", "")
  If $bLoaded = True Then DllClose($hDll)
  Return $aResult[3]
EndFunc   ;==>_ExtractText

 

Edited by Fenzik

Share this post


Link to post
Share on other sites

And close the DLL at the end of conversion multiple files.

I forgot it and have problem to edit previous post...

Share this post


Link to post
Share on other sites

Fensik,

Thanks a lot for your reply. I run the code as you sent me and I got the same error: !>14:44:16 AutoIt3.exe ended.rc:-1073740940 when selecting .docx, .xlsx and .pdf.

I got Win 8.11 in a 64bit PC.

 

Share this post


Link to post
Share on other sites

Strange!

are you using last version of Autoit and Scite?

I have both last versions and everithing runs OK.

So try to update to the last versions.

And do you have the file xd2txlib.dll in the @scriptdir?

Share this post


Link to post
Share on other sites

I've got last versions of:

-SciTE Version 3.6.0
-Autoit 3.3.14.5
-xdoc2txt 32bit (x86) version (Windows OS is 32bit / 64bit) in the scriptdir.

I also tested it with xdoc2txt 32bit (x86) and xdoc2txt 64bit (x64) version and  I also install  "Microsoft Visual C ++ 2010 Redistributable Package (x86)" and "Microsoft Visual C ++ 2010 Redistributable Package (x64)" to avoid any dependency issue.

But even though it keeps showing the same error: !>17:08:44 AutoIt3.exe ended.rc:-1073740940
+>17:08:44 AutoIt3Wrapper Finished.

 

Share this post


Link to post
Share on other sites

Ok. So let start from the beginning.

When you unpack and run the easy example from the first post of this topic, does  it work?

Share this post


Link to post
Share on other sites
Posted (edited)

My friends can use this code (use COM - without registering Dll):

MsgBox(0, 'Result', _ExtractTextCOM("sample.docx", False))

Func _ExtractTextCOM($sFilename, $bProperties = False)
    Local $hXd2tx = DllOpen(@ScriptDir & "\xd2txcom" & (@AutoItX64 ? '_64' : '') & ".dll")
    If @error Then Return SetError(1, '', '')
    Local $oXd2tx = ObjCreate('{4ECE8E8A-BCC2-4709-BCAE-264210DF321B}', '{EB26F494-4E90-4432-9BA6-C6D9CDEE25C4}', $hXd2tx)
    Return $oXd2tx.ExtractText($sFilename, $bProperties)
EndFunc

 

Edited by moimon
Wrong Code

Share this post


Link to post
Share on other sites

@jcpetu:

Unfortunately no idea.

It works perfect on my side.

What about bit version of autoit.exe?

Dll is X86.

So it shoult be run using X86 version of Autoit or compiled as X86.

What about Scite settings? Don't you prefer X64 Autoit here?

Good Luck! On my side unfortunately no other idea to make it work on your enviroment..

Share this post


Link to post
Share on other sites

@jcpetu:
I think that i solved your trouble. :)
You probably don't have installed Microsoft Visual C++ 2010 Redistributable package for x86, which is necessary for whole Xdoc2Txt project.
............
So try to install it, for example from Here.
Then the x86 version of the DLL and script should work i hope.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By Colduction
      Hi guys, i'm using Telegram UDF by @LinkOut from github with latest update, but this UDF has not any parse_mode ability in SendDocument and other send file's functions to make texts bold, italic or underline and i can't send Emojis via these functions too. i've tried to change HTML section of multipart/form-data but i did not get correct results.

      For example, i can't get correct results by sending a document with this URL Encoded caption: %F0%9F%93%84%20*Test*%20%F0%9F%93%84

      I will be happy to help me in this section. Thanks!
    • By DannyJ
      I use _ClipPutHTML UDF function 
      My problem is that I am not able to write characters with accets.
      When I paste this code to an Mail program the accent characters will be Chinese characters or '???' characters.
      Here is a snippet of my code:
      #include <_ClipPutHTML.au3> $sHTMLStr='<html><head>'&@CRLF & " <title>Page Title</title>"&@CRLF & _ ' <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">'&@CRLF & _ "</head>"&@CRLF & "<body>"&@CRLF & "<h1>Headline Text</h1>"&@CRLF & _ "<p>" & "ófiéááéllááéáéá:" & Chr(225) & BinaryToString("á",4) &@CRLF & _ '<a href="http://www.autoitscript.com/forum/index.php?showtopic=96556">_ClipPutHTML() functions</a>.'&@CRLF& _ " The regular modifiders, such as <strong>bold</strong>, <i>italics</i>, and <u>underlines</u> work as usual,"&@CRLF& _ " just like all other HTML formatting.</p>"&@CRLF & "<p>&nbsp;</p>"&@CRLF & _ "<p><strong>Here's an example list:</strong></p>"&@CRLF & "<ul>"&@CRLF & _ " <li>List <i>itemü</i> #1.</li>"&@CRLF & _ " <li>List <i>itemá</i> #2.</li>"&@CRLF & _ ' <li>List <i>itemé</i> #3 with a <a href="http://www.google.com">Hyperlink</a></li>'&@CRLF & _ "</ul>"&@CRLF & "</body>"&@CRLF & "</html>" $sPlainTextStr="Headline Text"&@CRLF&@CRLF& _ "ófigyeljáéáéá" & Chr(225) & "_ClipPutHTML() functions."& _ "The regular modifiders, such as bold, italics, and underlines work as usual, just like all other HTML formatting."&@CRLF&@CRLF& _ "Here's an example list:"&@CRLF& _ " * List itemü #1."&@CRLF& _ " * List itemá #2."&@CRLF& _ " * List itemé #3 with a Hyperlink"&@CRLF ;I have tired this way, but it does not work. ;$UTF8HTML = BinaryToString($sHTMLStr,4) ;ConsoleWrite($UTF8HTML) ;$sUTF8String=BinaryToString($sPlainTextStr,4) ConsoleWrite($sUTF8String) _ClipPutHTML($UTF8HTML,$sUTF8String) ; Special Unicode text call ;_ClipPutHyperlink("http://www.google.co.jp/",ChrW(0x30B0)& ChrW(0x30FC)& ChrW(0x30B0)& ChrW(0x30EB)& " (Japanese Google)") ; Regular text ;_ClipPutHyperlink("http://www.google.com","itt")  
    • By diff
      Hello,
      still learning and trying to understand AutoIT but having problem in filling my PDF file.
       
      So my code looks like similar to this:
      Global $1 = "text text 44444444" Global $2 = "texting2 texting2" Global $3 = "newtext3 next3" ShellExecute ("C:\Users\XXX\Desktop\myPDF.pdf") WinWaitActive("MyPDF.pdf - Adobe Acrobat Reader DC") Send ("{TAB}") ClipPut($1) Send ("^v") Send ("{TAB 3}") ClipPut($2) Send("^v") Send ("{TAB}") ClipPut($3) Send("^v") So its fill my PDF form, the first field looks good, the code add the text text 4444, then second should be $2 with texting2 texting2 but for some reason the code uses for second and third field after TAB only variable $3.
      So, I receive in $2 and $3 for some reason same newtext3 next3 in both, why its skipping the variable $2? Maybe there also much better solution for instant text? Because Send writes with delay by letters which I don't like.
      Thanks!
    • By MakzNovice
      Hello Experts,
      I have Zero experience with Autoit + Adobe Acrobat, and I really in need to get this working as PoC.
      I am trying to automate some manual actions below are the steps I would like to do.
      INPUT to script : 
      1. PDF file to open
      2. String that I would like to add as \\Server\Directory name
      Steps : 
      1. Open the file in Adobe Acrobat Pro
      2. Browse to View > Tools > Send For Review > Open (see image 1)
      3. On the launched tool bar click on "Send for Shared Connecting" (see image 2)
      4. Next select option "Automatically Collect comments on my..." in dropdown and click 'Next' (see image 3)
      5. Select radiobutton "Network folder" and paste the input "\\Server\Directory" in text field and click 'Next' (see image 4)
      Experts, I would really appreciate a quick script which I can run and get rolling.
      Please note, I would not likwe to rely on MouseClick and/or cordinates match approach.
      PLEASE SUPPORT!!!!
      Makz
      **********************************************************************************************************
      Image 1

       
      Image 2

       
      Image 3

       
      Image 4

    • By jitendriya
      Hi every one .
      I want to read a pdf file and write into a excel using autoit , so how can i do this with out using third party server please tell me .
      Thank you..
×
×
  • Create New...