Jump to content

Recommended Posts

Posted (edited)

Really? Does it keep table delimination (cells) intact?

tried out a free-ware one, and although tables still look like tables, it's just a bunch of space chars padding everything...i suppose you can delim on where there are multiple spaces (of course the forum editor removes them all, ha):

Property Name Property Type Access Type Description

Hostname string GET/PUT Retrieves and sets the name of a server, where Hostname is

the server’s hostname or IP address. If Hostname is not given

or undefined, the authentication is performed on the local

Port integer GET/PUT Retrieves and sets the TCP port to use when connecting to the

server. Its default value is 0 (zero), indicating the default port

number should be used. Otherwise, enter the correct port

number.

A port number set to a negative value is treated as an incorrect

value and the default port number is used instead.

Note: The default port number for ESX Server 3.x is 443; the

Edited by jdelaney
IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Posted

  On 4/12/2013 at 2:56 PM, 'water said:

The input PDF remains unchanged. The tools extracts the text from the PDF and writes it to an output file. You then can easily process the output file or many of them.

Thats the plan anyway. But I dont see another alternative to click & drag to acquire my data. Unless you could suggest anythign else? that might be easier.

Posted

Don't know. Give it a try and see how a table is converted to text.

Or use on of the online PDF converters to create a Word document. Then process the Word document using the Word UDF or my WordEX UDF.

My UDFs and Tutorials:

  Reveal hidden contents

 

Posted

  On 4/12/2013 at 3:00 PM, 'water said:

Don't know. Give it a try and see how a table is converted to text.

Or use on of the online PDF converters to create a Word document. Then process the Word document using the Word UDF or my WordEX UDF.

I wish I could but Im not being allowed to :(

Posted

Well if anyone is looking at my code.

Could someone have a look at say the first function.

As when I do copy my values I send them to clipboard and store that data in a variable called $row1

but after that I need to clear the clipboard and when i put ClipPut("")

it then doesnt recognise the next function ctrl+c.

Posted (edited)

Not allowed to what, access the internet?

If that's not the case, I just found this site, and you can automate it, since it doesn't require you to type in the mangled words (human verification)

http://www.pdfonline.com/pdf-to-word-converter/

actually a really good one...going to pass it along to my team...much easier to loop through table objects.

edit: you are wasting your time with the copy buffer

Edited by jdelaney
IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Posted

@jdelaney if he's got sensitive data in the PDFs, an online converter is a big no-no. I'm working on something for him that's a little more... conventional... to say the least.

  Reveal hidden contents

 

Posted (edited)

Well... try this on for size... Make sure that sumatra is set to your default PDF reader first.

And replace $sReader $dirPDF and $sPDF with the correct values if you would.

Tested on WinXP

_PDFSearch()

Func _PDFSearch()
    Local $sReader, $dirPDF, $sPDF, $pidReader, $hReader, $hTimer, $x

    $sReader = "C:\Program Files\SumatraPDF\SumatraPDF.exe"
    $dirPDF = "C:\path\to\pdfs\"
    $sPDF = "pdfname.pdf"

    $pidReader = Run($sReader & ' "' & $dirPDF & $sPDF & '"')
    if not $pidReader Then
        msgbox(16 + 262144, @AutoItExe, "Failed to open PDF file.")
        Exit
    EndIf
    WinWait($sPDF & " - SumatraPDF")
    $hReader = WinGetHandle($sPDF & " - SumatraPDF")
    WinActivate($hReader)
    if not WinWaitActive($hReader, "", 5) Then
        msgbox(16 + 262144, @AutoItExe, "Timed out waiting for application window to gain focus.")
        Exit
    EndIf
    ControlFocus($hReader, "", "[CLASS:SUMATRA_PDF_CANVAS; INSTANCE:1]")
    Sleep(500)
    ClipPut("")
    $hTimer = TimerInit()
    $x = 500
    While not ClipGet()
        If TimerDiff($hTimer) > 5000 Then
            msgbox(16 + 262144, @AutoItExe, "Timed out attempting to get text from document.")
            Exit
        EndIf
        ControlSend($hReader, "", "[CLASS:SUMATRA_PDF_CANVAS; INSTANCE:1]", "^a")
        Sleep($x)
        ControlSend($hReader, "", "[CLASS:SUMATRA_PDF_CANVAS; INSTANCE:1]", "^c")
        Sleep($x)
        $x+=500
    WEnd
    $sStringData = ClipGet()
    If not WinClose($hReader) Then ProcessClose($pidReader)
    ConsoleWrite($sStringData & @CRLF)
    msgbox(0,"","Check your console output. Now you have everything stored in $sStringData for you to manipulate")
EndFunc

EDIT: I just pray you're not dealing with a PDF with multiple columns XD

EDIT2: Replaced shellexecute() of file to Run() of application with the file as a parameter

Edited by Mechaflash
  Reveal hidden contents

 

Posted

  On 4/12/2013 at 3:34 PM, 'Mechaflash said:

Well... try this on for size... Make sure that sumatra is set to your default PDF reader first.

And replace $sReader $dirPDF and $sPDF with the correct values if you would.

Tested on WinXP

_PDFSearch()

Func _PDFSearch()
Local $sReader, $dirPDF, $sPDF, $pidReader, $hReader, $hTimer, $x

$sReader = "C:\Program Files\SumatraPDF\SumatraPDF.exe"
$dirPDF = "C:\path\to\pdfs\"
$sPDF = "pdfname.pdf"

$pidReader = Run($sReader & ' "' & $dirPDF & $sPDF & '"')
if not $pidReader Then
msgbox(16 + 262144, @AutoItExe, "Failed to open PDF file.")
Exit
EndIf
WinWait($sPDF & " - SumatraPDF")
$hReader = WinGetHandle($sPDF & " - SumatraPDF")
WinActivate($hReader)
if not WinWaitActive($hReader, "", 5) Then
msgbox(16 + 262144, @AutoItExe, "Timed out waiting for application window to gain focus.")
Exit
EndIf
ControlFocus($hReader, "", "[CLASS:SUMATRA_PDF_CANVAS; INSTANCE:1]")
Sleep(500)
ClipPut("")
$hTimer = TimerInit()
$x = 500
While not ClipGet()
If TimerDiff($hTimer) > 5000 Then
msgbox(16 + 262144, @AutoItExe, "Timed out attempting to get text from document.")
Exit
EndIf
ControlSend($hReader, "", "[CLASS:SUMATRA_PDF_CANVAS; INSTANCE:1]", "^a")
Sleep($x)
ControlSend($hReader, "", "[CLASS:SUMATRA_PDF_CANVAS; INSTANCE:1]", "^c")
Sleep($x)
$x+=500
WEnd
$sStringData = ClipGet()
If not WinClose($hReader) Then ProcessClose($pidReader)
ConsoleWrite($sStringData & @CRLF)
msgbox(0,"","Check your console output. Now you have everything stored in $sStringData for you to manipulate")
EndFunc

EDIT: I just pray you're not dealing with a PDF with multiple columns XD

EDIT2: Replaced shellexecute() of file to Run() of application with the file as a parameter

Thanks for your help mate. Still cant get it to do what I want without click & drag

Posted

so did $sStringData not output the text?

  Reveal hidden contents

 

Posted

At what point is it failing? I put in quite a few error checks...

  Reveal hidden contents

 

Posted

  On 4/15/2013 at 3:35 PM, 'Mechaflash said:

At what point is it failing? I put in quite a few error checks...

wouldnt run at all :( and im pretty sure I configured my code to work with urs correctly

Posted

You're going to have to post the code you used. I tested it here and it worked very well. If you never ran into any of my error boxes, then I suspect it may be with the way you've re-written it.

  Reveal hidden contents

 

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...