Kiran_L

Read data from PDF files

5 posts in this topic

Hi guys,

 

I am trying to read a pdf file with unstructured data. I dontot know how to handle pdf activities in AutoIt,

Can you help me with any UDF to open the PDF and read the doc.

 

Thanks for your time.

 

Share this post


Link to post
Share on other sites



Hi @Kiran_L,

Try checking this old thread and this link. You might get an idea somehow. Else, can you post your made code so far so that anyone can easily help.

Share this post


Link to post
Share on other sites

You can use mupdf

https://mupdf.com/downloads/

or other commercial solutions like QuickPDF (look in my signature for QuickPDF.au3 UDF)

 


Signature beginning:   Wondering who uses AutoIT and what it can be used for ?
* GHAPI UDF - modest begining - comunication with GitHub REST API *
ADO.au3 UDF     POP3.au3 UDF     XML.au3 UDF    How to use IE.au3  UDF with  AutoIt v3.3.14.x  for other useful stuff click the following button

Spoiler

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind. 

My contribution (my own projects): * Debenu Quick PDF Library - UDF * Debenu PDF Viewer SDK - UDF * Acrobat Reader - ActiveX Viewer * UDF for PDFCreator v1.x.x * XZip - UDF * AppCompatFlags UDF * CrowdinAPI UDF * _WinMergeCompare2Files() * _JavaExceptionAdd() * _IsBeta() * Writing DPI Awareness App - workaround * _AutoIt_RequiredVersion() * Chilkatsoft.au3 UDF * TeamViewer.au3 UDF * JavaManagement UDF * VIES over SOAP * WinSCP UDF * GHAPI UDF - modest begining - comunication with GitHub REST API *

My contribution to others projects or UDF based on  others projects: * _sql.au3 UDF  * POP3.au3 UDF *  RTF Printer - UDF * XML.au3 - BETA * ADO.au3 UDF SMTP Mailer UDF *

Useful links: * Forum Rules * Forum etiquette *  Forum Information and FAQs * How to post code on the forum * AutoIt Online Documentation * AutoIt Online Beta Documentation * SciTE4AutoIt3 getting started * Convert text blocks to AutoIt code * Games made in Autoit * Programming related sites * Polish AutoIt Tutorial * DllCall Code Generator * 

Wiki: Expand your knowledge - AutoIt Wiki * Collection of User Defined Functions * How to use HelpFile * Best coding practices * 

IE Related:  * How to use IE.au3  UDF with  AutoIt v3.3.14.x * Why isn't Autoit able to click a Javascript Dialog? * Clicking javascript button with no ID * IE document >> save as MHT file * IETab Switcher (by LarsJ ) * HTML Entities * _IEquerySelectorAll() (by uncommon) * 

I encourage you to read: * Global Vars * Best Coding Practices * Please explain code used in Help file for several File functions * OOP-like approach in AutoIt * UDF-Spec Questions *  EXAMPLE: How To Catch ConsoleWrite() output to a file or to CMD *

"Homo sum; humani nil a me alienum puto" - Publius Terentius Afer
"Program are meant to be read by humans and only incidentally for computers and execute" - Donald Knuth, "The Art of Computer Programming"
:naughty:  :ranting:, be  :) and       \\//_.

Anticipating Errors :  "Any program that accepts data from a user must include code to validate that data before sending it to the data store. You cannot rely on the data store, ...., or even your programming language to notify you of problems. You must check every byte entered by your users, making sure that data is the correct type for its field and that required fields are not empty."

Signature last update: 2017-06-04

Share this post


Link to post
Share on other sites

I Would recommend U Using xpdf Just Search In Forum For Xpdf Some One Had Posted It Earlier it Will Allow U To read Data From Pdf As Text

 

Share this post


Link to post
Share on other sites

Using Xpdf tools :

; #FUNCTION# ====================================================================================================================
; Name...........: _XFDF_Info
; Description....: Retrives informations from a PDF file
; Syntax.........: _XFDF_Info ( "File" [, "Info"] )
; Parameters.....: File    - PDF File.
;                  Info    - The information to retrieve
; Return values..: Success - If the Info parameter is not empty, returns the desired information for the specified Info parameter
;                          - If the Info parameter is empty, returns an array with all available informations
;                  Failure - 0, and sets @error to :
;                   1 - PDF File not found
;                   2 - Unable to find the external programm
; Remarks........: The array returned is two-dimensional and is made up as follows:
;                   $array[1][0] = Label of the first information (title, author, pages...)
;                   $array[1][1] = value of the first information 
;                   ...
; ===============================================================================================================================
Func _XFDF_Info($sPDFFile, $sInfo = "")
    Local $sXPDFInfo = @ScriptDir & "\pdfinfo.exe"

    If NOT FileExists($sPDFFile) Then Return SetError(1, 0, 0)
    If NOT FileExists($sXPDFInfo) Then Return SetError(2, 0, 0)
    
    Local $iPid = Run(@ComSpec & ' /c "' &  $sXPDFInfo & ' "' & $sPDFFile & '"', @ScriptDir, @SW_HIDE, 2)

    Local $sResult
    While 1
        $sResult &= StdoutRead($iPid)
        If @error Then ExitLoop
    WEnd
    
    Local $aInfos = StringRegExp($sResult, "(?m)^(.*?): +(.*)$", 3)
    If Mod( UBound($aInfos, 1), 2) = 1 Then Return SetError(3, 0, 0)
    
    Local $aResult [ UBound($aInfos, 1) / 2][2]
    
    For $i = 0 To UBound($aInfos) - 1 Step 2
        If $sInfo <> "" AND $aInfos[$i] = $sInfo Then Return $aInfos[$i + 1]
        $aResult[$i / 2][0] = $aInfos[$i]
        $aResult[$i / 2][1] = $aInfos[$i + 1]
    Next
    
    If $sInfo <> "" Then Return ""

    Return $aResult
EndFunc ; ---> _XFDF_Info



; #FUNCTION# ====================================================================================================================
; Name...........: _XPDF_Search
; Description....: Retrives informations from a PDF file
; Syntax.........: _XFDF_Info ( "File" [, "String" [, Case = 0 [, Flag = 0 [, FirstPage = 1 [, LastPage = 0]]]]] )
; Parameters.....: File    - PDF File.
;                  String    - String to search for
;                  Case      - If set to 1, search is case sensitive (default is 0)
;                  Flag      - A number to indicate how the function behaves. See below for details. The default is 0.
;                  FirstPage  - First page to convert (default is 1)
;                  LastPage   - Last page to convert (default is 0 = last page of the document)
; Return values..: Success - 
;                   Flag = 0 - Returns 1 if the search string was found, or 0 if not
;                   Flag = 1 - Returns the number of occcurrences found in the whole PDF File
;                   Flag = 2 - Returns an array containing the number of occurrences found for each page
;                              (only pages containing the search string are returned)
;                              $array[0][0] - Number of matching pages
;                              $array[0][1] - Number of occcurrences found in the whole PDF File
;                              $array[n][0] - Page number
;                              $array[n][1] - Number of occcurrences found for the page
;                  Failure - 0, and sets @error to :
;                   1 - PDF File not found
;                   2 - Unable to find the external programm
; ===============================================================================================================================
Func _XPDF_Search($sPDFFile, $sSearch, $iCase = 0, $iFlag = 0, $iStart = 1, $iEnd = 0)
    Local $sXPDFToText = @ScriptDir & "\pdftotext.exe"
    Local $sOptions = " -layout -f " & $iStart
    Local $iCount = 0, $aResult[1][2] = [[0, 0]], $aSearch, $sContent, $iPageOccCount
    
    If NOT FileExists($sPDFFile) Then Return SetError(1, 0, 0)
    If NOT FileExists($sXPDFToText) Then Return SetError(2, 0, 0)
    
    If $iEnd > 0 Then $sOptions &= " -l " & $iEnd
    
    Local $iPid = Run($sXPDFToText & $sOptions & ' "' & $sPDFFile & '" -', @ScriptDir, @SW_HIDE, 2)
    While 1
        $sContent &= StdoutRead($iPid)
        If @error Then ExitLoop
    WEnd
    
    
    Local $aPages = StringSplit($sContent, chr(12) )
    
    For $i = 1 To $aPages[0]
        $iPageOccCount = 0
        While StringInStr($aPages[$i], $sSearch, $iCase, $iPageOccCount + 1)
            If $iFlag <> 1 AND $iFlag <> 2 Then
                $aResult[0][1] = 1
                ExitLoop
            EndIf
            $iPageOccCount += 1
        WEnd

        If $iPageOccCount Then
            Redim $aResult[ UBound($aResult, 1) + 1][2]
            $aResult[0][1] += $iPageOccCount
            $aResult[0][0] = UBound($aResult) - 1
            $aResult[ UBound($aResult, 1) - 1 ][0] = $i + $iStart - 1
            $aResult[ UBound($aResult, 1) - 1 ][1] = $iPageOccCount
        EndIf
    Next
    
    If $iFlag = 2 Then Return $aResult
    Return $aResult[0][1]
    
EndFunc ; ---> _XPDF_Search



; #FUNCTION# ====================================================================================================================
; Name...........: _XPDF_ToText
; Description....: Converts a PDF file to plain  text.
; Syntax.........: _XPDF_ToText ( "PDFFile" , "TxtFile" [ , FirstPage [, LastPage [, Layout ]]] )
; Parameters.....: PDFFile    - PDF Input File.
;                  TxtFile    - Plain text file to convert to
;                  FirstPage  - First page to convert (default is 1)
;                  LastPage   - Last page to convert (default is last page of the document)
;                  Layout     - If true, maintains (as  best as possible) the original physical layout of the text
;                               If false, the behavior is to 'undo'  physical  layout  (columns, hyphenation, etc.)
;                                 and output the text in reading order.
;                               Default is True
; Return values..: Success - 1
;                  Failure - 0, and sets @error to :
;                   1 - PDF File not found
;                   2 - Unable to find the external program
; ===============================================================================================================================
Func _XPDF_ToText($sPDFFile, $sTXTFile, $iFirstPage = 1, $iLastPage = 0, $bLayout = True)
    Local $sXPDFToText = @ScriptDir & "\pdftotext.exe"
    Local $sOptions 
    
    If NOT FileExists($sPDFFile) Then Return SetError(1, 0, 0)
    If NOT FileExists($sXPDFToText) Then Return SetError(2, 0, 0)
    
    If $iFirstPage <> 1 Then $sOptions &= " -f " & $iFirstPage
    If $iLastPage <> 0 Then $sOptions &= " -l " & $iLastPage
    If $bLayout = True Then $sOptions &= " -layout"
    
    Local $iReturn = ShellExecuteWait ( $sXPDFToText , $sOptions & ' "' & $sPDFFile & '" "' & $sTXTFile & '"', @ScriptDir, "", @SW_HIDE)
    If $iReturn = 0 Then Return 1
    
    Return 0
    
EndFunc ; ---> _XPDF_ToText

 

2 people like this

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Similar Content

    • mLipok
      By mLipok
      Here:
      https://github.com/nachbar/TRichViewToPdfUsingDebenu/blob/master/Unit1.cpp
      I found a code in C++ for file format conversion from RTF to PDF with using Debenu QuickPDF.
      I know how to use Debenu QuickPDF in AutoIt .
      My question is about RTF part of this code:
       
      HDC hdcNew = debenu->GetCanvasDC( RTFPRINTINGDOTSPERINCH * RTFPAGEWIDTHININCHES, RTFPRINTINGDOTSPERINCH * RTFPAGEHEIGHTININCHES); canvas = new TCanvas; canvas->Handle = hdcNew; RVReportHelper1->DrawPage( PageCounter, canvas, true, RTFPRINTINGDOTSPERINCH * RTFPAGEHEIGHTININCHES); // LastPageHeight);  
      First there is hdcNew declaration , and this is not the problem.
      My problem is in converting the following code snippet, to AutoIt
      canvas = new TCanvas; canvas->Handle = hdcNew; RVReportHelper1->DrawPage( PageCounter, canvas, true, RTFPRINTINGDOTSPERINCH * RTFPAGEHEIGHTININCHES); // LastPageHeight);
      If you remember, I created RTFPrinter sometime ago. But it was some time ago , and created by trial and error, rather than in-depth analysis.
      Also, it was just a modification of another script, not my own work from scratch.
      So now I'm looking for help how to adapt this code snippet to AutoIt.
      Any tips ?
       
      Regards,
      mLIpok
       
    • Mag91
      By Mag91
      Hey Community,
      cause im too new in the Auto it world i will try it with the your help. hopefully.
      I woud like to know how i can handle my Problem.
      ----
      I have a Excel Data with 362 random numbers.
      For Example:
      1166642335374 1172899897343
      .....
      this numbers are a part of the filepath ...example
      D:\Projekte\1166_64233_5374
      as u can see its the first number of the Excel data. After the first 4 numbers it shoud make a "_" than another 5 "_"
      This is my first question. How can i handle this to make it Shell execute.
       
      --------
      Second question:
      If i am in the path.
      For Example:
      D:\Projekte\1166_64233_5374
      the code shoud search for specific PDF Files.
      They are named like: 0050569E364B1ED79B900F73E62660EC.pdf
      the first 15 letters are always the same
      0050569E364B1ED
      when he found this data he has to copy it on a Folder on the Desktop.
      (There can also be 2 or 3 pdfs in one Folder with this letters)
      ----
      Please give me some help :-)
       
       
       
       
       
       
    • Mag91
      By Mag91
      Hey Everybody,
      as you know im on a very low autoit-level.
      My question is: How can i read all PDFs from a Folder wich is open and copy them to a Folder on a Desktop.
       
      The Folder wich contains the PDFs is variable Z:\Projektls\"*"*"*EVERYTIME ANOTHER ENDING"*"*"*"*"
      There can be 1 PDF or even 15 PDFs.
      i tried it with _FileListToArray and _FileCopy but i Need some help to understand this language
       
      THANKS!
       
    • Skeletor
      By Skeletor
      Hi Guys,
      I've been reading this post ...
      When I came accross the examples, non of them had what I was looking for.
      I basically want to "snapshot" my GUI's multiple tabs and send them into the pdf.
      A little nudge from you guys would be great.
      Im really stuck with this one, therefore I have no code.
      Lets discuss or point me in a right direction... thanks alot
       

    • KimberlyJillPereira
      By KimberlyJillPereira
      I could only extract the first 20 from table into Microsoft Excel by using Array Extract but I want to extract until the end what I mean is until the second page. How to do it? Please revert. Thanks.