Jump to content

Read and manipulate PDF file


Recommended Posts

Hi all,

I'm new to Autoit and I have a question regarding on how to work with PDF file. Let say, I have a pdf file (pls see the example that I attached). I need to  read the file line by line and highlight the line if a condition is met, e.g. if the score is 90 and above. Can it be achieved with autoit? Any guidance are much welcome. Thank you all.

Score.pdf

Link to comment
Share on other sites

If you have Word, this will read the entire document and split each line into an array however the ID column is cut off. I assume it's because the PDF file "contains interactive features".

#include <Array.au3>
#include <Word.au3>

_Func('...\Score.pdf')

Func _Func($sFile)
    Local $oWord = _Word_Create()
    If @ERROR Then
        ConsoleWrite('Error: _Word_Create' & @CRLF)
        Exit
    EndIf
    Local $oDoc = _Word_DocOpen($oWord, $sFile)
    If @ERROR Then
        _Word_Quit($oDoc)
        ConsoleWrite('Error: _Word_DocOpen' & @CRLF)
        Exit
    EndIf
    Local $oRange = $oDoc.Range
    Local $sText = $oRange.Text
    ConsoleWrite($sText)
    Local $aLines = StringSplit($sText, @CRLF)
    _ArrayDisplay($aLines)
    _Word_Quit($oDoc)
EndFunc

 

Link to comment
Share on other sites

Hi, thanks for the reply. I come out with some general idea as below.

1. Use Xpdf to export the PDF file to text file

2. Use FileOpen to open the text file

3. Use _FileCountLines to get the number of lines

4. Loop each line, use FileReadLine to read the line

5. Use StringRegExp to check if the line matches the format. And extract the values e.g. Score value

6. Determine if extracted values meet criteria. E.g. score >= 90

7. If yes, highlight the line in PDF file (PDF comment/highlight button)

8. Save the PDF and delete the text file

Anything I missed/ or wrong here? And I have concern at step 7. How to do it? If the PDF file contains pic/chart, will the line number in the text file and pdf file tally?

 

 

Link to comment
Share on other sites

Ok, let me try to clarify. Basically I have an input file, which is PDF.

Currently, user need to open the file manually, check through the records and highlight the records which has Score >= 90. The example I attached is the OUTPUT. The INPUT is the same file but without records highlighted.

What I want to achieve is to get this process done automatically using Autoit.

And yes, the PDF file allowed user to edit.

 

Link to comment
Share on other sites

Do you have access to Word & Excel?

If so, does this function read all the columns for the input file? (Where no records have been highlighted)

18 hours ago, Luke94 said:

If you have Word, this will read the entire document and split each line into an array however the ID column is cut off. I assume it's because the PDF file "contains interactive features".

#include <Array.au3>
#include <Word.au3>

_Func('...\Score.pdf')

Func _Func($sFile)
    Local $oWord = _Word_Create()
    If @ERROR Then
        ConsoleWrite('Error: _Word_Create' & @CRLF)
        Exit
    EndIf
    Local $oDoc = _Word_DocOpen($oWord, $sFile)
    If @ERROR Then
        _Word_Quit($oDoc)
        ConsoleWrite('Error: _Word_DocOpen' & @CRLF)
        Exit
    EndIf
    Local $oRange = $oDoc.Range
    Local $sText = $oRange.Text
    ConsoleWrite($sText)
    Local $aLines = StringSplit($sText, @CRLF)
    _ArrayDisplay($aLines)
    _Word_Quit($oDoc)
EndFunc

 

What I'm thinking is to read the PDF file with the above function, move it into Excel, highlight the records as requested and save as a PDF file. Might be a long-ass way about it but it would get you what you want. There will be an easier solution I would have thought, I just don't know of it.

Link to comment
Share on other sites

Ok, I wrote some codes as below

Func _XPDF_ToText($sPDFFile, $sTXTFile, $iFirstPage = 1, $iLastPage = 0, $bLayout = True)
    Local $sXPDFToText = @ScriptDir & "\pdftotext.exe"
    Local $sOptions

    If NOT FileExists($sPDFFile) Then Return SetError(1, 0, 0)
    If NOT FileExists($sXPDFToText) Then Return SetError(2, 0, 0)

    If $iFirstPage <> 1 Then $sOptions &= " -f " & $iFirstPage
    If $iLastPage <> 0 Then $sOptions &= " -l " & $iLastPage
    If $bLayout = True Then $sOptions &= " -layout"

    Local $iReturn = ShellExecuteWait ( $sXPDFToText , $sOptions & ' "' & $sPDFFile & '" "' & $sTXTFile & '"', @ScriptDir, "", @SW_HIDE)
    If $iReturn = 0 Then Return 1

    Return 0

EndFunc
#include <MsgBoxConstants.au3>
#include <File.au3>

_XPDF_ToText("C:\Users\Duc Phu\Desktop\Score.pdf","C:\Users\Duc Phu\Desktop\temp.txt",1,0,true)

; Open temp text file
Local $hFileOpen = FileOpen("C:\Users\Duc Phu\Desktop\temp.txt",0)
; Read the fist line of the file using the handle returned by FileOpen
Local $sFileRead = FileReadLine($hFileOpen, 1)
; Retrieve the number of lines in the temp file
Local $iCountLines = _FileCountLines($hFileOpen)
Local $ReadLine[$iCountLines]
Local $ReadLineFull[$iCountLines]
Local $ReadLineScore[$iCountLines]

For $i = 1 to $iCountLines
$ReadLine[$i-1] = FileReadLine("C:\Users\Duc Phu\Desktop\temp.txt",$i)
Local $RegResult = StringRegExp($ReadLine[$i-1],'[0-9]+\s+[A-Za-z]+\s+([0-9]+)',2)
If Not @error Then
    ;If regex found matches
    $ReadLineScore[$i-1] = $RegResult[1]
    ;If score >=90 then write the match to $ReadLineFull array. We need this array for PDF searching and highlighting later on
    If $ReadLineScore[$i-1] >= 90 Then
        $ReadLineFull[$i-1] = $RegResult[0]
    Else
        $ReadLineFull[$i-1] = "-"
    EndIf
Else
    ; If not
    $ReadLineFull[$i-1] = "-"
    $ReadLineScore[$i-1] = "-"
EndIf
Next

; Close the handle returned by FileOpen.
FileClose($hFileOpen)

;Here, we have $ReadLineFull array. We need to loop through the array, if value <> "-" then we will need to
; 1st to open the PDF file
; Send Ctrl+F to open the arobat reader search box
; Send Ctrl+V to paste the value to the search box
; Wait sec to ensure the search result is returned
; Click on the full matched result. The line containing the result should be selected
; Send control click to the highlight button to highlight the line

 

Do you think that the idea here is doable

;Here, we have $ReadLineFull array. We need to loop through the array, if value <> "-" then we will need to
; 1st to open the PDF file
; Send Ctrl+F to open the arobat reader search box
; Send Ctrl+V to paste the value to the search box
; Wait sec to ensure the search result is returned
; Click on the full matched result. The line containing the result should be selected
; Send control click to the highlight button to highlight the line

 

Link to comment
Share on other sites

I managed to complete this project. Below is the full code

#include <MsgBoxConstants.au3>
#include <FileConstants.au3>
#include <File.au3>

Func _XPDF_ToText($sPDFFile, $sTXTFile, $iFirstPage = 1, $iLastPage = 0, $bLayout = True)

    Local $sXPDFToText = @ScriptDir & "\pdftotext.exe"
    Local $sOptions

    If NOT FileExists($sPDFFile) Then Return SetError(1, 0, 0)
    If NOT FileExists($sXPDFToText) Then Return SetError(2, 0, 0)

    If $iFirstPage <> 1 Then $sOptions &= " -f " & $iFirstPage
    If $iLastPage <> 0 Then $sOptions &= " -l " & $iLastPage
    If $bLayout = True Then $sOptions &= " -layout"

    Local $iReturn = ShellExecuteWait ( $sXPDFToText , $sOptions & ' "' & $sPDFFile & '" "' & $sTXTFile & '"', @ScriptDir, "", @SW_HIDE)
    If $iReturn = 0 Then Return 1

    Return 0

EndFunc

Func FileSelection()
    ; Display an open dialog to select a list of file(s).
    Global $sFileOpenDialog = FileOpenDialog("Select file(s)", @DesktopDir & "\", "Adobe PDF Files (*.pdf)", BitOR($FD_FILEMUSTEXIST, $FD_MULTISELECT))
    If @error Then
        ; Display the error message.
        MsgBox(0, "", "No file(s) were selected.")
        Exit
        ; Change the working directory (@WorkingDir) back to the location of the script directory as FileOpenDialog sets it to the last accessed folder.
        ;FileChangeDir(@ScriptDir)
    Else
        ; Change the working directory (@WorkingDir) back to the location of the script directory as FileOpenDialog sets it to the last accessed folder.
        ;FileChangeDir(@ScriptDir)

        ; Replace instances of "|" with @CRLF in the string returned by FileOpenDialog.
        ;$sFileOpenDialog = StringReplace($sFileOpenDialog, "|", @CRLF)

        ; Display the list of selected files.
        ;MsgBox(0, "", "You chose the following files:" & @CRLF & $sFileOpenDialog)
    EndIf
EndFunc


FileSelection()
Local $FilesArr = StringSplit($sFileOpenDialog, "|")
Local $Dir = $FilesArr[1]
Local $File[$FilesArr[0]-1]

For $iFile = 0 to $FilesArr[0]-1-1
    $File[$iFile] = $FilesArr[$iFile+2]
    Local $CurrentFile = $Dir & "\" & $File[$iFile]

    _XPDF_ToText($CurrentFile,@ScriptDir & "\temp.txt",1,0,true)

    ; Open temp text file
    Local $hFileOpen = FileOpen(@ScriptDir & "\temp.txt",0)
    ; Retrieve the number of lines in the temp file
    Local $iCountLines = _FileCountLines($hFileOpen)
    Local $ReadLine[$iCountLines]
    Local $ReadLineFull[$iCountLines]
    Local $ReadLineScore[$iCountLines]

    For $iLine = 1 to $iCountLines
    $ReadLine[$iLine-1] = FileReadLine(@ScriptDir & "\temp.txt",$iLine)
    Local $RegResult = StringRegExp($ReadLine[$iLine-1],'[0-9]+\s+[A-Za-z]+\s+([0-9]+)',2)
    If Not @error Then
        ;If regex found matches
        $ReadLineScore[$iLine-1] = $RegResult[1]
        ;If score >=90 then write the match to $ReadLineFull array. We need this array for PDF searching and highlighting later on
        If $ReadLineScore[$iLine-1] >= 90 Then
            $ReadLineFull[$iLine-1] = $ReadLine[$iLine-1]
        Else
            $ReadLineFull[$iLine-1] = "-"
        EndIf
    Else
        ; If not
        $ReadLineFull[$iLine-1] = "-"
        $ReadLineScore[$iLine-1] = "-"
    EndIf
    Next

    ; Close the handle returned by FileOpen.
    FileClose($hFileOpen)

    ;Here, we have $ReadLineFull array. We need to loop through the array, if value <> "-" then we will need to

    ; Send Ctrl+F to open the arobat reader search box
    ; Send Ctrl+V to paste the value to the search box
    ; Wait sec to ensure the search result is returned
    ; Send Enter key
    ; Send control click to the highlight button to highlight the line

    ; 1st to open the PDF file
    ShellExecute($CurrentFile,"","","",@SW_MAXIMIZE)

    ; Wait 5 seconds for the Notepad window to exist
    $WinActive = WinWaitActive("[CLASS:AcrobatSDIWindow]", "", 5)

    If $WinActive = 0 Then
        MsgBox(0,"Error", "No Acrobat Reader window")
        Exit
    Else
        For $iLine = 1 to $iCountLines
        If $ReadLineFull[$iLine-1] <> "-" Then
            ClipPut($ReadLineFull[$iLine-1])
            Send("^f")
            Sleep(1000)
            Send("^v")
            Sleep(1000)
            Send("{ENTER}")
            Sleep(1000)
            ControlFocus("[CLASS:AcrobatSDIWindow]","","[CLASS:AVL_AVView; INSTANCE:38]")
            Sleep(1000)
            ControlClick("[CLASS:AcrobatSDIWindow]","","[CLASS:AVL_AVView; INSTANCE:38]", "left", 1, 65, 10)
            Sleep(1000)
        EndIf

        Next
        Sleep(1000)
        Send("^s")
        Sleep(1000)
        WinClose("[CLASS:AcrobatSDIWindow]", "")
        Sleep(5000)
    EndIf

Next

MsgBox(0,"AutoProcess","Done. File(s) saved and closed.",5)

 

Link to comment
Share on other sites

On 6/10/2021 at 11:39 AM, ducphu said:

Acrobat reader

No.

Reader is not editor.

Do you thought about Acrobat Profesional ?

If so.... take a look here:

 

 

Edited by mLipok

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

Spoiler

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind. 

My contribution (my own projects): * Debenu Quick PDF Library - UDF * Debenu PDF Viewer SDK - UDF * Acrobat Reader - ActiveX Viewer * UDF for PDFCreator v1.x.x * XZip - UDF * AppCompatFlags UDF * CrowdinAPI UDF * _WinMergeCompare2Files() * _JavaExceptionAdd() * _IsBeta() * Writing DPI Awareness App - workaround * _AutoIt_RequiredVersion() * Chilkatsoft.au3 UDF * TeamViewer.au3 UDF * JavaManagement UDF * VIES over SOAP * WinSCP UDF * GHAPI UDF - modest begining - comunication with GitHub REST APIErrorLog.au3 UDF - A logging Library * Include Dependency Tree (Tool for analyzing script relations) * Show_Macro_Values.au3 *

 

My contribution to others projects or UDF based on  others projects: * _sql.au3 UDF  * POP3.au3 UDF *  RTF Printer - UDF * XML.au3 UDF * ADO.au3 UDF SMTP Mailer UDF * Dual Monitor resolution detection * * 2GUI on Dual Monitor System * _SciLexer.au3 UDF * SciTE - Lexer for console pane

Useful links: * Forum Rules * Forum etiquette *  Forum Information and FAQs * How to post code on the forum * AutoIt Online Documentation * AutoIt Online Beta Documentation * SciTE4AutoIt3 getting started * Convert text blocks to AutoIt code * Games made in Autoit * Programming related sites * Polish AutoIt Tutorial * DllCall Code Generator * 

Wiki: Expand your knowledge - AutoIt Wiki * Collection of User Defined Functions * How to use HelpFile * Good coding practices in AutoIt * 

OpenOffice/LibreOffice/XLS Related: WriterDemo.au3 * XLS/MDB from scratch with ADOX

IE Related:  * How to use IE.au3  UDF with  AutoIt v3.3.14.x * Why isn't Autoit able to click a Javascript Dialog? * Clicking javascript button with no ID * IE document >> save as MHT file * IETab Switcher (by LarsJ ) * HTML Entities * _IEquerySelectorAll() (by uncommon) * IE in TaskSchedulerIE Embedded Control Versioning (use IE9+ and HTML5 in a GUI) * PDF Related:How to get reference to PDF object embeded in IE * IE on Windows 11

I encourage you to read: * Global Vars * Best Coding Practices * Please explain code used in Help file for several File functions * OOP-like approach in AutoIt * UDF-Spec Questions *  EXAMPLE: How To Catch ConsoleWrite() output to a file or to CMD *

I also encourage you to check awesome @trancexx code:  * Create COM objects from modules without any demand on user to register anything. * Another COM object registering stuffOnHungApp handlerAvoid "AutoIt Error" message box in unknown errors  * HTML editor

winhttp.au3 related : * https://www.autoitscript.com/forum/topic/206771-winhttpau3-download-problem-youre-speaking-plain-http-to-an-ssl-enabled-server-port/

"Homo sum; humani nil a me alienum puto" - Publius Terentius Afer
"Program are meant to be read by humans and only incidentally for computers and execute" - Donald Knuth, "The Art of Computer Programming"
:naughty:  :ranting:, be  :) and       \\//_.

Anticipating Errors :  "Any program that accepts data from a user must include code to validate that data before sending it to the data store. You cannot rely on the data store, ...., or even your programming language to notify you of problems. You must check every byte entered by your users, making sure that data is the correct type for its field and that required fields are not empty."

Signature last update: 2023-04-24

Link to comment
Share on other sites

  • 2 weeks later...

I've been out for a week (family vacation)

It looks like this is printed from Word... could you edit this in Word before printing it to PDF? (the document properties say "Application: Acrobat PDFMaker 20 for Word")

All my code provided is Public Domain... but it may not work. ;) Use it, change it, break it, whatever you want.

Spoiler

My Humble Contributions:
Personal Function Documentation - A personal HelpFile for your functions
Acro.au3 UDF - Automating Acrobat Pro
ToDo Finder - Find #ToDo: lines in your scripts
UI-SimpleWrappers UDF - Use UI Automation more Simply-er
KeePass UDF - Automate KeePass, a password manager
InputBoxes - Simple Input boxes for various variable types

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...