Jump to content
Sign in to follow this  
yclee99

PDF Search

Recommended Posts

Dear All,

I have large number PDF files which I need to checked each of them whether containing certain string (equipment tag name). Example: I have 500 PDF files and I want to know which PDF files contains the specific equipment tag name. There are 100 equipment tag name in total. In the past, we did this process manually by opening PDF file and search for the equipment tag name. We don't really care the equipment tag name is located at which page. What is important to use is the PDF files contains which equipment tag name. This process is really time consuming. I wonder it there a way to do it automatically. 

My original idea is to convert the PDF to text file (XPDF - pdftotext) and search for for the equipment tag name. Is there any better way to deal with this?

p/s: I just found out that pdftotext is not free as I am using it for commercial purpose. I am trying to avoid using pdftotext.

Share this post


Link to post
Share on other sites

from my understanding, the xpdf tools are opensource/free to use (the command line tools).  it's the XpdfReader that isn't free, which you don't need.

edit:  read closer and if your going to be selling/distributing your app then yes you need a license.

;====================================================================================================================
;   Get the text out of a PDF file and return it as a String value
;   If error is encountered @Extended will contain the error returned from pdftotext.exe
;   $bMaintainLayout:       True = (Default) This will try to keep the spacing as it shows in the PDF file
;                           False = This will just display the text without any layout
;====================================================================================================================
Func _XPDF_GetText($sPDFFile, $bMaintainLayout = True)
    Local $sXpdftotext = @ScriptDir & "\pdftotext.exe"
    If NOT FileExists($sXpdftotext) Then Return SetError(1, 0, 0)
    ;ConsoleWrite('"' & $sXpdftotext & '" -layout "' & $sPDFFile & '" "-"' & @CRLF)
    If $bMaintainLayout = True Then
        $sLayout = " -layout "
    Else
        $sLayout = " "
    EndIf

    Local $iPid = Run('"' & $sXpdftotext & '"' & $sLayout & '"' & $sPDFFile & '" "-"', "", @SW_HIDE, 2 + 4)             ;Run the converter and get the StdOut "2" and the StdErr "4"
    ProcessWaitClose($iPID)                                                                                             ;Need to wait for it to finish before we get the StdOutput and StdErr values

    Local $sResult
    While 1                                                                                                             ;Loop through the StdoutRead getting all the available text from the PDF file
        $sResult &= StdoutRead($iPid)                                                                                   ;Put the output into the $sResults string
        If @error Then ExitLoop                                                                                         ;Once we reach the end of the output string, exit the loop
    WEnd

    Local $sErrOutput
    While 1                                                                                                             ;Loop through the StderrRead incase there are any problems reading the PDF
        $sErrOutput = StderrRead($iPID)                                                                                 ;Put the error output into the $sErrOutput
        If @error Then                                                                                                  ;Exit the loop if the process closes or StderrRead returns an error.
            ExitLoop
        EndIf
        If $sErrOutput <> "" Then Return SetError(1, $sErrOutput, 0)                                                    ;If there is something in the $sErrOutput then there was a problem, return Error and sets @extended to whatever was returned by the error
        ;MsgBox($MB_SYSTEMMODAL, "Stderr Read:", $sOutput)
    WEnd

    Return $sResult                                                                                                     ;Return the contents of the PDF as a variable

EndFunc

 

Edited by BigDaddyO

hmm... I guess I have to have a signature...

Share this post


Link to post
Share on other sites

another approach would be to utilize the filtdump.exe utility (from Windows SDK) in conjunction with PDF IFilter. both free, and very easy to use.

IFilter allows Windows Search (formerly Windows Indexing Service) to parse text from non-textual files for indexing and searching purposes. FYI, IFilter for MS-Office files (and OpenOffice files) is installed by default with MS-Office installation (and is also provided as a standalone installer) - and that's why you can search for text inside Office files. the IFilter for PDF is a free 3rd-party component provided by Adobe (other vendors, like Foxit, also provide PDF IFilter, but that is paid).

once the PDF IFilter is installed, download the Windows SDK and get the command-line utility filtdump.exe (i use the Windows 7 SDK, but i see it exists in Windows 10 SDK as well). filtdump.exe accepts an input file as a parameter, calls upon the appropriate IFilter to parse the text from that file, and then output the text to a new txt file. in that file you can search.

Share this post


Link to post
Share on other sites

Thanks guys. 

I manage to develop the function that I am looking for. Special thanks to BigDaddyO as I am using the XPDF_gettext function to archive the functionality. 

My next task it to improve the tool performance (speed). Personally, I think Line 3 should be take out from the loop and just do 1 time _Excel_rangeWrite the whole range but I am struggle to find the way. Any advise?

 

For $i = 1 To UBound($aDatasheet) - 1
    $readPDF = _XPDF_GetText($sDatasheetDir & $aDatasheet[$i])
    _Excel_RangeWrite($oNewWorkbook, "Sheet1", $sDatasheetDir & $aDatasheet[$i], "A" & $i + 1)
    For $j = 1 To UBound($aEquipment) - 1
        $readPDFoutput = StringReplace($readPDF, $aEquipment[$j], $aEquipment[$j])
        $iReplacedCount = @Extended
        If $iReplacedCount Then
            $iCol = $j + 1
            $sLetter = _Excel_ColumnToLetter($iCol)
            _Excel_RangeWrite($oNewWorkbook, "Sheet1", $iReplacedCount, $sLetter & $i + 1)
        EndIf
    Next
    _Excel_BookSave($oNewWorkbook)
Next

 

 

Share this post


Link to post
Share on other sites

Do your initial _Excel_RangeRead and pull in the entire spreadsheet into a 2D array.

Then as you go through, you update the array with your new data.

Once all of the checking is finished, you can write out the entire 2D array to a new spreadsheet in a single _Excel_RangeWrite.

$aDataSheet[0][0] = A1, $aDataSheet[100][10] = J100, etc...


hmm... I guess I have to have a signature...

Share this post


Link to post
Share on other sites
On 4/12/2019 at 8:16 PM, BigDaddyO said:

Do your initial _Excel_RangeRead and pull in the entire spreadsheet into a 2D array.

Then as you go through, you update the array with your new data.

Once all of the checking is finished, you can write out the entire 2D array to a new spreadsheet in a single _Excel_RangeWrite.

$aDataSheet[0][0] = A1, $aDataSheet[100][10] = J100, etc...

My current codes perform _Excel_RangeWrite when matching is found and skip _Excel_RangeWrite when there is NO matching. 

I will try to modify my codes as per your suggestion above. The problem is I don't know who to write 2D array from certain cell (i.e. from B2). I was not able to find the sample in the forum. 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...