Sign in to follow this  
Followers 0
zfisherdrums

PDF Comparison Helper

1 post in this topic

#1 ·  Posted (edited)

At work, I needed a way to compare PDFs. I stumbled across the XPDF toolset sometime ago. What I wanted was a utility that I could pass two PDFs into and have them rendered as text for a text-based comparison in WinMerge.

Now, WinMerge does have a plugin that provides PDF comparison, but I am not thrilled with what it does to the layout. At least pdftotext.exe maintains an assemblance of the layout (provided you use the appropriate args).

That said, here is the script that I'm using. It will consume two PDF files passed in via the command line or the Config.ini. It will then render a text file using a PDF-To-Text utility of your choosing, apply reg-ex masking, and create a new-transformed file suitable for text-based comparison.

The Config.ini also holds the command lines for the PDF Conversion tool (pdftotext.exe in my case) and the Comparison tool (WinMerge in my case). Because the comandline to these tools is contained in the config file, you can change them out to suit your needs. I just like compiling once and letting the config file handle the user preference stuff.

Finally, regular expressions provide a level of masking to prevent false alarms. The key used in the Config file will be the text that replaces the search pattern defined in the value. Placing an star (*) before the key will remove the search pattern text altogether. For example, any long format US dates would be replaced with <<Long Date>> in this example:

<<<LongDate>>>=(January|February|March|April|May|June|July|August|September|October|November|December) \d\d?, 20\d\d

So, download the attached zip, then the XPDF toolset from the link shown earlier and drop the pdftotext.exe in the folder to run the example. I won't distribute that exe here for obvious reasons.

I'm attaching the code I wrote here:

PDFComparisonHelper.zip

For all you copy-and-pasters out there:

#include <File.au3>

Global $RegExpressions, $PDF_A, $PDF_B, $PDFCommandLine, $CompareCommandLine
Global $ConfigPath = @ScriptDir & "\Config.INI"

Func DoAllConversions( $file )
    If Not FileExists( $file ) Then Die( "File A cannot be found" )
    $TextConvertedFile = ConvertPDFtoText( $file )  
    If $TextConvertedFile = "" Then Die( "File A could not be converted" )
    $RegExConvertedFile = ApplyRegExTranformations( $TextConvertedFile )        
    Return $RegExConvertedFile
EndFunc

Func ConvertPDFtoText( $file )
    Local $cmdLine = StringFormat( $PDFCommandLine, $file )
    ConsoleWrite( "Converting PDF to Text ---> " & $cmdLine & @CRLF)
    $exit = RunWait( $cmdLine, @ScriptDir )
    
    If $exit = 0 Then 
        Return StringReplace( $file, ".pdf", ".txt" )
    Else
        Return ""
    EndIf 
EndFunc

Func ApplyRegExTranformations( $filename )
    Local $FileRaw = FileRead( $filename )
    
    For $i = 1 to $RegExpressions[0][0]
        Switch StringMid( $RegExpressions[$i][0], 1, 1)
            Case "*"
                ReplaceRegEx( $FileRaw, $RegExpressions[$i][1] )
            Case "#"
; SKIP THIS ONE
            Case Else
                ReplaceRegEx( $FileRaw, $RegExpressions[$i][1], $RegExpressions[$i][0] )                
        EndSwitch
        ConsoleWrite($RegExpressions[$i][0] & @CRLF)
    Next
    
    $NewFile = StringReplace( $filename, ".txt", "_RegEx.txt" )
    FileDelete( $NewFile )
    FileWrite( $NewFile, $FileRaw )
    
;~ ConsoleWrite( $FileRaw & @CRLF )
    return $NewFile
EndFunc

Func ReplaceRegEx( ByRef $text, $pattern, $replace = "" )
    $text = StringRegExpReplace( $text, $pattern, $replace )
EndFunc

Func DoComparisons( $fileA, $fileB)
    Local $commandLine = StringFormat( $CompareCommandLine, $fileA, $fileB )
    ConsoleWrite( $commandLine & @CRLF)
    RunWait( $commandLine, @ScriptDir ) 
EndFunc

Func Die( $Message )
    MsgBox(0, @ScriptName, $Message )
    Exit
EndFunc

; //////////////////////////////////////////////////////////////////////////////////////////////////
; //////////////////////////////////////////////////////////////////////////////////////////////////
;                                           START HERE
; //////////////////////////////////////////////////////////////////////////////////////////////////
; //////////////////////////////////////////////////////////////////////////////////////////////////

; Determine if Config file exists
If Not FileExists( $ConfigPath ) Then Die( "Config file cannot be found" )

; Define Regular Expressions
$RegExpressions = IniReadSection( $ConfigPath, "RegEx" )

; Define PDF Command Line
$PDFCommandLine = IniRead( $ConfigPath, "Paths", "PDFCommandLine", "" )
If $PDFCommandLine = "" Then Die( "PDF Command Line not found in Config file" )

; Define Compare Tool Command Line
$CompareCommandLine = IniRead( $ConfigPath, "Paths", "CompareCommandLine", "" )
If $CompareCommandLine = "" Then Die( "Compare Command Line not found in Config file" )
    
; Read in File A and File B
If $CmdLine[0] >= 2 Then 
    $PDF_A = $CmdLine[1]
    $PDF_B = $CmdLine[2]
Else
    $PDF_A = IniRead( $ConfigPath, "Paths", "LeftPath", "" )
    If $PDF_A = "" Then Die( "No File A path provided" )

    $PDF_B = IniRead( $ConfigPath, "Paths", "RightPath", "" )
    If $PDF_B = "" Then Die( "No File B path provided" )
EndIf 

; Compare the two text files
DoComparisons( DoAllConversions( $PDF_A ), DoAllConversions( $PDF_B ))

Config.ini

[Paths]
LeftPath=A.pdf
RightPath=B.pdf
PDFCommandLine=pdftotext.exe -layout "%s"
CompareCommandLine=""C:\\Program Files\\WinMerge\\WinMerge.exe" "%s" "%s"

[RegEx]
<<<LongDate>>>=(January|February|March|April|May|June|July|August|September|October|November|December) \d\d?, 20\d\d
<<<Money>>>=\$\d\d?\d?,?\d?\d?\d?,?\d?\d?\d?\.?\d?\d?\*?\*?
<<<NumericValue>>>=\d\d?\d?,?\d?\d?\d?,?\d?\d?\d?\.?\d?\d?\*?\*?


[Sandbox]
#<<<ExtraBlankLine>>>=(\r\n){2,}

Let me know if you have any questions,

Zach...

Edited by zfisherdrums

Share this post


Link to post
Share on other sites



Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0