Jump to content
Sign in to follow this  

PDF Comparison Helper

Recommended Posts


At work, I needed a way to compare PDFs. I stumbled across the XPDF toolset sometime ago. What I wanted was a utility that I could pass two PDFs into and have them rendered as text for a text-based comparison in WinMerge.

Now, WinMerge does have a plugin that provides PDF comparison, but I am not thrilled with what it does to the layout. At least pdftotext.exe maintains an assemblance of the layout (provided you use the appropriate args).

That said, here is the script that I'm using. It will consume two PDF files passed in via the command line or the Config.ini. It will then render a text file using a PDF-To-Text utility of your choosing, apply reg-ex masking, and create a new-transformed file suitable for text-based comparison.

The Config.ini also holds the command lines for the PDF Conversion tool (pdftotext.exe in my case) and the Comparison tool (WinMerge in my case). Because the comandline to these tools is contained in the config file, you can change them out to suit your needs. I just like compiling once and letting the config file handle the user preference stuff.

Finally, regular expressions provide a level of masking to prevent false alarms. The key used in the Config file will be the text that replaces the search pattern defined in the value. Placing an star (*) before the key will remove the search pattern text altogether. For example, any long format US dates would be replaced with <<Long Date>> in this example:

<<<LongDate>>>=(January|February|March|April|May|June|July|August|September|October|November|December) \d\d?, 20\d\d

So, download the attached zip, then the XPDF toolset from the link shown earlier and drop the pdftotext.exe in the folder to run the example. I won't distribute that exe here for obvious reasons.

I'm attaching the code I wrote here:


For all you copy-and-pasters out there:

#include <File.au3>

Global $RegExpressions, $PDF_A, $PDF_B, $PDFCommandLine, $CompareCommandLine
Global $ConfigPath = @ScriptDir & "\Config.INI"

Func DoAllConversions( $file )
    If Not FileExists( $file ) Then Die( "File A cannot be found" )
    $TextConvertedFile = ConvertPDFtoText( $file )  
    If $TextConvertedFile = "" Then Die( "File A could not be converted" )
    $RegExConvertedFile = ApplyRegExTranformations( $TextConvertedFile )        
    Return $RegExConvertedFile

Func ConvertPDFtoText( $file )
    Local $cmdLine = StringFormat( $PDFCommandLine, $file )
    ConsoleWrite( "Converting PDF to Text ---> " & $cmdLine & @CRLF)
    $exit = RunWait( $cmdLine, @ScriptDir )
    If $exit = 0 Then 
        Return StringReplace( $file, ".pdf", ".txt" )
        Return ""

Func ApplyRegExTranformations( $filename )
    Local $FileRaw = FileRead( $filename )
    For $i = 1 to $RegExpressions[0][0]
        Switch StringMid( $RegExpressions[$i][0], 1, 1)
            Case "*"
                ReplaceRegEx( $FileRaw, $RegExpressions[$i][1] )
            Case "#"
            Case Else
                ReplaceRegEx( $FileRaw, $RegExpressions[$i][1], $RegExpressions[$i][0] )                
        ConsoleWrite($RegExpressions[$i][0] & @CRLF)
    $NewFile = StringReplace( $filename, ".txt", "_RegEx.txt" )
    FileDelete( $NewFile )
    FileWrite( $NewFile, $FileRaw )
;~ ConsoleWrite( $FileRaw & @CRLF )
    return $NewFile

Func ReplaceRegEx( ByRef $text, $pattern, $replace = "" )
    $text = StringRegExpReplace( $text, $pattern, $replace )

Func DoComparisons( $fileA, $fileB)
    Local $commandLine = StringFormat( $CompareCommandLine, $fileA, $fileB )
    ConsoleWrite( $commandLine & @CRLF)
    RunWait( $commandLine, @ScriptDir ) 

Func Die( $Message )
    MsgBox(0, @ScriptName, $Message )

; //////////////////////////////////////////////////////////////////////////////////////////////////
; //////////////////////////////////////////////////////////////////////////////////////////////////
;                                           START HERE
; //////////////////////////////////////////////////////////////////////////////////////////////////
; //////////////////////////////////////////////////////////////////////////////////////////////////

; Determine if Config file exists
If Not FileExists( $ConfigPath ) Then Die( "Config file cannot be found" )

; Define Regular Expressions
$RegExpressions = IniReadSection( $ConfigPath, "RegEx" )

; Define PDF Command Line
$PDFCommandLine = IniRead( $ConfigPath, "Paths", "PDFCommandLine", "" )
If $PDFCommandLine = "" Then Die( "PDF Command Line not found in Config file" )

; Define Compare Tool Command Line
$CompareCommandLine = IniRead( $ConfigPath, "Paths", "CompareCommandLine", "" )
If $CompareCommandLine = "" Then Die( "Compare Command Line not found in Config file" )
; Read in File A and File B
If $CmdLine[0] >= 2 Then 
    $PDF_A = $CmdLine[1]
    $PDF_B = $CmdLine[2]
    $PDF_A = IniRead( $ConfigPath, "Paths", "LeftPath", "" )
    If $PDF_A = "" Then Die( "No File A path provided" )

    $PDF_B = IniRead( $ConfigPath, "Paths", "RightPath", "" )
    If $PDF_B = "" Then Die( "No File B path provided" )

; Compare the two text files
DoComparisons( DoAllConversions( $PDF_A ), DoAllConversions( $PDF_B ))


PDFCommandLine=pdftotext.exe -layout "%s"
CompareCommandLine=""C:\\Program Files\\WinMerge\\WinMerge.exe" "%s" "%s"

<<<LongDate>>>=(January|February|March|April|May|June|July|August|September|October|November|December) \d\d?, 20\d\d


Let me know if you have any questions,


Edited by zfisherdrums

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this