Finding differences in PDF documents

The portable document format, or PDF, has become indispensable as a way to exchange data across various platforms and operating systems. This is especially true for documents that should be readable but not easy to modify.

In this article, I will examine how to determine whether two PDF documents are identical and, if they are not, how to find what differences in content and appearance exist. In particular, I will look at five programs including Md5sum [1], Pdftotext [2], Pdfdiff [3], Comparepdf [4] and DiffPDF [5], all of which can be found in the Ubuntu repositories and that of most over distributions.

Comparing Files

Md5sum can be found on every Linux system. In Debian GNU/Linux and Ubuntu, you will find it in the coreutils [6] package. The primary purpose of Md5sum is to generate 128-bit long hash values based on the MD5 method. In simplified terms, this type of hash value corresponds to the digital fingerprint of a data set.

Hence, you can use Md5sum to generate a hash value for each of two PDF documents. Then, you should compare the two results. Right away, it will be clear whether the documents are identical. If they are, then the two hash values will match just like Debian-20150207.pdf and Debian-20150209.pdf match in Listing 1.

Listing 1

Compare Hash Values

$ md5sum Debian-20150207.pdf Debian-20150208.pdf Debian-20150209.pdf
6d997a79b970eb8526f0d1662f740b45  Debian-20150207.pdf
5f91ffc412d95e3436faceb2e772e0e1  Debian-20150208.pdf
6d997a79b970eb8526f0d1662f740b45  Debian-20150209.pdf

This method will help you determine whether differences exist between files, but it does not help you determine how the files differ. Therefore, in the example of Debian-20150208.pdf , you would not be able to tell how this document is distinct from the other two.

The tools Pdftotext and KDiff3 can help you answer this question. You will find Pdftotext in the Debian packages as part of poppler-utils . KDiff3 [7] belongs to the KDE suite.

Pdftotext lets you extract the content from a PDF document, which technically means the program extracts text but disregards graphical elements. The name for the Pdftotext output file derives from the name of the original file except that the .txt suffix is used.

The extracts from two documents can then be compared using KDiff3, which neatly displays any differences in highlighted form alongside one another. Listing 2 summarizes the procedure with all three invocations together.

Listing 2

Compare Extracts

$ pdftotext file1.pdf
$ pdftotext file2.pdf
$ kdiff3 file1.txt file2.txt

After you have invoked KDiff3, giving the text files to be compared as parameters, you will see that content present only in the first file appears in green print, and that content present only in the second file appears in blue.

Identical content appears in black print on a white background (Figure 1). The bar found on the right edge of the window is very useful. It identifies the sections in which the differences appear. Clicking on the bar takes you to the corresponding location in the text.

Figure 1: An example of using KDiff3 to perform a direct comparison of two text files.

If the three invocations in Listing 2 prove too cumbersome for what you have in mind, then you might consider using Pdfdiff and Comparepdf. Both tools combine these individual steps. To compare content for any differences,

Pdfdiff utilizes the first diff program that it finds on your system, which depends on the distribution and desktop. So, for example, it might find KDiff3 or also Meld [8].

DiffPDF

DiffPDF, found as the diffpdf package on Debian, is the graphical version of Comparepdf, both of which come from the same development team. The tool is based on the graphics library of Qt as well as Poppler and has a convenient and fairly well-designed user interface (Figure 2).

Figure 2: DiffPDF is the graphical version of Comparepdf and offers a well-designed user interface.

The documents sit in the left and the middle columns for purposes of comparison. DiffPDF color-codes all text fragments that have been changed or were moved to a different place on the same page. The program compares the documents page by page. You will also see a colored bar on the left margin of the document that visually marks the difference.

The user options for this bar include intensity, width, and hue – all of which you can tailor to your liking via the Options button.

Two buttons sitting above the page view are used for selecting files. The entry field next to the buttons is used to specify which pages DiffPDF should compare. In Figure 2, the page numbers shown in the field range from 1 to 460. If the two files show a different number of total pages, then DiffPDF will usually take the smaller value as the upper limit.

You will be able to see the number of pages that contain discrepancies in the output field of the right-hand column. Figure 2 shows that discrepancies occur in 200 out of 460 pages compared.

The right-hand column of the user interface contains several other buttons, which are used for navigation. Options include the default setting of a word-by-word comparison mode, character-by-character mode, and a visual comparison. This last setting leads to an optical comparison which also includes illustrations.

The view button is used to switch back and forth between pages that contain differences. The page number for each of the pages within the respective document as well as the number of discrepancies that occur on the page are listed in the view mode. The arrow buttons scroll forwards and backwards through the pages.

By using the entry field enlargement , you can control the presentation of the pages you are comparing. This option is especially helpful with smaller display screens when the user wants to quickly find out what the comparison looks like.

The six buttons at the bottom of the right-hand column let you initiate the comparison, specify options for display, show program status information, open the integrated help, and close DiffPDF.

The Save as button helps you display a useful summary of changes (Figure 3). The resulting output document contains all of the differing pages together with highlighted sections. This saves you the trouble of going through documents page by page to locate the pertinent differences.

Figure 3: DiffPDF collects all of the differences it discovers together in a report.

Conclusion

The tools presented here make modifications and differences in PDF documents more apparent and easier to access. Note that using these tools to compare PDF documents is typically successful only when the data in the documents is presented in text form. Otherwise, you will need to compare the documents visually, which may mean that some details are missed.