Use Paperwork to digitize and archive documents

The idea behind Paperwork originated with the desire to have a paperless office. Letters, bills, and loose pages are loaded into a scanner, which spits out PDF and JPEG files into the in tray. Then, the contents of these files are converted with OCR into digital form.

This is where Paperwork [1] comes into play. The application collects image data and text, overlaps them, and then saves them as a PDF. Paperwork creates a summary of the text content for the prepared documents as a searchable index. However, there are some pitfalls inherent to this process for which you should watch out. The scans and photographs will need to have the highest possible resolution so that the software is able to properly recognize the text. This means that a good scanner with at least 600 DPI resolution is a must.

During start up, Paperwork first looks for Tesseract [2]. If it cannot find this powerful OCR engine, then the program will use CuneiForm. In most cases, you will get the best results with Tesseract. You can install it with:

sudo apt install tesseract-ocr tesseract-ocr-eng

Incidentally, while parsing a page, Paperwork quite literally takes an interesting turn here. If it cannot determine the orientation for a scanned page, it will simply process the page four times by turning it 90 degrees with each version and then use the best results.

Installation

You will find general installation information in Paperwork's Git Repository [3]. The most up-to-date version 0.32 is not presently in Ubuntu and its derivatives' repositories. However, a PPA exists, and the information about installation procedures is online [4].

To install from source (which worked fine for us), first install some Python dependencies:

sudo apt install python3-pip  python3-setuptools python3-dev  python3-pil install libenchant-dev  python3-whoosh

This will install, among other things pip , the python utility for installing and updating Python-based apps and libraries directly from Python's repositories.

With pip , you can now install Paperwork proper:

pip3 install paperwork

The final step is to check and install for more dependencies Paperwork may need. To do that, use the commands:

paperwork-shell chkdeps paperwork_backend

and

paperwork-shell chkdeps paperwork

and follow the on-screen instructions.

Architecture

The application itself is primarily based on four components. Paperwork uses SANE

sudo apt install sane xsane

for scanning documents. It processes the documents with Tesseract or CuneiForm. Whoosh [5] is used to indicate text that has been altered by OCR. In addition, this tool automatically generates recommendations for keywords. In the process, it reduces each word to the word stem in order to arrive at meaningful results. There is a graphical interface that has been developed using GTK/Glade, which puts all of the components together.

The preferred OCR engine originally comes from Hewlett Packard. Google uses the open source library for things like digitizing books. The software distinguishes itself with high recognition rates and extensive automation.

Since Tesseract works exclusively with uncompressed TIFF files, it is necessary to first prepare the scanned pages. Based on experience, this is a demanding task that can only partially be automated. Herein lies one of the weaknesses of Paperwork. Programs like gscan2pdf [6] offer more possibilities.

Paperwork creates a searchable PDF text file from the prepared pages. Currently, the software supports direct scanning, reading in PDF documents, and analysis of scanned images. Modern image formats like JPEG 2000 are not supported, but the software does support the classic JPEG and PNG formats.

Even so, it is not possible to simultaneously load multiple image files in a project. Nor is it possible to load an entire folder with scanned images. This makes it inconvenient and tedious work when you want to continue to process existing scans.

Settings

To date, it is only possible to configure for a few parameters. You can specify a folder for saving information, the scanner, and the OCR engine (Figure 1). The language settings control the spell check, but this was not successful in the test.

Figure 1: Paperwork has been designed with just a few settings. Even so, this does not necessarily make the software easier to use.

There is a user-specific configuration file ~/.config/paperwork.conf . This is where you can check to see whether the program is actually using the correct language and spell check for the OCR engine. Note that this feature does not work accurately in all versions.

If Paperwork does not recognize words that contain a language-specific character, then this is an indication of an error in the language setting. To resolve this problem, you should first close the application since this will cause it to write back the configuration file. Next you should use an editor to enter en as the value for the corresponding lang variable in the section. After switching the language, you should restart the text recognition.

If you operate a number of different scanners, then it is important to make sure that you select the appropriate scanner. You should take care to set the resolution correctly. Testing did not reveal why the application recommends only 300 DPI with some devices that have a physical resolution of 600 DPI.

Practical Example

Once started, Paperwork displays a clean looking interface that contains two sections (Figure 2). To the left you will see a database in which the current document has been highlighted (in Figure 2, the new document is currently empty). The program displays the pages of this document that have been scanned and processed in a corresponding preview.

Figure 2: Paperwork's functions are hidden behind plain and ordinary looking buttons at the top of the main pane.

A single mouse click on one of the pages shown in the preview causes an original-sized version to be displayed. This lets you review the details of what the program has recognized during the OCR process (Figure 3).

Figure 3: After the OCR process is complete, the software will display what it has recognized when you mouse over a word. In this example, the software has incorrectly split the word LOOKING into LOOK and NG, as it was unable to correctly identify the I.

There are a number of buttons, some very inconspicuous, located at the top edge of the screen. They are used for calling program functions. The arrows in Figure 2 point to the location of the buttons. The leftmost button controls document viewing. Options for reading in data are found under Scan . The rightmost button is used to manage special settings.

The software collects images that have been scanned in as projects and then exports them as a PDF file. In carrying out this process, the software sets up for each project a suitable subdirectory in the database directory. The name of the subdirectory follows the convention Date<_ID-Number . However, only the date appears in the front end, which makes it difficult to search for a project you are looking for (Figure 4).

Figure 4: Paperwork keeps its tidy appearance even after a series of documents have been read in. The software displays the selected document as a preview.

By default, Paperwork uses the ~/papers/ folder as a working directory. The software always sets up several files in the project directory. You will find JPEGs of the scanned page under paper.<Number>.jpg . Text extracted by the OCR engine is found under paper.<Number>.words .

A labels file contains headings if they have been manually assigned (Figure 5). The PDF file that is created always has the same name doc.pdf in this directory.

Figure 5: If necessary, you can provide additional information about selected documents like the labels "document" and "PDFs" shown in the example.

The extracted texts are located in the project directories, but they do not exist as simple text files. Instead they exist as special XML files in hOCR format [7]. In this format, markings indicating the position in the original document are shown alongside the pure text that the files contain. This makes it possible to be precise when overlaying the text onto the image files.

After a document has been generated as a project, you can assign additional headings even if these headings are not found in the document and the OCR process does not recognize them. Assignments are made by means of a button that appears on the selected document.

In addition, you can choose an existing label or add new labels with the corresponding button. The heading editor will appear, which allows you to link colored markings to the document (Figure 6). These abstract markings prove to be conceptually interesting, but turn out to be only of limited value. This is because the software does not let you sort or search according to these criteria.

Figure 6: You can use labels, colored markings, and keywords to add more information to a document.

The application makes it possible to influence the results of a scan. To do this, you should go to the Settings dialog and start the Scan function by pressing the button. You should also specify the areas in which the software will find the relevant text.

The software will show a preview after it scans each page. You can choose to have the program highlight each word by clicking on the button shown in Figure 7 and then toggling Highlight words to ON .

Figure 7: After scanning, the program shows the last page that has been read in, and you can choose to see the words it has identified.

Problems

Paperwork displays significant shortcomings in terms of the OCR recognition rate. The program commits an array of errors even when dealing with good scans and PDF text documents. The spellcheck is not of much use (Figure 8).

Figure 8: Conspicuous problems occur with the OCR process. In spite of the fact that Tesseract is a good and powerful OCR engine, Paperwork still manages to create error-ridden texts from PDF documents.

When creating the index, the application collects words that have been extracted from the texts. The user cannot influence what goes into the index and what does not. Asserting your control over the index process would require that you learn Whoosh, which is quite an undertaking. The indexer is fairly efficient, but, currently, Paperwork does not make optimal use of this tool.

Paperwork saves the index in the ~/.local/share/paperwork/index/ directory, thus independently from the database. This causes some headaches when you are looking for portable management of a large number of documents. You will find a number of files containing the extracted keywords in the directory, but the files are not necessarily directly legible.

To date, the program does not produce document titles that are meaningful. Nor is it possible to change a title manually. This process would be helpful, especially if you are dealing with numerous documents all at once. Manual tagging with additional keywords is a tedious alternative. In addition, the program automatically assigns currently used labels to all of the subsequent documents.

Searching is also not easy. Paperwork treats all text alike, not distinguishing among the content, label, or title of a document. The only way to achieve at least a minimum of influence over the search results is to use Advanced Search (Figure 9).

Figure 9: Advanced Search makes it possible to search for multiple linked keywords.

Advanced Search lets you add certain keywords and exclude others, but this approach is laborious. Metadata from PDF documents still means nothing to Paperwork. Distinctions in terms of content, for example the difference between a title and the author, are beyond the current capabilities of the program.

Conclusion

At the present time, Paperwork does not offer an integrated solution. There are simply too many features still lacking, and there are significant problems with the functions that do exist. Supporting only JPEG and PDF formats is no longer adequate to modern needs. (See the "Alternatives to Papework" box).

Alternatives to Paperwork

In the meantime, there are a number of possibilities for providing scanned documents with a text layer. Gscan2pdf is well-known and also proven in practice. Other alternatives would be xsane2djvu and OCRmyPDF [8], which also often achieve good results. The advantage to gscan2pdf is that it permits extensive preparation of the pages before they are sent to the OCR engine.

Recoll [9] is a good choice when you want access to text contained in various formats but without explicit OCR processing. This program does not use a database at all with the index. Instead, it recursively monitors and processes an array of directories that have been previously uploaded. Recoll recognizes numerous different formats and provides the option of controlling the search function.

Text PDF documents require almost no preparation for loading into a database. Programs like Calibre [10] do this easily and mostly error free. Although it was developed for managing ebooks, the program loads and displays text, HTML, ODT, and PDF documents.

However, Calibre cannot handle a full text search unless certain precautions are taken. Meanwhile, several plugins are available as retrofits. Some of these are based on Recoll. The combination of Calibre with Recoll proves to be a good solution for saving and administering documents in various formats. Gscan2pdf can be added to the mix and used for the OCR processing.

It is also not clear why Paperwork still uses OCR to analyze PDFs by itself when the PDFs already contain text. The same goes for the lack of possibilities for receiving documents according to their title and analyzing their metadata.

Infos

  1. Paperwork: https://github.com/jflesch/paperwork/#readme
  2. Tesseract OCR: https://github.com/tesseract-ocr/
  3. Installation: https://github.com/jflesch/paperwork/wiki/Update
  4. Paperwork under Ubuntu: https://github.com/openpaperwork/paperwork/blob/unstable/doc/install.debian.markdown
  5. Whoosh: http://whoosh.readthedocs.org/en/latest/quickstart.html
  6. "Electronic Document Archives with Gscan2pdf" by Vincze-Aron Szabo, Linux Magazine , Issue 85: http://www.linux-magazine.com/Issues/2007/85/Gscan2pdf/
  7. hOCR-Files: https://en.wikipedia.org/wiki/HOCR
  8. OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF
  9. "Finding Files with Recoll" by Tim Schürmann, Linux Magazine , Issue 79: http://www.linux-magazine.com/Issues/2007/79/Recoll/
  10. "Organizing and Reading Ebooks with Calibre" by Dr. Karl Sarnow, Ubuntu User , Issue 21: http://www.ubuntu-user.com/Magazine/Archive/2014/21/Organizing-and-reading-e-books-with-Calibre/