Use Paperwork to digitize and archive documents

Slashdot it! Delicious Share on Facebook Tweet! Digg!

Problems

Paperwork displays significant shortcomings in terms of the OCR recognition rate. The program commits an array of errors even when dealing with good scans and PDF text documents. The spellcheck is not of much use (Figure 8).

Figure 8: Conspicuous problems occur with the OCR process. In spite of the fact that Tesseract is a good and powerful OCR engine, Paperwork still manages to create error-ridden texts from PDF documents.

When creating the index, the application collects words that have been extracted from the texts. The user cannot influence what goes into the index and what does not. Asserting your control over the index process would require that you learn Whoosh, which is quite an undertaking. The indexer is fairly efficient, but, currently, Paperwork does not make optimal use of this tool.

Paperwork saves the index in the ~/.local/share/paperwork/index/ directory, thus independently from the database. This causes some headaches when you are looking for portable management of a large number of documents. You will find a number of files containing the extracted keywords in the directory, but the files are not necessarily directly legible.

To date, the program does not produce document titles that are meaningful. Nor is it possible to change a title manually. This process would be helpful, especially if you are dealing with numerous documents all at once. Manual tagging with additional keywords is a tedious alternative. In addition, the program automatically assigns currently used labels to all of the subsequent documents.

Searching is also not easy. Paperwork treats all text alike, not distinguishing among the content, label, or title of a document. The only way to achieve at least a minimum of influence over the search results is to use Advanced Search (Figure 9).

Figure 9: Advanced Search makes it possible to search for multiple linked keywords.

Advanced Search lets you add certain keywords and exclude others, but this approach is laborious. Metadata from PDF documents still means nothing to Paperwork. Distinctions in terms of content, for example the difference between a title and the author, are beyond the current capabilities of the program.

Conclusion

At the present time, Paperwork does not offer an integrated solution. There are simply too many features still lacking, and there are significant problems with the functions that do exist. Supporting only JPEG and PDF formats is no longer adequate to modern needs. (See the "Alternatives to Papework" box).

Alternatives to Paperwork

In the meantime, there are a number of possibilities for providing scanned documents with a text layer. Gscan2pdf is well-known and also proven in practice. Other alternatives would be xsane2djvu and OCRmyPDF [8], which also often achieve good results. The advantage to gscan2pdf is that it permits extensive preparation of the pages before they are sent to the OCR engine.

Recoll [9] is a good choice when you want access to text contained in various formats but without explicit OCR processing. This program does not use a database at all with the index. Instead, it recursively monitors and processes an array of directories that have been previously uploaded. Recoll recognizes numerous different formats and provides the option of controlling the search function.

Text PDF documents require almost no preparation for loading into a database. Programs like Calibre [10] do this easily and mostly error free. Although it was developed for managing ebooks, the program loads and displays text, HTML, ODT, and PDF documents.

However, Calibre cannot handle a full text search unless certain precautions are taken. Meanwhile, several plugins are available as retrofits. Some of these are based on Recoll. The combination of Calibre with Recoll proves to be a good solution for saving and administering documents in various formats. Gscan2pdf can be added to the mix and used for the OCR processing.

It is also not clear why Paperwork still uses OCR to analyze PDFs by itself when the PDFs already contain text. The same goes for the lack of possibilities for receiving documents according to their title and analyzing their metadata.

Infos

  1. Paperwork: https://github.com/jflesch/paperwork/#readme
  2. Tesseract OCR: https://github.com/tesseract-ocr/
  3. Installation: https://github.com/jflesch/paperwork/wiki/Update
  4. Paperwork under Ubuntu: https://github.com/openpaperwork/paperwork/blob/unstable/doc/install.debian.markdown
  5. Whoosh: http://whoosh.readthedocs.org/en/latest/quickstart.html
  6. "Electronic Document Archives with Gscan2pdf" by Vincze-Aron Szabo, Linux Magazine , Issue 85: http://www.linux-magazine.com/Issues/2007/85/Gscan2pdf/
  7. hOCR-Files: https://en.wikipedia.org/wiki/HOCR
  8. OCRmyPDF: https://github.com/jbarlow83/OCRmyPDF
  9. "Finding Files with Recoll" by Tim Schürmann, Linux Magazine , Issue 79: http://www.linux-magazine.com/Issues/2007/79/Recoll/
  10. "Organizing and Reading Ebooks with Calibre" by Dr. Karl Sarnow, Ubuntu User , Issue 21: http://www.ubuntu-user.com/Magazine/Archive/2014/21/Organizing-and-reading-e-books-with-Calibre/

Buy this article as PDF

Express-Checkout as PDF

Pages: 8

Price $0.99
(incl. VAT)

Buy Ubuntu User

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content