Use Paperwork to digitize and archive documents

Slashdot it! Delicious Share on Facebook Tweet! Digg!
Yuriy Klochan, 123RF

Yuriy Klochan, 123RF

Paper Trail

Digital archives do away with the need for traditional filing cabinet storage. Even so, Paperwork tries to make your life easier.

The idea behind Paperwork originated with the desire to have a paperless office. Letters, bills, and loose pages are loaded into a scanner, which spits out PDF and JPEG files into the in tray. Then, the contents of these files are converted with OCR into digital form.

This is where Paperwork [1] comes into play. The application collects image data and text, overlaps them, and then saves them as a PDF. Paperwork creates a summary of the text content for the prepared documents as a searchable index. However, there are some pitfalls inherent to this process for which you should watch out. The scans and photographs will need to have the highest possible resolution so that the software is able to properly recognize the text. This means that a good scanner with at least 600 DPI resolution is a must.

During start up, Paperwork first looks for Tesseract [2]. If it cannot find this powerful OCR engine, then the program will use CuneiForm. In most cases, you will get the best results with Tesseract. You can install it with:

sudo apt install tesseract-ocr tesseract-ocr-eng

Incidentally, while parsing a page, Paperwork quite literally takes an interesting turn here. If it cannot determine the orientation for a scanned page, it will simply process the page four times by turning it 90 degrees with each version and then use the best results.

Installation

You will find general installation information in Paperwork's Git Repository [3]. The most up-to-date version 0.32 is not presently in Ubuntu and its derivatives' repositories. However, a PPA exists, and the information about installation procedures is online [4].

To install from source (which worked fine for us), first install some Python dependencies:

sudo apt install python3-pip  python3-setuptools python3-dev  python3-pil install libenchant-dev  python3-whoosh

This will install, among other things pip , the python utility for installing and updating Python-based apps and libraries directly from Python's repositories.

With pip , you can now install Paperwork proper:

pip3 install paperwork

The final step is to check and install for more dependencies Paperwork may need. To do that, use the commands:

paperwork-shell chkdeps paperwork_backend

and

paperwork-shell chkdeps paperwork

and follow the on-screen instructions.

Architecture

The application itself is primarily based on four components. Paperwork uses SANE

sudo apt install sane xsane

for scanning documents. It processes the documents with Tesseract or CuneiForm. Whoosh [5] is used to indicate text that has been altered by OCR. In addition, this tool automatically generates recommendations for keywords. In the process, it reduces each word to the word stem in order to arrive at meaningful results. There is a graphical interface that has been developed using GTK/Glade, which puts all of the components together.

The preferred OCR engine originally comes from Hewlett Packard. Google uses the open source library for things like digitizing books. The software distinguishes itself with high recognition rates and extensive automation.

Since Tesseract works exclusively with uncompressed TIFF files, it is necessary to first prepare the scanned pages. Based on experience, this is a demanding task that can only partially be automated. Herein lies one of the weaknesses of Paperwork. Programs like gscan2pdf [6] offer more possibilities.

Paperwork creates a searchable PDF text file from the prepared pages. Currently, the software supports direct scanning, reading in PDF documents, and analysis of scanned images. Modern image formats like JPEG 2000 are not supported, but the software does support the classic JPEG and PNG formats.

Even so, it is not possible to simultaneously load multiple image files in a project. Nor is it possible to load an entire folder with scanned images. This makes it inconvenient and tedious work when you want to continue to process existing scans.

Buy this article as PDF

Express-Checkout as PDF

Pages: 8

Price $0.99
(incl. VAT)

Buy Ubuntu User

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content