Use Paperwork to digitize and archive documents

Slashdot it! Delicious Share on Facebook Tweet! Digg!

Settings

To date, it is only possible to configure for a few parameters. You can specify a folder for saving information, the scanner, and the OCR engine (Figure 1). The language settings control the spell check, but this was not successful in the test.

Figure 1: Paperwork has been designed with just a few settings. Even so, this does not necessarily make the software easier to use.

There is a user-specific configuration file ~/.config/paperwork.conf . This is where you can check to see whether the program is actually using the correct language and spell check for the OCR engine. Note that this feature does not work accurately in all versions.

If Paperwork does not recognize words that contain a language-specific character, then this is an indication of an error in the language setting. To resolve this problem, you should first close the application since this will cause it to write back the configuration file. Next you should use an editor to enter en as the value for the corresponding lang variable in the section. After switching the language, you should restart the text recognition.

If you operate a number of different scanners, then it is important to make sure that you select the appropriate scanner. You should take care to set the resolution correctly. Testing did not reveal why the application recommends only 300 DPI with some devices that have a physical resolution of 600 DPI.

Practical Example

Once started, Paperwork displays a clean looking interface that contains two sections (Figure 2). To the left you will see a database in which the current document has been highlighted (in Figure 2, the new document is currently empty). The program displays the pages of this document that have been scanned and processed in a corresponding preview.

Figure 2: Paperwork's functions are hidden behind plain and ordinary looking buttons at the top of the main pane.

A single mouse click on one of the pages shown in the preview causes an original-sized version to be displayed. This lets you review the details of what the program has recognized during the OCR process (Figure 3).

Figure 3: After the OCR process is complete, the software will display what it has recognized when you mouse over a word. In this example, the software has incorrectly split the word LOOKING into LOOK and NG, as it was unable to correctly identify the I.

There are a number of buttons, some very inconspicuous, located at the top edge of the screen. They are used for calling program functions. The arrows in Figure 2 point to the location of the buttons. The leftmost button controls document viewing. Options for reading in data are found under Scan . The rightmost button is used to manage special settings.

The software collects images that have been scanned in as projects and then exports them as a PDF file. In carrying out this process, the software sets up for each project a suitable subdirectory in the database directory. The name of the subdirectory follows the convention Date<_ID-Number . However, only the date appears in the front end, which makes it difficult to search for a project you are looking for (Figure 4).

Figure 4: Paperwork keeps its tidy appearance even after a series of documents have been read in. The software displays the selected document as a preview.

By default, Paperwork uses the ~/papers/ folder as a working directory. The software always sets up several files in the project directory. You will find JPEGs of the scanned page under paper.<Number>.jpg . Text extracted by the OCR engine is found under paper.<Number>.words .

A labels file contains headings if they have been manually assigned (Figure 5). The PDF file that is created always has the same name doc.pdf in this directory.

Figure 5: If necessary, you can provide additional information about selected documents like the labels "document" and "PDFs" shown in the example.

The extracted texts are located in the project directories, but they do not exist as simple text files. Instead they exist as special XML files in hOCR format [7]. In this format, markings indicating the position in the original document are shown alongside the pure text that the files contain. This makes it possible to be precise when overlaying the text onto the image files.

After a document has been generated as a project, you can assign additional headings even if these headings are not found in the document and the OCR process does not recognize them. Assignments are made by means of a button that appears on the selected document.

In addition, you can choose an existing label or add new labels with the corresponding button. The heading editor will appear, which allows you to link colored markings to the document (Figure 6). These abstract markings prove to be conceptually interesting, but turn out to be only of limited value. This is because the software does not let you sort or search according to these criteria.

Figure 6: You can use labels, colored markings, and keywords to add more information to a document.

The application makes it possible to influence the results of a scan. To do this, you should go to the Settings dialog and start the Scan function by pressing the button. You should also specify the areas in which the software will find the relevant text.

The software will show a preview after it scans each page. You can choose to have the program highlight each word by clicking on the button shown in Figure 7 and then toggling Highlight words to ON .

Figure 7: After scanning, the program shows the last page that has been read in, and you can choose to see the words it has identified.

Buy this article as PDF

Express-Checkout as PDF

Pages: 8

Price $0.99
(incl. VAT)

Buy Ubuntu User

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content