Paper Trail
Digital archives do away with the need for traditional filing cabinet storage. Even so, Paperwork tries to make your life easier.
Digital archives do away with the need for traditional filing cabinet storage. Even so, Paperwork tries to make your life easier.
The idea behind Paperwork originated with the desire to have a paperless office. Letters, bills, and loose pages are loaded into a scanner, which spits out PDF and JPEG files into the in tray. Then, the contents of these files are converted with OCR into digital form.
This is where Paperwork [1] comes into play. The application collects image data and text, overlaps them, and then saves them as a PDF. Paperwork creates a summary of the text content for the prepared documents as a searchable index. However, there are some pitfalls inherent to this process for which you should watch out. The scans and photographs will need to have the highest possible resolution so that the software is able to properly recognize the text. This means that a good scanner with at least 600 DPI resolution is a must.
During start up, Paperwork first looks for Tesseract [2]. If it cannot find this powerful OCR engine, then the program will use CuneiForm. In most cases, you will get the best results with Tesseract. You can install it with:
sudo apt install tesseract-ocr tesseract-ocr-eng
Incidentally, while parsing a page, Paperwork quite literally takes an interesting turn here. If it cannot determine the orientation for a scanned page, it will simply process the page four times by turning it 90 degrees with each version and then use the best results.
You will find general installation information in Paperwork's Git Repository [3]. The most up-to-date version 0.32 is not presently in Ubuntu and its derivatives' repositories. However, a PPA exists, and the information about installation procedures is online [4].
To install from source (which worked fine for us), first install some Python dependencies:
sudo apt install python3-pip python3-setuptools python3-dev python3-pil install libenchant-dev python3-whoosh
This will install, among other things pip , the python utility for installing and updating Python-based apps and libraries directly from Python's repositories.
With pip , you can now install Paperwork proper:
pip3 install paperwork
The final step is to check and install for more dependencies Paperwork may need. To do that, use the commands:
paperwork-shell chkdeps paperwork_backend
and
paperwork-shell chkdeps paperwork
and follow the on-screen instructions.
The application itself is primarily based on four components. Paperwork uses SANE
sudo apt install sane xsane
for scanning documents. It processes the documents with Tesseract or CuneiForm. Whoosh [5] is used to indicate text that has been altered by OCR. In addition, this tool automatically generates recommendations for keywords. In the process, it reduces each word to the word stem in order to arrive at meaningful results. There is a graphical interface that has been developed using GTK/Glade, which puts all of the components together.
The preferred OCR engine originally comes from Hewlett Packard. Google uses the open source library for things like digitizing books. The software distinguishes itself with high recognition rates and extensive automation.
Since Tesseract works exclusively with uncompressed TIFF files, it is necessary to first prepare the scanned pages. Based on experience, this is a demanding task that can only partially be automated. Herein lies one of the weaknesses of Paperwork. Programs like gscan2pdf [6] offer more possibilities.
Paperwork creates a searchable PDF text file from the prepared pages. Currently, the software supports direct scanning, reading in PDF documents, and analysis of scanned images. Modern image formats like JPEG 2000 are not supported, but the software does support the classic JPEG and PNG formats.
Even so, it is not possible to simultaneously load multiple image files in a project. Nor is it possible to load an entire folder with scanned images. This makes it inconvenient and tedious work when you want to continue to process existing scans.
To date, it is only possible to configure for a few parameters. You can specify a folder for saving information, the scanner, and the OCR engine (Figure 1). The language settings control the spell check, but this was not successful in the test.
There is a user-specific configuration file ~/.config/paperwork.conf . This is where you can check to see whether the program is actually using the correct language and spell check for the OCR engine. Note that this feature does not work accurately in all versions.
If Paperwork does not recognize words that contain a language-specific character, then this is an indication of an error in the language setting. To resolve this problem, you should first close the application since this will cause it to write back the configuration file. Next you should use an editor to enter en as the value for the corresponding lang variable in the section. After switching the language, you should restart the text recognition.
If you operate a number of different scanners, then it is important to make sure that you select the appropriate scanner. You should take care to set the resolution correctly. Testing did not reveal why the application recommends only 300 DPI with some devices that have a physical resolution of 600 DPI.
Once started, Paperwork displays a clean looking interface that contains two sections (Figure 2). To the left you will see a database in which the current document has been highlighted (in Figure 2, the new document is currently empty). The program displays the pages of this document that have been scanned and processed in a corresponding preview.
A single mouse click on one of the pages shown in the preview causes an original-sized version to be displayed. This lets you review the details of what the program has recognized during the OCR process (Figure 3).
There are a number of buttons, some very inconspicuous, located at the top edge of the screen. They are used for calling program functions. The arrows in Figure 2 point to the location of the buttons. The leftmost button controls document viewing. Options for reading in data are found under Scan . The rightmost button is used to manage special settings.
The software collects images that have been scanned in as projects and then exports them as a PDF file. In carrying out this process, the software sets up for each project a suitable subdirectory in the database directory. The name of the subdirectory follows the convention Date<_ID-Number . However, only the date appears in the front end, which makes it difficult to search for a project you are looking for (Figure 4).
By default, Paperwork uses the ~/papers/ folder as a working directory. The software always sets up several files in the project directory. You will find JPEGs of the scanned page under paper.<Number>.jpg . Text extracted by the OCR engine is found under paper.<Number>.words .
A labels file contains headings if they have been manually assigned (Figure 5). The PDF file that is created always has the same name doc.pdf in this directory.
The extracted texts are located in the project directories, but they do not exist as simple text files. Instead they exist as special XML files in hOCR format [7]. In this format, markings indicating the position in the original document are shown alongside the pure text that the files contain. This makes it possible to be precise when overlaying the text onto the image files.
After a document has been generated as a project, you can assign additional headings even if these headings are not found in the document and the OCR process does not recognize them. Assignments are made by means of a button that appears on the selected document.
In addition, you can choose an existing label or add new labels with the corresponding button. The heading editor will appear, which allows you to link colored markings to the document (Figure 6). These abstract markings prove to be conceptually interesting, but turn out to be only of limited value. This is because the software does not let you sort or search according to these criteria.
The application makes it possible to influence the results of a scan. To do this, you should go to the Settings dialog and start the Scan function by pressing the button. You should also specify the areas in which the software will find the relevant text.
The software will show a preview after it scans each page. You can choose to have the program highlight each word by clicking on the button shown in Figure 7 and then toggling Highlight words to ON .
Paperwork displays significant shortcomings in terms of the OCR recognition rate. The program commits an array of errors even when dealing with good scans and PDF text documents. The spellcheck is not of much use (Figure 8).
When creating the index, the application collects words that have been extracted from the texts. The user cannot influence what goes into the index and what does not. Asserting your control over the index process would require that you learn Whoosh, which is quite an undertaking. The indexer is fairly efficient, but, currently, Paperwork does not make optimal use of this tool.
Paperwork saves the index in the ~/.local/share/paperwork/index/ directory, thus independently from the database. This causes some headaches when you are looking for portable management of a large number of documents. You will find a number of files containing the extracted keywords in the directory, but the files are not necessarily directly legible.
To date, the program does not produce document titles that are meaningful. Nor is it possible to change a title manually. This process would be helpful, especially if you are dealing with numerous documents all at once. Manual tagging with additional keywords is a tedious alternative. In addition, the program automatically assigns currently used labels to all of the subsequent documents.
Searching is also not easy. Paperwork treats all text alike, not distinguishing among the content, label, or title of a document. The only way to achieve at least a minimum of influence over the search results is to use Advanced Search (Figure 9).
Advanced Search lets you add certain keywords and exclude others, but this approach is laborious. Metadata from PDF documents still means nothing to Paperwork. Distinctions in terms of content, for example the difference between a title and the author, are beyond the current capabilities of the program.
At the present time, Paperwork does not offer an integrated solution. There are simply too many features still lacking, and there are significant problems with the functions that do exist. Supporting only JPEG and PDF formats is no longer adequate to modern needs. (See the "Alternatives to Papework" box).
Alternatives to Paperwork
In the meantime, there are a number of possibilities for providing scanned documents with a text layer. Gscan2pdf is well-known and also proven in practice. Other alternatives would be xsane2djvu and OCRmyPDF [8], which also often achieve good results. The advantage to gscan2pdf is that it permits extensive preparation of the pages before they are sent to the OCR engine.
Recoll [9] is a good choice when you want access to text contained in various formats but without explicit OCR processing. This program does not use a database at all with the index. Instead, it recursively monitors and processes an array of directories that have been previously uploaded. Recoll recognizes numerous different formats and provides the option of controlling the search function.
Text PDF documents require almost no preparation for loading into a database. Programs like Calibre [10] do this easily and mostly error free. Although it was developed for managing ebooks, the program loads and displays text, HTML, ODT, and PDF documents.
However, Calibre cannot handle a full text search unless certain precautions are taken. Meanwhile, several plugins are available as retrofits. Some of these are based on Recoll. The combination of Calibre with Recoll proves to be a good solution for saving and administering documents in various formats. Gscan2pdf can be added to the mix and used for the OCR processing.
It is also not clear why Paperwork still uses OCR to analyze PDFs by itself when the PDFs already contain text. The same goes for the lack of possibilities for receiving documents according to their title and analyzing their metadata.
Infos