Paperwork in the battle against paper stacks

Browsing Scanned Documents

Paperwork doesn't only process documents. The paperless office also needs a search function, which Paperwork provides. The program saves the recognized texts in an externally inaccessible index. You can search for them in Paperwork using keywords. The corresponding input box is at the upper left under the toolbar. Paperwork displays the matching document and highlights the hits in the document on the right (Figure 5). A tool tip shows how to limit the search to a specific date or use Boolean operators.

Figure 5: With the search function, you can quickly find the places inside indexed documents.

Besides the automatically generated keywords, you can assign additional keywords ("labels") to documents that don't even appear in the document. In the search, you use the syntax label:<term> , even in conjunction with a regular search. Paperwork keeps these labels in a file named labels in the documentation directory. You can mark additional keywords in the current document with the pencil button in the upper left of the toolbar. Paperwork saves this data in the extra.txt file.

On request, Paperwork exports the finished documents as PDFs. Actually, output in DjVu format should be possible, but that didn't work in this test. Other possibilities are pdf2hocr [10] or pdfsandwich [11]. Paperwork also provides a printing function for archive documents – which, of course, defeats the purpose of a truly paperless office.

Conclusion

Despite some interesting functionalities, Paperwork is still a bit immature to handle the flood of paper in the office. The program should be of particular interest to Python programmers, who can take advantage of the modules it implements for larger projects.

If you're looking for a good scan program with integrated OCR function, GScan2PDF [7] might be a better choice, because it is more stable and implements more functionalities. You will also find it has significantly more ways of preparing the scanned documents for OCR processing.

The unique selling point of Paperwork – the index function for the smattering of scanned documents over time – can just as easily be implemented with Recoll [12]. This desktop search engine works not only with indexed PDF documents but includes office document formats as well.

Infos