Paperwork in the battle against paper stacks
|
Browsing Scanned Documents
Paperwork doesn't only process documents. The paperless office also needs a search function, which Paperwork provides. The program saves the recognized texts in an externally inaccessible index. You can search for them in Paperwork using keywords. The corresponding input box is at the upper left under the toolbar. Paperwork displays the matching document and highlights the hits in the document on the right (Figure 5). A tool tip shows how to limit the search to a specific date or use Boolean operators.
Besides the automatically generated keywords, you can assign additional keywords ("labels") to documents that don't even appear in the document. In the search, you use the syntax label:<term> , even in conjunction with a regular search. Paperwork keeps these labels in a file named labels in the documentation directory. You can mark additional keywords in the current document with the pencil button in the upper left of the toolbar. Paperwork saves this data in the extra.txt file.
On request, Paperwork exports the finished documents as PDFs. Actually, output in DjVu format should be possible, but that didn't work in this test. Other possibilities are pdf2hocr [10] or pdfsandwich [11]. Paperwork also provides a printing function for archive documents – which, of course, defeats the purpose of a truly paperless office.
Conclusion
Despite some interesting functionalities, Paperwork is still a bit immature to handle the flood of paper in the office. The program should be of particular interest to Python programmers, who can take advantage of the modules it implements for larger projects.
If you're looking for a good scan program with integrated OCR function, GScan2PDF [7] might be a better choice, because it is more stable and implements more functionalities. You will also find it has significantly more ways of preparing the scanned documents for OCR processing.
The unique selling point of Paperwork – the index function for the smattering of scanned documents over time – can just as easily be implemented with Recoll [12]. This desktop search engine works not only with indexed PDF documents but includes office document formats as well.
Infos
- Paperwork: https://github.com/jflesch/paperwork/
- Tesseract OCR: http://code.google.com/p/tesseract-ocr/
- Installing Paperwork from sources: https://github.com/jflesch/paperwork/blob/unstable/doc/install.debian.markdown
- General installation instructions: https://github.com/jflesch/paperwork/wiki/Update
- Whoosh: http://whoosh.readthedocs.org/en/latest/quickstart.html
- Scanned books: http://book.google.com
- "Clear the Clutter" by Vincze-Aron Szabo, http://Linux Magazine, Issue 85, 2007: http://www.linux-magazine.com/Issues/2007/85/Gscan2pdf/%28language%29/eng-US
- hOCR-Files: https://en.wikipedia.org/wiki/HOCR
- DjVu: https://en.wikipedia.org/wiki/DjVu], [http://DjVu.org
- PDF2hocr: https://github.com/KarolS/pdf2hocr
- Pdfsandwich: http://www.tobias-elze.de/pdfsandwich/index.html
- "Digging In" by Tim Schürmann, http://Linux Magazine, Issue 79, 2007: http://www.linux-magazine.com/Issues/2007/79/Recoll/%28language%29/eng-US
« Previous 1 2 Next »
Buy this article as PDF
Pages: 3
(incl. VAT)