April 1, 2011

Weekend Project: Create a Paperless Linux Office


The paperless office: whether to combat clutter or save the forests, it has been the dream of many a computer user ever since the first electronic record of, well, probably anything. But it remains elusive, in no small part because whatever your personal intentions, you just cannot control the actions of other people, and many businesses today still insist on sending you printed bills and receipts. You can at least dispense with the filing cabinets, however, by scanning in the documents you need as searchable, full-text PDFs. Fire up the scanner and the weekend.

Scanning with gscan2pdfClearly, you could just scan everything and save your documents as TIFF or JPEG files. Linux has solid support for USB desktop scanners (even all-in-one printer/fax/scan devices and those with sheet-feeders or other attachments) thanks to the SANE project. There is also no shortage of quality scan applications, like Kooka, XSane, or Simple Scan. But with images alone you lose the ability to search the text content of your documents — and remember, you can not only search within a particular document, but use GNU utilities to search your entire document collection.

That's where optical character recognition (OCR) comes in. OCR recognizes letterforms in the scanned document image and outputs actual text, which is precisely what we're after. But rather than run a command-line OCR program on every scanned image and produce a .txt file, it's better to combine the two into a single document, and hopefully a single step. That's the purpose of gscan2pdf, a lightweight GUI application that has a built-in SANE scanner interface, an OCR engine, and the ability to write PDF documents that embed the OCRed text and use the scanned image as a background for improved legibility.

Installing it, Firing it Up

You can grab the latest gscan2pdf build from the project's web site, including tar archives and RPM packages. If you use a Debian-based distro (including Ubuntu), it is already available through the package manager (and it is probably worth checking for in other desktop distributions as well). If you do install from source, you will need the SANE libraries, Perl, ImageMagick, and a few other common packages. The only dependencies you might not already have installed are the OCR engines: GOCR, Tesseract, CuneiForm, the support package Unpaper, and the DjVu image format library. Unpaper is a post-processor that cleans up scans for better OCR performance, and DjVu is an alternative image output format that, like PDF, preserves both text and images.

When you launch gscan2pdf, most of the main window is taken up by a preview panel with two tabs: "Image" and "OCR Output." Down the left is a thumbnail pane that will allow you to browse between individual pages as you scan them. The basic workflow is simple: Click on the "scan" icon in the toolbar, which opens up the floating scanner window. If you happen to have multiple scanners attached, choose the right one from the Device drop-down selector. There are seven tabs of controls and options you can configure, but at any point you can simply punch the "Scan" button at the bottom, and gscan2pdf will scan in the current page, run OCR on it, and slide it into the list of pages in the thumbnail browser.

The scan window stays open, so once you get your settings right, you can scan page after page without the interface disappearing or otherwise getting in your way. By contrast, if you have a lot of pages to scan in an app like XSane, you have to pause and save each one in order to continue.

Most of the mental energy you'll expend during this process is in getting the scanner settings right. The "Scan Mode" tab exposes every scanner feature your hardware supports, but in my experience it doesn't pick particularly useful defaults right out of the starting block. The scan resolution, for example, should probably be set to 200 or higher, and you will almost certainly want either color or grayscale image mode. Whether brightness, sharpness, and the various color-correction options improve your results is a toss-up; you can certainly get nicer looking images by fiddling with the settings, but unless the document you're scanning needs to be pristine and archival-quality for some legal reason, the real goal is getting better OCR results (and in the legal case, you should probably save the paper original anyway...).

I had trouble getting the "Preview" tab to work, and the "Optional equipment" tab gives you access to film and transparency scanner features, which is not the use case we're describing here, but it also lets you select an external document feeder if you have one. Primarily, however, good results are going to come from getting a quality scan, and choosing the right OCR engine.Using OCR with gscan2pdf

You select the OCR engine from the "Post-processing" section of the "Page Options" tab. Among the recommended engines, I found Tesseract to give the best results, but it is worth running several samples to be sure: I'm certain that different types of document are likely to fare better under different OCR algorithms. The "Clean up images" option lets you set Unpaper features, such as filtering out solid blocks of color, ignoring borders, and automatically de-skewing the image. Deskewing can have a dramatic effect, but in my tests Unpaper could only correct a certain amount of skew, beyond which it got confused, and it did not cope well with pages using a lot of on indentations. Best advice: try to have a steady hand when you close the scanner lid.

Makin' Copies

OCR is never perfect, and you can open a built-in editor for the text in any page's "OCR Output" tab. When you're happy with the content, hit "Save" from the toolbar or the File menu. This brings up an intermediary file-format dialog rather than the normal GNOME "Save File" window, which can be confusing at first. Although you can save your scan as a flat image file (TIFF, PNG, etc.), the only output formats that preserve the OCR text and the scanned image are PDF and DjVu. You can also save your work as a "gscan2pdf session" for later resumption.

DjVu, for those unfamiliar, is an open format dedicated to scanned documents. It maintains separate text and background image layers, and can achieve a high compression level at good quality thanks to some clever encoding techniques. PDF is a more widely-supported format among applications, though, so it is the preferred choice. Oddly enough, gscan2pdf will export a scanned image to PostScript file format, but in doing so it embeds the image as a TIFF file yet does not appear to include the OCR text. I still haven't figured out what's going on there.

Arguably the best feature of gscan2pdf's document-saving process is that you can use it to create multi-page PDFs. When scanning pages, the "Page Options" tab has a feature that numbers each successive scan as the next page in a compound document. In the same tab you can mark pages as double-sided. These features together allow gscan2pdf to write multi-page PDFs that a modern PDF reader like Evince can display, search, and page-step through just like the output of a high-end DTP application.

Scanner Beware

Gscan2pdf does have its quirks. A friend of mine lamented recently that he could not get the application to produce quality scan images, noting that every page was visually skewed, like a parallelogram. Unfortunately, that sort of error is most likely to be a SANE problem, and since SANE uses different back-ends for different scanner families, it could be hard to track down.

You also need to go into the process fully aware of what OCR can and cannot produce. It is an imperfect science, and you will have to read through every page of text to correct character-recognition errors. Gscan2pdf does not have a built-in spellchecker, which is a toss-up on its helpfulness. Yes, it might be nice to have a spell-checking option (such as to catch your own typos when making corrections), but the truth is that OCR produces a completely different set of character-substitution mistakes than humans make when typing, so an automatic spellchecker based on ispell, aspell or other open source engines would be more aggravation than assistant. It would flag dozens and dozens of errors, but would be no help making suggestions, because the mistakes stem from the visual similarity of characters, not keyboarding or spelling trouble.

You have other options for converting paper documents to OCR'ed digital files, too. Kooka, for instance, lets you perform OCR in addition to scanning, but it lacks the automated output and PDF-generation features that make gscan2pdf so simple to use. Scan on.

Click Here!