January 3, 2006

Optical character recognition is an uphill battle for open source

Author: Nathan Willis

If you use Linux, or another free operating system, and need optical character recognition (OCR) software, be prepared for a challenge. OCR is a tricky problem on any computing platform -- both because it is conceptually hard, and because the task does not lend itself to simple, easy-to-use interfaces.

OCR is the use of visual pattern matching to extract text from an image -- usually a scanned paper document, but it could be a digital photo, a frame of video, or a screenshot just as easily.

Proprietary platforms like Mac OS X and Windows have a few commercial choices, but even the best of them can be headache-inducing. A few years ago, while working for a non-profit organization, I had to scan in and convert old documents using OCR -- out-of-print books, magazine articles, the occasional printout of a lost digital original. We tried several different products, and found that even the best of them required thorough proofreading, numerous corrections, manual tweaking of the input images, and a whole lot of time. So it was with low expectations that I approached the task of finding a suitable open source OCR program.

Extracting text with Kooka

The best of the lot available today is Kooka, the KDE environment's default scanning application. Kooka uses the GNU project's Ocrad as its OCR engine. It supports a wide variety of scanners via the SANE library, for when you are acquiring images as you work.

Scanning with Kooka - click to enlarge

Its documentation suggests that Kooka can use other OCR engines in addition to Ocrad, but Kooka offers only one other option through its preferences, GOCR. Despite selecting GOCR and restarting Kooka as directed, the application persisted in using Ocrad. I did verify that GOCR was installed, functional, and located in /usr/bin as Kooka expected by testing its GTK front end, gtk-ocr. Kooka just would not use it.

Kooka's controls give you a lot of leeway over the resolution, brightness, and contrast of the document you are scanning. For the best results, you want to tweak the settings to give maximum contrast, eliminating if possible specks of dust and shadows on the paper. The downside is that excessive contrast in the scan can wash out the serifs, dots, and thin strokes of the letters, making them harder for OCR software to distinguish.

OCR with Kooka - click to enlarge

In my case, the images I needed to OCR were already-digitized scans of the user manual for an old camera system. I chose previously digitized images in order to test all of the OCR apps I found (some of which do not perform scanning) on an equal footing. The images were of good quality, and the text was perfectly readable to start with. To adjust brightness and contrast in these images I had to open them in an external image editor, which Kooka conveniently lets you do through the Image menu.

You initiate an OCR job with OCR Image from the Image menu. Kooka presents a dialog from which you can tweak several settings, including whether the program should attempt to guess the layout of the document, and whether to enable spell-checking on the result.

Listing 1 is the result of the first pass with Kooka's OCR, in which all settings were left at their defaults. Clearly there is much to be desired; among other things a plethora of stray pixels were picked up and interpreted as punctuation marks. Listing 2 is the result of the second pass, after having cleaned up the original image considerably with the GIMP. Listing 3 began with the original text from Listing 2, corrected by the spell checker.

Correcting spelling with Kooka

Spell checking in Kooka is a manual process; the program cycles through the OCRed text and (much as in any office app) asks you about every unrecognized word, suggesting replacements when possible. Kooka uses existing spell-checking engines, such as ispell, aspell, and hspell (for Hebrew).

The problem with this approach is that none of these spell-checking engines are aware that the checked text is from an OCR source. Word processing spelling errors are very different from OCR spelling errors; the former are the results of mistyping, the latter are the results of machine vision problems.

OCR spell-checking needs to take into account that the mistakes it is looking for stem from similarities in letter shape, confusion between capital and lowercase letters, and misplaced word breaks. Ispell and Aspell can suggest replacement corrections based on dictionary words, but they are of little help.

Another feature often found in commercial OCR applications is the ability to mark or exclude areas of the page from scanning. In some cases, it is possible to indicate that two or more blocks of text are connected, which can be very helpful when scanning in magazine layouts. Kooka has the beginnings of automatic layout detection via the Ocrad engine -- you can tell it to interpret the document as multi-column or to guess at more complex layouts. But, in most cases, specifying the layout yourself would be much faster.

Using Clara - click to enlarge

Everything else

By my count, Kooka was able to correctly identify just 123 of 270 words, or 45.56 percent -- not an encouraging number. But the pickings are slim for Linux users. I did examine other applications: gtk-ocr and Clara. Both are functioning GUI programs, but neither has close to the feature set of Kooka.

Gtk-ocr is sparse in its interface and controls. The command-line client may be more functional, but for a graphically intensive task such as this I would not recommend it. Clara has a more promising feature list than gtk-ocr, but the last update to the program was a long time ago, and it uses Xlib for its interface -- a fact that may frighten away younger users, but elicit feelings of nostalgia in others with more experience.

Using gtk-ocr - click to enlarge

A number of other projects turn up on a Freshmeat or SourceForge search for "OCR," but most are academic in nature, and not suited for end users.

Given the inherent complexity of tasks involving natural language processing (OCR, speech recognition, and machine translation, to name a few), it should come as no surprise that (a) the available tools are just marginally useful, and (b) research into the subject continues.

I hope that some of the research being conducted will find its way into a usable open source OCR application for Linux. Right now, the best alternatives for those needing to extract text from an image all involve spending money. There are a few proprietary solutions advertising Linux support (VueScan and OCRShop, for example) and proprietary OCR software for proprietary OSes -- but don't underestimate the value of paying a typist to transcribe your text the old-fashioned way.