Google’s Tesseract OCR engine is a quantum leap forward

966

Author: Nathan Willis

The open source optical character recognition (OCR) landscape got dramatically better recently when Google released the Tesseract OCR engine as open source software.

The Tesseract code was written at Hewlett-Packard in the 1980s and ’90s. In 1995, it was one of the top-tier performers at UNLV’s OCR competition, but when HP withdrew from the OCR software marketplace, the code languished. Then in 2005, HP handed off the code to UNLV’s Information Science Research Institute (ISRI), an academic center doing ongoing research into OCR and related topics. ISRI discovered that original Tesseract developer Ray Smith was now an employee at Google, and asked the search engine giant if it was interested in the code. Google spent a few months updating the code to compile on modern operating systems, and released it on SourceForge.net.

You can download the latest tarball, a bugfix release numbered 1.0.1, from the Tesseract OCR project page. The only compilation instructions are those listed on the release notes section of the SourceForge.net download page. Instructions are listed for Windows, Mac OS X, and Linux, all for the same source code. Compilation under Linux is straightforward — run ./configure followed by make — but there is no make install step. In fact, you must move the resulting tesseract binary into its parent directory, where it expects to find a support directory called tessdata. Make sure the directory is writable, because Tesseract generates temporary files there while processing an image.

The usage instructions are concise — Tesseract has no switches and does exactly one thing. Just execute tesseract example.tiff outputfilename, and Tesseract will generate an ASCII file named outputfilename.txt containing the text recognized from example.tiff.

Currently, Tesseract recognizes only English and works only on TIFF files (black and white, 8-bit greyscale, and 24-bit color; no compression). Also, it can generate output only in the US-ASCII character set, so glyphs with accent marks or other unsupported attributes will probably be reproduced incorrectly.

Proof text

In January, I wrote an overview of free software OCR engines, concluding pessimistically that if you needed to OCR text of any length, you should consider paying a typist to transcribe it. Now, the question is: how well does Tesseract compare to the open source competition?

The difference is night and day. This is an OCR engine that actually works. For the sake of comparison, this listing is Tesseract’s output on the exact same file with which I tested Kooka in the earlier review. I count seven mistakes (five spelling, one capitalization, and one punctuation) in 266 words. Discarding the punctuation mistake because I did not count them in my previous review, and Tesseract correctly recognized 97.74% of the text.

Of course, unlike Kooka, Tesseract does not recognize page layout (e.g., multi-column text), so it combined the two columns into one. Nor does it share Kooka’s ability to mask out non-text parts of an image, so I had to remove an illustration from the page with the GIMP.

All things considered, though, it is the success rate of the text-recognition engine that matters most. The rest is just gravy. Even without a GUI, Tesseract is more useful today than Kooka.

Google has confirmed its intentions to continue developing the Tesseract code, although it does not have concrete plans. Currently the Tesseract OCR project on SourceForge has only two members, the Google engineers who work on the project part-time. I spoke with Google’s Luc Vincent about the future of the project, and he listed the known shortcomings — lack of supported file types, additional languages, page layout — as the targets for future development.

But, he said, the direction that the project takes will largely depend on the interest shown by outside programmers. Google does not have plans to develop Tesseract into a full-fledged application like Picasa or Google Earth.

Read ’em and weep

The company clearly deals in areas where OCR is important (such as the Book Search and Image Search programs), but it doesn’t need a GTK or Qt app for them. Where Tesseract goes for Linux users will depend on who gets involved and actually works on the code.

Luckily, some activity has already begun. For instance, a couple of enterprising early adopters have worked up a simple script that uses ImageMagick to seamlessly convert other image formats and pass them to Tesseract, overcoming one of the software’s big limitations.

The Tesseract code is under the Apache 2.0 License, which the Free Software Foundation claims is incompatible with the GPL, but the Apache Software Foundation does not. On the licensing front, it is worth noting that the tesseract-1.0.1 tarball contains a subdirectory with third-party code called Aspirin/MIGRAINES. This code is licensed separately (as the README and other documentation makes clear), under a non-free software license, but the code is not actually used by the current version of Tesseract.

Even if Tesseract 1.0.1 were to be the only release ever made from this project, it has changed the landscape of OCR for free software dramatically. I’m confident it won’t be the only release — it’s just that high-quality. Do yourself a favor and check it out — it builds quickly on Linux, and it actually works as advertised. It is lucky for us that the best GUI OCR program (Kooka) uses pluggable OCR engines, so the Tesseract code could join the current arsenal (GOCR and OCRAD) in short order, and provide free software users with an all-in-one solution.