ChangeLog: Google launches OCRopus OCR project

38

Author: Nathan Willis

Google announced this week a “technology preview” of its new OCRopus optical character recognition (OCR) project. OCRopus is an open source system incorporating existing code from Google’s Tesseract OCR engine, with plans to build new components where required. The company will deploy OCRopus to aid in scanning material for its book search service.

Google will be working with the Image Understanding and Pattern Recognition (IUPR) research group at the German Research Center for Artificial Intelligence (DFKI), employing three graduate students or postdocs. IUPR’s Thomas Breuel will head up the project.

The announcement enumerates several specific software components: the Tesseract engine, the hOCR HTML-based output format, the Recognition by Adaptive Subdivision of Transformation Space (RAST) page layout analysis algorithm, a handwriting-recognition library, and an English language recognition module. Although English is the only language supported at this time, the project’s FAQ affirms its intent to eventually support any language, script, and character set.

We reviewed Tesseract in September, and found it to be vastly superior to other open source character recognition libraries. Despite its capabilities, though, Tesseract in its present form is extremely limited: command-line only, requiring a very specific input image format, restricted to a small, Latin-alphabet character set, and unable to recognize page layout. Many of these shortcomings appear to be addressed in the OCRopus project’s plans.

Given Google’s goal of using OCRopus in its book scanning project, it remains to be seen whether the resulting code will be a GUI-driven, interactive application like Kooka or something more suited for automated scanning of large volumes. The project road map indicates an alpha-quality release scheduled for the third quarter of 2007, and a 1.0 release next year.

The current code base is available through anonymous Subversion checkout. It is C++ and Python code targeted at x86 Ubuntu Linux systems. The project’s wiki has a Getting Started page to assist interested users in setting up the proper build environment, compiling the code, and running simple tests.

If that proves problematic, in the meantime online demos are available through the IUPR’s OCRopus Web site. As of today, the best place to read more about the project is at its Google Code page — the documentation and announcement both link to an off-site project home page at ocropus.org, but it appears to be down, at least temporarily.