Linux.com

Feature: Tools & Utilities

How to scan and OCR like a pro with open source tools

By Mathis Dirksen-Thedens on June 24, 2008 (7:00:00 PM)

Share    Print    Comments   

With optical character recognition (OCR), you can scan the contents of a document into a single file of editable text. This article, which focuses on scanning books, describes the steps you need to take to prepare pages for optimal OCR results, and compares various free OCR tools to determine which is the best at extracting the text.

First, fire up your distribution's package manager to fetch a few packages and dependencies. In Debian, the required packages are sane, sane-utils, imagemagick, unpaper, tesseract-ocr, and tesseract-ocr-eng. You may also install other language packs for Tesseract -- for example, I installed tesseract-ocr-deu for German text.

Scanning the pages

Before you can translate images into text, you have to scan the pages. If you want to scan a book, you can't use an automatic feed for your scanner. The following small bash/ksh script scans pages one at a time and outputs each to a separate file in portable anymap format called scan-n.pnm:

for i in $(seq --format=%003.f 1 150); do echo Prepare page $i and press Enter read scanimage --device 'brother2:bus1;dev1' --format=pnm --mode 'True Gray' --resolution 300 -l 90 -t 0 -x 210 -y 200 --brightness -20 --contrast 15 >scan-$i.pnm done

Adjust the parameters of the scanimage command according to your scanner model (find out which device names you can use with scanimage -L and look up device-specific options with scanimage --help --device yourdevice). Also, adjust the settings for the parameters -l (discard on the left), -t (discard on the top), -x, and -y (the X and Y coordinates on the bottom right corner of the page). Try to position the book in a way that makes it possible to use these parameters to define a rectangle that contains only the text, not the binding or the border. Don't worry about the page number; you can cut it out later with little effort.

Your scans may not be positioned consistently or have shadows in the corners. If you feed these images into an OCR program, you won't get accurate results no matter how good the OCR engine might be. However, you can use the unpaper command before applying the OCR magic to preprocess the image and thus get the text recognized more accurately. If you scanned the pages in the right orientation -- that is, right side up -- you can use the default settings with unpaper; otherwise, you can use some of the utility's many options. For example, --pre-rotate -90 rotates the image counterclockwise. You can also tell unpaper that two pages are scanned in one image. See the manual page for detailed information. The following unpaper script prepares the scanned images for optimal OCR performance:

for i in $(seq --format=%003.f 1 150); do echo preparing page $i unpaper scan-$i.pnm unpapered-$i convert unpapered-$i.pnm prepared-$i.tif && rm unpapered-$i.pnm done

You need to convert the scans from .pnm files because the best OCR tool I have found requires the TIFF input format.

Comparing OCR tools

Now comes the most important part: the automated optical character recognition. Many open source tools are available for this job, but I tested a selection and found that most didn't produce satisfactory results. This is not a representative survey, but it is clear that some open source tools perform far better than others.

To illustrate, I have prepared a small example from a German book written by my wife's grandfather. The figure to the right shows the original text. It's a smaller version of the original 300 DPI scan that I fed to the OCR programs.

GOCR produced the following results:

Ja, wer _einer __leute hat ihn njcht jn
_3meg. Menc_al fra_e 3ch _jch, wa-
_ gerade der Maulbeerba_ es 3st. Ejne
Aprikose ist doc.h eine vjel edlere m3cht.

Ocrad provided the following:

Ia, Her meiner _leute hat ihn nicht in
_iMe_. Mònchmal fragte ich mich, Na-
nm gerade der Maulbeerbaum es ist. Eine
Rpyik_e ist doch eine viel edlere nvcht.

I used the -l deu option with Tesseract-OCR to select the German word library, which resulted in the following:

Ja. wer meiner Landsleute hat ihn nicht in
Erinnerung. Manchmal fragte ich mich, wa-
rum gerade der Maulbeerbaum es ist. Eine
Aprikose ist doch eine viel edlere Frucht.

Of the three, Tesseract-OCR worked the best, making only one mistake: it interpreted the comma in the first line as a period. Therefore, I made Tesseract-OCR my tool of choice. This simple script uses that application to apply OCR to every scanned page:

for i in $(seq --format=%003.f 1 150); do echo doing OCR on page $i tesseract prepared-$i.tif tesseract-$i -l eng done

The result of that process is a bunch of text files that each represent the contents of one page.

Putting it all together

Before you create a consolidated document, you'll want to remove any page numbers that still exist in your text files. If they're located above the text, you can strip the first line from every text file that Tesseract-OCR produced:

for i in $(seq --format=%003.f 1 150); do tail -n +2 tesseract-$i.txt >text-$i.txt done

If they are below the text, just use head -n -1 in the above script instead of tail -n +2. This causes the script to remove the last line and not the first.

Finally, use cat text-*.txt >complete.txt to create one big file containing your whole book. Edit the resulting file and unhyphenate the whole text by replacing each combined occurrence of a hyphen and a line feed with an empty string. You can also remove unnecessary line feeds. In gedit, you can define your own tools and make them available via a keyboard shortcut. I defined the following tool to work on the current selection:

#!/bin/sh # newlines to spaces tr '\n' ' ' # only one space character at a time sed 's/[[:blank:]]{2,}/ /' </dev>

With this, you can select some lines and press your defined shortcut. The whole selection becomes one line.

You now have one large document that represents the contents of the book. Consider reading the whole file again to find any typos that may be left, then move on to LaTeX to create a professional-looking Portable Document Format file from your scanned text.

Mathis Dirksen-Thedens has a degree in mathematics and computer science and works in the IT department of a big German power and gas supplier.

Share    Print    Comments   

Comments

on How to scan and OCR like a pro with open source tools

Note: Comments are owned by the poster. We are not responsible for their content.

How to scan and OCR like a pro with open source tools

Posted by: Anonymous [ip: 141.123.223.100] on June 24, 2008 09:42 PM
Fantastic and very informative article. I've tried to find a good open source OCR solution, but the websites of the first two look like they were made in 1994. That's usually a good indicator that the author doesn't have a clue of current technology (I hear you: yay, it's lynx-friendly! Get a clue, luddite). I'll have to take a gander at Tesseract. Thanks!

#

How to scan and OCR like a pro with open source tools

Posted by: Anonymous [ip: 71.234.246.22] on June 25, 2008 04:01 AM
Tesseract is the OCR package to use. I've had good success using it for a health-information application. It does a really good and accurate job on numbers (like dates and record numbers) even with a variety of fonts.

You may find it helpful to put an alarm around your tesseract runs. I've found some particularly nasty pages on which it hangs.

#

How to scan and OCR like a pro with open source tools

Posted by: Anonymous [ip: 64.0.193.178] on June 25, 2008 06:41 PM
You don't actually need sed in the last script (I have an irrational dislike of sed):

tr -s '[:blank:]\n' ' '

#

How to scan and OCR like a pro with open source tools

Posted by: Anonymous [ip: 206.176.231.19] on June 25, 2008 11:24 PM
reinvention of wheel? Because old wheel is not round enough?

#

gscan2pdf

Posted by: Anonymous [ip: 88.162.36.171] on June 26, 2008 03:13 AM
For a nice interface to xscanimage, unpaper, and tesseract-ocr, check out gscan2pdf. It has a lot of great features and really simplifies the process.
http://gscan2pdf.sourceforge.net/

#

Re: gscan2pdf

Posted by: Mathis Dirksen-Thedens on June 28, 2008 07:59 AM
I agree, gscan2pdf has a lot of great features, but it does a different thing. If I want to have the plain text of the scanned original afterwards, gscan2pdf cannot help me, as it only produces a PDF with the recognized text in a hidden layer behind the image. The sole purpose of this is to make the PDF indexable by desktop search helpers like Beagle.

#

Re(1): gscan2pdf

Posted by: Anonymous [ip: 88.162.36.171] on June 29, 2008 01:30 AM
It's completely possible to cut or copy the OCR text from the output window in gscan2pdf and then paste it into whatever text manipulation application you prefer. This works fine for smaller projects, although I agree that it's probably not the best choice for dealing with a whole book.

There's also the xsane2tess script, a wrapper for tesseract-ocr that can be used with xsane's scan to text feature. See: http://doc.ubuntu-fr.org/xsane2tess (in French, but try Google translation).

#

How to scan and OCR like a pro with open source tools

Posted by: Anonymous [ip: 149.149.120.95] on June 27, 2008 05:09 PM
So no GUI options here? I'm prone to typing mistakes. Would spend all day chasing errant punctuation or spelling mistakes...

#

How to scan and OCR like a pro with open source tools

Posted by: Anonymous [ip: 76.181.177.61] on June 28, 2008 12:10 PM
First of all, I do appreciate the article. I am not sure I am going to get ANYBODY, not a geek, to do this, including para-legals, my sister, or my grandma. It is like that programmer that said we should teach the whole marketing department, SQL, so they can update the database. I asked if he could build a screen, and he said yes, but it would take him a day. I asked how many days and how much support it would take to teach 14 non technical people, SQL. He finally built the screen.

#

How to scan and OCR like a pro with open source tools

Posted by: Anonymous [ip: 196.219.138.54] on June 29, 2008 02:52 AM
or as our friend put it in the post previous to mine, don't teach a man to fish, just give him a damn dolphin, he'll be happy and so will the rest of his family.

#

How to scan and OCR like a pro with open source tools

Posted by: Anonymous [ip: 87.8.122.243] on June 30, 2008 01:36 PM
ocr in txt format: and what kind of "professional" is in it? A solution of 1995??? And u want to compare with Finereader? or Omnipage? Come on be serious!!!

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya