Home Learn Linux Linux Tutorials How to scan and OCR like a pro with open source tools

How to scan and OCR like a pro with open source tools


With optical character recognition (OCR), you can scan the contents of a document into a single file of editable text. This article, which focuses on scanning books, describes the steps you need to take to prepare pages for optimal OCR results, and compares various free OCR tools to determine which is the best at extracting the text.



First, fire up your distribution's package manager to fetch a few packages and dependencies. In Debian, the required packages are sane, sane-utils, imagemagick, unpaper, tesseract-ocr, and tesseract-ocr-eng. You may also install other language packs for Tesseract -- for example, I installed tesseract-ocr-deu for German text.

Scanning the pages

Before you can translate images into text, you have to scan the pages. If you want to scan a book, you can't use an automatic feed for your scanner. The following small bash/ksh script scans pages one at a time and outputs each to a separate file in portable anymap format called scan-n.pnm:

for i in $(seq --format=%003.f 1 150); do
  echo Prepare page $i and press Enter
  scanimage --device 'brother2:bus1;dev1' --format=pnm --mode 'True Gray' --resolution 300 -l 90 -t 0 -x 210 -y 200 --brightness -20 --contrast 15 >scan-$i.pnm


Adjust the parameters of the scanimage command according to your scanner model (find out which device names you can use with scanimage -L and look up device-specific options with scanimage --help --device yourdevice). Also, adjust the settings for the parameters -l (discard on the left), -t (discard on the top), -x, and -y (the X and Y coordinates on the bottom right corner of the page). Try to position the book in a way that makes it possible to use these parameters to define a rectangle that contains only the text, not the binding or the border. Don't worry about the page number; you can cut it out later with little effort.


Your scans may not be positioned consistently or have shadows in the corners. If you feed these images into an OCR program, you won't get accurate results no matter how good the OCR engine might be. However, you can use the unpaper command before applying the OCR magic to preprocess the image and thus get the text recognized more accurately. If you scanned the pages in the right orientation -- that is, right side up -- you can use the default settings with unpaper; otherwise, you can use some of the utility's many options. For example, --pre-rotate -90 rotates the image counterclockwise. You can also tell unpaper that two pages are scanned in one image. See the manual page for detailed information. The following unpaper script prepares the scanned images for optimal OCR performance:


for i in $(seq --format=%003.f 1 150); do
  echo preparing page $i
  unpaper scan-$i.pnm unpapered-$i
  convert unpapered-$i.pnm prepared-$i.tif && rm unpapered-$i.pnm


You need to convert the scans from .pnm files because the best OCR tool I have found requires the TIFF input format.


Comparing OCR tools


Now comes the most important part: the automated optical character recognition. Many open source tools are available for this job, but I tested a selection and found that most didn't produce satisfactory results. This is not a representative survey, but it is clear that some open source tools perform far better than others.


To illustrate, I have prepared a small example from a German book written by my wife's grandfather. The figure to the right shows the original text. It's a smaller version of the original 300 DPI scan that I fed to the OCR programs.

GOCR produced the following results:


Ja, wer _einer __leute hat ihn njcht jn _3meg. Menc_al fra_e 3ch _jch, wa- _ gerade der Maulbeerba_ es 3st. Ejne Aprikose ist doc.h eine vjel edlere m3cht.


Ocrad provided the following:


Ia, Her meiner _leute hat ihn nicht in _iMe_. Mònchmal fragte ich mich, Na- nm gerade der Maulbeerbaum es ist. Eine Rpyik_e ist doch eine viel edlere nvcht.


I used the -l deu option with Tesseract-OCR to select the German word library, which resulted in the following:


Ja. wer meiner Landsleute hat ihn nicht in Erinnerung. Manchmal fragte ich mich, wa- rum gerade der Maulbeerbaum es ist. Eine Aprikose ist doch eine viel edlere Frucht.


Of the three, Tesseract-OCR worked the best, making only one mistake: it interpreted the comma in the first line as a period. Therefore, I made Tesseract-OCR my tool of choice. This simple script uses that application to apply OCR to every scanned page:


for i in $(seq --format=%003.f 1 150); do
  echo doing OCR on page $i
  tesseract prepared-$i.tif tesseract-$i -l eng


The result of that process is a bunch of text files that each represent the contents of one page.


Putting it all together


Before you create a consolidated document, you'll want to remove any page numbers that still exist in your text files. If they're located above the text, you can strip the first line from every text file that Tesseract-OCR produced:


for i in $(seq --format=%003.f 1 150); do
  tail -n +2 tesseract-$i.txt >text-$i.txt


If they are below the text, just use head -n -1 in the above script instead of tail -n +2. This causes the script to remove the last line and not the first.


Finally, use cat text-*.txt >complete.txt to create one big file containing your whole book. Edit the resulting file and unhyphenate the whole text by replacing each combined occurrence of a hyphen and a line feed with an empty string. You can also remove unnecessary line feeds. In gedit, you can define your own tools and make them available via a keyboard shortcut. I defined the following tool to work on the current selection:


# newlines to spaces
tr '\n' ' '
# only one space character at a time
sed 's/[[:blank:]]{2,}/ /'


With this, you can select some lines and press your defined shortcut. The whole selection becomes one line.


You now have one large document that represents the contents of the book. Consider reading the whole file again to find any typos that may be left, then move on to LaTeX to create a professional-looking Portable Document Format file from your scanned text.



Subscribe to Comments Feed
  • jeremy best Said:

    can you please tell me what packages I need to scan to text in OpenSuse 11.4? Tesseract doesn't seem to be available. Many thanks

  • Nalin.x.Linux Said:

    linux-intelligent-ocr-solution Lios is a free and open source software for converting print in to text using either scanner or a camera, It can also produce text out of scanned images from other sources such as Pdf, Image or Folder containing Images. Program is given total accessibility for visually impaired. Lios is written in python, and we release it under GPL3 license. Lios will work with Debian based operating systems. There are great many possibilities for this program, Feedback is the key to it, Expecting your feedback

  • Bruce Martin Said:

    I have tried Tesseract and YAGF in Fedora 20 64 bit and as soon as I ask it to load an image from anywhere, the image screen turns dusty black and YAGF closes. both cuneiform and Tesseract are installed.

  • BJH Said:

    What about documents which contain mainly text but with some image figures?

  • Bruce Martin Said:

    Documents that contain diverse contents must be handled with comparable diversity! Examples: 1) Raster Graphics: If the document contains raster graphics, you might capture these and paste them into Gimp, thren post process as needed finally exporting th .PNG. 2) Vectorial graphics including CAD. If the document contains vectorial graphics, you need to export or copy and paste into a vectiroal drawing application that accepts the vectorial format/language of the original vectorial graphic. Note: Some of these formats are proprietary (such as Autocad .DWG) and connot be accessed without having a working copy of the proprietary software. In such cases one might have to take a screen grab of the graphic (assuming it is visible on the screen) and then redraw the item in an open format, such as .ODG using Libre Office Draw, or Libre CAD. If the item is a 3D view, it might be necessary to approach this in a 2D application in the same manner as was used for 3D mechanical drawing representation before electronic CAD existed. If the item's nature is skewed by way of its original shape, this may require doing double auxiliary projections in the 2D application. 3) Finally, the finished document must then be reconstructed from these diverse parts, either by importation or importation by copy and paste, depending on the nature and size of each piece. If the end result is to be a presentation such as .ODP, Libre Office Impress, etc. The parts need to be recomposed in that application, except that if the content is to have Vectorial groups to control the relative positioning of sub-groups of drawing objects, the grouping should first be done in Libre Office Draw, prior to exporting to .ODP or to an .ODT Text Document. 4) Additional for .ODP presentations: Libre Office draw has a special type of dual screen capability to support public speakers. Each slice has a corresponding field for comments, effectively functioning as a teleprompter. These text comments need to be pasted into those fields as appropriate to the sequence and timing of the intended performance. On stage, the presenter will have his laptop so he sees that screen, and additionally he has a video feed from the laptop (in one of sevewral formats) which will sent the video to the projector(s) or, in large auditoriums, where the projection is built-in and controlled by an operator on the side or in the back, that feed from the laptop goes to the operator who takes it from there. If there is audio content to emanate from the Presentation itself, that feed comes from the line output of the laptop (if colour coded, the colour will be pale green), and goes to either a separate sound system or to the operator in the case of a large auditorium. If the event is to have professional simultaneous translation, the feed will also have to be sent to the translation facility so the translators can hear it without distortion from sounds in the audience. If multilingual simultaneous content is to be a part of the presentation, and subtitles are not used, that content would need to be in the.ODP, however I am not sure of that kind of multitracking audio can be embedded in an .ODP. In the event that text/script using a right to left alphabet (such as Arabic, Farsi, and other languages derived from the ancient Arameic) there is an add-on to Libre Office that is used to accommodate these scripts. following the installation of that, the appropriate fonts for that specific combination of alphabet and style will need to be loaded. Multiple fonts for a single right to left alphabet do exist, to be used as appropriate.

Upcoming Linux Foundation Courses

  1. LFS201 Essentials of System Administration
    12 Jan » 30 Mar - Online Self-Paced
  2. LFD331 Developing Linux Device Drivers
    01 Jun » 05 Jun - Virtual (GUARANTEED TO RUN)
  3. LFD320 Linux Kernel Internals and Debugging
    08 Jun » 12 Jun - San Jose - CA + Virtual (GUARANTEED TO RUN)

View All Upcoming Courses

Become an Individual Member
Check out the Friday Funnies

Sign Up For the Newsletter

Who we are ?

The Linux Foundation is a non-profit consortium dedicated to the growth of Linux.

More About the foundation...

Frequent Questions

Join / Linux Training / Board