October 12, 2011

Project Gutenberg and You: Using Open Source to Contribute to PG

Michael S. Hart passed away in early September — you might not know his name, but you certainly know his work. Hart founded Project Gutenberg, the oldest and arguably the best free e-book library in the world, home to tens of thousands of titles in dozens of languages, all contributed by volunteers. Project Gutenberg is no doubt going to continue to thrive, but as a tribute to Hart, let's take a look at the free software tools you can use to participate, from scanning in an old book to helping the Project's ongoing proofreading and formatting work — and sharing PG books with others.

Just as a refresher, Project Gutenberg's (PG) ebook library consists of works that are out in the public domain in the United States (where PG is based). With a few peculiar exceptions, that public domain status means they have been previously published long ago, and PG requires that new ebook entries be "cleared" — usually by verifying the date on an actual old copy of the book in question.

The practical upshot is that PG titles all originate as scans of old, printed volumes, which then get run through optical character recognition (OCR) to convert the images to text, and human proofreading, formatting, markup, and on some titles, translation. As you can imagine, that takes a lot of software.

OCR: Teaching the Computer to Read

OCR is always a tricky proposition: it requires computer vision, natural language, and a host of other disciplines working in tandem to pick out letters from an image file with any degree of reliability. There are several high-quality open OCR engines, most of which are designed to function as either libraries or CLI tools. In practice, you will want to scan images from a book using a GUI tool (to correct exposure, alignment, and everything else visually-identifiable) that can call on one of the CLI OCR applications to perform the text conversion.

The current best-of-breed OCR engine is Tesseract, which was created at HP and then given over to Google, who released it as open source in 2008. Recent builds have added more scripts and languages to the corpus of what Tesseract can recognize, and better support for more image file formats. Other engines you may want to install include CuneiForm, Ocrad, and GOCR. It is probably a good idea to install all of the engines provided by your distro's package management system; they may vary in performance from text to text (or font to font). If you have them all installed you can process a couple of test pages before proceeding to batch-convert large sections of a book.

As for the GUI front-end and scanning software, OCRFeeder is the place to start. A GTK+-based application, it can scan directly from a scanner, or import images you scanned in another app (such as Skanlite, XSane, or Simple Scan). OCRFeeder can use any of the OCR back ends mentioned above, and can even do document-layout-recognition, which is necessary for multi-column texts and books with illustrations.

Once The Text Is In, Making Sense Of It

Even if your OCR output is 99% accurate, a human being must still proofread the electronic copy of the text to find where that troublesome one percent is hiding, and fix it. PG has two main approaches, the go-it-alone method, and the curated Distributed Proofreaders (DP) project. DP is definitely the more rigorous, and accounts for a large percentage of PG's new ebooks, but for the sake of comparison, let's consider both — the proofreading and markup steps could prove important to any ebook project you work on, PG-bound or private.

If you take the go-it-alone route, you will need to proof your text, both for internal spelling and spacing problem (at least those of the kind spawned by OCR; you should not attempt to modernize the spelling of a very old book). Pulling the text into a word processor (or text editor) with spell-check for the language in question is a good idea, but PG fans have also developed an OCR-specific checking program called Gutcheck that may be a superior choice for your first pass. OCR tends to make different mistakes than human typists, so word processors' spell-checking often do not catch them.

One of PG's principles is that books can best be preserved by using the simplest, most compatible file formats available, so all titles are made available as plain ASCII text (or in a suitably simple encoding that captures the correct accent marks if ASCII does not). But most readers prefer to use HTML. Many text editors can output simply-formatted text as HTML with zero effort, but if the peculiar formatting of your book makes that a problem, you can use an ebook editor like Sigil to clean up the output.

PG also recommends that all HTML be passed through the official W3C HTML Validator to find errors. You might also want to use GutenMark (a CLI app) or its graphical front-end GUItenMark to convert your ebook from plain text to nicely-formatted HTML.

These steps give you a rough idea of what it takes to polish a text for inclusion in the PG library, but a far better approach than doing it all yourself is to participate in the Distributed Proofreaders process, which leverages a large community of dedicated volunteers and has all of the kinks in the process smoothed out. DP breaks the proofreading and correction process into discrete "rounds," and offers a slick web-based tool for volunteers to do proofing one page at a time.

A volunteer manager oversees each book project, ensuring consistency. DP has proofing and formatting guides for the volunteers. The DP site has detailed instructions on how to get started; you can get a good feel for the workflow by reading the FAQ and even start flagging mistakes as an unregistered "smooth reader."

But Where Do the Books Come From?

All technical issues aside, getting a new book approved for inclusion in the PG library or DP project is an important part of the process that cannot be hurried through. It is important to both projects that quality controls be followed to verify the copyright status of book projects, and to make sure that two people do not start identical projects to digitize the same title.

Both projects provide guidelines to help. You can easily search for a book title and make sure that it is not already in the library, but if it is not, you should still contact the project to begin the copyright (and de-duplication) clearance process before you begin. PG explains its rules and provides contact information on its Copyright How-to page, and links to an online tool to help you verify a book's copyright status. DP also has a good landing page outlining how you should proceed to propose a new book project; it is called the Content Provider's FAQ.

David's In-Progress List keeps track of the ongoing book projects, which makes it easy to check your proposal against works that are not-quite-yet in the library. DP also maintains a list of partially-finished books that are missing pages — if you have a copy and can provide a scan or text of the missing page, you can help out a great deal. Finally, DP runs a web discussion forum about content sourcing, which is a great place to get the most up-to-date information and pick up tips on how to proceed.

But Wait There's More!

Project Gutenberg has amassed an astonishing collection of literature, a lot of which is increasingly hard-to-find in print. But its influence goes further than that, with ripples that have created other projects also promoting literacy and open access to content. You might find one of them worth volunteering at as well.

PG has an official effort to burn and distribute periodic ebook collections on CD and DVD media, to help those who do not have constant Internet access. There is a separate PG effort underway to digitize sheet music — mostly classical chamber music, but a variety of styles and composers. There is a lot of overlap with PG ebook methods, but sheet music has its own special challenges, so volunteers with musical expertise are in great demand.

A close cousin to PG's ebook library is its audio books effort. There are several sources for audio recordings, most notably the Librivox project, where human volunteers record works in their own voice. For a while PG was also adding computer-generated recordings, but due to a variety or technical problems, they are rarely as good as a human voice. The collection of PG audio books is far smaller than the electronic text library, so help is appreciated.

Finally, it is important to remember that PG is based in the United States, and focuses its efforts on works that are public domain in that jurisdiction. Other countries have different copyright laws, making the copyright status of recent internationally-published works difficult to verify. There are now PG affiliated projects in several other countries, including Canada, Australia, Germany, and Norway. The national affiliate projects often focus on works in the native language(s) of the region, in addition to adhering to the appropriate copyright terms.

In a wider sense, Project Gutenberg (which was founded in 1971) is something of a precursor to many of the open-data-projects that are popular today — Wikipedia, the Internet Archive, OpenStreetMap, etc. They all take the crowd-sourced, volunteer-driven model that Hart made popular, and use it to provide free access to information, for everyone. That's a pretty good legacy.

Click Here!