Speak to me, Linux


Author: JT Smith

Voice control is the next step in human interaction with computers. Voice recognition, and its flip side, speech synthesis, can help you streamline your day-to-day work and organize your Linux desktop in a better way.

To begin conversing with your Linux desktop, download the Sphinx-2 speech recognition engine and the Festival text to speech application. Although the CMU Sphinx Group provides several versions of Sphinx (Sphinx-2, -3, and -4), I use only Sphinx-2, as it is the fastest. Even though it is not as accurate as Sphinx-3 or Sphinx-4, it runs in real time, and therefore works well with live applications.

The installation of Sphinx-2 and Festival should be trivial; most distributions already have binaries, and even compiling from source should not be difficult. Debian users might find Festival a little tricky to install if they own an onboard sound card with AC97 codecs. (The symptom is speech that sounds twice as fast as it should, no matter what speed you set up; unfortunately I couldn’t find any solution except for changing the sound card.)

Happily, the normal desktop user will not have to learn Festival’s command-line interface, as great applications such as KDE Text-to-Speech System (KTTS) and Perlbox Voice fill this gap. KDE 3.4 will talk to you via Festival, Festival Lite (flite) or FreeTTS (another free speech synthesis written in Java), in a multitude of languages and accents. If you want to use KTTS with your present KDE desktop, sources as well as binaries for Debian, SUSE, and Mandrake are available at KTTS’s home page.

The KTTS Interface

KTTS works by sending the text to be spoken via DCOP to the KTTS daemon. KTTS can read you pop-ups from Knotify, Web sites, or any other text. You can open Konqueror, navigate to a Web site, select Tools, and choose Speak Text. If you didn’t highlight anything, KTTS will read the whole page.

In KDE Text-To-Speech Manager (kttsmgr) you can manage your languages, speech engine, and what your computer reads for you. Your computer can act as your private secretary and read your email messages while you manage other applications. You can use multiple languages, which can be useful if for example you are a native German listener, but you need to read an English Web page. If you need other voices than English, take a look at the MBROLA Project. MBROLA tries to obtain as many sets of speech synthesizers for as many languages as possible, and provide them free for non-commercial applications. You can use Mbrola voices with Festival.

Adding Sphinx-2 and Perlbox Voice to Festival, you can make your computer listen to what you tell it and take actions accordingly. Perlbox Voice provides a transparent interface to several open source speech systems, but it is mainly an easy-to-use front end to Sphinx-2, built in Perl and Tk; you’ll need to have both languages installed before you install Perlbox. Perlbox comes with a Perl script for installing, which will copy files in the right location. When that completes, fire up the application with the perlbox-voice command.

Perlbox Interface

To get your Linux desktop to perform an action, write in one Perlbox box the magic words to invoke the action, and in another what the computer is expected to do when it hears them. For example, in the “When you say” box you write “Web” and in the “Computer does” box, Konqueror. Then start Perlbox’s listener via the Control tab and say “Web”; Konqueror will start.

Perlbox comes with a KDE plug-in that allows you change from one desktop to another, invoke K Menu, refresh the desktop, and more, all via voice commands. You can extend the plug-in with more commands by modifying a single file. If you use another desktop manager, Perlbox comes with good documentation on how to build your own plug-ins.

In noisy environments, you can use a magic word to activate Perlbox, in order to keep the application from taking actions if it mishears aleatory words.

The ease of using these speech engines and speech recognition systems could make Linux the preferred OS for the visually impaired. Open licenses allow rapid improvement and development. And it’s just a lot of fun to talk to your closest co-worker: your computer.