Speech Recognition HOWTO
Stephen Cook
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
| Revision History | ||
|---|---|---|
| Revision v2.0 | April 19, 2002 | Revised by: scc |
| Changed license information (now GFDL) and added a new publication. | ||
| Revision v1.2 | February 5, 2002 | Revised by: scc |
| Added more commercial software listings (sent by Mayur Patel). | ||
| Revision v1.1 | October 5, 2001 | Revised by: scc |
| Added info for Vocalis Speechware. Fixed/Updated various other items. | ||
| Revision v1.0 | November 20, 2000 | Revised by: scc |
| Added info on L and H and HTK | ||
| Revision v0.5 | September 13, 2000 | Revised by: scc |
| Initial HOWTO Submission | ||
- Table of Contents
- 1. Legal Notices
-
- 1.1. Copyright/License
- 1.2. Disclaimer
- 1.3. Trademarks
- 2. Forward
-
- 2.1. About This Document
- 2.2. Acknowledgements
- 2.3. Comments/Updates/Feedback
- 2.4. ToDo
- 2.5. Revision History
- 3. Introduction
- 4. Hardware
-
- 4.1. Sound Cards
- 4.2. Microphones
- 4.3. Computers/Processors
- 5. Speech Recognition Software
-
- 5.1. Free Software
- 5.2. Commercial Software
- 6. Inside Speech Recognition
-
- 6.1. How Recognizers Work
- 6.2. Digital Audio Basics
- 7. Publications
1. Legal Notices
1.1. Copyright/License
This document is made available under the terms of the GNU Free Documentation License (GFDL), which is hereby incorporated by reference.
2. Forward
2.1. About This Document
I started this document when I began researching what speech recognition software and development libraries were available for Linux. Automated Speech Recognition (ASR or just SR) on Linux is just starting to come into its own, and I hope this document gives it a push in the right direction - by supporting both users and developers of ASR technology.
I have left a variety of SR techniques out of this document, and instead I have focused on the "HOWTO" aspect (since this is a howto...). I have included a Publications section so the interested reader can find books and articles on anything not covered here. This is not meant to be a definitive statement of ASR on Linux.
For the most recent version of this document, check the LDP archive, or go to: http://www.gear21.com/speech/index.html.
2.2. Acknowledgements
I would like to thank the following people for the help, reviewing, and support of this document:
-
Jessica Perry Hekman
-
Geoff Wexler
2.3. Comments/Updates/Feedback
If you have any comments, suggestions, revisions, updates, or just want to chat about ASR, please send an email to me at This e-mail address is being protected from spambots. You need JavaScript enabled to view it .
2.4. ToDo
The following things are left "to do":
-
Add descriptions in the Publications section.
-
Add more books to the Publications section.
-
Add more links with descriptions.
-
Enhance the description of the ASR system steps
-
Include descriptions of FFTs and Filters.
-
Include descriptions of DSP principles.
3. Introduction
3.1. Speech Recognition Basics
The following definitions are the basics needed for understanding speech recognition technology.
- Utterance
-
An utterance is the vocalization (speaking) of a word or words that represent a single meaning to the computer. Utterances can be a single word, a few words, a sentence, or even multiple sentences.
- Speaker Dependance
-
Speaker dependent systems are designed around a specific speaker. They generally are more accurate for the correct speaker, but much less accurate for other speakers. They assume the speaker will speak in a consistent voice and tempo. Speaker independent systems are designed for a variety of speakers. Adaptive systems usually start as speaker independent systems and utilize training techniques to adapt to the speaker to increase their recognition accuracy.
- Vocabularies
-
Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the SR system. Generally, smaller vocabularies are easier for a computer to recognize, while larger vocabularies are more difficult. Unlike normal dictionaries, each entry doesn't have to be a single word. They can be as long as a sentence or two. Smaller vocabularies can have as few as 1 or 2 recognized utterances (e.g."Wake Up"), while very large vocabularies can have a hundred thousand or more!
- Accuract
-
The ability of a recognizer can be examined by measuring its accuracy - or how well it recognizes utterances. This includes not only correctly identifying an utterance but also identifying if the spoken utterance is not in its vocabulary. Good ASR systems have an accuracy of 98% or more! The acceptable accuracy of a system really depends on the application.
- Training
-
Some speech recognizers have the ability to adapt to a speaker. When the system has this ability, it may allow training to take place. An ASR system is trained by having the speaker repeat standard or common phrases and adjusting its comparison algorithms to match that particular speaker. Training a recognizer usually improves its accuracy.
Training can also be used by speakers that have difficulty speaking, or pronouncing certain words. As long as the speaker can consistently repeat an utterance, ASR systems with training should be able to adapt.
3.2. Types of Speech Recognition
- Isolated Words
-
Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a single utterance at a time. Often, these systems have "Listen/Not-Listen" states, where they require the speaker to wait between utterances (usually doing processing during the pauses). Isolated Utterance might be a better name for this class.
- Connected Words
-
Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but allow separate utterances to be 'run-together' with a minimal pause between them.
- Continuous Speech
-
Continuous recognition is the next step. Recognizers with continuous speech capabilities are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Continuous speech recognizers allow users to speak almost naturally, while the computer determines the content. Basically, it's computer dictation.
- Spontaneous Speech
-
There appears to be a variety of definitions for what spontaneous speech actually is. At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters.
- Voice Verification/Identification
-
Some ASR systems have the ability to identify specific users. This document doesn't cover verification or security systems.
3.3. Uses and Applications
- Dictation
-
Dictation is the most common use for ASR systems today. This includes medical transcriptions, legal and business dictation, as well as general word processing. In some cases special vocabularies are used to increase the accuracy of the system.
- Command and Control
-
ASR systems that are designed to perform functions and actions on the system are defined as Command and Control systems. Utterances like "Open Netscape" and "Start a new xterm" will do just that.
- Telephony
-
Some PBX/Voice Mail systems allow callers to speak commands instead of pressing buttons to send specific tones.
- Wearables
-
Because inputs are limited for wearable devices, speaking is a natural possibility.
- Medical/Disabilities
-
Many people have difficulty typing due to physical limitations such as repetitive strain injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use a system connected to their telephone to convert the caller's speech to text.
- Embedded Applications
-
Some newer cellular phones include C&C speech recognition that allow utterances such as "Call Home". This could be a major factor in the future of ASR and Linux. Why can't I talk to my television yet?
4. Hardware
4.1. Sound Cards
Sound cards with the 'cleanest' A/D (analog to digital) conversions are recommended, but most often the clarity of the digital sample is more dependent on the microphone quality and even more dependent on the environmental noise. Electrical "noise" from monitors, pci slots, hard-drives, etc. are usually nothing compared to audible noise from the computer fans, squeaking chairs, or heavy breathing.
Some ASR software packages may require a specific sound card. It's usually a good idea to stay away from specific hardware requirements, because it limits many of your possible future options and decisions. You'll have to weigh the benefits and costs if you are considering packages that require specific hardware to function properly.
4.3. Computers/Processors
Using a cluster (Beowulf or otherwise) to perform massive recognition efforts hasn't yet been undertaken. If you know of any project underway, or in development please send me a note! This e-mail address is being protected from spambots. You need JavaScript enabled to view it
5. Speech Recognition Software
5.1. Free Software
5.1.1. XVoice
This software is primarily for users. An RPM is available.
HomePage: http://www.compapp.dcu.ie/~tdoris/Xvoice/ http://www.zachary.com/creemer/xvoice.html
Project: http://xvoice.sourceforge.net
Community: http://www.onelist.com/community/xvoice
5.1.2. CVoiceControl/kVoiceControl
This software is primarily for users.
Homepage: http://www.kiecza.de/daniel/linux/index.html
Documents: http://www.kiecza.de/daniel/linux/cvoicecontrol/index.html
5.1.4. GVoice
This software is primarily for developers.
Homepage: http://www.cse.ogi.edu/~omega/gnome/gvoice/
5.1.6. CMU Sphinx
This software is primarily for developers.
Homepage: http://www.speech.cs.cmu.edu/sphinx/Sphinx.html
Source: http://download.sourceforge.net/cmusphinx/sphinx2-0.1a.tar.gz
5.1.7. Ears
This software is primarily for developers.
FTP site: ftp://svr-ftp.eng.cam.ac.uk/comp.speech/recognition/
5.1.8. NICO ANN Toolkit
This software is primarily for developers.
Its homepage: http://www.speech.kth.se/NICO/index.html
5.1.9. Myers' Hidden Markov Model Software
This software is primarily for developers.
Information is available at: http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/myers.hmm.html
5.1.10. Jialong He's Speech Recognition Research Tool
This software is primarily for developers.
More information is available at: http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/jialong.html
5.1.11. More Free Software?
If you know of free software that isn't included in the above list, please send me a note at: This e-mail address is being protected from spambots. You need JavaScript enabled to view it . If you're in the mood, you can also send me where to get a copy of the software, and any impressions you may have about it. Thanks!
5.2. Commercial Software
5.2.1. IBM ViaVoice
Their commercial (not-free) product, IBM ViaVoice Dictation for Linux (available at http://www-4.ibm.com/software/speech/linux/dictation.html) performs very well, but has some sizeable system requirements compared to the more basic ASR systems (64M RAM and 233MHz Pentium). For the $59.95US price tag you also get an Andrea NC-8 microphone. It also allows multiple users (but I haven't tried it with multiple users, so if anyone has any experience please give me a shout). The package includes: documentation (PDF), Trainer, dictation system, and installation scripts. Support for additional Linux Distributions based on 2.2 kernels is also available in the latest release.
The ASR SDK is available for free, and includes IBM's SMAPI, grammar API, documentation, and a variety of sample programs. The ViaVoice Run Time Kit provides an ASR engine and data files for dictation functions, and user utilities. The ViaVoice Command & Control Run Time Kit includes the ASR engine and data files for command and control functions, and user utilities. The SDK and Kits require 128M RAM and a Linux 2.2 or better kernel)
The SDKs and Kits are available for free at: http://www-4.ibm.com/software/speech/dev/sdk_linux.html
5.2.2. Vocalis Speechware
More information on Vocalis and Vocalis Speechware is available at: http://www.vocalisspeechware.com and http://www.vocalis.com.
5.2.7. Entropic
K.K. Chin advised me that the original developers of the HTK (the Speech Vision and Robotic Group at Cambridge) are still providing support for it. There is also a "free" version available at: http://htk.eng.cam.ac.uk. Also note that Microsoft still owns the copyright to the current HTK code...
5.2.8. More Commercial Products
There are rumors of more commercial ASR products becoming available in the near future (including L&H). I talked with a couple of L&H representatives at Comdex 2000 (Vegas) and none of them could give me any information on a Linux release, or even if they planned on releasing any products for Linux. If you have any further information, please send any details to me at This e-mail address is being protected from spambots. You need JavaScript enabled to view it .
6. Inside Speech Recognition
6.1. How Recognizers Work
Most recognizers can be broken down into the following steps:
-
Audio recording and Utterance detection
-
Pre-Filtering (pre-emphasis, normalization, banding, etc.)
-
Framing and Windowing (chopping the data into a usable format)
-
Filtering (further filtering of each window/frame/freq. band)
-
Comparison and Matching (recognizing the utterance)
-
Action (Perform function associated with the recognized pattern)
Although each step seems simple, each one can involve a multitude of different (and sometimes completely opposite) techniques.
(1) Audio/Utterance Recording: can be accomplished in a number of ways. Starting points can be found by comparing ambient audio levels (acoustic energy in some cases) with the sample just recorded. Endpoint detection is harder because speakers tend to leave "artifacts" including breathing/sighing,teeth chatters, and echoes.
(2) Pre-Filtering: is accomplished in a variety of ways, depending on other features of the recognition system. The most common methods are the "Bank-of-Filters" method which utilizes a series of audio filters to prepare the sample, and the Linear Predictive Coding method which uses a prediction function to calculate differences (errors). Different forms of spectral analysis are also used.
(3) Framing/Windowing involves separating the sample data into specific sizes. This is often rolled into step 2 or step 4. This step also involves preparing the sample boundaries for analysis (removing edge clicks, etc.)
(4) Additional Filtering is not always present. It is the final preparation for each window before comparison and matching. Often this consists of time alignment and normalization.
There are a huge number of techniques available for (5), Comparison and Matching. Most involve comparing the current window with known samples. There are methods that use Hidden Markov Models (HMM), frequency analysis, differential analysis, linear algebra techniques/shortcuts, spectral distortion, and time distortion methods. All these methods are used to generate a probability and accuracy match.
(6) Actions can be just about anything the developer wants. *GRIN*
7. Publications
If there is a publication that is not on this list, that you think should be, please send the information to me at: This e-mail address is being protected from spambots. You need JavaScript enabled to view it .
7.1. Books
-
"Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993. ISBN: 0130151572.
-
"How to Build a Speech Recognition Application". B. Balentine, D. Morgan, and W. Meisel. 1999. ISBN: 0967127815.
-
"Speech Recognition : Theory and C++ Implementation". C. Becchetti and L.P. Ricotti. 1999. ISBN: 0471977306.
-
"Applied Speech Technology". A. Syrdal, R. Bennett, S. Greenspan. 1994. ISBN: 0849394562.
-
"Speech Recognition : The Complete Practical Reference Guide". P. Foster, T. Schalk. 1993. ISBN: 0936648392.
-
"Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition". D. Jurafsky, J. Martin. 2000. ISBN: 0130950696.
-
"Discrete-Time Processing of Speech Signals (IEEE Press Classic Reissue)". J. Deller, J. Hansen, J. Proakis. 1999. ISBN: 0780353862.
-
"Statistical Methods for Speech Recognition (Language, Speech, and Communication)". F. Jelinek. 1999. ISBN: 0262100665.
-
"Digital Processing of Speech Signals" L. Rabiner, R. Schafer. 1978. ISBN: 0132136031
-
"Foundations of Statistical Natural Language Processing". C. Manning, H. Schutze. 1999. ISBN: 0262133601.
-
"Designing Effective Speech Interfaces". S. Weinschenk, D. T. Barker. 2000. ISBN: 0471375454.
For a very LARGE online biography, check the Institut Fur Phonetik: http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html
7.2. Internet
- news:comp.speech
-
Newsgroup dedicated to computer and speech.
-
US: http://www.speech.cs.cmu.edu/comp.speech/
-
UK: http://svr-www.eng.cam.ac.uk/comp.speech/
-
Aus: http://www.speech.su.oz.au/comp.speech/
-
- news:comp.speech.users
-
Newsgroup dedicated to users of speech software.
-
http://www.speechtechnology.com/users/comp.speech.users.html
-
- news:comp.speech.research
-
Newsgroup dedicated to speech software and hardware research.
- news:comp.dsp
-
Newsgroup dedicated to digital signal processing.
- news:alt.sci.physics.acoustics
-
Newsgroup dedicated to the physics of sound.
- DDLinux Email List
-
Speech Recognition on Linux Mailing List.
-
Homepage: http://leb.net/ddlinux/
-
Archives: http://leb.net/pipermail/ddlinux/
-
- Linux Software Repository for speech applications
-
http://sunsite.uio.no/pub/linux/sound/apps/speech/
- Russ Wilcox's List of Speech Recognition Links
-
(excellent) http://www.tiac.net/users/rwilcox/speech.html
- Online Bibliography
-
Online Bibliography of Phonetics and Speech Technology Publications. http://www.informatik.uni-frankfurt.de/~ifb/bib_engl.html
- MIT's Spoken Language Systems Homepage
-
http://www.sls.lcs.mit.edu/sls/
- Oregon Graduate Institute
-
Center for Spoken Language Understanding at Oregon Graduate Institute. An excellent location for developers and researchers. http://cslu.cse.ogi.edu/
- IBM's ViaVoice Linux SDK
-
http://www-4.ibm.com/software/speech/dev/sdk_linux.html
- Mississippi State
-
Mississippi State Institute for Signal and Information Processing homepage with a large amount of useful information for developers. http://www.isip.msstate.edu/projects/speech/
- Speech Technology
-
ASR software and accessories. http://www.speechtechnology.com
- Speech Control
-
Speech Controlled Computer Systems. Microphones, headsets, and wireless products for ASR. http://www.speechcontrol.com
- Microphones.com
-
Microphones and accessories for ASR. http://www.microphones.com
- 21st Century Eloquence
-
"Speech Recognition Specialists." http://voicerecognition.com
- Computing Out Loud
-
Primarily for Windows users, but good info. http://www.out-loud.com
- Say I Can.com
-
"The Speech Recognition Information Source." http://www.sayican.com





Comments
Subscribe to Comments Feed