|Introduction||Speech Recognition||Speech Synthesis|
|Natural Language Processing||Speech APIs and Tools||Other Links|
This is the full article instead of just links.
This month we will look at the technologies of Speech Input, Output and Natural Language Processing (NLP). These are very large topics and each could easily expand to fill several columns, but I must be extra brief this month as job changes and a personal crisis have greatly reduced my available time. I apologize to my readers and will endeavor to return to more in-depth reporting in future columns.
An excellent source of information on Speech Technology is the comp.speech newsgroup. Several web sites mirror the comp.speech FAQ, although many are a year or two old (http://www.speech.cs.cmu.edu/comp.speech). The University of California Santa Cruz (UCSC) Perceptual Science Laboratory (PSL) web pages offer links to these and lots of other good speech research sites (http://mambo.ucsc.edu/psl/speech.html). A 1996 "Survey of the State of the Art in Human Language Technology " provides excellent background and is available online at http://cslu.cse.ogi.edu/HLTsurvey. Speech recognition and synthesis are two important ‘assistive’ technologies, heavily used by persons with disabilities. Many sites dedicated to this noble cause provide a good collection of links and other information (http://www.ataccess.org, ) TechTALKS is a Miller Freeman exposition and conference on speech and linguistic technologies that will be held June 8-9, Boston MA USA (http://www.techtalks.com).
Computer recognition of speech has been commercially available for many years. I recall a system used to control a model train at GE Special Purpose Computer Center where I worked while an undergraduate in 1978. The basic approach is to digitize the audio signal and then attempt to match it to some sort of dictionary. A system with a small vocabulary, such as trivial command and control (C2) systems, does not need to be very sophisticated in its parsing of the input signal. Each quiet-separated utterance can be checked against the dictionary in an exhaustive search. Systems with more elaborate vocabularies need to break the input signal into smaller chunks and attempt to match phonemes as well as words. More sophisticated searching techniques are used to develop candidate matches based on probabilities of phonemes in the vocabulary and matches for other phonemes in the input stream. Higher level natural language techniques can also be used to prune the search tree using the dialog context. C2 systems can generally be made speaker independent more easily than continuous speech (dictation) systems. The latter are not as germane to virtual environments, but I certainly would like to try one out for writing these columns! I understand that with training, the dictation systems can be fairly good. There are several commercial toolkits available for C2 speech recognition. The IBM Via Voice toolkit seems to be one of the most common in the VR environments I have seen.
One of the most variable aspects of speech recognition critical to good results is the quality of the original input signal. A good microphone is essential and several vendors make headset mikes specifically for recognition systems. Some consumer and business speech recognition products include decent headset microphones. Andrea Electronics Corporation (http://www.andreaelectronics.com) is one of the leaders in this market. Their patented Digital Super Directional Array (DSDA) provides a "far field" microphone system for speech recognition. They have one model designed for use in automobiles.
Unfortunately, even the best microphone cannot help in a very noisy environment. Several government and commercial field trials of speech recognition (battle exercises, windy border patrols) have failed because of environmental noise and stress induced changes in voices. One interesting solution to this is to use a multi-modal (sensor fusion) recognition system – that is, to augment the audio stream with a video lip reading. Michael Chan of Rockwell Science Center recently published several papers on this topic (http://hci.rsc.rockwell.com/BiSpeech). Some pictures of his system accompany this article. Other researchers have been working on lip reading systems as well. The UCSC PSL folks maintains a web links page on "Speech Reading" at http://mambo.ucsc.edu/psl/lipr.html. The Auditory-Visual Speech Processing conference will be held August 7-9, 1999 at U.California Santa Cruz, USA. http://mambo.ucsc.edu/avsp99
UPDATE January 2000: CMU Sphinx Speech Recognition tool has been released as Open Source http://www.speech.cs.cmu.edu/sphinx/Sphinx.html
Speech Synthesis has also been around for quite a while although I don’t know that there have been such large leaps in quality or cost. There are two main approaches to computer generated speech. The oldest approach, typified by Compaq’s DECTalk hardware/software, uses synthetic models of the human vocal tract augmented by rules derived from phonetic theories and acoustic analyses. This approach can be tuned to provide high quality intelligible speech that sounds like a man, woman, or child. The other, simpler approach is to record and play back human speech. Many systems have been built using recorded words, phonemes and diphones. More advanced systems provide special bridging software that interpolates between the smaller samples to provide smoother sounding speech. These recorded sound synthesis systems can sound really good during demonstrations because the text is carefully chosen to match the recorded vocabulary. Recorded sound based systems are much harder to tune for different characterizations (male, female, etc.) Although intelligibility of both these approaches may be good, they rarely sound ‘natural’. This is because the tonality, rhythm, emphasis, general expressiveness of human speech is missing. It is possible to improve this by providing a higher level context derived either human direction and script annotation or an advanced natural language system that monitors the overall discourse.
Speech technology has seen major leaps of improvement in the last few years in both quality and cost of systems. Much of this is driven by the new uses in computer telephony, the next generation after "Press One for Frustration, Two for…" A few years ago some leading edge companies such as AT&T started using speech recognition to allow "Press or Say One for". This has progressed to the current commercial offerings for complete dialog and discourse management for automated call centers. These couple speech recognition and generation with natural language processing for fairly sophisticated and effective applications within a limited scope. Natural Language Processing (NLP) for speech systems generally divides into two related areas. Dialog management concentrates on pairs of input and response phrases. It can provide local context for recognition and controlling naturalness of generation. Discourse management provides extended history of a conversation, allowing for the use of pronouns and more elaborate recognition/generation modeling. Stephanie S. Everett at the US Naval Research Laboratory (NRL) has been applying this technology directly to Virtual Environments (http://www.aic.nrl.navy.mil/~severett).
The Microsoft Speech API (SAPI) (http://microsoft.com/iit) provides a C++, COM or ActiveX interface to recognition and generation engines on Microsoft platforms. Version 4.0a was released in February of this year. Although Microsoft’s Intelligent Interface Technology group provides text-to-speech and continuous speech recognition engines in SAPI 4.0, third party engines are also supported.
Sun has developed the Java Speech API to provide cross-platform interface to support command and control recognizers, dictation systems and speech synthesizers. They have also developed the Java Speech Grammar Format (JSGF) to provide cross-platform control of speech recognizers. Sun is developing the Java Speech Markup Language (JSML) to annotate text input to Java Speech API speech synthesizers. (http://java.sun.com/products/java-media/speech)
IBM has developed SpeechML, an XML based markup language "for building network-based conversational applications. A conversational application is an application that interacts with the user through spoken input and output. A network-based application refers to one in which the elements of the conversation that define spoken output and input - Speech Markup Language documents - may be obtained over the network." (http://www.alphaWorks.ibm.com/tech/sml) The available implementation builds on both JSML and JSGF.
Speechworks (www.speechworks.com) provides a recognition engine and sophisticated discourse management system primarily for telephone based transactional services such as reservations, stock quotes, or order entry.
Nuance Communications http://www.nuance.com is another provider speech recognition software for call center automation.