Talking to computers

Toby Howard

This article first appeared in Personal Computer World magazine, November 1997.

IT'S BEEN CALLED the ultimate human-computer interface. No fiddly peripherals, no mice, no typing - just you and your computer, both talking.

Natural speech communication is a hot research topic, and the last few years have seen immense progress in voice recognition and speech synthesis. On the one hand, software speech recognition systems like Dragon Dictate and IBM's Voicetype are now cheap and reliable; on the other, speech synthesis is getting better than ever.

If there's a cultural icon for natural speech communication, it's HAL from Stanley Kubrick's classic 2001: A Space Odyssey. Anyone who has seen 2001 is unlikely to forget HAL, with his urbane -- but slightly creepy -- voice. (It actually belonged to Canadian actor Douglas Rain, who recorded his dialogue in a single 10-hour session. That Rain's voice alone could fix the personality of HAL is all the more remarkable when you know that Kubrick showed him neither a complete script, nor a single frame of the film.)

What made HAL special was that he understood what was said to him, and replied in conversational English. Although artificial machine intelligence of the kind demonstrated by HAL is still way beyond our reach, good synthetic speech has been with us for quite a while. To date, its main application has been to facilitate people with disabilities, but as computing power continues to increase, there are signs that speech synthesis is moving into mainstream computing.

Talking machines were around long before computers. One of the earliest was built in 1779 by Christian Kratzenstein, in response to a challenge issued by the Imperial Academy of St Petersberg. Kratzenstein's machine, which only spoke vowels, used a set of resonating chambers activated by a vibrating reed. In 1791, Wolfgang von Kempelen of Vienna demonstrated a more sophisticated machine, which could produce vowels and consonants. Unfortunately Von Kempelen wasn't taken seriously, because of an earlier indiscretion involving wild claims for a chess-playing automaton of his invention. This mechanical marvel actually concealed a legless Polish chess expert, squeezed into a cabinet along with a bogus mechanism. Nevertheless, von Kempelen's speaking machine was reconstructed by the English physicist Sir Charles Wheatstone, and subsequently inspired Alexander Graham Bell to create an artificial speaking head.

The first attempts at electronic speech synthesis began with Homer Dudley's Voder, which was demonstrated at the 1939 and 1940 World Fairs. This was an analogue machine, modelling the voice using oscillators and banks of filters. Although progress continued to be made with analogue devices, it was digital technology that revolutionised the field.

Today, there are two main approaches: formant synthesis and concatenative synthesis. Formant synthesis models the human vocal system from scratch, beginning with frequency analysis of real speech. Concatenative synthesis constructs artificial speech by stringing together samples of the basic phonemes which make up speech.

Speech synthesis is a very hard problem, and it's astonishing that so much progress has been made so quickly, and found its way into common PC software. TextAssist, for example, which comes bundled with Creative Labs soundcards, uses Digital's DECtalk synthesis engine. Although New Yorker magazine once referred to an early version of DECtalk as having "the unmistakable tones of an inebriated Swede", its modern incarnation is much better. You can find samples of this and other systems at www.speechtoys.com/spchtoys/spsyn.html.

However, where all speech synthesis systems currently fall down is in their simulation of prosody -- the changes in rhythm, intonation and stress as we speak. Without taking prosody into account, even the best synthesised voice will sound like a Dalek.

For a system to automatically generate realistic prosody, it must have some understanding of the meaning of an utterance. According to John Local, a leading UK speech researcher and designer or the experimental YorkTalk system, this is the fundamental outstanding problem. "People's expectations of synthesised speech are too high", he says. "If it is to talk at all, they expect a machine to talk with the expression and understanding of another human".

Most current systems which read out arbitrary text, for example, have no innate understanding of what they are saying. According to Local, for speech synthesis to truly succeed, we must start with meaning. Rather than concentrating on text-to-speech synthesis, the future lies with concept-to-speech.

Consider a sentence which expresses a simple idea, like "The dog ran down the road". If this was spoken by a system which "understood" what it was saying, and which had access to a database of knowledge about the world, we might expect it give different answers to different questions:

Q: What did the dog do?
A: He ran down the road.

Q: Was it a cat that ran down the road? 
A: No, it was the dog that ran down the road.

Q: Where did the dog run?
A: Down the road.

Q: Did the dog run across the park?
A: No, he ran down the road.

As well as understanding the concepts behind these assemblies of words, a concept-to-speech system must make the answers sound plausible, by understanding the correct intonations and emphasis for the words. No system can yet do this.

Traditionally, most speech synthesis researchers have been engineers, not linguists. But the problem of successfully synthesising human speech is no longer one of engineering. The mechanics of speech production are well understood, and systems to simulate them work well. The problem now is to understand the structure of language itself -- a far taller order, and an exciting challenge which may yet bring us the ultimate human-computer interface.

Toby Howard teaches at the University of Manchester.