I arrived in Pittsburgh late at night after 24 hours of travel from Hong Kong recently to visit my daughter. We needed a stiff drink, I told my daughter who picked me up from the airport.
She told her iPhone: "Find liquor stores near Pittsburgh Airport." Within seconds, a map appeared on the screen with a list of liquor stores nearest the airport.
Of course I had talked to machines before - calling directory enquiries and phone banking - but the ability of her smartphone to meet my need for alcohol made me realise that machines that recognise and respond to speech will only get smarter. That was some techno-epiphany.
In the rapidly developing technology of Automatic Speech Recognition (ASR), machines are "hearing" and understanding spoken language, and performing actions on verbal commands.
ASR actually predates the invention of the computer by 50 years: in the 1870s, Alexander Graham Bell experimented in transmitting speech by his wife, who was deaf. He had hoped to create a device that would transform a spoken word into a picture that a deaf person could see; that line of research eventually led to his invention of the telephone.
Conceptually, ASR is simply the machine-matching of sounds with words. By using models of the sounds of a language to build a library of words, speech can theoretically be matched with words. If the words can be fitted into a certain set of grammatical and syntactical rules, they can then be arranged into sentences, a process known as rule-based pattern recognition.
However, human language has infinite variety in the way sounds are made and strung together to form words. In any single language, accents, dialects and mannerisms vary from region to region and across social and economic groups, and can vastly change the way certain words or phrases are spoken. To encompass such numerous variations, state-of-the-art ASR systems are based on complex statistical methods known as Hidden Markov Models and neural networks.
These systems use probability and mathematical functions to determine which words you just spoke based on the speech models in their inventory. The speech signal is broken down into segments called phoneme, some as short as a few hundredths of a second. The smallest elements of a language, phoneme are representations of the sounds we make and put together to form meaningful expressions. For example, the word "potato" is formed by a string of sounds: p / oh / t / ay / t / oh; for each sound, hundreds and sometimes thousands of phoneme are constructed. An ASR program examines each phoneme and compares it with the phoneme before and after it. Each phoneme is like a link in a chain, and the completed chain is a word. The program attempts to match the speaker's segmented digital sounds with the phoneme that most likely surround it, based on a library of phoneme amassed from samples of speech recorded from thousands of people. The system then determines what the user was probably saying and outputs it as the most likely text, and performs the action requested by the speaker.
While speech recognition was used initially in simple query-response customer service transactions, its applications are multiplying fast. Your smartphone, if you have the newest model, may have ASR since speech-to-text dictation and voice-activated dialling are becoming standard features. In mobile phones, voice is already the primary medium for communication, so it is the logical next step to use voice to input commands, rather than the tiny keyboard. In the early applications, the memory was limited, so most of these phones could recognise only up to around 10 names at a time. As memory and processing power increased, so did recognition capability.
ASR is proving especially useful in situations where it is difficult to input information by typing on a keyboard because your hands are otherwise occupied. For instance, a surgeon who is operating on a patient can use a medical ASR system to make a text record of his remarks as he performs the surgery. Speech recognition in fighter aircraft reduces the workload of pilots so that they can focus on other mission-critical functions. Such programs have been operating in American, French and European fighter jets. The applications include setting radio frequencies, commanding the autopilot system, setting steer-point co-ordinates and weapons release parameters, and controlling flight display.
Even more complex is the recognition and translation of speech from one language to another. The US Defence Advanced Research Projects Agency, where the internet was developed, has researchers working on Global Autonomous Language Exploitation, a program that will take in streams of information from foreign news broadcasts and newspapers, process the speech using foreign-language ASR, and translate the speech into English. It aims to create software that can instantly translate two languages into each other with at least 90 per cent accuracy.
It is highly unlikely that any ASR system will achieve full accuracy. It would have to overcome obstacles like slang, accents, background noise, sniffles from a cold and a myriad other potential disruptions. The different grammatical structures in languages also pose a problem. One key area of research is to improve the robustness of speech recognition so that extraneous noise and other disturbances do not affect performance.
In the future, speech recognition may become speech understanding. The statistical models that allow computers to decide what a person just said may someday allow them to grasp the meaning behind the words and attain true artificial intelligence. We can talk to our computers today. Tomorrow, they may well talk back and go their own way, as the computer HAL did in the movie 2001: A Space Odyssey. But unlike HAL, who ignored commands and killed his human masters, hopefully computers will still do what we tell them to do.
Tom Yam is a Hong Kong-based management consultant. He holds an electrical engineering doctorate and an MBA from the Wharton School of the University of Pennsylvania