Stroll through Sanlitun, a bustling neighbourhood in Beijing filled with tourists, karaoke bars and luxury shops, and you'll see plenty of people using the latest smartphones from Apple, Samsung and Xiaomi. Look closely, however, and you might notice some of them ignoring the touch screens on these devices in favour of something much more efficient and intuitive: their voice.
A growing number of China's 691 million smartphone users now regularly dispense with swipes, taps and tiny keyboards when looking things up on the country's most popular search engine, Baidu. China is an ideal place for voice interfaces to take off, because Chinese characters were hardly designed with tiny touch screens in mind. But people everywhere should benefit as Baidu advances speech technology and makes voice interfaces more practical and useful. That could make it easier for anyone to communicate with the machines around us.
"I see speech approaching a point where it could become so reliable that you can just use it and not even think about it," says Andrew Ng Yan-tak, Baidu's chief scientist and an associate professor at Stanford University, in the United States. "The best technology is often invisible and, as speech recognition becomes more reliable, I hope it will disappear into the background."
Voice interfaces have been a dream of technologists (not to mention science-fiction writers) for many decades. But in recent years, thanks to some impressive advances in machine learning, voice control has become a lot more practical.
No longer limited to just a small set of predetermined commands, it now works even in a noisy environment, such as the streets of Beijing or when you're speaking across a room. Voice-operated virtual assistants such as Apple's Siri, Microsoft's Cortana and Google Now come bundled with most smartphones, and newer devices, such as Amazon's Alexa, offer a simple way to look up information, cue up songs and build shopping lists with your voice. These systems are not perfect, sometimes mishearing and misinterpreting commands in comedic fashion, but they are improving steadily, and they offer a glimpse of a graceful future in which there's less need to learn a new interface for every new device.
Baidu is making particularly impressive progress, especially with the accuracy of its voice recognition, and it has the scale to advance conversational interfaces even further. The company - founded in 2000 as China's answer to Google, which is currently blocked in China - dominates the domestic search market, with 70 per cent of all queries. And it has evolved into a purveyor of many services, from music and movie streaming to banking and insurance.
A more efficient mobile interface would come as a big help in China. Smartphones are far more common than desktops or laptops, and yet browsing the web, sending messages and doing other tasks can be painfully slow and frustrating. There are thousands of Chinese characters, and although Pinyin allows them to be generated phonetically from Latin ones, many people (especially those aged over 50) do not know the system. It's also common in China to use messaging apps such as WeChat to perform all sorts of tasks, such as paying restaurant bills. And yet in many of the poorer regions, where there is perhaps more opportunity for the internet to have big social and economic effects, literacy levels are still low.
"It is a challenge and an opportunity," says Ng, who has been honoured for his work in artificial intelligence and robotics at Stanford University in the United States. "Rather than having to train people used to desktop computers in new behaviours appropriate for cellphones, many of them can learn the best ways to use a mobile device from the start."
Ng believes that voice may soon be reliable enough to be used for interacting with all sorts of devices. Robots or home appliances, for example, could be easier to deal with if you could simply talk to them. The company has research teams at its headquarters in Beijing and in California's Silicon Valley that are dedicated to advancing the accuracy of speech recognition and working to make computers better at parsing the meaning of sentences.
Jim Glass, a senior research scientist at the Massachusetts Institute of Technology who has been working on voice technology for the past few decades, agrees that the timing may finally be right for voice control.
"Speech has reached a tipping point in our society," he says. "In my experience, when people can talk to a device rather than via a remote control, they want to do that."
Last November, Baidu reached an important landmark with its voice technology, announcing that its Silicon Valley lab had developed a powerful speech recognition engine called Deep Speech 2. It consists of a very large, or "deep", neural network that learns to associate sounds with words and phrases as it is fed millions of examples of transcribed speech. Deep Speech 2 can recognise spoken words with stunning accuracy. In fact, the researchers found that it can sometimes transcribe snippets of Putonghua more accurately than a person.
Baidu's progress is all the more impressive because Putonghua is phonetically complex and uses tones that transform the meaning of a word. Deep Speech 2 is also striking because few of the researchers in the California lab where the technology was developed speak Putonghua, Cantonese or any other variant of Chinese. The engine essentially works as a universal speech system, learning English just as well when fed enough examples.
Most of the voice commands that Baidu's search engine hears today are simple queries - concerning tomorrow's weather or pollution levels, for example. For these, the system is usually impressively accurate. Increasingly, however, users are asking more complicated questions. To take them on, last year the company launched its own voice assistant, called Duer, as part of its main mobile app. Duer can help users find cinema screening times or book a table at a restaurant.
The big challenge for Baidu will be teaching its AI systems to understand and respond intelligently to more complicated spoken phrases. Eventually, Baidu would like for Duer to take part in a meaningful back-and-forth conversation, incorporating changing information into the discussion. To get there, a research group at Baidu's Beijing offices is devoted to improving the system that interprets users' queries. This involves using the kind of neural-network technology that Baidu has applied in voice recognition, but it also requires other tricks. And Baidu has hired a team to analyse the queries fed to Duer and correct mistakes, thus gradually training the system to perform better.
"In the future, I would love for us to be able to talk to all of our devices and have them understand us," Ng says. "I hope to someday have grandchildren who are mystified at how, back in 2016, if you were to say 'Hi' to your microwave oven, it would rudely sit there and ignore you."
MIT Technology Review
Technologies that will change our world in 2016
The MIT Technology Review magazine recently identified 10 breakthrough technologies - those that are most likely to solve a big problem and open up new opportunities - for 2016. One is conversational interfaces, as discussed in the Baidu article. Below are the other nine:
Immune engineering: the technology to produce and edit T cells, which play a crucial role in determining how effective a person's immune system is, is coming on in leaps and bounds.
Precise gene editing in plants: a new gene-editing method is providing a precise way to modify crops in the hope of increasing yields and making them more drought- and disease-resistant. The technology is known as CRISPR and already a lab in China has used it to create a fungus-resistant wheat; several groups in the mainland are using the technique on rice in efforts to boost yields; and a group in Britain has used it to tweak a gene in barley that helps govern seed germination, which could aid efforts to produce drought-resistant varieties.
Reusable rockets: rockets typically are destroyed on their maiden voyage, but now they can make an upright landing and be refuelled for another trip, setting the stage for a new era in spaceflight.
Robots that teach each other: this will be an important step in preparing robots to perform many of those arduous or dangerous tasks we'd like not to do ourselves, such as packing items in warehouses, assisting bedridden patients or aiding soldiers on the front lines.
DNA App store: United States company Helix is attempting to create the first "app store" for genetic information. Helix's idea is to collect a spit sample from anyone who buys a DNA app, sequence and analyse the customers' genes, and then digitise the findings. This will make genetic information available to consumers at an unprecedentedly low price.
Improved solar panels: at a time when conventional silicon-based solar panels from China have never been cheaper, the US is fighting back. American company SolarCity will soon begin producing solar panels with a technology that combines a standard crystalline-silicon solar cell with elements of a thin-film cell, along with a layer of a semiconductor oxide, making affordable panels that could have more than 22 per cent efficiency. Today's commodity silicon-based solar panels have efficiencies of between 16 and 18 per cent.
Slack: the intra-office messaging system, often described as the fastest-growing workplace software the world has ever seen, gives you a centralised place to communicate with colleagues through instant messages and in chat rooms. Slack funnels messages into streams that everyone who works together can see, allowing you to "overhear" what is going on in an organisation or group.
The Tesla autopilot: an "autopilot"-enabled Tesla car can manage its speed, steer within and even change lanes, and park itself.
Power from the air: technology that lets gadgets work and communicate using only energy harvested from nearby television, radio, cellphone or Wi-fi signals is fast becoming a commercial reality.