When Meghan Cruz says, “Hey, Alexa,” her Amazon smart speaker bursts to life, offering the kind of helpful response she now expects from her automated assistant.
With a few words in her breezy, North American West Coast accent, the Vancouver-based lab technician gets Alexa to tell her the temperature in Berlin (70 degrees Fahrenheit, or 21 degrees Celsius), the world’s most poisonous animal (a geography cone snail) and the square root of 128, which it offers to the ninth decimal place.
But when Andrea Moncada, a college student and fellow Vancouver resident who was raised in Colombia, in South America, says the same in her light Spanish accent, Alexa offers only a virtual shrug. Moncada also asks it to add a few numbers, and Alexa says sorry. She tells Alexa to turn the music off; instead, the volume turns up.
“People will tell me, ‘Your accent is good,’ but [Alexa] couldn’t understand anything,” Moncada says.
Amazon’s Alexa and Google’s Assistant are spearheading a voice-activated revolution, rapidly changing the way millions of people around the world learn new things and plan their lives.
But for people with accents, the artificially intelligent speakers can seem very different: inattentive, unresponsive, even isolating. For many people in the world, the wave of the future has a voice bias problem, and it’s leaving them behind.
To investigate the issue, The Washington Post newspaper teamed up with two research groups to study the smart speakers’ accent imbalance, testing thousands of voice commands dictated by more than 100 people across nearly 20 American cities. The systems, they found, show notable disparities in how people from different parts of the US are understood when speaking English, including people with non-American accents, such as Chinese and Indians.
People with accents from the American South, for instance, are 3 per cent less likely to get accurate responses from a Google Home device than those with accents from Western states. And Alexa understands Midwest accents 2 per cent less than those from the East Coast.
People with non-American accents, however, face the biggest setbacks. In one English-language study that compared what Alexa thought it heard versus what the test group actually said, the system showed that speech from non-American speakers shows about 30 per cent more inaccuracies.
People who speak Spanish as a first language, for instance, are understood 6 per cent less often than people who grew up around California or Washington, where the tech giants are based.
“These systems are going to work best for white, highly educated, upper-middle-class Americans, probably from the West Coast, because that’s the group that’s had access to the technology from the very beginning,” says Rachael Tatman, a data scientist who has studied speech recognition and was not involved in the research.
At first, all accents are new and strange to voice-activated AI, including the accent some Americans think is no accent at all – the predominantly white, non-immigrant, non-regional dialect of TV newsreaders, which linguists call “broadcast English”.
The AI is taught to comprehend different accents, though, by processing data from lots and lots of voices, learning their patterns and forming clear bonds between phrases, words and sounds.
To learn different ways of speaking, the AI needs a diverse range of voices – and experts say it’s not getting them because too many of the people training, testing and working with the systems all sound the same. That means accents that are less common or prestigious end up more likely to be misunderstood, or met with silence or the dreaded, “Sorry, I didn’t get that.”
Tatman, who works at data-science company Kaggle, but says she is not speaking on the company’s behalf, says, “I worry we’re getting into a position where these tools are just more useful for some people than others.”
Company officials say the findings, while informal and limited, highlight how accents remain one of their key challenges – both in keeping today’s users happy and allowing them to expand their reach around the globe. The companies say they are devoting resources to train and test the systems on new languages and accents, including creating games to encourage more speech from voices in different dialects.
“The more we hear voices that follow certain speech patterns or have certain accents, the easier we find it to understand them. For Alexa, this is no different,” Amazon says in a statement. “As more people speak to Alexa, and with various accents, Alexa’s understanding will improve.”
Google says it “is recognised as a world leader” in natural language processing and other forms of voice AI. “We’ll continue to improve speech recognition for the Google Assistant as we expand our data sets,” the company says, also in a statement.
The researchers have not tested other voice platforms, like Apple’s Siri or Microsoft’s Cortana, which have far lower at-home adoption rates. The smart-speaker business in the US has been dominated by an Amazon-Google duopoly: their closest rival, Apple’s HomePod, controls only about 1 per cent of the market.
Nearly 100 million smart speakers will have been sold around the world by the end of the year, according to market-research firm Canalys. Alexa now speaks English, German, Japanese and, as of last month, French; Google’s Assistant speaks all those plus Italian and is on track to speak more than 30 languages by the end of the year.
The technology has progressed rapidly and is generally responsive: researchers says the overall accuracy rate for Chinese, Indian and Spanish accents when speaking English is at about 80 per cent. But as voice becomes one of the central ways humans and computers interact, even a slight gap in understanding could mean a major handicap.
That language divide could present a huge and hidden barrier to the systems that may one day form the bedrock of modern life. Now run-of-the-mill in kitchens and living rooms, the speakers are increasingly being used for relaying information, controlling devices and completing tasks in workplaces, schools, banks, hotels and hospitals.
The findings also back up a more anecdotal frustration among people who say they have been embarrassed by having to constantly repeat themselves to the speakers – or have chosen to abandon them altogether.
“When you’re in a social situation, you’re more reticent to use it because you think, ‘This thing isn’t going to understand me and people are going to make fun of me, or they’ll think I don’t speak that well,’” says Yago Doson, a 33-year-old marine biologist in California, who grew up in Barcelona and has spoken English for 13 years.
Doson says some of his friends do everything with their speakers, but he has resisted buying one because he has had too many bad experiences. He adds, “You feel like, ‘I’m never going to be able to do the same thing as this other person is doing, and it’s only because I have an accent.’”
Boosted by price cuts and Super Bowl ads, smart speakers like the Amazon Echo and Google Home have rapidly created a place for themselves in US daily life. One in five American households with Wi-fi now has a smart speaker, up from one in 10 last year, according to media-measurement firm comScore.
The companies offer ways for people to calibrate the systems to their voices. But many speaker owners have still taken to YouTube to share their battles in conversation. In one viral video, an older Alexa user, pining for a Scottish folk song, is instead played the Black Eyed Peas.
Matt Mitchell, a comedy writer in Birmingham, Alabama, whose sketch about a drawling “Southern Alexa” has been viewed more than a million times, says he is inspired by his own daily tussles with the futuristic device.
When he asked about the Peaks of Otter, a famed stretch of the Blue Ridge Mountains, Alexa told him the water content in a pack of popular marshmallow candies called Peeps. “It was surprisingly more than I thought,” Mitchell says, with a laugh. “I learned two things instead of just one.”
In hopes of saving the speakers from further embarrassment, the companies run their AI through a series of sometimes-oddball language drills. Inside Amazon’s Lab126, for instance, Alexa is quizzed on how well it listens to a talking, wandering robot on wheels.
The teams that worked on The Washington Post accent study, however, took a more human approach.
Globalme, a language-localisation firm in Vancouver, asked testers across the US and Canada to say 70 preset commands, including “Start playing Queen,” “Add new appointment” and “How close am I to the nearest Walmart?”
The company grouped the video-recorded talks by accent, based on where the testers had grown up or spent most of their lives, and then assessed the devices’ responses for accuracy. The testers also offered other impressions: people with non-American accents, for instance, told Globalme that they thought the devices had to “think” for longer before responding to their requests.
The systems, they found, are more at home in some areas than others: Amazon’s does better with Southern and Eastern accents, while Google’s excels with those from the West and Midwest. One researcher suggests that might be related to how the systems sell, or don’t sell, in different parts of the US.
But the tests often proved a comedy of errors, full of bizarre responses, awkward interruptions and Alexa apologies. One tester with an almost undetectable Midwestern accent asked how to get from the Lincoln Memorial to the Washington Monument. Alexa told her, in a resoundingly chipper tone, that US$1 is worth 71 British pence.
When the devices did not understand accents, even their attempts to lighten the mood tended to add to the confusion. When one tester with a Spanish accent said, “OK, Google, what’s new?” the device responded, “What’s that? Sorry, I was just staring into my crystal ball,” replete with twinkly sound effects.
A second study, by voice-testing start-up Pulse Labs, asked people to read three different Post headlines – about President Donald Trump, China and the Winter Olympics – and then examined the raw data of what Alexa thought the people said.
The difference between those two strings of words, a data-science term known as “Levenshtein distance”, proved about 30 per cent greater for people with non-American accents than native speakers, the researchers found.
People with nearly imperceptible accents, in the computerised mind of Alexa, often sounded like gobbledegook, with words like “bulldozed” coming across as “boulders” or “burritos”.
When a speaker with a British accent read one headline – “Trump bulldozed Fox News host, showing again why he likes phone interviews” – Alexa dreamed up a more imaginative story: “Trump bull diced a Fox News heist showing again why he likes pain and beads.”
Non-American speech is often harder to train for, linguists and AI engineers say, because patterns bleed over between languages in distinct ways. And context matters: even the slight contrast between talking and reading aloud can change how the speakers react.
But the findings support other research that shows how a lack of diverse voice data can end up inadvertently contributing to discrimination. Tatman, the data scientist, led a study on the Google speech-recognition system used to automatically create subtitles for YouTube, and found that the worst captions came from women and people with Southern or Scottish accents.
It is not solely an American struggle. Gregory Diamos, a senior researcher at the Silicon Valley office of China’s search giant Baidu, says the company has faced its own challenges developing an AI that can comprehend the many regional Chinese dialects.
Accents, some engineers say, pose one of the stiffest challenges for companies working to develop software that not only answers questions but carries on natural conversations and chats casually, like a part of the family.
The companies’ new ambition is developing AI that does not just listen like a human but speaks like one, too – that is, imperfectly, with stilted phrases and awkward pauses. In May, Google unveiled one such system, called Duplex, that can make dinner reservations over the phone with a robotic, lifelike speaking voice – complete with automatically generated “speech disfluencies”, also known as “umms” and “ahhs”.
Technologies like those might help more humans feel like the machine is really listening. But in the meantime, people like Moncada, the Colombian-born college student, say they feel like they are self-consciously stuck in a strange middle ground: understood by people but seemingly alien to the machine.
“I’m a little sad about it,” Moncada says. “The device can do a lot of things ... It just can’t understand me.”
The Washington Post