Chinese researchers say they’ve developed an AI text censor that is 91 per cent accurate
- They claim it could be useful to ‘identify and filter sensitive information from online news media’
- China’s internet is tightly controlled and the government relies on a huge army of censors to vet content
Traditional machine censors rely mainly on keywords to do this and struggle to achieve 70 per cent accuracy, while AI technology – which needs to be trained by humans – has taken that to about 80 per cent in recent years.
The team from Shenyang Ligong University and the Chinese Academy of Sciences say their AI technology does not need to be trained by humans and “outperforms other approaches” to achieve more than 91 per cent accuracy.
It would be particularly useful to “identify and filter sensitive information from online news media”, lead researcher Li Shu and her colleagues wrote in a paper published in the Journal of Chinese Computer Systems on Monday.
But identifying them is a challenge for computers. Chinese is one of the most complex languages in the world, with nearly 10,000 characters. And sensitive words – gun, for example – could get picked up in a non-sensitive context, triggering a false alarm, or illegal information could be posted online without the use of any sensitive words.
The Chinese government and internet companies have instead relied on a huge army of censors to manually vet online content, but it is too costly and inefficient to keep pace with the growth of information on China’s internet and social media.
Li, an associate professor of computer science at Shenyang Ligong University, said the technology developed by her team could keep up with the fast-evolving language used online in China, with a powerful dictionary containing not only sensitive words but their changing forms.
She said it could also read between the lines when searching for illegal content that was hidden in a different context, increasing the ability to identify text that is written in a way to bypass machine censors. Many internet users in China avoid using sensitive words and instead use homonyms or add hyphens between characters to confound the censors.
Part of the team’s text censor technology came from Google, Li said. In 2017, Google developed an open-source language model known as bidirectional encoder representations from transformers, or BERT, to help its search engine better understand users’ search terms. BERT can read a word in different contexts – such as “running a business” versus “running a marathon” – as a result of reading huge text databases including the entire Wikipedia site.
But BERT is not a censor by design and cannot understand text longer than 512 words. To make it work, Li’s machine breaks a long text into segments, lets BERT read the shorter parts and uses another AI tool to combine the results and assess them using the most up-to-date dictionary.
Google did not respond to a request for comment.
China is investing heavily in artificial intelligence and the technology is increasingly becoming part of everyday life in China – from e-commerce to public spaces where surveillance cameras are equipped with facial recognition, to military uses.