Chinese Covid-19 genetic data in US archive was removed in June 2020, virologist finds
- The sequences, now recovered, add to the limited genetic data available from the early phase of the pandemic
- Data had been removed at the request of the scientist who submitted it, but details of the sequences are included in a published paper by Wuhan team
The NIH confirmed on Wednesday that its staff had removed the sequences in June 2020, three months after they had been uploaded to the online US government-run genetic sequence archive. Archive rules allow researchers to ask to withdraw their submissions.
It made the disclosure after virologist Jesse Bloom, of the Fred Hutchinson Cancer Research Centre in Seattle, said he recovered genetic sequences that were collected by a team from Wuhan University in January and February 2020 but had been removed.
Nature or lab leak? Why tracing the origin of Covid-19 matters
But the details of the sequences, including their exact mutations, were included in a paper published by the Wuhan University researchers last June, which remains online.
Bloom first noticed the sequences in a research paper published last May that drew from the US archive and cited Wuhan University, but was surprised that he could not locate them, he said in a non-peer-reviewed paper uploaded this week to the preprint server bioRvix.
A hunch had pressed him to search the archive’s cloud server by guessing the url to see whether the deleted sequences were still there, away from public view.
“This strategy was successful,” Bloom wrote, adding that these sequences allowed reconstruction of 13 partial sequences of virus from early in the outbreak.
“The approach taken here hints it may be possible to advance understanding of Sars-CoV-2’s origins or early spread even without further on-the-ground studies, such as by more deeply probing data archived by the NIH and other entities,” he wrote.
“There is no plausible scientific reason for the deletion,” Bloom wrote, reasoning that there were “no corrections to the paper, the paper states human subjects’ approval was obtained, and the sequencing shows no evidence of [contamination]. It therefore seems likely the sequences were deleted to obscure their existence.”
Details of the sequences were included in a table in a paper by the Wuhan team published in the journal Small last June, which remains online.
Some scientists, citing Small, questioned Bloom’s assertion of obscuring data. Bloom tweeted in response that the information was less usable “as a table of mutations” in a little-known paper than in genetic sequences available in a database.
In their request to the NIH, the scientist who submitted the data said it had been updated and was being submitted to another database, so should be removed to avoid contradiction, the US agency said.
Two lead authors on the Wuhan paper did not respond to an emailed request from the South China Morning Post for comment about the removal.
Coronavirus whistle-blower doctor Li Wenliang dies from the disease
Chinese researchers operating with national grant funding, such as this Wuhan team, are understood to need approval to release data to external public databases. But Beijing has also moved to control publications about the virus, last year adding an approval process for related research, according to documents obtained by Associated Press.
Maciej Boni, an associate professor at Pennsylvania State University, said the genetic diversity added by the recovered data confirmed the generally accepted timeline for the virus’ emergence, and that Wuhan’s Huanan market was not its original location.
“It’s further confirmation that the date of origin was in the mid-October to mid-November range,” said Boni, who stressed the importance of data sharing in epidemic response. “Does it change the overall picture? No. But is the data valuable in confirming the picture? Yes.”
Chinese citizen journalist Zhang Zhan sentenced to four years in jail for Wuhan coronavirus reports
He said the lack of genetic data from the pandemic’s early stages “has been the key problem with better understanding its origins”.
Bloom also said the sequences could help infer what the early ancestor of the virus in humans may have looked like, using the data to project that it would have had three mutations missing from those found at the market.
Sudhir Kumar, director of the Institute for Genomics and Evolutionary Medicine at Temple University, said it was difficult to make such hypotheses from a small data set, but agreed that Bloom’s “sleuthing” gave scientists something they had long wanted more of: original data to analyse.
“Would it be helpful to have more data from December 2019 as well as January 2020 from China? Yes,” he said. “It would tell us truly about the diversity of the coronavirus that persisted there.
“A similar kind of data excavation needs to be done in other countries as well,” he said, adding that the virus may have been spreading for some time globally before it was identified.
It remains unclear whether China will welcome a follow-up mission of international researchers. Beijing has denied accusations that it has been less than fully transparent.