What is DNA sequencing about? Let's take the panda as an example.
Organisms' genomes could contain billions of base pairs - nucleotides on opposite sides of a DNA strand. DNA sequencing uses biochemical methods to determine the order of these nucleotide bases, which have only four types. Their initial letters are A, T, G and C.
To assemble the genome of three-year-old female Beijing Olympics mascot panda Jingjing, which was published on the cover of Nature in December last year, BGI scientists had to map 21 pairs of large linear nuclear chromosomes, organised as two double-helical DNA molecules that encode many genes. The 21 chromosomes consist of some 2.4 billion base pairs.
Using the next-generation DNA sequencing technology, BGI's bioinformaticians were able to simultaneously read massive amounts of DNA pieces with fewer than 100 bases each, after breaking large DNA molecules into small pieces with high-pressure nitrogen.
That was an advance over traditional sequencing technology, which only allowed researchers to sequence a much bigger piece of DNA containing 500 to 1,000 base pairs at one time. That took longer, like a tailor making a suit himself instead of dividing the material into small pieces and having 100 tailors work at the same time. After sequencing, bioinformaticians needed to assemble tens of millions of all those tiny DNA pieces with supercomputers, putting each fragment in the right place - one of the world's most difficult puzzles.
About 36 per cent of the DNA pieces from the panda were repeated - like a book with 24 million words and repetitions of 'Hello' and 'Goodbye'. In the same way that scholars would need to find the right pages for 'Hello' if the book were shredded and needed to be reconstructed, the researchers had to solve a difficult problem to know where the DNA pieces belonged.
BGI finally assembled 2.25 billion base pairs of DNA, or 94 per cent of the panda's whole genome.
It was the first reported de novo assembly - the piecing together of a previously unknown sequence - of a large mammalian genome by means of the latest methods.