The authors encode 750 KB as DNA. The data took 2 days to write at a cost of $12400 per MB and 15 days to read at a cost of $2200 per MB ($220 per MB for larger files). Information density is 2.2 PB per gram, or about 1500 Kg to store all of the world's 3 ZB of data. Data would be stable for 10000 years. (We have recovered DNA from Neandertals and mammoths, but dinosaurs are long gone). DNA synthesis and sequencing costs are dropping like Moore's Law, but still 10^8 times higher than disk.
During reading, a 50 byte segment of one file was lost due to an unanticipated coding error. They used a Huffman code of 5 or 6 bases per byte that resulted in a string of 0xFF bytes translating to a repeating self-complementary DNA string that folded back on itself during sequencing. This problem could be fixed by compressing or encrypting the data first. They use 5-6 bases per byte instead of 4 because the writing process is simpler when bases don't repeat. This requires a base 3 differential encoding. The encoder is a machine like an inkjet printer that paints in each pass a single base onto millions of pixels each containing a different DNA strand. Each strand is 117 bases of data (including length and index fields and a parity base) plus a 33 base promoter on each end that is identical for all strands. Reading is like normal paired-end sequencing, but skipping the step of fragmenting the DNA into short strands and using custom software to recover the data from the reads. The reads are 96 base pairs, which is sufficient when reading from both ends with some overlap.
DNA weighs about 10^-21 g per base. So the theoretical capacity should be about 10000 times higher than what the authors got. They used thousands of copies of each DNA strand with 4x overlap to reduce the error rate.
Last edited by Matt Mahoney; 25th January 2013 at 18:26.
it's obly me who recalled Johnny Mnemonic?