Just five years ago in 2005, the state of the art technology for sequencing genomes was Sanger sequencing, the same basic technology that had been used by biologists for decades, although the sequencers of 2005 were the result of decades of refinement of the basic technique. Five years later in 2010, the newest state of the art sequencer is the HiSeq 2000 from Illumina (at least until the Pacific Biosystems sequencers become available later this year… ::drool::). What difference does 5 years make? It would take more than thirty-thousand of the latest and greatest sanger sequencers from 2005 (right before the first next generation sequencer, a 454 machine built by Roche, was released) to produce as much DNA sequence data as a single one of the new HiSeq 2000s produces.*
PolITgenomics has a great, detailed post up about the new HiSeq 2000 sequencing machines from Illumina. Illumina sequencers (still often called Solexa, at least by me) generate lots and lots of short DNA sequences all at once. Assembling a genome out of those sequences can be hard** but the latest generation of Illumina sequencers, including the HiSeq 2000, generates sequences almost twice as long as those used to assemble the Panda genome which was published late last year (100 bp long vs ~50). The biggest difference between the HiSeq 2000 and existing Illumina sequencers is the increase in sequence data generated per machine per day.
The Genome Analyzer IIx, which was previously the top of the line model from Illumina, generates 25 gigabases of sequence per run and takes fives days to run, or 5 gigabases (a little over 1.5 times the amount of sequence in the human reference genome) of dna sequences per day, although, keep in mind, that’s all in the form of short hundred base pair fragments, that then has to either be pieced together like a giant jigsaw puzzle, or matched onto an existing genome to reveal unique mutations present in the newly sequenced individual or species***. Either way, it’s already a huge amount of data.
The new machine generates sequences that are the same length but makes them a little faster (4 days) and generates a LOT more of them (100 gigabases of sequence). A single machine is generating an average of 25 gigabases of sequence per DAY. Illumina had to redesign parts of their analysis software because the new instrument generates sequence faster than it could be saved to a hard drive!
And to top it all off, according to Omics! Omics! the Beijing Genomics Institute has already ordered 128 of the new sequencers. That’s enough sequencing machinery to generate 3200 gigabases (1000 times the size of the human genome) of sequence data per day. In 2009 the Joint Genome Institute which is part of the Department of Energy and a sequencing behomoth in its own right generated just over 1000 gigabases of high quality base calls. 0nce these machines are up and running the Beijing Genomics Institute will be producing twelve times the total 2009 sequence output of JGI every four days (on average they’ll produce much sequence as JGI did in 2009 every eight hours!).**** I don’t know what they intend to do with all that sequencing power, but I expect that it’ll be something incredible… actually they’ll probably end up doing lots of different things.
Where the number 30,000 comes from:
In a previous post I calculated a top of the market sanger sequencer from 2005 (just before the first next generation sequencing techologies came on the market) with 96 capillary tubes which could be run eight times per day would generated 750 kilobases of sequence per day. The new machine from Illumina generated 25 gigabases per day (ot 25,000,000 kilobases to keep my units consistent). 25,000,000/750 = 33,333 (rounded to the closest whole machine)
*This overstates the improvement as Sanger sequencers still generate longer sequences that are easier to piece together, but this is mostly only an issue with sequencing entirely new genomes, and a lot of the research that’s being done today is in organisms where there’s already a reference genome to map sequence data onto (for example in humans) which makes the sorted lengths of sequences generated by Illumina sequencing a much smaller issue.
**In my post on the panda genome I compared it to trying to piece together a novel 500 times as long as War and Peace using phrases this short:
nglish ambassador’s? Today is Wednesday. I must put
adron of varicolored horsemen. Two of them rode side
***If you’re one of the people like me who is geeky enough to want to sequence their own genome, but a lot more money to throw around ($48,000-68,000 last I read about it) this is exactly the way your genome would be sequenced. The Illumina sequencing machines generate gigabases of short DNA sequences that are then mapped by computers onto the reference human genome that was produced at a cost of $2.7 billion dollars over 13 years (under budget and ahead of schedule believe it or not!). By mapping imperfectly matched sequences, this technique will determine which base pair sequences are different in your genome compared to the reference sequence (ie you have an A where the reference human genome has a G). It’s like putting a puzzle together on top of another, already completed, puzzle with almost exactly the same picture.
***To be fair, JGI is continuing to tool up as well. The amount of sequence they generated increased by a factor of eight between 2008 and 2009.