Cacao seed pood. Source: H. Zell, wikimedia (click to see photo in its original context)

Today’s addition to the list of available plant genome sequences has been on my mental list of missing in action genome projects for quite some time: Theobroma cacao, the tree whose seeds provide the world with chocolate. The sequencing of the cacao genome was announced with great fanfare back in 2008. I distinctly remember discussing the announcement with a several grad students while on a tour to see Iowa State’s recently acquired Illumina sequencer. (That’s right I hadn’t even started grad school out here on the coast when this genome project was announced.)

The Genome:

The cocao tree has ten chromosomes. The genome project was able to assemble a little more than 92% of the genome into pseudomolecules representing those ten chromosomes.* The remainder of the genome was assembled into smaller pieces of sequence that have not yet been accurately assigned to a chromosome. The genome already has a preliminary set of genes annotated onto it (~35,000 gene models). The entire genome is estimated to be ~400 megabases, which is quite reasonable sized as genomes go and the press release mentions that the genome has been sequenced to 200-fold coverage (can this be right, it seems absurdly high?) using a mixture of 454 and Illumina sequencing. At the opposite extreme from technical stats like that, David Kuhn, one of the people involved in the genome project was quoted in the Washington Post describing it as “a very well-behaved genome” and I’m intrigued to find out what he means by that.

The genome is already released, and can be downloaded from this website. The appear to be no Fort Lauderdale restrictions, which means people can begin asking questions of the genome and publishing the answers they discover today!

The Plant:

Cacao seeds. Photo: EverJean, Flickr (click to see photo in its original context)

Cocao was originally domesticated in South America, but after production there was drastically reduced by a fungus called Moniliophthora perniciosa that causes a disease known as Witch’s Broom, the centers of production shifted to western Africa. Around 40% of the worlds supply is grown in Côte d’Ivoire alone with perhaps another 20% coming from neighboring Ghana. The production process is quite long and complicated, the farms are threatened by disease, and many of the cacao trees are quite old. All of which has contributed to the stories that have recently claimed “THE WORLD IS RUNNING OUT OF CHOCOLATE”

Random plant relationships triva: cacao is in the same family of plants as cotton, kola nuts and okra.

The Rest:

The genome of the cacao tree was sequenced in a collaboration between Mars**, IBM, the USDA and a number of universities. The interest of Mars in the chocolate genome is obvious since an expensive and unpredictable supply of chocolate is a problem for a company that makes its money selling candy. IBM got to show off some of their super-computer hardware in the assembly of all the tiny pieces of sequence data generated using 454 and Illumina sequencing technology into pseudomolecules millions of letters long. In some ways I’m most surprised at the involvement of the US Department of Agriculture since cacao production within the US is effectively zero, but given how much the US spends on buying chocolate from the countries where it is grown, I suppose it makes sense that we’d have an interest in assuring a plentiful and reliable supply.

It’s notable that there was apparently another genome sequencing project being worked on by Pennsylvania State University and CIRAD in France with funding from — you guessed it — Hershey. According to several articles, they may already have a scientific paper being reviewed for publication (something the Mars-IBM-USDA group hasn’t done yet). Publishing first is generally the standard used for who gets credit for a scientific discovery, but in this case, with the Mars data already available to researchers around the world, it may end up not mattering whose paper appears first. Which would really be unfortunate for the researchers involved, but if competing genome projects means data starts getting released to the rest of the scientific community faster I am all for it. Genome projects with normal government funding generally don’t have any direct competition***, which may be why some of them tend to keep a tight hold on their data for what feels like forever.****

If you’d like more details on the cacao genome project, written by professional reporters, not a grad student stealing time between Python programming, seminars, a server migration, and revising the paper we just got back from peer review:

The Washington Post: With DNA of chocolate nearly decoded by scientists, could sweeter treats await?

The New York Times: DNA of Cacao Bean Tree Sequenced by Mars and Hershey (requires a login)

And of course, there’s the Cacao Genome project’s website itself.

Normally the pay off when a company sequences a genome is an advantage over their competition, but Mars has released this dataset to the public, which includes their competitors (although obviously one of their biggest competitors, Hershey, was already going to have its own version of the genome one way or another). Normally the pay-off when public sector scientists sequence a genome comes in the form of publications***** but it’s not clear if that will happen either in this case given the competing genome version. So to everyone involved in the cacao genome project, thank you.

*If you’re downloading the sequence data, these are labeled as super_contigs 1-10.

**Mars apparently pledged to spend $10 million sequencing the chocolate genome over 5 years. Since the genome was released after only two years, I’m not sure how much of that money was actually spent.

***Although that could be changing. For example the JGI Cucumber genome project appears to have been scooped when the China’s BGI who published their version of the genome first. Edit (February 2012: This isn’t actually what happened, I have the chance to talk with someone from JGI who explained their cucumber genome assembly was carried out using donated data, and was never intended for publication.)

****The genome projects out there that constantly miss promised release dates for their data know who they are….

*****And with publications come wealth, fame, and adoration by members of the opposite sex… I wish.


