First of all, apparently I’ve been spelling Pigeonpea wrong. The random “d” I was inserting comes from “pidgin.” But fortunately that’s not what the scientific feud is about.
Instead, it turns out that there were actually two independent assemblies of the pigeonpea genome published in separate journals within a couple of weeks of each other.
One, which I’ve been talking about, was published in Nature Biotechnology (impact factor 31) on November 6th. This pigeonpea project was run by ICRISAT (one of the CGIAR centers) with much of the actual sequencing and informatics contracted out to BGI (the Beijing Genomics Institute).
But what I didn’t know was that a second pigeonpea assembly was published back on October 25th in the “JOURNAL OF PLANT BIOCHEMISTRY AND BIOTECHNOLOGY” a journal far fewer scientists will recognize the title of (impact factor 0.41). This genome was put together by a group at the Indian Council of Agricultural Research (ICAR).
Confused by all the acronyms yet?
Alright, I’ll drop all of them for the remainder of this post:
- Has a longer history of working on pigeonpea genomics
- Published first (if just barely)
- Published in a much higher impact journal
- Has more total assembled sequence*
Seems simple enough. And just the sort of thing that is bound to happen more and more as the cost of genome sequencing continues to drop. Unfortunately there seems to be a lot of bad blood between the research groups and all sorts of stuff (like who is more of a real indian) is getting dragged in. Check out the comments section on this article.
I really only have two thoughts on the subject:
- The research community really will benefit if these two research groups can make peace with each other and merge their data into a combined assembly for version 2 of the pigeonpea genome. Neither assembly is all that great yet, and the two genomes were sequenced using different technologies: 454 sequencing (fewer longer reads) and Illumina sequencing (more shorter reads). A merged assembly could capture more of the total genome and — more importantly in my opinion — maybe get a larger fraction of the current sequence placed and orientated within the pseudomolecules. Right now less than 250 megabases of the 600-odd megabase genome are placed within chromosomes. The other 350 megabases are present as over 100,000 floating contigs, many as small as 103 bp of sequence.
- These situations are terrible for everyone directly involved, all of whom are probably afraid they will not get the credit they expected for the incredibly hard work of sequencing a genome. But the same mess is good for both the broader scientific community and the world as a whole. Incidents like the current one with pigeonpea or the competing chocolate genome projects and the three cucumber genomes drive home one message. Genome projects can be scooped. You can’t afford to sit on your data for months or years… you need to write up your findings and make your data available fast. As a result science moves faster (good for us scientists trying to publish) and in the long term important discoveries with the potential to help people are made faster (good for the whole world).
*In the article linked above the author states: “The major difference between the two sequencing projects is that while the ICRISAT-led team has assembled 605.78 Mb out of the 833.07 Mb (about 72.5%) of the genome, the ICAR team has captured 511 Mb (about 61%).” I haven’t had a chance to look at the ICAR assembly yet, but to get to 605 Mb the ICRISAT assembly is including contigs as small as 103 base pairs, which is a uselessly small piece of DNA from a genomics or genetics perspective. So comparing raw “sequence assembled” numbers without asking “how many pieces? (and how big were they?)” clearly will not give the whole picture.