James and the Giant Corn Rotating Header Image

December 5th, 2010:

Gene annotation: Man vs. The Machine (Updating the classical maize gene list)

I don’t have it up on the website yet, but you can download the latest version of the maize classical gene list as either an open office spreadsheet, excel or comma separated file. If you have any questions about the methodology behind the list, need information from the list in some other format, or can think of other maize genes which should be included on the list, don’t hesitate to contact me.

Late last week the people at maizesequence.org (the folks currently responsible for updating the official gene models of corn) published their own list that uses the same term “classical maize genes.” Despite using the same name their list was produced using a very different methodology and has no association with the list produced by my lab. called the “named maize gene list”. Here are the key differences between our two resources:

  • Automated gene assignments/No human proofing
  • No restrictions on what it takes to be a classical maize gene. (Our list requires at least 3 citations, an interesting mutant phenotype, or someone in the community caring enough to suggest the gene)

The publication of these two lists gives us the opportunity to check how human proofed annotations stand up against well designed algorithms.

Of the 468 genes included on my new list (which reflects both the addition of new genes at the request of members of the maize community, and the removal of genes with poor sequence evidence), 152 were also identified by maizesequence.org. In 140 cases we identified the same gene models. I rechecked the remaining 12 cases. In nine cases after a more thorough check I remain convinced I made the right assignment. In two others maizesequence.org’s assignment was clearly the correct one and I’ve updated my own list accordingly. (In the last case we may both be wrong but I can’t be certain).

So how does the man (me) compare to the machine (the computers at Cold Spring Harbor that generated maizesequence.org’s list)?

  • Accuracy: slight advantage man. Out of 152 overlapping genes I made clear mistakes in 3 (2% error rate). The algorithm used by the folks at cold spring harbor was clearly mistaken in 10 of 152 cases ( 6.8% error rate).  Yes, I’m counting the gene where we both seem to have got it wrong.
  • Speed: huge advantage machine. The new set of gene models were released the day before thanksgiving. It’s taken me until tonight (11 days) to complete all the updates to my list.
  • Comprehensiveness: advantage man. While the complete maizesequence.org list contains >500 genes, this was accomplished by including genes like IDP1611 which is known solely for having an insertion/deletion polymorphism, while skipping genes like p1 (a gene that helps control the synthesis of purple pigment in corn plants) and shrunken1 (a gene involved in making sweet corn sweet). Not really a relevant comparison given the name change.

Obviously there are advantages to both approaches. I don’t want to think about the number of hours I’ve put into the maize gene list over the past year (usually in my “free time”. 😉 ). So even though I was able to shave a few percentage points off the error rate of a computer, in most situations (absent a grad student who grew up hearing stories of famous maize geneticists his whole life plus a PI who was understanding about the grad student undertaking side projects in free time that could have been devoted to thesis research), I’d certainly be comfortable trusting the results of a well designed computational algorithm for this sort of work.

Previous posts about the classical genes of maize genetics:

How many maize genes have actually been studied?

The most studied genes of maize genetics and why we love kernel phenotypes.

Two classical maize genes and the mystery of the missing gene.