Gene annotation: Man vs. The Machine (Updating the classical maize gene list)

December 5, 2010

Gene annotation: Man vs. The Machine (Updating the classical maize gene list)

Filed under: Uncategorized — James @ 6:36 pm

I don’t have it up on the website yet, but you can download the latest version of the maize classical gene list as either an open office spreadsheet, excel or comma separated file. If you have any questions about the methodology behind the list, need information from the list in some other format, or can think of other maize genes which should be included on the list, don’t hesitate to contact me.

Late last week the people at maizesequence.org (the folks currently responsible for updating the official gene models of corn) published their own list ~~that uses the same term “classical maize genes.” Despite using the same name their list was produced using a very different methodology and has no association with the list produced by my lab.~~ called the “named maize gene list”. Here are the key differences between our two resources:

Automated gene assignments/No human proofing
No restrictions on what it takes to be a classical maize gene. (Our list requires at least 3 citations, an interesting mutant phenotype, or someone in the community caring enough to suggest the gene)

The publication of these two lists gives us the opportunity to check how human proofed annotations stand up against well designed algorithms.

Of the 468 genes included on my new list (which reflects both the addition of new genes at the request of members of the maize community, and the removal of genes with poor sequence evidence), 152 were also identified by maizesequence.org. In 140 cases we identified the same gene models. I rechecked the remaining 12 cases. In nine cases after a more thorough check I remain convinced I made the right assignment. In two others maizesequence.org’s assignment was clearly the correct one and I’ve updated my own list accordingly. (In the last case we may both be wrong but I can’t be certain).

So how does the man (me) compare to the machine (the computers at Cold Spring Harbor that generated maizesequence.org’s list)?

Accuracy: slight advantage man. Out of 152 overlapping genes I made clear mistakes in 3 (2% error rate). The algorithm used by the folks at cold spring harbor was clearly mistaken in 10 of 152 cases ( 6.8% error rate). Yes, I’m counting the gene where we both seem to have got it wrong.
Speed: huge advantage machine. The new set of gene models were released the day before thanksgiving. It’s taken me until tonight (11 days) to complete all the updates to my list.
Comprehensiveness: advantage man. While the complete maizesequence.org list contains >500 genes, this was accomplished by including genes like IDP1611 which is known solely for having an insertion/deletion polymorphism, while skipping genes like p1 (a gene that helps control the synthesis of purple pigment in corn plants) and shrunken1 (a gene involved in making sweet corn sweet). Not really a relevant comparison given the name change.

Obviously there are advantages to both approaches. I don’t want to think about the number of hours I’ve put into the maize gene list over the past year (usually in my “free time”. 😉 ). So even though I was able to shave a few percentage points off the error rate of a computer, in most situations (absent a grad student who grew up hearing stories of famous maize geneticists his whole life plus a PI who was understanding about the grad student undertaking side projects in free time that could have been devoted to thesis research), I’d certainly be comfortable trusting the results of a well designed computational algorithm for this sort of work.

Previous posts about the classical genes of maize genetics:

How many maize genes have actually been studied?

The most studied genes of maize genetics and why we love kernel phenotypes.

Two classical maize genes and the mystery of the missing gene.

Comments (4)

4 Comments »

[…] This post was mentioned on Twitter by AgBlogFeed, James. James said: Identifying the classical genes of maize genetics: Man vs The Machine. A J&TGC post: http://wp.me/pnQ5R-v4 […]

Pingback by Tweets that mention Gene annotation: Man vs. The Machine (Updating the classical maize gene list) – James and the Giant Corn -- Topsy.com — December 6, 2010 @ 3:40 am
One thing I’ve been unclear about is whether their annotation included the maize full length cdnas compiled by our group and a couple of others,
http://www.maizecdna.org/ .
I’m going to add them into the annotation set on our site, anyway.

Comment by Will Nelson — December 6, 2010 @ 9:02 am
Are you asking about their classical maize gene list or the whole version 2 gene annotations?

Your guess is as good as mine what sequence data went into building the maize gene models. I just know it’s not everything in GenBank because when I was proofing the classical gene list I cam across a some cases where there was good cDNA evidence for 5′ or 3′ UTRs but the gene model was just an ab initio CDS.

If you’re asking about their classical gene list, my understanding is they used a list of GenBank sequences associated with genes that had already been compiled by MaizeGDB, which (if it’s anything like the one I built for my own list) will be weighted much more heavily towards the sequence records submitted when these named genes were originally cloned (even though those are often incomplete CDSs).

Comment by James — December 6, 2010 @ 9:23 am
Your analysis does not properly credit the value of the many gene “man” identified that “machine” did not.

Comment by The Beekeeper — December 7, 2010 @ 8:00 am

RSS feed for comments on this post. TrackBack URL

James and the Giant Corn Genetics: Studying the Source Code of Nature

December 5, 2010