On the perils of automated annotation

One of the common questions for a biologist to ask when given a list of genes is: “What kinds of genes are over represented on this list?” Are these are genes involved in protecting the plant from disease? Genes involved in photosynthesis? Or what?

To answer that question, it’s important to be able to make guesses about the function of all the genes in the genome of an organism. As you can imagine, if half the genes in the whole genome of a plant are involved in photosynthesis (not true), having a list of 20 genes where half are involved in photosynthesis isn’t very interesting. If only 1% of all the genes in the genome are involved in photosynthesis (also not true), that fact that half the genes on your list are involved in photosynthesis becomes a lot more exciting.

So the ability to analyze lists in this fashion requires making guesses about what every gene in the genome is doing. There are ways to do just that using comparisons to genes in other species whose functions are known, and that’s exactly what I did this week with the maize genome. It’s important to keep in mind though that these guesses are just that. Guesses. Exhibit A:

The list of genes I’m working with is highly enriched in genes involved in “neurological system process”. That’s right, my list of plant genes, from corn, has many more genes involved in brain activity that might be expected given the overall representation of those genes in the maize genome. The list also contains unexpectedly high numbers of genes involved specifically in “cognition”.

There’s no substitute for a sanity check from a real live human being*, but unfortunately, it’s hard to scare up enough trained scientists to manually annotate the possible jobs of 30,000+ genes, so until someone gets around to doing that for corn, automated prediction programs such as this one will have to suffice.

*Or maybe a sanity check from highly trained corn plant? Now that we have genetic evidence corn plants are thinking creatures… πŸ˜‰


