James and the Giant Corn Rotating Header Image

Gene Annotation Won’t Kill You

I just got out of a meeting about how we’re going to improve the annotations of the maize genome. It’s got me riled up.

But first, what is gene annotation (in twenty words or less):

Gene annotation is the information on which parts of chromosomes are part of each gene and sometimes information on what that gene does.

Today early annotations of genes on a chromosome are done by computer, but the computers still make a lot of mistakes, and the only way to get better annotations is with humans.

Humans can improve two things. They can catch mistakes in gene models that a computer would miss, and they can do a better job of predicting the functions of individual genes.

It’s not the most glamorous part of science, but it is, never the less, ESSENTIAL to the creation of the most useful genome sequences. Shouldn’t corn/maize, as one of the three most important species that feed world*, have a first rate genome sequence (especially now that we’ve already invested 30 million dollars in generating and assembling the sequence data itself)?

I sure think we should.

But it’s not going to happen until we get over two problems:

Making the perfect the enemy of the good:

I’ve done annotation several times in my life. Improving the gene models themselves is quick, 10-15 minutes per gene if you have a well designed system. The only trick is that you need to figure out rules in advance on what people will do when they don’t have enough evidence to be sure of the model.

Determining function can take as little time (or as much) as you like. On the shortest end, is simply feeding the gene sequence into NCBIs BLAST and making a guess based on the most similar gene with a known function. More complex (and somewhat more accurate approaches) take into account everything from domains to confidence of sequencing calls, data on where the gene is expressed, and what percantege of different genes match up with each other at all. I got the chance to talk with a bright guy who trained people to annotate genes, and he says they average ~45 minutes per gene for functional annotations. But you don’t have to do all of that. If you get a set of annotations that’s only 95% accurate instead of 98% it’s still a WHOLE lot better than no annotations at all. If people only want to put in 15 minutes per gene, they can still create a much more trustworthy dataset than the automated annotations.  And you can always go back and improve them more later.

Making the problem bigger in your mind:

The great thing about gene annotations is that if you set the system up correctly, annotating the first 10% of the genome is useful by itself. If one in ten genes has an annotation, it’s still better than no annotations at all. You keep annotating genes and it keeps getting more useful, but the way some people talk about it, they don’t seem to think our work to do any good until every last gene is annotated. And boy do they think it is going to take a long time. It was sounding like some people expected to grow old and die working on the never ending project of the annotating the maize genome, with nothing to show for our efforts.

I’m pretty sure these are the two main barriers to getting the maize genetics community fired up about doing gene annotations. Now I just have to figure out neutralize them.

*Rice, another of the three, does have a first class genome

2 Comments

  1. EJ says:

    James, which annotation system did you use? Manatee perhaps? There are a lot of annotation systems for prokaryotes out there, but for eukaryotes things become more difficult. I’ve heard the Ensemble annotation is one of the best but really difficult to implement.

    1. James says:

      The work I did with soybean genes was probably the best example of using a prebuilt annotation system. That was done with yrgate, a tool built by Volker Brendel’s group. ( http://www.plantgdb.org/prj/yrGATE/ )

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: