These are the official annotations for version two of the maize genome assembly. For the official release I need to point you to a text file on an FTP site (sorry). You can download all the new genomic information here.
In version one the maize sequencing consortium released two separate sets of gene models. A list of 32,400 high confidence gene models known as the maize “filtered gene set”, and a list of over 110,000 gene models known as the maize “working gene set.” There were very good reasons to do this at the time, but it was a pain for a lot of people since depending on what questions you studied you might use one set of annotations or the other.
In version two the maize genome folks have released a single set of 110,028 genes. Intimidated? So was I. Fortunately it’s not quite as intimidating a pile of data as it sounds. The first thing we can do is say we’re going to ignore transposon related genes. Remember transposons are selfish genes that encode proteins* that they use to propagate themselves, but generally don’t contribute to the plant (or other organism if you’re into that sort of thing) as a whole. If you’re interested in studying transposon biology, you’re interested in these genes, but the rest of the community can safely ignore them. So subtracting out the transposon related genes (handily annotated by the folks at cold spring harbor as “transposable_element” in the gene information file) removes 29,082 genes from our total.**
So now we’re down to only 80,891 genes. But wait, there’s more! Maize also has a bunch of broken pieces of genes lying around the genome. Some of these are copied scraps of genes picked up by transposons as they run wild through the maize genome. Others are the decaying corpses of redundant genes from the whole genome duplication that happened only 5-12 million years ago*** in the ancestor of modern maize. It’s important to know about these for a bunch of reason. For one thing, it’s very hard to be sure a gene is dead. (I’m still trying to come up with a genomics equivalent of the old “poke it with a stick, see if it moves” test.) For another, just because there’s only a fragment of the gene left in B73 (the line of corn that was sequenced) doesn’t mean the gene might not still be intact and fully functional in other corn plants. Remember at least 3410 of the high confidence filtered genes found in B73 (more than 10%!) have been lost in other varieties of corn****. But for now lets get rid of the genes that looked pretty dead. They’re marked as “pseudogenes” in the gene information file for this latest release along with information about why they look dead (really useful to have!). That lets us trim off another 17,615 genes from the grand total.
What are we left with? 63,276 genes that show every sign of being “the real thing.” These genes can be split into two categories:
- Genes that are clearly related to other genes found in other species (44,946 genes)
- Genes that are unlikely anything previously seen (18,330 genes)
Now some of those “unknown” genes will turn out to be mistakes. Others will turn out to be members of fast evolving gene families (like disease resistance genes) that change too fast to keep track of. A few will turn out to be completely real, and without previous precedent.
All in all this new release of data is a sign the maize genome annotation is growing up. The documentation was easier to follow and the folks who were in change of this set of annotations did a much better job of — for lack of a better term — “showing their work.” Genes aren’t just annotated as a pseudogene or transposon, there are notes explaining the evidence that resulted in the gene being put in that category.
And when someone asks me how many genes corn has, I can tell them “somewhere between 45 and 65,000”. Which is still a big range, but a huge improvement from “More than 32,400 and less than 108,754.”
*I’m simplifying of course, many transposons manage to do their thing without needing to make any of their own proteins, but that’s a story for another day.
**That’s right the corn genome contains more genes than all the “real” genes in Arabidopsis combined (28,775 genes (27,416 protein coding genes + 1359 non-coding RNA genes)). Which reminds me, the version 10 of the Arabidopsis gene annotations came out last Wednesday. Check it out! I may mock Arabidopsis as a plant, but TAIR (the folks in change of managing the Arabidopsis gene annotations and genome information) have done an outstanding job over the past decade!
***I’m completely serious when I say “only” here. On the scale of many of the evolutionary events we can still see evidence of in plant genomes, a few million years isn’t much at all.
****Don’t believe me? Check out this paper.