James and the Giant Corn Rotating Header Image


Missing Genes on a Massive Scale

Edit: stripped out all the numbers as they clearly applied to an earlier version of the data and I don’t know if the new ones are intended for public release yet.

Last november when the maize genome was published, one of the companion papers looked at genes where a different number of copies were found in different breds of maize (this is called Copy Number Variation) and genes found in B73 (the variety of maize that was sequenced) but completely missing from the genomes of other varietes. There’s a great post on that paper written up by Mary at OpenHelix.

A few months later, it sounds like this dataset has grown substantially. Over XXXX B73 genes (that’s X% of the filtered B73 gene set!) that appear to be lost (or have sequences so different they no longer register) in at least some varities of maize. And because the new dataset incorporates data from XX different maize breds and XX different teosinte* lines they’re able to identify some of the losses as older because they’re found in multiple comparisons, while some appear to be lost in only a single breed, and might represent more recent losses.

Sit back and think about that for a second. At least X% of the genes in corn sometimes go missing. This could have implications for everything from inbreeding depressions and hybrid vigor, to the kind of basic research I’m actually working on myself.

As you can imagine I’d love to get my hands on this dataset myself, but the next best thing will be to take furious notes when Nathan Springer talks about the project on Friday morning**, and being sure to swing by Steven Eichten’s poster soak in the awesomeness.

Ruth A. Swanson-Wagner et al. “Combined Analysis of genomic structural variation and gene expression variation between maize and teosinte populations” Talk #1 2010 Maize Meeting (Presented by Nathan Spinger)

Steven R. Eichten et al. “Extenisve Copy Number Variation Among Maize Lines” Poster #139 2010 Maize Meeting

*Teosinte is the wild species from which maize/corn was domesticated.

**And he’s talking at 8:30 AM on a day when I still plan on being heavily jet lagged.

The long genome drought

Grad students graduating with PhDs right now probably entered grad school at point A, or earlier. People at the end of their first post-doc (and potentially in a position to apply for faculty positions and start training new graduate students themselves) would have entered grad school at point B or earlier.

Today there are a mere 10 published plant genomes out of the more than quarter million named plant species in the world. But even ten genomes is a huge amount of data to deal with for a plant genomics community that largely came of age during the long genome drought of 2002-2006. What is the genome drought? The rice genome was published in April 2002. It was only the second plant genome to be sequenced, and the last plant genome to be published until the poplar genome came out in September of 2006, a gap of more than four years. Two genomes, especially of species as distantly related as arabidopsis and rice doesn’t make for a lot of compelling comparative genomics (Although there was certainly some really cool stuff being discovered in this time period.)

Does that matter? Probably not, but it’s important to remember that the people earning their PhDs today probably entered grad school (and chose a lab and field of study) during that two-plant-genome era (see point A) and less opportunity for exciting research mean less grant money and less ability to attract grad students. The youngest people applying for faculty positions today (assuming they only did one, quite successful, post-doc), also entered grad school in the two genome era (see point B), if not the previous single genome era.

I’m talking about this mostly to make the point that I think comparative genomics as a field of study is getting a lot more exciting as more genomes become avaliable, which is likely to attract more graduate students in that key first year when they join a lab and begin to specialize. Which means as we move farther away from the time of the long genome drought, we will hopefully* start to see a lot more well trained people doing plant genomics.

Which is a good thing because the other point this graph should make (not that I think many people need to be reminded of it), is that the pace of sequencing plant genomes is accelerating, and SOMEONE needs to analyze the huge quantities of data that are already starting to flow through the plant biology community.

This should be the last plant genome themed post for a while, but please continue to let me know if you know/hear about more plant genome projects. jcs98 (@) jamesandthegiantcorn.com

*If the hypothetical end to the also-hypothetical-and-possibly-the-result of-wishful-thinking-on-my-part shortage of plant comparative genomicists could hold off long enough for me to be really in demand when/if I finish grad school, that would be great. 😉

And the list gets better!

Based on e-mails and responses to my previous post I’ve made the following additions to the sequenced plant genomes page:

  • Added an entry on Columbine, a member of an early diverging group of eudicots. As far as I can tell this sequence is currently unreleased, but from the JGI website it looks like the initial assembly is already complete, so if you know of a way for people to get ahold of that let me know.
  • Added an entry on the Castor Bean. The sequencing group has released a 4x coverage genome assembly. (The castor bean is the source of the deadly toxin ricin, and is not grown in the US, we import our castor oil from other countries.)
  • Split the entry on Arabidopsis into “Arabidopsis species and allies“. This gives the Arabidopsis lyrata its own heading, and will be important since there are another 7 species from the Arabidopsis genus and its close relatives in the JGI sequencing pipeline.
  • Added an entry on v2 of the date palm genome generated by Weill Cornell Medical College in Qatar. This definitely should still be considered an “in progress” genome, but at least until the banana genome comes out it’s the best non-grass monocot genome available.
  • Added an entry on the genome of Physcomitrella patens, which, as a moss, is the descendant of an evolutionary lineage that split from all the other genomes I’ve listed on the page around 450 million years ago.
  • Added the recently announced sunflower genome project to the list of planned, in-progress, and private genome efforts. (Apparently the genome of the cultivated sunflower is more 3 gigabases. Bigger than corn!) That’ll be a cool genome to see when it comes out.
  • Added information on the woodland strawberry genome project, which aims to have an assembled genome of Fragaria vesca by sometime this year. You may remember the woodland strawberry genome from the mix up back in January.
  • Added the various groups that have announced they have private genome sequences of the oil palm genome to the same section.

Particular thanks to Greg, Jeff, and Eric whose suggestions where behind most of these additions.

Completely unrelated, you may have noticed I switched the RSS feed back to full length entries. I recently tried out Google Reader (I’m way behind the times I know), and it is SO MUCH nicer to see the full entries there than have to click through from a brief summary. The downside, as I know from previous experiences, is that when I send out the full entries by RSS I get a lot less traffic.

I don’t earn any income from traffic to this site, but it is a nice feeling to know people are reading and enjoying something I wrote, and I have no way of tracking how many people (if any) read an article from the RSS feed.

Anyway I’ll keep sending out full entries for at least the next week (I expect I’ll be too busy to worry about ego stroking traffic statistics until at least a week from Tuesday.)

Sequenced Plant Genomes

Libe slope in Ithaca, NY. Behind you are student dorms. At the top of the hill, campus starts. Photo: foreverdigital, flickr (click to see in original context)

When I was an undergraduate, there were exactly two sequenced plant genomes, rice and arabidopsis. And sure maybe I didn’t have to walk “ten miles to school, barefoot, in the snow, uphill, both ways”* the one way I did have to walk uphill (sometimes in the snow but always with shoes), was very uphill. But where was I?

Oh yeah, plant genome sequences. Kids getting into plant genomics these days don’t realize how easy they’ve got it. By my count (which may be low but I’m getting to that) there are ten published plant genomes, with several more unpublished genomes that are available in various states of completion, and lots more on the way.

Which brings me to what I was doing yesterday instead of writing an update for this website: trying to document the published plant genomes, the unpublished genomes that are available, and which new genomes we can expect to see published in the near future.

Please, if you find mistakes or know of additional flowering plant genomes I should mention, let me know! jcs98 (@) jamesandthegiantcorn.com.

If you don’t work in biology, it might be interesting to see which plants have sequenced genomes and how they’re related to each other.

*An explanation of this phrase.


Who could have predicted maize geneticists would be so interested in maize genes? The entry I posted last night on Purple plant1 and Colored aleurone1 easily received more traffic in its first day on the site (it’s still got a long way to go before it catches long term readership attractors like water chestnuts and the NIPGR tomatoes), than any entry since the heady days of the maize genome release back in November.

The relationships of the four grass species with sequenced genomes. The branches are NOT to scale with how long ago the species split apart. Green stars represent whole genome duplications. The most important one to notice in the one in the ancestry of maize/corn. That duplication means that every region in sorghum, rice, or brachypodium is equivalent to two different places in the maize genome, one descended from each of the two copies of the genome that existed after the duplication.

And this morning the dataset I drew that example from, 464 classical maize genes mapped onto the maize genome assembly plus syntenic orthologs in up to four grass species: sorghum, rice, brachypodium, and the other region of the maize genome created by the maize whole genome duplication (technically syntenic homeologs since we started in maize to begin with, by the principle is the same), went out to the maize genetics community (thank you MaizeGDB!).

A postdoc in our lab tells me more people have visited CoGe today than any day on record (and we hit that mark before noon!).

Anyway, thank you guys, it’s great to feel appreciated!

Two classical maize genes, synteny, and the mystery of the missing gene

Colored aleurone1 and Purple plant1 are both genes with long histories in maize research and are involved in the regulation of anthocyanin biosynthesis.The mutant version of purple plant1 does exactly what it sounds like. (In the proper genetic background) it has plants producing anthocyanin (a purple plant pigment) everywhere, resulting in purple plants. The mutant form of colored aleurone1 was identified from a mutant that changed the color of individual corn kernels. Guess which of these two classic maize mutants made it into the top 15 most published on genes in maize, and which fell barely short.

Ears segregating for the colored aleurone mutant phenotype. Image courtesy of MG Neuffer via MaizeGDB.

Purple plant1's phenotype is highly variable depending on the genetic background the mutant is in. Images courtesy of MG Neuffer via MaizeGDB.

The two genes are also duplicates (homeologs) resulting from the maize whole genome duplication. From the picture below you can also see both the two genes and the regions they are in match up to single regions in rice and sorghum, two grasses that haven’t gone though a whole genome duplication since the great radiation of grass species that took place an estimated 50 million years ago (well after dinosaurs stopped walking the earth). (more…)

Ion Torrent Sequencing

I know absolutely nothing about their technology (they’ve been playing things much closer to their chests than Pacific Biosystems), but they just announced they’d start delivering their machines by the end of this year and that they’ll reveal the principles of their new technology in a talk on Saturday.

Marco Island, Florida (where the Advances in Genome Biology and Technology conference is being held) is certainly the place to be this week.

Greg Baute’s optimistic predictions about the year 2010 in sequencing may prove more accurate than my own pessimism yet.

Map of the Places That Get the First PacBio Sequencers

In all honesty, I don’t know how big a difference Pacific Biosystem’s technology* will make to genomics. I doubt anyone can until the machines are actually in use by sequencing centers and people can start to make judgements about how they behave under real life conditions. How much sequence can actually be produced per day or per dollar? How long will the reads actually get? What sort of sequencing errors are most common with the technology and how common are they?

But now I know the people who will be the first to find out the answers to these questions. Today (yesterday by the time this is scheduled to publish) Pacific Biosystem’s announced where the first ten of their new sequencing machines will be going:

View Pacific Biosystems Sequencers in a larger map (click the markers to see the names of the institutions receiving the sequencers)