James and the Giant Corn Genetics: Studying the Source Code of Nature

December 27, 2010

Views on Science in High Places

Filed under: biofortified — James @ 4:17 pm

“[Mr. X] told the assembled groups that science itself is subjective, and that he could have three different groups bring him three different supposedly scientific opinions.”

Any guesses on the identity of Mr. X? Could he be a creationist arguing for the inclusion of intelligent design alongside science in the classroom? A new-age radical arguing that alternative medicines are just as scientifically effective was … well medicine? Maybe the most likely bet would be a sceptic of global warming, they’ve been in the press a lot lately, what with temperatures falling across the northern hemisphere (it’s apparently winter you see).

Unfortunately the person in question is (according to an article posted in the wall street journal), US Secretary of Agriculture Tom Vilsack. (more…)

I Am So Excited About P.A.G.

Filed under: Uncategorized — James @ 7:22 am

From a behind the scenes explanation of how the strawberry genome (and the strawberry genome paper which is quite a different kind of project) came to be:

Dr. Tom Davis from University of New Hampshire submitted the letter of intent for the January 13, 2006 deadline, days before the Plant Animal Genome meeting in San Diego. At the Rosaceae Executive Committee annual meeting on Jan. 15 Tom announced that he had submitted the letter.  … Tom was convinced to withdraw his strawberry sequencing proposal, differing to the eventual DOE-JGI support of peach.

At PAG in San Diego, 2008, Vladimir Shulaev and Richard Veilleux from Virginia Tech attended the Rosaceae Executive Committee meeting.   Vladimir announced that Virginia Tech had thrown support behind the idea of genome sequencing- both some financial support as well as technical and facility support.  This was the seed that was needed.  The discussion had a pure “pass-the-hat” flavor to it.  Nobody had big funds, but Vlad and Richard had a big idea.  That proved to eliminate the first major barrier to a complete sequence.

Luckily some enthusiasm was found at PAG 2010.  Vladimir presented the genome work at a major symposium.  It looked awfully sweet on the big screen and many of us felt a sense of prime time.  It was energizing. Todd Mockler, Todd Michael and Tom Davis met.  Later that night we had a strategy-n-pizza meeting with all of those present in the consortium.

The whole thing is a fascinating read, especially if you’ve never had a chance to hear about the politics and organization that go into sequencing a plant genome. But the point I’m making with the quotes I’ve pulled from the article is that P.A.G. seems to be the meeting for people from across the community to come together to plan and work on these large collaborations.

This year I’m most going to be there to take in the sights and find out what approaches other people are using since there aren’t a lot of other labs at my home institution that carry out related kinds of research, but it’s exciting to think that someday I might get to be involved in the back door wheeling and dealing that makes large collaborative science work!

Also, reading how the strawberry genome took from 10 months from its first rejection in February to its final publication yesterday helps put my own self pity about (not nearly as data rich) a manuscript I’ve been trying to get published since June into perspective. 😉

December 26, 2010

Woodland Strawberry Genome Published (For Real This Time)

Filed under: biofortified,Uncategorized — James @ 10:00 am

Hi all, hope you’re enjoying the holiday break. I’m back with news of a new plant genome publication!

Today’s plant is the woodland strawberry (Fragaria vesca). Now these aren’t the strawberries you probably see at your local grocery store, those are garden strawberries (Fragaria x ananassa). Woodland strawberries were the predominant strawberries grown throughout europe until around 250 years ago when they were displaced by the new garden strawberries — created when a strawberry species brought from north america crossed with another species from chile when they were grown next to each other in france. The new hybrid species bore larger fruit than the woodland strawberry.

Wild strawberry (left) and domesticated strawberry (right). I'm not sure which species these are, downside of having to hunt down public domain photos.

Sequencing the genome of the garden strawberry directly would be a real mess, as the genome of that species is made up of four closely related genome-copies*. With modern DNA-sequencing technology, generating the raw sequence data that makes up a genome is — relatively — cheap and easy, but afterwards you are left with a lot of small pieces of DNA sequence, and putting those pieces together (like putting together a puzzle with millions of pieces) remains challenging. Mix together pieces from four closely related puzzles together with no way to tell them apart and the project becomes even more challenging.

Fortunately the woodland strawberry side-steps that problem, being a normal diploid plant without any of the whole genome duplications that would make sequencing garden strawberries such a terrible mess. It also has a pleasingly small genome, with a genome of 206 million base pairs spread over seven chromosomes, making it only slightly larger than the genome of the first plant to be sequenced (Arabidopsis 157 million base pairs and five chromosomes). Small genomes are easier to put together, with less total pieces to go around.

The research consortium that sequenced and assembled the strawberry genome, first assembled overlapping pieces of sequenced DNA into larger pieces called contigs and then using genetic map data to line those contigs up into seven pseudomolecules, each of which represents a whole strawberry chromosome. The strawberry genome itself wasn’t released prior to the publication of the paper, so I haven’t had a chance to look at it myself, but both the fact that they’ve been able to assemble all the way to the chromosome level, and that they developed and used genetic map data argue for a well done assembly.

Speaking of assembly, here are all the vital genome stats that I normally would have to hunt around for after reading a “new genome sequenced!” story in the popular press (some of these I’ve already mentioned above):

  • Strawberries have a haploid number of 7, and a genome size of 206 MB
  • The average base pair in the strawberry genome was sequenced 39 times using second generation technology (a label that includes Illumina, 454, and SOLiD sequencers, in this case a mixture of all three technologies were employed)
  • 34,809 predicted genes were identified across the strawberry genome.
  • The authors found no evidence of the whole genome duplications found in other rosids (I’m assuming this means the most recent whole genome duplication in the ancestors of strawberries was the pre-rosid hexaploidy.)
  • The paper describing the genome will shortly be available from Nature Genetics. The title is “The genome of the woodland strawberry (Fragaria vesca)” and the last name of the first author is Shulaev. UPDATE: Here’s the link to the genome paper.

Strawberry genome browser.

Aside from the enjoyment I always feel when a new genome goes live, I’m particularly happy to see the strawberry genome come out for two reasons.

The first is that there was no “strawberry genome” grant. Funding for sequencing the genome came from a number of sources. I take this as a sign that in addition to the rapidly declining cost of sequencing itself, the cost and difficulty of assembling and annotating the genome a new plant species are also continuing to decline at a rapid pace.

The second reason is that I once before announced the sequencing of the strawberry genome on this site. It was almost a year ago, after a reporter misunderstood a presentation at PAG and posted a “new genome sequenced” story online that was rapidly picked up by  a number of websites including my own. It was a bad break for the folks working hard to sequence, assemble, and annotate the real strawberry genome, and I’m very glad to see them get the moment in the spotlight they so richly deserve. The people who sequence genomes make the work of so many other researchers, including myself, possible.

Links to other coverage (updated as I find them):

*Either the result of duplicated copies of a single genome that have since evolved independently (autotetraploidy), or hybrids that merged the genomes of closely related strawberry species together in a single plant (allotetraploidy).

December 13, 2010

How weird are grasses?

Filed under: Uncategorized — James @ 9:28 am

Weird enough that suggestions like this make it into the scientific literature:

Given the large number of new genes with different GC structures in grasses, perhaps the lineage was initiated by a wide hybridization event with another species that had genes with a high GC content, followed by selective gene retention and loss to create today’s Poaceae. The wide hybridization, while most likely to have involved a plant species, could have been prokaryotic or algal and, a prokaryotic origin could explain the higher proportion of intronless Poaceae-specific genes (1).

The above text comes from a paper published just two years ago (one year before the publication of the maize genome) describing the sequence of thousands of maize genes. It’s a wild claim — hybridizing with a prokaryote (bacteria) or algae not the wide cross part — and one I personally don’t think is very likely. But the reason people can make it with a straight face is because grasses are so different from everything that came before them, and so successful.

Grasses look and act almost nothing like the species they’re related to — the grass family belongs to the same order (the next organizational step up) as pineapples. They’re flowering plants but have either discarded things as basic to flowers as petals or modified them into something completely unrecognizable (the petals and sepals of other flowering plants might have become the lemmas, paleas, and/or lodicules of grasses).  Grasses have also reversed course from most flowering plants that depend on birds or insects to carry pollen from one plant to another and gone back to good old wind pollination. Yet, without the necessity of appealing to different pollinators to drive speciation, the grass family still encompasses more than ten-thousand species. And in the 50 or so million years or so since the major grass families split from each other, they have reshaped a whole quarter of the earth’s surface (2) creating whole new ecosystems (grasslands) that never existed before.

  1. Nickolai N Alexandrov et al., “Insights into corn genes derived from large-scale cDNA sequencing,” Plant Molecular Biology 69, no. 1 (January 2009): 179-194.
  2. H. L. Shantz, “The Place of Grasslands in the Earth’s Cover,” Ecology 35, no. 2 (April 1954): 143-145.

December 5, 2010

Gene annotation: Man vs. The Machine (Updating the classical maize gene list)

Filed under: Uncategorized — James @ 6:36 pm

I don’t have it up on the website yet, but you can download the latest version of the maize classical gene list as either an open office spreadsheet, excel or comma separated file. If you have any questions about the methodology behind the list, need information from the list in some other format, or can think of other maize genes which should be included on the list, don’t hesitate to contact me.

Late last week the people at maizesequence.org (the folks currently responsible for updating the official gene models of corn) published their own list that uses the same term “classical maize genes.” Despite using the same name their list was produced using a very different methodology and has no association with the list produced by my lab. called the “named maize gene list”. Here are the key differences between our two resources:

  • Automated gene assignments/No human proofing
  • No restrictions on what it takes to be a classical maize gene. (Our list requires at least 3 citations, an interesting mutant phenotype, or someone in the community caring enough to suggest the gene)

The publication of these two lists gives us the opportunity to check how human proofed annotations stand up against well designed algorithms.

Of the 468 genes included on my new list (which reflects both the addition of new genes at the request of members of the maize community, and the removal of genes with poor sequence evidence), 152 were also identified by maizesequence.org. In 140 cases we identified the same gene models. I rechecked the remaining 12 cases. In nine cases after a more thorough check I remain convinced I made the right assignment. In two others maizesequence.org’s assignment was clearly the correct one and I’ve updated my own list accordingly. (In the last case we may both be wrong but I can’t be certain).

So how does the man (me) compare to the machine (the computers at Cold Spring Harbor that generated maizesequence.org’s list)?

  • Accuracy: slight advantage man. Out of 152 overlapping genes I made clear mistakes in 3 (2% error rate). The algorithm used by the folks at cold spring harbor was clearly mistaken in 10 of 152 cases ( 6.8% error rate).  Yes, I’m counting the gene where we both seem to have got it wrong.
  • Speed: huge advantage machine. The new set of gene models were released the day before thanksgiving. It’s taken me until tonight (11 days) to complete all the updates to my list.
  • Comprehensiveness: advantage man. While the complete maizesequence.org list contains >500 genes, this was accomplished by including genes like IDP1611 which is known solely for having an insertion/deletion polymorphism, while skipping genes like p1 (a gene that helps control the synthesis of purple pigment in corn plants) and shrunken1 (a gene involved in making sweet corn sweet). Not really a relevant comparison given the name change.

Obviously there are advantages to both approaches. I don’t want to think about the number of hours I’ve put into the maize gene list over the past year (usually in my “free time”. 😉 ). So even though I was able to shave a few percentage points off the error rate of a computer, in most situations (absent a grad student who grew up hearing stories of famous maize geneticists his whole life plus a PI who was understanding about the grad student undertaking side projects in free time that could have been devoted to thesis research), I’d certainly be comfortable trusting the results of a well designed computational algorithm for this sort of work.

Previous posts about the classical genes of maize genetics:

How many maize genes have actually been studied?

The most studied genes of maize genetics and why we love kernel phenotypes.

Two classical maize genes and the mystery of the missing gene.

Powered by WordPress