James and the Giant Corn Genetics: Studying the Source Code of Nature

December 27, 2010

Views on Science in High Places

Filed under: biofortified — James @ 4:17 pm

“[Mr. X] told the assembled groups that science itself is subjective, and that he could have three different groups bring him three different supposedly scientific opinions.”

Any guesses on the identity of Mr. X? Could he be a creationist arguing for the inclusion of intelligent design alongside science in the classroom? A new-age radical arguing that alternative medicines are just as scientifically effective was … well medicine? Maybe the most likely bet would be a sceptic of global warming, they’ve been in the press a lot lately, what with temperatures falling across the northern hemisphere (it’s apparently winter you see).

Unfortunately the person in question is (according to an article posted in the wall street journal), US Secretary of Agriculture Tom Vilsack. (more…)

I Am So Excited About P.A.G.

Filed under: Uncategorized — James @ 7:22 am

From a behind the scenes explanation of how the strawberry genome (and the strawberry genome paper which is quite a different kind of project) came to be:

Dr. Tom Davis from University of New Hampshire submitted the letter of intent for the January 13, 2006 deadline, days before the Plant Animal Genome meeting in San Diego. At the Rosaceae Executive Committee annual meeting on Jan. 15 Tom announced that he had submitted the letter.  … Tom was convinced to withdraw his strawberry sequencing proposal, differing to the eventual DOE-JGI support of peach.

At PAG in San Diego, 2008, Vladimir Shulaev and Richard Veilleux from Virginia Tech attended the Rosaceae Executive Committee meeting.   Vladimir announced that Virginia Tech had thrown support behind the idea of genome sequencing- both some financial support as well as technical and facility support.  This was the seed that was needed.  The discussion had a pure “pass-the-hat” flavor to it.  Nobody had big funds, but Vlad and Richard had a big idea.  That proved to eliminate the first major barrier to a complete sequence.

Luckily some enthusiasm was found at PAG 2010.  Vladimir presented the genome work at a major symposium.  It looked awfully sweet on the big screen and many of us felt a sense of prime time.  It was energizing. Todd Mockler, Todd Michael and Tom Davis met.  Later that night we had a strategy-n-pizza meeting with all of those present in the consortium.

The whole thing is a fascinating read, especially if you’ve never had a chance to hear about the politics and organization that go into sequencing a plant genome. But the point I’m making with the quotes I’ve pulled from the article is that P.A.G. seems to be the meeting for people from across the community to come together to plan and work on these large collaborations.

This year I’m most going to be there to take in the sights and find out what approaches other people are using since there aren’t a lot of other labs at my home institution that carry out related kinds of research, but it’s exciting to think that someday I might get to be involved in the back door wheeling and dealing that makes large collaborative science work!

Also, reading how the strawberry genome took from 10 months from its first rejection in February to its final publication yesterday helps put my own self pity about (not nearly as data rich) a manuscript I’ve been trying to get published since June into perspective. 😉

December 26, 2010

Woodland Strawberry Genome Published (For Real This Time)

Filed under: biofortified,Uncategorized — James @ 10:00 am

Hi all, hope you’re enjoying the holiday break. I’m back with news of a new plant genome publication!

Today’s plant is the woodland strawberry (Fragaria vesca). Now these aren’t the strawberries you probably see at your local grocery store, those are garden strawberries (Fragaria x ananassa). Woodland strawberries were the predominant strawberries grown throughout europe until around 250 years ago when they were displaced by the new garden strawberries — created when a strawberry species brought from north america crossed with another species from chile when they were grown next to each other in france. The new hybrid species bore larger fruit than the woodland strawberry.

Wild strawberry (left) and domesticated strawberry (right). I'm not sure which species these are, downside of having to hunt down public domain photos.

Sequencing the genome of the garden strawberry directly would be a real mess, as the genome of that species is made up of four closely related genome-copies*. With modern DNA-sequencing technology, generating the raw sequence data that makes up a genome is — relatively — cheap and easy, but afterwards you are left with a lot of small pieces of DNA sequence, and putting those pieces together (like putting together a puzzle with millions of pieces) remains challenging. Mix together pieces from four closely related puzzles together with no way to tell them apart and the project becomes even more challenging.

Fortunately the woodland strawberry side-steps that problem, being a normal diploid plant without any of the whole genome duplications that would make sequencing garden strawberries such a terrible mess. It also has a pleasingly small genome, with a genome of 206 million base pairs spread over seven chromosomes, making it only slightly larger than the genome of the first plant to be sequenced (Arabidopsis 157 million base pairs and five chromosomes). Small genomes are easier to put together, with less total pieces to go around.

The research consortium that sequenced and assembled the strawberry genome, first assembled overlapping pieces of sequenced DNA into larger pieces called contigs and then using genetic map data to line those contigs up into seven pseudomolecules, each of which represents a whole strawberry chromosome. The strawberry genome itself wasn’t released prior to the publication of the paper, so I haven’t had a chance to look at it myself, but both the fact that they’ve been able to assemble all the way to the chromosome level, and that they developed and used genetic map data argue for a well done assembly.

Speaking of assembly, here are all the vital genome stats that I normally would have to hunt around for after reading a “new genome sequenced!” story in the popular press (some of these I’ve already mentioned above):

  • Strawberries have a haploid number of 7, and a genome size of 206 MB
  • The average base pair in the strawberry genome was sequenced 39 times using second generation technology (a label that includes Illumina, 454, and SOLiD sequencers, in this case a mixture of all three technologies were employed)
  • 34,809 predicted genes were identified across the strawberry genome.
  • The authors found no evidence of the whole genome duplications found in other rosids (I’m assuming this means the most recent whole genome duplication in the ancestors of strawberries was the pre-rosid hexaploidy.)
  • The paper describing the genome will shortly be available from Nature Genetics. The title is “The genome of the woodland strawberry (Fragaria vesca)” and the last name of the first author is Shulaev. UPDATE: Here’s the link to the genome paper.

Strawberry genome browser.

Aside from the enjoyment I always feel when a new genome goes live, I’m particularly happy to see the strawberry genome come out for two reasons.

The first is that there was no “strawberry genome” grant. Funding for sequencing the genome came from a number of sources. I take this as a sign that in addition to the rapidly declining cost of sequencing itself, the cost and difficulty of assembling and annotating the genome a new plant species are also continuing to decline at a rapid pace.

The second reason is that I once before announced the sequencing of the strawberry genome on this site. It was almost a year ago, after a reporter misunderstood a presentation at PAG and posted a “new genome sequenced” story online that was rapidly picked up by  a number of websites including my own. It was a bad break for the folks working hard to sequence, assemble, and annotate the real strawberry genome, and I’m very glad to see them get the moment in the spotlight they so richly deserve. The people who sequence genomes make the work of so many other researchers, including myself, possible.

Links to other coverage (updated as I find them):

*Either the result of duplicated copies of a single genome that have since evolved independently (autotetraploidy), or hybrids that merged the genomes of closely related strawberry species together in a single plant (allotetraploidy).

December 13, 2010

How weird are grasses?

Filed under: Uncategorized — James @ 9:28 am

Weird enough that suggestions like this make it into the scientific literature:

Given the large number of new genes with different GC structures in grasses, perhaps the lineage was initiated by a wide hybridization event with another species that had genes with a high GC content, followed by selective gene retention and loss to create today’s Poaceae. The wide hybridization, while most likely to have involved a plant species, could have been prokaryotic or algal and, a prokaryotic origin could explain the higher proportion of intronless Poaceae-specific genes (1).

The above text comes from a paper published just two years ago (one year before the publication of the maize genome) describing the sequence of thousands of maize genes. It’s a wild claim — hybridizing with a prokaryote (bacteria) or algae not the wide cross part — and one I personally don’t think is very likely. But the reason people can make it with a straight face is because grasses are so different from everything that came before them, and so successful.

Grasses look and act almost nothing like the species they’re related to — the grass family belongs to the same order (the next organizational step up) as pineapples. They’re flowering plants but have either discarded things as basic to flowers as petals or modified them into something completely unrecognizable (the petals and sepals of other flowering plants might have become the lemmas, paleas, and/or lodicules of grasses).  Grasses have also reversed course from most flowering plants that depend on birds or insects to carry pollen from one plant to another and gone back to good old wind pollination. Yet, without the necessity of appealing to different pollinators to drive speciation, the grass family still encompasses more than ten-thousand species. And in the 50 or so million years or so since the major grass families split from each other, they have reshaped a whole quarter of the earth’s surface (2) creating whole new ecosystems (grasslands) that never existed before.

  1. Nickolai N Alexandrov et al., “Insights into corn genes derived from large-scale cDNA sequencing,” Plant Molecular Biology 69, no. 1 (January 2009): 179-194.
  2. H. L. Shantz, “The Place of Grasslands in the Earth’s Cover,” Ecology 35, no. 2 (April 1954): 143-145.

December 5, 2010

Gene annotation: Man vs. The Machine (Updating the classical maize gene list)

Filed under: Uncategorized — James @ 6:36 pm

I don’t have it up on the website yet, but you can download the latest version of the maize classical gene list as either an open office spreadsheet, excel or comma separated file. If you have any questions about the methodology behind the list, need information from the list in some other format, or can think of other maize genes which should be included on the list, don’t hesitate to contact me.

Late last week the people at maizesequence.org (the folks currently responsible for updating the official gene models of corn) published their own list that uses the same term “classical maize genes.” Despite using the same name their list was produced using a very different methodology and has no association with the list produced by my lab. called the “named maize gene list”. Here are the key differences between our two resources:

  • Automated gene assignments/No human proofing
  • No restrictions on what it takes to be a classical maize gene. (Our list requires at least 3 citations, an interesting mutant phenotype, or someone in the community caring enough to suggest the gene)

The publication of these two lists gives us the opportunity to check how human proofed annotations stand up against well designed algorithms.

Of the 468 genes included on my new list (which reflects both the addition of new genes at the request of members of the maize community, and the removal of genes with poor sequence evidence), 152 were also identified by maizesequence.org. In 140 cases we identified the same gene models. I rechecked the remaining 12 cases. In nine cases after a more thorough check I remain convinced I made the right assignment. In two others maizesequence.org’s assignment was clearly the correct one and I’ve updated my own list accordingly. (In the last case we may both be wrong but I can’t be certain).

So how does the man (me) compare to the machine (the computers at Cold Spring Harbor that generated maizesequence.org’s list)?

  • Accuracy: slight advantage man. Out of 152 overlapping genes I made clear mistakes in 3 (2% error rate). The algorithm used by the folks at cold spring harbor was clearly mistaken in 10 of 152 cases ( 6.8% error rate).  Yes, I’m counting the gene where we both seem to have got it wrong.
  • Speed: huge advantage machine. The new set of gene models were released the day before thanksgiving. It’s taken me until tonight (11 days) to complete all the updates to my list.
  • Comprehensiveness: advantage man. While the complete maizesequence.org list contains >500 genes, this was accomplished by including genes like IDP1611 which is known solely for having an insertion/deletion polymorphism, while skipping genes like p1 (a gene that helps control the synthesis of purple pigment in corn plants) and shrunken1 (a gene involved in making sweet corn sweet). Not really a relevant comparison given the name change.

Obviously there are advantages to both approaches. I don’t want to think about the number of hours I’ve put into the maize gene list over the past year (usually in my “free time”. 😉 ). So even though I was able to shave a few percentage points off the error rate of a computer, in most situations (absent a grad student who grew up hearing stories of famous maize geneticists his whole life plus a PI who was understanding about the grad student undertaking side projects in free time that could have been devoted to thesis research), I’d certainly be comfortable trusting the results of a well designed computational algorithm for this sort of work.

Previous posts about the classical genes of maize genetics:

How many maize genes have actually been studied?

The most studied genes of maize genetics and why we love kernel phenotypes.

Two classical maize genes and the mystery of the missing gene.

November 30, 2010

Recruiting an Undergrad

Filed under: Uncategorized — James @ 5:08 pm

If you are an undergraduate at UC-Berkeley interested in gaining research experience or know someone who is, check out our lab entry at the Undergraduate Research Apprenticeship Program (URAP). The project description is a huge mouthful, and personally I think it sounds much more intimidating than it needs to be. So below is my argument why, if you’re interested in a research career or graduate school, you should consider in working in my lab, even if you aren’t specifically interested in genome evolution or computational biology. (more…)

November 28, 2010

Crops of the Americas

Filed under: Uncategorized — James @ 9:38 am

But the Native Americans weren’t without any useful proto-domesticated plants. Tobacco and chocolate are two that come to mind.

Wait…. what? Take a minute and think about that statement. When you think about crops that were a lot more than proto-domesticated long before europeans made it over to the western hemisphere, what should be the first thing that comes to mind?

It’s in the URL of my website! Corn. (more…)

November 27, 2010

Hybrid vigor and missing genes

Filed under: Genetics,genomics — James @ 5:51 pm

Thinking about defining the number of genes present in the maize genome reminded me of an old* story about the trouble of defining what truly represents a gene and how really awesome ideas can sometimes come years before the data needed to support them.

The year is 2002. The first complete version of the human genome is still a year away. The genomes of two plant species have already been published (rice and arabidopsis) but in terms of shere genome size, both species are a drop in the bucket compared to the human genome, or other plant genomes like corn or wheat. But none of this is particularly important except to set the stage.

Two researchers at Rutgers University were sequencing a tiny piece of the maize genome (~0.01%) that surrounded a single gene call bronze1 — the fifth most studied gene in maize — when they found something unexpected.

They had previously 10 identified genes in a single stretch of 32-kb of the maize genome. (A similar gene density throughout the remainder of the maize genome would have resulted in a maize genome containing more than 700,000 genes!) However it was already known that the maize genome was split between small gene-rich islands and vast desolate expanses of transposons (referred to as transposon nests**), and in fact the same study identified a couple of these nests of transposons on either side of their gene rich island (see part A of the second picture in this post).

Below I'll use cartoons, but here's a real and to scale example of a gene rich island I picked at random from maize chromosome 3. Genes and intergenic spaces are to scale. Base image generated with GenomeViewer, part of the CoGe toolkit. http://www.genomevolution.org/CoGe/

Their initial sequencing used DNA from a breed of corn called McC, which I must admit I’ve only ever read about in this particular paper. However, when they decided to sequenced the same region from the genome of B73*** they made three discoveries which I’ve listed in increasing order of strangeness: (more…)

November 23, 2010

New maize gene models are out!

Filed under: Uncategorized — James @ 7:18 pm

These are the official annotations for version two of the maize genome assembly. For the official release I need to point you to a text file on an FTP site (sorry). You can download all the new genomic information here.

In version one the maize sequencing consortium released two separate sets of gene models. A list of 32,400 high confidence gene models known as the maize “filtered gene set”, and a list of over 110,000 gene models known as the maize “working gene set.” There were very good reasons to do this at the time, but it was a pain for a lot of people since depending on what questions you studied you might use one set of annotations or the other.

In version two the maize genome folks have released a single set of 110,028 genes. Intimidated? So was I. Fortunately it’s not quite as intimidating a pile of data as it sounds. (more…)

November 9, 2010

Columbine and Foxtail Millet

Filed under: Uncategorized — James @ 4:36 am

The latest version of phytozome (v6) just came out and it includes two genomes people in my lab have been waiting on for a LONG time.

The first is Aquilegia coerulea or Blue Rock Columbine. From a comparative genomics standpoint columbine is important because its a representative of a lineage of eudicots that split off from the two biggest families in that group (the asterid and rosids), very early on. It’s also only the second plant species to be sequenced that’s known for its flowers (the first was Monkey Flower or Mimulus). Most of the ~300 megabase genome is contained within 156 large scaffolds. So not as big as whole chromosomes, but hopefully large enough for some useful syntenic analysis. The sequenced cultivar of Columbine is “Goldsmith.” You can read more details and download the genomic data here.

The second new genome is even more exciting as far as I’m concerned, but that’s because I’m a grass guy. Foxtail millet (Setaria italica) is an important crop in its own right and comes from a tribe of grasses called the Paniceae that was previously unrepresented by any sequenced genome. The Paniceae are relatives of the Andropogoneae, the tribe that contains both maize/corn and sorghum (as well as cool unsequenced species like sugarcane) and the addition of Setaria will do a lot to help balance out grass phylogenetic comparisons.* A second species of Setaria (Setaria viridis) is also being sequenced (likely using this species as a road map to assemble the genome) which is being pushed by one of my old bosses as a good genetic model to study the C4 photosynthesis that makes species in both the Paniceae and Andropogoneae so much more efficient at photosynthesis than species that still do plain old vanilla** C3 photosynthesis. Almost all of the Setaria genome is contained within 24 large scaffolds, which should definitely be big enough for the syntenic analysis my lab likes to do. More details and download links here.

Both genomes are still Fort Lauderdaled, but that only means you can’t PUBLISH on them yet, can’t stop you from looking for cool stuff. That’s what I plan to do myself after lab meeting this morning. Putting this data up on the website also means the clock has started ticking on that restriction. The version of the Fort Lauderdale accord that JGI uses gives the researchers involved in sequencing the genome 12 months to get their paper published. If they don’t manage to do that, the genome becomes fair game for anyone. Which is very useful for people like me planning future research projects. One way or another, I know that by November of 2011 I will be able to publish papers that use insight gained from the Setaria and/or Columbine genomes.***

Normally I’d dig up some pictures and interesting trivia about these species, but it 4:40 in the morning out here on the west coast, so I’m going to wrap this post up and wish you folks on the east coast good morning.

*I hope to explain what I mean by this in a followup post, but I’ve learned to be careful about promising ANY updates in advance.

**I don’t know what kind of photosynthesis vanilla plants carry out.

***I’m sure both groups will beat the deadline for getting their genomes out the door, but I can think of at least one example of a research group sitting on a completely genome indefinitely which is why having early data release and expiration dates on exclusivity is so important.

« Newer PostsOlder Posts »

Powered by WordPress