James and the Giant Corn Rotating Header Image


I’ve been saying it for nearly a decade: pineapples really are awesome

With all these new third generation sequencing technologies coming out in 2010, hopefully someone will sequence the pineapple genome. If not, maybe the cost of sequencing will drop enough while I’m in grad school that I can sequence the genome myself ( a guy can dream).

An incredibly overused graph, but the reason it’s so overused is that it really is a remarkably useful dataset. Source: https://www.genome.gov/sequencingcostsdata/

Although I was a bit overly optimistic back in 2010 about how fast the cost of sequencing (and critically assembling) genomes would decline. Back then we are all talking about sequencing prices dropping 10x every 1-2 years. This turned out of be a quick burst of innovation brought about by second generation sequencing technologies (primarily 454 at first, then Solexa which became Illumina later on). Like many technologies, there was a lot more low hanging fruit for optimization early on, and the cost of sequencing essentially plateaued from 2011 to 2015.

Of course now we’re finally starting to get those economically viable 3rd generation sequencing technologies I though were right around the corner in 2010. And they still have lots and lots of headspace for optimization (pacbio and oxford nanopore being the two most successful ones at the moment) that maybe in another 6-7 years grad students really will be able to generate genome assemblies on a whim.

In the meantime, hey, we did get a pretty cool pineapple genome assembly a couple of years ago.

Ming R., VanBuren R., Wai C. M., Tang H., Schatz M. C., et al., 2015 The pineapple genome and the evolution of CAM photosynthesis. Nat Genet 47: 1435–1442.
Also, here’s a fun video of a 3D scan of the internal structure of a pineapple:

Evidence of my ongoing obsession with pineapples.

Science is fun.

Editor’s note: Robert VanBuren, second author on the pineapple genome, and first author on at least one of dozen or so published grass genome sequences got his own research group out at MSU working on CAM photosynthesis and drought. Check it out!

Dichanthelium oligosanthes (One in a Thousand Series)

Inflorescence of Dichanthelium oligosanthes. Accession “Kellogg 1175”

Out of the ~12,000 known grass species, the genomes of less than one in one thousand have been sequenced. The “One in a Thousand” series focuses on these rare grass species.

Dichanthelium oligosanthes is a wild grass that grows in forest glades throughout the American midwest. It is a small plant. Doesn’t grow particularly fast. Its flowers aren’t particularly striking. And it has enough issues with seed dormancy that growing it in captivity is a major pain. Dichanthelium is a one in one-thousand grass with a sequenced reference genome.*

The reason folks are interested in Dichanthelium isn’t because of what it is, but who it’s related to. Dichanthelium occupies a spot on the grass family tree between a tribe** of grasses that includes foxtail millet and switchgrass, each one in a thousand species themselves, and another tribe of grasses that includes corn and sorghum, two more one in a thousand species. The relationship looks something like this:

Phylogenetic relationship of Dichanthelium oligosanthes to related grasses with sequenced genomes.


STAG-CNS: Finding smaller conserved promoter regions by throwing more genomes at the problem

Functionless DNA changes more rapidly, functional DNA more slowly. This is one of the fundamental principles of comparative genomics. It’s why people look at the ratio of synonymous nucleotide changes to nonsynonymous nucleotide changes within the coding sequence of genes. It’s why the exons of two related genes will still have strikingly similar sequences after the sequence of the introns have diverged to the point where it’s impossible to even detect homology. It’s also a way to identify which parts of the noncoding sequence surrounding a set of exons are functionally constrained. The bits of noncoding sequence that determine where, and when, and how much, a gene is expressed are by definition, functional, and should diverge more slowly between even related species than the big soup of functionless noncoding sequence that the functional bits of a genome float in. These conserved, functional, noncoding sequences are called, unimaginatively, conserved noncoding sequences (CNS).*

Comparison of a single syntenic orthologous gene pair in the genomes of peach and chocolate. Coding sequence marked in yellow, introns in gray, annotated UTRs in blue. Red boxes are regions of detectably similar sequence between the same genomic region in these two species. Taken from CoGePedia.

I’ve been playing with CNS since I first opened a command line window back as a first year grad student. The smallest CNS we’d consider “real” were 15 base pair exact matches between the same gene in two species. On the one hand, this seemed a bit too big, because I know lots of transcription factors bound to motifs as short as 6-10 base pairs long. On the other hand this seemed a bit too short because I’d see 15 base pair exact matches that couldn’t be real a bit too often (for example a match between a sequence in the intron of one gene, and the sequence after the 3′ UTR of another).

15 bp represented a compromise between the two concerns pushing in opposite directions. Then, in the fall of 2014, a computer science PhD student walked into my office and asked if I had any interesting bioinformatics problems he could work on. The result was a new algorithm (STAG-CNS) which was both more stringent at identifying conserved noncoding sequences and able identify shorter conserved sequences than was previously possible. It achieved both of these goals through the expedient of throwing genomes from more and more species at the problem.


Correcting genotyping errors when constructing genetic maps from genotyping by sequencing — GBS — data.

When doing anything even vaguely related to quantitative genetics I would chose more missing data over more genotyping errors any day of the week. There are lots of approaches to making missing data less of a pain. The most straightforward of these is called imputation. Imputation essentially means using the genetic markers where you do have information to guess what the most likely genotypes would be at the markers where you don’t have any direct information on what the genotype is. This is possible because of a phenomenon known as linkage disequilibrium or “LD.” Both imputation and LD deserve their own entire write ups and they are on the list of potential topics for when I have another slow Sunday afternoon. For now the  only thing you have to know about them is that, when information on a specific genetic marker is missing, it is often possible to guess with fairly high accuracy what that missing information SHOULD be. But when the information on a specific genetic marker is WRONG… well it’s usually a bit more of a mess (but I think the software solutions for this are getting better! Details at the end of the post.)

Figure 1: Genotype calls along chromosome 1 for six recombinant inbred lines (RILs).


Hybrid vigor and missing genes

Thinking about defining the number of genes present in the maize genome reminded me of an old* story about the trouble of defining what truly represents a gene and how really awesome ideas can sometimes come years before the data needed to support them.

The year is 2002. The first complete version of the human genome is still a year away. The genomes of two plant species have already been published (rice and arabidopsis) but in terms of shere genome size, both species are a drop in the bucket compared to the human genome, or other plant genomes like corn or wheat. But none of this is particularly important except to set the stage.

Two researchers at Rutgers University were sequencing a tiny piece of the maize genome (~0.01%) that surrounded a single gene call bronze1 — the fifth most studied gene in maize — when they found something unexpected.

They had previously 10 identified genes in a single stretch of 32-kb of the maize genome. (A similar gene density throughout the remainder of the maize genome would have resulted in a maize genome containing more than 700,000 genes!) However it was already known that the maize genome was split between small gene-rich islands and vast desolate expanses of transposons (referred to as transposon nests**), and in fact the same study identified a couple of these nests of transposons on either side of their gene rich island (see part A of the second picture in this post).

Below I'll use cartoons, but here's a real and to scale example of a gene rich island I picked at random from maize chromosome 3. Genes and intergenic spaces are to scale. Base image generated with GenomeViewer, part of the CoGe toolkit. http://www.genomevolution.org/CoGe/

Their initial sequencing used DNA from a breed of corn called McC, which I must admit I’ve only ever read about in this particular paper. However, when they decided to sequenced the same region from the genome of B73*** they made three discoveries which I’ve listed in increasing order of strangeness: (more…)

Transposon Mutagenesis

In yesterday’s Transposon Week post, I discussed how transposons can spread through a species by without providing any benefit to the animals, plants, fungus, or micro-organisms that host them.

Adding a little extra useless DNA doesn’t help an organism survive, but it also doesn’t cause serious harm. But in yesterday’s post I completely avoided one serious question:

When new copies of a transposon get inserted across the genome, what happens to the DNA they land in? For what matter, what kind of DNA do transposons land in in the first place?

The answer to the second question is that different kinds of transposons each have their favorite places to land in the genome. Some transposons like to land in centromeres. Some transposons like to land in other transposons. Some transposons like to land near genes.

Then there are transposons like Mutator. Mutator is a maize/corn transposon that really likes to insert itself into genes. Transposons that usually land in other parts of the genome are also sometimes found in genes.

When a transposon lands in a gene, whether because that’s where it likes to insert or simply by accident, the gene stops working. Depending on which gene has to misfortune to interrupted by a transposon, the effects can range from so-subtle-we-can’t-even-detect-them to so lethal the organism dies before we get a chance to study it. In between are a whole range of effects. From severe developmental mutants, to gorgeous and apparently random streaks of color in flowers, to the spotted corn kernels which were my first introduction to the world of transposons.* (**)

Transposons are always breaking genes. The deadliest mutations disappear from the population as quickly as their appear. More subtle mutations can linger on for generations given rise to all sorts of genetic disorders. And keep in mind many genes can be broken with no visible effect at all. Anything you eat from asparagus to zuccini has the potential to contain genes broken by transposons. And depending on the gene, you’d probably never even know it.

Sorry to put this post up so late (it’s technically already Thursday) and in such a poor shape. I had some craziness in lab today and was waiting (unfortunately without any luck) to hear back about some more interesting stories I could tell about the Mutator transposon.

*To be fair, the last two are actually caused by transposons jumping OUT of genes allowing them to resume their normal function. The original mutations caused by transposons inserting into genes were to break the biochemical pathways used by Dahlia’s to make red pigment in their petals and by corn to produce purple pigment (anthocyanin) in its kernels.

**I really wish there was a good source of freely usable pictures things like transposon sectors in flowers and corn kernels. I can usually find pictures of normal plants on Flickr with creative common licenses. But I really want to be able to show you guys the cool mutants that make genetics so exciting.

Transposons: The Difference Between Junk DNA and Selfish DNA

Tranposons are one of those really cool features of genomes that never really seem to make the jump into the public eye. Most people at least have some conception of what a gene is. It’s a piece of DNA that contains the instructions for making a protein plays some role in the cell. A lot of other people can recall hearing an off-hand statistic only some tiny fraction of the human genome is made up of genes, with the rest being “junk DNA”. The question of why most of our genomes have no apparent function is why there’s a slow trickle of scientific research that gets picked up in the popular press as “scientistists discover junk DNA not junk after all!”.

But the reason most of genetics-genomics people aren’t in a huge rush to discover the hidden function behind most of this “junk DNA” is because we KNOW what most of it does and where it comes from. It’s not junk, it’s selfish DNA. <– although there’s certainly lots of cool stuff remaining to be discovered in the much smaller fractions of genomes we can’t classify at all. (more…)

Welcome to transposon week here at James and the Giant Corn!

I’m just about wrapped up with the big project I’ve been working on recently. Hope to be able to say more about it in the not-too-distant future. Having to be secretive in science sucks.

But there’s a lot of be happy about! I’m done teaching for a long time. As much as I enjoyed working with the kids in my class, the other responsibilities of teaching (grading, sitting through lectures without the chance to break in for the discussions and arguments that make academia so fun, grading, designing assignments, grading) were really starting to wear me down.

And I’m only three weeks (June 22nd) from either passing my qualifying exam or becoming a beaten and broken shell of a man. For three hours four professors will question me on everything I’ve learned (or should have learned but didn’t) in my education up to this point, and everything I propose to spend the next few years of my life doing. This may not sound like a good thing, but it is. Because my qualifying exam has been hanging over my head all semester,

The lab has a new paper in press, having run the sequential gauntlets of Peer Review, Editorial Evaluation, and finally (and perhaps most dreaded) Your-Figures-Aren’t-High-Resolution-Enough e-mails from the journal’s publication department. But more on the details of that whenever the paper actually shows up.

But what was the point of this entry again? Oh yeah. Transposons. I have a soft spot from transposons (I’m guessing most people who work with maize genetics do). Today we may know that transposons are found in practically every genome under the sun, but they were discovered first in maize using old school genetics (breeding plants together and counting traits in the offspring), before DNA sequencing was a gleam in its inventor’s eye.

And on top of that, some delightfully high-copy number transposons are in the middle of proving a major scientific point for me, so I figured the least I could do was devote a week to them here on the site.

If you’re not a geneticist, should you still care about transposons? Absolutely! Transposons are one of the best arguments, not for why genetic engineering is safe, but for why, if anyone worried about hypothetical unintended consequences of genetic engineering should be worried about any food with DNA in it (and as far as I know, that’s all food.) To paraphrase a seinfield character: “No food for you!”

The week’s schedule: (more…)

The Peach Genome Is Out

1.1 pound peach from the Berkeley Farmer's market.

Here. I had no idea anyone was even considering sequencing the peach genome until I heard a single off-hand comment at the maize meeting last month, and all of the sudden here it is. And in better shape in its first release than some genomes are even after they’re published.

This is a pre-publication release, so the Fort Lauderdale Convention is still in effect,* but the peach genome looks really great from the quick and dirty analysis I have already run. They’ve already got the genome assembled into pseudomolecules (chromosomes), unlike some genomes I could mention that have already been published, and marked the locations and structures of genes in the geneome (there was a weird period last summer when there were pre-release versions of the maize genome organized into chromosomes, and pre-release versions with the genes marked, but none that had both.)

*In short, you or I can download the peach genome, play around and study it to our hearts content, but we can’t publish anything on it until the people who actually sequenced the peach genome publish a paper describing their work.

The two genomes of maize

I recently go back from the maize meeting. I mentioned before that big part of the reason to do poster presentations is to get comfortable discussing ones research with people who haven’t specialized in the exact same subject. In my case, my poster got a fair bit of interest which was great. (Although I was surprised which parts people were most interested in.) But there were also a couple of concepts I had a lot of trouble getting across.

It’s too late to do me any good at the maize meeting, but I have created the figure I think I needed to explain those ideas. Too late for the maize meeting, but maybe I can squeeze it into my qualifying exam proposal. Or maybe the next time I get a chance to give a talk on campus. Let’s just not get into how much of my morning I spent putting this together, and pretend it was a good investment of my time ok? (more…)