James and the Giant Corn Rotating Header Image

The Last Genome I’ll (Probably) Ever Publish: Proso millet (Panicum miliaceum)

Proso millet growing in western Nebraska.

I’ve now been a part of the publication of three genomes, all grasses. One as a grad student (Brachypodium distachyon). One as a postdoc (Dichanthelium oligosanthes). And now one as a PI (Panicum miliaceum). Each species had different motivations: Brachypodium was intended to be a genetic model selected because it belonged to the same part of the grass family as wheat, barley, rye, and oats, but had a genome that was 1-2 orders of magnitude smaller. Dichanthelium was a comparative grade genome picked because stood between two groups of C4 grasses with sequenced genomes (maize and sorghum on one side, foxtail millet and pearl millet on the other) yet still used C3 photosynthesis, the ancestral state. Panicum miliaceum (proso millet or broomcorn millet) was sequenced because it’s an actual crop people grow in some of the driest cultivated land in the world (like inner Mongolia and western Nebraska), and having a reference genome sequence really does help with things like genomic selection, marker assisted selection, and QTL mapping. And each was sequenced using completely different technologies: Sanger sequencing (Brachypodium), Illumina short reads and mate pairs “next gen sequencing” (Dichanthelium), and PacBio long-reads combined with HiC “third gen sequencing” (proso millet). PacBio assemblies are SO MUCH BETTER than what we could manage with Illumina + mate pairs (I realize this is not news to most of you, but it’s one thing to hear it, it’s another to see it for yourself).

Differently colors of proso millet grain, all sourced from the USDA NPGS’s amazing germplasm collections.

If I’ve learned one thing from these three experiences it is that it makes sense to work together with a whole team of people to put together a genome. The Dichanthelium genome project I was mostly working with a single other postdoc who also thought the potential for comparative genomics/biology of the species was cool, and in retrospect we bit off way more than we could chew, and were lucky to make it across the finish line to a paper. For both proso millet and brachypodium, I had the joy of working with big teams of people including folks whose whole job was genome assembly and annotation, and they were really REALLY good at it.

A single proso millet spikelet flowering.

So what can I tell you about proso millet? It produces grain more efficiently per unit of water transpired than any other grain crop studied. It can produce grain in fewer days than any other crop I’ve worked with (some varieties are ready for harvest 50-60 days after planting!) It’s an allotetraploid, although so far we’ve only found a diploid lineage related to one of its subgenomes, not the other. One early approach we tried (see Ott et al below) was to use a technology designed to separate and phase the haplotypes of a diploid human to separate and phase the two subgenomes of an inbred tetraploid individual of proso millet. I’ve actually met farmers in both China and the USA who grow the crop, which is a really nice feeling. With one of my private sector hats on, I’ll get to use this genome to try to make higher yielding varieties of proso millet for those exact farmers. With my main public sector hat on, I’m excited to have a model for NAD-ME C4 photosynthesis that is easier to germinate, grow, and propagate than Panicum hallii or Panicum virgatum. There is nothing like working with wild grasses to make you appreciate the work all of our ancestors did to select against seed dormancy and photoperiod sensitivity while they were domesticating crops from wild species over dozens and hundreds of generations.

Zou C, Miki D, Li D, Tang Q, Xiao L, Rajput S, Deng P, Peng L, Huang R, Zhang M, Sun Y, Hu J, Fu X, Schnable PS, Li F, Zhang H, Feng B, Zhu X, Liu R, Schnable JC, Zhu JK, Zhang H. (2019) “The genome of broomcorn millet.” Nature Communications doi: 10.1038/s41467-019-08409-5

Ott A, Schnable JC, Yeh CT, Wu L, Liu C, Hu HC, Dolgard CL, Sarkar S, Schnable PS. (2018) “Linked read technology for assembling large complex and polyploid genomes.” BMC Genomics doi: 10.1186/s12864-018-5040-z

Repost: Correcting genotyping errors when constructing genetic maps from genotyping by sequencing — GBS — data.

Editor’s note: this is a repost of an article which originally ran on James and the Giant Corn March 26th, 2017. I’m choosing to post this new, slightly amended version a little more than a year later to mark the publication of the paper describing Genotype Corrector. All told it took approximately 18 months from initial submission to final publication. However, to be fair a lot of that time was spent waiting for a single round of peer review at a different journal from the one in which the paper finally appeared.  

When doing anything even vaguely related to quantitative genetics I would chose more missing data over more genotyping errors any day of the week. There are lots of approaches to making missing data less of a pain. The most straightforward of these is called imputation. Imputation essentially means using the genetic markers where you do have information to guess what the most likely genotypes would be at the markers where you don’t have any direct information on what the genotype is. This is possible because of a phenomenon known as linkage disequilibrium or “LD.” Both imputation and LD deserve their own entire write ups and they are on the list of potential topics for when I have another slow Sunday afternoon. For now the  only thing you have to know about them is that, when information on a specific genetic marker is missing, it is often possible to guess with fairly high accuracy what that missing information SHOULD be. But when the information on a specific genetic marker is WRONG… well it’s usually a bit more of a mess (but I think the software solutions for this are getting better! Details at the end of the post.)

Figure 1: Genotype calls along chromosome 1 for six recombinant inbred lines (RILs).

(more…)

Play to your strengths!

A friend tries her own hand at chopping open coconuts on Hainan Island.

I have a confession to make here. I suck at organic chemistry. Chemistry in general (general chemistry, organic chemistry, biochemistry) was by far my weakest subject in college (Cs the whole way). I even managed to fail organic chemistry lab one semester which brought me down below a full course load that semester and I had to organize an appeal to avoid being involuntarily suspended the following semester. It’s always fun to tell this story to new undergrads and or grad students and watch their eyes get wider and wider as the tale goes on.

The reason I tell them that story — besides to try to help put things into perspective when a kid is worried about getting their first B and that their own imagined future is crumbing before their eyes — is to make the point that it’s okay to be really really good at some things, and suck terribly at others. That’s why we come together as a society. If I’m really good at climbing trees to harvest coconuts, but suck at spearing fish, and you have the opposite skill set, one solution would be for me to spend all my time practicing fish spearing, and you to spend all your time practicing tree climbing. Or I could trade you some of my coconuts for some of your fish, and we’d both have a lot more to eat when we sit down to a delicious feast on the beach as the waves roll in.

I have no idea what these even are, let along how to make them, but I remember them being really delicious (Beijing 2014).

There is also such as thing as over-specialization. If I’m so focused on harvesting a particular type of coconut that I develop my whole own coconut focused vocabulary, to the point I cannot even communicate with people who spear fish, or farm taro, I’m going to have a bad time of it out in our hypothetical island world.

Thus ends this fable/analogy/whatever it is.

….also I’ve sucked at spelling since I first learned to write.

I may have a slightly skewed idea of normal work habits are

I’m now worked at four different scientific institutions in some capacity or another, and I’m always surprised how empty buildings are when I come in on Saturdays or Sundays. To be clear, I’m certainly not at work every weekend day myself, and I don’t expect the students or collaborators to work weekends.* I’m just realizing that, after 13 years of thinking “wow, people at University X really have a more relaxed approach to research than most places” maybe my idea of how many hours it is normal for a researcher to log in a week might be a tiny bit skewed.**

 

Although, to be fair, 9:15 AM on Easter Sunday might be the MOST representative time point. 😉

*I always say that my mentoring style is to focus on productivity, not hours worked in lab. I’m still working out what that means in practice. For an entertaining — as long as the person writing the e-mail isn’t your boss — glimpse of what the opposite sounds like, be sure to read this classical e-mail from 2002. 

**Growing up, I thought every family had dinner around 8 pm once everyone got home from the office, and that once you got a real job, “weekend” actually meant “sunday morning.”

A new chapter

Whatever anyone tells you, remember to play to your greatest strengths, not your weaknesses.

I’ve been saying it for nearly a decade: pineapples really are awesome

With all these new third generation sequencing technologies coming out in 2010, hopefully someone will sequence the pineapple genome. If not, maybe the cost of sequencing will drop enough while I’m in grad school that I can sequence the genome myself ( a guy can dream).

An incredibly overused graph, but the reason it’s so overused is that it really is a remarkably useful dataset. Source: https://www.genome.gov/sequencingcostsdata/

Although I was a bit overly optimistic back in 2010 about how fast the cost of sequencing (and critically assembling) genomes would decline. Back then we are all talking about sequencing prices dropping 10x every 1-2 years. This turned out of be a quick burst of innovation brought about by second generation sequencing technologies (primarily 454 at first, then Solexa which became Illumina later on). Like many technologies, there was a lot more low hanging fruit for optimization early on, and the cost of sequencing essentially plateaued from 2011 to 2015.

Of course now we’re finally starting to get those economically viable 3rd generation sequencing technologies I though were right around the corner in 2010. And they still have lots and lots of headspace for optimization (pacbio and oxford nanopore being the two most successful ones at the moment) that maybe in another 6-7 years grad students really will be able to generate genome assemblies on a whim.

In the meantime, hey, we did get a pretty cool pineapple genome assembly a couple of years ago.

Ming R., VanBuren R., Wai C. M., Tang H., Schatz M. C., et al., 2015 The pineapple genome and the evolution of CAM photosynthesis. Nat Genet 47: 1435–1442.
Also, here’s a fun video of a 3D scan of the internal structure of a pineapple:

Evidence of my ongoing obsession with pineapples.

Science is fun.

Editor’s note: Robert VanBuren, second author on the pineapple genome, and first author on at least one of dozen or so published grass genome sequences got his own research group out at MSU working on CAM photosynthesis and drought. Check it out!

Dichanthelium oligosanthes (One in a Thousand Series)

Inflorescence of Dichanthelium oligosanthes. Accession “Kellogg 1175”

Out of the ~12,000 known grass species, the genomes of less than one in one thousand have been sequenced. The “One in a Thousand” series focuses on these rare grass species.

Dichanthelium oligosanthes is a wild grass that grows in forest glades throughout the American midwest. It is a small plant. Doesn’t grow particularly fast. Its flowers aren’t particularly striking. And it has enough issues with seed dormancy that growing it in captivity is a major pain. Dichanthelium is a one in one-thousand grass with a sequenced reference genome.*

The reason folks are interested in Dichanthelium isn’t because of what it is, but who it’s related to. Dichanthelium occupies a spot on the grass family tree between a tribe** of grasses that includes foxtail millet and switchgrass, each one in a thousand species themselves, and another tribe of grasses that includes corn and sorghum, two more one in a thousand species. The relationship looks something like this:

Phylogenetic relationship of Dichanthelium oligosanthes to related grasses with sequenced genomes.

(more…)

STAG-CNS: Finding smaller conserved promoter regions by throwing more genomes at the problem

Functionless DNA changes more rapidly, functional DNA more slowly. This is one of the fundamental principles of comparative genomics. It’s why people look at the ratio of synonymous nucleotide changes to nonsynonymous nucleotide changes within the coding sequence of genes. It’s why the exons of two related genes will still have strikingly similar sequences after the sequence of the introns have diverged to the point where it’s impossible to even detect homology. It’s also a way to identify which parts of the noncoding sequence surrounding a set of exons are functionally constrained. The bits of noncoding sequence that determine where, and when, and how much, a gene is expressed are by definition, functional, and should diverge more slowly between even related species than the big soup of functionless noncoding sequence that the functional bits of a genome float in. These conserved, functional, noncoding sequences are called, unimaginatively, conserved noncoding sequences (CNS).*

Comparison of a single syntenic orthologous gene pair in the genomes of peach and chocolate. Coding sequence marked in yellow, introns in gray, annotated UTRs in blue. Red boxes are regions of detectably similar sequence between the same genomic region in these two species. Taken from CoGePedia.

I’ve been playing with CNS since I first opened a command line window back as a first year grad student. The smallest CNS we’d consider “real” were 15 base pair exact matches between the same gene in two species. On the one hand, this seemed a bit too big, because I know lots of transcription factors bound to motifs as short as 6-10 base pairs long. On the other hand this seemed a bit too short because I’d see 15 base pair exact matches that couldn’t be real a bit too often (for example a match between a sequence in the intron of one gene, and the sequence after the 3′ UTR of another).

15 bp represented a compromise between the two concerns pushing in opposite directions. Then, in the fall of 2014, a computer science PhD student walked into my office and asked if I had any interesting bioinformatics problems he could work on. The result was a new algorithm (STAG-CNS) which was both more stringent at identifying conserved noncoding sequences and able identify shorter conserved sequences than was previously possible. It achieved both of these goals through the expedient of throwing genomes from more and more species at the problem.

(more…)

Correcting genotyping errors when constructing genetic maps from genotyping by sequencing — GBS — data.

When doing anything even vaguely related to quantitative genetics I would chose more missing data over more genotyping errors any day of the week. There are lots of approaches to making missing data less of a pain. The most straightforward of these is called imputation. Imputation essentially means using the genetic markers where you do have information to guess what the most likely genotypes would be at the markers where you don’t have any direct information on what the genotype is. This is possible because of a phenomenon known as linkage disequilibrium or “LD.” Both imputation and LD deserve their own entire write ups and they are on the list of potential topics for when I have another slow Sunday afternoon. For now the  only thing you have to know about them is that, when information on a specific genetic marker is missing, it is often possible to guess with fairly high accuracy what that missing information SHOULD be. But when the information on a specific genetic marker is WRONG… well it’s usually a bit more of a mess (but I think the software solutions for this are getting better! Details at the end of the post.)

Figure 1: Genotype calls along chromosome 1 for six recombinant inbred lines (RILs).

(more…)

McClintock Prize

Last night my major professor received the McClintock Prize in Maize Genetics.

His acceptance talk was really exciting and full of his newest ideas about the big problems of biology and evolution. However, looking back at his history, one of the amazing things about his career is that he’s reinvented himself entirely, switching from a research program focused on transposons and developmental biology to an entirely different career focused on taking the rigorous hypothesis development and hypothesis testing to the world of comparative plant genomics (and he started when there was exactly one sequenced plant genome, so being able to do comparative work at the time was quite something).

In many ways it makes me nostalgic for my time in the lab. In grad school you are essentially paid to think, while it often feels like as a faculty member you are paid mostly to attend meetings, fill out forms, and spend four hours a day answering e-mails. 😉

But this post isn’t about me. Congratulations Mike! Really is one of the fathers of modern maize genetics.

A partial sample of the 76 people who either received their PhDs or Postdoc’d in the Freeling Lab at UC Berkeley

Complete list of lab alumni here.