What NOT to do with your fresh RNA-seq dataset (a rant)

July 13, 2011

What NOT to do with your fresh RNA-seq dataset (a rant)

Filed under: Uncategorized — James @ 1:11 pm

Under absolutely no circumstances should you take your hard drive full of data, walk into lab and drop it on the desk of some new grad student who decided to go to grad school because he loves plants (or whatever your favorite model organism is) and was a wiz at PCRs in his undergrad lab and tell him he’s now in charge of figuring out how to turn it into a paper.

Crunching gigabases of sequence data doesn’t require a lot of scientific acumen, but it does require a set of skills and abilities that very few incoming genetics or biology grad students have. Not only that, the traditional method of grad student learning — pestering post-docs, technicians, older grad students, even undergrads who have been in the lab for a while — is generally completely closed off. After all, if anyone in the lab already knew how to do the sorts of analyses you’re asking for, you’d have given to the project to them instead.

-A few grad students are going to take to computational biology like a fish to water. Before you know it you’ll have new analyses and paper drafts flooding your inbox — along with requests for more lots of expensive new datasets to follow up on the hints of cool things he’s finding in this one. However, these are rare exceptions and it is difficult if not impossible to spot which grad students will react this way in advance.

-A larger number will eventually drop out. Staring at a command line and struggling through introductory books on scripting languages (or worse yet, the documentation files command line sequence analysis tools) isn’t how they pictured spending their time in grad school. Now a fair number of grad students don’t make it through under any circumstances, but a lot of the people lost to sequence analysis had the potential to becomes good or even amazing scientists. They just weren’t computer people.

-But by far the most common fate will be a grad student who struggles along, getting advice on the exact parameters to use for specific programs or the meaning of specific error messages. Because his project isn’t very productive or exciting he’ll spend lots of time on side projects that let him handle actual plants and use his mad pipetting skills. And a few years down the road you’ll have a not-very-imaginative analysis of gene expression (on a dataset anyone could replicate for a fraction of the price) and have spent a lot of money (grad student stipends aren’t generous but over the years they add up) training an indifferent bioinformatician with little or no experience at the fundamental business of science: thinking up new ideas and creative ways to see if they’re right.

So what’s the alternative?

Collaborate: Yes this will mean sharing credit. But more worryingly, picking a good collaborator is hard. Lots of pure computational labs are more interested in programming creative new tools than running existing ones. Sure they could get your entire analysis done in less than a week, but you’ll often spend month after month pestering them to actually get around to it. The key is to find a person or group interested in the answers to the same biological questions you’re looking into so they’ll be in a hurry to finish the analysis and see what the answers are.
Hire someone who is already a computer person: this could be either a freshly minted post-doc from a computational lab or a professional programmer. This is the most expensive option, but it also gives you the most freedom. Once your programmer/post-doc is set up (which could include buying a pricey computer server or two), they’ll be able to crank through analyses in days or weeks that might take a biology grad student months or years, if at all. But that means you’re going to need to keep new data coming in or watch your expensive computational support spin their wheels and get into all sorts of trouble.
If none of those options are appealing or feasible: hire a company to do the analysis for you! Yes it cuts against the grain to spend money on analysis, but it’s still cheaper than paying a grad student’s tuition and stipend for months, and you’ll get the results fast enough to publish them before they’re scooped by someone who waited a year and paid 1/10th the cost to generate an identical dataset with whatever the NEXT next-generation sequencing technology turns out to be.*

*I recently did the math on a PLoS Genetics paper published in late 2009 based on on a single in-depth analysis of RNA-seq comparisons of mutant and non-mutant siblings. Today we could generate the same dataset, with twice the depth of sequencing for less than $1000 dollars. (INCLUDING regent costs). The takeaway lesson here: just because your dataset was expensive to generate doesn’t mean you don’t have to worry about the competition stealing the glory if you take more than a year to publish.

Comments (5)

5 Comments »

As a professional programmer working for a purely computational bioinformatics lab, I can fully agree with almost everything you’ve said. I would add a couple of things. First, the programmers working in bioinformatics are forgoing much larger salaries which they could earn elsewhere, and often by doing work which is more interesting in pure programming terms. In other words the programmers are probably there to do science, not because they just couldn’t find any better programming job, so they should be treated as part of the scientific team.

This includes not just crunching datasets on command, but actively participating the design of experiments and planning what data to take and how, since the programmer has the best understanding of how the data can ultimately be processed. Moreover, the needs of the computer often coincide with better and more systematic experimental design. Even something as simple as computer-friendly naming schemes can save serious money, not to mention embarrassing errors, and is often overlooked by PIs unversed in programming.

And my second and related point is that the trained programmer is in by far the best position position to creatively analyze and get results from these large datasets. Too often biologists working with programmers will essentially keep the programmers in the dark, not telling them what they’re really aiming for or why. Often the programmer is not included in the day-to-day scientific discussions, and does not receive necessary feedback on how various tools are being used – or misused. If the goal is to obtain better papers, at lower expense, do not do this.

Comment by William Nelson — July 13, 2011 @ 1:56 pm
I almost brought up the point about involving the people who are going to be analyzing your data at the design stage instead of after all the money has already been spent and you’re stuck with whatever experimental design seemed like a good idea at the time.

Watching computer programmers and biologists try to talk with each other can be incredibly frustrating. Very smart people on both sides, but a completely different vocabulary and backgrounds. The one really top notch programmer we used to have in lab would spend months on concepts my PI (no computational background) was convinced should be incredibly simple to implement, and at other times would take care of tasks assumed to take months in the course of an afternoon.

He finally left lab several months ago and is now making a LOT more money working in biomedical research than we could have afforded. But you’re right lots of biologists assume programmers with the skill set, interest, and background grow on trees and are in for a rude surprise when they realize how hard those positions are to fill.

Comment by James — July 13, 2011 @ 8:52 pm
I’ll add in that (contrary to the expectation of many biologists) most “computer people” are NOT simultaneous experts in programming, bioinformatics, statistics, mathematics and data analysis – I’ve seen a lot of very derivative and unproductive analyses done by excellent bioinformaticians and programmers who just didn’t have the necessary biological/statistical background or collaboration.

I also think that a biologist would be foolish to aspire to be an expert programmer, statistician etc. unless their field is tied VERY tightly to these skills. Most biologists would be better off developing a deep knowledge of their subject so that they can ask the right questions (while developing just enough of an understanding of programming and data analysis to be able to ask for (and understand) help from other experts). Too many young PhD biologists have hitched their wagon to having great “hands” on whatever the trendy technique or machine was at the time – only to have it go obsolete or realize they’ve spent their time cultivating technician skills instead of PI skills (chromosome walking, anyone?).

That being said, I do think that any biology grad student in 2010 should aim to have at least some familiarity with both R and Perl/Python 😉

Comment by Matt — July 13, 2011 @ 8:23 pm
Yup, programming and being an expert on statistics are really different skill sets. The computer can do the hard work of statistical tests for you (especially with R!), but what it _can’t_ do is tell you what statistical test is appropriate to your dataset. And as William said above about seeking computational advice, it’s better to consult statistical experts when you’re designing the experiment, not when your data has all already been generated.

I’d estimate less than half the grad students in my department can write so much as a “hello world!” script in any language.

Comment by James — July 13, 2011 @ 9:01 pm
Very true, stats is a separate expertise. Truly, biology is an interdisciplinary field these days! I would agree that biology grad students should focus on their PI skills, but also learn enough programming to know how to design data-intensive projects, and how to collaborate with programmers. Also I think the CS departments should expand their hiring to include people who don’t pursue really “novel” work in a CS sense, but collaborate actively and in a useful way with non-CS groups.
There’s little incentive now in the CS departments to push an idea to the point where it’s actually usable in practice by someone else.

Comment by William Nelson — July 14, 2011 @ 9:33 am