Internet Scale Data

October 12, 2009

Internet Scale Data

Filed under: Computers and Coding — James @ 9:12 am

I’m pretty proud of my eight core* workstation. But then I picked up the New York Times this morning** and found out I’m in fact behind the times:

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.

“If they imprint on these small systems, that becomes their frame of reference and what they’re always thinking about.”

Analyzing huge data sets is only going to get more important in the future, especially in biology. We had a taste of it with studying micro-arrays. Now the problems of micro-array analysis are being replaces by those of RNA-seq analysis. And five years from now, sequence data is going to be even more out of control.

So if you’re studying biology, or plan to in the future, learn to program. From the nytimes article, ideally you’d want to learn how to program for a supercomputer. If you’re feeling that ambitious, ask around, you’d be surprised how many campuses have one or more. Some even have one that’s been superseded by a new faster super computer so it’s actually possible to get some time on it.

But even if you don’t feel like learning supercomputers, still learn some basic programming. Before you even start, figure out how to learn the command line. Then try out a scripting computer language like Python or Perl***. Maybe you’ll get hooked like I did, and even if you don’t, a little bit of programming can help with huge datasets. Being able to scan through DNA for the locations of a specific sequence, or a not entirely specific sequence ie ATG (GT or CC) ACT(0,1, or 2 times) CGT has come in handy plenty of times.Or just pulling data out of multiple spreadsheets (the ones in our lab are 10s of thousands of lines long). and putting the data that matches your requirements together in a new spreadsheet.

*Technically 2-cpu 4-cores each

**Well not actually picked up. Saw it on the website. About the only time I lay hands on a physical newspaper is when I visit my folks.

***Or Ruby? There’s one lab here on campus that uses Ruby exclusively, but I don’t have any personal experience to judge it one way or the other.

Comments (0)

No Comments »

No comments yet.

RSS feed for comments on this post.

James and the Giant Corn Genetics: Studying the Source Code of Nature

October 12, 2009