Thursday 26 August 2010

Version Control for Scientists

So, here I am, working very hard, writing code, making figures, assembling presentations, writing papers, thinking about three different things at once, calibrating my data, recording observations and generally trying to do science. The result of three years' work will be three or four artifacts: a dataset of observations, a couple of papers with a definite view on what that means, and the rest of my thoughts bundled into a big book of a PhD.

Now, the data I collect is hopefully fixed. The ice was 1.25 meters thick. The voltage measured by my logger was 0.027 V, the temperature of the air was -25C. Once I've collected the data and entered it into a handy digital format it shouldn't change - I won't be able to collect it again. It's important then for this data to be well structured, to be annotated with how I collected it, and for it to be safely stored for ever. This means it must be backed up safely and in lots of places, and that any processing I do doesn't alter any of the raw data - all I can do is read it in for further processing.

That data needs to be processed though. Raw readings need to be turned into temperatures through some calibration code, calculations turn these temperatures into ice growth rates and heat fluxes, eventually I want an output of pictures of what this means, and hard numbers to compare with other people. I also want to be able to prove to someone that I arrived at these hard numbers through an honest method, and to be able to recreate any figures I make today at some point in the future.

I achieve this by using code to transform my data and to produce all figures, and can in principle do this directly from the raw data (usually I save intermediate stages of the calculations to save time, but I could make these again too). This means I can easily redo a figure if someone points out a mistake, or if it might tell a better story if coupled with other information, or if it just needs to have a different format because last time I put it in a presentation and this time it's going into a paper.

Now, it's all well and good having the code to make the figure today, but what if I modify or add to my code in the future? How can I know exactly what I did today, if I might have added something to my code for something else tomorrow? Partly we can avoid this being a problem by designing the code well, to always operate as it used to even as we improve it, (but that's a whole different story). Mostly we need to make sure we can somehow get back to older versions of the code we're using, or the paper we're writing. This is the only way we can prove to someone else exactly what we did to produce our final, polished and published paper.

I think it's essential for a scientist to be using a version control system, not because it offers a way to safely make and remove changes when developing code, but because it offers proof of our methods, even in the future. While I don't expect my work to generate the controversy of climategate, by using version control it should always be both possible and very easy to defend work I've done, even once I've moved on.

And how do I do this? I use mercurial with TortoiseHg and Kiln, for which there is a very good tutorial, but more adventurous people might like git with github which also fits the bill very nicely. As well as a fix of honesty, I get a easy way to backup my work, to recover deleted paragraphs I realise I liked after all, and a simple way to synchronize my work between three different computers and an online backup (thoughtfully provided by Kiln).

(This post was spurred on by an interesting discussion on the merits or not (I'm with not, if you didn't notice above) on version control of data at http://sunlightlabs.com/blog/2010/we-dont-need-a-github-for-data/.)


No comments:

Post a Comment