coding screens

I work in a neuroscience lab. Yet, during most day hours (and some night hours), I sit in front of a computer and write code. I guess you could group this code in three domains:

  • Study Preparation For example, writing protocols and scripts to run experiments, or low-level code to interface with stimulus presentation and measurement devices.
  • Data Analysis For example, writing code for digital signal processing, statistical tests and visualization.
  • Publications  For example, writing a draft for a presentation, a grant application, or a manuscript for publication.

Sure, quality is important in all three domains. And even in academia informal reviews occur all the time. Yet, I have the impression that quality is checked formally almost exclusively for the last domain. Most scientists send their manuscript drafts to their co-authors to check for spelling, style or grammar errors. Few scientists do the same for their code. It seems as if colleagues, funding agencies and peer reviewers are  somehow expected to trust that code has no errors.

It’s like trusting your car dealer´s promise that the car resembles your requirements: No need to inspect the engine or take it on a test ride!

Why we miss errors

We all know this is rarely the case.  Missed errors will cost more time and effort, not counting prestige and ego, to fix at a later stage.

I’ll just mention two public cases. Edward Awh and his co-authors had to retract a paper because they trusted a functions output and performed hilbert transfromation along the wrong dimension.

Specifically, there was a matrix transposition error in the code […]. The data matrix was oriented correctly for the call to eegfilt, but the output of the call to eegfilt was not correctly transposed in the standard Matlab format before passing into the built-in Matlab ‘hilbert’ function. (1)

Dan Bollnick talks openly about how he had to retract a paper because he trusted a functions output instead of his visual inspection of the data :

I clearly thought that the Mantel test function in R […] reported the cumulative probability rather than the extreme tail probability […] Didn’t look significant to my eye, but it was a dense cloud with many points so I trusted the statistics and inserted the caveat “weak” into the title. I should have trusted my ‘eye-test’. (2)

In my experience most errors stay undetected because

  • your code runs without runtime errors
  • you trust a function from a toolbox
  • the outcome of the code is a statistically significant finding.

Almost every code written can have errors. If we keep on not testing our code, we are likely to  miss those severe errors that force a retraction.

How can we find errors?

A natural strategy would be, to implement techniques and best practices proven in software engineering.  Probably the most used approaches are code review and unit tests. But:

The most interesting facts that this data reveals is that the modal rates don’t rise above 75 percent for any single technique and that the techniques average about 40 percent. […] The strong implication is that if project developers are striving for a higher defect-detection rate, they need to use a combination of techniques. (3).

And consider that a lot scientific coding is done under very different requirements compared to  industrial or traditional software engineering:

The whole “get data, extract interesting insights, publish, move on” mindset that permeates science make it really hard to justify the time needed for a properly engineered piece of code. (4)

In my opinion, it is therefore important to focus no only on which methods work best in software engineering, but also how to can apply them in science labs.

Statistical hypotheses

I believe the key difference between software engineering and science is, that the latter has statistical hypotheses instead of detailed requirements. The gist of testing code is to run it with a defined input and check whether what it returns is  according to requirements. But if one only has a more or less vague statistical hypotheses, it is not so clear when outcome and requirement disagree. In fact, we do rarely test whether the output is according to the requirements, but whether it fits our hypotheses. One solution is therefore, to use mock-up data, where hypothesis and requirements converge. That means, we have to formalize the hypothesis we want to test, and create mock-up data. Think what this data, given your code should return. Consider to include a non-significant dataset. Explore whether your code behaves  according to these requirements. Yet, even then we test only from data to analysis outcome. Treating the whole analysis code as a big black box means that if there are errors within the code, they might roughly cancel each other, mask behind randomness in the statistical hypotheses, and we  might not be able to spot them. Running tests on smaller parts of the code is tedious, yet often worthwhile. Still it does not prevent the code to have errors in how the small units are stitched together. In my experience, asking a colleague to have a look at your code and then tell you what they think it it supposed to to, is a nice technique to spot such errors in stitching the parts together.

Analytical flexibility

Science has also a (very bad) tradition of analytical flexibility. Much has been written already on this topic (5). It means that you reanalyze your data until something is significant, then you publish. The tradition is stubborn and hard to kill, especially if you believe data is costly and publications are the gate-keeper to tenure (6). P-hacking is certainly bad for  statistical inference. I will also make your code more prone to errors for two reasons. First, you are not actively looking for possible errors once the outcome is  fine. Second, re-writing the script over and over again can introduce subtle errors, like overwritten variables, false initialization, and wrong sequences.  The solution sounds simple: Reduce the amount of analytical flexibility. Use version control, and make your code public, e.g. on github.  If you feel it helps, preregister your study. At least, attempt to write the complete analysis scripts before you run it on the data. You can do this by using mock-data, or doing it first only for a single subjects of the study. At this point, you do not yet care about the outcome. This means, you can keep the focus on finding errors.

Reduce Dependency

Things like formal code review and unit tests are in my experience  underused in science. There is a tendency to rely on external toolboxes and built-in functions – especially if they create beautiful figures. It also gives the false allure of safety. By quoting the toolbox in the paper, we feel justified in our assumptions that it runs correctly, and this can be doubly problematic if we do not really understand what the software does.  Additionally, if we have too many dependencies, sharing code becomes difficult. It is worse, if the software we used is proprietary. And if one can not share the code, how can somebody else review and test it? I try to make my code as deployable as possible. That means, i aim to use open languages, e.g. python, and to write scripts and modules as self-contained as possible. A first step in this direction is to run scripts not interactively, but as a stand alone. If you use python, try running python your script.py from your terminal. If you are using Matlab, try matlab -nodesktop -nosplash -r script (omit the .m for your script.m). If it runs without runtime errors, at least you know that you had not hidden initialized variables. The goal is to write a script, that if run on another computer, and maybe half a year later, still works. This also pays of in the long run, for example to fulfill a reviewers requests.

Readable Code

Science inherits the dilemma inherent in writing any code: You have to choose between writing code leaning more towards readability, towards encapsulation, towards maintainability or towards modularity. In other words: should we write code to easen peer review or to make it run faster? Should it be possible to verify the code by reading the code, or is it ok to use complex code for performance gain? How should we comment code? Consider furthermore the environment in which science will be conducted in the future. Sharing code and data might soon become an expectation. It might even go as far that notebooks and live scripts will be a publication in their own right (citable and with a doi). There is no good answer to that, as it requires striking a balance between opposing requirements.

Conclusion

Before i conclude, let me add some remarks, which I developed up in a conversion with Laurence Billingham. Not all PhD students stay forever. If the leave a lab, you inherit the code. If you enforce that all code is readable and tested, you might prevent some embarrassing problems later. Consider also that scientists love good reasons. Frame testing as some kind of scientific method of reasoning about the code. Let them remember that they already accept to have their papers reviewed. And code is just another kind of text. Also, not everyone will stay in science. Learning some software engineering skills might make it easier to transition into industry.

Put Shortly: Small errors can wreck a study and force retraction of a paper. Explore best practices used in software engineering and adapt them for the scientific environment  to control and reduce error rate.