I work in a neuroscience lab. Yet, during most day hours (and some night hours), I sit in front of a computer and write code. I guess you could group this code in three domains:
- Study Preparation This can be workspace definitions, protocols and scripts to run experiments, or low-level code to interface measurement devices.
- Data Analysis This can be code for e.g. digital signal processing, statistical tests and visualization. It includes also proper data organization.
- Publications This can be a draft for a presentation, a grant application, or a manuscript for publication.
Quality is essential in all three domains. And granted, in academia, informal review occur all the time. Yet, I made the experience that quality is checked formally almost exclusively for the last domain. Most scientists circulate their manuscript draft among the authors to check for spelling, style or grammar errors. Few scientists do the same for their code. Colleagues, funding agencies and peer reviewers somehow are expected to trust what’s written in the publications, and that the code resembles the promises. We all know this is often not the case.
It’s like trusting your car dealer´s promise that the car resembles your requirements: No need to inspect the engine or take it on a test ride!
Why we miss errors
But big errors can be caused by this approach! Even if you’re cool regarding prestige and ego, they’ll cost you time and money.
Consider that Awh had to retract a paper because they trusted a functions output and performed hilbert transfromation along the wrong dimension
Specifically, there was a matrix transposition error in the code […]. The data matrix was oriented correctly for the call to eegfilt, but the output of the call to eegfilt was not correctly transposed in the standard Matlab format before passing into the built-in Matlab ‚hilbert‘ function. (1)
Bollnick talks openly about how he had to retract a paper because he trusted a functions output instead of his visual inspection of the data :
I clearly thought that the Mantel test function in R […] reported the cumulative probability rather than the extreme tail probability […] Didn’t look significant to my eye, but it was a dense cloud with many points so I trusted the statistics and inserted the caveat „weak“ into the title. I should have trusted my ‚eye-test‘. (2)
And recently I had a similar experience, luckily before submission:
I upgraded my machine from Win7 to Win10. Which meant upgrading Matlab and (an ancient) Fieldtrip to their most recent version. Which made a significant cluster analysis results in an EEG source reconstruction study dissappear. The same code did not replicate anymore. Not keeping my dependencies up-to-date wasted me a lot of time.
In my experience most undetected errors occur because
- the code runs without runtime errors
- you trust a functions output
- the output is significant
The conclusion: Every code written can have errors. If you test your code only using these three criteria, you are likely to miss the severe errors that might force you to retract.
How can we find errors?
The obvious strategy seems to implement proven techniques and best practice from software engineering. Whole books have been written on the topic how to find and remove errors in code. The most used approaches are code review and unit tests, but:
The most interesting facts that this data reveals is that the modal rates don’t rise above 75 percent for any single technique and that the techniques average about 40 percent. […] The strong implication is that if project developers are striving for a higher defect-detection rate, they need to use a combination of techniques. (3).
Now consider that scientific coding is done under very different requirements compared to industrial or traditional software engineering:
The whole „get data, extract interesting insights, publish, move on“ mindset that permeates science make it really hard to justify the time needed for a properly engineered piece of code. (4)
In my opinion, it is therefore important to focus no only on which methods work best in software engineering, but also how to can apply them in science labs.
I believe the key difference in the environments is that science has a statistical hypotheses. The key approach in testing your code is to run it with a defined input and check whether the output is according to your expectations. But if you write your code, and test in on your recorded data, you can not really test whether its output is according to the requirements. This means to properly test your code, you have to use mock-up data. Therefore, we should think hard about the hypothesis we want to test, and create mock-up data for several configurations. Think what your code should return for these different datasets. Do not forget to include a non-significant dataset. Explore whether your code behaves according to this expectations.
Science has also a (very bad) tradition of analytical flexibility. Much has been written already on this topic (5). It means that you reanalyze your data until something is significant, then you publish. The tradition is stubborn and hard to kill, especially if you believe data is costly and publications are the gate-keeper to tenure (6). P-hacking is certainly bad for statistical inference. I will also make your code more prone to errors, because you are not actively looking for possible errors (no proper code review) and it is difficult to create the mock-data for your fancy analysis (no proper unit test). The solution is simple: Do not indulge in analytical flexibility. If it helps, preregister your study. Enforce version control. Attempt to code the analysis scripts before you actually record any data. After all, it forces you to think very well ahead and test with mock-data, where you know the expected result.
Things like formal code review and unit tests are in my experience underused in science. There is a tendency to rely on external toolboxes and built-in functions – especially if they create beautiful figures. It also gives the false allure of safety. By quoting the toolbox in the paper, you might feel allowed to assume it runs correct. This can be double problematic if you do not really understand what the software does. (Which might be more of an issue in student researchers running GUI-analysis like SPSS). Additionally, if you have too many dependencies, sharing code becomes difficult. This would detrimentally affect the reviewability review of code. Currently I have no really good solution to this problem. Maybe sharing the toolboxes? Switching from closed to open software? But have you ever tried signal processing in R? Using virtualization?
Additionally, science inherits the internal dilemmas between writing code for readability versus encapsulation and maintainability of the code. In other words: should we write code for peer review or for the machine? Should it be possible to verify the code by reading the code at any time, or is it better to outsource script chunks to readable functions? How should be comment code? Consider furthermore the environment in which science will be conducted in the future. Sharing code and data might soon become an expectation. It might even go as far that notebooks and live scripts will be a publication in their own right (citable and with a doi).
Before i conclude, let me add some remarks, which I developed up in a conversion with Laurence Billingham. Not all PhD students stay forever. If the leave a lab, you inherit the code. If you enforce that all code is readable and tested, you might prevent some embarrassing problems later. Consider also that scientists love good reasons. Frame testing as some kind of scientific method of reasoning about the code. Let them remember that they already accept to have their papers reviewed. And code is just another kind of text. Also, not everyone will stay in science. Learning some software engineering skills might lower the barrier for PhD students with coding experience to transition into industry.
Well, my labmates and I agreed that we will try to implement some of the best practices and see how they work out for a science lab. I believe we’ll start with more formal code review sessions and unit tests and see where the process leads us from there. Will keep you updated!
Put Shortly: Small errors can wreck your study and force you to retract a manuscript. Explore best practices used in software engineering and adapt them for the scientific environment to control and reduce error rate.