Abstract

Programming for data wrangling and statistical analysis is an essential technical tool of modern epidemiology, yet many epidemiologists receive limited formal training in strategies to optimize the quality of our code. In complex projects, coding mistakes are easy to make, even for skilled practitioners. Such mistakes can lead to invalid research claims that reduce the credibility of the field. Code review is a straightforward technique used by the software industry to reduce the likelihood of coding bugs. The systematic implementation of code review in epidemiologic research projects could not only improve science but also decrease stress, accelerate learning, contribute to team building, and codify best practices. In the present article, we argue for the importance of code review and provide some recommendations for successful implementation for 1) the research laboratory, 2) the code author (the initial programmer), and 3) the code reviewer. We outline a feasible strategy for implementation of code review, though other successful implementation processes are possible to accommodate the resources and workflows of different research groups, including other practices to improve code quality. Code review isn’t always glamorous, but it is critically important for science and reproducibility. Humans are fallible; that’s why we need code review.

Editor’s note: An invited commentary on this article appears on page 2178, and the authors’ response is published on page 2180.

Scientific mistakes are costly. In a famous example, economists Reinhart and Rogoff published “Growth in a Time of Debt” in 2010, in which they argued that when national debts approached 90% of gross domestic product, economic growth dropped off sharply (1). This paper was cited by conservative lawmakers across the western world, including Paul Ryan in the United States, to justify austerity policies in response to the Great Recession of 2008 (2, 3). However, Reinhart and Rogoff’s results couldn’t be replicated. The culprit? Data omissions, unconventional weighting procedures, and a coding error (4). (Reinhart and Rogoff responded to these critiques in the New York Times, agreeing there was a coding error but disputing the other 2 points (5).) Bloomberg Businessweek went so far as to call this incident, “The Excel Error That Changed History” (6).

Although few individual studies have as much policy impact as “Growth in a Time of Debt,” coding errors are extremely common and contribute to the reproducibility crisis in science (7–10). Even outstanding and highly experienced statistical programmers make mistakes, and much research in the health sciences is led by analysts with limited coding experience. Simple studies can rely on hundreds of lines of code, and more typical research will invoke thousands of lines of code for data management and analysis. The chances of a mistake somewhere in that code are very high. Sometimes the mistakes are conceptual issues, such as merging on the wrong identifiers or using incorrect weights; more often, mistakes are tiny slips, like setting the missing to zero or using a similarly named but not correctly coded variable. Simple coding mistakes can corrupt results enough to require paper retraction, as in a recent example in which reverse coding the exposure led to the reversal of the effect estimate (11).

Improving the quality of code for data management and analysis should be a high priority for our field. We recommend that research groups mitigate the risk of bugs by adopting a technique from the software industry: the code review. Code review entails a thorough examination of the data cleaning and analysis methods, including incorporation of explicit tests, by a programmer who was not involved in the initial coding. Although code review is just one strategy to enhance the quality of code, it merits special focus because it is a relatively straightforward strategy that most research teams could implement immediately without additional training or tools. In professional software development settings, code review is a near-universal standard and is sometimes mandatory (12, 13). To our knowledge, no formal evaluation of the extent to which code review has been adopted in epidemiology research exists; we conducted an informal Twitter poll, in which 60% of 315 respondents indicated their code is reviewed “never” or “rarely” (Figure 1).

Figure 1

Results from an informal Twitter poll (17) asking, “How often is your analysis code for a paper reviewed by someone else before submission (that is, code in SAS, R, Stata, etc.)?” (n = 315). In our poll, 14% of respondents said their code was always or almost always reviewed, 26% said code was sometimes reviewed, 33% said rarely reviewed, and 27% said never reviewed.

Code review also has secondary benefits. The specter of easy mistakes puts undue pressure on analysts to achieve impossible perfection. This can be especially anxiety-producing for new researchers working on their first few projects. Adopting a standard code review protocol acknowledges that mistakes are an inevitable aspect of writing code and provides a safety net against such mistakes. Code review is a feasible solution to reduce stress and errors.

Studies have shown the code author (that is, the initial programmer) is more thorough when they know their code will be reviewed, and the process increases confidence for both the code author and the code reviewer (14). Additional benefits include better documentation of the code to facilitate future applications and training opportunities if the code author can learn new coding techniques from the reviewer. Similarly, the code reviewer is exposed to new coding techniques, potentially strengthening their coding skills. In this way, code review can accelerate learning for both the code author and the code reviewer. Code review can also facilitate collaborations and collegiality between research group members. These benefits have been acknowledged in industry (12, 13). For research group leaders, implementing a standard code-review protocol has the potential to improve science, reduce stress on trainees and analysts, accelerate learning, contribute to team building, and if a style guide is implemented (discussed below), codify best practices; the costs associated with code review are relatively small for a big reduction in trainee stress level and improved research rigor.

RECOMMENDATIONS FOR SUCCESSFUL IMPLEMENTATION OF CODE REVIEW

There are many approaches to code review, and the best approach must balance the time and personnel effort required to complete the review with the benefits gained from the review. On the basis of the experiences of the 3 authors in various settings, we provide some recommendations for the implementation of code review at 3 levels: 1) for the research laboratory, 2) for the code author, and 3) for the code reviewer. These guidelines are not meant to be prescriptive; they represent what we consider best practices and have evolved over the course of writing this paper.

The research laboratory

For code review to be most effective, we believe it should be implemented at the research group/principal investigator level and should be normative for all papers arising from the research group. For example, following the previously noted paper retraction (11), new standard operating procedures were put into place that included code review by a second biostatistician or analyst (11). For code review to become normative in epidemiology, the change must be structural: Principal investigators/research group leaders/senior authors need to make code review normative in their laboratory. We also encourage principal investigators to endeavor to contextualize code review as a group problem-solving exercise rather than an error-finding exercise (13). Code review procedures are more likely to succeed if they are routinely adopted across all papers and supported by the group leader (12). If code review is adopted only for some papers, it can be perceived as punitive or indicating distrust in a specific programmer. To the contrary, code review should be recognized as a benefit to the code author and a strategy to enhance reproducibility. It may also be helpful for the principal investigator to observe some review sessions in which the reviewer gives feedback to the code author. This is especially important if there is a potential for bad feelings to emerge.

Research laboratories may consider creating a “style guide” in which they detail the overall structure and format of code and conventions that should be followed within the laboratory (for a sample style guide in academia, see the Appendix after the references; for a sample style guide in industry, see reference 15). A style guide can help by codifying best practices and can facilitate faster reviews by ensuring that everyone follows the same coding format. A style guide can also provide a reference point for reviewers to return code to the code author if the code does not adhere to the established style.

We additionally recommend that authors explicitly note if code has been reviewed in the methods section of the paper (e.g., after noting the software used) to both increase readers’ confidence in the results and help make code review processes normative.

The code author

If laboratories create a style guide, the code author should adhere to the style guide as they write their code. It is also a good practice to create tests/checks periodically within the code, which is a standard practice in industry (14). This not only benefits the development process by ensuring incremental correctness of code but also eases the burden for the code reviewer by demonstrating that the code works properly. For example, when recoding a variable, the programmer should perform a crosstab of the old variable and the new variable to confirm that the variable was created properly. Merges and data-set transformations (e.g., wide to long) should be explicitly checked in the code to ensure the sample size is as anticipated, and some individual observations should be tracked through the transformation to confirm the final result matches the goal. The need for checks of this sort can be included in the style guide (e.g., see “Variable Cleaning Conventions” in the Appendix), although defining appropriate tests will depend to a large extent on the research project. Other checks for especially tricky sections of code, such as loops or test cases to check during the reshaping of data, should also be incorporated into the code. We acknowledge that some style guide recommendations are arbitrary and are intended to improve consistency rather than reflecting intrinsic superiority of one approach over another (e.g., see standardized naming conventions in the style guide shown in the Appendix).

It is important to discuss the selection of the code reviewer with paper co-authors. A few criteria can be used to select an appropriate code reviewer for each project: 1) familiarity with the software packages used in the project; 2) familiarity with the data set(s); 3) interest in the research question and in potentially co-authoring the manuscript; and 4) familiarity with the methodologic approach. Typically, the methods and results sections of the paper should be nearly finalized when the code is sent for review so that the reviewer does not have to review multiple iterations. Because code review can be a substantial time commitment, authors may prefer to schedule time to review in advance. Review can typically occur in parallel with the co-authors finalizing the text; however, if major coding errors are identified, the text will presumably change.

The code author should make all relevant data sets and documents available to the code reviewer; this includes premerge data sets so the reviewer can ensure the merges are happening properly, as well as a draft of the methods and results so the reviewer can understand what the code is intended to accomplish. Preparing code for review is an important exercise that reduces the burden on the code reviewer by ensuring the code is streamlined and redundancies are removed. Thorough documentation is essential for smooth code review, although best practices for documentation depend on the coding language and group workflow (12, 14–16). The process of preparing code for review helps identify mistakes even before the review process begins (12).

The code reviewer

The reviewer will often have multiple phases of questions for the code author. It is sometimes expeditious to sit with the code author first for a “tour” of the code before beginning a detailed review. Best practice would entail the code reviewer first ensuring the code author followed the style guide before continuing the review and returning the code to the author for reformatting if the style guide was not followed.

The reviewer should read the text of the manuscript draft and carefully review each line of code, perform a line-by-line execution, confirm that the implementation matches the methods and results of the paper as reported in the text, and manually code in checks/tests to ensure that the code is performing as intended. The code author should provide examples of tests, but the reviewer may also identify additional tests to ensure the code is operating as expected.

When the code does not adhere to the style guide or the documentation is unclear, the reviewer should request edits from the code author. If these are small things, the changes can be requested as a collection at the conclusion of code review. If they appear to be fundamental, it is acceptable to ask for edits before the code review is completed. In general, a collaborative approach between the code author and reviewer is beneficial. It is often difficult for the code author to recognize when important ideas have been omitted from the documentation; requests from the code reviewer can be invaluable and represent an important teaching moment.

Receiving requests to edit programming code can be frustrating. Attention to both the tone of feedback and the power dynamics between the code author and the code reviewer is important. For example, analysis has shown that negative tone in feedback less useful than positive tone (13).

Because of the substantial work for the code reviewer, our group adopted the norm of inviting reviewers to participate as co-authors on the manuscript; very often the second author role is deemed the most appropriate based on the level of work and other contributions the code reviewer has made to the paper. Although co-authoring entails some additional work, this opportunity can often be valuable for the code reviewer and may increase incentives to do a more thorough review. Ideally, the code reviewer should be an invested co-owner of the quality of the results.

DISCUSSION

We have outlined just one possible approach to code review; many other successful implementations are possible that could improve science and enhance reproducibility, including pair programming, standardized strategies to “test” code, improved coding style, code walkthrough, etc. (12, 14, 15). In code walkthroughs, the code author walks the lead investigator through the code, explaining what is happening in each step, including how the data structure or variable coding has changed at that step. Code walkthrough may be an especially feasible approach for small research teams with only one programmer. The gold-standard code review for reproducibility in academic sciences would involve giving the methods section of the paper to one or more researchers not involved with the project to see if they can reproduce the results exactly. Unfortunately, in applied academic settings, this gold-standard approach is often too costly in terms of time and personnel to be feasible. The gold-standard approach may also provide fewer secondary benefits, such as acceleration of learning and improved collegiality.

Recent work has noted that at Google, the code review processes can still break down despite years of refinement (13); although the guidelines we propose may be helpful, they represent a starting point for future iterations. The optimal structure may evolve as research and research teams change with respect to the level of integration and complexity of different projects.

In industry, software tools are often used to facilitate review, which can ensure that before new code is added to large projects, it was reviewed and certified. These tools can facilitate attaching comments to specific lines of code and tracking differences across multiple rounds of review. Sadowski et al. (13) detailed different tools that are used by various companies. These authors also noted that it can be useful to provide context to proposed changes, for example, noting if the proposed change is essential for the integrity of the analysis, needed for consistency with the style guide, or merely preferred for reasons of clarity or efficiency.

The best practice in industry is to send small, incremental units of code for review, with multiple reviews to complete a major project (12). The industry approach is in contrast to our suggestion of reviewing code when the project is near completion. The industry approach has advantages, such as reduced burden on the code reviewer for each review; however, in the context of an academic manuscript, we typically need to have all the details of the data sets and code in mind at once to understand what the code is doing. Therefore, in academic settings, we suspect it is typically better to perform code review once, at the end of the project.

Although code review has substantial benefits, it also has costs that can hamper widespread adoption. It will involve extra work for the code author, who must adopt more rigorous code hygiene and follow additional process and scrutiny. Code review will also take substantial reviewer time; however, adherence to the style guide by the code author can ensure that best practices are implemented, reducing review time and decreasing the burden on the code reviewer. The process can lead to increased frustration for both the code author and the reviewer, especially if the process breaks down. Clarifying and codifying code-review practices at the research group level through a style guide can help to mitigate some of the challenges of code review. Another potential challenge to widespread adoption is unclear code review practices; we try to alleviate this challenge by providing some guidance on the implementation of code review in this paper. We recognize that if all programmers were to produce correct code on the first try, code review would be mostly cost and little benefit, although benefits could still accrue in the form of reduced anxiety for the code author, better documentation, and improved collegiality. In some cases, code review will not find errors; this could either be because the code is correct or because the reviewer failed to identify errors. In an empirical analysis of error detection in industry, one company found a median of 3 errors were detected per review, whereas at another company, 87% of reviews found no errors and only 7% of reviews found more than 2 errors (16). Code review is not a cure-all, but can still be beneficial, especially if systematically implemented—it entails a cost/benefit analysis that must be done with the recognition that all programmers write bugs and that bugs can be expensive in time, cost, and reputation.

Even as the importance of reproducible research gains increasing attention, training on best practices for reproducible research is limited. We consider code review an invaluable and relatively easily implemented technique that requires no additional training or tools and has many potential benefits in addition to the potential to improve science, such as reducing stress on trainees and analysts, accelerating learning, contributing to team building, and if a style guide is created, providing a way to codify best practices. All the advantages of the most cutting-edge statistical methods can be undone by a tiny mistake, such as a typo in the merge statement. Code review isn’t always glamorous, but as Reinhart and Rogoff demonstrate, it is critically important for science and reproducibility. Humans are fallible: If your research group uses code written by humans, code review is a must.

ACKNOWLEDGMENTS

Author affiliations: Department of Family and Community Medicine, University of California San Francisco, San Francisco, Californian, United States (Anusha M. Vable); Google Inc., Mountain View, California, United States (Scott F. Diehl); and Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, California, United States (M. Maria Glymour).

We thank the following collaborators for feedback on the creation of the style guide and for an invigorating discussion on best coding practices: Scott C. Zimmerman, Dr. Sarah Ackley, Dr. Kate A. Duchowny, Chloe W. Eng, Dr. Megha Mehrotra, Dr. Ellicott C. Matthay, Kristina V. Dang, and Dr. Elizabeth Rose Mayeda.

Conflict of interest: none declared.

REFERENCES

1.

Reinhart
CM
,
Rogoff
KS
.
Growth in a time of debt
.
Am Econ Rev
.
2010
;
100
(
2
):
573
578
.

2.

Krugman P. The Excel depression. New York Times. https://www.nytimes.com/2013/04/19/opinion/krugman-the-excel-depression.html.

Published April 18, 2013
.
Accessed June 5, 2019
.

3.

Cassidy
J
. The Reinhart and Rogoff controversy: a summing up.
New Yorker
. https://www.newyorker.com/news/john-cassidy/the-reinhart-and-rogoff-controversy-a-summing-up.
Published April 2013
.
Accessed June 5, 2019
.

4.

Herndon
T
,
Ash
M
,
Pollin
R
.
Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff
.
Cambridget J Econ
.
2014
;
28
(
2
):
257
279
.

5.

Reinhart
CM
,
Rogoff
KS
. Reinhart and Rogoff: responding to our critics.
New York Times
. https://www.nytimes.com/2013/04/26/opinion/reinhart-and-rogoff-responding-to-our-critics.html.
Published April 25, 2013
.
Accessed June 5, 2019
.

6.

Coy
P
. FAQ: Reinhart, Rogoff, and the Excel error that changed history.
Bloomberg Businessweek
. https://www.bloomberg.com/news/articles/2013-04-18/faq-reinhart-rogoff-and-the-excel-error-that-changed-history.
Published April 18, 2013
.
Accessed June 5, 2019
.

7.

Ioannidis
JPA
.
How to make more published research true
.
PLoS Med.
2014
;
11
(
10
):e1001747.

8.

Collins
FS
,
Tabak
LA
.
NIH plans to enhance reproducibility
.
Nature.
2014
;
505
(
7485
):
612
613
.

9.

Lash
TL
,
Collin
LJ
,
VanDyke
ME
.
The replication crisis in epidemiology: snowball, snow job, or winter solstice?
Curr Epidemiol Rep.
2018
;
5
:
175
183
.

10.

Lash
TL
.
The harm done to reproducibility by the culture of null hypothesis significance testing
.
Am J Epidemiol.
2017
;
186
(
6
):
627
635
.

11.

Aboumater
H
,
Wise
RA
.
Notice of Retraction. Aboumatar et al. Effect of a program combining transitional care and long-term self-management support on outcomes of hospitalized patients with chronic obstructive pulmonary disease: a randomized clinical trial. JAMA. 2018;320(22):2
.
JAMA J Am Med Assoc
.
2019
;
322
(
14
):
1417
1418
.

12.

Sadowski
C
,
Söderberg
E
,
Church
L
, et al.  Modern code review: a case study at Google. In:
International Conference on Software Engineering, Software Engineering in Practice
.
Gothenburg, Sweden
: ICSE;
2018
.

13.

Bosu
A
,
Greiler
M
,
Bird
C
. Characteristics of useful code reviews: an empirical study at Microsoft. In:
Proceedings of the International Conference on Mining Software Repositories
.
2015
.

14.

MacLeod
L
,
Greiler
M
,
Storey
M-A
, et al. 
Code reviewing in the trenches
.
IEEE Softw
.
2018
;
35
(
4
):
34
42
.

15.

Google
. Google’s R style guide. https://google.github.io/styleguide/Rguide.html.
Accessed January 6, 2020
.

16.

Rigby
PC
,
Bird
C
. Convergent contemporary software peer review practices categories and subject descriptors. In:
European Software Engineering Conference and Symposium on the Fundations of Software Engineering
.
Saint Petersburg, Russia
: ESEC;
2013
:
202
212
.

17.

@AnushaVable
.
Dear Epidemiologists & #EpiTwitter, @MariaGlymour and I are wondering: How often is your analysis code for a paper reviewed by someone else before submission (that is, code in SAS, R, Stata, etc.)?
. https://twitter.com/AnushaVable/status/1224460049830928384.
Posted February 3, 2020
.
Accessed July 9, 2021
.

A. APPENDIX

A.1. Example Style Guide for Coding in Academia to Facilitate Code Review

This example style guide is for a relatively straightforward project of data cleaning and analysis. This style guide may not be generalizable to other settings, for example, simulation analyses or very complex analyses. These guidelines are not meant to be prescriptive; they represent what we consider best practices and have evolved over the course of writing this paper.

A.2. Overall code format

  1. Code should have a header detailing the project, the code author(s), the code reviewer, what happens in the program (ideally with approximate line numbers), and areas to which the code reviewer should pay particular attention. Code should be well explained so that the code reviewer never needs to wonder what a section of code is doing. Code should include section headers to facilitate review (e.g., merge data sets, clean variables, create analytic sample, main paper analyses, appendix analyses).

  2. Code should be ordered in the same way it is presented in the manuscript to facilitate review of both the code and the manuscript; all of the below sections should have section headers so the reviewer can find them easily. Following this format will help ensure that code is streamlined and that redundancies are removed; there should be no redundancies in the code that is sent for review. One approach is to do the data cleaning and create the analytic data set in one file and then perform analyses in another file; this approach is fine as long as the order of the files is clear to the code reviewer. Code should generally be ordered as follows, although we acknowledge that is not always appropriate. When there are deviations from this order, they should be noted and explained.

    • Clear your workspace at the start of the script.

    • Import and merge relevant data sets.

    • Clean the variables in the same order they are presented in methods section of the manuscript, typically:

      • Exposure

      • Outcome

      • Effect modifiers

      • Other covariates (ideally cleaned in the order they are listed in the manuscript). An approach that can facilitate review is to include all the covariates in a local (Stata), macro (SAS), etc., so the code reviewer does not need to ensure all the confounders are included in each of the analytic models. This approach can also help avoid inconsistencies if the adjustment variables change over the course of the analysis.

    • Define the final analytic sample, detailing who was excluded from the analytic sample and why. If data are clustered (e.g., repeated measures on the same person), describe the entire data structure. A useful approach for defining the analytic sample can be to create a variable called “eligible” that is 0 for people who are not eligible and 1 for people who are eligible. This facilitates analysis of how the analytic sample differs from the overall sample, which can be useful to include in appendix tables.

    • Conduct the data analysis in the order the tables and figures are presented in the results section. The code author should confirm that every analysis includes the same number of observations, unless deviations are specifically justified and noted, to facilitate review.

    • Conduct the appendix analyses in the order they are presented in the appendix.

  3. Checks to ensure the code is working as anticipated should be included in the code script so they can be rechecked every time the code is run. This will also facilitate review by demonstrating to the code reviewer that the code is running properly. For example, when reshaping data from long to wide or vice versa, ensure that the number of unique respondents is as expected, and that the number of observations per respondent is within the expected range.

  4. Remove old or experimental code that is no longer relevant before review. All code that is reviewed should be relevant for the project.

  5. Code should run completely from top to bottom without errors or intervention from the programmer.

A.3. Variable cleaning conventions

  1. Use intuitive, descriptive variable names whenever possible.

  2. Dichotomous variables should be named so that 0 = no and 1 = yes; that is, create a variable named “male” or a variable named “female” instead of a variable named “sex.”

  3. For transformation of variables, attach a suffix specifying the transformation. Some possibilities are as follows:

    • To center age at 70, the centered variable could be age_c70.

    • To z score the Center for Epidemiological Studies-Depression measure, the standardized variable could be cesd_std.

    • To trichotomize mother’s educational level (assuming the variable is named momedu_yrs), the trichotomized variable could be momedu_3cat.

    • To take the natural log of income, the logged variable could be income_ln.

    • To rescale age in years to age in decades, the rescaled variable could be age_decades.

  4. Use standardized naming conventions in longitudinal data to indicate assessment wave or year collected; for example, for age data collected in 1998, the variable should be named age_1998 or, if 1998 was the year of the second wave of data collection, age_w2.

  5. When recoding a variable, the following approach should be used:

    • Do not recode over original variables; always generate a new variable.

    • Name the new variable something descriptive.

    • After the variable is recoded, examine the cross-tab with the old variable or list a few observations showing both the old and new variables to ensure the recoding has been done properly. This should be included in the code script to demonstrate to the code reviewer that the recode has been done properly.

    • Apply a descriptive label to the variable (e.g., if income was inflation-adjusted to 2016 dollars, the label may be “income, inflation-adjusted to 2016 dollars”).

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)