Computational reproducibility of Jupyter notebooks from biomedical publications

Abstract Background Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications. Approach We address computational reproducibility at 2 levels: (i) using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks associated with publications indexed in the biomedical literature repository PubMed Central. We identified such notebooks by mining the article’s full text, trying to locate them on GitHub, and attempting to rerun them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. (ii) This study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over the course of 2 years, during which the corpus of Jupyter notebooks from articles indexed in PubMed Central has grown in a highly dynamic fashion. Results Out of 27,271 Jupyter notebooks from 2,660 GitHub repositories associated with 3,467 publications, 22,578 notebooks were written in Python, including 15,817 that had their dependencies declared in standard requirement files and that we attempted to rerun automatically. For 10,388 of these, all declared dependencies could be installed successfully, and we reran them to assess reproducibility. Of these, 1,203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions. Conclusions We zoom in on common problems and practices, highlight trends, and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.


Introduction
Many factors contribute to the progress of scientific research, including the precision, scale, and speed at which research can be performed and shared and the degree to which research processes and their outcomes can be trusted (Siebert et al., 2015;Contera, 2021).This trust, in turn, and the credibility that comes with it, are a social construct that depends on past experience or proxies to it (Gray et al., 2012;Kroeger et al., 2018;Jamieson et al., 2019).A good proxy here is reproducibility, at least in principle (Hsieh et al., 2018): if a study addressing a particular research question can be re-analyzed and that analysis leads to the same conclusions as the original study, then these conclusions can generally be more trusted than if the conclusions differ between the original and the replication study.

Reproducibility issues in contemporary research
Over recent years, the practical replicability of published research has come into focus and turned into a research area in and of itself (Peng, 2015;Samuel and König-Ries, 2021).As a result, systematic issues with reproducibility have been the subject of many publications in various research fields as well as prominent mentions in the mass media (The Economist, 2013).These research fields range from psychology (Simmons et al., 2011) to cell culture (Hussain et al., 2013;Bairoch, 2018) to ecology (Kelly, 2019), geosciences (Ledermann and Gartner, 2021) and beyond and include software-affine domains such as health informatics (Coiera et al., 2018), human-computer interactions (Hinsen, 2018), artificial intelligence (Hutson, 2018), software engineering (Shepperd et al., 2018) and research software (Crick et al., 2017).This is often framed in terms of a "reproducibility crisis" (Baker, 2016), though that may not necessarily be the most productive approach to addressing the underlying issues (Hunter, 2017;Fanelli, 2018;Guttinger, 2020).In more practical terms, Näpflin et al. (2019) observe that "appropriate workflow documentation is essential".

Terminology
Within this broader context, distinctions between replicability, reproducibility, and repeatability are often important or even necessary (Meng, 2020) but not consistently made in the literature (Plesser, 2017).A potential solution to this confusion is the proposed distinction (Goodman et al., 2016) between Methods reproducibility (providing enough detail about the original study that the procedures and data can be repeated exactly), Results reproducibility (obtaining the same results when matching the original procedures and data as closely as possible) and Inferential reproducibility (leading to the same scientific conclusions as the original study, either by reanalysis or by independent replication).
In the following, we will concentrate on "Methods reproducibility in computational research", i.e. using the same code on the same data source.For this, we will use the shorthand "Computational reproducibility".In doing so, we are conscious that the "same code" can yield different results depending on the execution environment and that the "same data source" might actually mean different data if the data source is dynamic or if the code involves manipulating the data in a way that changes over time.We are also aware that the shorthand "Computational reproducibility" can also be applied, e.g., to "Results reproducibility in computational research" in cases where the algorithm described for the original study was re-implemented in a follow-up study.For instance, Burlingame et al. (2021) were striving for Results reproducibility when they re-implemented the PhenoGraph algorithm -which originally only ran on CPUs -such that it could be run on GPUs and thus at higher speed.However, Results reproducibility is not the focus of our study.

Computational reproducibility in biomedical research
In light of the reproducibility issues outlined above, there have been calls for better standardization of biomedical research software -see Russell et al. (2018) for an example.In line with such standardization calls, a number of guidelines or principles to achieve methods reproducibility in several computational research contexts have been proposed.For instance, Sandve et al. (2013), Gil et al. (2016) and Willcox (2021) laid out principles for reproducible computational research in general.In a similar vein, Grüning et al. (2018)  Jupyter notebooks, a popular file format for documenting and sharing code.While most of these are language agnostic, language-specific approaches to computational reproducibility have also been outlined, e.g. for Python (Halchenko et al., 2021).
However, compliance with such standards and guidelines is not a given (Russell et al., 2018;Rule et al., 2018;Pimentel et al., 2021), so we set out to measure it specifically for Jupyter notebooks in the life sciences and to explore options to bridge the gap between recommended and actual practice.In order to do so, we mined a popular repository of biomedical fulltexts (PubMed Central) for mentions of Jupyter notebooks alongside mentions of a popular repository for open-source software (GitHub).

PubMed Central
PubMed Central (PMC)1 is a literature repository containing full texts of biomedical articles.At the time of writing, it contained about 7.5 million articles.Founded in the context of the Open Access mandate issued by the National Institutes of Health (NIH) in the United States (Roberts, 2001), PMC is operated by the National Center for Biotechnology Information (NCBI), a branch of the National Library of Medicine (NLM), which is part of the NIH.PMC hosts the articles using the Journal Article Tagging Suite (JATS), an XML standard, and makes them available for manual and programmatic access in various ways, of which we used the Entrez API (Sayers, 2010).

GitHub
GitHub2 is a website that combines git-based version control with support for collaboration and automation.It is a popular place for sharing software and developing it collaboratively, including for Jupyter notebooks (Rule et al., 2018) and for code associated with research articles available through PubMed Central (Russell et al., 2018).

Jupyter
Jupyter notebooks3 (Kluyver et al., 2016;Granger and Perez, 2021) are a computing environment in which code, code documentation, and output of the code can be explored interactively.They have become a popular mechanism to share computational workflows in a variety of fields (Kluyver et al., 2016), including astronomy (Randles et al., 2017;Wofford et al., 2019) and biosciences (Schröder et al., 2019).Here, we build on past studies of the reproducibility of Jupyter notebooks (Rule et al., 2018;Pimentel et al., 2019) and analyze Jupyter notebooks available through GitHub repositories associated with publications available through the biomedical literature repository PubMed Central.

Jupyter and reproducibility
Jupyter notebooks can, in principle, be used to enhance reproducibility, and they are often presented as such, yet using them does not automatically confer reproducibility to the code they contain.Several studies have been conducted in recent years to explore the reproducibility of Jupyter Notebooks.A recent one has investigated the reproducibility of Jupyter notebooks associated with five publications from the PubMed Central database (Schröder et al., 2019).In their reproducibility analysis, they looked for the presence of notebooks, source code artifacts, documentation of the software requirements, and whether the notebooks can be re-executed with the same results.According to their results, the authors successfully reproduced only three of 22 notebooks from five publications.Rule et al. (Rule et al., 2018) explored 1 million notebooks available on GitHub.In their study, they explored repositories, language, packages, notebook length, and execution order, focusing on on the structure and formatting of computational notebooks.As a result, they provided ten best practices to follow when writing and sharing computational analyses in Jupyter Notebooks (Rule et al., 2019).Another study (Pimentel et al., 2021) focused on the reproducibility of 1.4 million notebooks collected from GitHub.It provides an extensive analysis of the factors that impact reproducibility based on Jupyter notebooks.Chattopadhyay et al. (Chattopadhyay et al., 2020) reported on the results of a survey conducted among 156 data scientists on the difficulties when working with notebooks.Other studies focus on best practices on writing and sharing Jupyter notebooks (Rule et al., 2019;Pimentel et al., 2021;Willis et al., 2020;Wang et al., 2020b).As a result, tools have been developed to support provenance and reproducibility in Jupyter Notebooks (Chirigati et al., 2013;Boettiger, 2015;Samuel and König-Ries, 2018;Project Jupyter et al., 2018).Cases where Jupyter notebooks have played a key role in some actual replication attempts have also begun to appear in the literature.For instance, Baker et al. (2019) assembled a Jupyter notebook as part of a published correction.Shortly after we had created our corpus, a paper was published with a Jupyter notebook that enabled others to reproduce the computational workflows, ultimately leading to the retraction of the original work, as detailed in Meyerowitz-Katz et al. (2021).

Wikidata
Wikidata is a cross-disciplinary and multilingual database through which a global community curates FAIR and open data to serve as general reference information (Waagmeester et al., 2020;Rutz et al., 2022).This includes information about key elements of the research ecosystem, from researchers to research fields and research organizations, from methods to datasets, software and publications.This information can then be explored in various ways, e.g. through the visualization tool Scholia (Nielsen et al., 2017), which provides profiles for different types of entities or relationships.For entities of the type Jupyter notebook (known to Wikidata as Q70357595), the most relevant profile types are those for a topic4 , a software5 or a resource used6 .

Environmental footprint
Computations ultimately require physical resources, and both the production and the use of these resources can have a considerable environmental footprint (Lannelongue et al., 2021b).The more reproducible some workflows become, the more accurate their environmental footprint can be assessed (Taddeo et al., 2021).This can then lead to an optimization of the environmental footprint, especially since it often correlates with the financial footprint of using computational resources (Schwartz et al., 2020).One of our aims in this study is thus to get an overview of the contribution of Jupyter-based workflows to the environmental footprint of biomedical research involving computation.This is in line with the recommendation in Lannelongue et al. (2021a) to integrate routine environmental footprint assessment into research practice.

Pipeline
In this section, we describe the key steps of the pipeline we used for assessing the reproducibility of Jupyter notebooks associated with publications extracted from PubMed Central.The driver file for running the workflow is documented in r0_main.pyand the driver notebook for the analysis of the collected data is documented in the notebook named "Index.ipynb".Figure 1 provides an overview of the workflow used in this study.We used the esearch function to search PMC for Jupyter notebooks on 24 th February, 2021.We looked for publications that mentioned GitHub together with either the string "Jupyter" or some closely associated ones, namely "ipynb" (the file ending/extension of Jupyter notebooks) or "iPython" (the name of a precursor to Jupyter).The search query used was "(ipynb OR jupyter OR ipython) AND github".Based on the primary PMC IDs received from the esearch utility, we retrieved records in the XML format using the efetch function and collected the publication metadata from PMC (Roberts, 2001) using NCBI Entrez utilities via Biopython (Cock et al., 2009).
In the next step, we processed the XML fetched from PMC.We used an SQLite database7 for storing all the data related to our pipeline.We collected information on journals and articles.We first extracted information about the journal.For this, we created a database table for the journal and extracted the ISSN8 ( International Identifier for serials), the journal title, the NLM's (National Library of Medicine) abbreviated journal title, and the ISO9 (International Organization for Standardization) abbreviation.
We then created a database table for the articles and populated it with article metadata.The metadata includes the article name, Pubmed ID, PMC ID, Publisher id and name, DOI, subject, the dates when the article was received, accepted, and published, the license, the copyright statement, keywords, and the GitHub repositories mentioned in the publication.For each article, we also extracted the Medical Subject Headings (MeSH terms)10 to get the subject area of the article.
To extract the GitHub repositories mentioned in each article, we looked for mentions of GitHub links anywhere in the article, including the abstract, the article body, data availability statement, and supplementary information.GitHub links were available in different formats.We normalized them to the standard format 'https://github.com/{username}/{repositoryname}'.For example, we extracted the GitHub repository from nbviewer 11 links and transformed its representation to the standard format.We excluded 692 GitHub links that mentioned only the username or organization name or github pages and not a specific repository name.After preprocessing and extracting GitHub links from each article, we added the GitHub repositories to the database table for the corresponding articles.Likewise, we linked the article's entry in the table to the journal where it was published.We also collected information on the authors of the article in a separate database table: we created an author database table, extracted the first and last name, ORCID, email, and connected these data to the corresponding entries in the article table.
Based on the GitHub repository name collected from the article, we checked whether these repositories were available at the original link or not.If the repository existed, we cloned it (ignoring branches, i.e. just taking the base one, which is usually called "main") and collected information about the repositories using the GitHub REST API12 .On that basis, we created a repository database table.For each GitHub repository, an entry is created in the table and connected to the article where it is mentioned.We collected the execution environment information by looking into the dependency information declared in the repositories in terms of files like requirements.txt,setup.pyand pipfile.Additional information for each repository is also collected from the GitHub API.This includes the dates of the creation, updates, or pushes to the repository, and the programming languages used in each repository.Further information includes the number of subscribers, forks, issues, downloads, license name and type, total releases, and total commits after the respective dates for when the article was published, accepted, and received.After collecting and creating these data tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories.The code for the pipeline is adapted from (Pimentel et al., 2019;Samuel and König-Ries, 2021).Hence, the method to reproduce the notebooks in this study is similar to (Pimentel et al., 2019).For each notebook, we collected information on the name, nbformat, kernel, language, number of different types of cells, and the maximum execution count number.We extracted the source and output of each cell for further analysis.Using Python Abstract Syntax Tree (AST) 13 the pipeline extracted information on the use of modules, functions, classes, and imports.
After collecting all the required information for the execution of Python notebooks from the repositories, we prepared a Conda14 environment based on the python version declared in the notebook.Conda is an open source package and environment management system which helps users to easily find and install packages and create, save, load and switch between environments.The pipeline then installed all the dependencies collected from the corresponding files like requirements.txt,setup.pyand pipfile inside the Conda environment.For the repositories that did not provide any dependencies using the above mentioned files, the pipeline executed the notebooks by installing all the anaconda dependencies15 .Anaconda is a Python and R distribution which provides data science packages including scikit-learn, numpy, matplotlib, and pandas.
We used the nbdime16 library from Project Jupyter to compute diffs of the notebooks.We used the tools adapted from (Pimentel et al., 2019;Samuel and König-Ries, 2021).The code from (Pimentel et al., 2019) provides a basis for reproducing Jupyter notebooks from GitHub repositories.The ReproduceMeGit (Samuel and König-Ries, 2021) extended from (Pimentel et al., 2019), is a visualization tool for analyzing the reproducibility of Jupyter Notebooks, along with provenance information of the execution.ReproduceMeGit provides the difference between the results of the executions of notebooks using the nbdime library.These two tools provide the basis for our code for the reproducibility study.
After collecting the notebooks, we also ran a Python code styling check using the flakenb 17 library on the notebooks, since code styling consistency is a potential indicator for the extent of care that went into a given piece of software.The flakenb library is a tool for code style guide enforcement for notebooks.It helps to check code against some of the style conventions in PEP 8 18 , a style guide for Python code.The flakenb library provides an ignore flag to ignore some specified errors.In this study, we did not use this flag and collected all errors detected by the library.For the styling of notebooks, we collected information on the pycode styling error code and description 19 .

Reproduction
The complete pipeline was run on the Friedrich Schiller University Ara Cluster 20 .The computational experiments were performed on a Skylake Standard Node (2x Intel Xeon Gold 6140 18 Core 2,3 GHz, 192 GB RAM).This node has two CPUs, each with 18 cores, and 192 GB RAM in total.The complete pipeline ran in 117 hours and 52 minutes from 24 th -28 th February 2021.We then used the website https://green-algorithms.org v2.2 Lannelongue et al. (2021b) to estimate that the pipeline run drew 47.38 kWh.Based in Germany, this has a carbon footprint of 16.05 kg CO2e, which is equivalent to 17.51 tree-months.

Results
In this section, we present the results of our study on analyzing the computational reproducibility of Jupyter notebooks from biomedical publications.We extracted metadata from 1419 publications from PubMed Central.These articles had been published in 373 journals and had 2398 mentions of GitHub repository links.At the time of data collection, 49 GitHub repositories mentioned in the articles were not accessible, returning a "page not found" error instead.Out of 2177 unique and valid GitHub repositories cloned, only 1117 had one or more Jupyter notebooks.From these repositories, a total of 9625 Jupyter notebooks were downloaded for further reproducibility analysis.In the text, we use the full name in its current styling, e.g."PLOS Computational Biology".Figure 2 shows the top ten journals with the highest number of articles that had a valid GitHub repository with at least one Jupyter notebook.Figure 3 shows the journals by the number of GitHub repositories and repositories with Jupyter notebooks.The journal eLife topped the list in both the rankings, which is why we chose to submit our manuscript there.It was followed by PLOS ONE and PLOS Computational Biology.The ratio of notebooks per GitHub repository varies across journals, with the range being between 3.4:1 in GigaScience, 2:1 in Nature Communications, and 1.5:1 in Scientific Reports.From the 1117 repositories with Jupyter notebooks, 290 (25.9%) of repositories had one Jupyter notebook, 462 (41.4%) had two notebooks, and 249 (22.3%) had ten or more notebooks.6,782 (70.4%) of the notebooks belonged to repositories with ten or more notebooks.

General statistics of our study
Figure 5 shows the maximum number of notebooks for articles published in the respective journal.Among the top ten journals with notebooks, Nature Communications had the maximum number of notebooks; however, it ranked fourth in terms of journals with repositories with notebooks.Figure 5 shows the timeline of the articles by the number of Github repositories with at least one Jupyter notebook.This indicates a growing trend of articles with notebooks.In parallel to exploring trends related to Jupyter notebooks, we analyzed the uptake of ORCID identifiers21 over time in the collected journal articles with notebooks (Figure 6).ORCID provides a persistent digital identifier to uniquely identify authors and contributors of scholarly articles.While iPython notebooks go back to 2001, the Jupyter notebooks with kernels for multiple languages became available in 2014, whereas ORCID was launched in 2012.Hence, both are relatively recent innovations in the scholarly communications ecosystem, and their respective uptake processes occur in parallel.
There are in total 11594 authors in the 1419 publications.We have not performed any author disambiguation to distinguish unique authors in our corpus.However, such disambiguation is taking place at scale in Wikidata (see Discussion).There are 2,720 (23.46%) authors with ORCID

Programming languages
Figure 7 and Figure 8 show analyses of the programming languages used in the Jupyter notebooks present in the collected publications.Figure 7 presents (using a log scale) the most common programming languages used in the notebooks.Python (84.8%) is the most common programming language, followed by unknown (7.5%) and R (4.8%).Unknown notebooks are those which do not declare the programming language or its version in the notebook.A total of 720 notebooks do not declare a programming language.From the figure, we can see that the Jupyter ecosystem is not just Python anymore, but Python is most prominent, and none of the other languages have overtaken the "Unknown" group, which is primarily due to early notebooks in which Python was hardcoded, or the language stated in some other non-standard fashion.Jupyter Notebooks are also used for other languages like Bash, Matlab, and Java. Figure 8 shows the top programming    Figure 11 shows the statistics on the structure of notebooks.Notebooks have a median of 20 cells and 13 code cells.The average number of cells with outputs in notebooks found in our study is three, with zero being the least (Figure 11b).The maximum number of cells, code cells, and cells with output seen in a notebook are 95, 431, and 163, respectively.The maximum number of raw and empty cells seen in a notebook is 49 and 31, respectively.Raw cells let the users write output directly and the kernel does not evaluate them.The average number of markdown cells in notebooks is six, with the maximum being 383.6311 (65.77%) of the notebooks have markdown cells, while 3284 (34.23%) notebooks do not.96.58% of notebooks use English in the markdown cells; While 46.27% notebooks use only English in the markdown cells.In addition to English, French (11.76%) and Danish (3.96%) are the other popular natural languages used in the markdown.In 1909 (30.25%) notebooks, we could not detect the language in the markdown cells.Further analysis of markdown cells shows that the average number of lines and words seen in markdown cells are 20 and 145, respectively.Paragraphs and headers, the most commonly seen markdown elements, appear in 92.65% and 81.81% notebooks, respectively.1,449 (17.76%) notebooks do not have execution numbers and 6,710 (82.24%) notebooks have execution numbers.The maximum execution count seen in a notebook is 2076 (Figure 11d).Figure 12 shows the most frequently used titles in notebooks from our collected data."Untitled", "programming" and "index" are the three most common notebook names.There are 63 (0.65%) whose title is or starts with "Untitled".There are 21 (0.22%) notebooks that contain the name 'Copy'.We also see many notebooks with the string 'test' in their names (Figure 13).1,070 (11.12%) notebooks have names that are not recommended by the POSIX fully portable filenames guide (Pimentel et al., 2019).Only four notebooks have names that are disallowed in Windows.There are no notebooks without a title (i.e., notebooks with just a '.ipynb' extension).Figure 14 shows the distribution of length of notebook title.The average length of the notebook title is 18 characters, with a maximum of 123 characters and a minimum of 2.

Notebook modules
Figure 15 shows the analysis of modules declared in notebooks.Using AST22 , we analyzed the valid Python notebooks.5,248 (69.06%) notebooks had imports, of which 714 (9.40%) had local imports, while 5,216 (68.64%) had external modules.Local imports denote the import of modules defined in the notebook repository's directory.There are 1035 local and 38229 external modules declared in the collected Python notebooks.Figure 15 shows the top ten commonly used Python modules declared in the notebooks.The most used modules are numpy (3255), pandas (2428), and matplotlib.pyplot(2411).These are widely used modules for data manipulation, analytics, and visualizations.

Notebook dependencies
Figure 17 shows the analysis of the declared dependencies of GitHub repositories and notebooks.4650 (48.31%) of notebooks belong to repositories which have declared dependencies using setup.py,requirements.txt,or pipfile.There are 492 repositories with declared dependencies (Figure 17b).

Notebook Reproducibility
In our reproducibility study, we executed 4169 (43.45%)Python notebooks.The dependencies of the notebooks, as mentioned in their respective repositories, were installed in conda environments.But, dependencies of 1,485 (35.62%) notebooks failed to install.None of the files were malformed with wrong syntax or conflicting dependencies.We did not find any missing files that required other requirement files which were unavailable or files that needed external tools.Hence, the reason for the failed installed error is unknown.We attempted to execute 2,684 (64.38%) notebooks for the reproducibility study after successfully installing all the requirements.However, many notebooks failed to execute even after installing all the requirements successfully.Exceptions 2,265 (84.39%) notebooks resulted in exceptions due to several reasons.Figure 18 shows the top ten exceptions that occurred while executing the notebooks.ModuleNotFoundError, ImportError, and FileNotFoundError are the most common reason that resulted in failure of execution in notebooks.1,362 (32.67%) of the executions failed because of ModuleNotFoundError and ImportError exceptions.ModuleNotFoundError exception occurs when a Python module used by the notebook could not be found.ImportError exception occurs when a Python module used by the notebook could not be imported.These two errors occur mainly due to missing dependencies.132 (3.17%) notebooks have NameError, which occurs when a declared variable in the notebook is not defined.374 (8.97%) notebooks have FileNotFoundError or IOError.These exceptions occur when absolute paths are used to access data or when the data files are not included in the repository.
Figure 19 shows how the top three common exceptions ModuleNotFoundError, ImportError, and FileNotFoundError change with the year the article was published .We see an increase in the Mod-uleNotFoundError through the years.In the years before 2019, the ImportError outnumbered Mod-uleNotFoundError.
Figure 20 shows the trend of exceptions by the year of publication normalized by the number of notebooks.In 2020, we observed the highest number of exceptions and notebooks.Figure 21 shows the exceptions by the type of the article.The research articles have the most number of exceptions.Figure 22 shows the exceptions by journal, normalized by the number of notebooks.
The journal eLife has the most number of notebooks with the most number of exceptions, followed by PLoS One.

Successful replications
396 (9.50%) of the notebooks in our corpus finished their execution successfully without any errors.However, for 151 notebooks (3.62%), our execution generated results that differed from the original ones, while 245 notebooks (5.88%) produced the same results in our execution as documented   for the original notebooks.Table 1 zooms in on the successfully executed notebooks and compares those that did not yield results the same results as the original ones (different group) with those that did (identical group).A clear difference between both groups is that many of the notebooks in the identical group had their dependencies specified via either setup.pyor requirements.txtor both, in contrast to none of the notebooks in the different group.Since notebooks with no dependency declarations were run using the default conda dependencies, the fact that they successfully finished means that all dependencies were covered.However, as the version of the dependencies used in the original notebook was not documented, it may have differed from the version that was provided in our respective Conda environment.Besides versioning of dependencies, there could be a number of other reasons as to why an error-free execution might yield different results.For instance, random functions may be invoked, or code cells in the original might have been executed multiple times or in a different order than in our execution, which ran every code cell just once, from top to bottom.However, we would not expect the invocation of random functions or an inconsistent execution order to correlate so strongly with whether the dependencies had been explicitly declared or not.
In contrast to the dependency declarations, other features in Table 1 show more gradual differences between the two groups, and they largely fit with intuition.For instance, it is understandable that notebooks with more code cells take longer to execute and that code whose execution per code cell takes longer is somewhat more complex, thus raising the probability of different outcomes.It is also not surprising that, while the total number of cells per notebook is nearly the same in both groups, notebooks in the identical group show a higher ratio of Markdown versus code cells, since that ratio is indicative of documentation efforts, and better documentation would be expected to go with better reproducibility.
The average number of differences observed per notebook (or even per code cell) is not easy to interpret on its own, as it includes differences in output cells, cell counter values or in output files, and a difference early in a notebook can lead to further differences later.
Table 2 illustrates how different Python versions performed in terms of successful executions: amongst the top 5 versions for notebooks yielding different results, there were three 2.7 versions, whereas there were three 3.7 versions in the group that yielded identical results, and 3.6.9and 3.6.5 were represented roughly equally in both groups.Other parameters that we considered but did not include in the analysis of the finished notebooks were the number of dependencies (the more there are, the more likely replicability is reduced; see also section Notebook dependencies and in particular Fig. 17), the type of dependencies (e.g.local code or environment, Python package, local or remote file or service, each of which could complicate replication; see also Fig. 15 and 16), the recency (cf.Fig. and 19 and 20) of the notebooks (more recent ones would be expected to be more replicable) or notebook titles (cf.Fig. 12, 13 and 14) containing strings like "tutorial" or "demo" (which might be indicative of expected reuse, thus perhaps triggering more careful documentation) or "untitled" (which is the default title and may thus indicate a lack of attention to documentation and, consequently, a higher likelihood for replication attempts to fail).

Notebook Styling
In addition to the common exceptions in the notebooks, we also checked the notebook code styling errors.Figure 23 shows the most common Python code warning/style errors found in our study.
Table 3 presents the code for the Python code warnings and style errors found in our study.E231 is the most common coding style error, followed by E225 and E265, respectively.There are also some common content errors other than styling errors like F403 and F405 -these are related to variable and module definition errors.The W601 and W606 warnings relate to the use of deprecated and reserved keys.

Discussion
In this study, we have analyzed the Method reproducibility -in the sense of Goodman et al. (2016)of Jupyter notebooks written in Python and publicly hosted on GitHub that are mentioned in publications whose full text was available via PubMed Central by the reference period, i.e. the time when our reproducibility pipeline was run on 24-28 February 2021.We will now discuss the limitations of the study and then its implications, again primarily for Method reproducibility of Jupyter notebooks associated with biomedical publications.

Limitations
The present study does not address Inferential reproducibility and only briefly touches upon Results reproducibility.Furthermore, we made no attempt to re-run computational notebooks that met any of the following exclusion criteria during the reference period: (a) they did not use Jupyter (or its precursor, IPython), (b) they were not written in Python, (c) they were not publicly available on GitHub, (d) they were not mentioned in publications available from PubMed Central, (e) they were not on the base branch of their GitHub repository (which is the only branch we looked at).Our reproducibility workflow is based on that by Pimentel et al. (2019), with some changes to include GitHub repositories from publications and using the nbdime library (Project Jupyter, 2021) from Jupyter instead of string matching for finding differences in the notebook outputs.The approach is using conda (Conda community, 2017) environments.We did not use any Docker images (Docker, 2013) for the execution environment, even in cases when they were available.This workflow being fully automated, we did not spend any manual effort on fixing any of the errors that came up for an individual notebook -see Woodbridge (2017) for a report of an attempt to do so, which also provided the foundation for a prototypical validation tool that makes use of GitLab Actions For a good number of the reported problems (especially the missing software or data dependencies, as per Fig. 18), it is often straightforward to fix them manually for individual notebooks, yet undertaking manual fixes systematically was not practical at the scale of the thousands of notebooks rerun here.If the original code had specified dependencies without referring to a specific version, our rerun would use the most recent conda-installable version of that library.Finally, in estimating the environmental footprint of this study, we only included the footprint due to running 'async' and 'await' are reserved keywords starting with Python 3.7 3 the full pipeline once -we did not include the efforts involved in preparing the pipeline, analyzing the data or writing the manuscript.

Implications
There are several implications of this study.First, on a general level, the low degree of reproducibility that we documented here for Jupyter notebooks associated with biomedical publications goes conform with similarly low levels of reproducibility that were found in earlier domain-generic studies, both for Python (Rule et al., 2018;Pimentel et al., 2021) and R (Trisovic et al., 2022).Second, considering that the notebooks we explored here were associated with peer-reviewed publications, it is clear that the review processes currently in place at journals within our corpus does not generally pay much attention to the reproducibility of the notebooks.This clearly needs to change, and we need systemic approaches to that rather than just adding this to the list of things the reviewers are expected to attend to.As our study demonstrates, a basic level of reproducibility assessment can well be achieved in a fully automated fashion, so it would probably be beneficial in terms of research quality to include such automated basic checks -for notebooks and other software -into standard review procedures.Ideally, this would be done in a way that works across publishers as well as for a variety of technology stacks and programming languages.
Third, while there is a large variety in the types of errors affecting reproducibility, some of the most common errors concentrate around dependencies (cf.Fig. 15,16,17 and 18), so efforts aimed at systemic improvements of dependency handling -e.g. as per Zhu et al. (2021) -have the potential to increase reproducibility considerably.Here, programming language-specific efforts regarding code dependencies can be combined with efforts targeted at improving the automated handling of data dependencies, which would be beneficial irrespective of the specific programming language.Fourth, zooming in on Python specifically, wider adoption of existing workflows for code dependency management (such as requirements.txt)as well as associated checks during the publishing process would help.Researchers attempting to publish research with associated notebooks should not have to do this all by themselves -research infrastructures as well as publishers and funders can all help facilitate establishing best practice here and engaging communities around them.
Fifth, the few notebooks that actually did reproduce (cf.Successful replications) are not equally distributed.This means that reproducibility could probably be strengthened by enhancing or highlighting the features that correlate with it.For instance, Jupyter notebooks with more emphasis on documentation scored better than others, and there is merit in the idea of making Jupyter notebooks or similar computational notebooks a publication type of their own.This is already the case in some places, as examplified by Constantine et al. (2016) or Garg et al. (2022) in the Journal of Open Source Software.
Sixth, the ongoing diversification of the Jupyter ecosystem -e.g. in terms of programming languages, deployment frameworks or cloud infrastructure -is increasingly reflected, albeit with delay, in the biomedical literature.In parallel, while GitHub remains hugely popular, alternatives like GitLab, Gitee or Codeberg are growing too.Future assessments of Jupyter reproducibility will thus need to take this increasing complexity into account, and ideally present some systematic approach to it.
Seventh, the delays that come with current publishing practices also mean that Jupyter notebooks associated with freshly published papers are using software versions near or even beyond their respective support window (which is 42 months in much of the Python ecosystem 23 ).
For instance, the oldest Python version still officially supported in 2021 was 3.6 (which was itself retired by the end of 2021, when 3.10 was released 24 ), yet as shown in Figure 9, over a thousand Python notebooks in our corpus whose last commit was in 2021 still featured earlier Python versions, mainly 2.7 (outphased in 2020) but also 3.4 (2019), 3.5 (2020) and some for which the version could not be determined.This contributes to reproducibility issues.A similar issue exists with the versions of the libraries called from any given notebook, though the effects might differ as a function of whether they have been invoked with or without the version being specified.If the version had been specified, its official end of life might go back even further.If the version was not specified, the newest available version would be invoked, which may not be compatible with the way the library had been used in the original notebook.Similar issues can arise with the versioning of APIs, datasets, ontologies or other standards used in the notebook, all of which can contribute to reduced reproducibility.To some extent, these version delay issues can be shortened by preprints: since they are (essentially by definition, but not always in practice) published before the final version of the associated manuscript, and hence their delays should be shorter, with lower reductions in reproducibility, though we did not investigate that in detail.
Eight, the variety and scale of issues encountered in the notebooks analyzed here provides ample opportunities for use in educational contexts -including instructed, self-guided or group learning -since fixing real-life bugs can be more motivating than working primarily with textbook examples.To do this effectively would require some mapping of the strengths and weaknesses of the notebooks to learning objectives, which may range from understanding programming paradigms, software engineering principles or data integration workflows to developing an appreciation for documentation and other aspects of good scientific practice.Given the continuously expanding breadth of publications that use Jupyter notebooks, it is also steadily becoming easier to find publications where they have been used in research meeting specific criteria.These could be a particular topic -e.g.natural products research (Mayr et al., 2020) or invasion biology (Bors et al., 2019) -or workflows involving a particular experimental methodology like single-cell RNA sequencing (Vargo and Gilbert, 2020) or other software tools like ImageJ (Bryson et al., 2020).

Conclusions
On the basis of re-running 4169 Jupyter notebooks associated with 1419 publications whose full text is available via Pubmed Central, we conclude that such notebooks are becoming more and more popular for sharing code associated with biomedical publications, that the range of programming languages or journals they cover is continuously expanding and that their reproducibility is low but improving, consistent with earlier studies on Jupyter notebooks shared in other contexts.The main issues are related to dependencies -both code and data -which means that reproducibility could likely be improved considerably if the code -and dependencies in particular -were better documented.Further improvements could be expected if some basic and automated reproducibility checks of the kind performed here were to be systematically included in the peer review process or if computational notebooks -Jupyter or otherwise -were combined with additional approaches that address reproducibility from other angles, e.g.registered reports.
and Brito et al. (2020) looked at specifics of computational reproducibility in the life sciences, Nüst et al. (2020) explored the use of Docker -a containerization tool -in reproducibility contexts, and Trisovic et al. (2022) looked at the reproducibility of R scripts archived in an institutional repository, while Rule et al. (2019), Pimentel et al. (2019) as well as Wang et al. (2020a); Willis et al. (2020) and Wang et al. (2020b) zoomed in on

Figure 2 .
Figure2.Journals with the highest number of articles that had a valid GitHub repository and at least one Jupyter notebook.In the figures, journal names are styled as in the XML files we parsed, e.g.("PLoS Comput Biol").In the text, we use the full name in its current styling, e.g."PLOS Computational Biology".

S
S C o m p u t B io l P L o S O n e N a t C o m m u n G ig a s c ie n c e B M C B io in f o r m a t ic s S c i D a t a F r o n t N e u r o in f o r m B io in f o r m a t ic s

Figure 3 .
Figure 3. Journals by the number of GitHub repositories and by the number of GitHub repositories with at least one Jupyter notebook.

Figure 4 .
Figure 4. Journals by number of GitHub repositories with Jupyter notebooks.For each journal, the notebook count gives the maximum number of notebooks within a repository associated with an article published in the journal.

Figure 5 .
Figure 5. Articles by number of GitHub repositories with at least one Jupyter notebook by year.

Figure 6 .Figure 7 .
Figure 6.ORCID usage in our collection.Bars indicate the total number of ORCIDs found each year for authors of articles in our collection.Colors indicate the number of articles that year with Jupyter notebooks.Note that data for 2021 is incomplete, as only articles published by mid-February have been included.

Figure 8 .Figure 9 .
Figure 8. Relative proportion of the most frequent programming languages used in the notebooks per year.

Figure 9 Figure 10 .
Figure9shows the Python version of notebooks based on the year in which the repository was last updated.2471 notebooks have Python version 3.6, followed by 2031 notebooks with Python version 3.7.Python version 3.6 and 3.7 are commonly used in recent years, followed by version Distribution of the maximum execution count across notebooks in our corpus.

Figure 11 .
Figure 11.Analysis of the notebook structure

Figure 13 .Figure 14 .
Figure 13.Notebooks with the string test Notebooks with dependencies.

Figure 17 .
Figure17.Dependencies of Juypter Notebooks and GitHub repositories.In (a), the notebooks depending on external modules (green) are plotted against notebooks depending on local modules (red) and notebooks that had both (brown).In (b) and (c), GitHub repositories and Jupyter notebooks are shown as to whether they declared their dependencies via any combination of setup.py(red), requirements.txt(green) or a pipfile (pink).

Figure 20 .
Figure 20.Exceptions by year of publication.

Figure 23 .
Figure 23.Frequent notebook code style errors as per the Python code style guide.
Fully automated workflow used for assessing the reproducibility of Jupyter notebooks from publications indexed in PubMed Central: the PMC search query resulted in a list of article identifiers that were then used to retrieve the full-text XML, from which publication metadata and GitHub links were extracted and entered into an SQLite database.If the links pointed to valid GitHub repositories containing valid Jupyter notebooks, then metadata about these were gathered, and the Python-based notebooks were run with all identifiable dependencies, and their results analyzed with respect to the originally reported ones.

Table 1 .
Comparison of notebooks that were successfully executed without errors, grouped by whether their results were different from or identical to the results documented for the original notebook.For features listed in italics, the mean values per notebook are indicated, otherwise totals across all notebooks per group.

Table 2 .
Comparison of most frequent Python versions declared for notebooks that were successfully executed without errors, grouped by whether their results were different from or identical to the results documented for the original notebook.Versions listed in italics occur in both top-5 groups, versions listed in bold in only one.The absolute columns give total number of notebooks per version and group, while the relative columns normalize the absolute values as a percentage of the total number of notebooks per group, i.e. 151 for different and 245 for identical, as per Table1.In both groups, the top-5 versions account for slightly over half of the notebooks.

Table 3 .
Common Python Notebook Code Warning/Style Error found in our Study