Datastorr: a workflow and package for delivering successive versions of 'evolving data' directly into R

Abstract The sharing and re-use of data has become a cornerstone of modern science. Multiple platforms now allow easy publication of datasets. So far, however, platforms for data sharing offer limited functions for distributing and interacting with evolving datasets— those that continue to grow with time as more records are added, errors fixed, and new data structures are created. In this article, we describe a workflow for maintaining and distributing successive versions of an evolving dataset, allowing users to retrieve and load different versions directly into the R platform. Our workflow utilizes tools and platforms used for development and distribution of successive versions of an open source software program, including version control, GitHub, and semantic versioning, and applies these to the analogous process of developing successive versions of an open source dataset. Moreover, we argue that this model allows for individual research groups to achieve a dynamic and versioned model of data delivery at no cost.

> As data publishers ourselves, we personally have found the plain-text approach VERY useful, as it allows us to use git's version control functionality to identify changes. But the reviewer is right, it could cause problems with large files. We have therefore amended the text to make it clear that that storing files as text is not essential and may become problematic for very large data files.The reviewer will be relieved to know that releases are stored in zip format, so the extra costs they note are avoided.
The comma in "Of particular interest for current purposes, is" should be removed.
The comment about semantic versioning and software API changes simply is not correct. While API changes most often do occur in changes of the major version, many major versions do not include any API changes, and API changes may also occur in minor versions (though not very often).
> We appreciate the reviewer feels our description of the system used to label software versions relative to changes in the API is incorrect. However, our statements on this topic reflect the guidelines given in the cited reference for semantic versioning (http://semver.org/). For example, on the first line of that site it is stated "MAJOR version when you make incompatible API changes". And then further down, "The way in which the version number is incremented after this release is dependent on this public API and how it changes" (see here). We are therefore not convinced our text is incorrect. But we can readily agree not all developers follow this system exactly. Hence we have reworded the text in this section to make it clear that this is just one suggested option.
The idea of an Interface for a dataset is not really clear, and is probably not wellnamed. Maybe "Structure" would be better? > Agree. In the revised text, we only use the term "Interface" once and in quotes, and thereafter refer to the "structure" of the dataset. We feel it helpful to keep this single usage of the word "Interface" to connect with the discussion about software, where the term Interface is used, as the last word in the term API.
an in "Interface an then be labelled " should be can.

> Done
The definition of semantic versioning for software does not label X, Y, and Z, but the later discussion about using it in data does name it, without tying these names back to the labels or positions. > We're sorry, but we didn't understand exactly what the reviewer was requesting with this comment. I also disagree with the idea that a change in the "Interface" is the reason for a change in the major version. As with software, it's very possible that a major change in the data values could be considered a major version.
> We amended the text to make clear that "Substantial additions of data might also be considered a major change". So little of this paper actually is about R that perhaps the title should be changed. > We appreciate the Reviewer's suggestion but have not followed their suggestion, as it conflicts with suggestions made in the previous round of review. As the editor may recall, the version initially submitted had a broader title, but the reviewer's requested we narrow the focus to what we were "actually proposing: a simple solution for research groups to maintain a dynamic dataset, and integration of this solution into R" .
"A central feature of the proposed system is that data are maintained in the cloud" seems incorrect to me. I think saying "on the Internet" or "on the web" would be more accurate.
> Done. Reworded to say "data are maintained on the web". Similarly changes were implemented throughout the paragraph.

Introduction
Sharing of a quality dataset -a collection of measurements, stored in one or several les -is now considered a rst-class scienti c output. Increasingly, funding bodies, publishers, and scienti c social norms are recognizing the value of sharing datasets, including as standalone products without any accompanying analyses [1,2,3,4,5]. Evidence of this trend is seen in the increasing numbers of standalone "Data papers" appearing in both standard domain-level journals and specialised data journals. Yet, while the last decade has witnessed a rapid and exciting change in attitudes towards data sharing, the scienti c community is still grappling with how to e ectively maintain and distribute open-source datasets [1,6,7,8,9,4,10,11,12]. In particular, in some areas, such as our own area of ecology and evolution, we are only starting to support the fact that some high-quality datasets may be evolving entities [12].
An evolving (or 'living') dataset is one that is subject to occasional or recurrent change. Typical changes may include improving the quality of existing data, adding new data, re-structuring the dataset content, or integrating with other datasets. For example, a dataset on biological organisms might be expanded through the addition of new records or improved through the correction of spelling mistakes in taxonomic names. In some cases, datasets may be expected to continue to evolve over extended periods [e.g. 13]. Evolving datasets are never " nished", and as such there is no "master" or "canonical" version. Rather, as research around a data product grows, there might be many valid versions produced. Even datasets that are not initially envisioned as evolving, may

Key Points
• Evolving datasets are those that are often being expanded and improved. Users of evolving datasets require easy ways successive versions. • Using techniques adopted from software development, we present a work ow for maintaining and distributing versions of an evolving dataset. A new package datastorr enables fetching and loading successive versions directly in R. • Using semantic versioning to label versions of dataset conveys helps users identify the type of change that has occurred. become so as minor errors are identi ed and corrected during use. In either case, the most recent version of the dataset will typically contain the best-available information, but there are still reasons to go back to previous versions: to replicate previous analyses or to work on a stable version for downstream analyses or visualisation.
A common approach taken by those maintaining an evolving dataset is to release sequential versions of the dataset, each containing a snapshot of the dataset at the time of release [e.g. 14,15,12,16]. Ideally, the latest versions of an evolving dataset would be immediately available to all users across the globe, along with notes describing the changes when compared to previous versions. For the sake of reproducibility, previous versions of the dataset should be archived and remain available. In the recent past, small research groups have solved the issue of versioning data internally and informally, for example by emailing around the latest version. However, as science grows and moves towards more systematic sharing of data, scalable solutions are needed to distribute dataset versions to a wider variety of users.
One approach taken by large research consortia has been to create dedicated web servers for archiving and delivering of data. Projects such as the Sloan Digital Sky Survey (https://www.sdss.org) have sophisticated infrastructure and processes for distributing successive versions of very large datasets [16]. The issue of updating data has also been addressed in some centralised repositories, like genetic sequences (via GenBank), where new data can be added and there exist abilities to correct errors in existing records. Yet these web databases require a level of funding and technological infrastructure that is beyond most research groups.
The vast majority of research projects are smaller and these currently rely on more generic data repositories for distributing data. A common approach for distributing a dataset is to release it under a Digital Object Identi er (DOI) in a standalone data repository, such as DataDryad, Figshare, and Zenodo. While these platforms did not all initially support versioning of datasets, they now support multiple versions of a dataset, either under a single or di erent DOIs [17]. Yet, while these new features in principle allow users to access multiple versions of a dataset, the release, discovery, and access to multiple versions of a dataset is not always straightforward.
We believe more can be done to streamline the distribution of a potentially large number of dataset versions to users, especially for small research teams with limited budgets. There are at least three challenges. First, dataset developers need a cheap -ideally free -and reliable system to create and distribute versions of an evolving datasets with low technical overhead. Second, users need an easy mechanism to discover the existence of new (or all) versions of an evolving dataset. Third, users need a mechanism to retrieve speci c versions. For those using a computational language such as R [18], all versions of an evolving dataset would ideally be both accessible and discoverable directly from within R.
In this article we outline how emerging technologies from software development (Table 1) can be used to address these challenges, enabling small research groups to create and maintain a stream of versions for small-to-medium sized datasets (up to 2Gb), and distribute these directly into the R computational environment for a potentially unlimited number of users at zero nancial cost and minimal technical overhead. To achieve this we developed a new R package called datastorr, which together with other the technologies allows for easy and scalable delivery of successive versions of an evolving dataset directly into R. At time of publishing this article, this work ow was being used to distribute versions of datasets across a wide range of topics (Table 2), suggesting a potentially wide domain of application.

A lightweight, cheap, and scalable work ow for delivering versions of an evolving dataset into R
In brief, the work ow we present here borrows best practices for software development [19] and applies them to the challenge of maintaining and distributing versions of an evolving dataset. Our approach envisions multiple parties involved in the creation and/or use of a versioned dataset, including developers, contributors and users (Fig. 1). Each of these will likely have di erent goals and requirements (see Table 3). When building a piece of software, developers maintain a core set of code which produces the binary executable le that is eventually installed on a user's local computer. Analogously, developers of an evolving dataset maintain a core set of les (the "code"), which produces an organised dataset that can be "installed" (i.e., loaded) on a user's local computer. In the development of either software or data, successive versions -called "releases" -are distributed as snapshots of the generated product at a particular point in time.
The similarity in work ow between software and data allows us to deploy the re-purpose some of the same technological platforms that are used to maintain and distribute versions of a software product to maintain and distribute versions of an evolving dataset (Table 1). Importantly, these tools are available free of charge for open source projects and already well developed -ensuring high-level performance and stability. Moreover, the combination of technologies allow us to address the goals and requirements of the di erent parties involved in the creation and use of a versioned dataset (see Table  3).
An overview of the proposed system is as follows.
• Raw data les are stored under version control in a git repository -a free and leading version control system used in software development -by the dataset developers. All the les that go together to build a single dataset are stored in the repository, together with any code used to manipulate these les to create the dataset that is ultimately distributed. • Changes to the raw data les and code are tracked by the developers using git's ability to make "commits" -granular Table 1. Overview of technologies used to maintain, store, and distribute versions of an evolving dataset as described in this paper.

Technology Description git
Open source version control system used for tracking progressive changes in a set of text les, typically computer code but also data.

GitHub
A commercial web platform available at github.com for sharing, visualising, and managing git repositories. Includes ability to browse the 'history', 'issue' tracking, and ability to host 'releases'. Also has a well developed Application Programming Interface (API) enabling programmatic access to dataset releases.

R
Widely-used and open-source language for data processing and statistical analysis.
datastorr A package in R used to fetch releases of an evolving dataset hosted on GitHub.

semantic versioning
The process of assigning unique version numbers in a particular format to successive versions of a digital product; traditionally applied to software but here to an evolving dataset. and annotated snapshots of the source les over time. • The git repository is hosted on GitHub -a leading platform for hosting, enabling multiple developers or other contributors to work collaboratively on improving a dataset (Fig.  1). • Developers use the les in the repository to make a release of the dataset -a snapshot of the generated data product at a particular commit -and upload these to GitHub, where they are hosted alongside the raw les and (optionally) labelled using "semantic versioning". The version labels indicate both the ordering of versions and the magnitude of change expected between di erent versions. • Using the datastorr package, users can both retrieve a list of all available versions of the dataset, and retrieve particular versions of the dataset on demand, and load them directly into R. • Those not using R can also access versions from GitHub.
Below we elaborate on each of the di erent technologies.

Version control
Version control, primarily an open-source variety called git, has become widespread in software development. In practice, version control tracks line-by-line changes in text les and creates and maintains a history of those changes. Increasingly version control has been applied to scienti c code and also data management, especially for small-to-medium sized datasets [20,9,8]. git is attractive for data management because it tracks all changes in monitored les, provided these are saved in text format (e.g., ".csv", ".tsv", ".txt"; with some tricks git can also indicate changes in some other le types such as ".xlsx"). It allows users to annotate commits with informative messages detailing the rationale for those changes. The "history" of commits is also visible to anyone interacting with the repository. In its present form, git can handle individual data les at least up to 100MB, which includes a large fraction of scienti c cases.
As a general strategy for tracking a dataset under version control with git, we recommend: • Developers establish a separate git repository for each dataset to be distributed. • Saving data in their rawest form. In some datasets you might only have a single le. Others may have may les that get manipulated or combined in some way to produce a uni ed product. • Where possible, saving all les as plain text, so that git can identify line-by-line changes. For example, save tabular data as a "csv". While this approach works well for smallto-intermediate sized les, those with larger les may prefer to use a compressed format to reduce repository size and bandwidth.    3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 late or compile the raw data les into the nal dataset. For example, you might combine many independent datasets into one uni ed dataset. • Documenting any changes in the dataset by making a commit in the git repository, with informative message outlining why the change was made.

Hosting and distributing versions of an evolving dataset
Datasets stored under version control via git reach their real potential when hosted at a suitable internet hosting service [20,9]. Here we focus on the platform GitHub (Table 1). Hosting of a git repository enables dataset developers to connect with other potential contributors and also users (Fig. 1). These platforms are designed to work with git repositories, and thus o er many helpful features, such as ability to record issues, host documentation, or review edits over time.
Another notable feature of GitHub is the ability to host a stream of releases from the dataset, alongside the git repository containing all the raw les. Each release is linked to a speci c commit in the git repository history and occur at points where the dataset developer decided to generate a new version of the data for distribution. While users could in principle download the entire git repository, most of the time, what they want are the releases.
Deciding when to make a new release is at the discretion of the dataset developer. In practice, one makes fewer releases than one does commits into the git repository, though there is nothing stopping developers from releasing a new version for every commit. The exibility here allows developers to do internal work between releases and only release the data to users when the revision represents a clear improvement on the previous release.
Another important consideration is that websites like GitHub naturally cater for two types of data users accessing the data: those that interact with the data via point and click downloading, and those that use programmatic interaction (Fig. 1, Table  3). Speci cally, GitHub releases can be downloaded directly by users or accessed programmatically via the GitHub API.

Semantic versioning
To realize the full bene ts of a versioned controlled dataset, users should be able to easily intuit the types of changes that have occurred among versions. Since software development has e ectively already dealt with a similar problem in the labelling of software releases, we suggest there is bene t in adopting the best-practices from that eld.
Speci cally, we suggest adapting the process of semantic versioning, developed for labelling successive releases of software (see semver.org), to labelling of successive releases of an evolving dataset (Fig 2). In semantic versioning of software, a tri-digit label of the form "X.Y.Z" is applied to each version, where X, Y, and Z are non-negative integers. For example, version "2.1.2". Although everyday practice may di er, the guide- lines at semver.org suggest labels are incremented in a particular way, determined by changes in the public API for the software.
Though the analogy to software is not perfect, datasets can also be thought of as having an "Interface", determined by the structure of the dataset, which dictates how users interact with the resource. For example, in tabular data the structure is determined by the names of di erent les, the column labels within each, and the presence of di erent subgroups within the table (as indicated by labels in particular columns). Successive versions of a dataset can then be labelled in a manner analogous to that of software, determined by the structure of the dataset and changes in that structure.
Drawing inspiration from the guidelines for semantic versioning of software at semver.org, we suggest the following guidelines for labelling of a dataset with semantic versioning: • Clearly communicate the structure of the dataset in the metadata or landing page. This includes le types, data type, and element names. • Use versions beginning with "0.Y.Z" to indicate products where the structure is still in development. • Version "1.0.0" de nes the structure. • Once de ned, increment version numbers to communicate any changes to the structure. • Increment the "Major" version when you make changes to the structure that are likely incompatible with any code written to work with previous versions. Such changes may include revising the le names, the structure of the dataset, or changing element names (e.g. column headers). Substantial additions of data might also be considered a major change to structure, especially where they add new subgroups to the dataset. • Increment the "Minor" version to communicate any changes to the structure that are likely to be compatible with any code written to work with the previous versions (i.e. allows code to run without error). Such changes might involve adding new data within the existing structure, so that the previous dataset version exists as a subset of the new version. For tabular data, this includes adding columns or rows. On the other hand, removing data should constitute a major version, as records previously relied on may no longer exists. • Increment the "Patch" version to communicate correction of errors in the actual data, without any changes to the structure. Such changes are unlikely to break, or change analyses written with the previous version in a substantial way. • Once a dataset version has been released, do not modify it. Further modi cations are released under a new version number.
While it is hoped the guidelines above help users in understanding the types of changes that have occurred between successive versions of a dataset, any change in a dataset may alter the results of a users' analysis in non-trivial ways. Unlike developers of software, developers of a dataset cannot guarantee full backwards-compatibility, i.e. that certain results will remain unchanged in updated versions. We suggest responsibility for verifying how di erent versions of an evolving dataset in uence their particular analysis or use thus always remains with the user, even if simply applying a so-called "patch". While further work -and likely experience -is needed to rene the process of semantic versioning for datasets to further develop understanding between data developers and data users of what di erent changes imply, semantic versioning still provides a more nuanced way to communicate from the developer to the user on the types of change they could expect.

Loading data versions directly into R using the datastorr package
For e cient usage and to aid reproducibility, many users will want access to all versions of any particular dataset programmatically (Table 3). Code to access a stream of GitHub releases could be written individually by each user, but this creates an unnecessary technological hurdle. To make it easier for users to access versioned data via code, we developed a new package for the R platform, as one of the most prominent platforms for data science [18].
Our package, called datastorr (github.com/ropenscilabs/datastorr), facilitates access to releases of any evolving dataset hosted on GitHub (Fig. 1). Speci cally, the datastorr package: 1) Contains the main code needed to interact with the GitHub API to retrieve versions of the dataset; and 2) Enables users to construct the shell of a second, dataset-speci c R package, which can be distributed and used to access releases for a speci c repository stored on GitHub. Using datastorr, a researcher can create and distribute a custom R package that facilitates access to their data with (very) minimal computational skills.
For example, datastorr has been used to build several packages ( Table  2), including taxonlookup (github.com/traitecoevo/taxonlookup), which hosts data on the Taxonomy of world's land plants [15]. The R package taxonlookup consists of only a few simple functions and associated help les, that were automatically generated with datastorr. For a user, accessing a version of the data is a simple as typing a single line of code (Fig. 1). Accessing a di erent version of the data involves changing only the version number. From the user's perspective, the existence of the taxonlookup and datastorr packages makes reproducing analyses using speci c versions of the data [e.g.][] [15,23] possible.
Using datastorr, dataset developers can set up their own R package to deliver versions of an evolving dataset simply by providing: i. a GitHub repository name (e.g., "traitecoevo/taxonlookup") where releases are stored; ii. the lename in the release that contains data; iii. the function used to load the data le into R.
Then as the dataset grows over time, the developers update the git repository and create a GitHub release with a new version number. All the releases are simultaneously available to any user, both point-and-click and programmatically.
The dataset-speci c packages created by datastorr are designed to be computationally e cient and also work o ine. Packages created by datastorr contain no actual data, only the rules for fetching the data. As such, the basic package structure is quick to install and takes up virtually no space on the user's hard-drive. The package functions by fetching each data version once (the rst time it is requested), and then caching these les locally for future reuse. Moreover, users can store several versions of an evolving dataset on their computer and unambiguously access di erent versions with single function.

Discussion
The key issue we are dealing with in this article may be familiar to many readers: many datasets are constantly evolving and, despite tremendous advances in data sharing and associated technologies over the last decade, there is as yet little consensus about how to maintain and distribute multiple versions of an evolving dataset, especially for small research teams. While such teams could in principle create their own dynamic web interface, the technological hurdles, cost and maintenance required are discouraging. Moreover, existing platforms for distributing data o er a limited set of features for the delivery of successive versions of an evolving dataset. This suggests there is a need for an easy, cheap, and scalable solution for maintaining and distributing successive versions of an evolving dataset. By adopting open-source and scalable practices from software

Towards an ecosystem for evolving data
Our contribution here connects with a growing number of recommendations and technologies supporting the sharing and reuse of evolving data. Such contributions include community guidance on good practice in data curation [6,8], data citation [7] and the FAIR principles for making datasets Findable, Accessible, Interoperable, and Reusable by both machines and humans [11]. In our proposed system, information about appropriate attribution for any dataset (whatever that is determined to be) should be made readily available, either on the landing page within GitHub, or even better distributed as part of the versioned dataset itself. Similarly, datasets can be structured to make them follow the FAIR principles, to the extent possible. Notably, our work ow with the datastorr package demands machine access to datasets -a core focus for the FAIR principles. While our proposed work ow does not currently enhance discoverability of new datasets, this is a broad challenge faced by all data platforms and researchers. While our package datastorr o ers an easy way for users of the R ecosystem to directly access dataset version, users of other languages can also access the datasets. Moreover, packages similar to datastorr would ideally be developed to make accessing dataset versions as easy as it is with datastorr.
Within the R ecosystem, the datastorr package complements other approaches for creating and delivering datasets. One common approach used within R is to embed data directly within an R package, which can then be distributed via the Comprehensive R Archive Network (cran.r-project.org). Moreover, dedicated packages are being developed to assist dataset developers in creating data packages [24]. An advantage of this approach, compared to ours, is that the data are immediately available in the package (whereas our packages only contain instructions for fetching the data). This advantage however also brings limitations. Notably, datasets must be under 5MB, and only one version of a dataset package can be installed on any given machine at any one time. datastorr o ers a viable approach for overcoming these limitations.
There are also many emerging or alternative technologies that o er other possible ways to implement a system for storing and distributing versions of an evolving dataset. Our solution currently emphasises the platform GitHub, but similar functions could be achieved via other git hosts such as bitbucket.org and gitlab.com. Git repositories can also be extended to accommodate larger les using features like Git-Large File Storage or git-annex. More fundamentally, there are emerging alternatives for version control speci cally designed for data, such as the datproject.org, and other new platforms for distributing data, such as the Comprehensive Knowledge Archive Network, (CKAN) and Open Knowledge International, (OKFN).
The key here is not the speci c technology, but rather the concept of creating, maintaining, and distributing versions of an evolving dataset, which may be achieved with all of these approaches. Indeed, as with every technology the best available approach is certain to evolve, especially as emerging technologies facilitate even better delivery of data in the future.

Further advantages and extensions
A central feature of the proposed system is that data are maintained on the web. This has two main bene ts: rst, it provides a platform for multiple data contributors to sync their les and correspond about changes in the dataset, and second, it allows for hosting of a stream of data releases for distribution (Fig. 1). Web platforms thus act as a central point for the collection, curation, and distribution of the data. Additionally, one of the greatest bene ts of using web platforms like GitHub for development of both software and data has been the way they encourage contributions from multiple individuals working simultaneously -including from people from outside the initial group of project participants [25,9]. Multiple developers can make changes to di erent parts of the code (or, in our case, data) and the git system will integrate these together or, when needed, ag where there are con icts that need to be resolved. The proposed system of data delivery thus has the added bene t of facilitating seamless and transparent collaboration among research groups in the construction and maintenance of datasets.
In the long term, scientists want their datasets, software, and papers to preserved and remain accessible. While our proposed system for data delivery does not guarantee long-term preservation, users can also choose to automatically archive data-versions released on GitHub version in one of several traditional data archives, with a DOI (Digital Object Identi er) minted for each release. Currently, both Zenodo and FigShare each integrate with GitHub for archiving of material hosted there. Ideally, tools like datastorr would also be developed to pull versions from these archives too.

Ethical Approval
Not applicable.

Consent for publication
Not applicable.

Competing Interests
The author(s) declare that they have no competing interests.