*-DCC: A platform to collect, annotate, and explore a large variety of sequencing experiments

Abstract Background Over the past few years the variety of experimental designs and protocols for sequencing experiments increased greatly. To ensure the wide usability of the produced data beyond an individual project, rich and systematic annotation of the underlying experiments is crucial. Findings We first developed an annotation structure that captures the overall experimental design as well as the relevant details of the steps from the biological sample to the library preparation, the sequencing procedure, and the sequencing and processed files. Through various design features, such as controlled vocabularies and different field requirements, we ensured a high annotation quality, comparability, and ease of annotation. The structure can be easily adapted to a large variety of species. We then implemented the annotation strategy in a user-hosted web platform with data import, query, and export functionality. Conclusions We present here an annotation structure and user-hosted platform for sequencing experiment data, suitable for lab-internal documentation, collaborations, and large-scale annotation efforts.

to great opportunities for addressing and complementing research questions with already available sequencing data. A crucial aspect here is to be able to first find the appropriate data and then to utilize them in harmony with the underlying conducted biological experiments [2]. The systematic description of the available sequencing data together with the description of the underlying biological experiments and sample details are a critical prerequisite.
The open science concept requires publication of the sequencing data alongside the scientific results [3]. Sequencing databases such as the Sequence Read Archive (SRA) [4] or the Gene Expression Omnibus (GEO) [5] collect and open raw or processed sequencing data to the community and provide identifiers to connect data to scientific publications. The sequencing data are accompanied by an often minimalistic high-level description of experiments, samples and technologies employed [6].
Genome annotation projects including ENCODE [7], ModENCODE [8] and FANTOM [9] describe experimental aspects more systematically and with a greater level of detail. Together with the provided sophisticated query and export functionalities, this enables consistent processing of sequencing data and further allows direct comparison between all data within the projects. At the same time, significant human resources are required for such data annotation and curation [10]. However, the underlying technical solutions were specific for each of these projects and were not designed to be generalizable to other contexts, because of lack of access and documentation and design approach of the source code.
Here, we present a strategy to systematically annotate sequencing data together with their corresponding biological experiments. We implemented the strategy as a webserver-based platform with a user-friendly interface allowing data collection and decentralized data annotation. This Data Coordination Center (*-DCC) constitutes a generic and flexible framework designed to be adaptable to hold data from various types and species. The user interface for uploading data was inspired by the SRA Submission Portal Wizard [11]. The query and export interface was designed similar to the ENCODE DCC data interface [12]. The *-DCC presented here is suitable for large-scale annotation efforts such as the DANIO-CODE genome annotation project [13]. Sequencing data management for one lab can be facilitated by the DCC with the added benefit of allowing sharing of selected data with various other labs.

Annotation Structure
The description of data is overall guided by the design of the conducted experiments and the corresponding experimental workflow (Figure 1). Figure 1. Overview of the *-DCC annotation structure. The *-DCC structure was designed to capture all study steps necessary for downstream analysis and groups information in sections in parallel to the study steps a generic sequencing experiment is composed of.
All experiments of one study targeting the same research question are collected under one common series object, which also contains the description of the overall purpose of the experiments. As an example, a case-control-study with a number of animals with genetic mutation and their corresponding wild-type controls inspecting respective gene expression and histone marks would constitute a typical series. The next level in the annotation is the description of the biosample, for example the age or developmental stage of the animals, genetic background or the anatomical origin of the samples. This includes labelling biosamples as biological controls or biological replicates. The assay level captures the type of assay and the library protocol details, such as RNA-Seq, ChIP-Seq or the used immunoprecipitation targets. It is independent of the above-mentioned levels, allowing to use the same instance of an assay in different series instances. The assays are in practice often identical with sequencing library preparation kits and are applied to the biological samples resulting in applied assay objects. On this level, technical controls and replicates are labelled as such. Following the experimental workflow, the applied assays are sequenced using a specific platform and instrument with corresponding settings, all of which is captured in the sequencing level. Finally, the sequencing files are the immediate results of the sequencing process together with corresponding files resulting from data processing. These files are described on the data level, which can also include additional information for example about the genome version or the employed processing pipeline.
Where applicable, we limit the annotation to a set of predefined terms. This aspect unifies on the one hand the metadata, on the other hand it guides the annotators to find the most appropriate terms and ensures a high level of annotation consistency. The controlled vocabulary constitutes the most species-specific part of our platform and might require adaptation to the species of interest.
Our annotation strategy requires certain terms to be provided by the annotator during the annotation process, for example whether the experimental design is based on a case-control or a survey layout. Other terms are only required under certain circumstances, for example the assay target has to be provided only for ChIP-seq and other immunoprecipitation assays. A third category are the optional fields that allow further information to be entered and queried in a structured way, for example the maximal read length of a sequence.

Upload of data and annotations
To give a better insight into the specifics of uploading data to *-DCC, we compared the upload workflows between SRA and *-DCC. SRA provides an interactive annotation platform called Submission Portal Wizard and uses Microsoft Excel files or web forms for data input. Similarly, *-DCC provides a csv-based and a web form-based submission option. In order to compare the two platforms adequately, we went through the two form-based approaches for a typical zebrafish sequencing experiment as an example. The SRA covers a wider scope of experiments and data sources, e.g. metagenome studies and pathogen studies, compared to *-DCC. Therefore, we discuss only the relevant matching options in SRA. After login, the SRA Submission Portal Wizard starts by requesting information about the submitter. This information is entered indirectly in the *-DCC by specifying a lab for the Biosample, Assay and Sequencing sections and by the information entered during user registration about the currently logged-in user. Besides being logged-in on *-DCC, only users with the annotator role have the permission to upload annotations and data. The General Info step on the SRA platform asks for already created bioProject and bioSample instances related to this upload as well as a publication date for the uploaded data to go public. In the *-DCC form, a PubMed and a GEO ID can be provided for similar purposes. Also, the series can be set as public, that is visible to every user of the platform, or to only be visible to the currently operating user. Such private datasets can later be opened to the public. The next step collects the biological details on both platforms. In contrast to the SRA, the majority of these fields are connected to a controlled vocabulary on *-DCC. On both platforms some annotation fields are required to be filled while others are only optional. A detailed comparison between the different terms are provided in Figure 2C. The SRA Metadata step of the Submission Portal Wizard corresponds to four distinct steps in *-DCC, which are Assay, Applied Assay, Sequencing and Data. These steps contain details about the library preparation, sequencing instrument and the data files. Both platforms provide a controlled vocabulary for several fields of this section via drop down menus. The *-DCC allows to hold additional information about the sequencing settings and allows the same assay to be used in a different series entry. In the final two steps of the Submission Portal Wizard, the uploader provides the files either locally, via FTP preloads or via Amazon S3 buckets. Afterwards, the whole annotation is submitted. On the *-DCC, the file upload takes place in the Data section, by providing a URL to a web-accessible file or the file path on the DCC server for previously uploaded files. Users are recommended to stay on the upload page until a confirmation of the successful upload appears, but the upload will continue even if the page is closed. Depending on the file sizes and the internet connection of both the *DCC-server and the annotator, the file upload can last a few minutes up to hours (see also Methods).

Query and export of data and annotations
In order to query and export data and their annotations, *-DCC provides a table view with filter options (Figure 3) as well as an interactive heat map, similar to the matrix view in the ENCODE DCC. Together with the annotation structure, these data views allow pooling of different series based on a combination of shared annotation terms for example based on the same assay or the same developmental stage, by clicking on the relevant terms in the left sidebar. Similar to the ENCODE DCC, the *-DCC sidebar indicates the number of occurrences of each term in the current table. This enables a quick identification of complementary data sets for later integrative analysis. The *-DCC allows the download of sequencing files, as well as their accompanying annotations. The annotation file can then be used with data processing pipelines to select processing parameters based on the annotations. This might make it a suitable platform for a consortium or large-scale studies. The DANIO-CODE consortium uses *-DCC to collect and annotate zebrafish sequencing data (available at danio-code.zfin.org). Limitations *-DCC was not designed as a Laboratory information management system (LIMS) and therefore was not built to capture every detail of an experiment. We limited the platform to aspects necessary for down-stream analysis and integrative studies are covered. For the same reasons, *-DCC does not provide any API for automated annotation and data uploads. Furthermore, the main goal for *-DCC is to capture genomics lab experiments and as a result wasn't designed to capture for example the collection locations of metagenomics studies. Methods *-DCC was implemented as a Django 1.11 app with an underlying PostgreSQL database and a JavaScript-supported frontend. We rely on Django's user management framework with four different roles: guest, viewer, annotator and admin. Guests are users who are not logged in and have restricted access to sites on the platform. Logged in users, termed viewers, can have additional access such as for example data sets, which have not been set to public (not activated in the demo setup). Annotators are given wider access including to the tools for the upload of annotations and data. The admin can access the Django admin page to make changes to the database, fix broken uploads and handle user roles.
The file-upload occurs asynchronously using the Django file interface and ajax calls. The platform can therefore handle multiple uploads at the same time. Depending on the file sizes and the internet connection of both the *-DCC-server and the annotator, the file upload can last from a few minutes up to hours. For larger files (>10GB), we recommend preloading them on the server to speed up the process. Broken uploads will have to be taken care of by the admin manually.
Unit tests based on Django's test framework as well as end-to-end tests via cypress are available. A docker container and installation instructions are available. The source code is available under the MIT license at https://gitlab.com/daniocode/public/dcc. Unit tests as well as end-to-end tests are available.
A docker container is available in the repository for testing and deployment, see dcc.readthedocs.io for further instructions and code documentation.
A demo implementation is running on http://dcc-demo.daublab.org/ with username "annotator" and password "annotator" to have annotator user rights. Detailed answers to reviewer reports Reviewer #1: The authors present a software platform to collect and annotate sequencing metadata and data.
Sadly, the authors fail to convince on the software they implemented for various reasons.
General style / formattnig / rigorosity The authors do not seem to respect the requested structure of the jounral for a submission. For instance, the Findings section is empty. All content is under Background. We follow the suggestions given by the journal, which state that Background can be a subsection of the findings. We are open to change the structure in case the reviewer or the editor find it more appropriate.

Content
There are very few references to back any statement made in the article. We revisited our statements and provided additional references in the Background section.
Furthermore, there is no comparison to existing tools or platforms out there. We added a detailed comparison with the SRA upload procedure in the "Upload of data and annotations" section. We further added a comparison to ENCODE in the "Query and export of data and annotations" section. We chose these two tools because they are well established and to our understanding commonly used by the research community.
The idea appears similar to a LIMS for sequencing data, of which there are many. In our understanding the purpose of a LIMS and a DCC are different. While the main goal of a LIMS is the documentation of lab procedures including experiments, the goal of a DCC is to facilitate data access with a focus on integration from different experiments. Figure 1 and Supplement Information does not match (some terms are mandatory but shouldn't be, some terms are listed but not explained). We corrected the mismatches.
Reviewer #2: The manuscript "*-DCC: A platform to collect, annotate and explore a large variety of sequencing experiments" describes an annotation structure for sequencing experiment data. The authors provide a very high level overview of DCC and point towards an instance of its practical use by a research community (the zebrafish DANIO-CODE consortium). The manuscript is very light on detail and would benefit some kind of comparative analysis with other similar tools to highlight the advantages of DCC.
The platforms of other projects such as ENCODE are mentioned but dismissed as requiring too much manual intervention for most projects, however, this is not fully explained. An example detailing the differences between adding an experiment to DCC compared to ENCODE or other similar platforms would be very useful. We added further discussion and details including a step-by-step comparison between the SRA and the *-DCC upload interfaces in the "Upload of data and annotations" section.
There is a lack of consistency between the required fields in Figure 1 and the description in the supplementary data eg in Fig 1 there is no reference to optional terms 'DOI' or 'GEO ID' in 'Series', 'number of biological replicates' and 'developmental state' are required in Figure 1 but conditional in the supplementary data. We corrected the inconsistencies.
There is a lack of detail in the methods section and while Figure 2 shows a user interface, there is no text, or links to documentation, worked examples or other user support provided either in the manuscript on the gitlab pages or on a dedicated DCC website. These would be of significant benefit to any prospective users of the tool. We added a detailed online documentation at dcc.readthedocs.io and also setup a demo instance at dcc-demo.daublab.org.
The implementation of the DCC platform at DANIO-CODE consortium does demonstrate the functionality of DCC and could be considered as proving the authors assertions of utility. While setting up an instance of DCC from scratch would be expected to be well within the technical capacity of a large consortium, if the platform is, as the authors envision, to be used on a much smaller scale for data management in an individual lab or to share data between collaborators then more documentation and user support material is essential. We improved the support material including more detailed installation and customization instructions and refined the Docker setup to make it easier and faster to change and build it.
Overall, this manuscript flags the DCC platform and the DANIO-CODE instance demonstrates its use to the practical benefit of a community but to be of broader utility there needs to be additional information both in the manuscript and some dedicated user support documentation on both deployment and use.