Keeping it light: (re)analyzing community-wide datasets without major infrastructure

Abstract DNA sequencing technology has revolutionized the field of biology, shifting biology from a data-limited to data-rich state. Central to the interpretation of sequencing data are the computational tools and approaches that convert raw data into biologically meaningful information. Both the tools and the generation of data are actively evolving, yet the practice of re-analysis of previously generated data with new tools is not commonplace. Re-analysis of existing data provides an affordable means of generating new information and will likely become more routine within biology, yet necessitates a new set of considerations for best practices and resource development. Here, we discuss several practices that we believe to be broadly applicable when re-analyzing data, especially when done by small research groups.

occurring). We do believe that it is a streamlined and accessible approach, however. Therefore, we have changed the wording to "streamlined and reproducible assembly framework" (line 40).
I have a similar objections here: "The Github-Zenodo framework presented here represents a relatively low cost way for small research groups (i.e. a graduate student) to perform large-scale re-analysis projects in a publicly accessible way." I would rephrase this as GitHub and Zenodo are only for holding the code and the results. As the authors describe after that paragraph, the project required a vast amount of computational power to conduct the actual analysis and for this another type of infrastructure was needed. So "to perform large-scale re-analysis projects" the Github/Zenodo combo is not sufficient. * This is a very good point. We agree that the frameworks utility lies mainly in the hosting--not in the actual computational power required. We have changed the language to reflect that on line 139: "The Github-Zenodo framework presented here represents an efficent way for small research groups (i.e. a graduate student) to host and link both the code and results from large-scale re-analysis projects in a publicly accessible way." GitHub is sometimes misspelled (Github -"H" not capitalized).
* Thank you for catching the misspelling. * The figure was mainly designed to provide the idea of versioned pipelines being linked to versioned DOIs assigned to the outputs of the workflow (thus allowing the publication of multiple different datasets that are linked to alterations in the workflow). Given the feedback from both Reviewer 1 and Reviewer 2 we have decided to remove the figure.
Reviewer #2: I rather like this opinion piece. it makes a well reasoned argument about problems facing all of genomes, and more generally, all of "big data" science. I ahve just a few minor quips.
Thank you for your comments and thoughts, we have addressed your quips below: I'm not sure about teh general utility of the MMETSp dataset as a testbed. This is because: 1a: The reads are too short. The vast majority are 50bp PE, which is not particularly representative of the read length most people are choosing for their de novo assembly projects, today, in 2018. How assemblers function with 50bp is likely different than how they function with 100bp. * Yes, we agree with the reviewer that 50bp PE reads are far shorter than most assemblers will deal with now and in the future. As such, it is limited in that capacity. However, we do feel that the great diversity of life that is surveyed makes it an important datsets. The point about short reads (relative to the current norm) was raised in the review of the accompanying Johnson et al. paper and has been added in more detail in the discussion section there.
1b: The dataset is too big. 700 transcriptomes will challenge even the most computationally advanced labs. I do imagine a defined subset as being a good test-set. Maybe the authors could propose a subset that captures the taxonomic breadth and other dimensions of the dataset? * This is a fantastic point. Yes, the datset is far too big to be a simple test dataset. We decided to address this point in the main text of the associated paper by Johnson et al. Taking the reviewer's advice we have identified and listed 12 'High' and 15 'Low' performing assemblies that cover a cross-section of diversity from the MMETSP   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 ture.

43
These secondary data products of sequencing, such as an-44 notated assemblies, should be viewed as hypotheses gener-45 ated from the underlying biology, rather than some immutable 46 "truth". As such these secondary data products can continue to be improved as new tools are developed. For example, we  Through this process, we developed several practices that we 57 believe to be broadly applicable when re-analyzing data, espe-   From the perspective of our MMETSP re-analysis, we argue 94 the community needs more than a place to put the primary 95 and secondary data products associated with a single publica-96 tion. Ideally, the results of each re-analysis would be deposited 97 in a discoverable location, but would have a coherent archival 98 procedure that is lab-independent, easily searchable, and "for-   Competing interests 180 The authors declare that they have no competing interests.   3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65