Crowdsourcing genomic analyses of ash and ash dieback – power to the people

Ash dieback is a devastating fungal disease of ash trees that has swept across Europe and recently reached the UK. This emergent pathogen has received little study in the past and its effect threatens to overwhelm the ash population. In response to this we have produced some initial genomics datasets and taken the unusual step of releasing them to the scientific community for analysis without first performing our own. In this manner we hope to ‘crowdsource’ analyses and bring the expertise of the community to bear on this problem as quickly as possible. Our data has been released through our website at oadb.tsl.ac.uk and a public GitHub repository.

Ash dieback is a devastating disease of ash trees caused by the aggressive fungal pathogen Chalara fraxinea. This fungus emerged in the early 1990s in Poland and has since spread west across Europe reaching native forests in the UK late last year. The emergence of Chalara in the UK caused public outcry where up to 90% of the more than 80 million ash trees are thought to be under threat. The disease, which is a newcomer to Britain, was first reported in the natural environment in October 2012 and has since been recorded in native woodland throughout the UK. There is no known treatment for ash dieback, current control measures include burning infected trees to try and prevent spread [1] and the implications for the UK environment and the economy remain stark.
To kick start genomic analyses of the pathogen and host, we took the unconventional step of rapidly generating and releasing genomic sequence data. We released the data through our new ash and ash dieback website, oadb.tsl.ac.uk, which we launched in December 2012. Speed is essential in responses to rapidly appearing and threatening diseases and with this initiative we aim to make it possible for experts from around the world to access the data and analyse it immediately, speeding up the process of discovery. We hope that by providing data as soon as possible we will stimulate crowdsourcing and open community engagement to tackle this devastating pathogen.
The transcriptomics and genomics data we have released so far We have generated and released Illumina sequence data of both the transcriptome and genome of Chalara and the transcriptome of infected and uninfected ash trees. We took the unusual first step of directly sequencing the "interaction transcriptome" [2] of a lesion dissected from an infected ash twig collected in the field. This enabled us to respond quickly, generating useful information without time-consuming standard laboratory culturing; the shortest route from the wood to the sequencer to the computer.
The Chalara transcriptome data, generated at The Sainsbury Laboratory (TSL, Norwich, UK) was derived from two infected ash samples collected at Ashwellthorpe Lower Wood, near Norwich; the location of the first confirmed case of ash dieback in the wild in the UK. Here we extracted RNA from branches of two infected ash trees, prepared cDNA libraries from each and sequenced these to create 76 nt paired-end reads on our Illumina GAII.
In parallel to the transcriptome data, genome sequence data were produced in a coordinated effort between The John Innes Centre (JIC), TSL and The Genome Analysis Centre (TGAC) in Norwich. A single C. fraxinea isolate was cultured from infected tissue found in Kenninghall Wood. Genomic DNA libraries were constructed and sequenced on an Illumina MiSeq sequencer as 150 nt and 250 nt paired-end libraries.
As soon as these datasets were generated we released them through oadb.tsl.ac.uk. We took the unusual step to release the data before preliminary analysis had been undertaken so that we might take advantage of the huge range of expertise and knowledge available outside our groups, and thereby make the best of the data as quickly as possible via a crowdsourcing approach.
Crowdsourcing: bringing the power of many, marshalling expertise and democratising genomics Crowdsourcing is a form of massively parallel collaboration, the main distinguishing feature of which is the low overhead to entry of participation and low level of investment from a participant. The power is in the sheer number of people interested in seeing the goals of the project achieved. Scientists have not been slow to adopt these models to carry out work that could not be automated successfully and require human intelligence and expertise. Recently genomic scientists have made inroads to leveraging the power of crowds to annotate and assemble the genome sequence of a novel strain of Escherichia coli O104:H4 bacteria that caused a serious outbreak of foodborne illness in northern Germany in spring 2011. These scientists were able to quickly link up with others across the world with similar skills to rapidly analyse the novel pathogenic strain [3]. Most importantly, crowdsourcing allows for a new form of potentially effective live peer-review, many sets of eyes interrogating and reviewing data and analyses mean that unusual results are quickly highlighted and can be assessed and dealt with appropriately. Whether they are eventually found to be inconsistencies in analysis or more exciting genuine new discoveries, the end product is brought to the scientific community many times faster than the usual peer-review by a small number of reviewers and crucially it all happens out in the open with maximum transparency. The cornerstone of our crowdsourcing is our repository on GitHub [4], a versioning system designed for collaboration in software development that automatically maintains attribution of contribution, meaning that whoever contributes will get full credit for the difference that they made.
We are certain that the data will prove useful to anyone who wishes to be involved in the fightback against ash dieback and that concerted, early data-sharing and open analysis is a crucial step in a productive and timely response to emergent pathogen threats.
The future of our data and our initiative To date, genome analysis of emerging plant pathogens is not rapidly implemented as is routinely done with human pathogens [5,6]. Worse, the data (when available) is not immediately released into the public domain. We hope our openness will encourage the scientific community to engage in this proactive and collaborative model of working when faced with pressing challenges. Already we are seeing a significant amount of work being provided by external groups. Contributions of transcriptome assemblies, protein domain annotations, phylogenetic trees and BLASTs for specific gene family members have been provided from groups across the world.

Credit where credit is due
We absolutely understand the need for scientists to be credited for what they do and we intend to make sure that everyone who contributes receives full attribution. The GitHub repository ensures this, and we are committed to the principle for all other potential results from this initiative. The altmetrics movement is making it possible and acceptable for scientists to cite the varied products of science [7], rather than simply the papers they write and we intend to make it as easy as possible for contributors to be able to cite what they did via commit number and potentially DOIs.

Towards a rapid response for food and ecosystem security
A pathogenic threat to our forests and ecosystems is a threat to our ability to live on the planet sustainably, just as a threat to our crops is a threat to our ability to feed ourselves. In these situations it is vital to respond as quickly as possible so we must embrace the evolution of a new digital immune system [8]. Our initiative is an early step towards developing the crucial function of the digital immune system for response to plant pathogens; the thing we cannot upload to a repository is the people with the expertise and the will to contribute, and that is why we need the scientific community to download our data and provide analyses.