Towards a global cancer knowledge network: dissecting the current international cancer genomic sequencing landscape

Background While next generation sequencing has enhanced our understanding of the biological basis of malignancy, current knowledge on global practices for sequencing cancer samples is limited. To address this deficiency, we developed a survey to provide a snapshot of current sequencing activities globally, identify barriers to data sharing and use this information to develop sustainable solutions for the cancer research community. Methods A multi-item survey was conducted assessing demographics, clinical data collection, genomic platforms, privacy/ethics concerns, funding sources and data sharing barriers for sequencing initiatives globally. Additionally, respondents were asked as to provide the primary intent of their initiative (clinical diagnostic, research or combination). Results Of 107 initiatives invited to participate, 59 responded (response rate = 55%). Whole exome sequencing (P = 0.03) and whole genome sequencing (P = 0.01) were utilized less frequently in clinical diagnostic than in research initiatives. Procedures to identify cancer-specific variants were heterogeneous, with bioinformatics pipelines employing different mutation calling/variant annotation algorithms. Measurement of treatment efficacy varied amongst initiatives, with time on treatment (57%) and RECIST (53%) being the most common; however, other parameters were also employed. Whilst 72% of initiatives indicated data sharing, its scope varied, with a number of restrictions in place (e.g. transfer of raw data). The largest perceived barriers to data harmonization were the lack of financial support (P < 0.01) and bioinformatics concerns (e.g. lack of interoperability) (P = 0.02). Capturing clinical data was more likely to be perceived as a barrier to data sharing by larger initiatives than by smaller initiatives (P = 0.01). Conclusions These results identify the main barriers, as perceived by the cancer sequencing community, to effective sharing of cancer genomic and clinical data. They highlight the need for greater harmonization of technical, ethical and data capture processes in cancer sample sequencing worldwide, in order to support effective and responsible data sharing for the benefit of patients.


Introduction
In the emerging era of precision medicine, genomic analysis has become an integral component of the diagnostic work-up of cancer patients. Where initially DNA sequencing approaches tested individual cancer 'hotspot' loci (e.g. KRAS mutational status in colorectal cancer; EGFR mutational status in lung cancer), a more precise understanding of the biological basis of malignancy subsequently led to identification and deployment of specific 'cancer gene panels' as prognostic or treatment prediction tools. Additionally, the increased interrogative capacity afforded by next generation sequencing (NGS), allied to its decreasing cost, has empowered many institutions worldwide to perform whole exome sequencing (WES) or whole genome sequencing (WGS) on significant numbers of tumour samples. Primary data outputs from these initiatives are increasing exponentially, thus challenging scientific and clinical communities to develop workable solutions for optimal analysis, usage and storage of these datasets. Further complexity is introduced by the need to integrate this genomic data with associated clinical information.
Previously, on behalf of the Clinical Working Group of the Global Alliance for Genomics and Health (GA4GH) (a coalition of researchers, clinicians, patient advocates and life sciences/ information technology industries dedicated to implementing worldwide data sharing solutions), we have highlighted the data challenges in cancer genomics [1], emphasized the currently siloed nature of the clinical, pathological and genomic datasets and proposed a blueprint solution that is predicated on a culture of responsible data sharing [2]. However, there is a lack of collective intelligence on current practices in cancer clinical sample sequencing initiatives worldwide. There is a paucity of information on the types of technical NGS platforms/parameters employed and choice of bioinformatics algorithms for analysis. Uniform approaches for collecting matched clinical and-genomic data on outcomes and treatment toxicities are lacking [3]. Information is limited on both institutional enthusiasm for sharing their data and the technical ability to facilitate a data sharing culture. Costs and resources required to establish multi-institutional/international data sharing programs are considerable. From ethical and legal perspectives, data protection legislation/privacy concerns are also challenging, particularly as they can vary significantly according to geographic region [4]. These issues pose significant challenges for effective data harmonization and sharing. Thus, a detailed assessment of the current global cancer clinical sample sequencing landscape is required to inform and enhance present and future data sharing efforts.
Recognizing these information deficits, we performed an international survey of cancer clinical sample sequencing initiatives. This survey was designed to provide an informative snapshot of current activities worldwide and identify potential barriers that may limit data sharing activities, thus informing creation of a global informatics ecosystem that facilitates the sharing of clinical and genomic cancer data at scale.

Recruitment of respondents, survey development and data collection
Methodology for respondent recruitment, survey development and data collection is outlined in the supplementary Appendices S1 and S2, available at Annals of Oncology online.

Statistical analysis
Data collected from Google Forms were exported to the R statistical package for analysis. Descriptive statistics were used to summarize survey responses. All analyses were performed using v 2 testing unless otherwise indicated. Likert scales were used to capture the extent of perceived barriers to data sharing (1 ¼ minor barrier, 6 ¼ major barrier). Given that not all questions were mandatory, sample size varied according to the particular question; thus responses have been displayed with the numerator (n) and denominator (N) (largest possible number of available responses). The denominator is reported for each section once, unless it changes.

Results
The survey collected responses from July to October 2015. Out of the 107 initiatives invited, 59 responses were received (response rate ¼ 55%). Of the non-responders, 9 initiatives indicated that their activities did not match the survey's scope or had already been captured in our survey, thus giving a true response rate of 60%. Of the remaining non-responders, the majority resided in the US (n ¼ 23) and Australia (n ¼ 8). None of the Chinese initiatives responded (n ¼ 3). Survey completion rates varied across sections, ranging from 81% [Privacy and Ethics (n ¼ 48, N ¼ 59)] to 88% [Barriers (n ¼ 52, N ¼ 59)]. Completed surveys included respondents from diverse locations and initiative size, with the majority coming from North America and Europe (supplemen tary Tables S1, S3 and Figure S1, available at Annals of Oncology online). The primary intention of the initiatives were: research [37% (n ¼ 22)], clinical diagnostic [15% (n ¼ 9)], combination [34% (n ¼ 20)] and unknown [14% (n ¼ 8)] as self-nominated by the individual initiative (supplementary Table S1, available at Annals of Oncology online). Relevant inter-institutional initiatives in Eastern Europe, Africa or India were not identified.

Sequencing
Platforms. Wide variation in the type of sequencing platforms employed was observed. WES was the most frequently used (n ¼ 28, N ¼ 51, 55%) while WGS was also employed in a high number of initiatives (n ¼ 22, 43%). A total of 35% (n ¼ 18) used both WES/WGS, whereas 37% (n ¼ 19) employed neither platform.

Bioinformatics tools
Mutation calling. The most commonly reported bioinformatics tools were GATK (n ¼ 27, N ¼ 51, 53%), Samtools (n ¼ 25, 49%), VarScan2 (n ¼ 23, 46%), and Mutect (n ¼ 20, 39%) (supplemen tary Appendix S3, available at Annals of Oncology online). The frequency with which these tools were employed is shown in Figure 1A. Logistic regression analysis addressing the type of data employed (prospective, retrospective, combination), indicated that GATK is less likely to be used in a prospective study One initiative did not indicate their intent.
(P ¼ 0.02). No other significant relationships concerning data type, geographic location or size were identified.
Copy number alterations. Of the respondents, 85% (n ¼ 44, N ¼ 52) indicated that they estimated copy number alterations (CNAs) from their sequencing data, while 10% indicated not doing so. One initiative reported inference of CNA from targeted panel data.
Versioning of pipelines. The majority of initiatives indicated keeping records of which version of their computational procedures (also referred to as software pipelines) to analyse sequencing data that they employed (n ¼ 44, N ¼ 52, 85%). Seven initiatives (13%) indicated that they were unsure as to whether the versions of their pipelines were tracked and one initiative (2%) did not track pipeline versions.

Clinical parameters
Merged clinical and genomic data. Of responding initiatives, 47 of 51 (92%, P < 0.01) attempted to link clinical information to genomic data. No differences in the initiatives' intent (clinical diagnostic versus research versus combination) and linking of clinical and genomic data were identified (supplementary Table  S3, available at Annals of Oncology online). Data extraction exclusively employing manual extraction of records was most commonly utilized (n ¼ 23, 45%). Direct deposition of electronic health records (n ¼ 9, 18%, versus manual extraction P ¼ 0.01), a combination of manual and direct deposition (n ¼ 9, 18%, versus manual extraction P ¼ 0.01) and other approaches (e.g. in-house direct hospital data feeds) were less frequently utilized (n ¼ 10, 20%, versus manual extraction P ¼ 0.02) (supplementary Table  S3, available at Annals of Oncology online). The majority of initiatives used a customized case report form for data collection (n ¼ 34, N ¼ 47, 72%).

Privacy and ethics
Written consent was obtained in 34 initiatives (N ¼ 48, 71%), seven initiatives had implied consent/consent waivers (15%). The majority (n ¼ 36, 75%) of initiatives allowed re-contacting of patients for follow-up information. A protocol for communicating somatic genetic results was in place in the majority of initiatives (n ¼ 32, 67%) and a trend to association with the initiative's intent (clinical diagnostic, research or combination) was identified (P ¼ 0.06). A policy for incidental germline findings was in place in 23 of initiatives (48%), but no association was identified with the initiative's intent (P ¼ 0.55).

Data sharing
The majority of respondents (n ¼ 36, N ¼ 50, 72%, P < 0.01) indicated that they allow sharing of their data (supplementary Figure S2, available at Annals of Oncology online). Fourteen percent indicated not intending to share, while another 14% indicated that they are in the process of developing data sharing policies. No association was identified between data sharing and purpose of the initiative (P ¼ 0.14). Data sharing typically came with a varied set of restrictions such as regional legislation (e.g. European data that cannot leave the Eurozone, intellectual property (IP) concerns and material transfer agreement restrictions). Certain initiatives remarked that there were significant limitations in transferring raw data between institutions.

Perceived barriers
The greatest barriers identified (defined as responses >4 on the Likert scale, N ¼ 52) were: financial support for data sharing (77%, P < 0.01), bioinformatics concerns such as lack of conformity and interoperability of bioinformatics pipelines (69%, P ¼ 0.02), and clinical data capture (60%, P ¼ 0.19) (supplemen tary Figure S2, available at Annals of Oncology online). Initiatives with 1000 or more patients were more likely to perceive clinical data capture as a barrier compared with smaller initiatives (P < 0.01). Lack of expertise in the context of rapidly evolving technology (50%) and legal issues (37%) were also raised as potential barriers, whereas privacy/ethics issues (35%) and international legislation (33%) were not considered significant barriers. Of note, bioinformatics and financial concerns did not differ between size of initiative or whether the initiative was clinical diagnostic, research or combination. The free-text commentary of perceived barriers is shown in supplementary Table S4, available at Annals of Oncology online.

Discussion
Molecular technologies such as NGS have revolutionized cancer biology discovery. Their successful clinical application depends on the sequencing platform and its robustness, the associated bioinformatics pipeline(s), and the availability of clinically annotated data from patients undergoing therapeutic interventions. Linking clinical and genomic data can justify molecular stratification of patients to specific interventions, but there is a realization that matched data must be available from sufficient numbers of patients to allow statistically robust, clinically meaningful conclusions to be drawn. Collaborative sharing of this information between initiatives increases the value and relevance of the data, for the scientist, the pharmaceutical industry, the clinician, the payer (insurance/taxpayer) and ultimately for the patient. However, effective data sharing is challenging, from technical, clinical, ethical, logistical and regulatory perspectives.
From a technical perspective, respondents to the survey employed a number of sequencing platforms and methodologies. Of these platforms, WES (P ¼ 0.03) and WGS (P < 0.01) were more relevant to research application, with low adoption rates in clinical diagnostic initiatives. Conversely, clinical diagnostic initiatives employed greater sequencing depths than research initiatives (P ¼ 0.012). Surprisingly, nearly 40% of initiatives surveyed did not have clinical molecular diagnostic laboratory certification/accreditation, highlighting a deficiency that must be addressed in order for NGS to be routinely incorporated into mainstream clinical diagnostics.
A key finding was the heterogeneity in variant/mutation calling and variant-annotation tools. Use of a single variant caller was rare and tended to involve products from the sequencing vendor or bespoke in-house algorithms. However, the employment of a suite of variant callers was the preferred approach. This heterogeneity in pipelines compromises the ability to compare results between different clinical sample sequencing initiatives [5]. Efforts to address this lack of harmonization are ongoing, through initiatives such as NCI's Genome Data Commons [6] and the Somatic Mutation Calling Challenge (SMCC) [7]. The recent development of the GA4GH Application Programming Interface (API) [8] provides an easy-to-use web-based algorithm for improved identification of mutations and rearrangements in sequencing data, and is gaining traction in the translational bioinformatics community.
Over 90% of respondents indicated that they had mechanisms in place to capture linked clinical and genomic data. However, uniformity was lacking for the collection and aggregation of this information. While the majority of institutes employed electronic case report forms, nearly half of the initiatives surveyed were manually extracting clinical data. In order to address this, initiatives such as the ASCO's CancerLinQ project are developing custom-built electronic feeds from community oncology practices to maximize collection of clinical data [9].
A second challenge relates to the quality of the clinical data collected. Incomplete data sets reduce the value of the information collected, while lack of a cancer specific ontology compromises the ability to aggregate and compare clinical and genomic data from different sources. Building a cancer specific Human Phenotype Ontology (which has been an invaluable asset to the rare diseases community) [10], would significantly enhance phenotype-genotype correlations in the study of malignancy.
This survey also highlighted that longitudinal outcome/toxicity data are not captured through a standardized approach outside of clinical trials. Facilitating routine access to these data (e.g. through development of a minimum dataset) is necessary, in order to maximize the collective learning that can be achieved by aggregating clinical/genomic data, especially when analysing rare variants. It was encouraging that 75% of initiatives indicated that their protocol included the permission to re-contact patients, emphasizing the importance that clinical cancer sample sequencing initiatives place on the capture of follow-up patient outcome and toxicity data.
Over 70% of initiatives were in favor of sharing clinical and genomic data. However, a more detailed evaluation of both quantitative and qualitative responses revealed a number of barriers that exist and must be addressed. Lack of dedicated funding was perceived as the most significant barrier to data sharing activities. Collection of even a minimum clinical dataset has major human and technical resource requirements, leading to significant costs. Dedicated funding streams that actively promote data sharing should be encouraged. In this regard, the recent launch of the Innovative Medicines Initiative Big Data for Better Outcomes [11] incentivizes both the scientific and pharmaceutical communities to work together in large-scale data sharing activities. The second most commonly highlighted perceived barrier was lack of interoperability of bioinformatics pipelines, and we have already highlighted how initiatives/activities such as GDC [6], SMCC [7] and the GA4GH API [8] are addressing this challenge.
Issues with consent and data privacy were also raised in the free text narrative, with concerns relating to data protection legislation barriers in particular regions e.g. Europe, and harmonization of consent procedures. It is hoped that the recent decision of the European Commission on the EU-US Privacy Shield will help address inter-continental data privacy issues [12]. These regulatory challenges limit the effectiveness of global cancer knowledge networking. From an ethics perspective, ethics harmonization has been a key theme of GA4GH's Framework for Responsible Sharing of Genomic and Health-Related Data [13] that we suggest should serve as an overarching ethical framework for clinical and genomic data sharing. Additionally, introduction of a federated data sharing approach, where data does not leave the particular legal jurisdiction but can be mined efficiently in situ, represents a potential solution for regions that are sensitive to primary data transfer. Concerns were also raised in relation to how data sharing might adversely affect publications and IP issues. Improving the quality of publications through effective data sharing, and developing micro-attribution based rewards where the work of data providers is acknowledged [14] should help allay these fears.
The benefits of data sharing become increasingly relevant as we collectively realize that our current catalogue of actionable cancer mutations is limited, and even there, consensus is lacking. Molecular stratification approaches have identified distinct disease subtypes, some of which may be relatively rare. Thus, a collective approach employing information from data repositories worldwide is increasingly required to identify/verify relevant mutations that can inform improved diagnosis or identify novel targets. Such an approach has already been employed by GA4GH in the BRCA challenge [15], which convened BRCA experts from around the world to work together to share BRCA variants publicly, thus allowing expert review of variant interpretations to determine the pathogenicity of an increased number of variants in the BRCA1/BRCA2 genes. This work has resulted in BRCA exchange [16], a curated web portal that allows the BRCA community to query the current evidence of any BRCA1/2 variant present in the aggregated dataset. Extending the BRCA Challenge approach to other genes and cancers would allow a more granular understanding of variant relevance, thereby informing clinical actionability.
We acknowledge that this study has several limitations. By its nature, it is a snapshot at a particular moment in time, in a rapidly advancing field. While our aspiration was to capture responses from cancer sample sequencing initiatives worldwide, there is an enrichment towards initiatives in North America and Europe, due to a combination of an inability to identify cancer clinical sample sequencing collaborative initiatives and/or a lack of response from such initiatives in particular countries/regions (e.g. India, China). Nonetheless, this survey is a first attempt to catalogue cancer clinical sample sequencing activity worldwide and represents a useful benchmark to inform cancer data sharing activities going forward.

Conclusions
This is the first comprehensive global survey of cancer clinical sample sequencing initiatives. It provides an evidence-based perspective informed by responses from experts worldwide concerning the key barriers to data sharing. It emphasizes the need to break down individual data silos and underscores the requirement to provide robust approaches for clinical and genomic data collection and analysis. It highlights how limited dedicated funding, a dearth of standardized methodologies and a lack of thoughtful integration are hampering clinically relevant data sharing efforts. Developing a bioinformatics ecosystem that delivers open source interoperable solutions to overcome the barriers we highlight, would maximize the potential for responsible but effective sharing of clinical and genomic data for the benefit of cancer patients globally.