Data Resource Profile: The COloRECTal cancer data repository (CORECT-R)

Data Resource Profile: The COloRECTal cancer data repository (CORECT-R) Amy Downing,* Peter Hall, Rebecca Birch, Elizabeth Lemmon, Paul Affleck, Hannah Rossington, Emily Boldison, Paul Ewart and Eva JA Morris Cancer Epidemiology Group, Leeds Institute of Medical Research at St James’s and Leeds Institute for Data Analytics, University of Leeds, Leeds, UK, Edinburgh Cancer Research Centre, University of Edinburgh, Western General Hospital, Edinburgh, UK, Edinburgh Health Economics, University of Edinburgh, NINE BioQuarter, Edinburgh, UK, and Nuffield Department of Population Health, Big Data Institute, University of Oxford, Oxford, UK


Data resource basics
Why was the resource established?
Colorectal cancer is a major public health problem in the UK. Each year in the UK around 41 000 people are diagnosed with the disease, 16 000 die from it 1 and international comparisons indicate that survival rates are lower than those attained by our economic neighbours. 2,3 It is estimated that detecting and managing the illness costs the National Health Service (NHS) in excess of £1.1 billion annually. 4,5 Despite this outlay, there remain major variations in diagnosis, treatment and outcomes. [6][7][8] In parallel, the research community invests significant resource and effort into understanding the aetiology of the disease and developing more effective and efficient methods of detecting and managing it.
High-quality data are essential to improving outcomes. Good intelligence underpins patient choice, helping individuals reduce the risk of disease and access the best care. It identifies and quantifies inequalities, improves the costeffectiveness and quality of services and supports cancer research. Unfortunately, although there are numerous individual datasets containing important information about the disease and its management, both across the UK nations and internationally, the availability of high-quality cancer intelligence has been limited due to the challenges researchers face in gaining access to them. 9 The UK Colorectal Cancer Intelligence Hub has sought to address these data access challenges by creating a single UK colorectal cancer research data system as a model for data-driven research into all cancer types. Working in partnership with the public, patients, carers, data providers and data users, we have created a single research repository, known as the COloRECTal cancer data Repository (CORECT-R). This resource is seeking to house all the colorectal cancer data described in Figure 1 and make them more readily accessible to researchers, while ensuring the security of the data to protect patient confidentiality. In this way, the UK Colorectal Cancer Intelligence Hub seeks to generate the intelligence needed to promote early diagnosis, optimize treatment, support clinical research and, ultimately, improve the UK's colorectal cancer outcomes.
At present, the resource contains information on all individuals diagnosed with colorectal and anal cancer in England between 1997 and 2018. This includes information on more than 600 000 cases of the disease. Much more remains to be done, however, and by incorporating more colorectal datasets from across the UK, and potentially beyond, the UK Colorectal Cancer Intelligence Hub aims to provide an annually updated and richer resource dedicated to research. This will enable far more pertinent analyses than have previously been possible.
The CORECT-R resource, and analyses based upon the data within it, has received approval from the South West-Central Bristol research ethics committee (18/SW/0134).

Data collected
In the UK, large population datasets already exist which contain detailed information on all cancer patients. For example, in England the National Cancer Registration and Analysis Service (NCRAS) curates high-quality national cancer registration datasets that contain information on every tumour diagnosed in the country. 10 Via linkage, or potential linkage, of these data to other administrative health datasets, such as inpatient and outpatient activity (Hospital Episode Statistics-HES), 11 radiotherapy data (the National Radiotherapy Dataset-RTDS), cancer medicines prescribing (the Systemic Anti-Cancer Therapy-SACT), 12 mortality data (Office for National Statistics) and many other similar datasets, there already exists a rich linked cancer data resource. Public health organizations have also built similar high-quality resources in Wales, 13 Scotland 14 and Northern Ireland. 15 What has not happened routinely, however, is curation of these datasets at a disease level by academic and disease experts to realize the full potential of the information within them. Furthermore, it has often been extremely difficult for such experts to access the datasets, not least as the data are so detailed that access must be restricted to protect the confidentiality of the people the data pertain to. 16 The CORECT-R model seeks to address these challenges by enabling dedicated data managers to work with the linked datasets within the secure digital environments, known as 'Safe Havens', of the respective jurisdictions. 14,15,17,18 In this setting, the data managers act as colorectal cancer data experts and focus on the processing and management of the relevant datasets to enhance their utility and availability. This can be by, for example, deriving new variables from linked datasets and resolving quality assurance problems such as conflicting versions of the same data item arising from different data sources. Having dedicated data managers working within approved secure environments limits access to the most sensitive data items, by enabling the production of de-identified 'research ready' datasets for use by the cancer intelligence community.
The types of data hosted and/or linked in CORECT-R can be categorized into four main groups: routine cancer datasets, routine non-cancer datasets, consented datasets and biological samples. These groups will be described briefly here, but further details of the datasets available can be found in the CORECT-R data catalogue [https:// www.ndph.ox.ac.uk/corectr/corect-r].
The first group, routine cancer datasets (such as cancer registry, patient discharge and treatment datasets), are already held within CORECT-R. These datasets provide high-quality information on many aspects of the patient pathway, including initial cancer diagnosis, treatments received and subsequent outcomes, and are providing strong intelligence to inform and improve NHS colorectal cancer services. 7,[19][20][21] The second group CORECT-R hosts is routine administrative datasets that contain information on people without a diagnosis of cancer. Examples of such 'non-cancer' data would be information on individuals participating in a screening programme, who are not identified as having cancer or prescription information from a population of people without cancer but who are comparable in terms of age and sex to those who do. 22 These data are important, as many of the studies using CORECT-R will be strengthened by information on a comparable population without cancer. For example, studies looking to generate evidence to strengthen diagnostic pathways will require information on all who undergo the relevant diagnostic tests and not simply those who go on to get cancer. Similarly, to study the full impact of colorectal cancer on a population it is important to make comparisons with those without colorectal cancer. Linkage to non-cancer datasets is, therefore, essential to gain the intelligence needed to significantly improve colorectal cancer outcomes. The non-cancer datasets will contain no direct patient identifiers (being incorporated and linked via a secure pseudonymized process 23 ), meaning that members of the CORECT-R team and users of the resource will be unable to ascertain the identity of those individuals without cancer.
The third group is consented datasets. A huge number of colorectal cancer research studies are undertaken in which individuals have consented to participate, including randomized clinical trials, cohort studies and surveys. Such studies often capture very detailed data (including study outcome measures, patient-reported outcomes and genetic information) that go beyond what is available in routine datasets, and so their inclusion has the potential to strongly enhance elements of the data. A specific example of such data already incorporated into CORECT-R is the patientreported outcomes data taken from a national survey of colorectal cancer patients. 24,25 CORECT-R does not always hold the additional data or samples from these consented studies, but linkage facilitates access to the relevant components of the CORECT-R data and these associated studies. Where feasible and the necessary approvals allow, individuals within CORECT-R who have consented to such studies are flagged. Over time, linkage to more consented datasets will be facilitated, but this is dependent on the approvals under which study data were collected.
Finally, CORECT-R also enables linkage to samples. Access to biological samples increases the scope of colorectal cancer research by enabling phenotype to be related to management and outcomes. CORECT-R does not host these samples directly but flags cases where these data are available. This enables identification of relevant cases and access for research projects (again, where all relevant ethical and information governance approvals are in place). An example of such data is the pathology specimens being collected through the Yorkshire Cancer Research Bowel Cancer Improvement Programme, where digital images of tumour sections are linked to the routine cancer datasets held in CORECT-R.
Researchers can seek to access all the datasets within CORECT-R in their standard or linked format. In addition however, key pieces of information will be extracted from these component datasets to form national colorectal and anal cancer datasets. These are intended to be 'research ready', pseudonymized population-based datasets that researchers can easily access. They avoid the need for users to independently request broader extracts of data and replicate linkages to obtain commonly used key information (for example stage at diagnosis, comorbidity scores or treatment information). These datasets are continually developing and full details are available on the Hub website [https://www.ndph.ox.ac.uk/corectr]. Tables 1 and 2 summarize their current content.     The CORECT-R system also provides a secure Trusted Research Environment (TRE) via which researchers can access data. The CORECT-R TRE is a secure analytical area that, following both project and user approval, is accessed via two-factor authentication and a virtual desktop. It contains database and statistical analysis software and users can also bring their own software in to the environment once approved. The TRE infrastructure is provided by the company AIMES and hosted in partnership with Cancer Research UK.

Data resource use
The potential of the data in CORECT-R is huge, enabling research into all aspects of the disease from its aetiology to its management and outcome. Ethical approval to establish a Research Database was obtained (UK NHS Health Research Authority 18/SW/0134) and this supports the use of its contents for projects that will promote early diagnosis, help quantify and address inequalities, support cancer research and improve outcomes. The Scottish Public Benefit Privacy Panel has also approved an initial phase of the Programme (PBPP: 1718-0026), with data accessible to the current project team. If a researcher has a project beyond these approved uses, the Hub will actively support applications to extend the use of the resource.
An example of how the resource has been used is the investigation of post-colonoscopy colorectal cancers (PCCRCs). These can occur when the main diagnostic test used to identify the disease, colonoscopy, fails to detect the tumour or the precursor lesion. Via linkage and analysis of cancer registry, hospital and screening data (combined in CORECT-R), work has been undertaken to identify PCCRCs and understand their occurrence across the English NHS. 7,26 These studies have helped to identify groups of individuals at greater risk of developing a PCCRC and also colonoscopy providers with significantly higher, and lower, PCCRC rates. 7 To try and ensure this intelligence is used to inform efforts to reduce rates, the results of the study were disseminated in the peer-reviewed literature and to all colonoscopy providers via the Getting it Right First Time (GIRFT) programme. 27 In addition, all providers with outlying rates, in both a positive and negative direction, were directly informed of their results. This led many to seek to review their cases to understand why they arose. The CORECT-R team supported and facilitated their applications to the Office for Data Release at Public Health England to identify their PCCRC cases and to audit their colonoscopy services.
The resource is routinely used to quantify variations in colorectal cancer care across the NHS. For example, the Yorkshire Cancer Research Bowel Cancer Improvement Programme 28 has used the data to produce annual reports quantifying variation in the patient (sex, age, socioeconomic status and comorbidity) and tumour (stage at diagnosis, site of tumour and mode of presentation) characteristics of the colorectal cancer populations treated by each of Yorkshire's multidisciplinary teams, as well as variation in management and outcome. Treatments examined include use and type of major resection as well as the approach to surgery, use of neoadjuvant radiotherapy 19 and adjuvant chemotherapy. 21 Significant work has also   been undertaken looking at the management of metastatic disease. 20 The Programme team regularly provide these reports to Yorkshire's multidisciplinary teams giving them information on how their cases compare with those managed by other teams in the region, as well as England as a whole. Clinical engagement is then sought to try to understand and explain why any variation exists which, in turn, leads to initiatives aimed at minimizing it. There are numerous other examples of how the data within CORECT-R can be used and more details are available on the Hub's website, alongside a summary of all the projects currently under way, or completed, using the resource.

Strengths and weaknesses
The main strength of CORECT-R is that it streamlines the extremely resource-intensive processes that researchers go through to access population-based datasets relevant to colorectal cancer. Previously, many research and service groups all worked in parallel to seek their own permissions to obtain extracts of data. They then undertook bespoke linkages and used different methods to analyse these data and, in consequence, often obtained slightly different findings. 29 This results in significant duplication of effort and resource, as well as confusion for those wishing to make use of the resulting intelligence. In addition, multiple data transfers increase the risk of data breaches that may threaten patient confidentiality and public trust. CORECT-R offers an alternative and more efficient route to the data and this collaborative approach, in turn, enhances the abilities of the cancer intelligence community to produce the evidence needed to drive improvements in colorectal cancer care.
Although the collaborative approach of CORECT-R is a strong model, it does pose challenges. These often arise because the resource aims to align datasets owned by multiple different organizations who all have their own policies and procedures relating to data access and these sometimes conflict. In addition, when different datasets containing the same information for individuals (for example date of diagnosis, type of surgery or site of tumour) are aligned, they can disagree, and this leads to significant challenges in quality-assuring the information. Again however, via the collaborative nature of the Hub and the transparent data management and processing pathways it adopts, these challenges can be overcome.
CORECT-R aims to contain detailed information about all colorectal cancer patients in the UK and it is vital to respect the interests of those people to whom these data pertain and, in a very real sense, belong. Another strength of the UK Colorectal Cancer Intelligence Hub and its CORECT-R resource is its involvement of patients and the public. As a direct result, the concerns about the security and use of patient data are fully appreciated by the Hub team, and we have sought to design a system that will minimize these risks and anxieties while also maximizing the benefit the analysis of such population data can have. The Hub's Patient-Public Group is extremely active and involved in the management of CORECT-R and its outputs.

Data resource access
Researchers who wish to make use of the CORECT-R resource should contact the UK Colorectal Cancer Intelligence Hub team and discuss their requirements. The application process will differ depending on the details of the proposed project and the source of the data required. The application process centres on the development of a project protocol that describes the information required, justifies its use and sets out the study objectives. The Hub team will support the applicant through the process. For more details, see [https://www.ndph.ox.ac.uk/corectr].
The Hub website also hosts information on the data available within CORECT-R. The data catalogue contains detailed information on the 'research ready' national colorectal and anal cancer datasets. In addition, it also provides information on all the individual component datasets within CORECT-R and where to find more information about their content.
Cancer Research UK, the funders of the resource, and the UK Colorectal Cancer Intelligence Hub itself, are keen to ensure that the data within CORECT-R are used for the maximum benefit of colorectal cancer patients and, as such, have resourced the project to enable free access to academic researchers.
The publications arising from any projects that include CORECT-R data have to acknowledge the resource and its funding as well as the people who contribute data into the resource, with the following attributions: This project involves data that have been provided by, or derived from, patients and collected by the NHS as part of their care and support. This work was supported by Cancer Research UK (C23434/A23706), which underpinned data access via the UK Colorectal Cancer Intelligence Hub.
Finally, CORECT-R contains many different datasets and, if used in a particular project, the organizations who have contributed data need to be acknowledged. Users will be made aware of relevant attributions on a project-byproject basis.

Funding
The CORECT-R resource is supported by Cancer Research UK (grant C23434/A23706).