An extensible big data software architecture managing a research resource of real-world clinical radiology data linked to other health data from the whole Scottish population

Abstract Aim To enable a world-leading research dataset of routinely collected clinical images linked to other routinely collected data from the whole Scottish national population. This includes more than 30 million different radiological examinations from a population of 5.4 million and >2 PB of data collected since 2010. Methods Scotland has a central archive of radiological data used to directly provide clinical care to patients. We have developed an architecture and platform to securely extract a copy of those data, link it to other clinical or social datasets, remove personal data to protect privacy, and make the resulting data available to researchers in a controlled Safe Haven environment. Results An extensive software platform has been developed to host, extract, and link data from cohorts to answer research questions. The platform has been tested on 5 different test cases and is currently being further enhanced to support 3 exemplar research projects. Conclusions The data available are from a range of radiological modalities and scanner types and were collected under different environmental conditions. These real-world, heterogenous data are valuable for training algorithms to support clinical decision making, especially for deep learning where large data volumes are required. The resource is now available for international research access. The platform and data can support new health research using artificial intelligence and machine learning technologies, as well as enabling discovery science.

Have you included all the information Clinical images, especially when linked to other routinely collected health data, are extremely useful for many types of research: examining early/preclinical diagnosis [1], disease progression [2,3], genotype-phenotype associations [4], development of risk profiles [5,6], computer vision methods for biomarker extraction [7][8][9], machine learning approaches [10][11][12][13], and discovery and classification of disease types [14]. The emerging field of Radiomics has the potential to bridge the gap between medical imaging and personalised medicine [15]. However, collecting images for specific research projects is expensive and constrains the scale of many studies. Research cohorts are usually comprised of a narrow subset of people with a specific condition, which can make both generalising findings and repurposing of images for research problematic. Use of routinely collected images, in contrast, opens up the potential for very large-scale studies, which not only efficiently and effectively complement smaller disease-based cohorts of patients but are also extremely flexible when linked to extensive electronic medical records allowing for a wide range of disease areas to be examined. However, whereas research images are typically collected using specific image acquisition protocols under ideal conditions, routinely collected clinical images are much more heterogeneous.
Using clinical images for research and linking them to other routinely collected clinical data is challenging because: 1) Existing software used to query/search for images from PACS (Picture Archive Communication System) is designed for clinical care rather than research. They make it easy to find all images for a particular patient, but they are not designed to facilitate searching for all images with particular characteristics such as body part, slice thickness/scanning protocol/contrast agent/patient medication or linking to other EHR (Electronic Health Record) datasets (such as outcome data, prescription data).
2) Reuse of clinical images for research requires de-identification, yet identifiable data can be present in many areas of the associated DICOM (Digital Imaging and Communications in Medicine; RRID:SCR_018878)[16] file metadata and/or may be present within the pixel data itself, 'burned on' to the actual image.
3) Anonymisation of images can reduce the ability to perform linkage to other datasets e.g. demography, prescribing, hospital admissions etc. 4) Reuse often requires approval from multiple Data Controllers, and the complexity of deidentification increases the risk of rejection of applications for research given the amount of work the Data Controller may have to do to ensure that no identifiable data is released. [CHI] number) that is also increasingly seeded in data in other sectors such as social care. An National Health Service (NHS) Scotland service, called the electronic Data Research and Innovation Service (eDRIS) [17], provides a National Safe Haven environment (hosted by the University of Edinburgh) to support research access to anonymised extracts of linked data from different Data Controllers to answer specific approved research questions. The linkable phenotypic data includes a range of national datasets including, for example, prescribing, death data and hospital admissions.
Subject to robust pseudonymisation safeguards and approval by the Public Benefit and Privacy Panel for Health and Social Care, individual patient consent is not required in Scotland. This project assembles a library of imaging data then generates thoroughly redacted subsets for research projects, with multiple safeguards, but those subsets are themselves only released to approved research projects within the controlled environment of the National Safe Haven computers; any extraction of data beyond that is subject to further controls to protect the privacy of patients.
The technical safeguards described in this paper and implemented in this project are not perfect and are not the sole protection: rare medical conditions or identifying features may still be present within the research extracts generated. Contractual and administrative precautions manage these risks: researchers are both contractually prohibited from attempting to re-identify patients or link against unapproved datasets and prevented from exporting the raw data beyond the confines of the Safe Haven, since any such export could enable such an attempt to be made.
Incorporating clinical imaging data into the wealth of available datasets for research: A research copy of the data held within the Scottish National Clinical PACS system has been created to enable the clinical imaging data to be linked with the other routinely collected datasets and be made accessible for research (given appropriate data governance approvals). The research copy of the Clinical PACS system is called the Scottish Medical Imaging (SMI) Database and the data is held in the nonproprietary DICOM format.
The management of imaging data for research presents a substantial set of challenges beyond those encountered in the management of purely text-based records. Some of these are variations on familiar challenges, such as de-identification, whilst others are novel and intrinsic to this type of dataset, such as size and compute requirements for big data processing. This paper describes the architectural solution and software platform developed to support hosting, extracting and linking the SMI data which addresses the challenges identified of using routinely collected imaging data for research listed above.
We first describe the project approach, a very high level summary of the requirements, then our architectural solution, an explanation of why this solution met the requirements and why our solution is different to that of other open-source solutions for the large scale hosting of imaging data. We explain how the architecture enables feedback and enhancement improvements from other sources.
We then describe our progress towards implementing the architecture and the use cases we have tested.

Project Approach
There have been 4 phases to the project to date: Requirements gathering: An initial requirements gathering exercise was undertaken at the project inception eliciting requirements from the research community who will use the data extracts provided by the platform, the National Health Service Data Governance representatives as the Data Controllers of the data, and the National Safe Haven staff who will use the platform to build cohorts and provision relevant data extracts to researchers for analysis. We also investigated other open source and freely available platforms for hosting and/or anonymising imaging data to see if any of these could be used entirely or in part within our solution. We researched both functional and non-functional requirements of the solution.

Development of the Architecture:
We developed a range of option appraisals and designed an architectural solution to meet the requirements.

Development of Prototype:
We developed prototype software to run in a Regional Safe Haven environment managed by the University of Dundee, whilst the SMI data transfer project was taking place in parallel. This prototype supported two consented research projects, predicting dementia from CT and MRI images. We then expanded the prototype to run in the National Safe Haven.

Testing and Case Studies on Sample Data:
The software was then tested on a 180TB subset of the full dataset including ~3 million studies (which were loaded into the Document Store -see below). These were images generated across Scotland during the same 2-week period in February for each year in a 7-year period. Linkage, extraction and anonymisation was performed for a range of case studies along with performance and functional testing. (Case studies are listed in Appendix D, and included linking against the Scottish Cancer Registry's SMR06 diagnosis data and Radiology Information System records as well as the DICOM metadata.) The full set of historical data was still in the process of being decrypted from the proprietary PACS vendor format and could therefore not be used for complete testing at this stage.

Platform Architecture
A list of high-level platform requirements is provided in Appendix A.

Architecture Overview
The high-level platform architecture is shown in Figure 1. The Discussion section describes our future plans.
The SMI Data Repository is divided into identifiable and de-identified zones. The SMI Analytic Platform is the Safe Haven environment where researchers can access their relevant data extracts.  The extraction process uses the Inventory Tables to locate and anonymise the DICOM files held in the Data Store. These anonymised files are then provided to the researcher in the Safe Haven Environment.
A summary of the expected functionality of the data stores and process within the architecture is provided in Tables 1 and 2. Architectural support for feedback and enhancement from other sources The architecture uses a microservices architecture. Individual components (microservices) can be turned on/off as needed and support multi process execution for linear scaling. The architecture has been designed to support iterative enhancement based on feedback from research outputs generated in the Researcher Safe Haven Environment or directly from external sources (e.g. clinical experts).
Such enhancements could be:  New datasets, such as clinical mark-up (capturing ground truth data which has been generated by a radiologist marking up data for a set of images)  New processes to improve cohort generation or data set preparation (e.g. software which runs over pixel data and returns the size of the airways shown in CT scans) i.e. derived datasets.
 Algorithms which could run over source images or textual data (e.g. software which uses natural language processing on imaging metadata to find images which show signs of dementia).
There are several key benefits:  This is an opportunity to incrementally improve the quality and value of data sets from SMI.
 Research projects could add expertise at a scale which will never be available within a single development team.
 It can improve collaboration and sharing across projects.
 It supports active engagement by the user community and increases support for the service. A feed from the National PACS to retrieve the data from Oct 2018 onwards is in the process of being commissioned.
The first system test was conducted using ~3 million studies (~10 million series and ~300 million images). This data (all the scans taken during the same 2-week period in February for 7 consecutive years) has been used as test data for software development.
The implementation has enabled extraction of images based on cohorts built from data captured in DICOM tags and linking data from other sources -as illustrated in the "Use Cases" section. At this stage, operation still involves some manual intervention we intend to automate as development progresses, and only the initial subset of DICOM tags (as recommended by a domain expert) is promoted, but the system is designed to facilitate enhancement and extension in future. [18], Justification of different tools within the plugin architecture

Core platform
The platform has been implemented building upon the open source Research Data Management Platform (RDMP) [19]. The RDMP stores, manages, cleans, de-identifies and processes data to create reproducible, auditable data extracts for research and in the last 5 years has been used to support over 500 projects, generating over 2000 data extracts of mainly phenotypic text-based data for epidemiological research projects and clinical trials. The RDMP already provides many of the core components such as auditing, logging, deduplication and anonymisation required for populating the relational database in a platform agnostic way; and linkage and extraction; therefore it was efficient to build upon this platform to handle imaging data as well (creating the 'imaging RDMP' or iRDMP).

Choice of Architecture
A microservice architecture using the RabbitMQ message broker [18] simplifies development, testing and refinement of components in isolation, minimising and containing the side-effects of changes. A microservice architecture is one which decomposes a monolithic application into a set of smaller loosely-coupled services communicating over well-defined interfaces [20]. Advantages of a Microservice architecture as opposed to a monolithic approach have been known in the IT industry in recent years (e.g. Amazon [21], Netflix [22]) and recently for health data [23].

Non-structured database solution for identifiable metadata
The data in the DICOM tags is largely unstructured and deeply hierarchical. This is challenging to represent in a relational store. Moreover, its structure may change over time (e.g. new tags).
Consequently, the use of a document-oriented, flexible and dynamic data storage was deemed necessary. MongoDB [24] was selected as the NoSQL database technology because the hierarchy of DICOM tags can be mapped directly to a JSON document, then indexed and queried efficiently. This facilitated the transfer of images across database collections and use of queries against DICOM tags to select and control the promotion of data to later stages in the process.
Within the architecture the MongoDB database provided a middle ground within the ETL data flow. It allows mappings to the relational database schemas to be quickly modified and tested, while being able to quickly reload and re-process data from MongoDB rather than the slow process of going back to DICOM files, reducing the petabytes of raw DICOM files to terabytes of queryable data.

Structured database solution for de-identifiable metadata
Many DICOM servers and APIs have a way of representing DICOM in a relational schema e.g.
dcm4chee [25]. We have used our own (dynamic) cut down schema for several reasons:  To present something to data analysts that has a simplicity (without requiring DICOM expertise) on the same level as the other linkable datasets hosted on the National Safe Haven.
 To optimise for linkage i.e. the ability to limit the number of table joins needed and create efficient query orientated indexes e.g. PatientId+ImageType+StudyDescription  The ability to adjust this schema and regenerate the data as future development requires.
 The ability to add additional curated fields from external sources or transformed columns such as results from expert mark-up as ground truth data.
 The ability to store (and therefore expose) a limited set of tags (those we understand will not contain identifiable data).

Anonymisation Tools
There are many different software programs which can be used to de-identify imaging data. We tested the feasibility of 3 different widely used programs (DICOM Confidential [26], XNAT [27,28], CTP [29]) in deciding which to adopt as part of the pipeline. A summary of each is provided in Appendix B.
For a meaningful comparison of the tools, a set of criteria were devised, and each de-identification program was examined in turn against these criteria using a rating of 1-5 (where 5 is the best). We grouped the results into 3 different categories: core functionality, user friendliness and support. Table   4 shows a summary of the scores for each category with the detailed analysis provided in supplementary material A.
In summary, DICOM Confidential was ruled out due to the quality of the documentation and the lack of first-party or community support. We found that some of the images produced by DICOM Confidential were corrupted and chose not to investigate any further as the functionality of the other two tools appeared superior.
There was little difference in the functionality of CTP and XNAT. They are both well-supported tools which could perform the required tasks. The overall score of CTP was higher than XNAT. We thought that the XNAT image "bundling" for applying rules to subsets of images would be a useful capability which CTP does not provide. The pixel level anonymisation capability appeared to be much better supported and straightforward in CTP, and this is very important for this project. For these reasons we chose CTP. NIFTI as a method of de-identification NIFTI (Neuroimaging Informatics Technology Initiative) is an alternative to DICOM as a medical image storage file format. Originally created for neuroimaging, NIFTI stores image data as a single 3D image (.nii file), whereas DICOM stores a separate image file for each slice of the scan. In addition, the NIFTI format only stores pixel data and metadata related to the image itself, not any patient or study information as you would find in a DICOM image. This makes NIFTI a possible method to "anonymise" DICOM images. Not all image modalities and compression methods are supported however, and conversion tools require extensions to interpret the private tags that some image scanners write into the DICOM files to describe the pixel data. Therefore, NIFTI was not chosen.
NIFTI has become popular in some machine learning applications and is preferred over DICOM due to the ease of dealing with only 1 file representing the whole 3D scan. The images for each research project can be provided in a range of formats (including NIFTI), but conversion to NIFTI was not adopted as a method for de-identification.

Pixel data and anonymisation
Primary original CT scans were found to contain no "burned-in" text i.e. no text within the pixel data.
For MRI and other images we integrated a text-detection tool into our extraction pipeline, running each image through the Tesseract open-source OCR tool (originally developed by Hewlett-Packard) to detect the presence of any readable text.
In some cases, particularly head scans, the pixel data itself may be inherently identifiable [30][31][32]. This requires special-purpose software, however; since the pixel data is never exported beyond the Safe Haven environment and the use or installation of such software would not be permitted, and any research project would be denied access to the data for this purpose. In a very few cases something could be identifiable by unaided inspection -for example, a distinctive injury or piece of jewellerythis is an issue which needs further consideration.

Software deployed in the Safe Haven Analytical Environment
We investigated several tools to deploy into the Safe Haven for managing, viewing and manual annotation of images by research teams. We chose MicroDICOM (simple DICOM viewer) [33] as the first example to use. Over time it is expected that the number of tools available as part of the preinstalled VM will increase and that researchers will have the capability to install their own preferred software tools.

Comparisons to other existing systems
Given the different imaging platforms in active development to support research projects, we investigated alternative platforms so that we did not re-invent the wheel. In general terms, other solutions have concentrated on consented cohorts from researcher collected research images rather than much larger unconsented data from routinely collected "real world" images. The architectural solution developed by others is generally a large anonymised database (sometimes distributed) containing all the images with permissions to see, extract and run pipelines on the imaging data configured for each research group. The metadata provided is limited and relatively clean in comparison to routinely collected data. The architectural challenges and solutions are therefore very different. For example, a key functionality of the platform is the efficient and effective selection of anonymised cohorts from petabytes of noisy and heterogeneous identifiable data.
If the requirements were to store a de-identifiable, clean, homogenised copy of all of the pixel and metadata within the de-identifiable zone we could have employed one of the many excellent open source platforms for managing large volumes of imaging data such as OMERO [34,35], XNAT [27,28], ClearCanvas [36]. There are several reasons why we did not choose this approach and therefore did not use such platforms to manage the core data repository:  We envisage that the methods to de-identify data will change over time as our understanding increases and technological solutions improve. It is impractical to re-create >2 petabytes of de-identified images each time our methodology improves.
 It is unnecessary to undertake the effort to validate any de-identification method on all DICOM tags when only a small fraction of these will be required by research teams. It is unknown which ones will be required upfront.
 A proportion of all the images will never be extracted/released for research projects as they will not meet the cohort requirements. De-identifying imaging data reactively, only when it is required for a specific project, removes the requirement to carry out a needless timely and computationally expensive de-identification process on images which are never required.
(Conversely, a given image may be de-identified multiple times, once for each project. This is an issue which we plan to resolve in the next stage of development. We expand on this in the discussion section).
 It is risky to test a specific de-identification tool on sample data and trust that it will therefore also be successful for variations of routinely collected data from multiple sources and vendors.
The architecture was designed to reduce this risk: by default blacklisting all data until proven otherwise, in which case the metadata and/or image is then "promoted" to a white list.
 The data is currently >2 Petabytes and expected to grow at ~400TB per year. There is significant cost of maintaining two copies of the data both in terms of hardware and the maintenance required to update a duplicate as new data arrives (an identifiable version of the data is required in the identifiable zone to meet requirement I -see Appendix A) .
 Hosting duplicate versions of the data introduces additional data security and governance risks.
 Different research projects require different de-identification. For example, the granularity of date and patient age data may change depending on the specific questions posed by a research project -the overarching rule is that the data be de-identified as far as possible while meeting the research requirement.
 Following the data protection principle that individuals should see the minimum data to fulfil their job role, there is no need for Research Co-ordinators to see the pixel level data to build cohorts therefore, only text-based metadata is provided for cohort building.
Although existing solutions will not fully meet the requirements of this programme, one of our core principles is to reuse as many applicable, open source or freely available tools as possible i.e. do not try to re-invent the wheel. Therefore, where relevant we have included other software within our architecture.

Testing
The SMI microservices, and the RDMP Framework upon which it relies, have been developed entirely using a Test-Driven Development approach. Continuous Integration (CI) unit and system integration tests ensure code stability. Approximately 1450 automated tests cover the core RDMP code base, and in excess of 300 tests run on the SMI microservices.
Following development of a baseline version, functional and non-functional manual testing was undertaken. The test cases were planned and documented in advance, following a series of interviews with clinicians, academics and technical staff. While these scenarios were planned, documented and agreed, the approach to executing the tests was deliberately as exploratory as possible, rather than restricted by specific test scripting.

Exemplar-driven use cases
A number of use case scenarios were defined with input from Researchers (listing in Appendix D).
These scenarios were further elaborated by a team from eDRIS, in effect, performing a dry run as though these were real research projects. The test cases assessed included:  Cohorts can be generated using the metadata repository information  Images can be returned where the cohort has been generated from another dataset  Cohorts and images can be identified using a combination of the metadata repository and other data sources

Scalability Testing
Performance was benchmarked at the main processing stages of the end -to -end solution: initial load, population of the relational DB, and extraction/anonymisation.

Extraction and Anonymisation
Test were run on increasing numbers of images, and the processing time was logged as shown in Table   5. These tests were run using CT scans only (other modalities can have significantly higher number of images per series and/or larger file sizes). Each DICOM file is around 0.5Mb. On average each CT series has 325 images giving a total file size of around 170Mb. The hardware on which this runs is summarised in Appendix C. Table 5: Scalability of anonymisation processing

Discussion
The system we have developed is not a new tool for managing and viewing images like XNAT, OMERO, MicroDICOM and ClearCanvas. iRDMP is a platform and pipeline for extracting images from a directory of images based upon cohort selection criteria, anonymising them and copying the images into a secure location for analysis. Theoretically a tool/system for managing and viewing images from a single data store could have been configured/enhanced with a permissions layer to restrict access to only the images each research group had the right to see. This model was discounted as it did not meet the requirements for several reasons: Applicability/potential of the architecture and platform to be utilised in other environments/use cases: There are many different platforms in active development to support multiple research projects using clinical imaging data. Our architecture has not just been designed to fulfil Scottish data governance principles and data structures, there is a much wider applicability. There are many other Safe Havens nationally and internationally [37,38] where such a solution might be applicable and there is a trend towards the creation of new Safe Havens. Although within our architecture data extracts are viewed within a Safe Haven Analytical Platform (as part of the Scottish Data Governance requirements), the software platform can extract data to any destination. The software could therefore be utilised by other groups/organisations to manage imaging data and build cohorts for extraction which do not use Safe Haven environments.
We have tested our software on 2 different environments with different hardware and VM tools: a regional Safe Haven and the National Safe Haven. It proved flexible enough to work in both environments.
We have created a docker based integration repository [39] which supports automated testing (in Travis CI) of the full stack of microservices with test data generated by BadMedicine.Dicom (RRID:SCR_018879) [40]. This ensures that deployment of the tech stack is simple and reproducible.

Potential impact of enabling this resource
The SMI data, linked to other datasets, along with the secure iRDMP platform we have developed has the potential to reduce the costs and widen access to large quantities of routinely collected deidentified images at scale. It also has the potential to reduce the effort of obtaining governance approval, as a Data Controller approved method for de-identification and access has already been agreed. Increasing the availability of large scale routinely collected imaging datasets linked to other forms of health data for both industry and academic use will hopefully lead to a greater likelihood of achieving results translatable into diagnoses and treatments.

Future Plans
Short term: There are several developments which will enhance the functionality beyond that already provided which we aim to implement in the near future. Rather than the limited subset of CT and MRI metadata tags currently promoted, we plan on promoting many more of these tags. We would like to trial the use of the wider RDMP tool for cohort building and audit within the de-identifiable zone. This will require training on the tool and slight modifications to existing workflows. We would also like to fully automate the processes once the testing of the components has been completed. We are in the process of loading all historical data into the system, after which time we would like to carry out some performance testing of the solution to identify and investigate bottlenecks.

Medium term:
As well as enabling other modalities (in addition to CT and MRI), we would like to support complex cohort building:  Structured Reports are summary information mainly stored in free text format which have been populated by a clinician about the study. They can include patient information such as why the scan was requested in the first place, the condition found and family history. A cohort derived from structured reports might seek to extract all the images where a CT scan was performed because a lung tumour was expected. Structured reports are challenging to query because they can be highly identifiable, are free text and sparsely populated. As such, Natural Language processing methods have been widely utilised to extract information from the reports. We plan on utilising and extending many of these methods within the platform to extract relevant metadata from the reports which can then be utilised for complex cohort building.
 Pixel Data contains information which could be helpful for building a cohort of relevant images e.g. extract all x-ray images of the knee where the depth of cartilage is less than 2mm. This information is not captured in the DICOM metadata and instead would be obtained using an image processing algorithm to extract supporting features. We plan on developing automation processes where potentially relevant images are opened and the algorithm applied to the pixel data returning the cartilage depth. The cartilage depth can then be used to link with other data. We plan to develop algorithms for text mining and imaging metadata standardisation to provide summary data (data dimensioning) which can then be logically queried for cohort selection. We plan on investigating unsupervised machine learning techniques to group images into commonly used clusters such as body area.

Long term:
Simply copying pixel data for each research project may not scale for imaging data, where storage could quickly become infeasible as the SH hosts ever greater numbers of studies each requiring large imaging datasets. An efficient method of sharing the pixel data between multiple studies may be required. However, each study will have different metadata, e.g. study-specific patient identifiers in the image header, so a solution which combines shared pixel data with study-specific non-pixel metadata is needed. We plan to investigate different solutions such as a Virtual File Server (already developed in prototype), requiring each research group to purchase more disk space should their project require it, pulling images in batches/caching or another technical solution entirely. Different strategies for serving images may be required, such as a file share for machine learning consumption but a DICOM server when using a DICOM image viewer.
We are fortunate to have received significant funding from the MRC and EPSRC to deliver all these future plans within a 5-year programme grant called PICTURES (InterdisciPlInary Collaboration for efficienT and effective Use of clinical images in big data health care RESearch).
We are very interested in collaborating with other groups working on any of these issues.

Limitations of the architecture
We are aware of some limitations of the current architecture:  Data quality is an issue inherent in the re-use of routinely collected clinical data (as opposed to data being collected specifically for research purposes), for example typographical errors marking an MRI as a "Brian" scan rather than "brain" -something overlooked as irrelevant for clinical usage, but needing extra attention here.
 The "unconsented" nature of this data mandates control over the research data provided to projects to guard patient privacy, limiting the options for such projects; research is ongoing to mitigate this.
 Automatic detection and redaction of text is essential on this scale, but still needs manual intervention and tuning to keep redaction to a low enough level to deliver useful data. To date, 869 "special case" rules have been added to the IsIdentifiable tool's dataset: for example, the "Princess Royal" hospital being identified as a name rather than an organisation.
 Huge number of images can make cohort creation cumbersome at the image level. To address this, we are adding support to the infrastructure and relational database to enable research coordinators to mostly operate at the study level.
 By not anonymising each image once on initial receipt we introduce additional complexity and increased storage requirements. In addition, we have to repeat the anonymisation task each time the image is used so we have more work to do. We trade this off against the ability to apply better anonymisation later. If the repetition proves an issue in real use, this can be mitigated through caching previously processed pixel data.
 Duplicating the image data for each research project will limit future scaling to multiple projects; some potential ways of addressing this are discussed in the previous section.
 The relational database structure is not ideal for some more complex parts of DICOM. We believe that for cohort generation our flattened relational structure is simple and functional, but we may discover cases in the future where it becomes cumbersome for some parts of DICOM. If so, hierarchical data can be incorporated within MySQL via JSON columns.

Summary
We have designed an architecture which meets the requirements of data governance and security and initial indications suggest that it will manage and provide extracts of routinely collected imaging data linked to other relevant datasets for research from the >2 petabytes of SMI data. We have tested the extraction system on 5 use cases, based on real exemplar scenarios.
The introduction section of this paper identified 5 challenges to using routinely collected clinical images for research.
The limitations of existing patient-centric image handling, challenge 1, are addressed by extracting and indexing other attributes identified by researchers: instead of retrieving a specific patient's imagery, we can search for images by a combination of parameters such as body part, patient age, or cross-referencing with other datasets, for example "all head/brain MRIs of patients diagnosed with a glioblastoma".
To address challenge 2, the pervasive presence of potentially identifying information within various data fields and the image pixels themselves, the platform uses a Data Controller approved, The original identifiable data is maintained to enable linkage to other datasets (addressing challenge 3, the barrier anonymisation normally presents to such linking) but is not released for cohort building or for research. The use of encrypted numerical patient identifiers (using the same encryption for both image metadata and other eDRIS datasets) also facilitates some linkage without exposing the original identifier: an MRI scan can be matched with a cancer diagnosis or patient admission record without exposing that patient ID.
Developing the handling, indexing and anonymisation system in a robust, reusable way and incorporating multi-layered safeguards allays Data Controller concerns about the quality of anonymisation (challenge 4), while amortising the substantial development and testing costs across multiple research projects makes it more cost-effective to extract large sets of images for deep learning purposes which might otherwise be uneconomical (challenge 5).
No non-trivial dataset is likely to be perfect, particularly when data gathered for one purpose is being re-purposed, so each research project will need to apply appropriate quality control checks; some will also bring new value, for example an expert analysis of the images for a specific purpose. Each time an issue is identified by one project or new information added, this has the potential to improve the resources available to future projects. In the absence of a pre-created perfect reference library, any project will have to choose between using a resource such as this, with the need for quality checks, and a much smaller but more discerning research-specific set if one is available or could be created.
With CT scans in particular, the radiation dosage makes the re-use of existing imagery much more feasible.
If you would like to access the SMI dataset for a research project, please contact eDRIS [17] in the first instance.

Availability of Supporting Data
Other data, including links to additional information, further supporting this work can be found in DICOM files are stored unaltered in a file archive. There are many reasons why we wish to keep the identifiable data and store the original DICOM images: • Any program developed to strip all identifiable data from the DICOM files and tags risks rendering the whole dataset unusable if done incorrectly. Linkage to other datasets would subsequently be either incorrect or impossible.
• It is conceivable that future data de-identification strategies will wish to make use of some identifiable data and removing that data would therefore limit future options.
• The NHS may wish to use the data as a secondary offline Disaster Recovery system or use the data to populate a clinical system from an alternative provider. In which case it needs to be technically feasible to generate the data in identifiable form in a format that is nonproprietary and as close as possible to the DICOM files as they were originally captured.

Identifiable DICOM Tag
Data All tag metadata from the DICOM files are extracted to a MongoDB database in a searchable format.
This tag metadata is stored in an identifiable format because De-identification Analysts need to know what the identifiable data is so that they can remove it e.g.
• If the patient name is Mrs Jones then if searching for identifiable data in the clinical report the De-identification Analyst will need to know to look for the text "Jones" in order to remove this data.
• Or if checking if an image is identifiable, they might need to know the CHI number in order to check this is not burnt into the pixel data.
Inventory Tables A subset of data from the Identifiable set above is copied here.
This is a relational database which contains suitably cleansed and de-identified image metadata (and file paths), i.e. has been confirmed to be well-populated, of high quality and does not contain identifiable data. This is used by the Research Co-ordinators for cohort creation and extraction to the Safe Haven.
The data is indexed using EUPIs.
For example, DICOM age strings can express the age in years, months or days (e.g. 075Y, 006M or 002D). The cleaned and homogenised metadata will store these in a consistent and easily queried numeric format.
Other metadata fields may be a single value summarising data stored in multiple different DICOM tags. For example, by analysing the acquisition position of the images it is possible to identify examinations where the same volume has been acquired repeatedly in a single series, when used in conjunction with tags to show if contrast was used during the examination this can be used to disambiguate contrast bolus imaging from other acquisitions that may also use contrast.

Cohort and associated
Anonymous Research

Extracts
Any research project will start by defining a relevant cohort and obtaining the necessary ethical/administrative approval in consultation between the Researchers and the Research Coordinators -this is an out of band process outwith the iRDMP system, so not shown here.
The Data Analysts then assemble a data set (the Anonymous Research Extract) for that research project by querying the Inventory Table, possibly linked against other data sources via the EUPI (pseudonymised patient ID, explained below) and trigger the Extraction Microservices to export the appropriate subset of columns made available to the research users. For example, a project might request all available brain MRI scans from patients who have been prescribed Gabapentin, and want the dosage information and patient age; they would be given a set of image data (the scans themselves, passed through the DICOM file anonymiser described later) and a table of associated metadata including the dosage information for each de-identified patient.

Research Data
Management Platform The RDMP manages and monitors the extraction processes. It is not feasible or desirable to proactively analyse the complete Identifiable DICOM Tag Data in order to promote all tags. This is in part due to the difficulty in determining that a tag of a certain type does contain identifiable information for a) the whole of the current archive and b) future PACS images that will be taken.

CHI to EUPI Mapping
A tag can be promoted in two circumstances: 1) it is determined not to contain identifiable information, 2) or the identifiable information it does contain can be de-identified.
Sophisticated techniques such as Natural Language Processing methodologies can be used to determine condition 1 or find a solution for de-identification for condition 2. The solution for condition 2 is known as an anonymisation profile and can be saved for reuse. Once a tag can be flagged as safe for promotion it is moved to the Inventory Table. This is an iterative process (future studies with unique requirements will inform which data is prioritised for anonymisation/promoted).

Promotion of image types which are extractable
This process white-lists images which are extractable in the sense that pixel data can completely be de-identified. Some images, particularly ultrasounds, may have identifiable information such as patient name or CHI watermarked on the image. Which images can be de-identified is stored in the metadata catalogue but the rules regarding how images are de-identified are stored in CTP [17] anonymisation scripts that apply to all images. These scripts contain rules such as "if Modality is US and Manufacturer is X and model is Y then blank out pixels in the rectangle (0,0,1000,200)".

Mapping (CHI-EUPI)
This process is called when metadata is promoted to the de-identifiable zone to replace identifiable CHIs with EUPI. It is an automated process so that no individuals can see this mapping.

Cohort Creation Process
A set of software tools (or manual SQL queries if the user prefers) which query the DICOM metadata within the Inventory Tables in order to select  the associated data from study-specific image metadata and pixel data but does not allow row level or pixel data to be extracted. Access to the internet is restricted when analysing the data. De-identification analysts Are responsible for ensuring as much data as possible is made available to research coordinators for the creation of cohorts but that no identifiable data reaches the coordinators. Much of the de-identification task is automated but the system needs to be continually monitored and new DICOM tags added to the whitelist (or blacklist) as required.

System
Administrators Are part of the infrastructure team and are responsible for building and maintaining the underpinning infrastructure, security, network separation, monitoring and supporting automated processes. Supported automated processes would involve checking for example, whether there were errors in the data load process or data extraction process. They have privileges and expertise to debug and/or restart these processes.
Software developers Produce any new software required within any zones of the environment. The software is developed and tested outside of the production environment.
Deployment of software updates will be carried out by System Administrators.     Some acronyms lack proper introduction (e.g. EHR). EHR definition added (p2), all others checked The Authors should also take into consideration another issue: imaging exams may contain identifiable data even in terms of physical characteristics of the patient. For example, a head scan allows easy reconstruction of the subject's face. Making this data public may be problematic if an explicit consent is not given by the patient. Precautions against such abuse now detailed on p10. The legal proscription and technical safeguards against any attempt at re-identification of patients are more clearly specified, along with clarification that the image data is never made public, only disclosed to approved researchers within a controlled research environment with additional safeguards before further disclosure.
In general, there seems to be no mention of Ethics approval for collecting the data or requesting patient permission to collect data for research purposes. I am not sure of the legal framework in Scotland, but in other European countries informed consent must be given separately for medical procedures and research purposes. Furthermore, consent for research is tied to the aim of the research and should be reissued (or waived by the competent Institutional Ethics Board) on a case-by-case basis. In a table, it is mentioned that researchers who wish to access the data should absolve administrative and ethics requirements. However, there should be an approval (and consent) for collecting the data in the first place. This is a significant issue and should be appropriately addressed in the text. Approval issue and additional disclosure controls now explained (p3, p10, p17-18) No mention is made of quality control of the input data. Public datasets have shown to have issues both image datasets in general and medical imaging specifically (e.g. doi: 10.1016/j.acra.2019.10.006). Considering these problems arise in curated datasets, how is the risk of bias due to errors (which unfortunately are not rare in clinical practice) addressed? This situation could severely limit the usefulness of the data. Prospectively collected data in the setting of a clinical study has the advantage of greater control of its quality and reduction of bias sources (and is still not perfect). Issue acknowledged and tradeoff for size and diversity of data against smaller higher quality research-specific collections added p18. Some mention of quality control issues added, also noting that in some cases concerns such as radiation dosage (CT) make re-use of scans already performed for clinical reasons the only option.

Reviewer 2
This paper is a useful, readable report upon progress made in the development, design, and delivery of a key component of health informatics infrastructure. The infrastructure in question is intended to facilitate and manage access to routinely-collected image data for research purposes.
The paper begins with a brief explanation of the challenges to be addressed and the approach taken to development. This is followed by a description of the solution architecture, as an extension of an existing research data management platform (RDMP)already reported in this journal -to include support for image data.
The subsequent 'analysis' section comprises: -a very brief description of the current status (data is being converted into DICOM format, some testing has been performed, and an initial set of tags are supported) -some justification of technology choices (built upon an existing platform, using a servicebased architecture, using a document database and a bespoke relational schema, incorporating an OCR tool) -some comparison with other platforms (in contrast: metadata-driven, images held in identified form, de-identified versions generated only as required) -a brief description of the testing strategy (standard software engineering approach, some indications regarding performance and scalability) -some discussion (an additional remark on the approach to image data management, some assertions regarding re-usability of the software, a general claim regarding potential value) -short, medium, and long term intentions (plans for more tags, more automation, algorithms for standardisation and classification, shared access to image data) -some remarks upon limitations of the current architecture (cohort creation still cumbersome, storing images in identified form means more work, relational model may not be ideal) The final section sets out some brief conclusions: an assertion that the design meets requirements, a report that initial indications suggest that it will work as intended, and a summary of how different aspects of the design are related to the original list of challenges.
The material in the paper is original, and the subject matter is of interest and value. However, some additional work is needed before it is fully ready for publication.
(1) the abstract states that the platform has been tested on five different test cases -it would be good to understand what these test cases are (the brief mention on Page 6 does not convey any real understanding of what was demonstrated) P5 -some information added, plus explicit reference to Appendix D with full list.
(2) the way in which the challenges listed at the beginning of the paper are revisited later, to explain how they have been addressed, is very welcome -however, it would be good to have a more detailed explanation in each case Summary (p17-18) expanded, with each challenge reiterated and addressed specifically; mention of challenge 1 added.
(3) the analysis section should be made *more concise and more rigorous*, focussing upon the distinctive aspects of the architecture (rich, extensible metadata, project-specific deidentification) Architectural discussion (p7-8) streamlined, less discussion of off-the-shelf elements.
(4) the existing comments on microservices, message queuing, or even the specific choice of relational database are not particularly informative or useful, and neither are those on testdriven development -it is good to know that engineering principles were applied, but there is not enough information here to inform any evaluation or subsequent application elsewhere -the material could be made more concise, or perhaps simply omitted Architectural discussion (p7-8) streamlined.
(5) in contrast, it would be good to have some detailed information regarding performance and scalability, and to have also some detailed information regarding the two different environments with different hardware and VM tools mentioned on Page 16 -again, there is not enough detail here to add substance to the claim that the software 'proved flexible' Appendix C rewritten with more platform and performance information.
(6) the account of the limitations of the architecture is not particularly useful as it stands, and it would be better to address each of these points more clearly as part of the account of the choices made Limitations (p16) expanded and refined with mention of mitigation steps and plans.
(7) the 'discussion' and 'future plans' subsections within 'analysis' could be more concise, and might work better as part of the subsequent 'conclusion' section Restructured as suggested (p16-18) Minor points: (The page numbers here are those of the reviewer pdf, in which the main document starts with the abstract on Page 3) Page 4, Line 10: "with a condition" might be better as "with a specific condition" Page 6, Line 6: "elucidating" might be better as "eliciting" Minor points both resolved. Also corrected "concurrent" to "consecutive".
We hope that you are happy with these modifications.