Strategies and techniques for quality control and semantic enrichment with multimodal data: a case study in colorectal cancer with eHDPrep

Abstract Background Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses. Findings We developed an R package for electronic health data preparation, “eHDPrep,” demonstrated upon a multimodal colorectal cancer dataset (661 patients, 155 variables; Colo-661); a further demonstrator is taken from The Cancer Genome Atlas (459 patients, 94 variables; TCGA-COAD). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative “meta-variables” according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free text, completeness analysis, and user review of modifications to the dataset. Conclusions eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to multimodal colorectal cancer datasets resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN (https://cran.r-project.org/package=eHDPrep) and GitHub (https://github.com/overton-group/eHDPrep).

conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.

BACKGROUND
Health data can be challenging to work with, arising from incompleteness, fragmentation, inaccuracies and the presence of unstructured information [1].Data quality is an essential parameter for productive analysis, widely recognised in the adage 'garbage in -garbage out' [2].Thus, quality control (QC) procedures, including quality assessment, lay foundations for drawing robust conclusions from health data.The fundamental dimensions of data quality are consistency, accuracy, completeness, record uniqueness, timeliness and validity (syntactic conformity) [3,4].Applicability is a further important consideration for data quality; encoding data in a numeric and machine interpretable format is vital for accurate interpretation in advanced analysis workflows [5].Ontologies provide structured representations of a knowledge domain and can support QC when dataset variables are mapped to ontological entities.For instance, multiple variables may map to the same or semantically similar concepts, suggesting opportunities for merging operations or internal consistency checks [6].Ontologies also provide computable information on the semantic relationships between terms which can add value to downstream analysis [7].The semantic information held in ontologies can be leveraged to generate new variables through aggregation of existing variables during post-QC data preparation in a process we describe as semantic enrichment.
Several tools are available for health data QC, however these are typically aimed at single modalities.For example, 'dataquieR' (completeness, consistency, accuracy, validity) and 'mosaicQA' (completeness, validity) focus upon observational health and epidemiological research data [8,9].Packages such as 'summarytools' offer more generalised functionality to facilitate data exploration through summary descriptive reports (completeness, accuracy) [10].Other packages support targeted encoding such as 'genetics' which targets genetic data (i.e.genotypes and haplotypes) [11] while 'quanteda' provides extensive tools for natural language processing [12].The 'tidyverse' collection builds upon base R's functionality to improve the capability, efficiency, and programmability of data scientists' QC workflows [13,14].Several R packages calculate semantic similarities [15][16][17] however we are not aware of any which provide the ability to aggregate variables using semantic commonalities in preparation for analysis.
QC may require up to approximately 80% of a data mining project's time [18].While data quality and encoding issues in multimodal data can currently be tackled by combining multiple existing approaches, each requires time-consuming familiarisation and may require multiple data transformations potentially adversely impacting data quality [4].We present a toolkit for electronic Health Data Preparation (eHDPrep), enabling robust programmatic QC and enrichment of semantic content; high-level functions empower general R users to assess, process, and review their dataset with minimal coding while low-level functions allow advanced R users to specify parameters and workflows as required.We demonstrate the utility of eHDPrep on a multimodal dataset containing 155 variables for 661 colon cancer patients (Colo-661) [19,20].Colorectal cancer (CRC) has a large disease burden as the third most common malignancy with an estimated 1.9 million new cases and 915,800 deaths worldwide in 2020 [21]; advances in CRC medicine are urgently needed [22].

FINDINGS QUALITY CONTROL
Data reliability encompasses completeness, consistency, accuracy, uniqueness, and validity [3,4]; eHDPrep addresses issues in these dimensions through both specific low-level functions and in the high-level functions 'assess_quality', 'apply_quality_ctrl', and 'review_quality_ctrl'.The QC workflow in eHDPrep provides userfriendly methods to evaluate and address data quality issues (Figure 1).We present the application of this workflow to Colo-661 in the sections below in order to enhance data reliability, to enable machine interpretability, and to assess the effects of QC operations upon the dataset.with and a series of operations may be performed.'Natural Language Processing' is only required if free-text variables are present.'Merge Variables' is an optional step for user-defined merging operations with functionality to measure information loss.Variables are encoded in a machine interpretable format and a summary report is generated for review by the user.Additionally, functionality to review each step is provided.'Semantic Enrichment' optionally involves aggregation of variables according to semantic commonalities identified by an ontology such as SNOMED CT [23].

Semantic Characterisation
Following data import, semantic characterisation is required in order to determine variable classifications (Supplementary Table 1) along with information provided by the user, for example regarding data modalities.
The semantic characterisation process includes user review of the automated variable type assignments.
Correct semantic information is essential for successful application of downstream steps in eHDPrep.

Natural Language Processing
Analysis workflows typically require data in a standardised format, however significant health data are contained within free-text clinical notes [24].eHDPrep includes user-friendly extraction of information from free-text by wrapping Natural Language Processing functionality from quanteda [12] and tm [25] to create variables describing frequently occurring words or groups of nearby words (eHDPrep function 'extract_freetext').Three free-text variables in Colo-661, containing digitised medical notes, were transformed into eleven new structured variables.Of these, six variables were generated from family members' cancer history (recorded in 21% of patients) following manual correction of observed misspellings, expansion of abbreviations, and standardising cancer name to '[cancer location] cancer' (e.g."melanoma" to "skin cancer").Four of the new structured variables identified the occurrence of cancer in close family members (mother, brother, sister, father), two further variables recorded if a family member had lung or breast cancer.Manual review determined that the data extraction for these variables had 89.9% sensitivity and 99.9% specificity across the generated values.False positives and negatives were manually corrected in Colo-661.

Encoding Missing Values
Proper representation of missing values is critically important for the correct execution of downstream functions, for example if missing values are to be excluded from calculations.Missing values may be encoded in a variety of ways, including strings (e.g.'missing', 'unknown') or out-of-range values (e.g.'-1') [26].Indeed, missing values in Colo-661 were recorded in eight encodings, representing 4.3% of dataset values, which were converted to 'NA' values using eHDPrep.

Completeness
The degree to which a dataset is populated with data, rather than missing values, is a vital early measurement in quality assessment.eHDPrep measures both variable and patient record completeness at a whole-dataset scale, visualised across Colo-661 in Supplementary Figure S1.Patterns of completeness may also be explored with eHDPrep through a binary heatmap; the clusters of missing data in Colo-661 showed good correspondence with different data types demonstrating non-random missingness (Figure 4a).Variables with zero entropy [27] (Equation ( 1) have the same value across all records and, for example, cannot be used to stratify the cohort.Zero entropy variables therefore have limited utility, even if fully complete, and are flagged by eHDPrep.Four Colo-661 variables were removed due to zero entropy.These quality assessment procedures are achieved using the functions 'assess_completeness' and 'assess_quality'.
Where (  ) is the probability of each element x occurring in the input vector .

Internal Consistency
In order to enable evaluation of internal inconsistencies, eHDPrep assesses user-supplied semantic dependencies between variable pairs.In such dependencies, a value in one variable limits the logically valid values in the other.We designed forty-nine internal consistency checks for Colo-661 across fifteen variables (with some variables present in multiple pairs; Supplementary Table S2).As expected in real-world data, we found forty instances of internal inconsistency across five variable pairs, demonstrating the value of this automated approach.
There was a conflict between the related variables 'N stage' and 'number of positive lymph nodes' (Figure 2a).One record had a value of 'N2' for the 'N stage' variable; however, the 'number of positive lymph nodes' value was lower than required for assignment of N2 status according to the staging criteria [28].Similarly, we identified three records where the 'number of lymph nodes examined' was fewer than the 'number of positive lymph nodes' (Figure 2b).Thirty records contained inconsistencies due to a category mislabelling in relation to tumour budding [29] ('high, >10' instead of 'high, >=10') which was identified when comparing a discretized variable with its corresponding non-discretized variable.Four records stated that patients did not have a personal history of cancer while stating that the patient had non-melanoma skin cancer.Two records stated that the patient had a hereditary form of CRC while stating that the patient had no or an unknown family history of CRC.
Flagging the above inconsistencies focussed further data curation in order to resolve these conflicts.In the above instances we removed any inconsistent values from one variable in the pair, selected by assessing the reliability of the data source.A more conservative strategy might be required if expert curation is not possible; for example, involving elimination of all conflicting values or potentially removing the inconsistent variables entirely.

Variable Merging
Merging variables can improve analysis by reducing redundancy and improving storage efficiency.However, inappropriate merging may lead to information loss.Accordingly, we developed functionality in eHDPrep for quantitative evaluation of merging operations using an information theoretic approach.Information Content (IC; Equation 2) is determined from category probabilities for discrete variables, or with variable bandwidth kernel density estimation for continuous variables [30].The Mutual Information Content (MIC; Equation 3) of each input variable with the merged variable is also calculated [30,31].Potential information loss during variable merging can be assessed by comparing the MIC of an input variable and the merged variable against the IC of the input variable.If the MIC and IC are identical, the input variable's information is retained within the merged variable.
Where (  ) is the probability of each element  occurring in the input vector .
Where  and  are numeric vectors,  is mutual information, and  is the number of complete cases in both  and .
As an example of support for variable merging in eHDPrep, Figure 3 visualises two candidate merging operations applied to Colo-661 variables describing a scoring of Crohn's-like lymphoid reaction in the tumour [32] based on Graham-Appelman criteria [33].One possible merging strategy aggregates values '1' and '2' from Input 1 to '1-2' in the merged variable, leading to information loss (Figure 3a).A superior strategy takes the values '1-2' from Input 2 as an ordinal category value between '1' and '2' and does not produce any information loss (Figure 3b).(Supplementary Table S1) [5].Following the above encoding steps, human-interpretable labels in ordinal variables were transferred to a mapping reference table at the end of QC during an assertion confirming that all variables were numeric following QC, in contrast to 16.8% before processing with eHDPrep.

Quality Review
Understanding the effect of QC operations applied across large health datasets is non-trivial; eHDPrep simplifies this process and concisely records data changes resulting from QC. Firstly, eHDPrep can produce a comparative tally of unique combinations of values in variables before and after a change was implemented.
These tallies can be produced after each QC action, showing the incremental changes to the dataset.
Secondly, eHDPrep facilitates final review of changes to variable count which measured 207 variables in Colo-661 following QC, or value-level QC modifications which are optionally summarised in a bar plot.This plot (Figure 5) highlights differences in the proportion of values modified across Colo-661 that may inform upon the underlying structure of the dataset.Thirdly, eHDPrep's comparative completeness function visualises the distribution of variable or row completeness before and after QC. Figure 4b demonstrates the positive impact of eHDPrep QC on Colo-661's variable completeness, resulting in 62% more variables with >95% completeness.Overall, mean variable completeness was 9.45% higher in Colo-661 following QC when compared with the original dataset (following missing value encoding, described above).

SEMANTIC ENRICHMENT
Ontologies contain valuable curated information about the relationships between domain concepts.Indeed, ontological relationships have been widely employed to support the interpretation of results from highthroughput technologies [34].Rectangular health datasets (i.e., data frames or matrices) are semantically disorganised.However, ontologies can be utilised to capture semantic relationships between variables during data preparation in a process termed here as 'semantic enrichment'.The structure of the ontology provides for aggregation values from constituent variables, in order to generate 'meta-variables'.The process is explained below, as applied using the 'semantic_enrichment' function in eHDPrep, is summarised in Figure 6 and a worked example is provided (Supplementary Figure S8).

Discovery of the most informative common ancestor terms and semantic aggregation
The IC of each node in the supplied ontology is initially computed to quantify the specificity of nodes by depth and relative number of descendants [35].Nodes representing variables, manually mapped to the ontology, are added to form an ontology:variable network.Sets of variables which share semantic commonalities through common ontological ancestors are identified.Sets of variables may have multiple common ancestors, therefore the IC of all common ancestors of a set are compared to identify the Most Informative Common Ancestor (MICA), which labels the set.Min-max normalisation (Equation ( 4) is applied to each variable prior to semantic aggregation whereby meta-variables are produced by taking the row-wise sum, minimum, maximum, average, and product of the set for each MICA.Only meta-variables with non-zero entropy (Equation ( 1) are appended to the dataset.
Where  is a numeric vector.

Ontology Preparation
Variables in Colo-661 were mapped to two medical ontologies (Supplementary Table S1): the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) ontology which standardises clinical terms for generation of electronic health records, covering >350,000 concepts at present [23]; and the Gene Ontology (GO), a widely used knowledgebase for gene function that currently contains >44,000 terms [36,37].The Colo-661 variables were mapped to SNOMED CT by manual review assisted by the UK National Health Service Digital SNOMED CT Browser [38].The UK edition of the SNOMED CT Clinical Edition ontology, version 31.1.0,was downloaded from the UK National Health Service's technology reference data update distribution resource [39].SNOMED CT was converted from Release Format 2 (RF2) to W3C Web Ontology Language (OWL) format using version 2.9.0 of the official SNOMED CT OWL toolkit [40].We used ROBOT [41] to convert SNOMED from OWL to comma separated values containing each node's superclasses, enabling generation of an edge table (Supplementary Figure S2) that defined the network graph to which the mapped Colo-661 variables were joined.For GO mapping, variables with gene assignments within the Colo-661 resource were verified and mapped to GO terms using Ensembl release 105 [42,43] via the BiomaRt package [44,45].The ontologyIndex package [15] was used to import the January 2022 GO release into R.The Biological Process (BP) domain was chosen to create a network of Colo-661 variables and the mapped genes with their 'is_a' ontological ancestors.
network was similarly filtered and connected several distinct biological concepts (Supplementary Figure S6) with example GO MICAs shown in Figure 7b and Supplementary Figure S7.In Figure 7a, five comorbidity variables are linked by their semantic commonality as types of heart disease while three mutation variables are linked by their involvement in drug catabolic process in Figure 7b.where 212 patients had died from a CRC-specific cause.Subsets of Colo-661 have been used in previous publications and the data had therefore undergone prior manual QC operations [19,20,[48][49][50][51][52][53][54].Researcherdefined data modalities span clinical pathology, epidemiology, mutation, and treatment and outcomes.
These modalities originated from several sources: electronic health records, pathology reports, medical charts, the Northern Ireland Clinical Oncology Information System, Northern Ireland Registrar General's Office, tumour image analysis [49], ColoCarta panel [55], and targeted mutation analysis.
often appear near each other using quanteda's 'tokens_skipgrams' function.The 'stopwords' function from tm [25,57] identifies stopwords, such as "a", "in", and "if", for removal during cleaning.Cluster analysis is implemented through the dist and hclust functions in the stats package [56] to calculate Euclidean distance and single-linkage clusters, respectively, to analyse dataset completeness.Networks for semantic enrichment are created, manipulated, and analysed using igraph and Tidygraph [58,59].

DATA REPORTING AND VISUALISATION
The specificity and sensitivity of variables generated from free-text in Colo-661, following preprocessing, were assessed through comparative manual review between each value in the generated variables (n=3305) and corresponding value in the source variable.Two of the 3305 values were false positives (99.9% specificity) and 23 were false negatives (89.9% sensitivity).eHDPrep produces summary tables generated using the tibble package [60] which are optionally formatted using knitr [61,62] and kableExtra [63].Heatmaps of dataset completeness are visualised with pheatmap [64] and the remaining plots are created with ggplot2 [65].Network visualisations in this paper were generated using RCy3, Cytoscape, and InkScape [66][67][68].
Nodes in Supplementary Figures S3 and S5 were sized by PageRank centrality [69].

DISCUSSION
eHDPrep delivers an accessible set of functions, demonstrated here with suggested workflows applied to a real-world medical dataset (Colo-661), resulting in demonstrable benefit to data quality.Improvements to Colo-661 using eHDPrep included standardisation of eight strings representing missingness to "NA", resolution of forty internal inconsistencies, conversion of free text to eleven new variables, numeric encoding of 123 nominal and ordinal variables, and a 9.45% increase in mean variable completeness.Additionally, we have showcased tools for assessment of data quality in both the input data and the data following QC operations.
eHDPrep provides novel functionality for data preparation in R where meta-variables are created by aggregation using ontological semantic commonalities.The benefit of this semantic enrichment is exemplified in Colo-661 through the creation of 1600 meta-variables.Furthermore, the mean redundancy of the meta-variables with their constituent variables was 11.8%, demonstrating creation of substantial information that was absent from the input dataset.The added non-redundant information in meta-variables may potentially enable discovery of patterns where the disaggregated data would be too heterogeneous or too sparse to identify meaningful results.We also observed a 5.1% higher average completeness of the metavariables relative to their constituent variables.For patients where some values in constituent variables are missing, the meta-variables will contain semantic aggregations of the non-missing values, affording more comprehensive patient representation in downstream analyses while simultaneously preserving missing values that may be indicative of patient health and background [70,71].A further benefit of eHDPrep is interoperability, an important consideration in digital healthcare [72,73].The standardised encoding of the data improves syntactic interoperability and streamlines incorporation into larger databases while the metavariables support semantic interoperability; for example, during data linkage in identifying similar variables across resources with differing degrees of data aggregation.
Semantic enrichment may be widely useful in health data analysis due to the availability of multiple rich, comprehensive ontologies; for example the Disease Ontology and the Human Phenotype Ontology [74,75].
The results of semantic enrichment in eHDPrep are critically dependent upon the ontology taken as input, and will likely suffer from a degree of annotation bias [76].Also, mapping variables to ontology terms can be time-consuming and may require background knowledge of the variables if their labels are not selfexplanatory; these issues may be mitigated by fuzzy string matching [77] and software interfaces, for example the UK National Health Service Digital SNOMED CT Browser [38].Importantly, variables generated through semantic enrichment might not properly represent the quantitative relationship between their constituent variables due to unusual associations, such as J-curves [78].Careful variable aggregation may be applied to avoid variation in two variables cancelling each other out.The identified semantic relationships may also aid in interpretation of why the variables have a particular association.For example, if opposite J-curves were found for values of the variables in Figure 7b, their common involvement in drug catabolism might help to understand the pattern of association.
The improved data quality, interoperability, and meta-variables generated through semantic enrichment in eHDPrep is expected to provide for greater robustness and added value in downstream analyses of biomedical data, including Colo-661.
USER DOCUMENTATION AND TECHNICAL DETAILS Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?If not, please give reasons for any omissions below.as follow-up to "Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?" All data and materials are made available except for the Colorectal Cancer patient dataset (Colo-661) which is controlled access and available by application to the Northern Ireland Biobank; relevant contact details are given in the manuscript.

Figure 1 :
Figure 1: Overview of eHDPrep workflow.The ordering of steps reflects logical dependencies.Dashed arrows and boxes signify optional steps.Following import, the semantic characteristics of the data are established, missing values are dealtwith and a series of operations may be performed.'Natural Language Processing' is only required if free-text variables are present.'Merge Variables' is an optional step for user-defined merging operations with functionality to measure information loss.Variables are encoded in a machine interpretable format and a summary report is generated for review by the user.Additionally, functionality to review each step is provided.'Semantic Enrichment' optionally involves aggregation of variables according to semantic commonalities identified by an ontology such as SNOMED CT[23].

Figure 3 :
Figure 3: Information theoretic evaluation of merging operations.A comparison of two candidate merging approaches for variables pertaining to Crohn's-like lymphoid reaction in the tumour.a) Merging through aggregation of the Input 1 values '1' and '2' to '1-2' to align with Input 2's '1-2' value.The Mutual Information Content (MIC) of Input 2 with the merged variable is equal to the information content (IC) of Input 2; hence all of the information from Input 2 is captured by the merged variable.In contrast, Input 1 has IC higher than its MIC with the merged variable and so information loss has occurred.b) Lossless merging operation where '1-2' values of Input 2 were encoded as an intermediate ordinal category between '1' and '2' from Input 1.All information from both input variables is contained in the merged variable (i.e.IC is equal to MIC with the merged variable).Indeed, the IC of the merged variable in b) is greater than the value shown in a).Therefore, the merging operation shown in b) is advantageous.

Figure 5 :
Figure 5: Uneven distribution of QC value modifications in the Colo-661 cohort.Stacked bar plot presenting modifications during QC as a percentage of total values per patient are shown on the y-axis.Patient records are displayed on the x-axis, ordered by y-axis values.The proportion of substitutions shows remarkable consistency across most patient records, due to the standardisation of mutation variables which altered all present values; although some patients had missing mutation values which could not be standardised and therefore do not appear.
Figure 7a also visualises another MICA, 'Ischaemic heart disease', for two variables in the figure.Supplementary Figure S4 visualises the linkage of three variables, describing prescription medications and an adjuvant treatment regimen, as enzyme inhibitor products.The semantic commonalities between thirteen variables describing tumour excision location, other surgery types, and emergency surgery status are shown in Supplementary Figure S5 with the 'Surgical procedure' MICA.This figure contains additional MICAs for subsets of shown variables, for example node D ('Right colectomy').Supplementary FigureS7shows commonality in a typical feature of several cancers, 'Negative regulation of programmed cell death'[47], between sixteen variables from multiple modalities.

Figure 6 :
Figure 6: Semantic enrichment workflow in eHDPrep.The ordering of the steps shows logical dependencies.Dashed box and lines signify that 'Normalise Values' is an optional step, only required if variables have differing magnitudes.The 'Map Variables to Ontology Entities' step requires extensive user input.Meta-variables with zero entropy contain no information and so are omitted before the final step of appending meta-variables to the dataset.

Figure 7 :
Figure 7: Exemplar Most Informative Common Ancestors (MICAs) and the semantic relationships with their constituent Colo-661 variables.a) Relationships between comorbidities are identified by semantic enrichment with SNOMED CT. 'Heart disease' is the MICA for all constituent variables (blue) shown in this figure however 'Ischaemic heart disease' is also a MICA for the Colo-661 variable set containing only the 'mi' and 'ihd' variables.b) Three single nucleotide polymorphism variables are semantically linked in the GO by the MICA 'drug catabolic process'.Mutation variables in Colo-661 were selected based on prior association with colon cancer and exogenous exposures of interest.MICAs can

Table 2 : Summary statistics of ontologies applied to Colo-661 in semantic enrichment.
The GO network has a higher mean number of direct annotations per mapped variable than SNOMED CT, which may explain the larger number of meta-variables generated using GO from fewer mapped variables.Of the 207 variables in the encoded Colo-661 dataset following completion of QC, 14 were not mapped to either SNOMED CT or GO.The number of nodes and edges were measured before the addition of mapped variable nodes.
eHDPrep contains short-form documentation for each function; called with ?[function name].Long-form documentation, known as a vignette, is also provided to demonstrate QC and semantic enrichment functionality with synthetic example data, R code, and explanatory text.The vignette is created when the package is built and reflects the functionality of the current version.Error and warning handling messages have been included to ensure expected inputs are received and to notify if unexpected outcomes are returned.eHDPrep is written in the R programming language with a codebase size of 3943 lines of code and 57 unit tests.