Quantifying risk factors in medical reports with a context-aware linear model

Abstract Objective We seek to quantify the mortality risk associated with mentions of medical concepts in textual electronic health records (EHRs). Recognizing mentions of named entities of relevant types (eg, conditions, symptoms, laboratory tests or behaviors) in text is a well-researched task. However, determining the level of risk associated with them is partly dependent on the textual context in which they appear, which may describe severity, temporal aspects, quantity, etc. Methods To take into account that a given word appearing in the context of different risk factors (medical concepts) can make different contributions toward risk level, we propose a multitask approach, called context-aware linear modeling, which can be applied using appropriately regularized linear regression. To improve the performance for risk factors unseen in training data (eg, rare diseases), we take into account their distributional similarity to other concepts. Results The evaluation is based on a corpus of 531 reports from EHRs with 99 376 risk factors rated manually by experts. While context-aware linear modeling significantly outperforms single-task models, taking into account concept similarity further improves performance, reaching the level of human annotators’ agreements. Conclusion Our results show that automatic quantification of risk factors in EHRs can achieve performance comparable to human assessment, and taking into account the multitask structure of the problem and the ability to handle rare concepts is crucial for its accuracy.


Introduction
This appendix describes a complete automated approach for risk assessment. It is an end-to-end process, since it takes free-text electronic health records as an input and outputs a list of encountered risk factors with their risk values. The description of the complete process provides context to the risk quantification task, i.e. assigning a score to each of the risk factors found in a medical report, which was the focus of the main body of the article.
Risk assessment is accomplished through a sequence of three stages (see Figure 1 in the main body of the article for illustration): 1. Risk factor recognition, 2. Risk classification, 3. Risk quantification, Stage 3 is thoroughly discussed in the main body of the article; here information on stage 1 and 2 are provided with evaluation results of the whole process.

Risk factor recognition
Risk factor recognition was approached as a standard Named Entity Recognition (NER) task, where the purpose is to find all the mentions of words and multiword expressions that fall into predefined categories using manually annotated training data. The corpus (see the main body of the article for more information on the annotation process) contains 531 documents with 109,856 entities, which were divided into train and test subsets.
In the first preprocessing step, the documents were split into sentences using GENIA sentence splitter [10], tokenised using OSCAR4 [5] and POS-tagged using GENIA tagger [13]. Secondly, MetaMap Lite tagger [2] was used to recognise UMLS concepts belonging to semantic types corresponding to risk factor categories. Thirdly, word clustering, demonstrated to improve NER performance [1], was prepared: specifically, word embedding vectors were computed by applying word2vec [7] to all the MIMIC discharge summaries (unused in the annotated corpus) and clustered using k-means [4] implemented in ClusterR package [11] in R [9]. Eight different clusterings were generated by computing word vectors using a context window of 2 or 5 tokens and running k-means with the number clusters equal 128, 256, 512 or 1024. Finally, regular expressions were used to mark lines that have formatting typical for headers.
As a result, the following features were available for each token: • surface form text, • word lemma, • POS tag, • BIO label of the recognised UMLS concepts with information about semantic type, e.g. B-dsyn, • label denoting if the token belongs to a line recognised as a header, • cluster identifiers in each of the eight clusterings, The features based on lemmas and UMLS tags were taken into account through unigrams and bigrams in the close neighbourhood of the target token. These features were then used to build a Conditional Random Fields (CRF) [6] model of BIO-encoded mentions of risk factors (with categories). For this purpose the CRF++ 1 software was used with L1-norm regularisation.

Risk classification
The purpose of risk classification is to classify a given mention into one of two categories: being associated with the main patient (and subject to risk quantification) or not (and assigned risk value None). The whole corpus contains 109,386 entities with the risk assigned manually, which depending on whether the Risk value is None or not, could serve as negative and positive cases. This, being a binary classification task, was approached through logistic regression.
Firstly, for each mention a feature representation was generated to provide necessary context for the risk classification. This feature set was the same as the risk quantification (see the main body of the article for details). Secondly, a logistic regression model was built using the glmnet package [3] in R. In order to avoid the detrimental effect of the vast number of features, LASSO regularisation was applied (α set to 1) with λ selected through cross-validation. Based on the experiments with the training set, the threshold for the probability score was set at 0.8 and all mentions assigned less than this were labelled as None.

Risk quantification
In the last step of the process, all mentions that were not labelled as None are subject to risk quantification. Specifically, a CALM model is applied to compute risk score, which is then converted into the Low, Medium and High categories using predefined thresholds. This task is the topic of the main body of the article.

Evaluation
In order to evaluate the end-to-end solution, the models used for the three steps (prepared based on the training data) were run consecutively on the test data documents. Specifically, risk factors recognised at stage 1 were scored at stage 2 and either assigned None or quantified at stage 3. Since every stage can introduce its own errors, it is necessary to evaluate the complete workflow to assess the performance in a practical setting.
To this effect the manually-annotated mentions (true) were compared to those recognised by the system (predicted ) on test documents. Two entities were considered to match if their spans overlap (i.e. relaxed matching). Based on this, the following measures were computed: • Overall recall -how many of the true mentions were matched by the predictions, • Overall precision -how many of the predicted mentions were matching the true ones (ignoring the category and risk value), • Category accuracy -how many of the matching predicted mentions have the correct category, • Risk accuracy -how many of the matching predicted mentions have the correct risk value, • For each of the possible risk levels (None/Low/Medium/High), how many of the true mentions with such risk are: correct -matched by a predicted mention with the same risk value, incorrect -matched by a predicted mention, but with a different risk value, missed -not matched by any prediction.
In order to put these values in context, an inter-annotator agreement (IAA) baseline was computed on the double-annotated documents by treating annotations by one expert as true and the other one's as predicted. The values were aggregated over all annotator pairs and both directions.  Table 1: Evaluation results of the end-to-end solution, including the interannotator agreement baseline. Table 3 includes the results. It shows that in terms of overall mention recognition, the system's high precision is not matched by equally satisfactory recall. This obviously affects the later stages of workflow, since when a mention is missed at the NER stage, it cannot be assessed for correctness.
Looking at the accuracy measures, the categories are predicted very well, but risk values pose a larger challenge. Although the performance of risk quantification in isolation was on par with human agreement (see the main body of the article), in the end-to-end evaluation there is more room for improvement. There are two reasons for this. Firstly, to compute the accuracy (as opposed to mean squared error), the risk scores had to be converted to Low, Medium or High, which could increase the error if a border-line value is converted to the wrong category. Secondly, during testing the risk assessment uses the automatically recognised mentions, which are noisier than the true mentions on which the model was trained.
The results for the accuracy at each risk value confirm the observations made in the main body of the article, with the High-ranked risk factors being the hardest to assess properly. This is demonstrated both by a high ratio of factors with incorrect value, but also by a significant number of them being missed entirely. Based on this evaluation, improving the mention recall seems to be an obvious direction of development.
Entity normalisation is one of the elements of the end-to-end solution that play a part in the problems discussed above. Firstly, the risk factor recognition relies heavily on features generated from MetaMap output, so when it fails to associate a word with an ontology concept, then unless there are other contex-tual clues, the mention can be missed. Secondly, since the concept IDs define the multi-task structure, the normalisation accuracy is crucial for the risk quantification performance. Unfortunately, this problem is challenging in the case of EHRs due to issues, such as acronyms, loose syntax, ad-hoc abbreviations and misspellings. Replacing the UMLS MetaMap with a more tailored solution, optimised for processing clinical records [12] or acronym resolution [8], seems to be a promising direction.