Automatically pre-screening patients for the rare disease aromatic l-amino acid decarboxylase deficiency using knowledge engineering, natural language processing, and machine learning on a large EHR population

Abstract Objectives Electronic health record (EHR) data may facilitate the identification of rare diseases in patients, such as aromatic l-amino acid decarboxylase deficiency (AADCd), an autosomal recessive disease caused by pathogenic variants in the dopa decarboxylase gene. Deficiency of the AADC enzyme results in combined severe reductions in monoamine neurotransmitters: dopamine, serotonin, epinephrine, and norepinephrine. This leads to widespread neurological complications affecting motor, behavioral, and autonomic function. The goal of this study was to use EHR data to identify previously undiagnosed patients who may have AADCd without available training cases for the disease. Materials and Methods A multiple symptom and related disease annotated dataset was created and used to train individual concept classifiers on annotated sentence data. A multistep algorithm was then used to combine concept predictions into a single patient rank value. Results Using an 8000-patient dataset that the algorithms had not seen before ranking, the top and bottom 200 ranked patients were manually reviewed for clinical indications of performing an AADCd diagnostic screening test. The top-ranked patients were 22.5% positively assessed for diagnostic screening, with 0% for the bottom-ranked patients. This result is statistically significant at P < .0001. Conclusion This work validates the approach that large-scale rare-disease screening can be accomplished by combining predictions for relevant individual symptoms and related conditions which are much more common and for which training data is easier to create.


Background and significance
Aromatic L-amino acid decarboxylase deficiency (AADCd) is an autosomal recessive disorder caused by pathogenic variants in the dopa decarboxylase (DDC) gene.This gene encodes the AADC enzyme, which is responsible for catalyzing the chemical reactions that create the neurotransmitters: epinephrine, norepinephrine, dopamine, and serotonin.Therefore, the deficiency has widespread neurological effects including hypotonia, movement disorders such as oculogyric crisis and dystonia, dysfunction of the autonomic nervous system, and developmental delay. 1 As is the case with many rare disorders, estimating the prevalence of AADCd is challenging.The true global prevalence is unknown.In the most comprehensive recent study, 348 cases have been described worldwide, with a higher prevalence in Taiwan due to a founder variant. 2 Most patients present in infancy with hypotonia, oculogyric crisis, developmental delay, and feeding issues.Patients with the classic form of AADCd never reach their gross motor developmental milestones.Sleep disorders, gastrointestinal (GI) problems, mood disturbance, and feeding issues are frequent.AADCd has a wide spectrum of phenotypes with cases presenting late in childhood or early in adulthood and remaining undetected, perhaps indefinitely. 3ADCd often presents as a nonspecific neurodevelopmental disorder, particularly when the distinguishing feature of the oculogyric crisis is not recognized.][6][7][8] Furthermore, the primary diagnostic methodology of CSF neurotransmitter analysis may be underutilized due to the invasiveness of lumbar puncture and the limited availability of the analysis (we are aware of only 2 clinical laboratories providing the testing in the United States).These factors likely lead to AADCd being underdiagnosed. 1,9ccording to consensus guidelines, a definitive diagnosis should include positive findings in 2 of the 3 core diagnostic tests 10 : 1) cerebrospinal fluid analysis demonstrating abnormal levels of neurotransmitter metabolites consistent with deficiency of the AADC enzyme; 2) reduced plasma AADC enzyme activity; 3) compound heterozygous or homozygous pathogenic variants in the DDC gene.
Other biochemical tests which can support a diagnosis of AADCd include measurement of 3-O-methyldopa (3-OMD) in dried blood spots or plasma, or urine organic acid analysis. 2reatment with dopamine agonists, MAO inhibitors, and pyridoxine/pyridoxal phosphate has shown limited efficacy in some AADCd patients, 10 and gene therapy treatments are currently approved in Europe and the United Kingdom, and in development in the United States and China.Therefore, a systematic approach to identifying patients at an increased risk of AADCd is warranted.
There are as many as 10 000 rare diseases around the world. 11The time to diagnosis for many of these diseases is lengthened by their rarity as well as under-recognition by evaluating clinicians.This may cause needless suffering for such patients, not only in the stress of not having a diagnosis for their symptoms but also, when they exist, delays in treatment to reduce the symptoms of these diseases, which are sometimes debilitating.One possible way to expedite the diagnosis of rare diseases is through the use of clinical data, particularly data in the electronic health record (EHR) 12 coupled with new advances in machine learning (ML). 13any patients with rare diseases see numerous providers, resulting in a corpus of data that can be processed to uncover signals of rare diseases.
Our previous work focused on acute hepatic porphyria (AHP), a rare disease occurring in approximately 1 per 100 000 people. 14 The time to diagnosis of AHP takes an average of 15 years from the onset of symptoms. 15Our previous work on a corpus of EHR data from 205 000 patients, with 30 positive cases, found that we could identify the presence of the neuro-visceral symptoms of AHP and no other explanatory diagnoses using a ML approach. 16While we were not able to diagnose any new cases from 7 of 18 people who our algorithm identified and were willing to undergo urine porphobilinogen testing, it was clinically appropriate to test such individuals. 17thers have searched for additional rare diseases in EHR data using ML.These include births of patients with cardiac amyloidosis, 18 systemic sclerosis, 19 lipodystrophy, 20 presence of the KCNA2 gene variant, 21 primary Sj€ ogren's syndrome, 22 Dravet syndrome, 23 Jeune syndrome, 24 systematic lupus erythematosus, 25 renal ciliopathies, 26 Pompe disease, 27 and Fabry disease. 28

Objectives
The goal of this study was to use EHR data to identify patients who may have undiagnosed AADCd and possibly other related disorders of aromatic amino acid and neurotransmitter metabolism that may be coded similarly in the EHR (eg, ICD10 coding E70.81).Patients who have AADCd, but are yet to be diagnosed, will of course not have a structured diagnostic code in the EHR for this disease.In the initial review of 500K patients aged � 25 years old in the Oregon Health & Science University (OHSU) Research Data Warehouse (RDW), one patient was found with this diagnostic code assigned.Clearly, this is a very rare disease, and is potentially underrecognized, even in a tertiary care facility such as OHSU, which is the largest academic medical center in Oregon.
EHR data provide a wealth of information to improve clinical care and facilitate research.Identifying patients for further diagnostic work up of undiagnosed rare diseases by manual chart review is a laborious, time-consuming, and likely impractical task.This research was the first attempt to develop and evaluate an algorithm to identify potential patients with this disease on a large and realistic EHR dataset.
This approach could facilitate more accurate population prevalence assessment as well as provide proof of concept for a tool that can help identify undiagnosed patients who may benefit from earlier treatment as well as eligibility for future clinical trials.As opposed to more traditional projects of this type that rely entirely on manual chart review to identify potential patients, our approach made use of informatics, information retrieval, natural language processing (NLP), and ML techniques to create a more efficient and reusable approach to identifying patients who potentially have AADCd.While prior work exists on detecting cases of rare diseases in EHR data (see Background and Significance), this was the first work on AADCd and the first proposed method that we are aware of using ML and which did not use or have sample cases of the rare disease for training or algorithmic tuning.

Overall approach based on symptoms, not disease cases
The overall strategy for the automated identification of patients who may have undiagnosed AADCd and should be considered for diagnostic screening is shown in Figure 1.
This approach is not based on direct training with positive AADCd cases.Instead, the proposed method is based on recognizing the symptoms and associated conditions that together may indicate an undiagnosed rare disorder of neurotransmitter metabolism.Specifically, the goal is to identify cases where a definitive diagnosis is not present in the chart and diagnostic testing for AADCd may be indicated.
The approach follows a multistep process: 1) Based on the literature review and expert knowledge, a set of AADCd-associated symptoms and conditions of interest is identified.2) Randomly divide the dataset of patient chart data meeting inclusion criteria (described below) into 10 partitions.Use partition 0 for development and testing.Hold out partitions 1-9 as the blinded dataset.3) Divide clinical note data for partition 0 into training 80%, validation 10%, and testing 10%, by randomly assigning patients to 1 of the 3 sets.4) Create an annotation guide and manually annotate a training set of Pediatric Neurology and EEG exam notes annotated for the associated symptoms and conditions of interest.5) Divide patient pediatric neurology and EEG exam notes into sentences.6) Score each sentence for the probability of containing an AADCd concept of interest using a trained ML model 7) Combine all sentence predictions for an individual patient into a single probabilistic prediction of whether the patient had expressed that AADCd concept in a positive manner in a note.8) Combine all individual concept predictions into a single rank value using a fitted Poisson regression model.9) Manually review the top-ranked patient charts for diagnostic testing consideration.Also manually review the bottom-ranked patient charts for comparison.

Creation of annotation schema
After a review of the literature 3,5,10,29-32 and discussion amongst all authors, it was determined that the following 10 concepts were the most important factors in determining the risk of AADCd from the EHR: 1) Autonomic dysfunction 2) Cerebral palsy 3) Developmental delay 4) Epilepsy or seizures 5) Feeding issues 6) Hypotonia 7) Insomnia 8) Mood disturbances 9) Movement disorders 10) Oculogyric crisis These concepts of interest are a combination of symptoms related to AADCd, and conditions that may co-occur or be differential diagnoses for AADCd.The concepts of interest were also annotated with modifiers designating whether they were negated, not about the patient, or hypothetical.The listed concepts are not all of the same diagnostic or screening importance and are specified in alphabetical order and not ranked order.
For completeness, we also annotated whether the patient chart specifically mentioned AADCd.This was a very rare occurrence in our dataset, and we did not use direct mention of AADCd as a concept in our ML models.The approach is geared toward identifying patients with unrecognized AADCd, and it was reasoned that if AADCd is mentioned in the chart, it is already being considered by the clinicians, and therefore identifying these patients automatically would not add value as far as suggesting diagnostic testing.

Creation of pediatric neurology focused dataset
For all data in this study, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) [33][34][35] research database instance was provided by the OHSU Research Data Warehouse (RDW).This research data source includes all OHSU patients represented in a standards-based data model.By basing our work on this data model, the results are intended to be more generalizable and reusable.This study was approved by the OHSU IRB under approval number STUDY00023368.
The initial dataset cohort was generated by creating a subset of the main OMOP database by requiring patients to be �25 years old and have at least 2 visits at OHSU.A text search of the OMOP NOTES database table for this age criteria cohort dataset was then performed, identifying patients who had at least one Pediatric Neurology note.Using this process, the study cohort dataset was created, which consisted of 8946 patients who met the following set of selection criteria and had sufficient data to be included in this study: • �25 years of age at the time of the data set creation • �2 visits in the OHSU OMOP database • �1 Pediatric Neurology note The study cohort dataset was then divided into 10 partitions of approximately 850 patients each.Partition 0 was used for all investigation and training.Partitions 1-9 were used later in the study as test data to test the application of the approach on unseen data.
After reviewing a random selection of notes in partition 0, it was determined that most of the notes did not reference or contain much information that was relevant to the detection of AADCd.To focus the dataset on information relevant to AADCd detection, the notes were further filtered by limiting the final experimental dataset to Pediatric Neurology notes and EEG reports.For this study, other note types were not processed or manually reviewed.
The resulting experimental dataset contained 8946 patients and 520 473 notes overall.Partition 0 contained 921 patients and 54 857 notes.

Annotation of partition 0
To create a training data set for the AADCd-related concepts, an annotator trained in epidemiology (J.K.) reviewed the notes for each patient in partition 0 and selected the most clinically complete appearing early and late pediatric neurology note in the record, as well as the most complete EEG.This was done to maximize the efficiency of manual annotation to get as much data on the concepts of interest from the annotated notes.
The BRAT 36 annotation tool was installed locally and used by the annotator to select text spans and save annotations.For each of the selected notes, the annotator selected the minimum span of text that expressed the complete concept in the annotation schema, including any modifiers.The annotation schema allowed overlap of annotations, if necessary, such as a single negation applying multiple annotated concepts.
An initial round of 20 patients was first annotated, and these annotations were reviewed by the PI (A.M.C.).After a discussion of how to handle some uncertain edge cases and discovering some inconsistencies, the annotation guide was enhanced to provide additional specific instructions and example phrases.After updating and reviewing the annotation guide, the rest of the selected notes in partition zero were annotated.The final annotation guide is available as Appendix SA1.

Creation of training, validation, and test datasets
The annotated partition 0 data were converted into a training dataset suitable for ML by a multistep process.The goal of this process was to create a set of sentences, each sentence having an associated binary variable designating whether or not the sentence included the AADC-related concept, and other associated binary variables for each concept about whether they were negated, not about the patient, or hypothetical.
Each note was parsed into individual sentences using the "en_core_web_trf" model sentence parser in the spacy (https://spacy.io)Python toolkit.This model was the most complex parsing model and was chosen as having the best documented performance.Custom Python scripts were then written, which used sentence offset and text matching to determine which sentences corresponded to the individual BRAT annotations.If a sentence contained an annotation or part of an annotation, that sentence was marked as true for that annotation, and false otherwise.A database of sentences was created, containing all the information and text about the sentence and the annotations for that sentence.The database was then split 80%/10%/10% into training, validation, and testing datasets.In order to prevent leakage of patient data between the sets, all of the sentences for an individual patient were placed into the same dataset.The split sizes were determined based on having as much training data as possible, given that some of the concepts were rare (highly unbalanced data).At the same time, the validation and testing sets needed to be large enough to perform a meaningful evaluation.Therefore, we made the training set as large as possible, while not making the other sets any smaller than 10% of the data.Counts of sentences and annotations assigned in each dataset are shown in Table 1.

Machine learning approach
Initial experimentation with the training dataset and a linear SVM classifier used unigram and bigram features from the dataset.It was found that there were not enough samples of negated, not patient, and hypothetical annotations to predict these categories separately.Therefore, these were combined into one single "negative qualified" category for each concept.The prediction task for each concept in each sentence was then defined as a 3-class prediction, consisting of the following 3 classes: 1) Negative-the concept is not present in the sentence 2) Negative qualified-the concept is present, but is either negated, not patient, and/or hypothetical.3) Positive-the concept is present, and is not qualified as any of negated, not patient, or hypothetical.
All sentence-level classification tasks were then formulated as this 3-class problem.

Machine learning concept algorithm optimization
In order to predict the 3 class AADCd concepts most accurately, a variety of alternative ML strategies were evaluated and compared using 5 repetitions of 2-way cross-validation on the training dataset.Three types of features were evaluated: n-gram-based features, embedding vectors based on the pre-trained ClinicalBert model provided by HuggingFace (available at https://huggingface.co/ emilyalsentzer/Bio_ClinicalBERT), 37 and autoencoder-based features using various layer widths in a 5-stage encoder-decoder architecture.The n-gram features were obtained by parsing the sentences into n-grams of length one, two, or three tokens using the same spacy parser model used to divide the dataset into sentences.N-grams were then filtered for overall document frequency, and n-grams occurring in more than an upper threshold of the documents or less than a lower threshold of the documents were removed.The autoencoder experiments also used n-grams as input to a denoising autoencoder, which has been successful in prior reported biomedical text classification work. 38,39ross-validation experiments using the support vector machine and logistic regression classifiers found that the best set of thresholds removed tokens that occurred in 95% of the documents or more, or less than 5% of training documents.Combining uni-and bi-grams resulted in improved performance, no performance gain was obtained by adding tri-grams.This resulted in 3952 n-gram-based token features.
Feature embedding vectors based on ClinicalBert were also evaluated.This model creates a feature vector from the entire sentence consisting of 768 dimensions.The n-gram and Clini-calBert feature vectors were then evaluated using crossvalidation on the training set separately and concatenated into single vectors resulting in feature vectors of length 3952, 768, and 4720 respectively.
Cross-validation on the training set was again applied using log-loss as the metric of accuracy.It was found that the combination of n-gram and ClinicalBert features consistently outperformed either feature type separately.The combined feature vector of 4720 dimensions consistently performed better than all other feature combinations as evaluated by cross-validation.It was determined that the SVM classifier performed as well as, or in most cases, better than the other classification approaches.The autoencoder-based features did not improve performance over the combination of ngram and ClinicalBert features and as an individual feature set performed worse than n-grams.See Figure 2 for an example of comparisons that were evaluated.
The combined feature vectors were then used with several different classifier algorithms including SVM with multiple kernel types, random forest, logistic regression, 2 and 3-layer neural networks, and gradient boosting.The SVM-based kernels at default settings performed as well or better under cross-validation as the other algorithms, so SVM was chosen as the main algorithm, and kernel and parameter settings were optimized.
SVM with the linear kernel and the combined 4720 length feature vector was then used as a base for comparison with other SVM kernels.Kernel parameters were then optimized using grid search with combinations of the kernel, and parameter settings with the lowest log-loss were chosen for each AADCd concept.Negative down-sampling was also used to increase the concentration of positive samples in the training set for some concepts, with a downsampling range of 0.05 to 1.0 evaluated in steps of 0.05.See Table 2 for a list of the final ML models, kernels, and parameters chosen.The same vector of 4720 length consisting of the concatenated n-gram and ClinicalBert embedding features was used for all classifiers.
Final predictive concept models were then created using these settings and the full training data set, split into 2/3 þ 1/3 portions for model training and calibration with isotonic regression.The result of this step is a separate sentence-level predictive model for each concept giving the predictive probabilities for each of the 3 classes.
Performance was then evaluated on the validation dataset and checked for consistency with that predicted by the training dataset cross-validation selection procedure.The evaluated performance on the validation dataset was found to be consistent and close to the predicted performance.No changes were made to the models after the evaluation of the validation set.The final performance was then evaluated on the combined validation þ testing dataset.See Table 2 for the performance of the final trained models on the validation þ testing dataset.

Combining concept sentence predictions into patient predictions
Patient-level training and validation datasets were then created for each concept by collapsing the annotated sentences for each patient into a single binary present/absent variable.If a patient had any positive manual annotation for a concept, that patient was assigned positive for that concept, otherwise assigned as negative.In this manner, a patientconcept-level gold standard was programmatically created from the individual sentence annotations.
Several methods were investigated to automatically combine the individual sentence-level predictions for a patient for a given concept into a single patient-level concept prediction.These methods were termed "reduction" functions, since they act like a reduction operation in functional programming, taking in a list of inputs (in this case sentence-level concept predicted probabilities) and outputting a single overall result (in this case, the patient-level concept predicted probability).The reduction functions evaluated included: • max of positive sentence-level prediction probabilities • min of positive sentence-level prediction probabilities • mean of positive sentence-level prediction probabilities • noisy-or of positive sentence-level prediction probabilities 40 • two level neural networks trained on positive sentencelevel prediction probabilities • two level neural networks trained on positive, negative, and negative-qualified sentence-level prediction probabilities • Linear SVM trained on positive sentence-level prediction probabilities • Linear SVM trained on positive, negative, and negativequalified sentence-level prediction probabilities These methods were evaluated by comparing the algorithmic predictions with the gold standard.The best method for each concept was then chosen based on the performance on average precision.Average precision was chosen here as the best measure since the overall goal of the project is to rank patients for diagnostic screening for AADCd, and therefore it is reasonable to optimize the patient level predictions by the ability of the reduction function to rank patients for presence of the concepts of interest in their clinical notes.The best reduction algorithm and the patient concept performance obtained on the validation þ testing data after choosing these settings on the training data set are shown in Table 2.

Ranking patients by combining patient-level concept predictions
Finally, patient-level concept predictions were combined into patient-specific rank values for prioritizing manual review for AADCd screening.Since there is no "gold standard" for ranking patients in this manner, especially since the disease is very rare and there was a lack of appropriate cases to train on, this was done in 2 steps.First, it was postulated that patients having a higher number of AADCd-related symptoms would be more likely to be good screening candidates.Therefore, an overall target rank score was calculated for each patient in the training data based simply on the count of the number of positive symptoms that they had in their gold standard patient-level concept set.
This overall target rank count included all concepts except for Epilepsy or Seizures, which has a complex relationship with AADCd and was handled differently from the other concepts.Epilepsy or Seizures, while a distinct condition in itself, can be related to AADCd in 3 ways: (1) patients with AADCd have isolated seizures as part of the clinical presentation (seizures occur more frequently in AADCd than the general population), ( 2) oculogyric crises can be misdiagnosed as seizures, and (3) patients with AADCd can have both oculogyric crises And seizures as part of the clinical presentation.
It has been estimated that 4.5% to 8% of AADCd patients also have seizures/epilepsy, 3,10 which is more common than in the general population.While 68% of AADCd patients experience oculogyric crisis, this can be confused for seizures. 41n the second step, a Poisson regression model was created that took as input in the patient-level concept probabilities (all of them, including epilepsy as a predictive variable) and was fit to predict the number of counted AADCd symptoms assigned in the first step.This predicted symptom count was then used as the patient ranking value for manual review.
Initially, with this method, the Poisson regression was performed on the training data, and the fit was compared to the validation dataset.After this step demonstrated a good fit with a D-squared value of 0.63, the Poisson regression was fit on the training þ validation data, and this is the final regression ranking model used in our approach on unseen data.The mean symptom count on the training þ validation data was 1.615, with a standard deviation of 1.472.Performance of the model fit on the training þ validation data and tested on the test data is shown in Figure 3.The final coefficients of the model fit on the training þ validation data were as follows: The methodology above results in a set of models and applicable steps that can be applied to unseen data, and that produce ranking values for each individual patient.

Evaluation approach
We applied the methodology described in the previous sections to the 8025 patients in the held-out partitions 1-9.These patients comprised all unseen data that neither the investigators nor the algorithms had seen before.The topranked and bottom-ranked 200 patients were then identified, randomly ordered, and their clinical notes were manually reviewed.The review was done blinded by the annotation team, while annotating they had no information about the overall rank score or the individual concept predictions for any patient.Each patient was noted as to whether patients had AADCd compatible symptoms, whether these symptoms already had a definitive diagnosis expressed in the chart, and finally, if the first was true and the second false, whether the patient was an appropriate candidate for the next phase of AADCd screening.This last criterion was used as the outcome variable for the evaluation.For this work, a manual chart review of algorithm-identified patients by the epidemiologist annotator was taken as the endpoint of the study.Note that this manual screening of the high-and lowranked patients sets a higher-utility criterion for the results obtained by the algorithm.The ML approach was not specifically trained to recognize non-AADCd diseases or conditions that could explain the symptoms.It was considered important to base this final evaluation on the end goal of the project-identifying potentially undiagnosed cases of AADCd, and therefore setting an evaluation criterion that includes consideration of the overall purpose would allow us to evaluate a lower bar of the true performance of the approach.

Concept-level scoring on patients
The performance of the individual concept recognizers, combined with the reduction functions, evaluated on the individual patients in the validation set are shown in Table 3.The average precision obtained ranged from 0.11 for insomnia to 0.99 for epilepsy.The lift obtained (average precision divided by prevalence) was above 1.0 for all concepts, and often higher, demonstrating that there is some discriminative value for all concept classifiers, even the ones applied to relatively rare concepts such as oculogyric crisis and movement disorders.While sentence-level prediction performance can be low for some concepts, especially the rarest, the reduction process elevated the patient-level predictions to a more useful level of accuracy.

Patient-level scoring and ranking
The results of applying the concept scoring and Poisson regression rank value calculation to the 8010 patients in our test group are shown as a histogram in Figure 4.The mean predicted rank value was 5.136 with a minimum score of 0.246 and a maximum score of 19.351.The 25%/50%/75% percentile boundaries were at 1.172, 3.962, and 8.066, respectively.The top 200 patients had rank scores of 17.260 or higher.
To investigate the ability of our concept classifiers to separate patients into meaningful categories, and in order to study the clinical profile of the patients ranked in the top 200, spectral clustering was performed on the 8010 patient concept scores, using the "SpectralClustering" package in scikitlearn. 42Visual inspection examining clustering with 3 through 8 groups, showed that 5 clusters gave the best group separation for the smallest number of clusters which had approximately the same number of patients and had no tiny clusters.Two-dimensional principal component analysis (PCA) of the individual concept scores is shown in Figure 5 with the clusters plotted as separate colors.Patients in the top 200 rank scores are plotted as x's in black.All other samples are plotted as circles in black, blue, red, green, or yellow.It is clear from the figure that all the top 200 ranked patients fall into the red cluster, which represents a low score in component 0 and a high score in component 1.This is an interesting validation of the proposed ranking method since the spectral clustering and PCA analysis did not include the Poisson regression rank value as a clustering feature, only the individual predicted concept probabilities

Manual case review evaluation
As described above, 400 patient records were manually reviewed and assessed for appropriateness for the next phase of diagnostic screening.The results of the assessment are shown in Table 4.A total of 45 of the 200 top-ranked cases were characterized as requiring clinical screening for manual review, this is a positive predictive value of 22.5%.None of the bottom-ranked patients were characterized as appropriate for diagnostic screening.This difference is statistically significant by both Fisher's exact and chi-square test with Yeats correction at P � .0001.

Discussion
Our results showed that the proposed method worked well to identify patients in this population where diagnostic testing for AADCd is potentially indicated.Almost 23% of the topranked cases were marked for review by a clinician disease expert and for potential diagnostic testing.None of the lowest-ranked patients were designated for clinician review.This result was statistically significant to a high degree.
Our algorithm could be enhanced with structured data as a way to filter out patients who already have an explanatory diagnosis for the symptoms of interest.It is possible to define a set of diagnostic codes that could be used to filter out patients who already had sufficient reasons for their symptoms.This would take expert judgment to decide which conditions were sufficiently explanatory on their own, and which could cooccur with a disease such as AADCd.The method presented here is based on the OMOP data model; in particular, we made heavy use of the OMOP NOTE table as well as the PERSON table for demographic cohort selection.Therefore, this approach and implementation are not dependent upon any particular EHR system, and it should be reasonably straightforward to move our methods to other sites.
Our approach could be expanded and enhanced to work for other rare and semi-rare diseases since it is focused on automatically identifying sets of relatively common symptoms and does not require training data consisting of records of patients with known rare diseases.The method should generalize to other conditions where a set of related symptoms and conditions can be created, is usually mentioned in the note and other text of the medical record and not in the coded data, and cases are rare in the EHR and unavailable for direct training on diagnosed cases.This could also be a valuable approach for uncommon, not necessarily rare, diseases that have alternate or atypical presentations.For example, celiac disease, while commonly presenting with GI symptoms based on an immunologic response to gluten, also can present with non-GI symptoms in addition to, or without GI symptoms.These atypical symptoms include headache, peripheral neuropathy, and cerebellar ataxia, may be present in the chart notes, and are unlikely to be coded as structured data. 43,44urthermore, the approach could be expanded to address sets of rare diseases and not one rare disease at a time.A larger set of symptoms and related conditions could cover a range of diseases, such as many inborn errors of neurotransmitter metabolism and not just AADCd.Future work will investigate what set of symptoms and conditions would provide good coverage for an expanded set of diseases, and how to best determine which symptoms and conditions would be most useful and reusable across a range of diseases.
Our methods used traditional supervised learning NLP/ ML.One question is whether large language models (LLMs) such as GPT4 may have an impact on biomedical text research and future clinical NLP applications. 45Both large commercial models (such as GPT4) as well as LLMs small enough to be run locally, such as Llama, 46 Alpaca, 47 the huggingface (https://huggingface.co/), and GPT4All (https:// gpt4all.io/index.html)collections of models have recently become widely available.Running commercial models on clinical text may have some initial barriers, such as requiring a business partner agreement to ensure privacy, security protection, and Health Insurance Portability and Accountability Act (HIPAA) compliance, which would take some time to set up and govern.There are also issues with a lack of user control of changes in commercial models (eg, tuning on one version, with performance changing on the next, with no access to the prior version of the model).
The smaller, localizable LLMs can be run completely inside a healthcare institutional firewall, decreasing the HIPAA concerns.Given the large difference in available resources, it is unlikely that the localizable models will be able to keep up in performance with the ever-growing parameter size and training base of the commercial models. 48However, local models can be tuned for individual institutional requirements.3][54] Because local LLMs with good performance in clinical medicine have only recently become generally available (eg, GatorTron 55 ) we haven't applied these models yet in our AADCd research, but plan to in the future.In theory, the LLMs could produce probabilistic estimates of patient symptoms on spans of clinical text.Some researchers do not think that LLMs can provide accurate confidence estimates of their output. 56Breaking down a complex diagnosis into individual symptoms, as has been done here, is likely to produce better near-term results as compared to asking a complex synthesis question such as "Should the patient be screened for AADCd?". 53While these LLMs do not require annotated training data in the traditional sense, they do require detailed, and sometimes brittle 57 prompt engineering to coerce the models to produce the correct output.Annotated data for evaluation is also necessary to guide prompt engineering and also to do evaluation in order to determine how well they perform in comparison to traditionally trained methods.Prompt engineering in particular is a new emerging field, and it will take some time before the best methods of creating prompts for specific tasks are established. 57Since the LLMs have been shown to have problems with "hallucination" and "confabulation," 58,59 evaluation of extraction probabilistic confidence accuracy is also needed.
The work presented here has several limitations.First, all the data were sourced from a single health system, and the text in the notes may reflect documentation practices from that site.It is likely that the concept classifiers built here would decrease in performance on data from other sites and may improve with some site-specific training data.Going forward, it is important to replicate these findings on outside datasets.Currently, these are not available to us.The most widely publicly available clinical datasets, MIMIC II-IV, 60,61 are based primarily on critical-care patients and not on the pediatric neurology population focused on here.Collaborating with another healthcare institution with an OMOP-compliant RDW would be one feasible way to achieve this goal.Second, all the chart review was performed by a trained epidemiologist and not a rare disease expert.While review by clinical experts would be the next step, those resources were not available in this study.In future work, we intend to incorporate clinician expert manual chart review by a pediatric neurologist, and any systematic difference in manual review between the epidemiologist and neurologist will be analyzed.As none of the patients screened had a diagnosis of AADCd in their chart, determining an actual diagnosis of AADCd will require laboratory testing of patients after pediatric neurologist screening.

Conclusion
The work presented here has demonstrated a novel, feasible, generalizable approach for detecting potential undiagnosed cases of rare diseases in large population EHR systems, applied to the specific rare disease of AADCd.Future work will enhance the approach to a wider range of diseases, include structured EHR data for patient filtering, and follow up the current research with a detailed clinician review of selected patients.It is also our goal to collaborate with other institutions to apply our methods to additional populations that have an OMOP-based RDW.

Figure 1 .
Figure 1.Sentence to patient-ranking process.The overall process, starting with an identified cohort of potential patients, processing notes into sentences, scoring sentences, combining scores, and ranking patients is shown in this flowchart.

Figure 2 .
Figure 2. Negative log loss (smaller values are better) ML performance of representative examples of combinations of classifiers and feature sets evaluated by applying cross-validation on the training set.Abbreviations: NGRAM ¼ uni-and bigram sentence features, BERT ¼ ClinicalBert 768 dimension embedding vector, LOGISTIC_REGRESSION ¼ logistic regression classifier in scikit-learn, default parameters, NN_MLP_1024 ¼ neural network MLPClassifier in scikit-learn with hidden layer of size 1024, RANDOM_FOREST_200 ¼ RandomForest classifier in scikit-learn with 200 estimators, other parameters at defaults, LINEAR_SVC_C_1.0¼ support vector machine classifier, scikit-lean SVC implementation with C parameter set to 1.0.

Figure 3 .
Figure 3. Poisson regression model combining all individual concept predictors and fit to the actual count of the concept occurrences for each patient in the training þ validation set.Plotted points are the predicted and actual occurrence counts for the testing dataset.

Figure 4 .
Figure 4. Predicted overall Poisson rank score distribution of the final algorithm for all 8010 patients in the blinded dataset.The mean predicted rank value was 5.136 with a minimum score of 0.246 and a maximum score of 19.351.The 25%/50%/75% percentile boundaries were at 1.172, 3.962, and 8.066, respectively.The top 200 patients had rank scores of 17.260 or higher.

Figure 5 .
Figure 5. Spectral clustering analysis of ML concept predictions and correspondence with top 200 ranked patients in the blinded dataset.Separate clusters are shown as groups of similarly shaded and colored circles and plotted in 2 dimensions based on principle component analysis of the 10 concept prediction values for each patient.Patients ranked in the top 200 by the overall Poisson score are also shown as crosses.All the top-ranked patients fall in the upper left corner of the upper left-most (red) cluster.

Table 4 .
Manual review of the top 200 and bottom 200 patients ranked by the automated approach.top 200 are much more likely to pass this screening and be forwarded for clinician review than those in the bottom 200.Fisher's exact: P � .0001,chi-square with Yeats: P � .0001.

Table 1 .
Annotation counts for the training, validation, and testing datasets.

Table 2 .
Final optimized classifier parameters, down-sampling rate, and reduction method.Sentence level performance on the validation þ testing datasets shown for AUC (area under the receiver operating curve) and AP (average precision).

Table 3 .
Results of applying concept level classification and patient reduction functions to the individual AADC related concepts on the validation þ testing dataset showing average-precision and lift for each concept.
N_subjects are patients in the validation dataset, n_positive are the number of patients with a positive sentence for that concept.Prevalence is n_positive/ n_subjects.Abbreviations: AP ¼ average precision, LIFT ¼ average_precision/prevalence.