Sleep apnea phenotyping and relationship to disease in a large clinical biobank

Abstract Objective Sleep apnea is associated with a broad range of pathophysiology. While electronic health record (EHR) information has the potential for revealing relationships between sleep apnea and associated risk factors and outcomes, practical challenges hinder its use. Our objectives were to develop a sleep apnea phenotyping algorithm that improves the precision of EHR case/control information using natural language processing (NLP); identify novel associations between sleep apnea and comorbidities in a large clinical biobank; and investigate the relationship between polysomnography statistics and comorbid disease using NLP phenotyping. Materials and Methods We performed clinical chart reviews on 300 participants putatively diagnosed with sleep apnea and applied International Classification of Sleep Disorders criteria to classify true cases and noncases. We evaluated 2 NLP and diagnosis code-only methods for their abilities to maximize phenotyping precision. The lead algorithm was used to identify incident and cross-sectional associations between sleep apnea and common comorbidities using 4876 NLP-defined sleep apnea cases and 3× matched controls. Results The optimal NLP phenotyping strategy had improved model precision (≥0.943) compared to the use of one diagnosis code (≤0.733). Of the tested diseases, 170 disorders had significant incidence odds ratios (ORs) between cases and controls, 8 of which were confirmed using polysomnography (n = 4544), and 281 disorders had significant prevalence OR between sleep apnea cases versus controls, 41 of which were confirmed using polysomnography data. Discussion and Conclusion An NLP-informed algorithm can improve the accuracy of case-control sleep apnea ascertainment and thus improve the performance of phenome-wide, genetic, and other EHR analyses of a highly prevalent disorder.


Lay Summary
Sleep apnea is a common disease in which breathing partially or completely pauses during sleep, leading to less oxygen in the blood, repeated awakenings, and increased risk of developing multiple diseases. Current studies of sleep apnea often have relatively few participants due to the challenge of performing overnight sleep recordings. Electronic health record (EHR) billing code diagnoses of sleep apnea could be repurposed to increase the size of research studies, but the accuracy of the diagnoses is reduced. We developed a reusable algorithm that improves the accuracy of EHR sleep apnea diagnoses using natural language processing to extract information from clinical notes. As a proof of concept, we used the algorithm to identify hundreds of diseases that are increased among participants with sleep apnea compared to similar patients without sleep apnea. Many of these disease relationships with sleep apnea have not been previously recognized. This improved algorithm will help to accelerate future large-scale investigations of the causes and consequences of sleep apnea.
Large-scale epidemiological research in electronic health record (EHR) biobanks enables multiple opportunities to accelerate research. [19][20][21] Data collected as part of routine sleep clinic visits that would be financially and logistically challenging to collect prospectively may be efficiently repurposed for research questions, such as comprehensively identifying novel relationships with other diseases and improving the power of genetic analyses. [22][23][24][25][26] However, certain challenges must be addressed. Early EHR analyses used International Classification of Disease (ICD) codes for sleep apnea phenotyping. 27,28 ICD data are largely collected for clinical and billing purposes. ICD-derived disease diagnoses are often used when ruling out a given disease through a billed procedure such as a sleep examination whether or not the patient is found to have that condition. Modest diagnosis accuracy has been observed for several diseases, including a 33% positive predictive value for ICD-based rheumatoid arthritis phenotyping. 29 Natural language processing (NLP) in conjunction with medical chart reviews can effectively improve phenotyping accuracy. [29][30][31][32][33] Data from a sample of patients classified as true cases or true controls based on clinician validation (eg by manual record review using prespecified disease definitions) are extracted and linked to ICD-based diagnoses and other clinical information. Data from clinical notes within the EHR that improve classification accuracy in the validated set of charts are used to improve the diagnosis accuracy of other patients with ICD-based diagnoses. Extracting and processing free-text information is often addressed by using standardized vocabularies and medically oriented NLP tools. 34,35 A second improvement has been to group similar ICD diagnoses into broader clinical categories of %1800 "PheCodes" to provide larger numbers of cases representing the disease of interest. 36,37 Patients seen in an open healthcare system may seek care at institutions that are not part of the EHR system, and a third improvement is the use of a "data floor" with minimum healthcare utilization criteria to reduce misclassification of cases and controls due to incomplete EHR records. 31 A fourth improvement for sleep apnea may result from extracting key values from clinical notes and available polysomnographic (PSG) summary reports, such as case/control status based on a disease-defining threshold for laboratory diagnosis of sleep apnea: the apneahypopnea index (AHI).
Here, we report the development of a validated NLP-informed phenotyping algorithm for sleep apnea in the Mass General Brigham (MGB) Biobank, a resource with over 120 000 participants. 38,39 We compare the accuracy of this phenotyping algorithm to alternative models based on PheCodes and limited NLP, 40 which are useful when medical charts or expert clinician review are not available. We constructed improved NLP phenotyping for comorbid diseases, providing an opportunity to examine the relationships between sleep apnea, PSG statistics, and other diseases. As a proof of concept, we report associations between the NLP-derived sleep apnea status prospective incidence and prevalence of multiple diseases. Many of these associations have limited or no previously tested association with sleep apnea.

MATERIALS AND METHODS
Additional details are provided in the Supplemental Methods.

Study sample
Participants contributed EHR and sample data and provided written research consent to the MGB Biobank. 38,39 There were multiple analytical groups ( Figure S1, Table 1). "Screen positive" sleep apnea cases were defined by !1 sleep apnea coded PheCode diagnoses (described below). We selected a random sample of 300 participants for detailed chart review in order to generate an algorithm to separate bona fide sleep apnea cases from false positive noncases (eg coded for billing purposes or otherwise erroneously). We selected 3Â controls without PheCodes for sleep apnea or obstructive sleep apnea, matched to the NLP-defined sleep apnea cases based on age, sex, self-reported race/ethnicity, body mass index (BMI), and healthcare utilization using hospital encounters. 41 We also examined 4544 participants with available polysomnography records irrespective of a sleep apnea diagnosis. The first sleep apnea diagnosis date or the date of the PSG recording was used to calculate the age of a participant. Controls were matched on birthdates relative to cases. The age of the first sleep apnea diagnosis for a given case was used as the age of a matched control. BMI was extracted from structured tables and from unstructured clinical notes using regular expressions. The 2 BMI measurements closest in time to the participant's defined age were averaged together to calculate the participant's defined BMI.
We employed a "data floor" to reduce the number of participants with minimal documentation and hence the likelihood of false negative associations in our open network healthcare setting. 20 The sample was restricted to those with at least 2 clinical notes, 2 encounters associated with ICD diagnoses, and 3 separate PheCode diagnoses for any disease.

Clinician chart reviews
We performed clinical chart reviews among the 300 ICD-screen positive participants, in order to create a gold standard set of sleep apnea cases and noncases for algorithm development. Sleep apnea case/noncase (ie false positive) classifications were adjudicated by 2 sleep clinicians (SR and SMH) and informed by ICSD-3 guidelines ( Figure 1). 42 Chart data from 97 screen negative participants were also used to assess predictive value of the negative screen. Sleep apnea case classification categories are marked in green in Figure 1, while noncase classification categories are marked in red. This approach outperformed 2 exploratory sleep apnea disease definition models that assigned participants with central sleep apnea (CSA) or all non-"moderate sleep apnea" classifications as noncases (data not shown). From the 300 chart review set, 180 (60%) of these results were used in the training set, and 120 (40%) were used in the validation set for PheCAP and other methods.

Natural language processing
We extracted NLP terms that mapped to Concept Unique Identifiers (CUIs) from the Unified Medical Language System using cTAKES, and counted the instances of each nonnegated CUI term per note. 34,35 We used 2 NLP-based algorithm development approaches.
(1) PheCAP distinguishes true cases from noncases (ie negative in chart reviews despite one or more ICD codes) based on the presence of common terms extracted from the literature. 31 (2) Multimodal Automated Phenotyping (MAP) omits chart reviews and supplements ICD codes with the count of their exact matches located within clinical notes (eg count of "obstructive sleep apnea" phrases). 40 Sleep apnea candidate CUI terms were obtained using the surrogate-assisted feature extraction (SAFE) method from 7 internet-derived disease review resources (Table S1) 33,43 in order to select NLP concepts commonly recognized with sleep apnea and therefore more likely to generalize to other populations. We also constructed a composite term based on the cumulative count of 6 sleep apnearelated procedures and 2 NLP terms described in Table S2 that we term the "Joint CPAP CUI/Procedure Term."

PheCAP phenotype classification
We used PheCAP to test algorithms to classify sleep apnea case/noncase status in the chart review training and validation sets. 31 We tested 20 separate models to identify the optimal PheCAP settings (Table S3). PheCAP allows for flexible surrogate "silver standards" of a phenotype that aid in classification. We tested multiple surrogate combinations of S ICD (the number of phenotype PheCode diagnoses of a given patient), S NLP (the cumulative number of NLP CUI disease terms (eg "sleep apnea") seen across clinical notes for a given patient), and S ICDNLP (a combined count of the 2 terms). We tested the inclusion and exclusion of demographic (age, sex, and selfreported race/ethnicity) plus BMI, and PheCAP NLP terms. We further tested the final optimized PheCAP model to ask whether forcing case status for participants with diagnostic polysomnography criteria for sleep apnea (AHI ! 15) 42 and/or the joint continuous positive airway pressure ventilation (CPAP) CUI/procedure term from clinical notes and/or from PSG reports would improve overall model performance (n ¼ 13 cases, 7 noncases with measures available). The overall level of healthcare utilization has been shown to bias NLP analyses. 31 We therefore adjusted for the number of encounters with an ICD code for each participant in each PheCAP algorithm model.

Statistical analyses
Our primary measures of algorithm performance for PheCAP models, compared with PheCAP definitions, and the MAP model were Note: "Screen positive" had one or more PheCode diagnoses for sleep apnea (327.3) or obstructive sleep apnea (327.32). The 300 participants in "Chart Review Set" were obtained from the Screen Positive Group and used to perform PheCAP phenotyping. "PheCAP Cases" were classified by lead PheCAP algorithm (PheCAP S ICDNLP and NLP CUIs in Table 2). Age and BMI are presented as medians (interquartile range). All other fields, apart from sample size, are presented as total size (percentage). Age and BMI data were based on the first sleep apnea diagnosis date for PheCode cases, the last available visit date for PheCode controls, and the first available polysomnographic recording for the polysomnography sample.
the area under the receiver operator characteristic curve (AUC) and precision ( Table 2). Five additional statistics are provided in Table  S3.
Chi-square analyses examined the prevalence and incidence of comorbid PheCodes in PheCAP cases based on the best performing PheCAP algorithm compared to matched controls. We considered 527 PheCodes with a minimum MGB Biobank case prevalence of 1%. An incident diagnosis was defined as the first diagnosis for a comorbidity occurring at least one year after the first sleep apnea diagnosis. Participants with a prior diagnosis were excluded. Analyses considered combined sex and sex-stratified strata. Bonferroni corrections adjusted for the combined count of overall, female, and male analyses.
Logistic regression was used to analyze potential associations between PSG statistics and cross-sectional or incident comorbidities that were significantly associated with PheCAP sleep apnea status by adjusting for age and BMI at the time of the first available PSG recording, sex, and self-reported race/ethnicity. Phenotypes were then rank-normalized to account for any nonnormality in these residual values. We analyzed 2 PSG summary statistics: the AHI using 3% criteria and the percentage of the sleep episode with oxyhemoglobin saturation <88% (Per88). Tests were performed for PheCodes that were significantly associated with sleep apnea PheCAP status in combined-sex analyses. Bonferroni adjustments considered the combined count of AHI and Per88 calculations.

Sample characteristics
Sample characteristics are listed in Table 1. From the initial sample of 115 124 participants, 108 597 participants were retained after removing children or those without suitable criteria for the data floor. The final sample size was 100 616 after removing participants with unknown age, sex, and/or BMI values. Within this sample, 15,741 participants had !1 PheCode diagnoses for sleep apnea or obstructive sleep apnea, yielding 15.6% prevalence. Data from 397 randomly selected participants were used for chart review, including data from 300 participants with !1 sleep apnea PheCode diagnoses (and used in the algorithm validations) and data from 97 sleep apnea PheCode controls (to query for false negative PheCode diagnoses). From this sample, data from 180 participants with adjudicated case/ control status (60% of those with a positive sleep apnea PheCode diagnosis) were used in training and data from 120 participants with adjudicated case/control status were used in validation. Three of the 97 participants without an ICD diagnosis for sleep apnea were determined to have sleep apnea based on chart reviews.

PheCAP algorithm construction and performance
The 7 articles used for SAFE yielded 1072 nonnegated NLP concepts (CUI terms 34 ) that were seen in at least one article (eg "PSG (Polysomnography) [Diagnostic Procedure]"). A total of 130 terms were present in a majority of the articles and in at least one clinical note of !5% of participants with a sleep apnea PheCode diagnosis and were retained for analysis.
We tested 20 alternative PheCAP models using the 130 CUI terms and demographic and BMI data to identify the optimal tunable parameters (Tables 2 and S3). We present representative algorithms from PheCodes, PheCAP, and MAP in Table 2 based on chart review classification of cases/noncases using ICSD-3 guidelines and including sleep apnea and physician notes supported by prescribed therapy (Figure 1). The lead PheCAP model with the maximum AUC values ("PheCAP S ICDNLP and NLP CUIs" in Table 2) was based on cases and noncases classified as in the Figure 1 guidelines, combined counts of PheCode codings and equivalent PheCode NLP phrases (the S ICDNLP surrogate model), and additional NLP terms. Better AUC performance was observed when demographic and BMI data were excluded. Nevertheless, the average age and BMI and the percentage of males were all higher in the final ascer- tained PheCAP case sample compared to the PheCode-only screen positive group (Table 1). Final beta coefficients for NLP terms in the tested PheCAP models are provided in Table S4. The lead model included nonzero coefficients for the intercept, the number of clinical encounters, CUI C0199451 (CPAP, initiation and management), and the combined S ICDNLP silver standard surrogate term of sleep apnea PheCode counts, C0037315 (sleep apnea), and C05200679 (obstructive sleep apnea syndrome).The PheCAP algorithm is designed to optimize precision. The lead PheCAP model had improved precision in chart reviews of participants with at least one PheCode-based diagnosis coding date compared to PheCode-only counts (!0.943 vs 0.733; Table 2). Modest predictive improvements were observed when forcing PheCAP controls with an observed AHI ! 15 and/or an observed joint CPAP CUI/procedure term to be PheCAP cases (precision ! 0.951).

Associations between sleep apnea and comorbidities
We used the lead PheCAP model (ie PheCAP S ICDNLP and NLP CUIs without the AHI or the joint CPAP CUI/procedure term to increase the generalizability of our findings) to define sleep apnea cases, inform the selection of 3Â matched controls, and test the prevalence and incidence of comorbidities. We reused the nonsleep apnea NLP terms generated as a by-product of sleep apnea PheCAP phenotyping to generate NLP-informed case/control phenotyping using MAP and data from all of the MGB Biobank participants with a minimum data floor. We then tested the incidence of new comorbidities, defined by considering a first comorbidity diagnosis that occurred at least a year after the first sleep apnea diagnosis (Tables 3 and S5). Out of 527 tested PheCodes, 170 PheCodes had significant odds ratios (ORs) in combined-sex and/or sex-stratified analyses following Bonferroni correction. Hypersomnia and restless legs syndrome (RLS) had the highest odds ratio point estimates, likely due to par-ticipants being followed in sleep clinics. Lead disease associations reflected a range of pathobiology, including hypertensive heart disease, hypoglycemia, dysthymic disorder, and dementias. Diseases with significantly reduced incidence odds ratios included secondary malignancy of bone and non-Hodgkin's lymphoma. In sex-stratified analyses (Table S5), 76 comorbidities had significant odds ratios considering women with and without sleep apnea, while 111 comorbidities had significant odds ratios considering men with and without sleep apnea. While many disorders had relatively similar odds ratio estimates in both analyses, several disorders had higher odds ratio point estimates and/or nonoverlapping odds ratio confidence interval estimates among participants with sleep apnea in one sex versus the other sex. PheCodes with the largest incidence odds ratio differences between women and men for nonsleep disorders are provided in Figure S2. Notably, the chronic pulmonary heart disease odds ratio was higher in women (OR 4.17, 95% CI, 2.84-6.14) compared to men (OR 1.80, 95% CI, 1.32-2.44; P for sex interaction ¼ 7.15 Â 10 À4 ). Gout also had higher odds ratio estimates in women (OR 3.27, 95% CI, 2.34-4.56) compared to men (OR 1.36, 95% CI, 1.13-1.63; P for sex interaction ¼ 6.61 Â 10 À6 ). Obesity had a higher odds ratio estimate in men (OR 3.05, 95% CI, 2.63-3.53) compared to women (OR 1.50, 95% CI, 1.25-1.81; P for sex interaction ¼ 1.24 Â 10 À4 ).
We calculated the cross-sectional prevalence of 527 PheCode diagnoses among sleep apnea PheCAP cases and matched controls. Of this, 281 nonredundant PheCodes had significant odds ratios in combined-sex and/or sex-stratified analyses after Bonferroni adjustment (Tables 4 and S6). Morbid obesity had the highest odds ratio point estimate for any nonsleep disorder, followed by heart failure with preserved ejection fraction. The most significantly enriched PheCodes in cross-sectional analyses included cardiac, pulmonary, and multiple mental health and mood disorders. Secondary malig- Note: A total of 300 chart reviews were performed for participants with one or more sleep apnea PheCode codings. Therefore, certain PheCode-only rows lack negative predictive values by definition. Of the 300 chart reviews, 180 (60%) of these results were used in the training set, and 120 (40%) were used in the validation set. Results for the best performing PheCAP model are shown as "PheCAP S ICDNLP and NLP CUIs," along with chart review performance for PheCode-only definitions using a minimum of 1 and 2 PheCode instances to define a case and a more basic NLP algorithm using MAP. The performance of PheCAP surrogateonly models is shown next ("PheCAP S ICD ," "PheCAP S NLP ," "PheCAP S ICDNLP ") and is followed by the predictive performance using demographic parameters exclusively. Reduced performance was observed when including demographics and the lead PheCAP model ("PheCAP S ICDNLP , Demographics, and NLP CUIs"). Additional modest performance gains were obtained by forcing case status for participants with separately extracted AHI and/or continuous positive airway pressure (joint CPAP CUI/procedure term) evidence. Full results for all models are presented in Table S5 nancy of bone was significantly associated and had a reduced prevalence among sleep apnea cases (OR 0.37, 95% CI, 0.26-0.54). In sex-stratified analyses (Table S6), 174 disorders had significant ORs considering women with and without sleep apnea; 219 disorders had significant odds ratios considering men with and without sleep apnea. PheCodes with the largest absolute OR point estimate differences between women and men for nonsleep disorders are shown in Figure S3. Three heart conditions had higher, nonoverlapping OR estimates in women compared to men: chronic pulmonary heart disease, congestive heart failure not otherwise specified, and heart failure not otherwise specified (P for sex interaction 4.83 Â 10 À3 ).

Validation of associated comorbidities using polysomnography
Finally, we performed similar incident and cross-sectional comorbidity analyses in 4544 participants with available polysomnography. We tested the AHI using 3% desaturation criteria and the percentage of the sleep recording with oxyhemoglobin saturation under 88% (Per88). Eight largely cardiopulmonary and circulatory diseases were significantly associated with PSG measures in analyses of incident cases, including hypertensive heart disease (P ¼ 7.62 Â 10 À9 ; Table S7). Forty-one diseases had significant cross-sectional associations after Bonferroni adjustment (Table S8). Several cardiopulmonary diseases were associated, including asphyxia and hypoxemia (P ¼ 2.10 Â 10 À41 ) and chronic pulmonary heart disease (P ¼ 1.99 Â 10 À20 ). The lowest P values for 37 of these PheCodes were observed when analyzing Per88 ( Figure S4). Of the 17 diseases that were highly associated with Per88, 10 diseases (P < 1 Â 10 À10 ) were not nominally associated with AHI (P > .05).

DISCUSSION
In this study, we constructed an improved sleep apnea phenotyping algorithm that addresses the limitations of ICD codings within the EHR by using NLP and controlling for healthcare utilization to improve precision. This algorithm considered CPAP usage and can be applied to important analyses examining the causes and consequences of sleep apnea. We applied this algorithm as a proof of principle in a phenome-wide analysis that identified multiple disorders with elevated incidence and prevalence in patients with sleep apnea compared to matched controls. The phenotyping of the nonsleep disorders was also improved using NLP, and to our knowledge, most disorders have never previously been examined in the context of sleep apnea. The association between sleep apnea and the incidence and/or prevalence of several of these disorders was confirmed using polysomnography, despite a modest sample size and single point-intime polysomnography data.
The PheCAP algorithm was designed to optimize phenotype precision (Tables 2 and S3), which is particularly useful for genetic analyses and prioritizing the selection of true cases with high cer- Note: An incident diagnosis was defined as the first diagnosis for a potential comorbidity occurring at least one year after the first diagnosis date for sleep apnea. Otherwise, participants with prior diagnoses were excluded. Sample sizes will therefore vary by PheCode. Totally, 527 PheCodes with !1% overall prevalence were tested. Controls were matched for age, sex, BMI, population, and healthcare utilization. It was found that 170 nonredundant PheCodes were significantly associated following Bonferroni correction. Lead results are shown here. Complete results, including sex-stratified results and sample sizes, can be found in Table S5.
tainty. The precision of the validation sample improved from 0.733 when using a single diagnosis date (!1 PheCodes) to 0.943 when applying the PheCAP algorithm. CPAP is the most frequent medical treatment for sleep apnea and, with few exceptions, is used almost exclusively in outpatient settings for treating sleep apnea. The CPAP usage NLP term identified by PheCAP would likely generalize to other healthcare systems. Inclusion of AHI from polysomnography and CPT codes or other structured data signifying the use of CPAP improves phenotyping precision slightly (0.943-0.951). Additional improvements gained using the PheCAP procedure include the use of a "data floor" to exclude participants with sparse EHR documentation and adjustment for healthcare utilization to control for biases, 41 which were likely to have improved the precision of the !1 PheCodes and all other algorithms. Putative cases can be restricted to those with multiple ICD diagnosis coding dates to improve precision in situations where access to text and/or procedural data is impossible. We systematically examined the potential relationships between sleep apnea cases, matched controls, and comorbid diseases by leveraging improvements in the diagnostic accuracy of comorbidities using NLP. 20,30,40 The majority of tested diseases (170 incident PheCodes and 281 cross-sectional PheCodes out of 527 tested PheCodes) had significantly different incidence and/or prevalence rates between sleep apnea cases and controls following Bonferroni corrections (Tables S5 and S6). Given the known associations of sleep apnea with multiple metabolic, cardiovascular, and neurocognitive morbidities, 1 this is not surprising. These data highlight the role of sleep apnea as a risk factor for a broad range of diseases. Unexpectedly, patients with sleep apnea were at lower risk for incident diagnoses for non-Hodgkin's lymphoma and secondary malignancy of bone, with similar directionality in the cross-sectional results. We will attempt to replicate these results in future studies as these could be due to practice patterns in our system. Further work is needed to understand the pathophysiological mechanisms between sleep apnea and these diseases, the relative contributions of sleep apnea compared to competing risk factors for these diseases, and whether certain sleep apnea subtypes and groups of comorbidities have potential statistical relationships, which may aid in improved patient risk stratification and more personalized treatment strategies.
Personalized treatment may involve different gender-specific strategies. A number of comorbid diseases had odds ratio estimates that diverged in sex-stratified analyses (Tables S5 and S6, Figures S1  and S2). There are well-described gender differences in the physiology of sleep apnea, with men generally having more hypoxemia and women having more arousals 44 -factors that may influence propensity for future diseases. A portion of the differential odds ratios between men and women for specific diseases may be due to differences in sleep apnea subtypes. [44][45][46] Notable PheCodes that have higher odds ratios of incidence in women include chronic pulmonary heart disease, gout, and congestive heart failure not otherwise specified.
Multiple sleep disorders are often observed in the same patients. Other sleep disorders, including RLS, had higher odds ratios in cross-sectional analyses ( Table 4). The RLS association may be due in part to an increased awareness of sleep clinicians who may screen for other sleep disorders when examining patients suspected of having sleep apnea. RLS prevalence is increased among patients with sleep apnea versus controls, and RLS symptoms are reduced after treatment for sleep apnea. 47, 48 We could not completely disentangle the effects of central versus obstructive sleep apnea, as 90% of the participants originally diagnosed with CSA were also diagnosed with obstructive sleep apnea. "Cardiac defibrillator in situ" and "delirium due to conditions classified elsewhere" were significantly associated in a sensitivity analysis considering patients with a CSA diagnosis versus matched controls (P 8.40 Â 10 À4 ). The nonoverlapping odds ratio estimates were higher in the CSA diagnosis group compared to the remainder of the sample without a prior CSA diagnosis. Future work is needed to determine whether these odds ratio differences are due to CSA-specific effects, and whether comorbid sleep disorders have additive effects that may contribute to an increased prevalence and/or incidence of comorbid disease.
Most of the lead associations between polysomographic traits and comorbidities were based on a measure of low overnight oxygen saturation during sleep (Per88), in contrast to the more commonly used AHI (Tables S9 and S10, Figure S4). This is consistent with prior single disease reports 49-51 but has not been systematically evaluated to our knowledge. Hypoxemia measures have been the bases of our most significant genetic associations with sleep-disordered breathing to date. 24 Ten of the 17 diseases that were highly associated with Per88 (P < 1 Â 10 À10 ) in cross-sectional analyses were not nominally associated with AHI (P > .05; Table S8, Figure S4), indicating that a readily available PSG summary measure is more significantly associated with dozens of comorbidities compared to the AHI. Additional associations may be observed in the future using more specific measures such as the hypoxic burden. 52 The AHI (a count of the number of breathing pauses per hour of sleep) is increasingly recognized as a heterogeneous marker, resulting in a wide variety of stresses due to differences in durations and severity of individual breathing pauses that comprise the AHI. 46 Increased AHI was associated with reduced likelihood of cross-sectionally ascertained bariatric surgery, essential hypertension, migraine and, notably, insomnia. The latter association may reflect the common occurrence of comorbid insomnia with sleep apnea 53 or the increased likelihood of sleep disorder recognition once a patient is referred to a sleep specialist. The strength of a disease's association with measures of disrupted sleep versus hypoxemia may provide insights into potential pathophysiological connections for future study.

Strengths and weaknesses
Strengths of this study include applying advanced NLP methods to large-scale sleep phenotyping for the first time, to our knowledge. Careful consideration of comorbidity phenotyping and adjustment for healthcare utilization 41 increases our confidence in the association of sleep apnea with the increased prevalence of hundreds of disorders, using a phenome-wide approach. We validated these associations with several disorders using polysomnography. Measures of hypoxemia may be more sensitive to the risk of certain disorders compared to the AHI.
While our algorithm may conceivably not generalize to other environments, similar portable algorithms have been demonstrated for other phenotypes. 29,54 Moreover, the SAFE algorithm was designed to extract common concepts from background literature, 33 reducing the risk of overfitting. The CPAP term that remained pre-dictive following cross-validated LASSO regression represents a first-line therapy used in clinical sleep laboratories. We will attempt to replicate and extend our findings in other diverse biobanks in future studies.

CONCLUSION
We developed an advanced sleep apnea clinical phenotyping algorithm that was able to increase the precision of EHR data by leveraging NLP and identified several novel cross-sectional and incident associations between sleep apnea and other diseases. Despite their challenges, large-scale EHR analyses have provided important insights into the biology of disease. 55,56 EHR analyses of sleep apnea will be an attractive, pragmatic pathway for advancing our understanding of this important disorder at an unprecedented scale.