Assessment of obstructive sleep apnea-related sleep fragmentation utilizing deep learning-based sleep staging from photoplethysmography

Abstract Study Objectives To assess the relationship between obstructive sleep apnea (OSA) severity and sleep fragmentation, accurate differentiation between sleep and wakefulness is needed. Sleep staging is usually performed manually using electroencephalography (EEG). This is time-consuming due to complexity of EEG setup and the amount of work in manual scoring. In this study, we aimed to develop an automated deep learning-based solution to assess OSA-related sleep fragmentation based on photoplethysmography (PPG) signal. Methods A combination of convolutional and recurrent neural networks was used for PPG-based sleep staging. The models were trained using two large clinical datasets from Israel (n = 2149) and Australia (n = 877) and tested separately on three-class (wake/NREM/REM), four-class (wake/N1 + N2/N3/REM), and five-class (wake/N1/N2/N3/REM) classification. The relationship between OSA severity categories and sleep fragmentation was assessed using survival analysis of mean continuous sleep. Overlapping PPG epochs were applied to artificially obtain denser hypnograms for better identification of fragmented sleep. Results Automatic PPG-based sleep staging achieved an accuracy of 83.3% on three-class, 74.1% on four-class, and 68.7% on five-class models. The hazard ratios for decreased mean continuous sleep compared to the non-OSA group obtained with Cox proportional hazards models with 5-s epoch-to-epoch intervals were 1.70, 3.30, and 8.11 for mild, moderate, and severe OSA, respectively. With EEG-based hypnograms scored manually with conventional 30-s epoch-to-epoch intervals, the corresponding hazard ratios were 1.18, 1.78, and 2.90. Conclusions PPG-based automatic sleep staging can be used to differentiate between OSA severity categories based on sleep continuity. The differences between the OSA severity categories become more apparent when a shorter epoch-to-epoch interval is used.


Introduction
Obstructive sleep apnea (OSA) is a sleep disorder characterized by recurrent complete or partial breathing obstructions which heavily affect the sleep architecture [1]. Over 900 million people worldwide are estimated to suffer from OSA [2]. One consequence of OSA is sleep fragmentation due to the arousals induced by the breathing obstructions. Sleep fragmentation is associated with various OSA-related symptoms, including daytime sleepiness and decreased psychomotor vigilance [1]. One proposed method to assess the relationship between OSA and fragmented sleep is to perform survival analysis on the duration of continuous sleep of subjects grouped by OSA severity category [3].
In clinical practice, sleep stages are usually scored manually by visual inspection using the signals recorded during a polysomnography (PSG), including the electroencephalogram (EEG), the electrooculogram (EOG), and the electromyogram (EMG) [4]. Manual scoring of sleep stages is a time-consuming task performed by trained professionals. However, even with years of experience on the task, two scorers are prone to score some of the sleep stages differently [5][6][7][8]. The interrater reliability is especially low for N1, with reported agreement as low as 63% [8]. Cohen's κ value is a widely used metric for interrater agreement [9]. In one study, PSGs of 72 subjects (56 healthy controls, 16 patients with different sleep disorders) from three different hospitals were scored according to the American Academy of Sleep Medicine (AASM) 2007 rules by two independent scorers. A total of seven scorers participated in the study. Overall agreement for five-stage scoring was κ = 0.76, greatly varying between different sleep stages from moderate agreement (κ = 0.46) on N1 sleep to almost perfect agreement (κ = 0.91) on rapid eye movement (REM) sleep [5].
Since manual sleep scoring is laborious, automated solutions for sleep staging have been developed by utilizing a myriad of different approaches [10]. Combinations of nonlinear features such as wavelet transforms, entropy, spectral features, and autoregression coefficients have been used with classifiers such as k-nearest neighbors and random forests [11][12][13][14][15]. In classification to five stages (wake/N1/N2/N3/REM), the reported accuracies of these methods, which involve handcrafted feature engineering and primarily use EEG, have varied from 75% to 83%. Other examples of methods used for automated sleep staging include classification of cardiac features calculated from an ECG recording [16,17], and classification of features derived from bed sensors measuring heart rate and movements [18]. More recently, deep learning-based methods have been utilized for automated EEG-based sleep staging with very good results with reported accuracies varying from 84% to 87% and Cohen's κ values between manual and automated sleep staging varying from 0.77 to 0.82 [19][20][21][22].
While the characterization of sleep stages is often focused on the brain, sleep also affects the autonomous nervous system (ANS) activity [23,24]. The sympathetic nervous system (SNS) activity is decreased in non-REM (NREM) sleep, and there are phasic bursts of SNS activity in REM sleep [24]. Due to the changes in ANS activity, there are also hemodynamic changes during sleep [24,25]. During NREM sleep, mean arterial pressure and cardiac output are reduced. In contrast, during REM sleep the arterial pressure and heart rate are increased [25]. These changes are reflected in the photoplethysmogram (PPG) which measures the changes in blood volume in the microvascular tissue [26]. Thus, PPG can be used to differentiate between wake, NREM sleep, and REM sleep [19].
PPG measurements are simple to set up compared to standard EEG measurements, which makes PPG-based automated sleep staging an interesting alternative to EEG-based staging. A few studies have used PPG for three-stage classification three-stage classification (wake/NREM/REM) using handcrafted features [27,28]. These methods have achieved accuracies varying from 73% and 75%, and Cohen's κ values varying from 0.53 to 0.55. Recently, our research group introduced a sleep staging approach utilizing deep learning with raw PPG as the input. This method achieved an accuracy of 80.1% and Cohen's κ of 0.65 in three-stage classification [19]. However, the wakefulness classification accuracy of this method was 72%, and increasing this accuracy would be highly beneficial for the assessment of sleep fragmentation.
In the present work, we aimed to assess OSA-related sleep fragmentation with survival analysis of sleep continuity estimated using the PPG signal. The first hypothesis was that automatic PPG-based sleep staging can capture the interruptions of sleep at a level of accuracy that allows differentiation between OSA severity groups in terms of sleep continuity. Since the classification of sleep and wake is crucial for the task, an auxiliary objective was to improve the accuracy of previous PPGbased sleep staging methods. The second hypothesis was that hypnograms with higher resolution would better highlight the differences between OSA severity groups based on sleep continuity by capturing the short interruptions of sleep that may be omitted from sleep staging when using traditional 30-s epochs. This was tested by overlapping the PPG epochs during prediction to artificially obtain shorter epoch-to-epoch intervals. Recently, this method has been used to analyze sleep fragmentation with automatic EEG-based sleep staging [29]. The survival analyses were performed separately for hypnograms generated with different amounts of overlap. were used to record and analyze the signals. All recordings were manually scored in accordance with the prevalent AASM rules [31]. A total of 877 recordings were included in the study after leaving out recordings that included corrupted signals or contained less than 1 h of sleep. The Institutional Human Research Ethics Committee at Princess Alexandra Hospital approved the use of dataset B (HREC/16/ QPAH/021 and LNR/2019/QMS/54313).

Data
Since the dataset B was acquired with more recent hardware, and analyzed using the more recent AASM guidelines, it was the main dataset used in the present study, and the only one used in validation and testing. The demographic information (Table 1) and results are only reported for the dataset B.
The patients were assigned to separate training, validation, and test sets before training the sleep staging models and further analyses. A random sample of 20% of the patients from dataset the B was used as the independent test set. After the test set selection, 10% of the remaining data were sampled as the validation set. Details on the training, validation, and test set distributions are presented in Table 1. During training, only the training set was used to adjust the model weights. The validation set was used to monitor the training process and choose the final model. The test set was used for performance assessment of the final model and in the subsequent analysis of sleep fragmentation.
The data underlying this article cannot be shared publicly due to privacy reasons.

Sleep staging
Automatic sleep staging models were trained using only the PPG signal as the input. The raw signals were exported from the PSG software with their original sampling rate of 256 Hz. No additional filtering was performed during the exports. The PPG signal was downsampled from 256 to 32 Hz, after applying an order 8 Chebyshev type I antialiasing filter. Then, z-score normalization was applied to the downsampled signals. No further preprocessing was performed. Sleep was separately classified into three classes (wake/NREM/REM), four classes (wake/ N1 + N2/N3/REM), and five classes (wake/N1/N2/N3/REM). All models were pre-trained using the dataset A, and the bestperforming models according to cross-entropy loss on the validation set were used to initialize the weights before fine-tuning using the dataset B. Smaller learning rates were used when finetuning the pre-trained model with dataset B to avoid destroying the feature representation learned from dataset A.
A general architecture consisting of a convolutional neural network (CNN), a recurrent neural network (RNN), and a densely connected classifier was used [19,22]. The model was implemented in Python using Tensorflow 2.3.0 and its Keras API. The deep learning model architecture is described in Table 2. The CNN extracted features from 30-s windows of the raw PPG signal. This representation was aligned with the 30-s epochs used in manual sleep staging. The CNN was based on EfficientNet [32], which is a state-of-the-art deep learning architecture for image classification. In the present work, the 2D EfficientNet architecture was modified for 1D inputs by substituting the 2D convolutions with 1D convolutions. The Swish activation function [33] was used similarly to the original EfficientNet. The output features of the CNN were used as the input for the RNN. A bidirectional RNN was used to capture the sleep state dynamics both backward and forward in time. Long short-term memory (LSTM) cells were chosen over gated recurrent units (GRU) after evaluating both. The bidirectional LSTM output features were then fed to two densely connected layers with rectifier linear unit (ReLU) activations. The classifier output was produced by applying the softmax activation function to the final dense layer's output. Hyperparameter tuning was performed using a disciplined approach [34]. First, a suitable range for learning rates was searched using a learning rate range test [35]. The resulting range was used with a one-cycle learning rate scheduling policy, in which the learning rate was initially set to the minimum of the range. Then, the learning rate was increased linearly after each network training epoch until the maximum of the range was reached. After that, the learning rate was linearly decreased back to the minimum using the same number of training epochs that was used when increasing the learning rate. Finally, the learning rate was exponentially decreased for 20 training epochs until it was two orders of magnitude smaller than the minimum learning rate indicated by the learning rate range test [34].

Sleep parameters
Total sleep time (TST), sleep efficiency (SE), and wake after sleep onset (WASO) were computed for each patient in the test set using the five-stage model. In addition, the percentage of wake from the total recording time, as well as the percentages of each sleep stage from the TST were computed. The parameters were computed separately for each OSA severity category. OSA severity was defined using the apnea-hypopnea index (AHI; no OSA: AHI < 5, mild OSA: 5 ≤ AHI < 15, moderate OSA: 15 ≤ AHI < 30, severe OSA: AHI ≥ 30). The AHI values were calculated from the manually scored PSGs. Thirty-second non-overlapping epochs were used both in the manual scoring and when training the automatic sleep staging models. In addition, the automated model was used to produce hypnograms with higher temporal resolution by applying 15

Survival analysis of sleep continuity
Sleep continuity was evaluated using survival analysis techniques introduced by Norman et al. [3]. The mean length of continuous sleep was calculated for each patient. Then, Cox proportional hazards models were fitted with the mean continuous sleep as the time to event, and the one-hot encoded OSA severity categories as the binary covariates. The non-OSA group was used as the reference. The five-stage model was used in the survival analysis. The Cox proportional hazards modeling was performed separately with both manually scored hypnograms and the PPGbased hypnograms having 30-, 15-, and 5-s epoch-to-epoch intervals. In addition to acquiring the hazard ratios for decreased mean continuous sleep using the proportional hazards model, sleep continuity was evaluated visually from Kaplan-Meier plots.

Statistical analysis
All statistical analyses were performed using Python 3.8.5. Overall accuracy, precision, recall, and F1 scores were used for performance assessment of the sleep staging models. In addition, Cohen's κ was used to estimate the agreement between the manually scored PSG-based and the automatically scored PPG-based sleep staging. Medians and interquartile ranges were computed for the sleep parameters. Mean absolute error (MAE) was computed to assess the difference between the manual PSG-based and automated PPG-based sleep parameters. The Wilcoxon signed-rank test was used to test the statistical

Sleep staging
The confusion matrices for three-, four-, and five-class sleep stage classification are shown in Figure 1. A representative example of manual PSG-based and automatic PPG-based hypnograms for one healthy patient from the test set in the five-stage case are shown in Figure 2 Median and interquartile ranges for wake and different sleep stage percentages in each of the OSA severity categories computed from the manual PSG-based and automatic PPG-based hypnograms are shown in Table 3. The proportions of wake and REM sleep were estimated most consistently across different OSA category and epoch-to-epoch interval combinations. The proportion of N1 sleep was estimated the worst, especially in the severe OSA group. This was compensated by the automatic model by overestimating the  proportion of N2 sleep. When the epoch-to-epoch interval was decreased to 5 s with the automatic hypnograms, the mean absolute errors between the manual and automatic sleep stage percentages were higher, except for N2 percentage.

Sleep parameters
Medians and inter-quartile ranges for TST, SE, and WASO computed using manual PSG-based and automatic PPG-based hypnograms in each OSA severity category are shown in Table 4. The medians of TST and SE computed from the automatic PPG-based hypnograms using different epoch-to-epoch intervals were in line with the medians computed from the manual PSG-based hypnograms. On the other hand, the medians of WASO in each OSA severity category were overestimated by the automatic hypnograms. In all scenarios, the mean absolute error of the automatic PPG-based sleep parameters compared to manual PSG-based sleep parameters increased when the epoch-to-epoch interval was decreased.  A statistically significant difference (p<0.05) between manual and automatic sleep staging is denoted with an asterisk (*). A statistically significant difference (p<0.05) between the non-OSA group and each of the OSA groups is denoted with a dagger ( †).

Assessment of sleep continuity using survival analysis
In the analysis of sleep continuity, the five-stage model was used since it provided the highest accuracy on the classification of wake (Figure 1). The test set (n = 175) was used for the survival analysis. In the OSA severity grouping, the clinical diagnoses based on manually scored PSGs were used. With both manually and automatically scored hypnograms, the hazard ratios (HRs) for decreased mean continuous sleep compared to the non-OSA group were larger when the OSA severity increased (Table 5). When decreasing the epoch-to-epoch interval, the differences between the HRs of different OSA severity groups increased. The HRs for PSG-based manually scored hypnograms were 1.18, 1.78, and 2.90 for mild, moderate, and severe OSA, respectively. With the PPG-based automatic scoring with 5-s epoch-to-epoch interval, the corresponding HRs were 1.70, 3.30, and 8.11.
Kaplan-Meier plots for each scenario are shown in Figure 3. With the PSG-based manually scored hypnograms, the survival curves for each OSA severity category were clearly distinct. With the automatic PPG-based model with a 30-s epoch-toepoch interval, the mild OSA patients' survival curve overlapped with the non-OSA curve. In contrast, with the 5-s epoch-toepoch interval, all OSA severity categories were well separated. In addition, it is evident from the Kaplan-Meier plots that the mean continuous sleep estimated by the deep learning models decreased drastically when the epoch-to-epoch interval was decreased.

Discussion
In the present work, OSA-related sleep fragmentation was assessed with Cox proportional hazards modeling of mean continuous sleep utilizing PPG-based automatic sleep staging. The results were compared with manual PSG-based sleep staging analyses. The hazard ratios for decreased mean continuous sleep increased along with increasing OSA severity with both automatic PPG-based and manual PSG-based analyses. This supports the first hypothesis that the automated PPG-based sleep staging models can be used to differentiate between the OSA severity categories in terms of sleep continuity. Thus, it can be reasoned that the PPG signal captures the sleep fragmentation induced by OSA-related breathing obstructions. The differences between the hazard ratios for decreased mean continuous sleep for mild, moderate, and severe OSA compared to the non-OSA group further increased when shorter epochto-epoch intervals were used. This is in line with the second hypothesis of the present work that a denser temporal resolution of the sleep staging would highlight the differences between the OSA severity categories with respect to sleep fragmentation.
The second aim was to improve the accuracy of automatic PPG-based sleep staging. Compared to our previous work [19], the accuracy in three-, four-, and five-stage classification on the test set increased from 64.1%, 68.5%, and 80.1% to 68.7%, 74.1%, and 83.3%, respectively. The performance of the PPG-based automated sleep staging model is remarkable, considering that the PPG signal is not utilized in the manual scoring of the sleep stages. In the five-stage classification, the accuracy of classifying REM sleep (87%) is particularly high compared to our previous PPG-based sleep staging results (69%) [19]. However, the overall performance of the PPG-based five-stage classification is still not on the level of PSG-based sleep staging, especially in the case of N1 and N3 sleep. In the case of N1 sleep, there may not be consistent hemodynamic changes compared to wakefulness and N2 sleep. It should be noted that the interrater agreement for scoring N1 sleep is particularly low also with manual EEGbased scoring [8]. Similarly, the slow wave activity of the brain, which is the main characteristic of N3 sleep, may not be reflected in the PPG signal, leading to misclassification of N3 sleep as N2 sleep. Thus, further studies are required to the application of the PPG-based models when investigating the overall sleep architecture.
The main contribution in the present work that accounts for the increased accuracy of PPG-based REM sleep classification compared to our previous work [19] was the use of a more sophisticated feature extractor CNN. With CNNs that consist of blocks of consecutive convolutional layers and occasional pooling layers, the number of parameters grows quickly to the extent that computational resources, especially the GPU memory, become a major limiting factor. In addition, when increasing the depth of the network, the gradients of layer inputs with  [38], attempts to overcome these issues in three ways. First, instead of standard convolutions, computationally more lightweight depthwise separable convolutions are used [38]. Secondly, a linear bottleneck is used at the end of each block to reduce the number of channels passed down to the next block.
Thirdly, skip connections are added from the input to the bottleneck output of the MBConv blocks for improved gradient flow.
Using the MBConv blocks in the present work allowed us to significantly increase the depth of the feature extractor CNN with the same computational resources compared to our previous work [19].
According to our previous study, the REM sleep classification accuracy is comparable to EEG-based automatic sleep staging (91%), which is on par with the clinical interrater reliability [22]. Since the accuracy of identifying REM sleep is high, the PPGbased sleep staging model could be used to study REM-related phenomena, such as REM-related OSA and REM sleep fragmentation. The increasingly common home sleep apnea tests (HSAT) do not include EEG, but the PPG signal is recorded. Thus, with the prevalent methodology of EEG-based sleep staging the diagnosis of REM-related OSA cannot be done with HSATs. Therefore, it would be extremely valuable to accurately identify REM-sleep with the HSATs using only the PPG signal.
When utilizing supervised machine learning techniques, the quality of the output labels is of paramount importance. The fact that the interrater reliability of PSG-based manual scoring of sleep stages by experienced professionals is generally around 80% to 85% [5], is an issue in the supervised approach in general. In addition to the moderate interrater reliability, manual sleep staging is a very time-consuming task requiring a lot of expertise. This complicates the collection of large, high-quality datasets for supervised learning. To increase the quality of the labels and to speed up the data acquisition, the scoring rules and practices may require a revision to make the scoring process less ambiguous and easier to automate. A further step would be to develop methods for assessing sleep that do not depend on manual scoring at all. For example, unsupervised learning could be used on the PSG signals to derive sleep characteristics that capture the variance in the nocturnal PSG signals more optimally than the current visual inspection-based scoring rules. These features could correlate better with the effects of sleep deprivation such as daytime sleepiness; however, this warrants further study.
Pulse oximetry has a lot of potential for sleep analytics and diagnostics of sleep disorders as it is simple to measure and already used in various monitoring devices and applications. Since PPG measurements are easy to conduct, acquisition of larger PPG signal datasets without the manually scored PSGs is feasible. This opens possibilities to utilize semi-supervised learning with large amounts of unlabeled PPG signals and a smaller number of PPG signals with corresponding manual PSGbased labels. Thus, any dataset which includes PPG signals could be used to increase the amount of training data, regardless of whether the corresponding hypnograms are available. Semisupervised learning has been performed with good results for example using generative adversarial networks (GANs) [39] and ladder networks [40]. In the era of consumer-grade self-tracking wearables such as smart watches, armbands, and rings, the use of deep learning-based semi-supervised methods for tasks such as sleep staging will become increasingly important.
In the prevalent sleep staging methodology, sleep is discretized to arbitrary length (usually 30-s) epochs, mainly for practical reasons to reduce the amount of work in manual scoring. In the present work, when the epoch-to-epoch interval with automatic sleep staging was artificially decreased, better differentiation was achieved between the OSA severity categories in terms of mean continuous sleep (Table 5). This finding supports the hypothesis that using the 30-s non-overlapping epochs in sleep staging does not fully capture the OSA-related sleep fragmentation. Especially with the severe OSA patients, there may be short periods of wake that are divided to two consecutive 30-s epochs such that both epochs will be scored as sleep. Using overlapping 30-s epochs with shorter epoch-to-epoch interval, those short periods of wake spanning two traditional epochs can be detected.
As seen in Figure 2, the models tend to predict increasingly fragmented sleep when the epoch-to-epoch interval is shortened. This leads to decreased mean duration of continuous sleep for all OSA severity groups as also seen in the Kaplan-Meier plots (Figure 3). However, the hazard ratios for decreased mean continuous sleep increased more rapidly with more severe OSA (Table 5). If the models would overestimate sleep fragmentation to the same extent in all OSA severity groups, as well as with the healthy subjects, the hazard ratios would not increase, since we always compare to the non-OSA group. To further investigate the overestimation of sleep fragmentation, denser-resolution manual PSG-based scorings would be needed. This underlines the need for new methods to produce higher resolution hypnograms for more detailed assessment of sleep fragmentation related to OSA.
One limitation of the present work is the amount of data. Although the main dataset B used in this study is large in the context of sleep research (n = 877), it is relatively small in the context of deep learning. Especially when the patients are divided into OSA severity groups and only the test set is considered, the sample sizes become small. For example, the number of patients in the non-OSA group in the test set was only 29 (Table 1). This is a limiting issue when analyzing the distributions of patient-wise variables, such as the sleep parameters or mean continuous sleep. With the epoch-based metrics, such as the overall sleep staging accuracies, this is not as problematic since on average there are hundreds of epochs for each patient.
In conclusion, the differences in hazard ratios for decreased mean continuous sleep between the OSA severity categories were increased when the epoch-to-epoch interval was decreased ( Table 5). The hypnograms with higher temporal resolution were achieved by overlapping the 30-s epochs before classification with the automatic PPG-based sleep staging model. This indicates that using a shorter epoch-to-epoch interval with the automatic hypnograms better captures the OSA-related sleep fragmentation. On the other hand, decreasing the epoch-toepoch interval increased the mean absolute error between the manual PSG-based and automatic PPG-based sleep parameters and sleep stage percentages (Table 4). Thus, although there are inconsistencies between the manual PSG-based and automatic PPG-based sleep parameters when the epoch-to-epoch interval is decreased, the increased resolution of the hypnograms better reveals the differences in sleep fragmentation between the OSA severity categories.

Disclosure Statements
Financial disclosure