Multi-scored sleep databases: how to exploit the multiple-labels in automated sleep scoring

Abstract Study Objectives Inter-scorer variability in scoring polysomnograms is a well-known problem. Most of the existing automated sleep scoring systems are trained using labels annotated by a single-scorer, whose subjective evaluation is transferred to the model. When annotations from two or more scorers are available, the scoring models are usually trained on the scorer consensus. The averaged scorer’s subjectivity is transferred into the model, losing information about the internal variability among different scorers. In this study, we aim to insert the multiple-knowledge of the different physicians into the training procedure. The goal is to optimize a model training, exploiting the full information that can be extracted from the consensus of a group of scorers. Methods We train two lightweight deep learning-based models on three different multi-scored databases. We exploit the label smoothing technique together with a soft-consensus (LSSC) distribution to insert the multiple-knowledge in the training procedure of the model. We introduce the averaged cosine similarity metric (ACS) to quantify the similarity between the hypnodensity-graph generated by the models with-LSSC and the hypnodensity-graph generated by the scorer consensus. Results The performance of the models improves on all the databases when we train the models with our LSSC. We found an increase in ACS (up to 6.4%) between the hypnodensity-graph generated by the models trained with-LSSC and the hypnodensity-graph generated by the consensus. Conclusion Our approach definitely enables a model to better adapt to the consensus of the group of scorers. Future work will focus on further investigations on different scoring architectures and hopefully large-scale-heterogeneous multi-scored datasets.


Introduction
Sleep disorders represent a significant public health problem that affects millions of people worldwide [1].Since the late 1950s, the polysomnography (PSG) exam has been the gold standard to The whole-night recording is divided in 30-second epochs, and each epoch is classified into one of the five sleep stages (i.e., wakefulness W, stage N1, stage N2, stage N3, and stage REM) according to the AASM guidelines [2].Worst case scenario, an eight-hour PSG may require up to two hours of tedious repetitive and time-consuming work to be scored.In addition, this manual procedure is highly affected by a low inter-rater scoring agreement (i.e., the agreement between different physicians scoring the same whole-night recording).The inter-rater scoring agreement value ranges from 70% up to slightly more than 80% [3][4][5].In [3] the averaged inter-rater agreement of about 83% results from a study conducted on the AASM Inter-scorer reliability dataset, by using sleep stages annotated from more than 2,500 sleep scorers.The agreement was higher than 84% for awake, N2 and REM stages, but it dropped to 63% and 67% for N1 and N3 stages respectively.In fact, the interrater agreement varies among sleep stages, patients, sleep disorders and across sleep centers [3], [6].
Since 1960 many different approaches and algorithms have been proposed to automate this timeconsuming scoring procedure.Mainly, two different approaches emerged: sleep scoring algorithms learning from well defined features extracted from the knowledge of the experts (shallow learning), and sleep scoring algorithms learning directly from the raw data (deep learning).Thorough reviews about feature based [7][8]  deep neural networks [12], convolutional neural networks [13][14][15][16][17][18][19][20], recurrent neural networks [21][22][23] and different combinations of them [24][25][26][27][28][29][30] have been all proposed only in these last five years.Almost all of the above algorithms have been trained on recordings scored by a single expert physician.The first remarkable exception comes from [27], where they consider recordings scored by six different physicians [31].The scoring algorithm was trained on the six-scorer consensus (i.e., based on the majority vote weighted by the degree of consensus from each physician).In [23] the Dreem group introduced two publicly-available datasets scored by five sleep physicians.Similarly, they used the scorer consensus to train their automated scoring system.It has been shown that the performance of an automated sleep scoring system is on-par with the scorer consensus [23,27], and mainly that their best scoring algorithm is better than the best human scorer -i.e., the scorer with the higher consensus among all the physicians in the group.Although they both considered the scorers are taken into account at the same time.We finally assess the performance and we quantify the similarity between the hypnodensity-graph generated by the models -trained with and without label smoothing -and the hypnodensity-graph generated by the scorer consensus.
In the present work we investigate a different approach in exploiting multi-scored database information.In particular: (1) we demonstrate the efficiency of label smoothing along with the softconsensus distribution in both calibrating and enhancing the performance of both DSN-L and SSN; (2) we show how the model can better resemble the scorer group consensus, leading to a similarity increase between the hypnodensity-graph generated by the model and the hypnodensity-graph generated by the scorer consensus.

Methods
In this section we first present the three publicly available databases used in this study: IS-RC (Interscorer Reliability Cohort) [31]; DOD-H (Dreem Open Dataset -Healthy) and DOD-O (Dreem Open Dataset -Obstructive) [23].We then briefly describe the architectures of the two deep learningbased scoring algorithms DSN-L [32] and SSN [23].Next, we show how to compute the consensus in a multi-scored dataset, i.e., how to compute the label among multiple-scorers so as to train our baseline algorithms and to be able to evaluate their performance.In Label smoothing with softconsensus subsection we describe in detail how to compute the soft-consensus distribution, and how to exploit it along with the label smoothing technique during the training procedure.The aim is to show how to insert the multiple-labels of the different scorers into the training procedure of our algorithms.We finally report all the experiments conducted on both DSN-L and SSN algorithms, i.e., base, base+LS U and base+LS SC models, and the metrics exploited to evaluate their performance.In Table 1 we report a summary of the total number and percentage of the epochs per sleep stage for the DOD-H, DOD-O and IS-RC datasets.

Deep learning-based scoring architectures
DSN-L [32] is a simplified feed-forward version of the original DeepSleepNet by [24].Unlike the original network, in [32] we proposed to employ only the first representation learning block, and we proposed to simply train it with a sequence-to-epoch learning approach.The architecture receives in temporal and frequency precision in the feature extraction procedure [33].Each CNN branch consists of four convolutional layers and two max-pooling layers.Each convolutional layer executes three basic operations: 1-Dimensional convolution of the filters with the sequential input; batch normalization [34]; element-wise rectified linear unit (ReLU) activation function.Then the pooling layers are used to downsample the input.In Figure 1 we report an overview of the architecture, with details about the filter size, the number of filters and the stride size of each convolutional layer.The pooling size and the stride size for each pooling layer are also specified.The models are trained end-to-end via backpropagation, using mini-batch Adam gradient-based optimizer [35], with a learning rate .The training procedure runs up to a maximum number of iterations (e.g., 100 iterations), as long as the break early stopping condition is satisfied (i.e., the validation F1-score stopped improving after more than a certain epochs; the model with the best validation F1-score is used at test time).All the training parameters (e.g., adam-optimizer parameters beta1 and beta2, mini-batch size, learning rate etc.) are all set as recommended in [32] and [23].

SSN
In Supplementary Analyses we also report additional mathematical details about both the scoring architectures.

Consensus in multi-scored datasets
Inspired by [23,27], we evaluate the performance of the sleep scoring architectures, as well as the performance of each physician, using the consensus among the five/six different scorers.The majority vote from the scorers has been computed -i.e., we assign to each 30-second epoch the most voted sleep stage among the physicians.In case of ties, we consider the label from the most reliable scorer.The most reliable scorer is the one that is frequently in agreement with all the others.We use the -metric proposed in [23] to rank the reliability of each physician, and to finally define the most reliable scorer.
We denote with the total number of scorers and with the single-scorer.The one-hot encoded sleep stages given by the scorer are: ̂ , where is the number of classes, i.e., sleep stages, and is the total number of epochs.The probabilistic consensus ̂ among the scorers ( excluded) is computed using the following: A c c e p t e d M a n u s c r i p t where is the -epoch of epochs and ̂ , i.e., is assigned to a stage if it matches the majority or if it is involved in a tie.The -is then computed across all the epochs as: where ̂ denotes the probabilistic consensus of the sleep stage chosen by the scorer for theepoch.
-, where the zero value is assigned if the scorer systematically scores all the annotations incorrectly compared to the others, whilst is assigned if the scorer is always involved in tie cases or in the majority vote.The -is computed for all the scorers, and the values are sorted from the highest -high reliability -to the lowest -low reliability.
The -is computed for each patient, i.e., the scorers are ranked for each patient, and in case of a tie the top-1 physician will be the one used for that patient.

Label smoothing with soft-consensus
The predicted sleep stage for each 30-second epoch is associated to a probability value ̂ , which should mirror its ground truth correctness likelihood.When this happens, we can state that the model is well calibrated, or that the model provides a calibrated confidence measure along with its prediction [36].Consider, for example, a model trained to classify images as either containing a dog or not; out of ten test set images it outputs the probability of there being a dog as 0.60 for every image.The model is perfectly calibrated if six dog images are present in the test set.Label smoothing [37] has been shown to be a suitable technique to improve the calibration of the model.

A c c e p t e d M a n u s c r i p t
By default, the cross-entropy loss function is computed between the prediction and the target (i.e., the one-hot encoded sleep stages, for the correct class and for all the other classes).
Whenever a model is trained with the label smoothing technique, the hard target is usually smoothed with the standard uniform distribution (3).Thus, the cross-entropy loss function (4) is minimized by using the weighted mixture of the target . (3) where is the smoothing parameter, the number of sleep stages, the weighted mixture of the target and ̂ the output of the model with the predicted probability values.
In our study, we exploit the label smoothing technique to improve the insertion of the knowledge from the multiple-scorers in the learning process.We propose to use the -( 5) as our new distribution to smooth the hard target . - where is the set of observations -i.e., annotations given by the different physicians -for theepoch, is the class index, is the number of observations and is the cardinality of the set A c c e p t e d M a n u s c r i p t ( ).In simple words, the probability value for each sleep stage is computed as the sum of its occurrences divided by the total number of observations.
is the one-dimensional vector that we use to smooth the hard target (6), and then minimize the cross-entropy loss function (7). - To make it clearer, we report a practical example on how to compute the soft-consensus distribution, and how to exploit it to smooth our labels.Consider the following set of observations given by five different physicians for the same -epoch.
We can calculate the consensus as following: --By applying (5) and ( 6) we obtain the following smoothed hard-target with : - We perform a simple grid-search to set the smoothing hyperparameter .When the model is trained with the labels smoothed by the uniform distribution the value ranges between with step .Extreme values are not considered as for the model is trained using the standard hot-encoding vector; whilst for values higher than , e.g., , the model would be trained using mainly/only the uniform distribution for each sleep stage.When the model is trained with the labels smoothed by the -distribution the value ranges between with step .In the latter case we also investigate an value equal to to evaluate the full impact of the consensus distribution on the learning procedure.

Experimental design
We evaluate DSN-L and SSN using the -fold cross-validation scheme.We set equal to for IS-RC, for DOD-H (leave-one-out evaluation procedure) and for DOD-O datasets, consistent with what was done in [23].
In Table 2 we summarize the data split for each dataset.
The following experiments are conducted on both DSN-L and SSN models for each dataset: • base.The models are trained without label smoothing.A c c e p t e d M a n u s c r i p t • base+LS U .The models are trained with label smoothing using the standard uniform distribution -i.e., the hard targets (scorer consensus) are weighted with the uniform distribution.
• base+LS SC .The models are trained with label smoothing using the proposed soft-consensusi.e., the hard targets (scorer consensus) are weighted with the soft-consensus distribution.
These models, differently trained, have been evaluated with and without MC dropout ensemble technique.In Table 4, Table 5 and Table 6 section Results we present the results obtained for each experiment on both DSN-L and SSN evaluated on IS-RC, DOD-H and DOD-O datasets.

Metrics
Performance.
The per-class F1-score, the overall accuracy (Acc.), the macro-averaging F1-score, the weightedaveraging F1-score (i.e., the metric is weighted by the number of true instances for each label, so as to consider the high imbalance between the sleep stages) and the Cohen's kappa have been computed per-subject from the predicted sleep stages from all the folds to evaluate the performance of our model [38,39].

Hypnodensity graph.
The hypnodensity-graph is an efficient visualization tool introduced in [27] to plot the probability distribution over each sleep stage for each 30-second epoch over the whole night.Unlike the standard hypnogram sleep cycle visualization tool, the hypnodensity-graph shows the probability of occurrence of each sleep stage for each 30-second epoch; so it is not limited to the discrete sleep stage value (see Figure 3).
In our study we have used the hypnodensity-graph to display both the model output -i.e., the probability vectors ̂ -and the multi-scorer -probability distributions.

A c c e p t e d M a n u s c r i p t
The Averaged Cosine Similarity ( ) is used to quantify the similarity between the hypnodensitygraph generated by the model and the hypnodensity-graph generated by the -.The has been computed as follows: where is the number of epochs in the whole night, || || is the norm computed for the predicted probability vector ̂ and the -ground-truth vector for the -epoch.Thus, the cosine-similarity is averaged across all the epochs to obtain our averaged unique score of similarity.The cosine-similarity values may range between i.e., high dissimilarity and i.e., high similarity between the vectors.

Calibration.
The calibration of the model is evaluated by using the expected calibration error ( ) metric proposed in [40].By we compute the difference in expectation between the accuracy and the (i.e., the softmax output probabilities) values.More in detail, the predictions are divided into equally spaced bins (with size ), then we compute the accuracy and the average predicted probability value for each bin as follows: where is the true label and ̂ ̂ is the predicted label for the -epoch; is the group of samples whose predicted probability values fall in and ̂ ̂ is the predicted probability value for sample the -30-second epoch.Finally, the value is computed as the weighted average of the difference between the and the among the bins: where is the number of samples in each bin.Perfectly calibrated models have for all { }, resulting in .

Results
In Table 3 we first report for all the multi-scored databases IS-RC, DOD-H and DOD-O, the overall scorers performance and their ( ), i.e., the agreement of each scorer with the consensus among the physicians.On IS-RC we have on average a lower inter-scorer agreement ( equal to 0.69, with an F1-score 69.7%) compared to both DOD-H and DOD-O ( equal to 0.89 and 0.88, with an F1-score 88.1% and 86.4% respectively).Consequently, we expect a higher efficiency of our label smoothing with the soft-consensus approach (base+LS SC ) on the experiments conducted on the IS-RC database.The lower the inter-scorer agreement, the lower should be the performance of a model trained with the one-hot encoded labels (i.e., the majority vote weighted by the degree of consensus from each physician).In Table 4 and Table 5 we report the overall performance, the calibration measure and the hypnodensity similarity measure of the three different DSN-L and SSN models on the three databases IS-RC, DOD-H and DOD-O.The performance of the DSN-L base models are higher compared to the performance averaged among the scorers on the IS-RC database, but not on the DOD-H and DOD-O databases.In contrast, the performance of the SSN base models are always higher than the performance averaged among the scorers on all the databases.We highlight that the results we report for SSN on DOD-H and DOD-O are slightly different compared to the one reported in [23].We decided to not compute a weight (from 0 to 1) for each epoch, based on how many scorers voted for the consensus.We do not balance the importance of each epoch when we compute the above mentioned metrics.We think it is unfair to constrain any metrics based on the amount of voting physicians.Overall, the results show an improvement in performance on all the databases (i.e overall accuracy, MF1-score, Cohen's kappa ( ), and F1-score) from the baseline (base) and the label smoothing with the uniform distribution (base+LS U ) models, to the ones trained with label smoothing along with the proposed soft-consensus distribution (ie.base+LS SC ).
The is the metric that best quantifies the ability of the model in adapting to the consensus of the group of scorers.A higher value means a higher similarity between the hypnodensity-graph generated by the model and the hypnodensity-graph generated by the soft-consensus (i.e., the model better adapts to the consensus of the group of physicians).As all the other metrics the value is computed per subject, but here we report the mean and also the standard deviation across subjects .We found a significant improvement in the value from the base and the base+LS U models to the base+LS SC models on all the databases and on both DSN-L (p-values < 0.01) and SSN (p-values < 0.05).Hence, our approach enables both DSN-L and SSN architectures to significantly adapt to the group consensus on all the multi-scored datasets.
We could easily infer that the SSN architecture is better (i.e., higher performance) compared to our study sleep and to identify sleep disorders.It monitors electrophysiological signals such as electroencephalogram (EEG), electrooculogram (EOG), electromyogram (EMG) and electrocardiogram (ECG).The physicians visually extract sleep cycle information from these signals.
and deep learning based [9-10] sleep scoring algorithms can be found in literature.Although the latter algorithms emerged only five years ago, their impressive results have Downloaded from https://academic.oup.com/sleep/advance-article/doi/10.1093/sleep/zsad028/7034145 by guest on 12 February 2023 A c c e p t e d M a n u s c r i p t never been reached with the previous conventional feature based approaches.Autoencoders [11], knowledge from the multiple scorers -by averaging their labels and by training their algorithm on the averaged consensus -they still trained the algorithm on a single one-hot encoded label.Indirectly, they are still transferring the best scorer's subjectivity into the model, and they are not explicitly training the model to adapt to the consensus of the group of scorers.In this work, we train two existing lightweight deep learning-based sleep staging algorithms, our DeepSleepNet-Lite (DSN-L) [32] and SimpleSleepNet (SSN) [23], on three open-access multi-scored sleep datasets.First, we assess the performance of both scoring algorithms trained with the labels given by scorer consensus (i.e., majority vote among the different scorers) and compare it to the performance of the individual scorer-experts.Then we propose to exploit label smoothing along with the soft-consensus distribution (base+LS SC ) to insert the multiple-knowledge into the training procedure of the models and to better calibrate the scoring architectures.For the first time in sleep scoring, we are considering the multiple-labels in the training procedure, the annotations of all the Downloaded from https://academic.oup.com/sleep/advance-article/doi/10.1093/sleep/zsad028/7034145 by guest on 12 February 2023 A c c e p t e d M a n u s c r i p t The dataset contains 70 recordings (0 males and 70 females) from patients with sleepdisordered breathing aged from 40 to 57.The recordings were collected at the University of Pennsylvania.Each recording includes the EEG derivations C3-M2, C4-M1, O1-M2, O2-M1, one EMG channel, left/right EOG channels, one ECG channel, nasal airway pressure, oronasal thermistor, body position, oxygen saturation and abdominal excursion.The recordings are sampled at 128 Hz.We only consider the single-channel EEG C4-M1 to train our DSN-L architecture, and we use multichannel EEG, EOG, EMG and ECG to train the SSN architecture.A band-pass Chebyshev IIR filter is applied between [0.3, 35] Hz.Each recording is scored by six clinicians from five different sleep centers (i.e., University of Pennsylvania, University of Wisconsin at Madison, St. Luke's Hospital (Chesterfield), Stanford University and Harvard University) according to the AASM rules [2].The dataset contains the following annotations , classified epoch.Some epochs are not scored by all the six physicians, and even for some of them we don't have any annotation (i.e ).We decided to remove the epochs classified by all the scorers as .Epochs with less than six annotations are equally taken into account to avoid excessive data loss.DOD-H.The dataset contains 25 recordings (19 males and 6 females) from healthy adult volunteers aged from 18 to 65 years.The recordings were collected at the French Armed Forces Biomedical Research Institute's (IRBA) Fatigue and Vigilance Unit (Bretigny-Sur-Orge, France).Each recording includes the EEG derivations C3-M2, C4-M1, F3-F4, F3-M2, F3-O1, F4-O2, O1-M2, O2-M1, one EMG channel, left/right EOG channels and one ECG channel.The recordings are sampled at 512 Hz.DOD-O.The dataset contains 55 recordings (35 males and 20 females) from patients suffering from obstructive sleep apnea (OSA) aged from 39 to 62 years.The recordings were collected at the Stanford Sleep Medicine Center.Each recording includes the EEG derivations C3-M2, C4-M1, F4-M1, Downloaded from https://academic.oup.com/sleep/advance-article/doi/10.1093/sleep/zsad028/7034145 by guest on 12 February 2023 A c c e p t e d M a n u s c r i p t F3-F4, F3-M2, F3-O1, F4-O2, FP1-F3, FP1-M2, FP1-O1, FP2-F4, FP2-M1, FP2-O2, one EMG channel, left/right EOG channels and one ECG channel.The recordings are sampled at 250 Hz.We only consider the single-channel EEG C4-M1 to train our DSN-L architecture, and we use all the available channels to train SSN architecture, on both DOD-H and DOD-O.As in [23], a band-pass Butterworth IIR filter is applied between [0.4,18] Hz to remove residual PSG noise, and the signals are resampled at 100 Hz.The signals are then clipped and divided by 500 to remove extreme values.The recordings from both DOD-H and DOD-O datasets are scored by five physicians from three different sleep centers according to the AASM rules [2].DOD-H and DOD-O contain the following annotations , , , , , and , where is a not classified epoch.All the scorers agree about the epochs (100% of agreement).Therefore, all of them are removed from the data.Unlike the previous IS-RC database, for each epoch five annotations are always available.
input a sequence of 90-second epochs, and it predicts the corresponding target of the central epoch of the sequence, i.e., many-to-one or sequence-to-epoch classification scheme.The representation learning architecture consists of two parallel convolutional neural networks (CNNs) branches, with small and large filters at the first layer.The principle is to extract high-time resolution patterns with the small filters, and to extract high-frequency resolution patterns with the large ones.This idea comes from the way the signal processing experts define the trade-off between Downloaded from https://academic.oup.com/sleep/advance-article/doi/10.1093/sleep/zsad028/7034145 by guest on 12 February 2023 A c c e p t e d M a n u s c r i p t A c c e p t e d M a n u s c r i p t

10 )
Downloaded from https://academic.oup.com/sleep/advance-article/doi/10.1093/sleep/zsad028/7034145 by guest on 12 February 2023A c c e p t e d M a n u s c r i p t

A c c e p t e d M a n u
s c r i p t consensus distribution.In Figure4we also show, on a patient from the DOD-O dataset, how we achieve a higher value with the proposed base+LS SC model with the soft-consensus distribution, compared to base+LS U model with the standard uniform distribution.The graph clearly highlights the differences between the output probabilities predicted by the different models.The probabilities predicted using our approach base+LS SC (d) are closer to the ground-truth (a) compared to the ones predicted from the models (e.g.refer to min.300 and to the probabilities associated with the sleep stage N3).

Figure 4 .
Figure 4. Hypnodensity-graphs from the scorers labels and from the predicted probabilities from the Downloaded from https://academic.oup.com/sleep/advance-article/doi/10.1093/sleep/zsad028/7034145 by guest on 12 February 2023 Downloaded from https://academic.oup.com/sleep/advance-article/doi/10.1093/sleep/zsad028/7034145 by guest on 12 February 2023 Downloaded from https://academic.oup.com/sleep/advance-article/doi/10.1093/sleep/zsad028/7034145 by guest on 12 February 2023 DSN-L architecture.The purpose of our study is not to highlight whether one architecture is better than the other, but we can not fail to notice the high values of confidence (the value is the average of the softmax output max-probabilities) obtained on the SSN based models.High values of confidence still persist despite smoothing the labels (with both uniform and soft-consensus distributions) during the training procedure.The SSN architecture is not highly responsive to the changes in probability values we implemented on the one-hot encoded labels.It always rely/overfit on the probability value given for each epoch, i.e., the consensus among the five/six different scorers.Indeed, on the IS-RC, which is the database with the lower inter-scorer agreement, the SSN base+LS SC model reaches a higher value of F1-score, i.e., 81.6%, compared to our DSN-L base+LS SC Downloaded from https://academic.oup.com/sleep/advance-article/doi/10.1093/sleep/zsad028/7034145 by guest on 12 February 2023 whilst the DSN-L better adapts to the consensus of the group of scorers (i.e., better encodes the variability among the physicians).differentphysicians.Indeed, in Supplementary FigureS1, on the DSN-L model, we clearly show how the value proportionally increases with the α-hyperparameter only by using the proposed soft-