Assessing REM Sleep in Mice Using Video Data

Study Objectives: Assessment of sleep and its substages in mice currently requires implantation of chronic electrodes for measurement of electroencephalogram (EEG) and electromyogram (EMG). This is not ideal for high-throughput screening. To address this deficiency, we present a novel method based on digital video analysis. This methodology extends previous approaches that estimate sleep and wakefulness without EEG/EMG in order to now discriminate rapid eye movement (REM) from non-REM (NREM) sleep. Design: Studies were conducted in 8 male C57BL/6J mice. EEG/EMG were recorded for 24 hours and manually scored in 10-second epochs. Mouse behavior was continuously recorded by digital video at 10 frames/second. Six variables were extracted from the video for each 10-second epoch (i.e., intraepoch mean of velocity, aspect ratio, and area of the mouse and intraepoch standard deviation of the same variables) and used as inputs for our model. Measurements and Results: We focus on estimating features of REM (i.e., time spent in REM, number of bouts, and median bout length) as well as time spent in NREM and WAKE. We also consider the model’s epoch-by-epoch scoring performance relative to several alternative approaches. Our model provides good estimates of these features across the day both when averaged across mice and in individual mice, but the epoch-by-epoch agreement is not as good. Conclusions: There are subtle changes in the area and shape (i.e., aspect ratio) of the mouse as it transitions from NREM to REM, likely due to the atonia of REM, thus allowing our methodology to discriminate these two states. Although REM is relatively rare, our methodology can detect it and assess the amount of REM sleep.

8-pin plastic connector/pedestal (Plastics One, Inc.) and then bonded to the skull with dental acrylic. After the bonding agent cured, the animals were connected to our signal-amplifier system using a connecting cable and swivel contact (Plastics One, Inc.) mounted above each cage. All mice were given 10-14 days for postoperative recovery and habituation before beginning any recording.
EEG and EMG signal were amplified using the Neurodata amplifier system (Model M15, Astro-Med, Inc., West Warwick, RI). Signals were amplified (2000×) and conditioned using the following settings for EEG signals: low cut-off frequency (-6dB), 0.3 Hz and high cut-off frequency (-6dB), 30 Hz; for EMG signals: low cut-off frequency (-6dB), 10 Hz and high cut-off frequency (-6dB), 100 Hz. Signals were digitized at 100 Hz. All data were acquired and analyzed using Gamma software (Astro-Med, Inc.) and converted to European data format (EDF) for manual scoring and analysis in the Somnologica science software (Embla, Inc., Denver, CO). WAKE, NREM, and REM sleep were manually scored using EEG/EMG in 10-second epochs during 24-hour baseline recordings. Sleep stages were determined as follows: epochs were scored as wake when the EMG amplitude ranged from activity slightly higher than baseline during quiet wakefulness to higher-amplitude activity during ambulation. EEG amplitude was low, with frequencies mostly above 10 Hz. NREM was characterized by high-amplitude delta (1)(2)(3)(4). EMG was constant with low-amplitude activity. REM was scored when lowamplitude rhythmic theta waves (6-9 Hz) predominated, with the EMG remaining at baseline levels. Although our goal is to replace this manual scoring with an automated video-based system, these EEG/EMG-based manual scores will be our "gold standard" for comparison because they are currently the most widely accepted method for accurately scoring sleep.
Twenty-four hours of data divided into 10-second epochs implies 8,640 epochs for each of the 8 mice, giving us a total of 69,120 epochs that have been manually scored as REM, NREM, or WAKE. For each of these epochs, we also have video recordings captured at 10 frames per second, giving us 100 frames of sleep in mice based on digital video recordings. This is challenging because REM sleep is a relatively rare state compared with NREM sleep and wake and because episodes of REM sleep are short. We show that the identification of REM vs NREM is possible with reasonable accuracy, and we validate this by comparison with EEG/EMG assessments of REM sleep in C57BL/6J male mice. This new phenotyping strategy will be valuable for studies of molecular change in response to sleep, wake, or sleep deprivation and for screening of the recently created large number of knockout mice 3 to determine if they have altered sleep and wake.

ANIMAL STUDIES
One inbred strain of male mice was used in this study: C57BL/6J (n = 8, age: 10 to 12 weeks, weight: 18 to 23 g), purchased from Jackson Laboratory, Inc. (Bar Harbor, ME). Mice were individually housed in Plexiglas cages (4" wide × 8" long × 12" high) and maintained on a 12-hour light/dark cycle (lights on 0700; 80 lux at the floor of the cage) in a soundattenuated recording room, temperature 22°C-24°C. Food and water were available ad libitum. Animals were acclimated to these conditions for 10-14 days before beginning any studies. All animal experiments were performed in accordance with the guidelines published in the NIH Guide for the Care and Use of Laboratory Animals and were approved by the University of Pennsylvania Animal Care and Use Committee.
Mice were implanted with EEG/EMG electrodes under deep anesthesia (intraperitoneal injection of ketamine [100 mg/kg] / xylazine [10 mg/kg]). For EEG recordings, 3 stainless-steel miniature screws (0-80 × 1/16, Plastics One, Inc., Roanoke, VA) were placed epidurally in the following locations: (1) right frontal cortex (1.7 mm lateral to midline and 1.5 mm anterior to bregma), (2) right parietal cortex (1.7 mm lateral to midline and 1 mm anterior to lambda), and (3) a reference electrode over the cerebellum (1 mm posterior to lambda on the midline). Two EMG electrodes were sutured onto the dorsal surface of the nuchal muscles immediately posterior to the skull. All leads from the electrodes were connected to an in time). Although accounting for such dependencies will likely substantially improve model performance, the first-order Markov assumption imposes several important restrictions. In particular, it implies that sleep-bout durations are (1) geometrically distributed and (2) do not depend on the previous state (e.g., the model assumes that WAKE bout lengths are distributed the same regardless of whether the previous bout was NREM or REM). Prior literature has found both of these assumptions untenable, 11 and, indeed, the unconditional fits of a geometric distribution to our data were quite poor. Motivated by these observations, we therefore also combined the random forest with a transition-dependent generalized Markov model. This allows the random forest to take account of very general time-dependence structures, including (1) non-local dependence, (2) bout duration distributions that are not geometrically distributed, and (3) bout duration distributions that depend on the previous state.
When fit to data, our model provides an estimate of the probability that a mouse is in a given sleep state at a given epoch. Formally, our model estimates the probabilities where i is one of NREM, REM, or WAKE, t indexes the epochs, X is the full set of video covariates (X 1 ,…, X 8,640 ), and is an estimate of the model parameters. Our actual prediction for epoch t is taken to be whichever state (NREM, REM, or WAKE) has the largest at epoch t. We note that our estimates should be superior to the probability estimates produced by the basic random forests algorithm, which ignores the time-series structure and predicts Y t based only on X t .
While the technical details pertaining to the estimation and computation of the models outlined above are beyond the scope of this manuscript, they can be found elsewhere. 12 Nonetheless, we note that the algorithm is fast, requiring only several seconds to estimate using the entire sequence of datapoints (i.e., all 24 hours worth of data) from a given mouse and only several minutes to predict on the entire sequence of datapoints (i.e., all 24 hours worth of data) from a different mouse. We also note that exploiting time dependencies greatly enhances our ability to detect signal in the data, particularly given the inherently high noise level. We will show that our proposed method is highly advantageous in terms of predicting REM sleep.

Evaluation
We focus our model evaluation on determining how well our model can track (1) the total amount of time spent in REM sleep, (2) the number of REM bouts, and (3) the median REM bout length using the values derived from EEG/EMG manual video data per epoch upon which to build our automated system (see Figure 2 for one such frame).
Tracking software was used to calculate, for each epoch with time index t, six continuous numerical covariates: the withinepoch mean of the velocity, aspect ratio, and size of the mouse and the within-epoch standard deviation of the velocity, aspect ratio, and size of the mouse (where the mouse is approximated by a tracking ellipse as shown in Figure 2). For velocity and size, we used the natural logarithms of the means and standard deviations as covariates. We also had one binary covariate which indicates whether or not the light in the cage was turned on (lights were on from 0700-1900). Henceforth, we denote the vector of our seven covariates for epoch t as X t .

Model
The sequential classification problem we face (i.e., the automated sleep scoring of mice) can be conceptualized by considering the data as consisting of two components, an "in-sample" component and an "out-of-sample" component. The in-sample component consists of all of the data from a single mouse, namely (1) the sleep states (Y 1 , Y 2 ,…,Y 8,640 ) where each Y t is one of NREM, REM, or WAKE and (2) the video-based covariates (X 1 , X 2 ,…,X 8,640 ). Using this in-sample data, we estimate a model that predicts the collection of Y t from the collection of X t . The out-of-sample data component, in contrast, comes from a different mouse and consists of only the video-based covariates denoted ( 1 , 2 ,…, 8,640 ). The goal is to predict the corresponding ( 1 , 2 ,…, 8,640 ) using the estimated model and the collection of t .
Our modeling strategy builds on a statistical technique known as random forests. 7 A random forest is a collection of classification (or decision) trees, 8 each of which is constructed using random subsamples of the data and the covariates. The random forest combines the predictions made by each tree by allowing them each to "vote" on a sleep state; the probability of each sleep state is determined by the fraction of votes it receives, and the predicted state is the one with the most votes.
Although the random forest algorithm is known to perform well in a wide variety of settings, it ignores a key feature of sleep data: namely, that the Y t and X t form sequences in time. This sequential nature leads to dependencies in the data. For example, if a mouse was awake in the last epoch (i.e., Y t-1 = WAKE), there is a high probability it will be awake this epoch (i.e., Y t = WAKE). It should be possible to modify the basic random forest to account for these dependencies and to thus enhance performance.
To do so, we build on previous work, which combines conventional methods with Markov models. 9,10 The general structure of a Markov model is illustrated in Figure 3. The mouse starts at time t = 1 in sleep state Y 1 (i.e., one of NREM, REM, or WAKE), and we observe video-based covariates X 1 that depend on Y 1 . Next, the mouse transitions to state Y 2 , and the process repeats itself until time t = T (in our case, T = 8,640).
In our modeling of sleep states in mice, we consider two particular Markovian enhancements of the basic random forest. First, we combine the random forest with a first-order Markov model. This enhances the random forest so that it takes account of local time dependencies (i.e., those that are nearby  This yields an expected amount of time spent in REM for each 2-hour block.
However, since the raw probabilities are not fully calibrated, we can improve on by introducing a threshold tuning parameter Ө. In particular, we let . For low values of Ө, the model will tend to underpredict , whereas, for high values, it will tend to overpredict it.
This notion is formalized in panel (a) of Figure 4, which gives the root mean square error (RMSE) between and for various values of Ө; we also look at the RMSE for the first 12 hours (dark) vs the second 12 (light). As can be seen, the optimal value occurs around Ө ≈ 0.31, regardless of whether one looks at light, dark, or all blocks. In panel (b) of Figure  4, we plot averaged across all 8 mice for the optimal value of Ө = 0.31. As can be seen, our video-based model's prediction of the amount of time spent in REM sleep quite accurately tracks that based on manual scoring.
The remaining panels of Figure 4 provide additional results for total time spent in REM sleep. In panel (c), we give the difference between the two methods ± 1 standard deviation; as can be seen, all differences lie less than one standard deviation from zero (for full details, see Table S1 of the supplement). In panels (d) and (e) of Figure 4, rather than averaging across all mice, we look at the algorithm's performance on two individual mice. Not surprisingly, the performance on individual mice is not quite as good as when averaged across all mice. Nonetheless, the curve for the video-based method tracks the contours of the curve for the manual scores. Furthermore, the differences between the two curves for individual mice appear calibrated with respect to the standard deviations in panel (c): 67% of the 96 individual 2-hour blocks (i.e., 8 mice × 12 twohour blocks) are contained within 1 standard deviation and 94% are contained within 2. Nonetheless, this additional variability should be taken into consideration when our method is applied to individual mice.
A final point worth noting is that the amount of REM sleep is small. Fewer than 10 minutes are spent in REM per 2-hour block on average across all mice and in aggregate only about 5% of the time is spent in REM sleep. Furthermore, no single mouse spends more than about 15 minutes in REM in any 2-hour block.
In panels (a) and (b) of Figure 5, we examine how well the model performs at predicting the total number of REM bouts within a given 2-hour block (for full details, see Table S2 of the supplement). We estimate this quantity using an analogue of the threshold procedure used for the number of minutes spent in REM: (1) when is larger than and , we label epoch t a REM epoch; (2) using these labels for the 720 epochs within each 2-hour block, we can calculate the number of distinct bouts of REM. Although the model underpredicts the number of REM bouts, there appear to be no substantial differences between our video-based estimates and those based on manual scoring for any particular block. This is even more encouraging when one considers the fact that we again used Ө = 0.31, the value of Ө that was optimal for the number of minutes spent in REM. There is no guarantee this value is also optimal for the number of bouts of REM, and, indeed, predictions would likely improve if we were to estimate a different value scoring as the benchmark. In particular, we break our 24 hours worth of data into 12 two-hour blocks, and we examine these three metrics averaged across all mice for each of the blocks. We also examine how well the model performs at predicting total amount of time spent in each of the three states (REM, NREM, and WAKE) individually for each mouse.
A novel aspect of our methodology is that our model includes a threshold-tuning parameter that takes , the probability of REM sleep in epoch t as given by our model, and "converts" it into a REM score for epoch t. This parameter can be set by the user to adjust the specificity and sensitivity of the model's predictions so that the predictions can take account of the relative costs of false positives and negatives (which typically vary from application to application). We discuss this parameter and how to optimally tune it more fully in the Aggregate Measures of REM subsection of the Results section.
Although we focus on the summary statistics discussed above, we also examine how well our model is able to match the gold standard manual scores on an epoch-by-epoch basis.
Given that manual scoring is currently the most widely accepted method for accurately scoring sleep, matching manual scores to a reasonable degree is important. Nonetheless, there are several issues related to epoch-by-epoch matching worthy of mention. First, we anticipate that most applications of our methodology will focus on estimating the summary statistics rather than the epoch-by-epoch scores. While matching manual scores on an epoch-by-epoch basis is a sufficient condition for estimating the summary statistics, it is by no means a necessary one, and accurate estimates of the summary statistics can be obtained from models that are less precise on an epoch-by-epoch basis. Second, epoch-by-epoch manual scores are inconsistent: each of our epochs was scored independently by two different scorers who disagreed on approximately 5% of the epochs, 3 with disagreement rates highest among those epochs in which the sleep stage was transitional (in such cases, an independent third scorer was used to break the tie and to determine the "truth"). Consequently, the maximum possible epoch-by-epoch agreement rate between any model and manual scores will be below 100%.

Aggregate Measures of REM
We focus our model evaluation on aggregate measures of REM sleep. Specifically, we break the 24 hours worth of data into 12 two-hour blocks and look at how well the model predicts the number of minutes spent in REM, the number of REM bouts, and the median REM bout length-averaged over all 8 mice. We also examine the performance at predicting the number of minutes spent in REM for individual mice. We use the values of these quantities derived from the EEG/EMG manual scores as our target benchmark.
We first consider the number of minutes spent in REM during block j, whose "true value" derived from manual scores we denote by . To estimate , we sum the raw probability of REM over each of the 720 epochs that make up a 2-hour block (i.e., 2 hours is 7,200 seconds or 720 epochs). That is, we set where is the model estimate of the probability of REM at epoch t (i.e., ).
full details, see Table S3 of the supplement). As for number of REM bouts, we (1) label an epoch as REM when is larger than and and (2) calculate the median bout length using these labels for the 720 epochs in a given block.
of Ө specifically for the number of bouts of REM. Nonetheless, doing so would also add an extra parameter to the model.
Finally, panels (c) and (d) of Figure 5 show the performance of the model at forecasting the median REM-bout length (for  In panel (B), we give the total time spent in REM sleep averaged across all mice for each two-hour block based on electroencephalography (EEG; black) and video (gray) for the optimal parameter value of 0.31. In panel (C), we give the difference between EEG and video ± 1 standard deviation. In panels (D) and (E), we give total time spent in REM sleep for two individual mice for each two-hour block based on EEG (black) and video (gray). (b), we give the difference between the two methods ± 1 standard deviation. There are no significant differences between our model and the "truth" as given by EEG/EMG data. In panels (c) and (d) of Figure 6, rather than averaging across all mice, we look at the algorithm's performance for the two individual mice considered in the lower panels of Figure 4. As can be seen, the video-based method tracks the manual scores closely with no major divergences when evaluated both in aggregate across all mice and for individual mice. In Figure 7, we provide the same plots but for WAKE (for full details, see Table S5 of the supplement). The model's estimates of time spent awake track the manual scores extremely well again, with no major divergences from the manual scores. This again holds both at the aggregate and individual level.

Epoch-by-Epoch Scoring Evaluation
Though our principal focus is on estimating measures of REM sleep-such as time spent in REM, number of REM bouts, and median REM bout duration-we also examined whether our algorithm could replicate the manual scores on an epoch-by-epoch basis. In particular, our epoch-by-epoch scoring evaluation directly compares the performance of 5 different methods: (1) The model consistently overpredicts the median bout length by about 20-30 seconds (2-3 epochs). This is the mirror image of the model's modest underprediction of number of bouts (since number of bouts times median bout length is roughly equivalent to total time spent REM). Again, we used Ө = 0.31 here, and, although predictions would likely improve if a value of Ө were specifically estimated for the median bout length, doing so would add yet another parameter to the model.

Aggregate Measures of NREM Sleep and Wake
Although our primary focus is how well our model estimates REM sleep, we also provide data on the estimation of both NREM and WAKE amounts. We again do so using the value Ө = 0.31 for our threshold tuning parameter. That is, we set and where and Ө = 0.31 (normalization by ensures that the total time spent in all states sums to the proper value of two hours per block).
In Figure 6, we provide the analogue of Figure 4 but for NREM sleep (for full details, see Table S4 of the supplement). Panel (a) gives the total time spent in NREM sleep averaged across all mice based on EEG (black) and video (gray). In panel  In panel (B), we give the difference between EEG and video ± 1 standard deviation. In panel (C), we give the median REM bout length in minutes averaged across all mice for each 2-hour block based on EEG (black) and video (gray). In panel (D), we give the difference between EEG and video ± 1 standard deviation.

EEG Video
sample so as to minimize the error rate with respect to the gold standard. An additional point worth noting about the 40-second Rule is that it can only distinguish sleep from wakefulness, whereas all other methods considered can distinguish among REM, NREM, and wakefulness. Before trying to discriminate REM from NREM, we first consider the simpler 2-state problem of forecasting SLEEP vs WAKE. The "true" score for an epoch is SLEEP if the manual scorers scored it as REM or NREM, and it is WAKE otherwise (as mentioned earlier, when the two manual scorers disagreed, an independent third scorer was used to break the tie and determine the "truth"). The various classification methods are then trained using this 2-state SLEEP/WAKE score as the response.
We declare an epoch to be in error if a given method classifies the epoch as something other than the "true" score and present the error rates in the second column of Table 1. As can be seen, one can achieve error rates lower than 10%. Although the 40-second Rule performs well, this method can be defeated by models that account for the additional information beyond velocity which is present in the video data. Indeed, the best overall error rate of 8.8% is achieved by our RF+TDGMM method; multinomial logistic regression, (2) random forests, (3) random forests combined with a first-order Markov model (RF+1MM), (4) random forests combined with a transition-dependent generalized Markov model (RF+TDGMM), and (5) the so-called "40-second Rule." 4 The fourth method, RF+TDGMM, is our method, which we have been examining thus far. The second and third represent various simplifications of it. Finally, the first and fifth are more common in the literature.
We also examine the "error rate" for the gold standard of manually scored EEGs. In particular, we declare the gold standard to be in error if the two original scorers scored the same epoch differently.
Before proceeding, we note the error rates for four of the five of the methods are completely "out of sample" in the sense that the models are tuned and fit for each mouse and then applied and evaluated on different mice. The only exception is the 40-second Rule. This algorithm considers a mouse "inactive" in a given 10-second epoch if the mean intraepoch velocity is less than 3 pixels per second; it then rules a mouse asleep when there are four or more consecutive inactive epochs. A single parameter (i.e., 40 seconds/4 epochs as opposed to some other multiple of 10 seconds/1 epoch) has been optimized in the Our second evaluation considers the 3-state problem (i.e., REM vs NREM vs WAKE), and we present our results in the third through fifth columns of Table 1 (since the 40-second Rule can only discriminate sleep from wakefulness but not REM from NREM, it is listed as NA in these columns). This problem is much more difficult for classification methodologies since they now must choose among three alternatives rather than two. Further complicating this difficulty is the fact that the REM occurs only about 5% of the time and looks somewhat similar to NREM in terms of video covariates. Consequently, the overall error rate for each method is higher in the third column vs the second column of the table.
In addition to this overall error rate, which is determined as outlined above, we also consider the false positive and false negative rate for REM, which is of special interest. Again, using the manual scores as "truth" (with ties broken by an independent third scorer when necessary), an epoch is classified as a REM false positive if the classification method declares it to be REM but the manual scorer does not; the REM false positive rate is thus the number of such epochs divided by the total number of epochs declared to be other than REM by manual scoring. Similarly, an epoch is declared to be a REM false negative if the manual scorers score it as REM but the classification this compares favorably to the 4.8% disagreement rate among manual scorers.  The first column gives the methodology, the second column gives the overall error rate on the two-state SLEEP/WAKE problem, and the third through fifth columns give, respectively, the overall error rate, the rapid eye movement (REM) false positive rate, and the REM false negative rate on the three-state non-REM (NREM)/REM/WAKE problem. RF+1MM denotes the random forest combined with a first-order Markov model whereas RF+TDGMM denotes the random forest combined with the transition-dependent generalized Markov model. lecular level could be affected by the recent surgery and the insertion of foreign objects into the mouse skull. A second application of our procedure is for high-throughput phenotyping, something that is increasingly important for studying the large number of knockout mice that are now available. 3 Currently, the only other automated approach being applied is assessment of mouse behavior by piezoelectric data. 5 This method measures pressure changes in the floor of the mouse cage produced by movement. There are highly variable signals during wakefulness as the mouse moves around; signals during sleep reflect breathing. It is conceivable that the piezo technology could also identify REM sleep because breathing in REM sleep is more irregular than in NREM sleep. 13 At present, however, this possibility has not been assessed.
The sensitivity and specificity of video-based methods to estimate sleep and its substages might be improved if the mouse behavior was observed by video not only from above but also from the side. A 3-dimensional assessment of the mouse using high-resolution video would likely improve assessment of its behavior, including the problem addressed here (i.e., identifying NREM and REM sleep). Such a system would likely be able to determine breathing, as is possible with piezo, as well as the small twitches that occur during REM sleep. Video analysis also provides the opportunity to identify other behaviors, and it is likely that analytic strategies could be developed to study a whole range of mouse behaviors.
In our studies, we used 10-second epochs to score wake and the stages of sleep. We did so because (1) this is the most commonly used epoch length for scoring of behavioral state 4 and (2) the original papers assessing behavioral state by non-EMG/ EMG based approaches used this epoch length. 4,5 Behavioral states of wake and sleep can, however, occur in quite short episodes, and, hence, examining in detail the architecture of sleep (bout length) requires scoring in 4-second epochs. 11,14 It is conceivable that if we had used 4-second epochs in this study, we might have found better agreement with bout lengths, etc. Future studies need to assess the impact of different epoch lengths on agreement between video and EEG analysis.
Epoch lengths aside, there are several differences between the results for time spent in REM sleep on one hand and number of bouts, bout length, and epoch-by-epoch scores on the other hand. First, there was a methodological choice: since our focus was on the time spent in REM, we tuned our parameter Ө to that quantity, whereas for the latter we either fixed it at the value that was optimal for time spent in REM (number of bouts and bout length) or at the default of one (epoch-by-epoch scoring). Future studies focusing on these metrics should consider tuning our method's predictions specifically for them. Second, beyond modeling choices, there are fundamental differences between time spent in REM and the other metrics. The expected time spent in REM does not require the conversion of a model's probabilities for each state at each epoch into a predicted sleep state for that epoch; rather, these probabilities can be summed across all epochs, yielding the expected time in the state. On the other hand, computing epoch-by-epoch scores, number of bouts, and bout length requires the conversion of these probabilities into sleep scores on an epoch-byepoch basis. This fundamental difference underlies the varying results observed. Finally, there is a third difference that applies method does not; the REM false negative rate is the number of such epochs divided by the total number of epochs scored as REM by the manual scorers.
The table reveals what is already known: REM is difficult to classify correctly, with REM false positive and false negative rates that are much higher than the overall error rates. The challenge here is to discover a method with any power to detect REM sleep. That is, there is an inherent trade-off between (1) obtaining a low REM false negative rate accompanied by a higher overall and REM false positive rate or (2) obtaining lower overall and REM false positive rates while having a high REM false negative rate. Since high REM false negative rates mean our models have little or no power to detect REM, we prefer to err on the side of (1) rather than (2).
Indeed, logistic regression and random forests have the best overall error rates and low REM false positive rates, but this is because they largely ignore the REM state (i.e., they very rarely classify an epoch as REM) leading to extremely high REM false negative rates. On the other hand, our RF+TDGMM methodology is able to achieve a good balance relative to other methods: it has a REM false negative rate that is much lower than the competitor models while remaining competitive on both the overall error rate and the REM false positive rate. By accounting for the time dependence of the data, our procedure is able to capture a greater proportion of the REM signal. Furthermore, by retaining a reasonable false positive rate relative to the other methods, our model does not sacrifice specificity in order to gain substantial improvements in sensitivity.
In sum, our RF+TDGMM methodology can detect REM sleep in video data. In achieving a lower REM false negative rate (i.e., actually detecting REM), it does have a commensurately higher overall and REM false positive rate as compared with methods such as logistic regression and random forests which tend to ignore the REM state. Finally, as demonstrated in the previous subsections, the RF+TDGMM can be combined with a threshold-tuning parameter to provide accurate assessments of aggregate measures of sleep and wakefulness such as the amount of time spent in REM sleep over 2-hour blocks.

DISCUSSION
In this study, we demonstrate that there is signal in video recordings of mice that is capable of distinguishing NREM from REM sleep. There are subtle changes in the area and shape of the mouse as it transitions from NREM to REM sleep, likely as a result of the atonia of REM sleep. Although REM sleep is a relatively rare state, as compared with NREM and WAKE, our methodology can provide reasonable estimates of it. This new methodology extends previous approaches 4,5 that do not require EEG/EMG recording to now differentiate REM from NREM as opposed to merely SLEEP from WAKE.
This new method has several applications to which it can be applied immediately (i.e., with no further estimation of the model parameters including the threshold parameter). First, in studies in which mRNA changes or protein changes with sleep and wake are being assessed, this approach is much more costeffective for estimating sleep states. EEG/EMG recording requires surgical implantation of electrodes, time to recover from surgery, and labor-intensive manual scoring of EEG/EMG recordings. There is, moreover, a concern that results at the mo-to number of bouts and bout length. When long stretches of REM are briefly interrupted (e.g., an epoch or two of NREM or WAKE surrounded on either side by many epochs of REM), the model's estimates of number of bouts and bout length-assuming it cannot detect these brief interruptions-will be strongly negatively impacted whereas there will be little impact on time spent in REM. Despite this major difference, our method is still quite competitive at estimating these more difficult quantities.
The application of the video methodology will be in uninstrumented mice (i.e., mice without EEG/EMG headstages). It is conceivable that the changes in shape (i.e., aspect ratio) and area as the mouse transitions from NREM to REM sleep are sufficiently different in uninstrumented mice that the model presented here (which was fit to data from instrumented mice) will be inaccurate on uninstrumented mice. We believe that this is unlikely for two reasons. First, the cable connected to the mouse's head is carefully counterbalanced so that the mouse moves freely and there is no excessive force on the head; thus, it seems unlikely that the cable will result in different changes in shape and area as the mouse becomes more atonic in REM sleep. Second, it is the changes in aspect ratio and area that are most important for differentiating NREM from REM sleep; the absolute magnitudes of these variables are of secondary importance, and, indeed, they vary from mouse to mouse. Although it seems that the need for instrumentation will therefore not affect the accuracy of our approach, the question is ultimately unanswerable, since validation requires EEG/EMG recordings; such recordings, in turn, require some form of instrumentation, whether by the methodology used here or by telemetry (which also could potentially alter mouse shape and area).
In conclusion, this study shows that video analysis can distinguish REM from NREM sleep in mice. Future elaborations of this technological approach could lead to further improvements in these estimates. Thus, high-throughput phenotyping of sleep and wake in mice is feasible and will facilitate studies of the role of specific genes using the large number of mice with knockout of specific genes that are more available 3 and investigation of chemical libraries to determine compounds that affect sleep and wake, as has been done in zebra fish. 15

442A
Assessing REM Sleep in Mice Using Video Data-McShane et al Supplement to "Assessing REM Sleep in Mice Using Video Data" In this supplement, we present tables detailing the data plotted in Figures 4a and 4b (Table   S1), Figures 5a and 5b (Table S2), Figures 5c and 5d (Table S3), Figures 6a and 6b (Table   S4), and Figures 7a and 7b (Table S5) of the main text.
In Table S1, we present the mean and standard deviation of the number of minutes spent in REM sleep in each two hour block (column one) across all eight mice for both EEG/EMG manual scores (columns two and three) and for our video-based model (columns four and five). We also give the mean and standard deviation of the difference between manual scores and the model across all eight mice (columns six and seven).   We give the total time spent in WAKE averaged across all mice for each two hour block for EEG and video assessments. We also give the standard deviation of each, their difference, and the standard deviation of the difference.