-
PDF
- Split View
-
Views
-
Cite
Cite
Luísa K Pilz, Melissa A B de Oliveira, Eduardo G Steibel, Lucas M Policarpo, Alicia Carissimi, Felipe G Carvalho, Débora B Constantino, André Comiran Tonon, Nicóli B Xavier, Rodrigo da Rosa Righi, Maria Paz Hidalgo, Development and testing of methods for detecting off-wrist in actimetry recordings, Sleep, Volume 45, Issue 8, August 2022, zsac118, https://doi.org/10.1093/sleep/zsac118
- Share Icon Share
Abstract
In field studies using wrist-actimetry, not identifying/handling off-wrist intervals may result in their misclassification as immobility/sleep and biased estimations of rhythmic patterns. By comparing different solutions for detecting off-wrist, our goal was to ascertain how accurately they detect nonwear in different contexts and identify variables that are useful in the process.
We developed algorithms using heuristic (HA) and machine learning (ML) approaches. Both were tested using data from a protocol followed by 10 subjects, which was devised to mimic contexts of actimeter wear/nonwear in real-life. Self-reported data on usage according to the protocol were considered the gold standard. Additionally, the performance of our algorithms was compared to that of visual inspection (by 2 experienced investigators) and Choi algorithm. Data previously collected in field studies were used for proof-of-concept analyses.
All methods showed similarly good performances. Accuracy was marginally higher for one of the raters (visual inspection) than for heuristically developed algorithms (HA, Choi). Short intervals (especially < 2 h) were either not or only poorly identified. Consecutive stretches of zeros in activity were considered important indicators of off-wrist (for both HA and ML). It took hours for raters to complete the task as opposed to the seconds or few minutes taken by the automated methods.
Automated strategies of off-wrist detection are similarly effective to visual inspection, but have the important advantage of being faster, less costly, and independent of raters’ attention/experience. In our study, detecting short intervals was a limitation across methods.
In field studies using actimetry, not identifying and handling nonwear may result in its misclassification as immobility/sleep or bias estimations of analyses characterizing rhythmic patterns. When compliance is low and researchers are working with long series and large datasets, detecting missing data by visual inspection becomes a laborious and time-consuming process. By comparing different strategies of off-wrist detection, our goal was to ascertain which ones accurately classify wear/nonwear and identify variables that are particularly useful in the process. Furthermore, we aimed to assess in which contexts nonwear might be difficult to detect. Our results suggest that automated strategies of off-wrist detection are similarly effective to visual inspection, but have the important advantage of being faster, less costly, and independent of raters’ attention/experience. In our study, detecting short intervals was a limitation across methods, which should be considered especially when investigating naps with actimetry.
Introduction
Actimetry, also known as actigraphy, refers to the monitoring of motor activity, which in humans is mainly obtained by means of devices equipped with accelerometers. These devices are most often used on the wrist (especially for estimating sleep timing and duration), albeit they can also be placed elsewhere (i.e. in the ankle or trunk) [1]. They are often equipped with additional sensors capable of measuring other parameters than activity (e.g. light exposure, skin temperature, and heart rate). Their recordings have been used in circadian and sleep research for over four decades [2], and although polysomnography (PSG) is the gold standard for assessing sleep, actimetry is capable of providing information that is impossible to capture in one single night in the laboratory. Sleep timing and duration across days “in-real-life,” as well as its regularity, cannot be fully assessed in lab studies [3]. Actimetry has also proven to be useful for studying physical activity, mobility, light exposure, and behavior in field and epidemiological studies [4,5]. The most outstanding advantage of actimetry is that it enables continuous recordings, besides being a noninvasive and convenient method that does not interfere with the individuals’ normal routine. This is particularly useful for the Circadian field to develop a better understanding of human daily patterns in entrained conditions, while individuals follow their usual routines.
Despite all its advantages, actimetry has of course limitations. Considering that sleep periods are inferred solely from movement data, actimetry can misestimate sleep as identified by PSG: previous research suggests that actimetry tends to overestimate sleep and underestimate wake during a sleep episode [6–8]. Furthermore, it is also very common that participants take off the actimeter during the study, especially those devices that are not waterproof. The resulting off-wrist intervals may consist of stretches of distinct durations, at random times, and represent a challenge when it comes to distinguishing missing data due to nonwear from periods of inactivity/rest. Dismissing the importance of identifying and handling nonwear may result in their misclassification as immobility/sleep or bias estimations of analyses characterizing rhythmic patterns.
When compliance is low, and researchers are working with long series and large datasets, making use of visual inspection to detect missing data becomes a laborious and time-consuming process. Nevertheless, this is a key step prior to data analysis. Many devices cannot detect whether the device is off-wrist, which can lead to difficulties in scoring [6]. Some devices have proximity sensors to identify nonwear periods, but they are not present in all actimeters, are limited by the nearness of the sensor to the skin (i.e. snug vs. loose wear), and may still have to be improved to detect wear vs. nonwear with better accuracy [9]. Furthermore, currently available algorithms are often hardware-specific or have not been tested in different devices. GGIR, for instance, is a widely used open-source software [10,11] developed for processing and analyzing data from three widely used brands of actimeters (GENEActiv, ActiGraph, and Axivity). It provides an algorithm for identification of nonwear periods which, nevertheless, relies on raw data (i.e. triaxial acceleration). Nonwear is estimated based on “standard deviation and the value range of each accelerometer axis, calculated for consecutive blocks of 30 minutes” [12], a window later on changed to 60 min to decrease incorrect classifications of inactivity as nonwear (false positives; type-I error) [13]. Many actimeters perform onboard signal processing and store only the derived output to reduce battery consumption and memory usage, even if they sample data using triaxial accelerometers.
Troiano and Choi algorithms are count-, epoch-based: they identify periods of zero counts computed using the acceleration data, within specified time intervals. Both were initially developed for hip-worn accelerometers data. The Troiano algorithm, originally used to analyze data from the NHANES 2003–2004 cohort, sets nonwear periods as intervals of at least 60 consecutive minutes of no activity counts, accepting up to two consecutive minutes of 1–100 counts to keep the non-wear classification [14]. In 2011, Choi et al. updated and modified this same algorithm considering periods of consecutive zero activity counts of a certain duration as non-wear chunks, being 90 min the minimum default setting duration [15]. They proposed movement artifacts could be up to 2 min within the nonwear bout for the classification to be kept, but with windows of 30 min of no counts (zeros) detected upstream and downstream to it. In another study [16], Choi et al. tested their algorithm in free-living conditions with data collected from ActiGraphs GT3X and GT1M wrist- and hip-worn by older adults; they also compared the algorithm used on the vector magnitude derived from the 3-axis vs. the vertical-axis counts and windows of 60 vs. 90 min. It classified wear/nonwear better with the 90 min window and using the vector magnitude; they also found that the monitor worn at the wrist was more sensitive to detect wear/nonwear than the waist-worn during the waking period. The performance was also considered superior to that of the Troiano algorithm. Knaier and colleagues recently validated these two algorithms with wrist- and hip-worn devices (ActiGraph GT3X+) [17] in a sample of male athletes. Results pointed out Choi algorithm superiority in relation to Troiano’s, working better, however, with hip- than wrist-worn devices.
Algorithms are often tested in existing datasets using data provided by participants as to their use of the devices or comparing them to visual inspection. The current study aimed to develop and test algorithms to detect off-wrist stretches and thus contribute with information and alternatives to detecting nonwear when preprocessing wrist actimetry data stored as “counts”. By comparing different strategies in free-living conditions, our goal was to ascertain which ones accurately detect nonwear and identify variables that are particularly useful in the process. Furthermore, we aimed to assess in which contexts nonwear might be difficult to detect. We hypothesized that: (1) using conditions and filters based on activity and skin temperature recordings, a heuristic algorithm could be an efficient method to differentiate valid from off-wrist data; (2) algorithms devised using ML would show good performance and indicate activity and temperature as variables to be used when identifying off-wrist; and (3) both algorithms would be superior to visual inspection in detecting off-wrist.
Methods
We developed algorithms using either: (1) a heuristic or (2) a machine learning approach (both available on https://github.com/LMicol/offwrist-detection). Both were tested using data from an off-wrist protocol followed by 10 subjects, which was devised to mimic contexts of actimeter wear/nonwear in real-life. We then used their self-reported data on off-wrist intervals as gold standard. Performances of visual inspection by experienced investigators and that of Choi algorithm [14] were also assessed and compared to our algorithms. Finally, data previously collected in other epidemiological studies were used for proof-of-concept analyses to test our algorithms against visual inspection (as the gold standard). Below, we describe each step in detail.
Off-wrist protocol
We designed an off-wrist protocol to mimic situations that may interfere with temperature and light data collection, as well as taking off the device. These situations are henceforth referred to as “contexts” and participants kept a log of which context actimeters were under and when (e.g. off-wrist in the sun from 02:00 pm to 06:00 pm). We recruited 10 undergraduate/graduate students from our lab to wear actimeters (ActTrust, Condor) on their non-dominant wrist for at least 14 days until completion of the off-wrist protocol. Recordings were stored in 1 min bins. The subjects could opt for intervals of varied durations (e.g. 2, 4, 6, 12, and 24 hr) for each context and were instructed to wear the actimeter normally for at least 2 hr between them. They were not imposed any restrictions regarding how long they should take to complete the protocol. The contexts simulated by our protocol might be identified (correctly or incorrectly) as off-wrist. Figure 1 depicts the reasoning of our study and details all contexts. The total length (sum) and proportion of the entire series that each context represents is shown in Table 1.
Context . | Subjects (n) . | Length (% of series) . | Length (min) . |
---|---|---|---|
Off-wrist, under the sun | 10 | 6 [3–9] | 1920 [1802–2148] |
Off-wrist, under electrical light | 10 | 2 [1–3] | 724 [720–1029] |
On the wrist and covered | 10 | 7 [5–9] | 2728 [2621–3056] |
On the wrist and over the sleeve | 10 | 7 [4–10] | 2866 [2726–3038] |
On the wrist, over the sleeve and covered | 10 | 6 [3–7] | 2389 [2070–2551] |
Off-wrist, before and during sleep | 10 | 3 [2–5] | 1370 [1279–2244] |
Off-wrist, after wakeup | 10 | 1 [1–2] | 384 [360–825] |
Off-wrist, device in motion (bag/car) | 9 | < 0.5 [0–0] | 40 [30–70] |
Others (off-wrist) | 9 | 2 [1–16] | 1106 [736–3926] |
Context . | Subjects (n) . | Length (% of series) . | Length (min) . |
---|---|---|---|
Off-wrist, under the sun | 10 | 6 [3–9] | 1920 [1802–2148] |
Off-wrist, under electrical light | 10 | 2 [1–3] | 724 [720–1029] |
On the wrist and covered | 10 | 7 [5–9] | 2728 [2621–3056] |
On the wrist and over the sleeve | 10 | 7 [4–10] | 2866 [2726–3038] |
On the wrist, over the sleeve and covered | 10 | 6 [3–7] | 2389 [2070–2551] |
Off-wrist, before and during sleep | 10 | 3 [2–5] | 1370 [1279–2244] |
Off-wrist, after wakeup | 10 | 1 [1–2] | 384 [360–825] |
Off-wrist, device in motion (bag/car) | 9 | < 0.5 [0–0] | 40 [30–70] |
Others (off-wrist) | 9 | 2 [1–16] | 1106 [736–3926] |
Columns 3 and 4: median [Q1 – Q3].
Context . | Subjects (n) . | Length (% of series) . | Length (min) . |
---|---|---|---|
Off-wrist, under the sun | 10 | 6 [3–9] | 1920 [1802–2148] |
Off-wrist, under electrical light | 10 | 2 [1–3] | 724 [720–1029] |
On the wrist and covered | 10 | 7 [5–9] | 2728 [2621–3056] |
On the wrist and over the sleeve | 10 | 7 [4–10] | 2866 [2726–3038] |
On the wrist, over the sleeve and covered | 10 | 6 [3–7] | 2389 [2070–2551] |
Off-wrist, before and during sleep | 10 | 3 [2–5] | 1370 [1279–2244] |
Off-wrist, after wakeup | 10 | 1 [1–2] | 384 [360–825] |
Off-wrist, device in motion (bag/car) | 9 | < 0.5 [0–0] | 40 [30–70] |
Others (off-wrist) | 9 | 2 [1–16] | 1106 [736–3926] |
Context . | Subjects (n) . | Length (% of series) . | Length (min) . |
---|---|---|---|
Off-wrist, under the sun | 10 | 6 [3–9] | 1920 [1802–2148] |
Off-wrist, under electrical light | 10 | 2 [1–3] | 724 [720–1029] |
On the wrist and covered | 10 | 7 [5–9] | 2728 [2621–3056] |
On the wrist and over the sleeve | 10 | 7 [4–10] | 2866 [2726–3038] |
On the wrist, over the sleeve and covered | 10 | 6 [3–7] | 2389 [2070–2551] |
Off-wrist, before and during sleep | 10 | 3 [2–5] | 1370 [1279–2244] |
Off-wrist, after wakeup | 10 | 1 [1–2] | 384 [360–825] |
Off-wrist, device in motion (bag/car) | 9 | < 0.5 [0–0] | 40 [30–70] |
Others (off-wrist) | 9 | 2 [1–16] | 1106 [736–3926] |
Columns 3 and 4: median [Q1 – Q3].

Rationale of the study (a) and contexts mimicked in an off-wrist protocol (b). The off-wrist protocol was developed to test the algorithms in different conditions.
ActTrust samples activity data using a triaxial accelerometer and can store it using three motion-quantifying modalities: PIM, TAT, or ZCM [3,18,19]. The Proportional Integration Mode (PIM), the so-called digital integration, calculates the area under the curve for each epoch. The Time Above Threshold (TAT) reflects in a cumulative manner the amount of time per epoch that the signal is above a set threshold, whereas Zero Crossing Mode (ZCM) counts the number of times per epoch that the signal crosses a threshold (set close to zero). In that sense, although all methods use a count system to measure motor activity, the PIM is a measure of activity level while ZCM is a measure of frequency of movement, and TAT measures time spent in motion in an epoch.
In our study, all volunteers provided written informed consent and were not compensated for their participation. The study was approved by the Ethics Committee of Hospital de Clínicas de Porto Alegre (#2020-0128-GPPG/HCPA) and was conducted in accordance with the Declaration of Helsinki.
Heuristic algorithm (HA)
The algorithm was initially developed using data collected with actimeters (ActTrust, Condor instruments) worn for 30−45 days by three participants in a previous study of our group. Recordings were stored in 1 min bins. No data on wear/ nonwear were provided by these three participants; however, based on researchers’ experience in inspecting actograms, patterns of activity and temperature indicative of off-wrist were identified based on: (1) stability of temperature throughout time and (2) long stretches of zeros in activity (PIM). These patterns, characteristic of off-wrist data, were then translated—by trial and error—into conditions and filters capable of detecting them in our algorithm. Data from 3 of the 10 volunteers in our study were used for tuning the algorithm: two filters were included. See Supplementary Material for a detailed description. We ran the algorithm on the entire actimetry raw dataset (before merging with diary data and excluding epochs that were not present in the diary).
A machine learning (ML) approach
We also present a machine learning approach to the off-wrist detection problem. We decided to use a classification algorithm to try to identify the off-wrist sequences and used Random Forest (RF). The first step with this approach is to define the input method: how the data will be read and inputted into the algorithm. The selected method was to read each line as an input of data and predict whether that line represents an off-wrist identification. The input line contains nine fields: “date.time,” “temperature,” “light,” “red.light,” “green.light,” “blue.light,” “pim,” “tat,” and “zcm.”
The second step of the ML process is to select features from the data and analyze the correlation between variables. The idea is to extract insights from the available data to assist the algorithm in the prediction process. In addition to the data provided by the actimeter, we included five features:
H = Time of the day
PIMW = Occurrences of zeros in PIM in a time window of 60 minutes
PIMS = Previous sequence of consecutive zeros in PIM
TEMP . DELTA .30 = Temperature variation in 30 minutes
TEMP . DELTA = Temperature variation from previous epoch
The actimeters-derived data and these five features were provided as inputs to the RF algorithm. The implementation was made using the python SciKit-Learn library. All the values were normalized using min−max normalization. The data were shuffled and divided into two groups: one for training and the other for testing. Also, we used cross-validation to ensure quality in the training step and to certify that there was no overfitting in the model. The set of training examples (labeled data) did not include stretches shorter than 30 min, since we did not set to detect those. After executing the algorithm, a filter was applied to ensure that only continuous intervals longer than 30 min were recognized. This disallows the possibility of recognizing smaller off-wrist periods but prevents discontinuous recognition and broken sequences. For ML training/testing, we used only the clean dataset, since diary data were necessary for training.
Visual inspection
Inferring wear/nonwear based on visual representations of activity, light, and temperature data is a common practice among researchers. As a matter of fact, even if visual inspection demands time and relies on personal experience, it is often a method of choice when preprocessing the data (also frequently accompanied by the identification of long stretches of zeros). In order to test whether algorithms could outperform visual inspection, two investigators (a board-certified sleep psychologist and a sleep physician) with research experience in this field (A.C. and F.G.C.) inspected actograms and reported in spreadsheets what they would identify as off-wrist intervals. Both were blinded to protocol details and log data.
Choi algorithm
We also used the Choi algorithm [15] to detect off-wrist data and compare it to our solutions. We performed this classification using the function wearingMarking available in the R package “Physical activity” [20] using a frame (“window 1”) of 90 min, an allowance frame (artefactual movement interval) of 2 and a stream frame (“window 2”) of 30 min.
Epoch-by-epoch accuracy tests and methods comparisons
We computed each algorithm’s and visual inspection’s specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), and accuracy considering as gold standard the binary series built from the data reported by the subjects in their logs.
Sensitivity and specificity relate the classification of the algorithm to the gold standard: sensitivity represents the proportion of epochs correctly classified as off-wrist (true positives) among all reported in the log as off-wrist (true label); specificity represents the proportion of epochs correctly classified as on-wrist (true negatives) among all reported in the log as on-wrist (true label). PPV and NPV represent the probability that the classification is correct: PPV quantifies the rate of epochs correctly classified as off-wrist among all considered off-wrist; NPV quantifies the rate of epochs correctly classified as on-wrist among all considered on-wrist. Finally, accuracy represents the proportion of epochs correctly classified.
Each parameter was computed for each participant and descriptive statistics refer to median [Q1–Q3] across them and by method. Data from all 10 subjects were used. We additionally computed sensitivity and specificity by epoch length and context, regardless of subject.
Algorithms and visual inspection performances were compared using Friedman followed by pairwise Wilcoxon tests (with Bonferroni adjustment), using the R package “stats” [21].
Proof-of-concept analyses
We additionally compared our algorithms to visual inspection using data collected in previous studies (Ethics approval: #2020−0128, #2015−0568, #2019−0527). Fifteen series (37 ± 16 years old, 10 women [67%], 5 from rural areas [33%]) were selected based on having off-wrist data detected with the HA. They were visually inspected using the same procedure as described in the Visual inspection section. We then assessed algorithms’ (i.e. HA, ML, and Choi) performances considering visual inspection as the gold standard. Since visual inspection was considered gold standard, a second pair of researchers (L.K.P. and M.A.B.O.) reviewed the visual inspection for large inconsistencies. All series were 15 days long, except for one that was 14.8. We computed the percentage agreement between raters for each subject and then group descriptive statistics. The performance of HA, ML, and Choi was assessed using the same metrics reported in the Epoch-by-epoch accuracy tests and methods comparisons section.
Results
Characteristics of the time series and of the participants in the validation cohort
We included 10 participants (four undergrad and six grad students from our research group) who completed our off-wrist protocol in a median of 25.7 days (Q1−Q3: 19.3 −49.4 days). Median age was 27 years-old (Q1−Q3: 25−28; min−max: 23−35) and seven were women. Median off-wrist duration was 21% of the total series (Q1−Q3: 12−30; min−max: 5−35). Data were collected from September 28, 2020 to April 21, 2021 in Porto Alegre (30.0 S, 51.2 W). Other details pertaining to the characterization of the actimeter usage can be found in Table 1.
Performance of a heuristic algorithm (HA)
HA requires no previous data preparation except for reading in the raw files. It took less than 2 min to run it through the whole dataset (which summed up to a total of 500,611 rows [minutes] of data). Accuracy reached a median of 97.1%, being higher than 93% for all subjects. Median sensitivity, specificity, positive predictive value (PPV), and negative predicted value (NPV) were 84.7%, 99.7%, 99.1%, and 96.5%, respectively.
Performance of a machine learning (ML) strategy
Figure 2a presents the correlation matrix between the variables collected with the actimeter, the calculated features, and the binary series (off-wrist = 1, on-wrist = 0) derived from participants’ self-reports in the diary. Firstly, it is noticeable that all the incoming data from the actimeter have a slight correlation with the diary notations. More precisely, light exposure, represented by the columns “Light,” “Red.Light,” “Green.Light,” and “Blue.Light,” shows a positive correlation with the diary values. However, temperature and activity data (columns “PIM,” “TAT,” and “ZCM”) show inverse correlations with the binary series. From the calculated features, only those using PIM presented a good correlation. Time of day, as well as temperature variation features, did not correlate with the binary series.
![Correlation and feature importance. Correlation coefficients between all the inputs, calculated features, and the gold-standard off-wrist binary series [self-reported NA] (a). Feature importance resultant of the training process of the ML algorithm (b). H: time of the day; PIMW: occurrences of zeros in PIM in a time window of 60 min; PIMS: Previous sequence of consecutive zeros in PIM; TEMP.DELTA.30: temperature variation in 30 min; TEMP.DELTA: temperature variation from previous epoch.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/sleep/45/8/10.1093_sleep_zsac118/2/m_zsac118_fig2.jpeg?Expires=1748045053&Signature=q4kuAzW0RJzWYg1ysVRGuEfxtPs0vHR3B~jFhMfYAGryk-J414DbWitpW~rOKq1bYxEUorsUnb4O7dqsdCyznqwkb77~xfDAXYI4s8bXpTUm8WeMLWub6indrlJ3gGX1R~VXPCBXqPFwQywWGRPneNfR0SGKOMwvA8~3DXaRZmNcvMLLWVoaYyqrpXyC8WhRQ9ciJjmdSK1BC0eEv31HmyVXRCaWudFgEzX3BI~2z3ZPPtS2PUAqYVPxqzfZ3esA82d-wuuVfZ1pAHRPbKQWy3BtUIbocF1xSgbMzd6X16q1lO7tzqHsASxnNbFaYtulRef4u0QeD~n88391zTPuLw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Correlation and feature importance. Correlation coefficients between all the inputs, calculated features, and the gold-standard off-wrist binary series [self-reported NA] (a). Feature importance resultant of the training process of the ML algorithm (b). H: time of the day; PIMW: occurrences of zeros in PIM in a time window of 60 min; PIMS: Previous sequence of consecutive zeros in PIM; TEMP.DELTA.30: temperature variation in 30 min; TEMP.DELTA: temperature variation from previous epoch.
After training the ML algorithm, features’ importance could be assessed and compared to the correlation matrix. Figure 2b presents feature importance calculated after the training step. The values of feature importance add up to 1.0, and each feature has a proportional weight based on their relevance in the classification decision. Notably, the correlation coefficients are reflected in the importance of the features. However, they are not always equivalent: for instance, temperature presents a higher correlation with the binary series than the sequential zeros in PIM (column PIMS), but a lower feature importance. The variance explained by the model was 0.871.
Figure 3 shows the ML algorithm’s performance (sensitivity, specificity, NPV, PPV, and accuracy) along with that of HA and other methods. Median accuracy was 98.3%, being higher than 89% for all subjects. Median sensitivity, specificity, PPV, and NPV were 94.0%, 99.7%, 98.4%, and 98.4%, respectively.

Performance of different strategies employed to identify off-wrist (epoch-by-epoch). Heuristic algorithm (HA), machine learning approach (ML), Choi algorithm (Choi), and both raters visual inspection (rater 1, rater 2). NPV: negative predictive value. PPV: positive predictive value. Metrics were computed for each individual. n = 10. Friedman test, *p < .05 Wilcoxon signed-rank test, Bonferroni correction for multiple comparisons.
Visual inspection performance
The two specialists estimated it took them ca. 3 and 6.5 hr for inspecting and documenting the off-wrist intervals of all files. The use of that information also required us an additional step: translating the intervals into a binary series of on-wrist (0) off-wrist (1). We found the median accuracy of visual inspection to be 98.7% and 98.8% for each rater, respectively. All subjects had their data scored with an accuracy higher than 90%. Median sensitivity, specificity, PPV, and NPV were 96.6%, 99.5%, 97.0%, and 98.8% for rater 1 (r1), and 95.5%, 99.3%, 97.6%, and 99.2% for rater 2 (r2).
Choi algorithm performance
It took ca. 15 s to run Choi’s algorithm through all files. We found its median accuracy to be 96.7%, and all accuracy values were higher than 92%. Median sensitivity, specificity, PPV, and NPV were 82.5%, 99.8%, 98.6%, and 96.3%.
Comparison between methods
Supplementary files include example actograms showing methods’ performance (Supplementary Figures S1–S5). Friedman’s test suggested that significant differences between methods in sensitivity, NPV and accuracy (sensitivity: Friedman X² = 22.6, p < .001; NPV: Friedman X² = 20.6, p < .001; accuracy: Friedman X² = 23.2, p < .001). Post-hoc comparisons using Wilcoxon signed-rank tests with Bonferroni p-adjustment methods show differences between HA/Choi and r1 in NPV, sensitivity and accuracy. No difference was seen in specificity and PPV (specificity: Friedman X² = 7.7, p = 0.10.; PPV: Friedman X² = 7.9, p = 0.09).
Sensitivity and specificity by context
Figure 4 shows sensitivity and specificity by context. Specificity was high regardless of context, whereas sensitivity was very low for “device in motion” contexts. Sensitivity to detect “under electrical light” and “after wake-up” also seems lower for HA and Choi. However, as shown in Figure 5, each context is composed of different proportions of varied interval lengths and the lower sensitivity in these contexts may likely be explained by them being composed of shorter intervals (see next section).

Sensibility and specificity by context. Heuristic algorithm (HA), machine learning approach (ML), Choi’s algorithm (Choi), and both raters visual inspection (rater 1, rater 2). Data computed for all epochs.

Episode lengths of different contexts. Length of episodes by context (a), with off-wrist, and on-wrist contexts color-coded differently (boxplots). Proportion of each context composed of epochs within different off-wrist episode lengths (b).
Sensitivity and specificity by off-wrist episode length
Figure 6 shows sensitivity and specificity by episode length. Sensitivity is higher than 80% for all methods in episodes longer than 6 hr. The sensitivity of HA is very low (<2%) for episodes shorter than 2 hr and reached 47% and 59% in episodes of 2−4 and 4−6 hr, similarly to Choi which did not detect episodes shorter than 30 min, had a low sensitivity for detecting episodes of 30 min−2 hr (12%), and had a sensitivity of 65% and 59% for episodes of 2−4 and 4−6 hr, respectively. ML and visual inspection also had sensitivity lower than 50% for episodes shorter than 2 hr but reached levels higher than 80% already with episodes longer than 2h.

Sensibility and specificity by episode lengths. Heuristic algorithm (HA), machine learning approach (ML), Choi’s algorithm (Choi), and both raters visual inspection (rater 1, rater 2). Data computed for all epochs.
Specificity is high regardless of episode length, being above 93% in episodes longer than 2 hr for all methods. Specificity is also higher than 95% in episodes longer than 30 min and shorter than 2 hr, except for one of the raters (82%). For intervals shorter than 30 min, specificity is still higher than 80% (all methods).
Proof-of-concept analyses results
As mentioned above, since visual inspection was going to be considered the gold standard in these analyses, large inconsistencies between the two raters were checked by a second pair of researchers. Five episodes were then manually adjusted after agreement between reviewers (onset or offset were shifted by a day, i.e. raters mistook the onset or offset days). Median percentage agreement between raters after these adjustments was 98.6% [Q1−Q3: 97.7% − 99.0%]. Only epochs which were identified as off-wrist by both raters were considered true labels, except for one large episode detected only by rater 1 and confirmed by reviewers. Figure 7 shows algorithms’ performances. All algorithms showed very high median specificity and PPV (higher than 99%); NPV and accuracy medians were all higher than 90%, whereas sensitivity was higher than 70%. Sensitivity, NPV, and accuracy were significantly lower for Choi than ML. Supplementary files also include example actograms showing methods’ performance in the proof-of-concept analyses (Supplementary Figures S6–S9). According to visual inspection, 63% of the 147 intervals of off-wrist in this sample were of ≥2 hr, 36% were longer than 6 hr, and those intervals of ≥ 2 hr represented more than 95% of the entire time of nonwear (ca. 1144 hr; Supplementary Figure S10).

Performance of algorithms in identifying off-wrist (epoch-by-epoch). Visual inspection by two raters was considered gold standard. Heuristic algorithm (HA), machine learning approach (ML), and Choi algorithm (Choi). NPV: negative predictive value. PPV: positive predictive value. Metrics were computed for each individual. n = 15. Friedman test, *p < .05 Wilcoxon signed-rank test, Bonferroni correction for multiple comparisons.
Discussion
Main results
To the best of our knowledge, this is the first study to devise a long protocol mimicking compliance to actimeter usage with programmed off-wrist periods of different lengths and contexts. Our protocol was designed to yield reliable data, which allowed us to show that three different methods are able to accurately detect off-wrist data. Of relevance, considering we aimed at low rates of false positives when developing automated strategies, median specificity was higher than 99% with both algorithms (as well as with visual inspection and Choi’s algorithm [16]). Accuracy was higher than 89% in all subjects for all methods, whereas ML seems to be more sensitive than Choi and HA, especially with shorter intervals (2−4 and 4−6 hr). With this study, we also provide two additional count-based algorithms with sufficiently good performance to detect off-wrist data collected with ActTrust devices.
Features, windows, and short intervals limitation
Variables derived from activity counts and temperature were those used in the heuristic algorithm (HA), as well as the ones with higher feature importance in the ML algorithm. Using temperature data, however, only slightly improved the performance of HA as compared to Choi’s in some instances and not consistently. Other variables than activity did not largely contribute to the ML solution either; its accuracy was superior to 94% regardless of other variables than activity (PIM) being used or not. Yet, only using the entire set of variables we reached an accuracy of 98%. Whether different strategies making use of temperature (and other data), perhaps aimed at detecting shorter intervals as well, are developed and show improved performance remains to be seen. Previous studies using the Axivity device showed an advantage of using temperature data, for example [22,23], but the usage of a fixed threshold to classify off-wrist based on the temperature signal or on its z-score transformation did not seem to improve our results with the HA. The same aforementioned studies [22,23] also used patterns of increase and decrease in the temperature smoothed signal, which we did not test here. We used, though, epoch-to-epoch differences in temperature as an indicator. An important note is that in both the protocol and the proof-of-concept data we had subjects whose temperature pattern was eventually of an increase when taking off the actimeter (see Supplementary Figure S1, a and b—protocol, Supplementary Figure S6a—proof-of-concept), as opposed to the more usually seen pattern of a temperature decrease. This may be related to the higher temperatures in Brazil, and may also be more common in rural populations whose work environment is often outdoors; in the case of the off-wrist protocol data, this pattern is, of course, related to the fact that one of the mimicked contexts was leaving the device under the sun. Furthermore, studies may indicate how in situations of more variable environment temperature, its signal is still useful in detecting off-wrist.
Relying solely on consecutive stretches of zeros in activity may result in a high rate of false negatives when detecting off-wrist [24] depending on the device, resolution with which data have been collected, and on how tolerant one can be to misclassifying inactivity as off-wrist. Artifacts within the non-wear period are one of the potential reasons. Yet, zero-counts within frames have so far proven to be the best indicator of off-wrist when used along with other strategies: in the case of HA, counts of zeros in activity (considering different interval windows, 90 min and 4 hr) were used; in the ML approach, the occurrence of zeros within 60 minutes showed the highest feature importance. This, however, means that short intervals remain a challenge in detecting off-wrist. It is important to highlight that the methods we developed were not set to detect short intervals in the first place. Recent efforts using similar machine learning approaches suggest that they may help in enabling nonwear detection at higher resolutions [25].
Patterns of both activity and temperature (when available) are also the ones investigators usually rely on to detect off-wrist when visually inspecting actograms and the data. Therefore, even though we hypothesized that automated aids could be more successful than visual inspection, it is not surprising that their performance is rather similar and that the human eye may even spot some episodes that rigid sets of rules would not. A combination of automated methods with visual inspection may be the best strategy; the possibility of easily using such a combination provided by the tools available in pyActigraphy [26] and ChronoSapiens [27], for example, may, therefore, offer an advantage.
Contexts
Neither algorithms nor investigators could detect “off-wrist, but in motion” episodes. It is important to consider, though, that this was also a context mimicked as relatively short episodes, which are overlooked by HA, Choi and, even if to a lesser extent, by ML and visual inspection. However, considering that activity is an important feature to all algorithms, such a result was probably to be expected. The classification of other contexts as off-wrist/on-wrist had good and similar performances, with length of episodes probably being a confounder and making it harder to interpret which off-wrist contexts are indeed more challenging to detect.
Advantages and disadvantages of each method
Even if visual inspection may be very accurate, a few points are worth considering: (1) since it heavily relies on the human factor, it is probably sensitive to raters’ attention levels at the time visual inspection is performed; additionally, it may be subjective to raters’ experience. The investigators who visually inspected the actograms in our study were aware that their performance would be evaluated, which may have contributed to higher attention levels when performing the task. Furthermore, they each had 7−8 years of experience working with actimetry data. (2) After visual inspection, an additional step of transforming the detected intervals into a format that allows one to replace data with NA or cross it with the raw data takes time, even if automated. (3) Timewise, the whole process takes longer and requires trained researchers’ time and attention.
The seconds to minutes taken by algorithms contrasted with the approximately 5 hr taken by raters to detect and document off-wrist intervals of 10 subjects highlights the advantage of the automated alternatives. Even if ours was a sample with more regular patterns, the time taken by the raters was consistent with the time it took to pre-process data in a study of cadets with “erratic” sleep schedules (30 min per participant) [28]. With large datasets, the faster solution provided by the algorithms implies resources’ saving and allowing experienced staff to invest their time in other relevant activities. The main advantage of Choi and HA is that they are faster than other methods, despite sensitivity being worse, especially in the range of off-wrist episodes of 30 min–6 hr. Although Choi seems to perform better than HA in the range of 2–4 hr, it also showed a sensitivity of 82% vs. the 95% of the HA with intervals of 12–24 hr. The ML approach is also faster than visual inspection and shows a very similar performance. Among the automated strategies, ML was the one with higher sensitivity, and it showed good sensitivity with episodes ≥ 2 hr. The sample of our proof-of-concept analyses was chosen based on subjects having off-wrist intervals and we did not have self-reported data on taking off the device. Yet, illustrating the relevance of the automated methods we explored here, in this sample a great number of nonwear episodes detected by VI consisted of intervals longer than 2 hr. Another important point is that long episodes undetected probably produce greater biases [29].
Strengths and limitations
Our study has the strength of experimentally, but still in free-living settings, mimicking nonwear and the context in which actimeters were when they were taken off (e.g. in movement, under natural light). We recruited undergraduate and graduate students from our lab to run this experiment; even if external validity may be lower in such a homogeneous sample, we considered that accurate documentation in the logs was instrumental. Therefore, although this is still self-reported data, subjects’ following of the protocol and documentation of their wear-time compliance should be highly trustworthy. Another limitation than the homogeneity of the sample is its small size. Yet, all subjects wore the actimeter for longer than 14 days and we collected 500 611 min of recordings. Additionally, our proof-of-concept analyses included subjects from rural and urban areas, and of different age groups and showed similar results. Even with our protocol, it was difficult to fully unravel how context and length of episodes affect performance of methods of off-wrist detection. Still, we show how detecting short episodes was a limitation across methods, and how fully automated strategies may perform similarly to visual inspection detection and therefore represent a viable and faster alternative. The interpretation of our results and the algorithms we developed are naturally limited to the device we used (ActTrust) and the information it collects, but we believe that the reasoning behind these algorithms and the limitations to be considered when using automated solutions could be transferred to other strategies. Capacitive proximity sensors may further facilitate the detection of off-wrist in actimetry data, but even then, the results of our study may help in (1) devising strategies for actimeters that still do not incorporate proximity sensing technologies for reasons of pricing and (2) choosing algorithms and interpreting results derived from datasets collected before these technologies were readily available.
Concluding Remarks
Automated strategies to off-wrist detection are similar to visual inspection, but present the important advantage of being faster and less costly. As expected, stretches of zeros inactivity were considered important indicators of off-wrist using both new automated strategies (heuristic and machine learning), whereas temperature improved only slightly and in some instances the performance of the heuristic algorithm when this was compared to Choi’s. The ML approach showed higher sensitivity than Choi and performed generally better with intervals shorter than 6 hr than the HA and Choi. The detection of short intervals was an important challenge regardless of the method of choice in our study, and this should be taken into account especially in studies investigating naps using actigraphy or estimating sleep from actigraphy in samples with highly fragmented sleep (such as is the case of patients with some sleep disorders and individuals in sleep-disrupting environments). Estimations of other parameters, such as some of those derived from non-parametric circadian rhythm analysis (e.g. interdaily stability and intradaily variability) and cosinor analysis (e.g. acrophase), may be less affected by not detecting short episodes, whereas keeping “zeros” in activity when they signify nonwear may have a more substantial influence in measures that depend on magnitude (e.g. M10—the average of activity during the most active hours, or amplitude) [29]. Finally, we hope that our study reminds researchers they should approach data cleaning thoughtfully, considering their data characteristics, research questions, and devices, and that publications should report methodologies used.
Acknowledgments
We are thankful to CNPq (E.G.S., M.P.H., M.A.B.O., and R.R.R.) and CAPES (L.K.P., A.C.T., D.B.C., and N.B.X.) for fellowships and to the support of Global Affairs Canada (N.B.X.).
Funding
This study was funded by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES) Finance Code 001 (CAPES-Epidemias—grant number: 88887.507070/2020-00; PROBRAL—grant number 88887.144127/2017-00), Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul (FAPERGS PPSUS—grant number: 21/2551-0000118-6 and PPSUS-2017 FAPERGS/MS/CNPq/SESRS n. 03/2017—grant number: 17/2551-0001419-7), and FIPE-HCPA.
Disclosure Statement
None declared.
Data Availability
The data underlying this article will be shared upon reasonable request to the corresponding author. Part of the data (offwrist protocol) is available under: https://github.com/LMicol/offwrist-detection.
References
Comments