Physical activity phenotyping with activity bigrams, and their association with BMI

Abstract Background Analysis of physical activity usually focuses on a small number of summary statistics derived from accelerometer recordings: average counts per minute and the proportion of time spent in moderate-vigorous physical activity or in sedentary behaviour. We show how bigrams, a concept from the field of text mining, can be used to describe how a person’s activity levels change across (brief) time points. These variables can, for instance, differentiate between two people spending the same time in moderate activity, where one person often stays in moderate activity from one moment to the next and the other does not. Methods We use data on 4810 participants of the Avon Longitudinal Study of Parents and Children (ALSPAC). We generate a profile of bigram frequencies for each participant and test the association of each frequency with body mass index (BMI), as an exemplar. Results We found several associations between changes in bigram frequencies and BMI. For instance, a one standard deviation decrease in the number of adjacent minutes in sedentary then moderate activity (or vice versa), with a corresponding increase in the number of adjacent minutes in moderate then vigorous activity (or vice versa), was associated with a 2.36 kg/m2 lower BMI [95% confidence interval (CI): −3.47, −1.26], after accounting for the time spent in sedentary, low, moderate and vigorous activity. Conclusions Activity bigrams are novel variables that capture how a person’s activity changes from one moment to the next. These variables can be used to investigate how sequential activity patterns associate with other traits.

Supplementary section S1: Bags of n-grams An n-gram is a contiguous sequence of length n. The 1-gram and 2-gram are referred to as unigrams and bigrams, respectively.
A bag on n-grams is the set of n-grams and associated frequency, from a given sequence. For example, the sequence ABAB has 2 A and 2 B characters and so can be represented by the bag of unigrams {A*2, B*2}. This sequence can also be represented by bags of bigrams {AB*2, BA}, 3-grams {ABA, BAB} and 4-grams {ABAB}.
Bags of unigrams represent any sequence that is a permutation of the elements in the bag. For example, the bag of unigrams {A*2, B*2} can represent the sequences AABB, ABAB, ABBA, and so forth. Where n>1, n-grams are overlapping such that, for instance, the bigram at position [i, i+1] overlaps with the bigrams at positions [i-1, i] and [i+1, i+2]. This restricts the possible sequences that each bag of n-grams represents. For example, the bag of bigrams {AB*2, BA} represents the sequence ABAB only. The bag of bigrams {AA*2, AB, BB, BA} represents the sequences AAABBA, ABBAAA and BBAAAB.
Supplementary section S2: Potential confounders Parity was recorded via questionnaire at 18 weeks gestation. We derived a measure of maternal smoking in pregnancy using questionnaire measures recorded at 18 and 32 weeks gestation. At 18 weeks gestation the mother was asked if they 1) smoked tobacco in the first 3 months of pregnancy or 2) smoked tobacco in the last 2 weeks. At 32 weeks gestation she was asked how many cigarettes she was currently smoking per day.
At 32 weeks gestation each mother was asked to record their highest education level, as either none, CSE (national school exams at age 16), or vocational; O level (national school exams at age 16, higher than CSE); A level (national school exams at age 18); or university degree. We combined the categories 'CSE' and 'vocational' into a single category due to the small number of participants in these groups. The mother also recorded her and her partner's occupation and these were used to derive their social class groups (I, II, III manual, III non-manual, IV, V) using the 1991 Office of Population, Censuses and Surveys classification. The lowest of the mothers or her partner's social class was used as a measure of the household social class. The mother was also asked about their and their partner's ethnicity and this was used to derive the child's ethnicity.
Supplementary section S3: Relating our linear models to real changes in activity sequences Figure 3 in the main paper shows three example sequence changes that are consistent with particular models in our analyses. In this section we provide further explanation and examples of sequence changes that are consistent with our models.
As discussed in the main paper, bigrams are overlapping such that a change of one epoch pair in a sequence from one bigram to another will often change the number of occurrences of at least one other bigram. In a minority of cases only, changing an epoch pair from one bigram to another results in the same bag of bigrams. For example, given the sequence SSLS, changing the SL to LS produces the sequence SLSS. Both before and after this change, the sequence has bag of bigrams {SS, SL, LS}. It is more common that a change of an epoch pair in a sequence from one bigram to another will change the number of occurrences of at least one other bigram, and these changes depend on the particular sequence.
The following two examples both change one occurrence of SS to SM: Example 1: Sequence SSSLML changes to SSMLML Example 2: Sequence LMVSSV changes to LMVSMV In example 1 this change simultaneously adds a second ML bigram and removes the SL bigram, whereas in example 2 this change simultaneously adds a second MV bigram and removes the SV bigram. The actual frequencies of activity states and bigrams are shown in Supplementary table 1.
The sequence changes of examples 1 and 2 are consistent with models 1 and 2 of our analyses. This is because models 1 and 2 corresponds to real sequence changes where there is an increase in frequency of the comparison bigram (SM) and an equivalent decrease of the baseline bigram (SS), such that the total remaining number of bigrams in the sequence (1339 -SS -SM) remains constant.
While we may think of these models as representing a swap from one bigram to another at particular positions in a person's sequence, in fact any increase in frequency of one bigram that is accompanied by an equal decrease in frequency of another bigram, is consistent with these models. For instance, swapping SS for SM in example 1 collaterally increases the frequency of ML by 1. Hence this example is also consistent with a model with baseline SS and comparison ML, because the frequency of these bigrams decreases and increases, respectively, by the same amount.
Models 3 and models 4 adjust for the common activity statistics; the frequency of activity states and the average activity, respectively. To be consistent with these models a change in a sequence has to additionally satisfy other properties.
Model 3 represents a change in a sequence where the frequency of one bigram increases, accompanied by an equal decrease in frequency of another bigram, while the time spent in each activity state (sedentary, low, moderate and vigorous) remains constant. Example 1 and 2 described above are examples where the time spent in activity states are not kept constant as, for instance, in example 1 the time spent in the sedentary state decreases by one minute and the time spent in the moderate state increases by one minute. Hence, these examples are not consistent with model 3. The following examples have the same initial sequences as examples 1 and 2 and show a change of one SS to SM, but are consistent with model 3: Example 3: Sequence SSSLML changes to SSMLSL Example 4: Sequence LMVSSV changes to LSVSMV In order to keep the frequency of each activity state unchanged a moderate state has additionally been changed to sedentary (see Supplementary table 1 for more details of  these examples).
Model 4 represents a change in a sequence where the frequency of one bigram increases, accompanied by an equal decrease in frequency of another bigram, while the average activity levels do not change. Hence, a change in the frequency of each activity state is consistent, as long as the average activity levels remain the same (as shown in illustration 3 of Figure 2 in the main paper).
While in this section we have focussed on changes in the frequency of bigrams consistent with our models, the same also applies to unordered bigrams (u-bigrams). Supplementary table 2 shows two example u-bigram changes consistent with models 1 and 2, and two examples consistent with model 3.   Underlined activity states in sequences shows position at which states change.
Examples 1 and 2 are consistent with models 1 and 2 because the frequency of SM increases by the same amount that SS decreases. Examples 3 and 4 are consistent with model 3 (adjusted for activity states; sedentary, low, moderate and vigorous) because: 1) the frequency of SM increases by the same amount that SS decreases, and 2) the time spent in sedentary, low, moderate and vigorous activity does not change.   Swapping the comparison and baseline bigram gives equivalent estimates of association with BMI (same value but with opposite sign).

Supplementary
Model 1: unadjusted. Model 2: adjusted for potential confounders (gender, exact age at age 11 clinic, parity, household social class, maternal education, maternal smoking during pregnancy and child ethnicity). Model 3: adjusted for potential confounders (gender, exact age at age 11 clinic, parity, household social class, maternal education, maternal smoking during pregnancy and child ethnicity) and activity states (time spent in sedentary, low, moderate and vigorous activity). Model 4: adjusted for potential confounders (gender, exact age at age 11 clinic, parity, household social class, maternal education, maternal smoking during pregnancy and child ethnicity), and mCPM. 17.45    Swapping the comparison and baseline u-bigram gives equivalent estimates of association with BMI (same values but with opposite sign).

Supplementary table 7: Difference in means of BMI for a 1 SD decrease of baseline u-bigram, coupled with corresponding increase in comparison u-bigram, where baseline bigram SD is less than comparison bigram SD
Model 1 (circle): unadjusted. Model 2 (cross): adjusted for potential confounders (gender, exact age at age 11 clinic, parity, household social class, maternal education, maternal smoking during pregnancy and child ethnicity).