Trotter and Gleser’s (1958) equations outperform Trotter and Gleser’s (1952) equations in stature estimation of the US White males

Abstract   Trotter and Gleser presented two sets of stature estimation equations for the US White males in their 1952 and 1958 studies. Following Trotter’s suggestion favouring the 1952 equations simply due to the smaller standard errors, the 1958 equations have been seldom used and have gone without additional systematic validation tests. This study aims to assess the performance of the Trotter and Gleser 1952, Trotter and Gleser 1958, and FORDISC equations for the White males in a quantitative and systematic way, particularly when applied to the WWII and Korean War casualties. In sum, 27 equations (7 from the 1952 study, 10 from the 1958 study, and 10 from FORDISC) were applied to the osteometric data of 240 accounted-for White male casualties of the WWII and Korean War. Then, the bias, accuracy, and Bayes factor for each set of stature estimates were calculated. The results show that, overall, Trotter and Gleser’s 1958 equations outperform the 1952 and FORDISC equations in terms of all three measures. Particularly, the equations with higher Bayes factors produced stature estimates where distributions were closer to that of the reported statures than those with lower Bayes factors. When considering Bayes factors, the best performing equation was the “Radius” equation from the 1958 study (BF = 15.34) followed by the “Humerus+Radius” equation from FORDISC (BF = 14.42) and the “Fibula” equation from the 1958 study (BF = 13.82). The results of this study will provide researchers and practitioners applying the Trotter and Gleser stature estimation method with a practical guide for equation selection. Key Points The performance of three stature estimation methods was compared quantitatively. Trotter and Gleser’s (1952, 1958) and FORDISC White male equations were included. Overall, Trotter and Gleser’s 1958 method outperformed the other methods. This study provides a practical guide for stature estimation equation selection.


Introduction
Stature has been extensively studied in various fields of anthropology as an important biological property indicative of the health, environmental conditions, and even socioeconomic/political circumstances of an individual and population [1][2][3][4][5][6][7]. As direct measurement of an individual's stature is not always feasible, particularly when they are deceased, extensive effort has been made to devise methods to estimate stature from skeletal elements since the late 19th century [8][9][10][11][12][13][14][15]. Since its introduction in the 1950s, Trotter and Gleser's [14,15] method has become one of the most popular techniques for stature estimation [16][17][18].
"Trotter and Gleser's method" refers to a set of stature estimation equations devised from their studies published in 1952 and 1958. In the 1952 study, the authors presented equations for the White and Black individuals using the World War II (WWII) US service member casualties and Terry Collection samples [14]. In 1958, they provided male equations for the Whites, Blacks, Asians, Puerto Ricans, and Mexicans using US Korean War service member casualties [15]. Trotter [19] explicitly suggested using the 1952 equations for White and Black males over the 1958 equations due to the smaller standard errors in the 1952 equations compared with the 1958 equations. Following Trotter's [19] suggestion, Trotter and Gleser's 1958 equations have been seldom applied to White and Black males and have gone without additional systematic validation tests. In fact, the stature estimation tool built into FORDISC, a forensic anthropological analysis software popular among forensic practitioners, is based on the Trotter and Gleser's 1952 dataset, not their 1958 dataset, to estimate statures of the 20th century (20C) Whites and Blacks [20]. However, a standard error associated with a certain regression equation is not necessarily a measure of its performance as it is not a predictive estimator but simply a descriptive indicator of the overall discrepancy between the actual and estimated values in a dataset used for the equation development [17,21].
Besides Trotter's [19] suggestion of using standard error, a general lack of feasible quantitative methods to compare the performance of different stature estimation methods explains the lack of systematic validation tests of Trotter andGleser's 1952 and1958 equations. Comparing the mean of stature estimates obtained by an estimation method to that of known statures has often been used for comparing the relative performance of a method [22,23]. However, as Jeong and colleagues [24] highlight, even identical means do not necessarily indicate that the estimates and known statures follow a same-shaped distribution and thus, cannot guarantee a good performance of the method. In this regard, Jeong et al. [24] suggest that, given the distributions of known statures and estimated statures, Bayes factors can be used to compare the performance of multiple stature estimation methods quantitatively and objectively.
Bayes factor refers to a ratio of the marginal likelihoods for two models [25,26]. In the context of stature estimation, the two models will be the distributions of two sets of stature data (e.g. estimated statures and known statures). The distribution of multiple sets of estimated statures produced by different estimation methods may be compared with the same distribution (i.e. distribution of known statures) so that the relative performance of the methods can be assessed using Bayes factors [24]. In other words, a method with a greater Bayes factor can be concluded to perform better than that of a lower Bayes factor.
The goal of this study is to assess the performance of Trotter and Gleser's 1952 and 1958 White male equations using Bayes factors when applied to the WWII and Korean War casualties. Black males were excluded from this study due to a small sample size. Bayes factors for each equation will be calculated by comparing the distribution of the stature estimates to that of the reported living statures. Thus, the resultant Bayes factors will indicate the relative performance of the equations, which will help researchers select the best equation available to them. To the authors' knowledge, this is the first effort to validate Trotter and Gleser's 1952 and 1958 equations in a systematic and quantitative way. The result of this study is expected to be beneficial to any researchers who estimate statures of skeletal remains using the Trotter and Gleser's method, particularly those who work on the identification of the WWII and Korean War casualties.

Data
The living stature and long bone measurement data were obtained from 240 accounted-for White male casualties of the WWII and Korean War whose skeletal remains were accessioned into the Defense POW/MIA Accounting Agency Laboratory (DPAA-Lab) and/or its predecessor organisations (JPAC and CILHI) between 1989 and 2017. Every individual used in this study had documented living statures from their antemortem records and possessed at least one of the measurable long bones from their upper and/or lower limbs. Living statures originally recorded in inches were transformed into centimeters by multiplying 2.54 and then rounding the values up to one decimal place. All bone measurements were taken by certified forensic anthropologists at the DPAA-Lab using the contemporary standards [27][28][29]. When both sides of bones were present, stature estimates were produced using left bones, and right bones were used only when their left counterparts were unavailable.

Equations to be compared
In their 1952 study, Trotter and Gleser presented seven simple regression equations and three multiple regression equations for White males [14, p.495]. In 1958, they presented another set of 10 equations, and all are simple regression equations [15, p.120]. In Trotter's [19] study, only the seven simple regression equations from the 1952 study were included in her recommendations to "give satisfactory estimates". Maximum lengths of six individual limb bones (humerus, radius, ulna, femur, tibia, and fibula) as well as the summed length of the "Femur+Tibia" were used to develop the simple regression equations in both 1952 and 1958 studies and, thus, these seven equations were compared in this study. Although three simple regression equations in the 1958 study using the summed lengths of "Femur+Fibula", "Humerus+Radius", and "Humerus+Ulna" were not presented in the 1952 study, they were included for comparison in this study because FORDISC, which is based on Trotter and Gleser's 1952 data, provides stature estimates using those summed lengths. Three multiple regression equations from the 1952 study could not be compared with any of the 1958 equations and thus, were excluded from this study ( Table 1). Even though FORDISC is based on Trotter and Gleser's [14] WWII data, it uses slightly different equations from those of Trotter and Gleser's 1952 study. Thus, with the "20th MStat" and "WM" options selected, the performance of FORDISCgenerated stature equations was also compared with the Trotter and Gleser's 1952 and 1958 equations. A total of 27 sets of stature estimates were produced for comparison: 7 using the 1952 equations, 10 using the 1958 equations, and 10 using FORDISC (Table 1).
It should be noted that the tibial measurements in FORDISC have been adjusted by the developers [20] due to the possible error pointed out by Jantz and colleagues [30,31]. However, when Trotter and Gleser's 1952 and 1958 tibia equations were applied in this study, the maximum tibial lengths (i.e. condylo-malleolar length) were entered into the equations with no corrections/adjustment. For the rest of analyses, a point estimate was regarded as the estimated stature of an individual.

Performance comparison among equations
Bayes factors along with associated posterior probabilities were calculated to compare the performance of the equations. Additionally, two frequently used performance measures were calculated for comparability purposes: bias (i.e. mean of differences between the estimated and actual statures) and accuracy (i.e. mean of absolute differences between the estimated and actual statures). Bayes factor calculation requires to specify the type of data distributions, so Kolmogorov-Smirnov tests were conducted to test for normality for the 27 sets of stature estimates and reported statures (i.e. documented living statures). Additionally, histograms, kurtosis, and skewness  [19]. b Equations excluded from this study due to a lack of Trotter and Gleser's 1958 equations to be compared. c Only the equations used for comparison in this study are listed.
were drawn/calculated to confirm that there is no significant departure of the data from a normal distribution. All analyses and visualisation of data were conducted using RStudio version 1.3.959 for Windows [32]. The LearnBayes package and R code provided in Jeong et al. [24] were used to calculate Bayes factors and posterior probabilities.

Results
Of 240 individuals, the numbers of individuals having measurable humerus, radius, ulna, femur, tibia, and fibula were 182, 156, 139, 200, 191, and 155, respectively ( Table 2). Approximately 40%-60% of the individuals possessed both the left and right bones with only 11.5%-23.2% of individuals having just right bones (Table 2). Table 3 presents the descriptive statistics of the reported statures as well as the maximum lengths of the left and right bones. The mean stature (174.8 cm) in the current study is slightly greater than those reported in Trotter and Gleser's 1952 and 1958 studies (174.0 cm and 173.95 cm, respectively); however, the difference was not statistically significant (one sample t-test; P > 0.05). The discrepancies in the mean bone lengths between the current study and the previous studies were as small as 0.02-0.85 cm with no statistical significance (one sample t-test; P > 0.05 for all bones) ( Table 3).
The results of the Kolmogorov-Smirnov tests indicate that all the 27 sets of stature estimates are normally distributed (P > 0.05) ( Table 4). The reported statures yield a P-value of 0.05 with a D statistic of 0.089 (Table 4). This relatively low P-value is likely due to a large sample size (n = 240), which makes kurtosis somewhat sensible [33]. The kurtosis of 2.469 implies a slightly platykurtic distribution of the data with relatively heavy tails; however, the histogram shows that the data do not depart from a normal distribution significantly ( Figure 1). Also, the skewness of 0.197 is close enough to zero indicating that the data are not positively or negatively skewed [33]. Thus, it was concluded that all sets of estimated statures and reported statures follow a normal distribution. Table 5 reports (Table 5).
A similar pattern was observed in terms of the accuracy (i.e. |estimated stature − actual stature|/n). Compared with the other methods, the 1958 equations yielded better or similar accuracies except for the "Ulna" equation, where the FORDISC equation yielded the lowest value (3.52 cm). Overall, prioritising the three methods solely based on the accuracy did not appear practical because they tended to

Items
Left only (n (%)) a Right only (n (%)) a Both sides (n (%)) a Total   produce very similar values (e.g. "Femur+Tibia" equation) ( Table 5). Table 5 also shows that the 1958 equations produced the greatest Bayes factors among the three methods except for three equations ("Ulna", "Humerus+Radius", and "Femur+Tibia" equations). The greatest Bayes factors for the "Ulna" and "Humerus+Radius" equations were obtained from FORDISC (BF = 7.47 and 14.42, respectively) and the 1952 study yielded the greatest Bayes factor for the "Femur+Tibia" equation (BF = 13.26). Out of 27, 7 equations yielded Bayes factors greater than 10 indicating "strong evidence" of the scenario that the stature estimates come from the distribution of the population (i.e. known stature) [24,40]. The seven equations with Bayes factors greater than 10 were the 1958 "Radius" equation  (Table 5).
Posterior probabilities reported in Table 5 indicate how much the equations could improve their predictions compared with the prior conditions. As the prior probabilities were originally set as 0.5, posterior probabilities greater than 0.5 can be understood as an improvement of the equation's performance. As indicated in Table 5, the equations with higher Bayes factors yielded higher posterior probabilities (Table 5). Figure 2 presents graphical comparisons of the estimation methods by overlapping the distributions of the stature estimates with that of the reported statures. Overall, it was visually demonstrated that the equations with high Bayes factors tended to yield distributions of stature estimates which are similar to that of the reported statures.

Discussion
Newly devised methods to reconstruct biological profile parameters (e.g. ancestry, sex, age-at-death, and stature) are generally expected to be subjected to vigorous validation processes by peer researchers using different samples and/or methodologies. For example, in a validation study of Pearson's [9] stature estimation equations, Stevenson [34] found that accuracy might vary between populations and stressed the importance of a population-specific method for stature estimation. Also, multiple validation studies [35,36] on Fully's [11] technique led to a new version of the anatomical method using revised osteometric measurements and statistical methods by Raxter and colleagues [13]. As such, validation studies not only enhance the accuracy and applicability of the method but also serve as a basis for a new method development.
In general, to select a stature estimation method for a target sample, the similarities of biological (e.g. ancestry and sex), geographical, and temporal backgrounds between the target sample and the reference sample used to devise a method are regarded as important standards to be considered [19,37]. Yet, no clear rule of thumb has been established for a situation where multiple methods meeting these standards are available for a target sample such as Trotter and Gleser's White and Black male equations from their 1952 and 1958 studies. Although Trotter [19] favoured the 1952 equations, her suggestion was not based on an independent validation test but simply based on the comparison of the standard errors associated with each equation. As mentioned previously, as standard errors are not a predictive indicator, they should not be regarded as a proper measure to compare the performance of stature estimation equations. In this regard, Jeong and colleagues [24] suggest that (i) a good estimation method should yield stature estimates where distribution is similar to that of a population (i.e. known stature) and (ii) the similarity of the two distributions (i.e. distributions of estimated and known statures) can be assessed quantitatively using the Bayes factors.
Bayes factors, based on the Bayesian approach, have some practical advantages over a P-value obtained from hypothesis testing in a frequentist approach. First, unlike the P-value, which is used to determine if a null hypothesis can be simply rejected or not, the Bayes factor (BF 01 ) presents the odds of how much more likely a set of given data would occur in the null model (M 0 ) over the alternative model (M 1 ). For example, given BF 01 = 2, the Bayes factor suggests that (i) the given data are twice as likely to occur in the scenario of M 0 compared with M 1 and, at the same time, (ii) the given data are 0.5 times (i.e. 1/BF 01 ) more likely to occur in the scenario of M 1 compared with M 0 [24,38]. Moreover, the Bayes factors calculated from different datasets can be directly compared with each other, which is not possible for a P-value [38]. This study could compare the Bayes factors from 27 sets of stature estimates and reported statures due to this property of the Bayes factors, and assist in eventually prioritising the performance of the equations.
The Bayes factors were calculated in this study in a way that the distribution parameters (mean and standard deviation) of the reported statures and estimated statures were used for the null (M 0 ) and alternative models (M 1 ), respectively. Thus, the higher the Bayes factor (BF 01 ), the more likely the scenario that the set of stature estimates occurs from the distribution of the reported statures (i.e. greater similarity between the distributions of the reported and estimated statures). Table 6 presents general guidelines to interpret the Bayes factors established by previous studies [39,40]. In both Jeffreys's [39] and Raftery's [40] guidelines, a Bayes factor of 3 is regarded as a meaningful point beyond which can be interpreted as a "substantial" or "positive" evidence for the null model (Table 6). Furthermore, Jeffreys [39] specifies that the Bayes factor greater than 10 can be interpreted as "strong" evidence.
When considering Bayes factors, Trotter and Gleser's 1958 study has more equations of greater performance than the other methods under comparison. About a half of the equations yielding Bayes factors greater than 3 (nine out of 19 equations) were from the 1958 study (Table 7). In fact, all 1958 equations except for the "Ulna" equation yielded the Bayes factors greater than 3. Moreover, more than half of the Bayes factors greater than 10 were obtained from the 1958 study (four out of seven) ( Table 7). The greatest Bayes factor in this study was also obtained from one of the 1958 equations (i.e. the "Radius" equation yielding BF = 15.34) ( Table 5). In addition, except for the "Ulna" and "Femur+Tibia" equations, the least bias and greatest accuracy were obtained from the 1958 equations. Overall, all these results indicate the 1958 equations outperform the 1952 and FORDISC equations.
This study provides researchers and practitioners applying the Trotter and Gleser stature estimation method with a practical guide for equation selection. In other words, based on the results presented in Tables 5 and 7, it is recommended to use the equation with the highest Bayes factor among the available options. It should be noted that the equation with the best Bayes factor is not necessarily associated with the lowest bias or greatest accuracy. For example, there are many equations with a lower bias and/or greater accuracy score than the 1958 "Radius" equation that yielded the greatest Bayes factor (BF = 15.34). Rather, the equation of a higher Bayes factor should be understood to produce stature estimates where distribution would mimic that of the true statures more accurately and thus, its overall performance would be greater than those with lower Bayes factors.
As Jantz and colleagues [30,31] raised the issue of possible mismeasurement of the tibia in Trotter and Gleser's 1952 study, the accuracy of the tibia-related equations has been debatable [17,41]. Jantz and colleagues [30,31] speculated that, unlike the description presented in Trotter and Gleser  [14], Trotter measured the maximum length of the tibia excluding the malleolus resulting in the overestimation of statures when the malleolus-included tibial length is plugged into the tibia equation. Jantz and Ousley [20] applied a correction factor to Trotter and Gleser's [14] raw tibia measurement data and generated new tibia equations, which is currently built into FORDISC version 3, to address this issue. A 10-mm correction factor, which should compensate for the missing malleolus length, was intended to be applied; however, the correction factor was applied twice for an unknown reason and thus, the current tibia equation in FORDISC underestimates stature [41]. Trotter's possible mismeasurement of the tibia in the 1958 study and the overcorrection of the tibial length in FORDISC explains the positive bias in the 1952 "Tibia" equation (1.09 mm) and negative bias in the FORDISC "Tibia" equation (−1.34 mm) as well as their low Bayes factors (BF = 0.57 and 0.08, respectively) ( Table 5, Figure 2). On the other hand, the 1958 "Tibia" equation yielded a decent bias (0.54 mm) and Bayes factor (BF = 6.33). This result not only demonstrates the outperformance of the 1958 "Tibia" equation compared with the other methods but also supports the argument that there was no measurement issue with the tibia because the bones had not been measured by Trotter but the technicians following Trotter's descriptions in the 1952 study. Another somewhat unexpected finding from this study is the best Bayes factor was obtained from an upper limb equation (the 1958 "Radius" equation, BF = 15.34), as it is generally accepted that lower limb equations yield more accurate estimates compared with upper limb equations [16,37]. The result of high-performing upper limb equations does not appear misguided considering that the Bayes factors greater than three were obtained more from the upper limb equations (10 out of 13 equations (77%)) than the lower limb equations (nine out of 14 equations (64%)) ( Table 7). Moreover, both FORDISC equations yielding the Bayes factors greater than 10 were from the upper limb equations ("Radius" and "Humerus+Radius" equations) ( Table 7). As the primary purpose of this study is to report the comparative performance of the methods/equations, exploring the reason for the different performance among equations is beyond the scope of this study and needs to be a topic for the future research.
Lastly, exclusion of the Black male equations from the analysis due to insufficient sample size is another limitation of this study. Thus, a validation test of the Black male equations should be another topic for the future research with additional data.