The Hubble Tension Survey: A Statistical Analysis of the 2012-2022 Measurements

In order to investigate the potential Hubble tension, we compile a catalogue of 216 measurements of the Hubble–Lemaître constant H 0 between 2012 and 2022, which includes 109 model-independent measurements and 107 Λ CDM model-based measurements. Statistical analyses of these measurements show that the deviations of the results with respect to the average H 0 are far larger than expected from their error bars if they follow a Gaussian distribution. We find that x σ deviation is indeed equivalent in a Gaussian distribution to x eq σ deviation in the frequency of values, where x eq = 0 . 72 x 0 . 88 . Hence, a tension of 5 σ , estimated between the Cepheid-calibrated type Ia supernovae and cosmic microwave background (CMB) data, is indeed a 3 σ tension in equivalent terms of a Gaussian distribution of frequencies. However, this recalibration should be independent of the data whose tension we want to test. If we adopt the previous analysis of data of 1976-2019, the equivalent tension is reduced to 2 . 25 σ . Covariance terms due to correlations of measurements do not significantly change the results. Nonetheless, the separation of the data into two blocks with H 0 < 71 and H 0 ≥ 71 km s − 1 Mpc − 1 finds compatibility with a Gaussian distribution for each of them without removing any outlier. These statistical results indicate that the underestimation of error bars for H 0 remains prevalent over the past decade, dominated by systematic errors in the methodologies of CMB and local distance ladder analyses.


INTRODUCTION
Few problems in astrophysics have received as much attention in the last years as what is called 'Hubble tension'.Hundreds or thousands of papers dedicated to investigating the observations that originate the tension within the standard cosmological model or to propose alternative scenarios have been produced (see reviews by Di Valentino et al. 2021;Perivolaropoulos & Skara 2022;Abdalla et al. 2022;Hu & Wang 2023;Vagnozzi 2023).The tension was mainly triggered with the claim in 2019 of a Hubble-Lemaître constant H 0 estimated from the local Cepheid-type Ia supernova (SN Ia) distance ladder being at odds with the value extrapolated from Cosmic Microwave Background (CMB) data, assuming the standard ΛCDM cosmological model, 74.0 ± 1.4 (Riess et al. 2019) and 67.4 ± 0.5 km s −1 Mpc −1 (Planck Collaboration et al. 2020), respectively, which gave an incompatibility at the 4.4σ level.This tension was later increased up to 6σ depending on the datasets considered (Di Valentino et al. 2021).Very recently, the latest result from the Cepheid-SN Ia sample is H 0 = 73.04 ± 1.04 km s −1 Mpc −1 (Riess et al. 2022), representing a 5σ tension with that estimated from CMB data.This tension should not be so surprising, given the number of sys-⋆ E-mail: martin@lopez-corredoira.com † E-mail: jjwei@pmo.ac.cn tematic errors that may arise in the measurements.As a matter of fact, there have always been tensions between different measurements in the values of the Hubble-Lemaître constant, which has not received so much attention.Before the 1970s, due to different corrections of errors in the calibration of standard candles, the parameter continuously decreased its value, making incompatible measurements of different epochs (Tully 2023).But even after the 1970s, a tension has always been present.The compilations of values until the beginning of the 2000s showed an error distribution that was strongly non-Gaussian, with significantly larger probability in the tails of the distribution than predicted by a Gaussian distribution: the 95.4% confidence-level (CL) limits are 7.0σ in terms of the quoted errors (Chen et al. 2003).The nature of the possible systematic errors is unknown, and they may dominate over the statistical errors.
Twenty years ago, it was estimated that these systematic errors might be of the order of 5 km s −1 Mpc −1 (95% CL) (Gott et al. 2001).
The common likelihood functions used by astronomers contain the assumption of Gaussian errors (D'Agostini 2005), which is also a requirement of the central limit theorem.Statistical analyses of the measurements of the Hubble-Lemaître constant H 0 between 1976 and 2019 (Faerber & López-Corredoira 2020;López-Corredoira 2022) have also shown that the dispersion of its value is far much larger than what would be expected in a Gaussian distribution given the published error bars.The only solution to understand this dis-persion of values is assuming that most of the statistical error bars associated with the observed parameter measurements have been underestimated, or the systematic errors were not properly taken into account.The fact that the underestimation of error bars for H 0 is so common might explain the apparent discrepancy of values.Indeed, a recalibration of the probabilities with this sample of measurements to make it compatible with a Gaussian distribution of deviations finds that a tension of 4.4σ would be indeed a 2.1σ tension in equivalent terms of a Gaussian distribution of frequencies, and a tension of 6.0σ would be indeed a 2.5σ tension in equivalent terms of a Gaussian distribution of frequencies (López-Corredoira 2022).That is, we should not be surprised to find those 4-6 σ tensions, because they are much more frequent than indicated by the Gaussian statistics, and they stem from underestimation of errors, not from real tensions in the background of physics or cosmology.
In this paper, we want to extend this type of historical statistical analyses focusing only in the years 2012-2022.We know the statistical and systematic errors are much smaller now than some decades ago, but it is still worth to check whether the distribution of these errors follows what is expected in a Gaussian distribution.Clearly, if we focus on the Hubble tension between SN Ia data and CMB data, the answer is negative, but apart from these two types of sources, we want to explore other measurements too and globally evaluate the distribution.
In Sect.2, we give a description of the bibliographical data we used for our statistical analyses and the criteria to select them.Statistical analyses are presented in Sect.3. Recalibration of data in order to comply with a Gaussian distribution is presented in Sect. 4. We are aware that many of the data are correlated, they were measured by the same teams and/or with similar calibrators or methods, so we add in Sect. 5 some considerations on the effect of removing correlated data.The last section summarizes and discusses the results and the possible origin of the underestimated statistical errors and/or 'unknown' systematic errors.

BIBLIOGRAPHICAL DATA
We conduct a comprehensive search for H 0 measurements between the years 2012 and 2022 in the NASA Astrophysics Data System 1 , adhering to the following selection criteria: (i) The H 0 values were reported by published papers.(ii) We only pay attention to the H 0 measurements inferred from model-independent methods and within the context of standard ΛCDM.Note that hundreds or thousands of papers dedicated to investigating the Hubble tension.A substantial fraction of them focuses on proposing alternative cosmological models for measuring H 0 in order to narrow the discrepancy between the CMB and SN Ia observations.Those H 0 measurements obtained under models other than ΛCDM 2 are not included in our sample, since they potentially interfere with the analysis of the Hubble tension.
Based on the aforementioned criteria, we compile a total of 216 H 0 measured values, with 109 values measured from modelindependent methods and 107 values obtained under the ΛCDM model.For the complete catalogue, please refer to  plotted in Figure 1.The size of the scatter dots is inversely proportional to the error.We can find a significant increase in numbers of the measured H 0 after the year 2019, primarily attributed to the renowned Hubble tension problem, while concurrently witnessing an improvement in measurement accuracy.It is essential to reexamine these H 0 measurements from a statistical perspective.Therefore, we employ the statistical analysis approach proposed by López-Corredoira (2022) to investigate the H 0 tension over the past 11 years.

Distributions of Statistical Data
The statistical data of H 0 are divided into three categories: complete, model-independent and ΛCDM model-based measurements.These categorizations can be distinguished based on the statements in the respective articles (see Table A1).To visualize the distributions of these measurements, in Figure 2 we plot histograms for each category.Notably, significant bimodal distributions are observed in all three subplots, which imply the possible tension: the measurements are clustered around 67.4 ± 0.5 km s −1 Mpc −1 from the CMB data Moreover, from the perspective of early and late Universe ob-servations, our statistics also reveal some intriguing situations.The methods of the early Universe observations, mainly including CMB and baryon acoustic oscillation (BAO) data, consistently have results around 68 km s −1 Mpc −1 in ΛCDM model-based measurements, aligning with expectations.However, it was previously thought that the H 0 measured in late Universe was approximately 73 km s −1 Mpc −1 , which does not match our statistics.Almost all of the model-independent measurements in our statistics are derived from the late Universe observations, but a considerable proportion of these measurements deviate from the value of 73 km s −1 Mpc −1 .

Statistical Significance of the Bimodality
In this subsection, our focus is on assessing the statistical significance of the double peaks in the distribution.To achieve this objective, we employ the dip test, a general method for testing multimodality of distributions (Hartigan & Hartigan 1985).The dip test quantifies multimodality in a sample by calculating the discrepancy between the empirical distribution function of the samples and the unimodal distribution function that minimizes the maximum discrepancy.We implement this approach by using the diptest3 package in Python.The p values for the dip test range between 0 and 1, representing the probability of unimodality.The p values less than 0.05 indicate significant multimodality.
In addition to the three categories mentioned in the previous subsection, we further segment the complete H 0 measurements into two blocks: H 0 < 71 km s −1 Mpc −1 (118 data) and H 0 ≥ 71 km s −1 Mpc −1 (98 data) for comparison.The results are summarized in the last column of Table 1.The complete measurements result in p = 0.01 and the measurements for the two blocks result in p = 0.96 and p = 0.98, respectively.This reveals that the measurements of two blocks exhibit unimodality (p ≫ 0.05), whereas complete measurements exhibit significant multimodality (p = 0.01 < 0.05), specifically bimodality.This implies a potential H 0 tension.For the model-independent and ΛCDM model-based measurements, the results of the dip test are p = 0.46 and p = 0.16, respectively, indicating that the evidence of multimodality is not robust, and the unimodality is not statistically significant either.

Statistical Significance of model-independent and ΛCDM model-based measurements
To investigate the causes of the bimodal distributions, we continue to analyse the statistical significance of three categories: complete, model-independent and ΛCDM model-based measurements.
We calculate the weighted average value H 0 to analyze the statistical characteristic, whereby each data value is assigned a weight based on its inverse variance.Then we can obtain χ 2 for H 0 of three categories: where H 0,i and σ i are the measured value and standard deviation, respectively.The weighted average values H 0 , along with their corresponding standard deviations σ, χ 2 and Q values are summarized in Table 1.The index Q is used to measure their statistical significance, representing the probability that sample differences arise solely from chance errors.Typically, the statistical significance is ), and H 0 = 68.94± 0.13 km s −1 Mpc −1 (Q = 4.36 × 10 −12 ), respectively.The χ 2 fittings of two H 0 blocks are also displayed, with Q ∼ 1.However, for all three categories, the Q values are much lower than the threshold 0.05, indicating that it is highly unlikely that the observed trends are due to chance errors.
It is important to note that the implicit assumption in our calculations is that the covariance of each H 0 measurement is ignored.A considerable fraction of the measurements were not independent at all, since there were many duplicate data being used.If the data are treated as independent variables, a conservative limit of the real dispersion of data would be obtained.This is because the 'effective' number of degrees of freedom is smaller than the number assuming independence, leading to an increase of the reduced χ 2 value.These defects also make the cause of the bimodal distributions uncertain, as they may arise from a failure to eliminate the correlated data.More discussion can refer to section 5.

Outliers Screening
Removing sufficient outliers which deviate significantly from the mean would result in Q values exceeding 0.05.It is imperative to investigate how many outliers lead to a loss of statistical significance.Prior to removal, we define the number of σ deviations between the measurements H 0,i and the average H 0 as To ensure a significance level of Q > 0.05, data with x > x min are excluded, and the calculations for H 0 and χ 2 follow the methodology described in the previous subsection.Our results are listed in Table 2, where the weighted averages H 0 are 69.17 ± 0.09, 72.45 ± 0.21 and 68.85 ± 0.10 km s −1 Mpc −1 , respectively.Additionally, the 27 outliers of the complete category are listed in Table 3.Compared to the values obtained in Table 1, several pieces of information can be inferred: (i) The H 0 measurements from model-independent methods perform best.The number of outliers is only one, with a value of 69.13 ± 0.24 km s −1 Mpc −1 , obtained from combining Hubble parameter H(z), Sunyaev-Zel'dovich effect and X-ray data (Huang & Huang 2017).The outlier is due to its remarkably low error.
(ii) The outliers in the complete dataset predominantly arise from the observations of Cepheids+SNe Ia, lensing, CMB and BAO.Due to their small error ranges, the deviations are obvious, especially for the renowned results from Riess et al. (2022) and Planck Collaboration et al. (2020).And most of the outliers were obtained after the year 2019.
(iii) The outliers in the Model-independent and ΛCDM modelbased dataset are less than in the complete one, which implies the possible tension between them.
Based on the available evidence, there are signs of the Hubble tension.However, the degree of tension may be overestimated.Therefore, our subsequent analysis aimed to quantify this potential overestimation.

RECALIBRATION OF PROBABILITIES
In Figure 3, we display the frequency of deviations larger than xσ from the weighted average values H 0 derived using the complete, model-independent and ΛCDM model-based categories.For comparison, the Gaussian cumulative distributions are also given, which enables us to infer the degree to which the sample deviates from the expected Gaussian error distribution with a certain probability.For instance, if P = 0.1, the error is approximately ∼ 1.6σ in a Gaussian error distribution, while ∼ 2.5σ in real samples, indicating an overestimated deviation of ∼ 0.9σ.
To describe this overestimated deviation quantitatively, we calculate the equivalent deviation x eq between the probabilities of Gaussian distributions and the real frequency, which means that there is actually a x eq σ deviation when the deviation is xσ in the real frequency.And we fit them with the power function, x eq = ax b , where a and b are free parameters.As shown in Figure 4, the equivalent deviations x eq in three categories are x eq = (0.719 ± 0.013)x 0.879±0.014(Complete), x eq = (0.983 ± 0.012)x 0.744±0.011(Model-independent), (4) x eq = (0.750 ± 0.007)x 0.818±0.009(ΛCDM model-based).( 5) The baselines (i.e., x eq = x; gray dashed lines) are also plotted in Figure 4, which represent the expected deviations of a Gaussian distribution without underestimating the error bar.Among three categories, the category of model-independent measurements most closely aligns with the Gaussian distribution (especially when x ≤ 1.5), while others exhibit significant deviations.
Considering the current 5σ tension reported in Riess et al. (2022), the practical equivalent tension calculated using Equation 3 is (2.96± 0.12) σ ∼ 3σ.However, in order to calibrate x eq and apply it to present-day Hubble tension values, we should use data before the Hubble tension, because the calibration should be independent of the data whose tension we want to test.If we adopt the function of López-Corredoira (2022) with data of 1976-2019, the equivalent tension is reduced to 2.25σ (x eq = 0.83x 0.62 ), which is a more accurate estimation.It also indicates that the Hubble tension is stronger in the past decade, probably due to the improvement in measurement accuracy and the increase in relevant data.Although these findings suggest an increase in the Hubble tension, underestimation of errors is still common.

THE EFFECTS OF CORRELATED DATA
In a large number of H 0 measurements, a substantial portion of the observational data is reused, which makes the measurements correlated, but we ignore the correlation between them.As we mentioned χ 2 in Equation 1, we treat the H 0 measurements as independent variables, and sum them without considering the covariance matrix.It is important to reiterate that the data we are investigating is not entirely independent, which means that our analysis may be biased.Therefore, it is necessary to account for the effects of correlated data.In this section, we eliminate 64 correlated data, leaving 152 data (71 model-independent measurements and 81 ΛCDM-based measurements).The removal criteria are that: a) if the measurements are obtained from the same data, we remove the earlier one; b) if one person measures H 0 using data A and data B, and the other person uses only data B, we discard the result that uses only data B. We implement the same analysis and find minimal changes in all statistical characteristics as presented in Table 4.
We also examine the statistical properties of double peaks by categorizing the 152 H 0 measurements into two blocks: H 0 < 71 km s −1 Mpc −1 (85 data) and H 0 ≥ 71 km s −1 Mpc −1 (67 data).The calculated results of the dip test are p = 0.12 (complete), p = 0.69 (H 0 < 71 km s −1 Mpc −1 ) and p = 0.98 (H 0 ≥ 71 km s −1 Mpc −1 ), respectively (see Table 4).The bimodality is alleviated but its p value is far less significant than that of the two H 0 blocks.Additionally, the results of χ 2 fittings indicate two H 0 blocks still satisfy the Gaussian distributions (Q ∼ 1).
However, there are still some correlated data in the dataset containing the 152 H 0 measurements, because it is not feasible to elim-Table 4. The results of dip tests, weighted averages and χ 2 fittings in three categories after removing some of the correlated measurements.The Q values measure the statistical significance for χ 2 fittings, and the p values measure the significance of the unimodality for dip tests.We find a significant bimodal distribution in the 216 H 0 measurements, corresponding to the results from the local distance ladder (H 0 = 73.04 ± 1.04 km s −1 Mpc −1 ; Riess et al. 2022) and CMB observations (H 0 = 67.4± 0.5 km s −1 Mpc −1 ; Planck Collaboration et al. 2020), which has not been reported in previous statistical studies yet.In the subsamples of 109 model-independent measurements and 107 ΛCDM model-based measurements, the bimodal distributions still exist.We calculate the weighted averages H 0 and the probabilities Q for complete, model-independent and ΛCDM modelbased measurements, yielding H 0 = 69.35±0.12(Q = 8.85×10 −27 ), H 0 = 70.82± 0.22 (Q = 1.23 × 10 −5 ) and H 0 = 68.94± 0.13 (Q = 4.36 × 10 −12 ) km s −1 Mpc −1 , respectively.Such low Q values (Q ≪ 0.05) indicate that they are lack of statistical significance.
The above results, for instance in Figure 3, show clearly that the deviations of the results with respect to the average H 0 are far larger than expected from their error bars if they follow a Gaussian distribution.The underestimated 5σ tension may actually be 3σ when we recalibrate the frequency of deviations with respect to the average, which is still a significant tension.However, this recalibration should be independent of the data whose tension we want to test.If we adopt the analysis of data of 1976-2019 (López-Corredoira 2022), the equivalent tension is reduced to 2.25σ.
In addition, the separation of the data into two blocks with H 0 < 71 and H 0 ≥ 71 km s −1 Mpc −1 (Sect.3.2) finds values of Q ∼ 1, indicating compatibility with a Gaussian distribution for each of them without removing any outliers.It points out that the possible underestimation of errors is related to own methodology in these two groups of measurements.
At H 0 < 71 km s −1 Mpc −1 , recent measurements of H 0 with low error bars are dominated by CMB measurements.These values are subject to the errors in the cosmological interpretation of CMB with ΛCDM, and it is subject to the many anomalies still pending to be solved in CMB anisotropies (Schwarz et al. 2016).Moreover, Galactic foregrounds are not perfectly removed (Lopez-Corredoira 2007;Axelsson et al. 2015;Creswell & Naselsky 2021), and these are an important source of uncertainties.
At H 0 ≥ 71 km s −1 Mpc −1 , recent measurements of H 0 with low error bars are dominated by SN Ia teams, using calibration with Cepheids.Intrinsic scatters in SN Ia measurements are poorly understood (Wojtak & Hjorth 2022).The usual results are based on the assumption that there is only one hidden (latent) variable behind this scatter, namely the absolute magnitude.With this assumption, one can claim that the error in the distance scale (and consequently in the H 0 measurement) can be reduced by increasing the number of observed supernovae.The problem is that the actual space of latent variables behind the intrinsic scatter is much larger: dust extinction in SN Ia depending on the type of host galaxies (Meldorf et al. 2023), variations of the intrinsic luminosity of SN Ia with the age of the host galaxies (Lee et al. 2022), etc. Ignoring all these latent variables can only lead to underestimated errors and possible biases.As a matter of fact, measurements showing a decrease of H 0 with z in SN Ia measurements (Jia et al. 2023) might indicate precisely the presence of these systematic biases rather than new physics or new cosmology.We must also bear in mind that the value of H 0 is determined without knowing on which scales the radial motion of galaxies and clusters of galaxies relative to us is completely dominated by the Hubble-Lemaître flow.The homogeneity scale may be much larger than expected (Sylos Labini 2011), thus giving important net velocity flows on large scales that are incorrectly attributed to cosmological redshifts.
In conclusion, our statistical analysis indicates that the underestimation of error bars for H 0 remains prevalent over the past decade, dominated by systematic errors in the methodologies of CMB and SN Ia analyses which make the tension become increasingly evident.

Figure 1 .
Figure 1.Histogram and scatter diagram of 216 measurements of the Hubble-Lemaître constant H 0 between 2012 and 2022.The size of the blue scatter points is inversely proportional to the error.Obviously, there is potential for increasingly precise estimation of H 0 .The red points are weighted averages with weighted standard deviations for each year.The results derived from the local Cepheid-supernova distance ladder and CMB data are also included for comparison.

Figure 2 .
Figure 2. Histograms of three categories: (top panel) 216 complete measurements, (middle panel) 109 model-independent measurements and (bottom panel) 107 ΛCDM model-based measurements.Kernel density estimation curves are depicted as red lines in each histogram.For clarity, several extreme values exceeding 80 km s −1 Mpc −1 are not displayed.

Figure 3 .
Figure 3. Probabilities of deviation larger than xσ for three categories: (top panel) 216 complete measurements, (middle panel) 109 model-independent measurements and (bottom panel) 107 ΛCDM model-based measurements.The red dashed lines denote the expected probabilities if the errors are Gaussian.

Figure 4 .
Figure 4. Equivalent number of σ of a Guassian distribution as a function of the number of measured deviation σ with respect to the weighted average value of H 0 .The three panels (from top to bottom) correspond to the cases of 216 complete measurements, 109 model-independent measurements, and 107 ΛCDM model-based measurements, respectively.The red dashed lines represent the best fits, and the gray dashed lines denote the baselines (i.e., x eq = x).
Table A1 provided in the Appendix.The histogram and scatter of these data are 1 https://ui.adsabs.harvard.edu/. 2 Numerous models proposed for solving the H 0 tension are divided into 11 major categories with 123 subcategories by Di Valentino et al. (2021).

Table 1 .
The results of dip tests, weighted averages and χ 2 fittings in three categories: complete, model-independent and ΛCDM model-based measurements.The Q values measure the statistical significance for χ 2 fittings, and the p values measure the significance of the unimodality for dip tests.

Table 2 .
The results of weighted averages and χ 2 fittings in three categories after removing the outliers.The number of outliers and minimal deviations are also displayed.
For example, we can not exclude those H 0 measurements obtained by the same observational data but with different methods.When we encounter the H 0 measurements separately obtained by the combined data A+B and the combined data B+C, although they are correlated, we also can not remove either of them.In any case, if we had a random selection of duplicated points, the reduced χ 2 is similar, although the effective number of degrees of freedom is lower than the number assuming independence.Probability Q might be larger when covariance terms are taken into account, but still very low when Q ≪ 1.

Table A1 :
216 measurements of the Hubble-Lemaître constant H 0 between the years 2012 and 2022.