A practical tool for maximal information coefficient analysis

Abstract Background The ability of finding complex associations in large omics datasets, assessing their significance, and prioritizing them according to their strength can be of great help in the data exploration phase. Mutual information-based measures of association are particularly promising, in particular after the recent introduction of the TICe and MICe estimators, which combine computational efficiency with superior bias/variance properties. An open-source software implementation of these two measures providing a complete procedure to test their significance would be extremely useful. Findings Here, we present MICtools, a comprehensive and effective pipeline that combines TICe and MICe into a multistep procedure that allows the identification of relationships of various degrees of complexity. MICtools calculates their strength assessing statistical significance using a permutation-based strategy. The performances of the proposed approach are assessed by an extensive investigation in synthetic datasets and an example of a potential application on a metagenomic dataset is also illustrated. Conclusions We show that MICtools, combining TICe and MICe, is able to highlight associations that would not be captured by conventional strategies.


Experimental design and statistics
Full details of the experimental design and statistical methods used should be given in the Methods section, as detailed in our Minimum Standards Reporting Checklist. Information essential to interpreting the data presented should be made available in the figure legends.
Have you included all the information requested in your manuscript?

Yes Resources
A description of all resources used, including antibodies, cell lines, animals and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section. Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.
Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?

Yes
Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.
Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?

Introduction
With the growing popularity of high throughput quantitative technologies it is now common to characterize living systems by measuring thousands of variables over a wide range of conditions. In these large datasets, the number of potential associations between variables is enormous. Computational and statistical methods should be able to highlight the signi cant ones (striking a balance between exibility and statistical robustness), and to prioritize the more relevant for downstream analysis. Traditionally, the presence of a potential relationship between two variables X and Y is assessed on the basis of a certain measure of association, that is often able to reveal speci c types of relationships, but is blind to others. Then, once the measure is computed, its signi cance is tested against the null hypothesis of no association. For linear associations, the Pearson correlation coe cient is the natural choice, while the Spearman's rank coe cient represents a more exible alternative for general monotonic relationships. In the exploratory analysis of datasets produced by modern -omics technologies this conventional approach show its limits, because a huge number of potential associations needs to be screened without any a priori information on their form. In these cases, it would be desirable to use a measure of dependence that ranks the relationships according to their strength, regardless of the type of association. A measure with this property has been de ned equitable [1] and a consistent mathematical framework for the de nition of equitability has been proposed [2,3,4,5,6]. The second challenge faced in the unsupervised screening of large 2 | GigaScience, 2017, Vol. 00, No. 0 datasets is that the number of associations to be tested is usually huge and the statistical assessment of signi cance has to face well known multiplicity issues [7,8].
Recently, a family of measures based on the concept of mutual information has been proposed, and one of the most popular member of this family, the Maximal Information Coe cient (MIC), has been shown to satisfy the equitability requirement [1]. Unfortunately, MIC su ers of lack of power [9], and its heuristic estimator, APPROX-MIC, is computationally demanding [5]. These two drawbacks have severely hampered the application of MIC to large datasets. In order to overcome these limitations, two new MIC-based measures, the MIC e -a consistent estimator of the MIC population value (MIC * ) -and the related TIC e (total information coe cient) statistics have been proposed [5]. Both quantities can be calculated more e ciently than APPROX-MIC and have better bias/variance properties [5]. In particular, TIC e is characterized by high power, which has been obtained at the expenses of equitability, while MIC e performs better on this side, showing reduced performances in terms of power. These two MIC-based measures, then, compensate each other and their combination is extremely promising as data exploration tool. In particular, a two step procedure can be applied, where TIC e is used to perform e ciently a high throughput screening of all the possible pairwise relationships and assess their signi cance, while MIC e is used to rank the subset of signi cant associations in terms of strength [5]. Despite the potential of this approach, an e cient software implementation of these two measures and of a statistical procedure to test the signi cance of each association controlling multiplicity issues is still lacking.
Here we present MICtools, an open-source and easy-to-use software providing: • an e cient implementation of TIC e and MIC e estimators [10]; • a permutation-based strategy for estimating TIC e empirical p values; • several methods for multiple testing correction, including the Storey's q value to control the false discovery rate (FDR); • the MIC e estimates for each association called signi cant.

Methods
MICtools implements a multi-step procedure to identify relevant associations amongst a large number of variables, assess their statistical signi cance and rank them according to the strength of the relationship. Starting from M variable pairs x i and y i measured in n samples, the procedure can be broken into 4 steps ( Figure 1): (i) estimating the empirical TIC e null distribution by permutations; (ii) computing TIC e statistics and its empirical p values for each variable pairs; (iii) applying a multiple testing correction strategy in order to control the family-wise error rate (FWER) or the FDR [11]; (iv) using MIC e to estimate the strength of the relationships called signi cant.
The pipeline can be run as a sequence of subcommands implemented into the main command mictools (Figure 1).

The empirical TIC e null distribution
Since TIC e depends only on the rank-order of the vectors x i and y i [1], the empirical null distribution can be estimated, for a given sample size and set of parameters, by performing R permutations of the elements of the vectors y i and by calculating the set of null TIC e statistics t 0 1 , . . . , t 0 R . Two parameters control the estimation of the null distribution of TIC e : the parameter B controlling the maximal-allowed grid resolution and the number of permutations R. In the current implementation, B was set to the default value 9, which guarantees good performances in terms of statistical power against independence in most situations [12]. However, di erent values of B can be chosen: for example, B = 4 for less complex alternative hypothesis, B = 12 for more complex associations [12]. With regards to the number of permutations, instead, the results obtained on the synthetic datasets (see Additional File 2, Figures A2 and A3 and Additional File 1, Table A2) empirically indicate that 200,000 permutations represent a reasonable choice in most scenarios.

Computing the TIC e and its associated empirical p values for each variable pair
The total information coe cient is computed for each (non permuted) variable pair, obtaining a set of TIC e values t i (with i = {1, . . . , M}). For each t i , the p value p i is estimated as the fraction of values of the empirical null distribution that exceeds t i [13]:

Findings
Two synthetic datasets (SD1 and SD2) were created in order to assess (i) the statistical power (or recall, i.e. the fraction of non-independent relationships that were recovered at a given signi cance level) and (ii) the ability to control the FDR. The analyses were performed varying the number of samples (SD1) and the e ect chance [14], i.e. the percentage of non independent variable pairs (SD2). Both datasets contain a set of independent variables and a xed number of variable pairs X and Y related by associations in the form is a function and η is a noise term. To characterize the performances of MICtools in presence of associations that could not be described by a function, a series of Madelon datasets [15,16] was also analyzed. Finally, the proposed pipeline was applied to the analysis of an environmental/metagenomic dataset which has been recently made available within the Tara project, a global-scale characterization of plankton using high throughput metagenomic sequencing [17].

Synthetic datasets
The SD1 dataset contains 60,000 associations between variable pairs X and Y. The e ect chance was set to 1%. The relationships between the 600 non-independent variable pairs were randomly chosen among 6 di erent types of functional associations, namely cubic, exponential (2 x ), line, parabola, sigmoid, and spike (see Table S3 in [1]). The noise term η is a random variable with uniform distribution in the range of f(X) multiplied by an intensity factor kη. Di erent values of kη were chosen randomly among 18,000 values obtained joining the following three sequences: the rst ranging from 0.05 to 1 (with steps of 0.0001), the second ranging from 1 to 2 (with steps of 0.0002), and the third ranging from 2 to 9 (with steps of 0.002). Using these values, the coe cients of determination (R 2 ) between Y and the noiseless function f(X) ranges approximately from 0 to 1. The remaining 99% (59,400) associations were de ned with X and Y randomly generated from a uniform distribution between 0 and 1. To characterize the e ect of the sample size, we created 20 replicates of SD1 for an increasing number of samples (n ∈ {25, 50, 100, 250, 1,000}), for a total of 100 datasets. Considering that the fraction of true positive associations was known, this design of experiment allowed us to quantify the statistical power and the performances in terms of FDR of the proposed pipeline. The results for 2 × 10 5 permutations are summarized in Figure 2 and in the Additional File 1, Table A1. The dependence of the power and of the number of false positives (FP) from the number of samples are shown in Figure 2a and 2b. The power increases with the number of samples reaching 75% for a sample size of 100. As expected, considering that we used the Storey's q value as a strategy to control the FDR, also the number of false positive grows for increasing sample size (Figure 2b) to keep the false discovery rate constant (0.05 in this case). Figure 2c shows the observed FDR, which is almost equal to the expected value of 0.05 for all samples sizes. In Figure 2d we show the values of MIC e as a function of the coe cient of determination (R 2 ) between Y and the noiseless function f(X) for the associations that pass the signi cance lter (i.e. associations with q values <0.05). As expected, MIC e and R 2 were always linearly correlated, especially for the larger sample sizes [5] (Figure 2d, upper panel). Moreover, we found that, for small sample sizes, only relationships with relatively high values of R 2 passed the signi cance lter. This e ect decreases with increasing number of samples, showing that the pipeline is able to identify relationships with more noise, provided that a su cient number of experimental points is available. This e ect is clearly visible in Figure  2e, where we show the statistical power as a function of the strength of the relationships for di erent sample sizes. While on less noisy associations (having R 2 close to 1) the pipeline shows high power also for smaller sample sizes, a high number of samples is needed to attain high power for very noisy relationships (having R 2 close to 0). Upon closer inspection, the panel d in Figure 2 also shows that the power depends on the form of the association. For instance, red points (corresponding to cubic functional forms) are hardly visible for sample sizes smaller than 100, while sigmoidal, linear and exponential relationships can be identi ed for all sample sizes, albeit with a power that depends on the amount of noise. This nding can be easily interpreted considering that more complex relationships (e.g. polynomials of higher order) are de ned by a higher number of parameters that makes them more difcult to distinguish from random associations if the number of points is limited. A more clear representation of this phenomenon is included in Additional File 2 ( Figure A1). Moreover, the downward bias in terms of equitability, especially for the more complex relationships (Figure 2d and A1) is a result of the core approximation algorithm EQUICHARCLUMP, which speeds up the computation of MIC e [18,5]. The EQUICHARCLUMP parameter c controls the coarseness of the discretization in the grid search phase and by default it is set to 5, providing good performance in most settings [12].
The median values does not show a strong dependence on the number of permutations. Figure A3b indicates that below 100 samples at least 2 × 10 5 permutations are needed to obtain stable values of power, and that its variability is anyway larger for small sample sets. All these evidences, however, support the choice of 200,000 as a default value for the number of permutations.
The dataset SD2 was generated to characterize how the effect chance, i.e. the fraction of non-random associations, affected the performances of MICtools. Similarly to dataset SD1, SD2 contains a subset of variable pairs X and Y related by associations of the form Y = f(X) + η, where η was de ned as in SD1. The number of samples was xed to n = 100 and the total number of associations was 60,000. For each e ect chance value (1%, 2%, 5%, 10%, 20% and 50%) we generated 20 independent datasets, for a total of 120. The power, number of False Positives (FP) and FDR as a function of the e ect chance are shown in Figure 3, panels a, b and c, respectively (see also Additional File 1, Table A3). In Figure 3a, we can observe that the statistical power grows with the e ect chance, while the actual FDR remains constant. In fact, an increase of e ect chance corresponds to a decrease of the fraction of relationships for which the null is true, π 0 (e ect chance = 1π 0 ). Consequently, an increase of the p-value threshold and therefore a growth of power is expected in order to maintain the FDR cuto constant [7,14].

The Madelon classi cation dataset
The analysis of SD1 and SD2 datasets demonstrates that MICtools is able to identify the relationships described by analytic functions with additive noise. However, more general forms of non-random associations are possible. Consider, for instance, the presence of clusters that might indicate the presence of subpopulations. To test the ability of MICtools to identify this type of associations, we created 7 datasets with an increasing number of samples n ∈ {50, 250, 500, 1,000, 2,500, 5,000} with a structure similar to the Madelon binary classication dataset [16,15] (http://archive.ics.uci.edu/ml/ machine-learning-databases/madelon/Dataset.pdf) using the datasets.make_classification() function available in the scikit-learn library [19]. Each dataset contains 4 clusters (two for each class), placed on the vertices of a ve dimensional four-sided hypercube. Each cluster was composed by normally distributed points (σ = 1). The ve dimensions de ning the hypercube constitutes the 5 "informative" features. Other 15 "redundant" features were generated as random linear combinations of the informative features and added to the dataset. Finally, 180 random variables without predictive power were added, for a total of 200. In this type of setting, the number of associations to be tested was 19900 = (200 × 199)/2. Among them, 190 are "real" (the relationships between the variables belonging to the "informative" and "redundant"). Figure 4a summarizes the results of the analysis. Panel (a) shows the association called signi cant (q-value cuto set to 0.05) on a Hive plot [20] as a functions of the number of samples. Each branch of the Hives represents a type of variable (informative: 5 variables; redundant: 15; random: 180), the blue lines identify true positives (associations between non-independent variables correctly identi ed), while false positives (incorrectly identi ed associations between independent variables) are marked in red. This representation clearly shows that, as 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 Figure 4 shows the e ect of n on the FDR, which is always approximately constant and very close to the theoretical value of 0.05.
On the bases of these results, we conclude that also with a relatively low number of samples MICtools is able to identify in an e cient way non functional associations typical of cluster structures. It is interesting to note that the associations among the informative variables started to be recovered when at least 250 samples were considered, while the associations between informative/redundant and redundant/redundant variables were signi cant also for lower number of samples (50). This apparently odd behavior is due to the di erent nature of the association among the variables. Binary associations among informative variables are indeed characterized by the presence of clusters, while redundant associations are constructed by linear combinations. In accordance to the results discussed for SD1, the statistical power of the procedure depends on the type of association and with a lower number of samples the results are biased towards less complex association patterns.

Identifying ecological niches: the Tara dataset
The Tara Oceans project is a large multinational e ort for the study of plankton at a global scale [17]. Within the project, a large study of the microbiota of water samples from the oceans characterized using metagenomic techniques has been recently made available. To illustrate the added value of using MICtools to analyze such large datasets, we downloaded the annotated 16S mi tags [21] OTU count table of 139 water samples from http://ocean-microbiome.embl.de/companion.html, together with the accompanying metadata on temperature and chemical composition [22]. MICtools was used to identify the existence of signi cant relationships between the environmental variables and the taxonomic composition of the microbiota. The genus relative abundances, the environment variables and the samples metadata are available in the Additional File 1, tables A4, A5 and A6 respectively. By using a q-value cuto of 0.01 we found signi cant associations between the relative abundances of 279 taxa with water temperature and of 287 taxa with oxygen ( Figure 5, panels b and c, respectively). To highlight the novel information provided by MICtools, Spearman's rank correlation coe cients and their associated p values were also calculated as in [23] (the default for the cor.test() function in the R environment). By using the Spearman's coecient alone we could identify a subset of the relations identi ed by MICtools, namey 194 taxa were associated with temperature and 191 taxa were associated with oxygen concentration, respectively. Conversely, almost all relationships identi ed with Spearman's correlation were also identi ed by MICtools. While the Spearman's coe cient based approach identi ed associations well described by monotonic functions (Figure 5e and 5f), MICtools was able to highlight the presence of more complex relationships between the taxa and the environmental parameters. As an example, we found a sharp increase of the Alcaligenaceae genus at oxygen concentration of 200 µmol kg -1 (Figure 5d) and a slow increase of the Sphingomonadaceae genus as a function of the temperature. In both cases, highlighting the samples on the bases of their speci c aquatic layer of ref- MICtools evaluates all the pairwise relationships between the variables of the two datasets (for a total of M × K associations). • given two datasets, X (of size M × n) and Y (of size K × n) it evaluates all the row-wise relationships, i.e. only the variables pairs x i and y i (for i = min(M, K)) will be tested; • moreover, for each experiment listed above, if the sample classes are provided, the analysis will be performed within each class independently.
For multiple testing correction MICtools makes available all the strategies implemented in Statsmodels and a Python implementation of the Storey's q-value method [7]. The indicative number of relationships tested per second during the empirical null estimation (using the TIC e ) and the strength estimation (MIC e ) for an increasing number of samples are reported in Additional File 2, Figure A3.

Author's Contributions
Informative R e d u n d a n t R e d u n d a n t Informative R a n d o m Informative R a n d o m R e d u n d a n t including the Storey's q value for controlling the false discovery rate (FDR). In the paper we describe the performances of MICtools in terms of statistical power and FDR control with an extensive set of tests on synthetic data. Finally, we apply the pipeline to a recent large and publicly available metagenomic dataset, to illustrate in a real world case the novel information provided by MICtools in comparison to conventional approaches.
We are convinced that the proposed strategy fills a gap in the currently available set of tools for the exploration and analysis of high throughput omic datasets.