A new approach for interpreting Random Forest models and its application to the biology of ageing

Abstract Motivation This work uses the Random Forest (RF) classification algorithm to predict if a gene is over-expressed, under-expressed or has no change in expression with age in the brain. RFs have high predictive power, and RF models can be interpreted using a feature (variable) importance measure. However, current feature importance measures evaluate a feature as a whole (all feature values). We show that, for a popular type of biological data (Gene Ontology-based), usually only one value of a feature is particularly important for classification and the interpretation of the RF model. Hence, we propose a new algorithm for identifying the most important and most informative feature values in an RF model. Results The new feature importance measure identified highly relevant Gene Ontology terms for the aforementioned gene classification task, producing a feature ranking that is much more informative to biologists than an alternative, state-of-the-art feature importance measure. Availability and implementation The dataset and source codes used in this paper are available as ‘Supplementary Material’ and the description of the data can be found at: https://fabiofabris.github.io/bioinfo2018/web/. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
In this work, we focus on predicting genes with altered expression with age in the brain. It has been commonly observed that there is an overall decline in neural function with age (Gustavsson et al., 2011), and there is growing evidence that ageing plays a significant role in the development of degenerative diseases (Mattson and Magnus, 2006). The likelihood of developing neurodegenerative diseases such as Parkinson's and Alzheimer's dramatically increases with age (Mattson and Magnus, 2006). This is clearly important, as neurodegenerative diseases have a high social-economic impact, costing 146 billion Euros in 2004 in 28 surveyed European countries (Gustavsson et al., 2011).
To study ageing processes in the brain holistically, we use a Random Forest (RF) classification algorithm (Breiman, 2001) to induce from data a model to predict if a given gene is over-expressed, under-expressed or have no change in expression with age in the brain. The RF algorithm is very popular in machine learning and bioinformatics (Touw et al., 2013) due to its high predictive accuracy and the use of variable importance measures (VIMs). These measures allow us to identify the most important variables for classification in the model (a set of partly random decision trees) built by the RF algorithm. However, current VIMs have an important limitation: they measure the importance of a variable as a whole, using all values taken by the variable. Sometimes, however, it is only one value of the variable (feature) that is important for classification, which requires a fine-grained measure of feature importance. This is the case in the dataset analysed in this work, which has 7490 features taking either a positive or negative value, representing the presence or absence (respectively) of a Gene Ontology (GO) term annotation for a gene (instance to be classified). As discussed in detail later, in general, the positive value of a GO term feature is much more informative and reliable than the negative value of that feature, since negative values represent lack of evidence and do not suggest any particular property for a gene. Hence, we propose a new method for measuring the importance of positive feature values, rather than the importance of a feature as a whole (both positive and negative values).
As related work, Hsing et al. (2008) use GO terms as features and a tree-based classification ensemble algorithm (boosting trees) to predict whether or not a protein is a hub in a network. In addition, Barardo et al. (2017) use GO terms as features and an RF algorithm to predict whether or not a chemical compound will increase Caenorhabditis elegan's lifespan. These works rank features based on a measure of feature importance, but they ignore the difference between positive and negative feature values, which is precisely the limitation that we are addressing here.
It should be noted that GO terms are a very popular type of feature for classification in bioinformatics; and there are also several other types of binary features whose positive values tend to be much more important than negative values, like pathway annotations (e.g. KEGG pathway features), protein-protein interaction features, etc. (Fabris et al., 2017). Hence, the proposed method for positive feature value evaluation has wider applicability in many other classification datasets in bioinformatics. This paper's main contribution is a new measure of feature value importance for RFs. This measure focuses only on positive feature values (ignoring negative values), and it is computed by a new algorithm that measures the predictive accuracy of a positive feature value by its overall predictive accuracy across all rules (root-to-leaf paths) in the RF where that feature value occurs. As a second contribution, we created a new dataset for studying gene expression with age in the brain and interpreted an RF model built from this dataset, based on the biology of ageing literature.
The remainder of this paper is organized as follows: Section 2 presents background on RFs and feature importance measures. Section 3 describes the creation of the dataset used in our experiments. Section 4 introduces the new measure of feature value importance. Section 5 reports the computational results and a biological interpretation of the most important GO terms based on the proposed measure of feature value importance. Finally, Section 6 presents the conclusions and some future work.

Random Forest
The RF algorithm, which is widely used for classification in bioinformatics, builds nTree (a parameter) Random Trees (RT) during its training phase. This involves randomizing the training set in two ways for each RT: first, the training set is re-sampled with replacement, maintaining the original size of the dataset. The new re-sampled training set contains, on average, about 66% of unique instances (genes) from the original dataset. The set of training instances for a given RT is the 'In-Bag' instance set for that RT. The other 33% of the original dataset, which is not used for training, is the Out-Of-Bag (OOB) instance set for that RT.
As a second source of randomness for building an RT, the search for the best feature to split the set of instances at each RT node considers a randomly chosen feature subset of size mtry (a parameter), typically much smaller than the original feature set's size. The instances at the current node are then split into two subsets according to a condition based on the values of the selected feature, creating two child nodes. This split aims to increase the similarity of classes within each instance subset and to decrease class similarity across the subsets. Next, the algorithm recurses in each instance subset until a stopping criterion is met.
In the prediction phase, a testing instance t is presented to each RT. For every RT, the feature values of t are matched against the feature-value conditions in the branches of the RT from the root node downwards, until t is assigned to a leaf node which predicts, for t, the most frequent class in that leaf node. The predictions of all RTs are combined (by voting) to output the RF's final prediction.
RFs are difficult to interpret: they comprise many RTs making, to some extent, conflicting predictions; due to their randomized nature. However, feature importance measures can be used to find the most important features for classification in RF models, as discussed next.
In essence, GVIM calculates each feature's importance by averaging the OOB Gini impurity decrease when using the feature in a split of an RT node. PVIM calculates the average predictive accuracy difference, across all RTs, of the RF before and after permutating a given feature with a randomly selected one. CPVIM works similarly to PVIM, but considers conditional relationships among variables. VarSelRF iteratively removes features from the RF until its predictive accuracy is significantly reduced. Next, it returns the smallest set of features with predictive accuracy statistically equivalent to the best RF. Finally, varSelMD calculates the average depth of the features in the RF, assigning greater importance to features that are closer to the root node of an RT.
In a very recent work (Epifanio, 2017), the Intervention in Prediction Measure (IPM) was proposed and compared against the five above feature importance measures. That work concluded that IPM was superior to identify the most important features. Thus, we use the state-of-the-art IPM measure as a strong baseline measure in our experiments.
The IPM first computes, for each RT and each Out-Of-Bag (OOB) instance, a vector of size J (the number of features) containing in each j-th position the number of times the j-th feature was used to classify the instance. Next, this vector is normalized by dividing the frequency of use in each position by the summation of the frequencies over all J positions. This normalized vector (V n ) contains the relative importance of each feature, i.e. its relative frequency of use to classify the instance. The vector V n is averaged across all OOB instances of interest and across all RTs to return the final IPM value for each feature.
Under the assumption that instances with different classes are classified using different features, these differences are reflected in the features' IPM scores. Features that are important to classify instances of some class but are not so important to classify instances of other classes are of particular interest, since they are good predictors of a given class.
Note also that all the above importance measures evaluate a feature as a whole (i.e. all values of the feature), which is an important limitation in datasets where just one value of a feature has a good predictive power. Actually, in our dataset, one of the two values of each binary feature is much more interesting, as discussed in Section 4.2.

Collection of data about genes and classes
Age-related brain gene expression was collected from GEO and AgeMap. First, in AgeMap, all brain gene expression data was obtained by combining cerebellum, cerebrum, hippocampus and striatum expression datasets into one dataset (Zahn et al., 2007). This gene expression data is already normalized with background subtracted. In total from this resource, gene expression data for 118 brain samples and 6712 mouse genes were extracted. Second, gene expression datasets and series datasets reporting expression levels in different ages or development stages in mammals' brains were identified by searching GEO (Barrett et al., 2006). Unsuitable datasets were removed. For example, custom datasets that examined a single pathway, specific diseases, mutants and treatments were excluded. Within the remaining 28 datasets, only age-related data from healthy, adult and non-treated samples were analysed. For example, in disease studies, we only took the controls at different age groups, and not the diseased state. Since ageing gene expression profiles can be detected early in adult life, all datasets with more than two adult time points were included, even if the oldest animals were middleaged. In summary, 28 ageing-related GEO datasets and series comprising 1212 samples and a differing number of genes per dataset were obtained.
For both the GEO and AgeMap datasets, genes with more than 30% missing gene expression data across all samples were removed. Otherwise, null values were replaced by the probe's average and probes targeting the same gene were averaged. Although we cannot perform a comprehensive evaluation of the quality of each experiment, our aggregation procedure is, in itself, a technique to cope with poor quality data. To identify genes that were consistently over-and under-expressed with age across all 31 datasets (3 from AgeMap and 28 from GEO), we found the genes with the largest number of putatively age-related signals in our multiple datasets, following the method described in (De Magalhães et al., 2009). Human homologs for all mouse and rat protein-coding genes were downloaded from NCBI BioMart v87. High confidence one-to-one orthologs were extracted, and for each gene outputted from regression analysis, orthologs were identified. Finally, the proportion of human protein-coding genes within each class is 2.4%, 0.8% and 96.8% for the classes 'over-expressed', 'under-expressed' and 'no change of expression' with age in the brain, respectively.

GO term-based features
The instances (genes) are described by features representing the presence or absence of a GO term. We use GO term features because they are very well-known and easy to interpret--they use a controlled vocabulary, curated by experts, so the terms have welldefined biological meanings.
To retrieve the list of GO terms associated with our instances (genes), we have used the GO annotations from the XML file exported by the NCBI web page http://www.ncbi.nlm.nih.gov/gene (downloaded on the 18th of April 2017). This XML file was generated by the query: The Gene Ontology definition (retrieved on the 14th of March 2017) was downloaded using the link http://geneontology.org/page/ download-ontology#go-basic.obo.
Since a GO term implies all its ancestors (defining an 'is-a' hierarchy), we have expanded the set of GO terms annotating each instance (gene) to contain all ancestors of those GO terms. Also, we have eliminated GO terms annotating less than 10 instances, to avoid terms with little statistical support. This resulted in a dataset with 17 716 genes (instances) and 7490 GO terms (features). We also added to the dataset a numerical feature whose value is the total number of GO terms annotated for a gene.

Experimental methodology
To measure predictive accuracy we use the popular Area Under the Receiver Operating Characteristic curve (AUROC), which is a plot of a classifier's (here an RF model's) True Positive Rate (TPR) as a function of its False Positive Rate (FPR). These rates are computed for each class by thresholding the class probabilities output by the RF using thresholds in the range [0, 1]. Each threshold produces a TPR and an FPR value, i.e. a point in the ROC curve. To obtain a single accuracy measure from the curve, we calculate the area under the ROC curve (AUROC) (Boyd et al., 2013). The AUROC is calculated considering each class in turn as the 'positive class', and the final AUROC is the weighted average over the three classes, weighted by the number of instances in each class. AUROC values of 1.0, 0.5 and 0 indicate, respectively, a perfect classifier (all instances correctly predicted), a classifier with random guessing performance and the worst possible classifier (all instances wrongly predicted).
The AUROC is computed using the well-known 10-fold crossvalidation procedure (Japkowicz and Shah, 2011). This method first divides the dataset into 10-folds of similar sizes. Next, each fold is temporarily removed from the dataset, one at a time, then the other 9-folds are used for training and the held-out fold used as a testing set for measuring predictive accuracy. The AUROC is the mean accuracy over the 10 testing sets. The AUROC value reported later is the mean over 30 runs of 10-fold cross-validation, to get more stable results, considering the randomized nature of RFs.
In each fold of the (external) cross-validation procedure we have used an internal 5-fold cross-validation procedure (on the training set only) to optimize the two most important parameters of the RF algorithm: mtry (the number of randomly sampled candidate features for selecting a split feature in an RT node) and nTree (the number of RTs in the RF). We have tested all pairwise combinations of the mtry values in the set f ffiffi where J is the number of features in the dataset, and nTree values in the set {100, 200, 300}; and used the pair with highest predictive accuracy on the internal cross-validation as the parameter-value pair for that external crossvalidation fold. Our dataset is highly unbalanced towards the class 'no change in expression (N)' with age, which has many more instances than the classes 'over-expressed (O)' and 'under-expressed (U)' with age. This leads to RFs models that are overly-conservative when predicting the minority classes 'O' and 'U'. To attenuate this, we have performed an under-sampling of instances with classes 'N' and 'O' in the in-bag set (used to train the RF). That is, instances of classes 'N' and 'O' were randomly deleted until the three classes have the same number of instances--i.e. the number of instances in class 'U'. We have performed experiments with and without this under-sampling, as reported later.
In order to measure the importance of features in the RF model, we have used the whole dataset to induce a final RF model using the pair of mtry and ntree values most frequently selected across the 10 external cross-validation folds; which was the pair: nTree ¼ 300, mtry ¼ 43 ( ffiffiffiffiffiffiffiffiffiffiffi 7490 p Ã 0:5). This final model was induced using the above-described under-sampling, since these produced overall better results.

Measuring the importance of positive feature values
To measure the importance of each positive feature value in a RF model, we propose the 'Computing the Predictive Accuracy of Random Tree Rules with Positive (6) Feature Values' (COMPACT þ FV) Algorithm, described below. The main motivation for this algorithm is, when calculating the importance of a feature f, to consider only the IF-THEN rules in the RF that contain 'positive values' of the feature f, as defined below. Note that every root-to-leaf path in an RF forms an IF-THEN rule, where the set of conditions along the path is the IF part and the class predicted by the leaf is the THEN part of the rule.
For the binary features used in this work we define a positive feature value as the value representing the presence (rather than absence) of the biological property linked with the feature. We use Gene Ontology (GO) terms as binary features, so a positive (negative) feature value indicates that an instance (gene) is (is not) annotated with a given GO term.
Considering only positive feature values has two major motivations: (1) Positive feature values tend to have a much higher level of confidence than negative ones. This is because a positive feature value indicates that 'there is evidence' for a given GO term; whilst a negative feature value indicates a 'lack of evidence' for that GO term--note that lack of evidence is different from evidence of absence. (2) Positive feature values are much more informative than negative ones. This is because a positive value tells us an instance (gene) has a certain biological property (GO term); whilst a negative value does not tell us any property possessed by a gene.
Recall that, for each RT in an RF model, each non-leaf node represents a test based on the value of a feature, leading to two child nodes--each of them associated with a condition that an instance must satisfy to reach that node. These two children correspond to the 'positive' and 'negative' values of the feature in the parent node.
COMPACT þ FV, presented in Algorithm 1, iterates over every RT in the RF and for every feature, it extracts every IF-THEN rule (if any) containing the positive value of that feature and uses that rule's statistics to measure the feature's importance. The accuracy statistics of a rule in an RT are calculated using its Out-Of-Bag (OOB) instances, i.e. the instances that were not used to train that RT. Algorithm 1 extracts from the RF two statistics for each feature f and class c: (a) Cov f þc , the OOB coverage, i.e. the total number of OOB instances covered by rules containing the positive value of feature f that predict class c; and (b) Hits f þc , the OOB hits, i.e. the total number of OOB instances correctly classified by rules containing the positive value of feature f that predict class c. Note that our importance measure (and also the IPM importance measure) cannot be calculated by analysing only the structure of RF: they also depend on the OOB instances of each RT.
After Algorithm 1 finishes executing, all importance scores for every feature f and class c (Prec f þc ) in an RF are computed. Recall that Prec f þc is the precision of all rules containing the positive value of feature f that predict class c.

An example of the use of the COMPACT 1 FV algorithm
Next, we show the calculation of Prec GO:0006887þN , i.e. the importance of a positive feature value representing the presence of the GO term 'GO:0006887' (exocytosis) to predict class 'no change of expression' with age in the brain (N). Let us assume that there are three rules in the RF that predict class 'N' using the positive value for feature 'GO:0006887'. Next we present these rules using the following format: each rule contains conditions (RT nodes) involving a feature (where '1' and '0' denote the presence and absence of a GO term annotation, respectively) and in parenthesis the distribution of class frequencies of the OOB (Out Of Bag) instances that satisfied all conditions of the rule. After the OOB class distribution, we present the class predicted by the rule (the most frequent class in the in-bag instances used to build the RT). For pedagogical purposes, Figure 1 shows a fictitious RT that contains the same rules. for each tree t in the Forest do 5: Get all root-to-leaf rules in t with the positive value of f.

6:
for every such rule, r do 7: Get the class that r predicts (class c), the number of OOB instances that r covers (cov) and the number of correctly classified OOB instances (hits). 8: Update the values of the Cov and Hits counters: 9: Hits f þc Hits f þc þ hits 10: Cov f þc Cov f þc þ cov 11: end for 12: end for 13: for every class c do 14: Compute the precision of the positive value of f:

15:
Prec f þc Hits f þc =Cov f þc 16: end for 17: end for 18: end procedure  Table 1 shows the mean AUROC of the RF models per class and for all classes as a whole, with and without the in-bag under-sampling, across 30 runs of the 10-fold cross-validation procedure, as described earlier.

Predictive accuracy results
As shown in Table 1, the RF using under-sampling has better predictive accuracy than the RF without under-sampling. In addition, inducing the RF model with under-sampling takes on average 3.8 h for each cross-validation run, which is much faster than the average 70.4 h to induce the RF model without under-sampling. This is due to the reduced in-bag set size when using under-sampling. Table 2 shows the most important GO terms based on the ranking by the proposed rule-based Precision measure. Table 3 shows the most important GO terms based on the ranking by the Intervention in Prediction measure (Section 2.2), a state-of-the-art measure of feature importance.

Feature importance results
Contrasting the two tables, it is clear that the rule-based Precision and the Intervention score lead to very different sets of top-ranked GO terms. Unfortunately, the intervention-based ranking is not useful for identifying GO terms that are strong predictors of a single class, since the top-ranked GO terms based on that score are very similar for all three classes. This is despite the fact that this score was computed for instances of each class separately. This result is due to the fact that the Intervention score reflects the use of both positive and negative feature values. Actually, for most features in our dataset, the large majority of instances have a negative feature value. Hence, the negative value of a feature tends to contribute more to its Intervention score than its positive value. Since negative feature values are much less informative than positive ones (as discussed earlier), this has the undesirable effect of preventing the identification of positive feature values which are relatively rare but provide much more informative predictions for a given class.
In contrast, the rule-based Precision focuses on rules containing only positive feature values, without being distracted by negative values. As a result, this measure successfully identifies different sets of top-ranked GO terms for predicting different classes. In addition, in general, the GO terms in Table 2 (Precision-based ranking) describe more specific and more informative gene properties than the more generic (often very broad) GO terms in Table 3 (interventionbased ranking).
These results reflect the different biases of the two measures. The Intervention measure rewards mainly the high frequency of use of a feature in an RF, without explicitly rewarding predictive accuracy. This measure implicitly rewards accuracy, since highly accurate features tend to be used to classify more instances. However, since the negative value of a feature is used to classify many more instances than its positive value, the measure is biased towards rewarding features with accurate negative values, rather than accurate positive values. In contrast, the rule-based Precision measure rewards mainly the predictive accuracy of a positive feature value in an RF's rules. The trade-off is that positive feature values have a relatively small frequency of use (see the Rule Hits column in Table 2); but this is overall a good trade-off, since the negative feature values are not very informative, as discussed earlier.
Hence, in the remainder of this section, we focus on the topranked GO terms identified by the rule-based Precision measure (Table 2). This table contains 18 top-ranked GO terms predicting the 'over-expressed' (O) and 'no change with age' (N) classes. There are 26 GO terms whose positive value has the maximum rule-based Precision of 1.0 when predicting the class 'N', we only show the top-18 in the table (sorted by the second criterion, the rule-based coverage). Most of these GO term annotations also have large numbers of rule-based Hits in the Out-of-Bag instances, as shown in the table, since this class has a prior probability (relative frequency) of 96.8%. The top-18 GO terms predicting class 'O' in the table have overall much lower rule-based Precision and Hits in the Out-of-Bag instances since this class has much fewer instances. However, these GO terms still have a rule-based Precision substantially higher than the prior probability of the class 'O', which is just 2.4%. Table 1. Random Forest predictive accuracy results (AUROC) with and without under-sampling for the classes 'Over-expressed (O)', 'Under-expressed (U)' and with 'No change in expression (N)' with age in the brain, and the mean AUROC across classes (All) weighted by their number of instances The top-ranked GO terms predicting the 'under-expressed' class are not shown in this table because they have low rule-based Precision and Hits (this class' prior probability is just 0.8%), so they are not reliable enough for further analysis. As shown in Table 2, positive feature values of GO terms used to predict over-expression included immune response pathways, responses to heavy metal toxicity and endoplasmic reticulum membrane genes.
Over-expression of the immune response (including GO: 2001198, rank 1; GO:0042605, rank 2 and GO:0042611, rank 3) is a commonly seen signature of the ageing transcriptome. Metaanalysis of ageing expression studies shows over-expression of immune response genes to be a consistent signature of ageing (De Magalhães et al., 2009). This includes the over-expression of inflammation genes, representative of an 'inflamm-ageing' phenotype associated with numerous ageing related diseases such as Alzheimer's disease and cancer (Xia et al., 2016). S100 proteins (GO:0044548, rank 13) are also linked to inflammation response, with constitutive expression in neutrophils and interleukin-induced expression in other cells. These proteins have been associated with inflammationrelated diseases and cancer, and possibly have a function in extracellular oxidant scavenging (Goyette and Geczy, 2011). Oxidative damage in the brain increases with age, including lipid peroxidation and protein carbonylation (Head et al., 2002).
GO terms related to cadmium (GO:0071276, rank 7) and zinc ion (GO:0071294, rank 5) response predicting over-expression may be linked, since the toxicity of both metals is oxidative stress based, the former by depletion of thiol-based antioxidants (Cuypers et al., 2010), while the latter causes copper deficiency, reducing the cells' ability to produce copper based antioxidants such as superoxide dismutase (Paynter et al., 1979).
Oxidised proteins may act as an intermediate to protein aggregate clusters, causing a breakdown of normal cellular function (Squier, 2001). The unfolded protein response (UPR), mediated by the endoplasmic reticulum (ER), produces chaperones and upregulates the Table 2. Top-ranked GO terms (ranked by rule-based Precision) used to classify genes as 'over-expressed' and with 'no change in expression' with age in the brain Note: The columns contain: (1) the feature rank, (2) the feature identifier, (3) the feature name, (4) the mean rule-based Precision and (5) the mean rule-based Hits. Rule-based scores are based on the RF's predictions on the Out-of-Bag datasets--not used for building the models. See the main text for definitions of Precision and Hits.
inflammation response to deal with protein aggregation and misfolding (Cao and Kaufman, 2012). This response is driven by transmembrane proteins in the ER and Golgi apparatus, facilitating communication between these organelles and the nucleus, potentially explaining the use of related terms (GO:0071556, rank 6; GO:0012507, rank 17 and GO:0030176, rank 18) to predict overexpression.
Positive feature values of GO terms used to predict unchanged expression included receptor activity (including olfaction), RNA processing and structural genes, however, this is also the largest class and so there were many other categories with high precision. These categories are all very large, including genes involved in a wide variety of functions.
G-protein coupled receptor activity (GO:0004930, rank 1) is closely related to olfaction. Olfactory receptors are a subset of G-protein coupled receptor and several olfaction-related terms co-occur with GO:0004930, for instance 'sensory perception of smell' (GO:0007608, rank 14) 'olfactory receptor activity' (GO:0004984, rank 7) and 'detection of chemical stimulus involved in the sensory perception of smell' (GO:0050911, rank 8) (Binns et al., 2009). Olfactory neurogenesis is reduced in aged mice, as is the ability to distinguish different odours, however, olfactory interneurons are increased (Enwere et al., 2004). Further, ageing-related diseases such as AD are frequently associated with declined olfactory function (Attems et al., 2005). In humans, the olfactory bulb appears to be the main benefactor of neuronal progenitor cells migrating from the lateral ventricle, suggesting it is more capable of neuroregeneration than other areas of the brain (Armstrong and Barker, 2001). Olfactory genes do not just relate to the sense of smell, but also to numerous other chemoreceptor mediated functions. For instance, OR51E2 mediates cytoskeletal remodelling and proliferation in airway smooth muscle cells, in response to shortchain fatty acids (Aisenberg et al., 2016), while OR10J5 mediates angiogenesis and stimulates cellular migration (Kim et al., 2015). RNA processing and its child terms (including GO:0006396, rank 2; GO:0034470, rank 10 and GO:0006397, rank 11) is a huge category containing over 4000 annotations in humans. While there is no evidence that the category changes in expression with age, there is a sex difference in humans with the ageing male brain underexpressing RNA processing GO groups relative to females (Berchtold et al., 2008). Likewise, the various structural GO groups highlighted are large and integral to basic cellular function. Intermediate filaments (GO:0005882, rank 9) play an important structural role in the brain, supporting axons and allowing an increase in axonal diameter (Fuchs and Cleveland, 1998). In addition, intermediate filaments including keratin filaments (GO:0045095, rank 17) have been implicated in numerous diseases, including cancer, and have possible roles in stress resistance and ageing (Hyder et al., 2011).

Conclusion and future work
Existing measures of feature importance for RFs do not differentiate between positive (the presence of a property) and negative feature values (the lack of evidence for a property). This is an important limitation, as for many feature types used in bioinformatics, like the very popular Gene Ontology (GO) terms-based features used in this work, positive feature values are much more informative than negative values. This is because the presence of a property (like a GO term annotation) gives much more useful information about a gene than the absence of a property. In addition, negative feature values are less reliable because they encode absence of evidence, rather than evidence for the property's absence.
For this reason, we have proposed a new feature importance measure that evaluates the precision (predictive accuracy) of only Table 3. Top-ranked GO terms [ranked by the Intervention in Prediction score (Epifanio, 2017)] used to classify genes as 'overexpressed ', 'under-expressed' and with 'no change in expression' with age in the brain the positive feature values in an RF, without being unduly influenced by the negative feature values. This measure works by finding rules (root-to-leaf paths) in the RF that use the positive feature value to predict a class of interest and then measuring the combined precision of these rules.
We have compared the results of using our feature importance measure against a state-of-the-art feature importance measure (the Intervention in Prediction measure), on a dataset created to predict whether or not a gene is 'over-expressed', 'under-expressed' or has 'no change in expression' with age in the human brain, using Gene Ontology (GO) terms as features. We have contrasted the topranked GO terms based on the rankings produced by our rule-based Precision measure and the Intervention in Prediction measure, and have concluded that the most important GO terms based on the Precision measure are more useful (more informative) to study our ageing-related problem. As evidence for this, we presented an interpretation of the biological meaning of the top-ranked GO terms, according to the proposed rule-based Precision measure.
As future work, we plan to apply our feature importance measure to other human tissues, and use other feature types besides GO terms.