TriPOINT: a software tool to prioritize important genes in pathways and their non-coding regulators

Abstract Summary Current approaches for pathway analyses focus on representing gene expression levels on graph representations of pathways and conducting pathway enrichment among differentially expressed genes. However, gene expression levels by themselves do not reflect the overall picture as non-coding factors play an important role to regulate gene expression. To incorporate these non-coding factors into pathway analyses and to systematically prioritize genes in a pathway we introduce a new software: Triangulation of Perturbation Origins and Identification of Non-Coding Targets. Triangulation of Perturbation Origins and Identification of Non-Coding Targets is a pathway analysis tool, implemented in Java that identifies the significance of a gene under a condition (e.g. a disease phenotype) by studying graph representations of pathways, analyzing upstream and downstream gene interactions and integrating non-coding regions that may be regulating gene expression levels. Availability and implementation The TriPOINT open source software is freely available at https://github.uconn.edu/ajt06004/TriPOINT under the GPL v3.0 license. Supplementary information Supplementary data are available at Bioinformatics online.

T u -Active/upregulated gene threshold that specifies at which gene expression/score the gene will be considered active/upregulated. These must be set according to the scores provided such that the threshold for which genes should be considered activated/upregulated is set to this value using this parameter. For interactions that are inconsistent (going against activation/inhibition) and supporting (following activation/inhibition) where activation implies a gene is up-regulated and inhibition implies a gene is down-regulated as a result of its upstream gene being up-regulated. "Weak" annotations are introduced to separate complementary cases where upstream genes are down-regulated and may factor differently in their target's expression.
example, values greater than a small positive number (i.e., 0.05) should be used when using log2 fold change data.
T d -Inactive/downregulated gene threshold that specifies at which gene expression/score the gene will be considered inactive/downregulated. These must be set according to the scores provided such that the threshold for which genes should be considered inactivated/downregulated is set to this value using this parameter. For example, values less than a small negative number (i.e., -0.05) should be used when using log2 fold change data.
w -Parameter controlling the influence of "weak" interactions (pathway interactions involving down-regulated upstream genes) as a percentage. For example, a value of 0.5 will reduce the influence of weak interactions compared to strong interactions by half.
E type (g 1 , g 2 , p m ) -Returns the type of edge between two genes g1 and g2 in the pathway pm. (ACTIVATION, INHIBITION, or ASSOCIATION (ignored in score calculation)) d (g i , g j , p m ) -The number of edges between gi and gj in the pathway pm.
Us(g i , p m ) -The set of immediate (edge distance = 1) upstream genes of a gene gi in pathway pm.
Ds(g i , p m ) -The set of downstream genes (of any edge distance) of gene g in pathway pm derived from the sub-graph defined such that all genes are consistent (i.e. expression reflects activation/inhibition) with respect to their upstream genes beginning from the source gene gi.
r -Controls the rate of exponential decay in the impact score. A value of 0 will remove the exponential decay induced by the number of edges between the source gene and the gene to be contributed into the impact score while a value of 1 will exponentially increase the amount to be divided by a factor e (distance) .
Inconsistency Score: Consistency Score: Impact Score:

Triangulation Score (without Non-Coding):
If either consistency or impact scores are 0, the triangulation score is 0 otherwise the following equation is used to calculate triangulation:

Triangulation Score (with Non-Coding):
If consistency, impact scores are 0, or if the number of non-coding regulators is 0, the triangulation score is 0 otherwise the following equation is used to calculate triangulation: The minmaxnorm function is the min max normalization of the score based on all scores across all genes/pathways within the score category. The noncoding function is the number of noncoding regulators targeting gene gi.
Triangulation score is composed of 2-3 main parts. First, the consistency score measure the degree to which the gene is following or going against genes regulating it as suggested in the pathway. Second, the impact score measures how much the gene influences its downstream gene targets. For example, if the gene being measured is upregulated, it will have a greater impact score if its downstream genes are following expression patterns indicative of supporting pathway interactions. Triangulation scores are set to 0 if any of the combined scores are 0 to eliminate scores driven entirely by one metric. Triangulation scores will be in the range -1 to 1 where a -1 triangulation score refers to a gene that is inconsistent with upstream factors where as a triangulation of 1 refers to a gene that is consistent with upstream factors.

Non-Coding Regulator P-Value Calculation
TriPOINT provides p-values for the number of non-coding regulators to measure this significance independent of other metrics (e.g., triangulation score), allowing for broader applicability of TriPOINT for those strictly interested in the number of noncoding regulators for genes. Here we assume the number of non-coding regulators follows a Poisson distribution with parameters λ and k. The probability of k noncoding regulators interacting with a gene is calculated as: where λ is the average number of interactions across all genes (i.e., the estimated probability of observing an interaction for a gene). The p-value reported by TriPOINT is obtained by subtracting the cumulative probabilities up to the observed number of interactions k from 1.

FDR Adjusted P-Values
In addition to P-Values, TriPOINT provides FDR adjusted p-values (q-values) obtained from the p.adjust statistical function from R using the Benjamini, Hochberger procedure ("fdr" method parameter in R). These values in combination with the reported p-values will enable users to appropriately select gene/pathway combination that are significant as reported by the p-value while having an idea that a certain portion of these will be false positives using the q-value.