- Split View
-
Views
-
Cite
Cite
Jiajie Peng, Junya Lu, Donghee Hoh, Ayesha S Dina, Xuequn Shang, David M Kramer, Jin Chen, Identifying emerging phenomenon in long temporal phenotyping experiments, Bioinformatics, Volume 36, Issue 2, January 2020, Pages 568–577, https://doi.org/10.1093/bioinformatics/btz559
- Share Icon Share
Abstract
The rapid improvement of phenotyping capability, accuracy and throughput have greatly increased the volume and diversity of phenomics data. A remaining challenge is an efficient way to identify phenotypic patterns to improve our understanding of the quantitative variation of complex phenotypes, and to attribute gene functions. To address this challenge, we developed a new algorithm to identify emerging phenomena from large-scale temporal plant phenotyping experiments. An emerging phenomenon is defined as a group of genotypes who exhibit a coherent phenotype pattern during a relatively short time. Emerging phenomena are highly transient and diverse, and are dependent in complex ways on both environmental conditions and development. Identifying emerging phenomena may help biologists to examine potential relationships among phenotypes and genotypes in a genetically diverse population and to associate such relationships with the change of environments or development.
We present an emerging phenomenon identification tool called Temporal Emerging Phenomenon Finder (TEP-Finder). Using large-scale longitudinal phenomics data as input, TEP-Finder first encodes the complicated phenotypic patterns into a dynamic phenotype network. Then, emerging phenomena in different temporal scales are identified from dynamic phenotype network using a maximal clique based approach. Meanwhile, a directed acyclic network of emerging phenomena is composed to model the relationships among the emerging phenomena. The experiment that compares TEP-Finder with two state-of-art algorithms shows that the emerging phenomena identified by TEP-Finder are more functionally specific, robust and biologically significant.
The source code, manual and sample data of TEP-Finder are all available at: http://phenomics.uky.edu/TEP-Finder/.
Supplementary data are available at Bioinformatics online.
1 Introduction
Biomedical studies have been ushered into a new era by the rapid development of large-scale genotyping and phenotyping technologies (Alemany et al., 2018; Cruz et al., 2016; Flood et al., 2016; Gudbjartsson et al., 2015; Kuhlgert et al., 2016). Recent studies demonstrate that by integrating both phenomics and genomics, we can better understand organism behaviors and identify new genes that govern phenotypes and response to the varying environments (Cobb et al., 2013; Emanuel et al., 2017; Sudlow et al., 2015). More specifically, by analyzing large-scale plant photosynthetic phenotype data, researchers can identify complex aggregate phenotypic traits, and explore the processes or genetic components that control a trait and the essential conditions under which the trait emerge (Heid and Winkler, 2017; Visscher et al., 2017).
The main computational challenge in omics data analysis arises from its unsupervised nature. It is generally believed that the emerging phenomena among multiple phenotypes measured across several genotypes (e.g. gene knockouts) reveals, to a great extent, the common regulatory roles of the knocked out genes in the biological system. An emerging phenomenon refers to a phenotypic pattern that multiple genotypes have correlated phenotype values during a serial of continuous time points (Cruz et al., 2016). Given temporal phenotype data where rows are genotypes and columns are time points, an emerging phenomenon is defined as a group of genotypes (rows) that have a similar phenotypic pattern during a relatively long and continuous temporal range (columns). See Section 3 for the mathematical definition of emerging phenomenon. A sample emerging phenomenon in plant photosynthesis phenotype data is shown in Figure 1. The experiment was done under the fluctuating light conditions (between 0 and 1000 μmolm−2s−1). Five selected genotypes (P1…P5) were measured using three photosynthetic phenotypes, namely photosynthetic system II activity , photoprotection (qESV) and photoinhibition (qI). The relative phenotype values were calculated by comparing each genotype with the reference (col-0) using logged fold change. The shadowed area indicates an emerging phenomenon of the five plants between 12:30 and 14:30, during which, all the five genotypes have a similar phenotypic pattern.
Emerging phenomena is universal in phenotyping experiments esp. under dynamic environmental conditions. They are highly transient and diverse, dependent in complex ways on both environmental conditions and development (Cruz et al., 2016). Revealing emerging phenomena is vital toward the identification of meaningful differences in biological function among genotypes, which may help biologists to examine potential phenome–genome relationships in a genetically diverse population and to associate such relationships with the change of environments or development. It is, however, unclear what specific patterns biomedical researchers should look for given the complexity of the biological system and its responses to environmental perturbations. Besides, the large variance in phenotypes, due to the biodiversity and the variance in environmental distribution, adds more challenges to the already difficult task (Cruz et al., 2016; Flood et al., 2016; Yang et al., 2017).
To address this challenge, we propose a new tool called Temporal Emerging Phenomenon Finder (TEP-Finder) as the first approach to capture emerging phenomena with various temporal scales and arbitrary phenotype variation shapes (see Fig. 2). TEP-Finder automatically transforms large-scale phenomics data into emerging phenomenon patterns, thus facilitates the translation of information into knowledge. TEP-Finder has two phases. First, TEP-Finder encodes phenotype-based relationships into a dynamic network using non-parametric clustering and generates seeds. It then identifies all the emerging phenomena in different temporal scales and constructs a directed acyclic network of emerging phenomena. To demonstrate the effectiveness of TEP-Finder, we applied TEP-Finder on a large-scale plant photosynthesis phenotyping experiment, and the results show that TEP-Finder can reliably and accurately identify high-quality emerging phenomena from data. Comparing with the existing models, TEP-Finder has the following advantages:
TEP-Finder is the first approach to capture emerging phenomena with diverse scales systematically;
TEP-Finder constructs a network of emerging phenomena to provides a graph-based representation of the complex hierarchy of emerging phenomena;
TEP-Finder successfully discovers emerging phenomena in an Arabidopsis photosynthetic phenotyping experimental data with high biological significance.
2 Background
An emerging phenomenon is defined as a group of genotypes who has a pattern of correlated phenotypes in a serial of continuous time points (Cruz et al., 2016). In the literature, given a set of predefined patterns, the minimal genotype contributor set can be identified using existing data mining techniques such as association rule mining (Hipp et al., 2000; Weiß, 2014) or subspace trajectory clustering (Agrawal et al., 1998; Soltanolkotabi et al., 2014). However, given the unsupervised nature, most emerging phenomena are not pre-definable. To our knowledge, there is no existing algorithm exactly designed for emerging phenomena identification. Tools, such as DHAC and NPM (Gao et al., 2015; Park and Bader, 2012), may be slightly modified to achieve the goal. Here we discuss two existing approaches with additional steps adopted for emerging phenomenon discovery.
DHAC models show how a network change with time (Park and Bader, 2012). Assuming that the edges in a network are conditionally independent given group membership, DHAC uses a probabilistic model to translate a hierarchical stochastic block to the dynamic domain, thus clustering a time-evolving network based on the observations at several specific time points. The rationale is that any node in a network cluster at a specific time point should be influenced by clusters at nearby time points. DHAC can be employed to group genotypes by matching clusters across multiple time points with additional steps that transform longitudinal phenomics data into a dynamic network (called DHAC+). However, to facilitate dynamic network clustering, DHAC considers global features on all temporal points rather than local features. Subsequently, the DHAC-based method cannot identify emerging phenomenon at different temporal scales.
NPM is a non-parametric clustering method that can simultaneously cluster subjects with arbitrary cluster shapes (Gao et al., 2015). NPM represents the phenotypes of each genotype in a serial of continuous time points as a cloud of points. Each point of the cloud corresponds to a vector in the sequential phenotype measurements taken for the genotype. Two similar shapes of clouds represent that two genotypes have a coherent phenotype pattern in a given time frame. Note that NPM is more advantageous than the Pearson correlation on the identification of a set of genotypes with coherent phenomics data. It is because Pearson correlation requires all the variables to follow a normal distribution, which is not always held for the phenomics data, while NPM does not make any assumption about the underlying data distribution and thus is particularly suitable for phenomics data analysis. NPM can be employed to identify emerging phenomena by applying it repeatedly on every time frame of a longitudinal phenomics dataset (called NPM+). However, it is difficult to pre-define the time scale of emerging phenomena or to identify the relationships between overlapped emerging phenomena. Furthermore, NPM is not a deterministic method so that the results are dependent on the initialization and the selection of anchor points.
The unmet needs to effectively identify high-quality emerging phenomena necessitates the development of tools that can automatically transform large-scale phenomics data into emerging phenomenon patterns, thus facilitate the translation of information into knowledge. To precisely identify emerging phenomena with different temporal scales, we propose TEP-Finder. In our experiment, TEP-Finder has been compared with NPM+ and DHAC+. The results demonstrate that TEP-Finder is better for capturing emerging phenomena and relationships among them.
3 Definition of emerging phenomenon
In a long temporal phenotype dataset , Ti is a time frame associated with experimental environments, treatments and outcomes , and Pj is a genotype to study , e.g. a gene knockout or a inbred line. The phenotype values of genotype Pj in time frame Ti are represented by a set of data points . In the plant photosynthetic phenotyping experiment using DEPI (Cruz et al., 2016), the phenotypes are mainly photosynthetic system II activity , photoprotection (qESV) and photoinhibition (qI). An emerging phenomenon ei is defined as follows (see example in Fig. 1).
Emerging Phenomenon. Given the temporal phenotype data , an emerging phenomenon is a group of genotypes that exhibit coherent phenomena during continuous temporal range , where , and the percentage of significant phenotype values in e is greater than or equal to K3. K1, K2 and K3 are user specified thresholds.
Note that in an emerging phenomenon , certain percentage of phenotype values of should be significantly different from the reference. The definition does not require all the phenotype values of to be significant because, when the environmental conditions vary dynamically, phenotype values often periodically switch between significance and insignificance (see Fig. 1). Hence, it is more reasonable to require a certain portion but not all of the phenotype values to be significantly different from that of the reference. Since a priori does not apply, new algorithms are needed for the identification of emerging phenomenon.
For a large-scale phenotyping experiment, the total number of identified emerging phenomena could be large. To better manage and use them, we construct an emerging phenomenon directed acyclic network (EP-DAG) defined as:
Emerging Phenomenon DAG. An EP-DAG is a DAG, where each node in represents an emerging phenomenon , and node is a descendent of node if and only if and .
The outputted EP-DAG is available in the OBO format. It, once generated from data, can be visualized with Cytoscape (Smoot et al., 2011) or OntoVisT (Srivastava and Sahni, 2011). It automatically supports emerging phenomenon search, phenotype relationship identification and multiple phenotyping experiments comparison, leading to improved computational efficiency and succinct representation. To our knowledge, there is no existing work focused on the construction of EP-DAGs.
4 Materials and Methods
To systematically identify emerging phenomena in long-term phenotyping experiments and to examine the interactions between emerging phenomena and dynamic environments in a genetically diverse population, we introduce a new algorithm called TEP-Finder. TEP-Finder has two phases. First, it identifies seed phenomena in every time frame of a longitudinal phenomics dataset, where a time frame is a predefined minimal temporal range of any emerging phenomena. Second, by expanding each identified seed phenomena to longer time frames, TEP-Finder discovers emerging phenomena that appear and disappear subject to the change of environments or time. Multiple emerging phenomena are then merged, pruned and connected to form a phenomenon network to facilitate phenotype search, comparison and functional analysis. The diagram of the whole process is shown in Fig. 2.
4.1 TEP-Finder Phase 1. Identifying seed phenomenon
An emerging phenomenon e is considered as the phenotypes of multiple genotypes that have similar variation trends in a continuous time period. Biologically, such time period can be transient or can last for a relatively long time. To identify e with varying length, we consider a seed-based approach. Namely, we segment the whole experiment duration into multiple time frames, each being the minimal temporal range of any emerging phenomena. Then, at every time frame, we seek seed phenomena that are potentially extendable to a longer time period. The seed identification phase can be divided into four steps.
4.1.1 Data segmentation and data representation
Given the temporal phenotype data , we segment into separated time frames with a fixed length m using the sliding window approach. Here, the window width is the smallest temporal period of any emerging phenomenon (e.g. 30 min) that users can specify.
We adopt a meta-clustering approach to identify the relationships among all the tested genotypes in each time frame Ti (Caruana et al., 2006; Lingras et al., 2016). In the meta-clustering process, we repeatedly cluster the phenotype values of all the genotypes in Ti using non-parametric clustering with random anchor points (Gao et al., 2015). The center of non-parametric clustering is a cloud-of-points representation. Since all the phenotype values are collected in a relatively short time, we examine the dependence among different phenotypes while ignoring the sequential order among the values and simply characterizing the phenotypes of genotype Pj in time frame Ti by the set of data points , which we refer to as cloud-of-points representation.
4.1.2 Phenotype clustering
This optimization problem can be effectively solved by employing NPM, a non-parametric clustering method for phenomics data analysis (Gao et al., 2015). Based on the Nadaraya–Watson method for kernel density estimation (Parzen, 1962; Rosenblatt, 1956; Sprent, 2009) and following the framework of maximum likelihood estimation, NPM uses optimal density functions and applies a non-parametric clustering technique to group genotypes into the same cluster if their clouds-of-points share similar arbitrary shapes. The non-parametric approach avoids the parametric assumption of the underlying distribution so that NPM is suitable to model the non-linear interactions among multiple phenotypes (Gao et al., 2015). Since the clustering process is dependent on the initialization and the selection of anchor points, we repeat NPM multiple times to obtain all the meta-clustering results.
4.1.3 Dynamic phenotype network construction
In this step, we construct a dynamic phenotype network , where is the set of genotypes, is the set of time frames and represents edges in different time frames. In each time frame Ti, we check whether any two genotypes Pj and Ph are frequently grouped into the same cluster in meta-clustering. If the co-occurrence is greater than a predefined threshold K4, we add edge to Ei. In the dynamic network G, while the nodes are identical, the edges vary over time, indicating emerging phenomena emerge and disappear with the change of time.
A running example is shown in Figure 3. In the example, and . Given the frequency of concurrence of every two genotypes in T1, T2 and T3 (the table on the left), and let K4 be 0.8, we identify all the edges (shaded blocks) and construct the dynamic network in the middle panel of Figure 3.
4.1.4 Seed phenomena identification
We identify the seed phenomena by repeatedly applying a maximal clique-based approach on every time frame of the dynamic network . Clique is a special structure such that any two nodes in it are adjacent, implying a close relationship among all the nodes that belong to the same clique. A maximal clique is a clique that cannot be extended by including one more adjacent node, meaning it is not a subset of a larger clique.
More specifically, we adopt the Bron–Kerbosch algorithm to identify all the maximal cliques (Bron et al., 1972). The basic form of the Bron–Kerbosch algorithm is the recursive backtracking that searches for all maximal cliques in a given network. Its performance has been further improved by defining a pivot vertex set, allowing it to backtrack more quickly in branches of the search that contain no maximal cliques (Cazals and Karande, 2008; Tomita et al., 2004).
Let be the set of all the maximal cliques in the dynamic network , maximal clique defines a seed phenomenon with its genotype set being Pj and its time frame being Ti. A running example is shown in the right panel of Figure 3. The maximal cliques in T1 are .
4.2 TEP-Finder Phase 2. Extending from seeds to emerging phenomenon
After identifying all the seed phenomena in the minimal time frames, we extend them to longer time frames. This phase has four steps.
4.2.1 Emerging phenomenon identification
To identify emerging phenomena, the general idea is to combine seed phenomena in adjacent time frames. More specifically, for , which is the jth seed phenomenon in time frame Ti, we join it with every seed in time frame , resulting in , where represents the intersection of Pj and Pk. Then, we determine whether is a new emerging phenomenon using the following rules developed based on the definition of the emerging phenomenon (see Definition 3.2). The combination process will continue on the followed time frames (i.e. ), until all the seed phenomena are examined.
Discard if , is not an emerging phenomenon;
Replace with if .
Replace with if .
Accept as a new emerging phenomenon if and and .
Following the example of the dynamic phenotype network and maximal cliques in Figure 3, the emerging phenomenon identification procedure starting from T1 is shown in Figure 4. In Figure 4a, one of the seed phenomena that start from T1 or T2 is and , respectively. The join of P2 and P4 is , which, according to Definition 3.1, is saved as an emerging phenomenon in time frame . Similarly, for the other seeds in T1 and T2, we join them pair-wisely and save all the qualified emerging phenomena (see the blue colored notes in Fig. 4a). Next, we join all the emerging phenomena in time frame with the seeds in T3, resulting in the emerging phenomena in time frame . For example, is the result by joining and . Note that P18 replaces P9 since they have the same genotypes and the time frame of P18 contains that of P9. Those who do not qualify the definition of emerging phenomenon are discarded (all the gray notes in Fig. 4a).
4.2.2 Significance test
Given the temporal phenotype data , we compare the phenotype values of every genotype Pi with the reference using logged fold change, resulting in the relative phenotype values. The reference could be the wild-type in mutant experiments, the parental lines in recombinant inbred line experiments, or the average of all the genotypes in population experiments. Without losing generality, all the significant phenomena can be identified using a user given logged fold change threshold or with the computation of the false discovery rate. Other significance tests can also be applied for the same purpose. If the percentage of significant phenotype values of an emerging phenomenon is less than a user given threshold K3, the emerging phenomenon is discarded.
4.2.3 EP-DAG construction
To model the complex relationships among all the emerging phenomena, we construct an EP-DAG . is a DAG with a virtual root node Proot. We first connect all the emerging phenomena found in any individual time frame directly to Proot (see example in Fig. 4b). Next, we add an edge pointing from every emerging phenomenon to another one if the latter is generated by joining the former with other ones and both of them start from the same time frame. For example, in Figure 4b, an edge is pointing from to . Finally, we add an edge pointing from one emerging phenomenon to another one , if , and is not a descendent of . For example, we add edges pointing from P5 to P8, P15 to P23 and P14 to P21 (see the dotted edges in Fig. 5b).
4.2.4 EP-DAG pruning
Finally, to reduce the redundancy of the emerging phenomenon, we merge the highly overlapped emerging phenomena and remove emerging phenomena with insignificant phenotype values. Note that if an emerging phenomenon is discarded because it does not satisfy the user given thresholds (e.g. the percentage of significant values less than K3), its children will be redirect to its patent emerging phenomena. See examples in Figure 4a and b. Mathematically, given two emerging phenomena and in the same time frame, if and , we remove the two emerging phenomena and compose a new one called . Meanwhile, the edges connecting to and are redirected to .
5 Results
5.1 Data description
For performance evaluation, we used the long temporal plant photosynthesis phenomics data in Gao et al. (2015). The phenotyping experiment was carried out by testing 182 chloroplast-targeted single mutant lines (each with at least four biological replicates) of Arabidopsis thaliana under dynamic environmental conditions using DEPI (Cruz et al., 2016). Three kinds of phenotypes, i.e. photosynthetic system II activity (), photoprotection (qESV) and photoinhibition (qI) were collected at 112 time points. See experiment details in Cruz et al. (2016).
TEP-Finder was implemented with Python 2.7. The following parameters for TEP-Finder were used in the experiment: number of genotypes , number of time points , percentage of significant phenomena , number of time points per time frame 10, overlap rate between two adjacent time frames 90%; number of runs of clustering per time frame 100. The final results consist of 4318 emerging phenomena and an EP-DAG with 7789 edges.
5.2 Methods to compare
We compared TEP-Finder with NPM+ and DHAC+. The latter two are the methods modified from NPM and DHAC, respectively (see Section 2). The major difference in these methods locates on the process of seed phenomenon identification. More specifically, NPM+ consists of the following two steps. First, given the phenotype data , we call NPM once at every time frame to obtain the clustering results. Second, the clusters are used as the inputs to TEP-Finder Phase 2 (see Section 4.2). In DHAC+, we first preprocess the phenomics data using the Phase 1 of TEP-Finder, resulting in the dynamic phenotype network G. Second, DHAC is adopted to identify seed phenomena in G instead of searching for the maximal cliques. Finally, the seed phenomena are used as the inputs to TEP-Finder Phase 2 (see Section 4.2). Note that the only difference between NPM+, DHAC+ and TEP-Finder is how the seed phenomena are identified. Comparing NPM+ and DHAC+ with TEP-Finder is critical because it can test whether our meta-clustering followed with maximal clique approach is appropriate to generate seeds, which form the basis for the identification of emerging phenomena.
5.3 Performance evaluation using Gene Ontology enrichment
An emerging phenomenon consists of a list of chloroplast-targeted single mutant lines that exhibit coherent and significant phenomena in a continuous time frame. It is expected that the knockout genes would be involved in the same biological process or have a similar molecular function. Therefore, we tested whether the knockout genes in the same emerging phenomenon are also enriched in Gene Ontology (GO). GO includes three categories: biological process, molecular function and cellular component. Given a set of genes and their GO annotations, GO enrichment analysis identifies the over-represented GO terms. In our experiment, data were downloaded from the GO website in March 2017, and clusterProfiler (Yu et al., 2012) was used for the enrichment test.
Figure 6a shows that the percentage of emerging phenomena at each level of the EP-DAG using GO biological process. Clearly, TEP-Finder is constantly better than DHAC+, esp. at deep levels of the EP-DAG. The high performance on deep levels is important because emerging phenomena at deep levels often represent abnormal photosynthetic behaviors in a relatively more extended time period. TEP-Finder and NPM+ have a similar trend, but TEP-Finder is still better than NPM+ on most of the cases. Specifically, the averaged percentage of the enriched emerging phenomena of TEP-Finder is 0.80, which is 0.70 for NPM+. Similar results are found on the GO enrichment test on the molecular function category. In general, the performance of TEP-Finder is higher than NPM+ and DHAC+ at each level of EP-DAG (Fig. 6b). The averaged percentage of the enriched emerging phenomena of TEP-Finder is 0.66, while the values of NPM+ and DHAC+ are 0.56 and 0.43, respectively.
While the first experiment shows that TEP-Finder has more enriched emerging phenomena than the other two, it is not clear whether the enriched GO terms are at a shallow or deep level of the GO. Therefore, in the second experiment, we compared the distribution of the enriched GO terms among the three methods. Since the first level of the EP-DAG is the virtual root node, the comparison was carried out at the second, third and fourth level. We only tested the first three valid levels of EP-DAG because, in the results of DHAC+, the number of emerging phenomena after three levels are too few to compare. Figure 7 shows the cumulative distribution of the GO biological process terms and the molecular function terms at the first three valid EP-DAG levels. It is constant that there are more deep-level enriched GO terms in TEP-Finder than the other two.
5.4 Performance evaluation using gene association
Given an EP-DAG, gene-to-gene similarity can be calculated based on the topological structure of the DAG. We thereby test the correlation between EP-DAG based gene similarities with the GO molecular function based gene similarities. To calculate the gene-to-gene similarities based on the EP-DAG, we adopted the widely used Resnik method (Resnik, 1995). Specifically, given any two genes gi and gj, we identify their least common ancestor term and calculate the gene–gene similarity using , where N is the total number of genes in the ontology and is the number of genes annotated to the lowest common ancestor. The GO-based gene similarities were calculated using a web service named InteGO2 (Peng et al., 2016). All the similarities were normalized to the range of [0,1].
Figure 8 shows the experimental results on the three networks constructed using TEP-Finder, NPM+ and DHAC+, respectively. In general, there is a strong correlation between the gene–gene similarities based on TEP-Finder and based on the GO. Specifically, the R2 score of TEP-Finder is 0.89, significantly higher than that of the other two methods (i.e. 0.60 for NPM+ and 0.46 for DHAC+). This experiment suggests that the EP-DAG built by TEP-Finder is well organized.
5.5 Biological significance
Although we now have deep knowledge of the core processes of photosynthesis, the ‘ancillary components’ essential for function in living cells under dynamic conditions are largely unexplored (Zhu et al., 2008). Intriguingly, these ancillary components probably evolved as plug-in functional modules to adapt the core processes to different conditions. Understanding their functions may allow us to combine these modules in different organisms, to achieve rapid improvements in the photosynthetic efficiency. The identification of the emerging phenomena of chloroplast-targeted single knockouts may enable systematic analysis of genotype–phenotype connections and provides a clue on the characterization of specific ‘ancillary processes’ that support efficient photosynthesis. Note that many of the ancillary photosynthetic processes down-regulate the capture of light energy, preventing photodamage but at the cost of light-capture efficiency. From an evolutionary perspective, these processes can be viewed as balancing needs for energy and the avoidance of deleterious effects from photosynthesis.
We first analyzed the identified emerging phenomena from the gene evolution perspective. Since essential genes are often slow evolving compared with genes with non-lethal mutant phenotypes, the genes identified only in the emerging phenomena under fluctuating light varying conditions may evolve faster than those in the emerging phenomena under smooth light conditions. The ratio Ka/Ks, which measures the relative rates of synonymous and non-synonymous substitutions at a particular site, is often used for the estimation of evolutionary rates (Peterson and Masel, 2009). In our experiment, the averaged Ka/Ks ratio of the 50 genes appeared only in the emerging phenomena under strong and smooth light conditions is 0.164, while the averaged Ka/Ks ratio of the 45 genes identified uniquely under fluctuating and strong light conditions is 0.192, significantly higher than the former (permutation test, P = 0.013).
We then analyzed the emerging phenomena from the perspective of photosynthetic functionality. Two emerging phenomena (A and B) were categorized under the same strong fluctuating light conditions in the middle of the day (between 500 and 1000 μmolm−2s−1 four times repeated) due to distinctively different photosynthetic phenotypes (Fig. 9 and Supplementary Fig. S2, A, orange; B, blue). The emerging phenomenon A consists of mutant lines AT1G12250, AT1G80030, AT4G24750 and AT5G03455. They are sensitive to fluctuating light, showing large extent of decreases in photosystem II (PSII) activity and decreases in qESV (photoprotection) under high light intensity compared to the low light. Mutant lines in emerging phenomenon B (AT1G14590, AT1G54580, AT2G40400, AT3G10470, AT4G31560, AT5G03455 and AT5G39830) have less extent of decreases in PS II activity with higher qESV indicating less sensitivity to the fluctuating light. As the important genes responsive to dynamic light conditions, sensitivity of mutant would be increased. Thus, for mutant lines that are shown a sensitive phenotype under the conditions, it indicates that the mutated genes are responsible for maintaining robust photosynthesis under the stress conditions. Hence, we hypothesize that the genes in A may contribute to photoprotection in response to natural light dynamics (see the selected samples in Fig. 9). According to the GO, these genes are involved in arsenate reductase activity and the photosynthesis-related biological processes, including arsenate reductase activity and oxidation-reduction process. Most of them are related to cellular redox balance, which are important for regulation photosynthesis, yet mode of function of found genes in this study is still partly remained elusive (Bauer and Papenbrock, 2002; Hall et al., 2010). This analysis may provide new insights and open the new possibility to understand how plants are adapted dynamic conditions. Mutant lines in B stay high qE and minor decrease in PS II activity in the high light indicating those mutants are less sensitive to dynamic light conditions. It shows that the mutate genes in B are less likely responsible to adapt fluctuating light conditions. A functional analysis based on GO shows that these genes are involved in cell cycle, cell division and protein complex oligomerization, and protein folding fatty acid biosynthetic process cytochrome complex assembly. Also, the averaged Ka/Ks ratio of A and B is 0.24 and 0.22, respectively, which is significantly higher than that of randomly selected chloroplast-targeted genes (permutation test, P = 0.024 and 0.013).
The biological analysis demonstrates that accurately identifying emerging phenomena from plant phenotyping data may be valuable toward the characterization of specific ancillary processes that support efficient photosynthesis.
6 Discussion and conclusion
Comprehensive analysis of emerging phenomena is required to improve our understanding of the quantitative variation of complex phenotypes and to attribute gene functions (Flood et al., 2016). However, unlike frequent patterns, emerging phenomena may re-occur frequently or may appear only once during an experimental period, depending on the experimental design. TEP-Finder is the first tool toward capturing the emerging phenomena in large-scale longitudinal phenotyping experiments, leading to the identification of the minimum set of distinct actors needed to produce an undefined, complex aggregate phenotypic trait. Particularly, TEP-Finder can identify emerging phenomena in different temporal scales from the data and also can construct a directed acyclic network (EP-DAG) for better data management. The GO and gene–gene association based performance evaluation show that TEP-Finder is better than the existing tools regarding biological significance.
An important component of TEP-Finder is the meta-clustering that repeatedly calls NPM with random anchor points for kernel density estimation. We tested whether the meta-clustering approach can lead to more robust results. Specifically, given the same input, we ran TEP-Finder and NPM+ three times and calculated the differences between the results of the three runs. Supplementary Figure S1 shows the average number of genes per emerging phenomenon (indicated by circle area) at each level of the EP-DAG. In Supplementary Figure S1a, the three runs of TEP-Finder are similar to each other, indicated by the highly overlapped circles, whereas the three runs of NPM+, as shown in Supplementary Figure S1b, are distinctively different. In summary, the adoption of the meta-clustering approach ensures TEP-Finder to be robust enough for emerging phenomenon mining.
K 1 is the lower bound for the number of genotypes involved in an emerging phenomenon. By changing K1, the requirement for the smallest number of genotypes in an emerging phenomenon is changed. Smaller K1 may result in identifying emerging phenomena with less genotypes and with longer temporal range. Similarly, K2 is the lower bound for the continuous temporal range of an emerging phenomenon. It is related with the experimental duration that the user is interested in. Smaller K2 may result in identifying emerging phenomena with more genotypes and with shorter temporal range. In summary, users can vary K1 and K2 to obtain emerging phenomena with different number of genotypes and different experimental duration. K3 is the minimum percentage of statistically significant phenotype values in the same emerging phenomena. To our knowledge, K3 is difficult to specify. We suggest to fix K1 and K2 and explore the results by varying K3. Supplementary Table S1 indicates that in our experimental data, with the increase of K3, the EP-DAG becomes more concise (shallower and has less amount of nodes), and the majority of the removed nodes are intermediate ones. It suggests that to choose an appropriate K3, users can start with a high value and then gradually reduce it. At the same time, users should check whether the leaf nodes (which are the emerging phenomenon with the longest temporal duration) captures long-term patterns that user desired. As a future work, we will develop new algorithms to automatically optimize the parameters of TEP-Finder.
Although TEP-Finder is originally designed for identifying the emerging phenomenon in plant long temporal phenotyping experiments, it can be applied on any other temporal data as long as the format of the data holds. For example, we may apply TEP-Finder to temporal gene expression data to identify a group of genes who exhibit a coherent gene expression pattern during a continuous period of the experiment. Alternatively, TEP-Finder can be applied on spatial data such as high-throughput phenotype data collected with unmanned aerial vehicles (Shi et al., 2016; Thorp et al., 2018). To this end, TEP-Finder can be utilized to discover coherent patterns during continuous spatial sub-regions.
Funding
This work was supported by US NSF ABI [grant numbers 1458556, 1716340], US DOE BES [grant number DEFG0291ER20021] and NSFC [grant numbers 61702421, U1811262, 61772426]. China Postdoctoral Science Foundation (No. 2017M610651), Fundamental Research Funds for the Central Universities (No. 3102018zy033), Top International University Visiting Program for Outstanding Young scholars of Northwestern Polytechnical University.
References