Analysis, identification and visualization of subgroups in genomics

Abstract Motivation Cancer is a complex and heterogeneous disease involving multiple somatic mutations that accumulate during its progression. In the past years, the wide availability of genomic data from patients’ samples opened new perspectives in the analysis of gene mutations and alterations. Hence, visualizing and further identifying genes mutated in massive sets of patients are nowadays a critical task that sheds light on more personalized intervention approaches. Results Here, we extensively review existing tools for visualization and analysis of alteration data. We compare different approaches to study mutual exclusivity and sample coverage in large-scale omics data. We complement our review with the standalone software AVAtar (‘analysis and visualization of alteration data’) that integrates diverse aspects known from different tools into a comprehensive platform. AVAtar supplements customizable alteration plots by a multi-objective evolutionary algorithm for subset identification and provides an innovative and user-friendly interface for the evaluation of concurrent solutions. A use case from personalized medicine demonstrates its unique features showing an application on vaccination target selection. Availability AVAtar is available at: https://github.com/sysbio-bioinf/avatar Contact hans.kestler@uni-ulm.de, phone: +49 (0) 731 500 24 500, fax: +49 (0) 731 500 24 502

Γ(g) (1) and the overlap of G is defined as This overlap definition accounts for the number of additional genes in G that cover the same sample in contrast to counting the number of samples covered by more than one gene from G.
The gene selection task consists in finding a subset of genes G ⊆ G such that the number of covered samples γ(G) is maximized and either the overlap ω(G) or the number of genes |G| is minimized. There is no obvious choice for a weighting between the two optimization objectives. A setup using equal weights on unnormalized objectives has been used previously [1]. Hence, the described research questions are formulated as a multi-objective gene selection task that consists of finding a Pareto-optimal set S * ⊆ S of gene subsets within the set of all gene subsets S = {G ⊆ G}. The solutions, i.e. gene subsets, in the Pareto-optimal set S * are the optimal trade-offs between the two optimization objectives [2]. Pareto-optimality is based on the dominance relation , i.e. a solution G dominates another solution G , denoted as G G , if and only if G is at least as good as G in all objectives and strictly better in at least one. Based on this, the Pareto-optimal set is defined as the set of solutions that are not dominated by any other solution: Figure 1 shows a Pareto-optimal set for a multi-objective optimization task maximizing coverage and minimizing overlap. The solutions dominated by the Pareto-optimal solution G are marked.  Multi-objective evolutionary algorithm for the gene selection task. The steps of an iteration t during the algorithm are shown starting with the population Pt and resulting in a population Pt+1 that will be processed in the next iteration t + 1 provided that the specified computational budget is not exhausted. Solutions in the populations are represented as bit-strings.

Multi-Objective Evolutionary Algorithm
We developed an evolutionary algorithm for the multi-objective gene selection task based on the Non-dominated Sorting Genetic Algorithm II [2, NSGA-II] implemented by jMetal library v.5.3 [3]. This is a population-based metaheuristic which adapts concepts of the theory of evolution [4]. A set of solutions, called population, is evolved iteratively by applying recombination and mutation operators to the solutions (see Figure 2). The algorithm uses the following binary solution encoding. Genes are sorted by sample coverage in decreasing order. The i-th bit in a solution s encodes whether the i-th gene g is part of the gene set G encoded by this solution. Our NSGA-II for Multi-Objective Gene Selection is outlined in Algorithm 1. The algorithm starts with an initial population of µ ∈ N randomly created solutions with a probability of p sel ∈ (0, 1) to set a bit to 1 (otherwise 0). Tournament selection with tournament size τ ∈ N and the crowded-distance comparator are used to select solutions from the parent population for recombination and mutation. The crowded distance d(s i ) of a solution s i is calculated based on the permutations π j resulting from sorting the solutions by the j-th objective f j as follows: and d(s π j (1) ) = ∞, d(s π j (µ) ) = ∞ for the extrema per objective mutated using a bit-flip mutation with probability pmut ∈ (0, 1).
The objective values of the resulting offspring solutions are evaluated. The new population of µ solutions is determined by non-dominated sorting (see Figure 3) of the parent population and their offspring [2]. This is repeated for a specific number of iterations (k ∈ N). The non-dominated solutions of the final population are the optimization results. The evolutionary algorithm tends to have difficulties to eliminate genes quickly from the subset that have become redundant due to the addition of other genes. Hence, an additional redundancy reduction step is included after crossover and mutation to remove genes that only cover samples already covered by other genes. The redundancy of a gene is defined relative to a gene order. A gene g is considered redundant if the previous genes (G prev ) in the order already cover all the samples Γ(g) that have an alteration in gene g, i.e.
Two gene orders are used to remove redundant genes from a solution: the order used in the bit-string encoding (decreasing sample coverage) and the corresponding reverse order (increasing sample coverages). Both orders are used to remove redundant genes from an offspring solution resulting in two solutions. From these two solutions the one with the better second objective, i.e. less redundancy, replaces the original offspring.

Objective Alteration Plots via Algorithmic Sorting
AVAtar facilitates the creation of objective and reproducible alteration plots by offering algorithmic sorting of genes and samples. The task of finding a gene order based on the additionally covered samples is formulated as a minimal Set Cover problem. A modified greedy algorithm [5] for Set Covering is applied (see Algorithm 2). Starting with an initial solution, the greedy algorithm incrementally adds the gene covering the most uncovered samples to its current partial solution, i.e. the gene g that maximizes the objective function f (G, g) = γ(G ∪ {g}). For a gene order based on maximal coverage and maximal exclusivity (i.e. minimal overlap), this objective function is set to (6) including a weighting α ∈ [0, 1] between coverage and exclusivity. To permit the weighting between both aims, the modified objective function uses relative coverage γ and relative if G t−1 = ∅ then // initial selection or restart after reaching maximal coverage?
πt ← gt and π t+1 ← g t+1 . Visualization for genomic data of a subset of 1, 540 patients from our AML dataset [6]. Genes are grouped into predefined functional categories. In this subgroup of NPM1 mutated patients, most co-mutations occur in genes associated to the functional groups activated signaling and DNA methylation. NPM1 is fixed to the first position and the remaining genes are sorted according to a trade-off between mutually exclusivity and coverage using α = 0.4. The sorted alteration plot highlights the molecular heterogeneity across patients. This visualization with AVAtar shows the molecular heterogeneity within these patients clearly.
overlap ω of the gene set G ⊆ G compared to the set of all genes G defined as Similarly, sorting genes according to co-occurring alterations and coverage can be achieved with the following objective function: Finally, samples are sorted lexicographically according to the gene order. Figure 4 shows a visualization for genomic data of a subset of 1, 540 patients from an AML dataset [6] using AVAtar's ability to additionally group genes on predefined functional categories. In this subgroup of NPM1 mutated patients, most co-mutations occur in genes associated to the functional groups activated signaling and DNA methylation. NPM1 is fixed to the first position and the remaining genes are sorted according to a trade-off between mutually exclusivity and coverage using α = 0.4. The sorted alteration plot highlights the molecular heterogeneity across patients.