MutViz 2.0: visual analysis of somatic mutations and the impact of mutational signatures on selected genomic regions

Abstract Patterns of somatic single nucleotide variants observed in human cancers vary widely between different tumor types. They depend not only on the activity of diverse mutational processes, such as exposure to ultraviolet light and the deamination of methylated cytosines, but largely also on the sequence content of different genomic regions on which these processes act. With MutViz (http://gmql.eu/mutviz/), we have presented a user-friendly web tool for the identification of mutation enrichments that offers preloaded mutations from public datasets for a variety of cancer types, well organized within an effective database architecture. Somatic mutation patterns can be visually and statistically analyzed within arbitrary sets of small, user-provided genomic regions, such as promoters or collections of transcription factor binding sites. Here, we present MutViz 2.0, a largely extended and consolidated version of the tool: we took into account the immediate (trinucleotide) sequence context of mutations, improved the representation of clinical annotation of tumor samples and devised a method for signature refitting on limited genomic regions to infer the contribution of individual mutational processes to the mutation patterns observed in these regions. We described both the features of MutViz 2.0, concentrating on the novelties, and the substantial re-engineering of the cloud-based architecture.

Histogram view of mutations in ZNF143 binding sites in melanoma 13 4 Distribution of the number of mutations in promoters and CTCF binding sites in melanoma . . . . . . . . . . . . . . . . . . . . . . 14 5 Explained variance with and without signature adjustment . . . 16 6 Distribution • Front-end (presentation layer): provides all the interactive visualizations described in our paper, implemented using the Angular.js 1 framework and D3.js. 2 • Middle Tier (backend, core layer): made of a REST API backed by the MutViz 2.0 engine. The engine, implemented in Python, performs the main computations, i.e., mutation-region intersections, mutation-region distances and signature refitting. The engine makes results available through a simple REST API. Depending on the type of computation and on the availability of cached results, the engine will choose whether to run an SQL query on the database (using SQLAlchemy 3 ) or start a parallel algorithm implemented on PySpark 4 . The REST API, through which the core layer can be accessed by the front-end, is implemented using the Flask 5 micro-framework.
• Storage (data layer): made of a Postgres database and a file repository. The database includes the mutation The MutViz 2.0 data layer organizes and stores the somatic mutation data, relying on the open-source relational database management system PostgreSQL. Supplementary Fig. 2 represents the logical schema of the database.
The main table, in Supplementary Fig. 2 represented with the name mutation, does actually have two implementations used for the two different visualization modes, trinucleotide mutation for intersection-based visualizations and mutation group for distance-based visualizations: • The table trinucleotide mutation stores mutation data as individual mutations associated to specific donors and tumor types (or better: tumor datasets; not always different tumor types, see Section 5, Supplementary Table 1). For each specific somatic single nucleotide mutation, it associates the donor (patient) identifier, the tumor type/dataset (e.g., BRCA / breast cancer), the chromosomal position, and the associated base change within its trinucleotide sequence context (or "trinucleotide mutation code"). This allows to compute statistics regarding the number of trinucleotide mutations per donor and perform signature refitting on individual tumors.
• The second implementation of mutation, named mutation group, groups mutations by tumor type, and does hence not require the data item donor id. Instead, an additional data item mutation count indicates how many donors have the specific base change at the given chromosomal position. Also, the trinucleotide context is not taken into consideration and the auxiliary table tri mutation code is replaced by the simpler base change representation in mutation code. This aggregation of mutations by tumor dataset speeds up the computation for distance-based visualizations where only total counts are needed and donor-specific details play no role.
In both mutation group and trinucleotide mutation, the tumor type and the base change (mutation code or trinucleotide mutation code, respectively) are encoded as integers. This reduces the size of each row, thereby decreasing the total number of page reads. The decoding of the mutation codes (base changes) and tumor type codes can be performed by a lookup in the satellite tables mutation code, tri mutation code and tumor type.
Clinical features and annotations of individual tumors are contained in the table clinical data and matched to the donor/patient identifier and the tumor type/dataset. Finally, the table temp user region is a temporary table, whose scope and persistence is limited to the single user session. For each region of interest, the table stores the chromosome and the start and stop position of the region.

Processing of user queries
We exploit indexing features and query optimization of PostgreSQL in order to quickly compute the result of user queries. Thanks to our data representation and choice of indexes, typical queries matching input regions with somatic mutations in a window of ±1 kbp around the region center run in a short time interval of a few seconds up to one or two minutes (and only a few milliseconds when previous results have been cached), which is excellent considering the size of the mutation datasets. Queries for larger windows of up to ±5 kbp can take up to several minutes. Still, this makes MutViz 2.0 a mostly fluid and responsive tool from the point of view of the user.

Queries for distance-based visualizations
First, the region set files provided by the user are loaded into the database as a temporary table, then the MutViz 2.0 backend/core layer opens a transaction through the database in order to run the following SQL query:

Queries for intersection-based visualizations
For the new visualization modes, which are based on mutation-region intersections, we implemented an ad-hoc parallel processing algorithm based on PySpark, the Python API of Apache Spark.
It can take the user-provided region set and mutation data from the database described above (data are streamed from the database to Spark) or from CSV files stored on the filesystem.
The main steps of the procedure are: 1. The mutation dataset is split into an arbitrary, configurable number of partitions. Mutations from different chromosomes can belong to the same partition. 5. Finally, the results from the individual mutation partitions (mutations matching to the region set) are aggregated for the specific visualization mode (e.g., by donor for the "mutations per donor" view).
Result: Mutations intersecting user-provided regions.
// current region precedes the current mutation The intersection procedure is implemented in algorithm 1 and is based on classical line-sweep algorithms, adapted to support chromosome matching. Compared to standard algorithms, when a mutation and a region overlap but their chromosomes do not match, the mutation cannot be simply discarded, since it could still match one of the upcoming regions. At the same time, the current region could overlap with the next mutations and cannot be discarded either. This problem is solved by the lookAhead step (lines 14-20, called in line 9) that performs a forward search, looking for regions that could match the current mutation.
This algorithm is efficient under the assumption that the number of regions in the user-provided region set is much smaller than the number of mutations in the database. This is the case even for frequently found transcription factor binding sites.
Filtering of mutation data according to clinical annotation (such as gender, age, therapy, etc.) is done a posteriori by joining the (cached) intersection result with the clinical data table from the database. Filtering the mutation data before computing the intersection instead would be less efficient because the resulting mutation set would be different for every user-defined filter on clinical annotation, and could therefore not be cached for following user queries with modified annotation filters.

User Interface
The user interface is an HTML 5 single-page application based on Angular JS 7 and D3.js 8 . It communicates with the application layer through the aforementioned REST API, retrieving processed data for building visualizations and computing statistical tests.
The interface is organized in eight main sections, namely: 1. MutViz 2.0 (home): provides a description of the tool and the data on which it is built.
2. Workspace: allows users to manage their workspace, containing either uploaded region sets or example region sets from our public repository. The browser's local storage is used to save and recover the users' workspace.
3. Histogram: shows the distribution of mutations around the regions in a single region set.

4.
Region comparison: provides a heatmap for comparing two different region sets for a given mutation dataset.

5.
Tumor comparison: provides a heatmap for comparing two or more mutation datasets (e.g., different tumor types) for a given region set. A statistical test can be computed only when comparing exactly two mutation datasets.

Mutation data and region sets
Supplementary   Fig. 1 in the paper), however, may also lie on the reverse strand. The two peaks of mutations within ZNF143 binding sites correspond to C→T changes-and the corresponding G→A changes on the reverse strand-at positions 3 to 5 of the binding motif (see the sequence logo in Fig. 1 of the paper).

Regions sets and mutations
To evaluate whether signature refitting using only mutations from a limited subset of the genome (regions of interest) can be improved by adjusting mutational signatures according to the nucleotide content (i.e., the nucleotide frequencies) within the region set, we performed additional analyses for CTCF binding sites and promoter regions. The latter was chosen as an example for regions which are still limited in size but orders of magnitude larger than transcription factor binding sites, thus usually containing larger amounts of somatic mutations as shown in Supplementary Fig. 4 for the melanoma (MELA) whole genome sequencing dataset. We identified experimentally validated CTCF binding sites as described in the paper. As promoter regions we took all regions spanning from 2000 bases upstream to 200 bases downstream of transcription start sites (extracted from the R Bioconductor package TxDb.Hsapiens.UCSC.hg19.knownGene).

Signature refitting
We used the R/Bioconductor package decompTumor2Sig [1] and performed signature refitting (tumor genome "decomposition") for each individual donor using three different approaches: 1. Whole-genome refitting: using all somatic mutations from the entire genome with standard, unaltered mutational signatures (COSMIC v3.0 [2]; see our article). Since the obtained exposures are derived from the maximum information (i.e., maximum number of mutations) available for a donor, we took them as a baseline or "goal" for comparison with signature refitting based on limited subsets of regions (and hence subsets of mutations).
2. Region-specific refitting without adjustment: using only somatic mutations from a limited subset of regions (here: promoters or CTCF binding sites) without prior adjustment of mutational signatures according to the trinucleotide frequencies in the regions of interest.
3. Region-specific refitting with adjustment: using only somatic mutations from a limited subset of regions (here: promoters or CTCF binding sites) with prior adjustment of mutational signatures according to the trinucleotide frequencies in the regions of interest (as described in the Section "Signature refitting for limited genomic regions" of our article).
Each of these signature refitting procedures returns an exposure vector containing one exposure (i.e., percent contribution) for each mutational signature. The exposure vectors obtained from region-specific refitting can be compared with those obtained for whole-genome refitting in order to evaluate whether signature adjustment based on nucleotide content improves the results when relying only on a limited subset of mutations.

Explained variance of exposure vectors
To evaluate how well an obtained exposure vector for the set of COSMIC signatures can explain the variance of the somatic mutation patterns of a donor after decomposition (refitting) of the tumor genome, or the respective subset, decompTumor2Sig computes the coefficient of determination R 2 [1]. Supplementary Fig. 5 illustrates that the explained variance strongly depends on the number of mutations which are available for signature refitting. When using mutations from promoter regions, region-specific refitting can eventually explain the observed variance in mutation patterns nearly as well as whole-genome refitting (when the promoter regions of a donor contain about 1000 somatic mutations or more; see upper panel of Supplementary Fig. 5).
CTCF binding sites (lower panel of Supplementary Fig. 5), however, are much smaller and usually contain much less somatic mutations from which to estimate exposures. Thus, signature refitting based on the small number of mutations can explain the variance observed in the subset of mutations less well. However, adjusting mutational signatures according the trinucleotide content of the set of CTCF binding sites prior to signature refitting significantly increases the explained variance. The analysis suggests that variances are explained reasonably well for donors with at least 35 somatic mutations located at CTCF binding sites.

Accuracy of exposure vectors
To evaluate the accuracy of exposure vectors obtained from region-specific signature refitting, i.e., the degree to which they approximate those obtained from whole-genome signature refitting, we used two commonly used measures. First,  we computed the cosine similarities between the exposure vectors, which ideally approach 1 if the vectors point into the same direction of the n-dimensional space (here: n=number of signatures). Then, we computed the mean squared error (MSE) between the exposure vectors, which ideally approaches 0 if there is little difference between the exposures obtained for the single mutational signatures. Supplementary Figures 6 and 7 illustrate that cosine similarities between exposures from region-specific refitting and whole-genome refitting are generally improved when mutational signatures are adjusted to region-specific trinucleotide frequencies. How well exposure vectors from whole-genome refitting are approximated depends on the number of mutations available for signature refitting.
For the larger promoter regions, which contain many more mutations than CTCF binding sites in the analyzed melanomas, we can observe strong similarities between exposure vectors from region-specific and whole-genome refitting, especially when signatures are adjusted for trinucleotide content. But also for CTCF binding sites cosine similarities can be significantly improved when signatures are adjusted prior to refitting.
This finding is confirmed when measuring mean squared errors between exposure vectors from region-specific signature refitting and those from wholegenome signature refitting, as illustrated by Supplementary Figures 8 and 9.   Figure 9: Mean squared errors (as a function of the number of mutations) between exposure vectors obtained from signature refitting for genome-wide mutation data and from signature refitting for mutation data from a subset of regions only (with and without prior adjustment of mutational signatures according to trinucleotide frequencies in CTCF binding sites). Upper panel: mutation data from promoter regions; lower panel: mutation data from CTCF binding sites. The dashed vertical line in the lower panel represents a reasonable cutoff for the number of mutations (n=35) in CTCF binding sites (see Supplementary Fig. 5 The human genome is organized hierarchically into discrete topologically associated domains (TADs), self-interacting sub-megabase segments that are relatively insulated from neighboring domains [3,4] and whose boundaries are associated with particular spatial patterns of CTCF binding sites [5].
Recent studies have shown that TAD disruption is often found in cancer cells and may contribute to oncogenesis through several mechanisms. For instance, the mutation, deletion or epigenetic alteration of a TAD boundary might fuse two adjacent TADs and lead to aberrant gene expression [6,7,8,9]. Other studies have found an enrichment of CTCF binding site mutations at TAD boundaries in various cancer types [9,10,11,12,13]. A recent review [14] summarizes various findings about the increase of somatic mutation frequencies at transcription factor binding sites, and suggests that this increase is shaped by the complex interplay between DNA damage and the activity of DNA repair mechanisms.

Result: enrichment of CTCF binding site mutations at TAD boundaries
Here, we illustrate that MutViz 2.0 can be used to study the impact of mutational processes on active TAD boundaries and the enrichment of CTCF binding site mutations they generate. Indeed, in particular our own study of the subject [10] has spurred the development of MutViz. First, we visualized the enrichment of somatic single nucleotide mutations in active CTCF binding motifs at TAD boundaries and compare them with mutations in active CTCF binding motifs which are not located at TAD boundaries. For this distinction, we exploit a study on the H1 human embryonic stem cell line (H1-hESC) which inferred neighborhoods via ChIA-PET (chromatin interaction analysis by paired-end tag sequencing), using the cohesin subunit SMC1 as target [6].
In detail, we identified potential CTCF binding sites using their position frequency matrix (Jaspar 2018 [15]; accession MA0139.1) and Biostrings [16] with a threshold of 80%. We then classified these CTCF binding sites as "in-boundary" or "off-boundary" according to whether they are located at a previously mapped TAD boundary or not, and as "active" or "inactive" according to whether or not they are located at a peak identified in CTCF ChIP-seq data of human H1-hESC cells (produced by the Bernstein Laboratory and published as part of the ENCODE project [17]; accession ENCSR000AMF for H1-hESC). Supplementary Fig. 10, produced with the MutViz 2.0 histogram view, clearly shows for three different tumor datasets a strong enrichment of mutations at active, in-boundary CTCF binding sites in contrast to also active, but off-boundary CTCF binding sites. Although for both the in-boundary and the off-boundary CTCF sites, the activity (active CTCF binding) in these tumors need not be the same as in H1-hESC, this remarkable result is likely associated with the different biological function of the two sets of binding sites.
Supplementary Figure 10: Somatic mutation enrichment at the 19bp CTCF binding sites (highlighted by a light blue background) as compared to their flanking genomic regions (-1kbp to +1kbp from the center of each binding site) for both active in-boundary (left) and active off-boundary binding sites (right). Results are shown for three WGS tumor datasets (see Supplementary Table 1): breast cancer (BRCA), esophageal adenocarcinoma (ESAD), and liver cancer (LIRI).

Result: types of mutations in CTCF binding sites at TAD boundaries in esophageal adenocarcinoma
Supplementary Figure 11: Trinucleotide mutation distribution of esophageal adenocarcinoma (ESAD) mutations falling in active in-boundary CTCF binding sites.
The trinucleotide mutation view in Supplementary Fig. 11 illustrates which types of mutations (base changes within the context of their flanking bases) fall in active in-boundary CTCF binding sites in esophageal adenocarcinoma (ESAD). As we can observe, most mutations are T→C and T→G, especially regarding triplets with an upstream cytosine, i.e., C[T →(C,G)]N, where N can be any nucleotide.
Finally, Supplementary Fig. 12 shows the "mutations per donor" view with the distributions of the number of specific base changes in CTCF binding sites across the individual donors of the ESAD dataset. In addition, we can see a greater number of transversions than transitions.