Rtpca: an R package for differential thermal proximity coaggregation analysis

Abstract Summary Rtpca is an R package implementing methods for inferring protein–protein interactions (PPIs) based on thermal proteome profiling experiments of a single condition or in a differential setting via an approach called thermal proximity coaggregation. It offers user-friendly tools to explore datasets for their PPI predictive performance and easily integrates with available R packages. Availability and implementation Rtpca is available from Bioconductor (https://bioconductor.org/packages/Rtpca). Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Thermal proteome profiling (TPP) (Mateus et al., 2020;Savitski et al., 2014) is a mass spectrometry-based, proteome-wide implemention of the cellular thermal shift assay (Molina et al., 2013). It was originally developed to study drug-(off-)target engagement. However, it was realized that profiles of interacting protein pairs appeared more similar than by chance (Tan et al., 2018) which was coined as 'thermal proximity co-aggregation' (TPCA) (Tan et al., 2018). The R package Rtpca enables analysis of TPP datasets using the TPCA concept for studying protein-protein interactions and protein complexes and also allows to test for differential protein-protein interactions (PPIs) across different conditions. Here, we exemplify the analysis based on a dataset by Becher et al. (2018) which provides temperature range TPP (TPP-TR) experiments for synchronized HeLa cells in G1/S cell cycle stage versus M phase. Note: The paper by Becher et al. (2018) also includes 2D-TPP (Becher et al., 2016) data which is in general more sensitive to changes in protein abundance or stability. This data can also be informative on dynamics of protein-protein interactions based on correlations analysis of 2D-TPP profiles of annotated interactors. However, the advantage of TPP-TR data is that one can test for coaggregation which, if significant, is directly indicative of protein-protein interaction or complex assembly.

2
Step-by-step walk through the data analysis First, we need to load the required libraries (these need to be installed as specified in the comments): library(dplyr) # install.packages ("dplyr") library(readxl) # install.packages ("readxl") library(Rtpca) # BiocManager::install("nkurzaw/Rtpca") library(ggplot2) # install.packages ("ggplot2") library(eulerr) # install.packages ("eulerr") Then, we download the supplementary data from Becher et al. (2018) which contains the TPP data which we'll be using: if(!file.exists("1-s2.0-S0092867418303854-mmc4.xlsx")){ download.file( url = "https://ars.els-cdn.com/content/image/1-s2.0-S0092867418303854-mmc4.xlsx", destfile = "1-s2.0-S0092867418303854-mmc4.xlsx", mode = "wb") } 2.1 Getting TPP data into a valid import format The ideal input format for using the Rtpca package is a list of ExpressionSets as obtained by importing the data with the TPP package (Franken et al., 2015). We showcase how this can be done in the Rtpca package vignette. However, in order to facilitate usage of the Rtpca functions if the raw data of a TPP experiment are not available, such as for the case we are exemplifying here, another supported input format of the data is a simple list of matrices. In the case replicates are available these can be simply incorporated as different list elements. Here we work with the median fold changes for each condition, thus our list objects will only contain one element each. The rows of the matrices should contain the different measured proteins and the columns the relative soluble fraction at different temperatures measured by TMT reporter ions. Since the column names will often represent the measured TMT channels, the Rtpca package additionally expects an attribute named temperature (shown below how to define, not needed if the input is an object imported with the TPP package).

Multiple testing burden in testing for differential PPIs
In principle, we could now go ahead and test all possible PPIs for differential coaggregation in the datasets acquired in the two different cell cycle phases, however in practise this is not feasible. The reason behind this is that the larger the annotation of PPIs is, the higher will be our multiple testing burden and the less likely we are to identify true positive PPI changes. Thus, below we suggest two possible strategies that lead to a significant reduction in tests in comparison to e.g. testing all StringDb (Szklarczyk et al., 2019) annotated PPIs above a certain threshold (even though using a very high threshold (990 or even higher) might also be a viable strategy). The first approach ('complex-centric approach') first tests for coaggregation of protein complexes separately in the different conditions. All PPIs in significantly coaggregating complexes in any of the conditions are then used in a secound step to test for differential coaggregation across the conditions. The secound approach ('PPI-centric approach') uses a similar strategy, but tests for significant PPI coaggregation separately in the different conditions and then chooses significant interactions across both conditions for further testing for differential behavior.

Complex-centric analysis
We start by loading an annotation of mammalian complexes by Ori et al. (2016), which comes with the Rtpca package: The crucial columns it contains are the following protein: a column using the same identifiers (Gene names (in this case), Uniprot ids or Ensembl protein ids) as the row names of the supplied input matrix or ExpressionSet object and id: unique protein complex ids.
Then, we perform a TPCA analysis based on complexes only in the G1/S condition: Since it would not be feasible to compare all annotated protein complexes (true positives) with all non-complex annotated groups of proteins (false positives), the function uses several random permutations of the input complex annotation table as a proxy of false positives found by ranking groups of proteins by low average Euclidean melting curve distances. Note: This procedure will be slightly different each time it is run, since the permutations of the complex annotation table will be different each time. Thus, it is recommended to set a random number generator seed to get reproducible results.
We can now inspect significantly co-melting protein complexes, like this: We can see that the predictive performance of this dataset for protein complexes is not quite as good as for the G1/S one, but still pretty decent: Based on the protein complexes which we find significantly assembled in either condition, we will select the protein-protein interactions to test for in a differential TPCA: We load the annotation of protein-protein interactions within complexes that is composed of PPIs from StringDb (Szklarczyk et al., 2019) and the complex annotation by Ori et al. (2016) and filter it for protein complexes that we have seen to coaggregate in the analysis above.

PPI-centric analysis
For the PPI-centric analysis, we first load PPIs annotated by StringDb (Szklarczyk et al., 2019): As for the complex-centric analysis we get back a tpcaResult object:

Note:
The AUC for recovering PPIs using the TPCA approach is usually lower than for recovering protein complexes. This is due to the fact that it is less likely to find three or more proteins showing similar melting curves by chance (protein complex analysis), than it is for two proteins (PPI-based analysis).  Comparison of the results obtained by both strategies In order to asses how many differential PPIs can be recovered with either of the approaches, we plot a Venn diagram below: It appears that both approaches pick up a set of distinct differential PPIs. The PPI-centric approach appears to recover more significant PPI changes, however the complex-centric one reveals more intra-complex centered interactions changes.

Conclusion
Rtpca offers user-friendly exploration of TPP datasets for PPIs and allows to assess significantly changing PPIs across different conditions. We exemplify here, how this can be done using the TPP dataset of different phases of the human cell cycle (Becher et al., 2018) from which we recover several differentially coaggregating protein pairs which are known to change during these phases. A challenge in the analysis remains the high numbers of hypothesis tests that have to be performed which require multiple testing adjustment and are limited in sensitivity. In the