ProAct: quantifying the differential activity of biological processes in tissues, cells, and user-defined contexts

Abstract The distinct functions and phenotypes of human tissues and cells derive from the activity of biological processes that varies in a context-dependent manner. Here, we present the Process Activity (ProAct) webserver that estimates the preferential activity of biological processes in tissues, cells, and other contexts. Users can upload a differential gene expression matrix measured across contexts or cells, or use a built-in matrix of differential gene expression in 34 human tissues. Per context, ProAct associates gene ontology (GO) biological processes with estimated preferential activity scores, which are inferred from the input matrix. ProAct visualizes these scores across processes, contexts, and process-associated genes. ProAct also offers potential cell-type annotations for cell subsets, by inferring them from the preferential activity of 2001 cell-type-specific processes. Thus, ProAct output can highlight the distinct functions of tissues and cell types in various contexts, and can enhance cell-type annotation efforts. The ProAct webserver is available at https://netbio.bgu.ac.il/ProAct/.


INTRODUCTION
Huge advancements in the understanding of human phenotypes in health and disease were achie v ed owing to transcriptomic analyses of human tissues and cells ( 1 , 2 ). Howe v er, transcriptomic measurements encompass tens of thousands of distinct molecules, resulting in detailed, comprehensi v e datasets that are not easy to interpret. This led to the de v elopment of module-or systems-based approaches that unco ver the innerw orkings of physiological systems ( 3 , 4 ). A powerful concept that has been used widely is that of a biological process, defined as an ensemble of gene products that executes a specific biological function, such as replication or protein degradation. By connecting subsets of genes with specific functions, biological processes offer a clearer view of the state and functionality of tissues and cells ( 3 , 5 ).
Acknowledging their importance, biological processes have been defined and annotated by various organizations. Prominent r esour ces include the Gene Ontology (GO)  ( 6 ), Reactome ( 7 ), Kyoto Encyclopedia of Genes and Genomes ( 8 ), WikiPathways ( 9 ) and Human MSigDB collections ( 10 ). Each r esour ce contains hundreds to thousands of biological processes and their associated genes. Howe v er, information on the activity of processes in different biological contexts is not part of these resources. A typical approach to estimate the activity or relevance of biological processes in different contexts is enrichment analysis of biological processes in transcriptomic profiles ( 11 ). It was shown that certain biological processes are acti v e in a few biological contexts, e.g. stress response pathways in challenging environments ( 12 , 13 ) and dif ferentia tion processes in de v elopment ( 14 ). Other methods to estimate the activity of biological processes harnessed knowledge of the molecular interactions between process-associated genes ( 15 , 16 ), or focused on comparing between specific contexts, such as a tumor sample to a r efer ence set of samples ( 17 ). Furthermore, se v eral w e btools pr ovide pr ocessrela ted informa tion and allow explora tion and visualiza tion of processes in tissues and cells ( Table 1 ). The Reactome database ( 7 ) and the Signaling Pathways Project (SPP) ( 18 ) include r epr esentation of the expr ession le v els of processassociated genes in tissues, howe v er, this r epr esentation is not comparati v e between tissues. The HumanBase database ( 19 ) is highly informati v e, howe v er its queries are limited to genes. The scTPA w e bserver ( 20 ) provides pathway activity profiles inferred from gene set enrichment analyses for user-defined single cell tr anscriptomics data, y et has limited query options and w e b functions.
We r ecently r eported a method to identify biological processes that were preferentially acti v e or under-expressed in specific contexts, denoted Tissue Process Activity (TiPA) ( 21 ). TiPA was applied to GO biological processes, owing to the large number of annotated processes and genes in GO and its uniform nomenclature across organisms. Per context, TiPA associated a process with the mean differential expression of its genes relati v e to other contexts. We showed via analysis of 1579 processes in 34 human tissues that TiPA was able to identify tissue-specific processes, and was better than another method ( 17 ). Application of TiPA to transcriptomic profiles of 108 cell subsets ( 22 ) re v ealed that pr efer entially acti v e processes often matched cell type identities, suggesting that they could facilitate the annotation of uncharacterized cell subsets.
Here we report the ProAct w e bserver for a ppl ying the TiPA method to user-defined transcriptomics data. Gi v en a differential expression matrix inferred from bulk or single cell transcriptomics, the w e bserver computes the preferential activity of GO biological processes per context. ProAct can be queried by process, context, or gene through a user-friendly interface. Its visualizations of the output allow users to quickly grasp the pr efer ential activity of biological processes in various contexts ( Figure 1 ). ProAct also provides pr ecomputed pr efer ential activities of GO biological processes and MSigDB hallmark gene sets in 34 human tissues, based on transcriptomic data from GTEx ( 1 ). Lastly, gi v en a user-defined differential expression matrix of cell subsets, ProAct combines the pr efer ential activity of 2001 cell-type-specific processes to offer likely cell-type annotations. ProAct is freely available at https://netbio.bgu.ac. il/ProAct/ .

Biological processes
Data of GO biological process terms and process-associated human genes were downloaded from Ensembl BioMart on 3 October 2020 (GRCh38.p13) ( 23 ). Data of MSigDB hallmark gene sets were downloaded from the 'Human MSigDB Collections' on 24 April 2023 ( 10 ).

ProAct scores
ProAct scores (also denoted TiPA scor es) wer e computed according to the TiPA method ( 21 ). The computation relies The fiv e topmost pr efer entially acti v e processes in li v er wer e r elated to li v er functions: 'trigly ceride-rich lipoprotein particle remodeling' (P1), 'negati v e regulation of verylow-density lipoprotein particle remodeling' (P2), 'monocarboxylic acid metabolic process' (P3), 'monoterpenoid metabolic process' (P4) and 'glyoxylate catabolic process' (P5). on (i) an input matrix of the differ ential expr ession (typicall y lo g2 fold-change values) of genes (rows) per contexts (column) and (ii) a list of processes and their associated genes. For each process p and context c, the ProAct score is set to the mean differential expression in c of the genes associated with p . There might be cases where a single or few process-associated genes are dramatically differentially expr essed r elative to all other process-associated genes. To reduce the impact of such outliers on a ProAct score we trimmed the mean, such that process-associated genes with the 10% most extr eme differ ential expr ession values were excluded from the computation.

ProAct scores in human tissues
ProAct w e bserver offers precomputed ProAct scores that estimate the pr efer ential activity of GO processes and MSigDB hallmark gene sets in 34 human tissues. Transcriptomic profiles were obtained from GTEx ( 1 ). Brain sub-r egions wer e grouped into six anatomically-r elated tissues denoted br ain0-br ain5 (Supplementary Table S2) ( 24 ). The GO-based precomputed dataset included ProAct scores for 6939 GO processes with 3-100 expressed genes and a z-score-deri v ed P -v alue ( 21 ). P -v alues were adjusted for multiple hypotheses testing using Benjamini-Hochberg procedure.

ProAct analysis of cell subsets
scRNA-sequencing profiles of fetal human tissues were obtained from ( 22 ), and included the number of reads per gene in 172 cell subsets. The differential expression matrix contained the pr efer ential expr ession of each gene in each cell subset relati v e to all other cell subsets ( 25 ). ProAct scores were computed as described above.

ProAct subset annotation
To facilitate annotation of cell subsets, we identified celltype-specific GO biological process terms, as described in ( 21 ). Specifically, cell types were matched with processes whose name or description contained cell-typerelated words using text-mining. For example, B cells were matched with the process 'marginal zone B cell differentia tion'. Altogether, we ma tched between 2001 GO biological processes and 35 cell types (Supplementary Table S1, Nucleic Acids Research, 2023, Vol. 51, Web Server issue W481 Figure 2. ProAct annotation of cell subsets. ( A ) GO process terms were associated with cell types and grouped accordingly into cell-type process groups. For example, the terms 'Schwann cell dif ferentia tion' and 'Schwann cell de v elopment' were associated with Schwann cells and grouped into a 'Schwann cell pr ocess gr oup'. The terms 'regulation of micr oglial cell activation' and 'microglial cell migration' were associated with microglia cells and grouped into a 'microglia process group'. ( B ) ProAct suggested annotations for six cell types (X axis). Each dot r epr esents the mean ProAct score (Y axis) of a distinct celltype process group. The top-ranking cell-type process groups in each cell type were indicati v e of cell-type identities: microglia ranked first in cerebellummicr oglia cells; oligodendr ocytes ranked first in cerebellum-oligodendr ocytes; megakary ocytes ranked first in kidney-megakary ocytes; mesangial stem cells ranked second in kidney-mesangial cells; Schwann cells ranked first in muscle-Schwann cells; skeletal muscle cells ranked first in muscle-skeletal muscle cells.
also downloadable from ProAct w e bsite). Next, GO terms tha t were associa ted with the same cell type were added to a cell-type process group (Figure 2 A). For example, 'regula tion of fa t cell dif ferentia tion' and 'fa t pad de v elopment' were added to a 'fat cells process group'. Gi v en an input matrix of the differential expression of genes in cell subsets, ProAct scores are computed per cell subset as described abov e. Ne xt, per cell subset, the score of e v ery cell-type process group is set to the mean ProAct scores of the processes composing the group.

RESULTS
The ProAct w e bserver estimates the pr efer ential activity of biological processes in different contexts. This is achieved by scoring a process in a gi v en conte xt by the mean differential expression of its genes in that context relati v e to other contexts ( 21 ). Hence, in a given context, positive ProAct scores indicate processes that are preferentially acti v e, whereas negati v e ProAct scores indicate downregulated processes. Below we describe ProAct and its application to bulk and single cell differential expression data.

ProAct can illuminate context-dependent processes
ProAct analysis starts with a user-defined differential expr ession matrix, wher eby each entry corresponds to the differ ential expr ession of a gene in one context relative to other conte xts. Ne xt, per conte xt, ProAct calculates the pr efer ential activity of GO biological processes (Methods). ProAct can be queried by process , gene , or context (Figure 1 A). The output of process queries is a heatmap r epr esentation of the pr efer ential activity of the process across contexts (Figure  1 B). The output of gene queries is the pr efer ential activity of gene-related processes per context, and also presented by a heatmap (Figure 1 C). The output of context queries includes the top pr efer entially acti v e processes in that conte xt, and is presented as a bar plot (Figure 1 D). The different outputs are interacti v e, and users can run subsequent queries. For example, users can query ProAct by a gene, and then run a subsequent query on one of the gene-related processes.
To demonstrate the different queries, we used a dataset of differ ential expr ession of genes in 34 human tissues ( 21 ). We first queried this dataset for processes related to skeletal muscle contraction to obtain their ProAct scores across tis-sues. As expected, top ProAct scores were obtained in skeletal muscle, in which contraction is a key function (Figure  1 B). Next, we queried the dataset by the gene Duchenne muscular dystrophy gene (DMD). The output included ProAct scores of 43 DMD-related processes in each of the 34 tissues (Figure 1 C). The top pr efer entially acti v e process was 'muscle filament sliding' in skeletal muscle, which was shown to be perturbed in the disease ( 32 ). Lastly, we queried this dataset by tissue, specifically li v er. The fiv e topmost pr efer entially acti v e processes wer e all r elated to li v er function, among which were 'triglyceride-rich lipoprotein particle remodeling', and 'negati v e regulation of very-lowdensity lipoprotein particle remodeling' (Figure 1 D). To facilitate similar analyses, ProAct has a dedicated tissueanalysis interface and built-in differential gene expression matrix ( 21 ).

ProAct can aid in cell-type annotation
ProAct analysis starts with a user-defined differential expression matrix of genes in different cell subsets, in a format tha t ma tches the output of the Seura t toolkit ( 26 ). Next, per cell subset, ProAct calculates the pr efer ential activity of GO biological processes (Methods). ProAct can be queried by cell subset to re v eal the top preferentially-acti v e processes in that subset. Additionally, ProAct can suggest cell-type annotations ( 21 ). For this, we associated 2001 GO process terms with 35 matching cell types (Methods). For example, the terms 'Schwann cell dif ferentia tion' and 'Schwann cell de v elopment' were associated with Schwann cells; the term 'Microglial cell migration' was associated with microglia (Figure 2 A). Per cell subset, ProAct associates each cell-type pr ocess gr oup with the mean Pr oAct score of its pr ocesses, considered as the cell-type score. These cell-type scores are r eported and pr esented gra phicall y per cell subset (Figure  2 B).
To demonstrate these queries, we applied ProAct to a transcriptomic dataset of six cell types from ( 22 ) (Methods). We first applied ProAct to re v eal the top pr efer entiallyacti v e processes in megakaryocytes. Among the top fiv e processes were 'megakaryocyte dif ferentia tion', 'pla telet formation', and 'megakaryocyte de v elopment', all closely related to megakaryocytes functions. Next, we applied ProAct to suggest cell-type annotations (Figure 2 B). In all six cell types, the correct cell-type pr ocess gr oup ranked either second (one cell type) or first (fiv e cell types). Hence, the topranking cell-type pr ocess gr oups of a gi v en cell subset could illuminate its identity.

DISCUSSION
The ProAct w e bserver estimates the activity of biological processes in user-defined contexts. In contrast to other tools, such as Reactome ( 7 ) or GSEA ( 11 ), ProAct does not estimate the absolute activity of a process in a gi v en conte xt. Instead, it estimates the activity of a process in a gi v en conte xt relati v e to all other conte xts. By that, it downplays constituti v ely acti v e 'housekeeping' processes, and highlights processes with conte xt-dependent acti vity, whether preferential or downregulated. This relati v e, conte xt-specific vie w of process activities facilitates the understanding of contextspecific functions and phenotypes ( 21 , 24 ).
Context-specific functions are especially important for deciphering cell identities. Single cell transcriptomics revealed that the human body is composed of hundreds of cell types ( 2 ), which further divide into functionally distinct cell states ( 27 , 28 ) and subtypes ( 29 ). Howe v er, the identity of many of the cell subsets identified via single cell transcriptomics is unclear. Their annotation and functional characterization, which used to rely on a limited set of known marker genes, currently involves more elaborate cell signatures ( 30 ). Process-based characterization has proved useful ( 20 , 21 ) (Figure 2 B), yet a cell subset can show preferential activity of several unrelated pr ocesses. Pr oAct is unique in addressing this complexity by further estimating the activity of 35 cell-type-specific pr ocess gr oups, resulting in more robust estimation of candidate cell types (Figure 2 ). ProAct could be extended to include additional and more e xpansi v e cell-type-specific pr ocess gr oups by using more sophisticated techniques (e.g. ( 31 )). For example, the process 'synaptic transmission' is neuron-related, but this process was not associated with any cell type (Methods). Likewise, pneumocytes were associated with two processes, whereas T cells were associated with 161 processes (Supplementary Table S1). Despite limitations, when tested on 108 cell types, ProAct-based ranking of cell-type process groups filtered out irrelevant cell types and nicel y ca ptured cell type identities ( 21 ).
In summary, the ProAct w e bserver provides insightful and user-friendly visualizations of pr efer ential process activities in various contexts, and can act as a supportive tool for functional characterization and annotation of newlyidentified cell types. Its application to additional gene sets would offer users a broader range of biological contexts to explore.

DA T A A V AILABILITY
ProAct is freely available at https://netbio.bgu.ac.il/ProAct/ . The data underlying this article are available in the article and in its online supplementary material.