PCAT: an integrated portal for genomic and preclinical testing data of pediatric cancer patient-derived xenograft models

Abstract Although cancer is the leading cause of disease-related mortality in children, the relative rarity of pediatric cancers poses a significant challenge for developing novel therapeutics to further improve prognosis. Patient-derived xenograft (PDX) models, which are usually developed from high-risk tumors, are a useful platform to study molecular driver events, identify biomarkers and prioritize therapeutic agents. Here, we develop PDX for Childhood Cancer Therapeutics (PCAT), a new integrated portal for pediatric cancer PDX models. Distinct from previously reported PDX portals, PCAT is focused on pediatric cancer models and provides intuitive interfaces for querying and data mining. The current release comprises 324 models and their associated clinical and genomic data, including gene expression, mutation and copy number alteration. Importantly, PCAT curates preclinical testing results for 68 models and 79 therapeutic agents manually collected from individual agent testing studies published since 2008. To facilitate comparisons of patterns between patient tumors and PDX models, PCAT curates clinical and molecular data of patient tumors from the TARGET project. In addition, PCAT provides access to gene fusions identified in nearly 1000 TARGET samples. PCAT was built using R-shiny and MySQL. The portal can be accessed at http://pcat.zhenglab.info or http://www.pedtranscriptome.org.


INTRODUCTION
Cancer is the leading cause of disease-related mortality in children. Approximately 300 000 children under age of 14 are diagnosed with cancer globally each year (1). In 2019, ∼11 000 new diagnoses were reported in the United States, with ∼1200 disease-caused deaths (2). Over the last five decades, intensive treatments combining surgical resection, radiotherapy and chemotherapy have significantly improved the outcomes of pediatric cancer. For instance, 5-year survival rate has increased from 58% in mid-1970s to 83% in 2014, with the mortality rates declining by 65% from 1970 to 2016 (2). However, prognosis for relapse patients remains poor, and intensive treatments cause longterm health problems such as secondary cancers, cardiovascular diseases, cognitive disabilities for brain tumors, etc. (3). Many pediatric cancers such as Ewing's sarcoma lack targeted therapy. Therefore, continuous efforts on finding new therapeutic targets and developing less toxic treatments for children with cancer are important for further improving prognosis and mitigating long-term health problems for survivors.
Pediatric cancers constitute ∼1% of annual new cancer diagnoses. This small population can be further split into many disease entities; thus, each has only a very small number of cases. This rarity poses a significant challenge for translational research, as collecting and testing agents in patients, especially those of ultra-rare subtypes, is difficult. Patient-derived xenograft (PDX) models have been used for the past four decades to alleviate these difficulties. These models are generated by implanting patient tumors into immune-deficient rodents and have been shown to retain histological and genomic features of the original tumors (4,5). Preclinical testing of these models to therapeutic agents has generated highly valuable insights to guide clinical trials in patients. Moreover, advances in sequencing and other high-throughput technologies now allow comprehensive molecular characterization of these models (6). The resulting genomic profiles provide a repertoire for guiding development of targeted therapies, identifying biomarkers for drug sensitivity and understanding the genetic basis of resistance.
Here, we introduce PDX for Childhood Cancer Therapeutics (PCAT), a new database of pediatric cancer PDX models. PCAT currently stores information of 324 PDX models spanning all major cancer types seen in children, including some very rare subtypes. Of these models, 309 have at least one type of genomic profiling data (somatic mutation, n = 289; expression/fusion, n = 244; copy number, n = 282). Preclinical testing data are available for 68 models across 79 therapeutic agents. To facilitate comparisons of PDXs and patient tumors, PCAT curated clinical and molecular data from TARGET so that patterns learned from PDXs can be easily replicated in patient tumors. User-friendly interfaces were constructed for searching and data mining. PCAT is freely available without the need for registration at http://pcat.zhenglab.info or http: //www.pedtranscriptome.org.
Mutation and copy number data were curated for 289 models from either PPTP (Pediatric Preclinical Testing Program) (7) or PPTC (Pediatric Preclinical Testing Consortium) (8). Gene level copy number changes were obtained by discretizing copy number values into homozygous deletion (−2), heterozygous deletion (−1), neutral (0), gain (1) and amplification (2) using GISTIC2 (9). RNA sequencingbased gene expression and fusions were obtained from PPTC (8). Preclinical testing results of 68 models over 79 therapeutic agents were manually collected from individual agent testing studies published since 2008. Drug responses are categorized into six levels: progressive disease 1, progressive disease 2, stable disease, partial response, complete response and maintained complete response. Detailed explanation of these six drug response levels for solid tumors and blood cancers can be found on the 'documentation' page of the website. A summary of PDX molecular data is shown in Figure 1B. A different visualization (UpSet plot) is provided in Supplementary Figure S1.
In addition to PDX models, PCAT also curated patient tumor data from TARGET (Supplementary Figure S2).
These data enable users to examine patterns observed from PDXs in patient tumors, and to further perform analyses that are not feasible using models such as survival analysis. Clinical, mutation and expression data of the TARGET dataset were downloaded from GDC data portal (Supplementary Table S1). Copy number segmentation files were downloaded from TARGET Data Matrix and were further analyzed by GISTIC2 to ensure compatibility with those of PDXs (9). Only high-confidence mutations called by at least two callers [MuSE (10), MuTect2 (11), So-maticSniper (12) and VarScan2 (13)] were included in our database.
Gene fusions are a very important group of cancer drivers, particularly for childhood cancer. To catalog gene fusions as a community resource, we employed our inhouse fusion caller PRADA (14) and a well-benchmarked tool STAR-Fusion (15) on 943 TARGET samples (Supplementary Table S2). This analysis identified a total of 8912 fusions by the two callers, with 3718 by PRADA, 5980 by STAR-Fusion and 786 called by both callers. We benchmarked our fusion identification using driver fusion events annotated for some ALL and acute myeloid leukemia (AML) patients in the clinical data. Of the 234 driver fusion events, PRADA identified 209 (89%) and STAR-Fusion identified 194 (83%). Combined, these two tools identified 90% of the total 234 fusions ( Figure 2A). We next broke down these fusions by cancer type and sample type. As expected, considerable heterogeneity was observed in fusion loads in each cancer type ( Figure 2B). Interestingly, post-treatment AML samples demonstrated significantly higher number of fusions than primary and recurrent samples (both P-values <0.001, Wilcoxon rank sum test), consistent with the anticipation that cytotoxic chemotherapy causes DNA breaks leading to increased fusion rates. Few post-treatment samples were available for other cancer types and thus were excluded from this analysis.

WEB INTERFACE AND DATA DOWNLOAD
PCAT web interface is organized into a resource and two analysis sections. Each analysis section consists of several functional modules. The resource section is the interface to the major data stored on PCAT, including the 324 PDXs and gene fusions. When searching for PDXs, users can specify histology, mutation, gene fusion and drug treatment as search criteria. Results will be returned in a tabular format divided into clinical information, mutation, fusion and preclinical testing if available. An example of the PDX summary page is shown in Figure 3A. Similarly, fusion search results will be returned also in a tabular format listing fusion and the case ID where this fusion is found. Clicking each fusion links to a page that summarizes the prevalence of the fusion in disease cohorts. The fusion detail page displays technical parameters of the fusion identification, a circos plot illustrating all fusions identified in the sample and the expression of the two partner genes ( Figure 3B).
The analysis modules are designed to enable exploration of genomic and preclinical data of PDXs and patient tumors. In the current release, PCAT is focused on gene expression data analysis because childhood cancers harbor far fewer mutations than their adult counterparts according   to recent large-scale genomic studies (16,17). 'Single gene analysis' modules allow users to correlate expression of the input gene with histopathological parameters, patient prognosis and genomic alterations (copy number and mutation). It also allows users to find genes that share co-expression patterns with the input gene in a selected dataset. 'Multiple gene analysis' modules allow users to visualize expression patterns of the input genes ('visualization' module). The 'single sample gene set enrichment analysis (ssGSEA)' module allows users to aggregate the expression of an input gene set into a single score. Importantly, the module integrates drug response data, thus allowing the correlation of pathway activity with PDX responses to therapeutic agents. The co-expression module returns pairwise expression correlation based on the selected dataset. If users choose to remove lineage effect, PCAT will z-score transform the expression data for each tissue of cancer origin before calculating expression correlation. Finally, PCAT allows correla-

tion of gene expression and mutation with preclinical testing results in the 'preclinical testing' module.
For all modules, PCAT provides download links for returned results so that users can have the opportunity to reproduce and customize figures for publications and other purposes. Fusion results for TARGET can be found in Supplementary Table S2.

ANALYSIS MODULES
We use examples to demonstrate the utility of the analysis modules in generating and testing hypotheses. First, we show expression of PDGFRA, a marker of mesenchymal stem cells, in PDX models ( Figure 4A). As expected, PDGFRA is highly expressed in cancers of mesenchymal origin, including extracranial rhabdoid cancer and osteosarcoma. This pattern is replicated across TARGET cohorts (Supplementary Figure S3). Next, we show the correlation between CDKN2A deletion and expression. CDKN2A is a well-established tumor suppressor and is frequently deleted in cancer. In ALL models, CDKN2A deletion is strongly associated with decreased expression (Figure 4B). The same pattern is observed in osteosarcoma (data not shown), a cancer type with high genomic instability (18). These data collectively show DNA deletion is a common mechanism to inactivate CDKN2A in cancer.
The 'survival analysis module' allows users to perform survival analysis using both clinical parameters and gene expression. PCAT supports univariate and multivariate survival analyses by deploying the R package survminer. When no query gene is input, users can perform survival analysis using clinical parameters. To correlate gene expression with clinical outcome, PCAT provides four methods to divide gene expression into groups, including auto-calculated threshold, mean value, median value and customized cut-off. The auto-calculated threshold is calculated by testing all possible cutoff values between the top 20% and bottom 20% samples based on the expression of the gene and adopts the value that best separates the clinical outcomes of the high and low groups. A Kaplan-Meier plot is generated for visualization. Users may also choose one or several clinical parameters as covariates to conduct multivariate survival analysis. A forest plot will be generated to show the hazard ratio of each variate generated in the analysis. To demonstrate the utility of this module, we use TERT and neuroblastoma as an example. High TERT expression, an indicator of active telomerase, predicts high-risk tumors in neuroblastoma (19). Using either mean or median to split the cohort, PCAT shows high TERT expression is significantly associated with worse overall survival in the TARGET neuroblastoma dataset ( Figure 4C). This correlation holds even when MYCN amplification is added as a covariate to the analysis (Supplementary Figure S4), suggesting TERT expression is an independent prognostic factor.
We use a list of 20 MYCN targets identified by shRNA screening (20) to demonstrate utility of the 'ssGSEA' module. We first ran ssGSEA using these genes in PDXs. The output clearly showed higher scores in neuroblastoma models than others, suggesting this group of genes is upregulated in neuroblastoma (Supplementary Figure S5). We then reran ssGSEA using the TARGET neuroblastoma dataset. We observed that MYCN amplification was strongly enriched in samples with higher ssGSEA scores (hence higher expression of the target genes) (Figure 4D), verifying the positive regulation of these genes by MYCN.
Finally, we use 19D12 (also known as SCH717454) inhibitor to demonstrate the utility of the preclinical testing module. 19D12 is a fully human antibody inhibiting the insulin-like growth factor 1 receptor (IGF1R) (21). Correlating IGF1R expression and response to 19D12 reveals  that maintained complete response demonstrated by two osteosarcoma models (OS-1, OS-9) is associated with high IGF1R expression ( Figure 5A). The same pattern was observed when limiting this analysis to osteosarcoma models only ( Figure 5B). OS-2, another osteosarcoma model with much higher IGF1R expression, shows limited response to 19D12 (PD2). Interestingly, this model has much higher expression of IGF1 and IGF2 than the two sensitive models (Supplementary Figure S6). Despite a small sample size, these observations suggest that high IGF1/2 expression may be an escape mechanism to 19D12 inhibition for osteosarcomas.

SUMMARY AND FUTURE DIRECTIONS
In this work, we describe PCAT, a new resource for childhood cancer PDX models. Previously published por-tals such as PDX Finder (22) have a larger repository than PCAT. However, PCAT is distinct in its collection of childhood cancer PDXs. For instance, searching 'neuroblastoma', a malignancy of the peripheral nervous system commonly seen in children, found one model on PDX Finder, but 40 models on PCAT. Among the PCAT functions/features not provided by PDX Finder are the intuitive interface that has been developed to allow users to explore the genomic and preclinical testing data of these models, as well as facilitation of comparisons between patient tumors and PDX models by integrating TARGET datasets into the portal. The gene fusions of nearly 1000 tumors are a unique resource that allows users to inquire and examine genes and their potential involvement in fusion events.
In the current release, no PDXs have genomic data from matched donor tumors. This limits our ability to investi- gate the similarity between these PDXs and their donor tumors. However, previous studies have shown that PDXs resemble patient tumors in histology, growth characteristics and genomic patterns (4)(5)(6). Preclinical testing results of these models were shown to predict for agent activity in children with the same diagnosis (23,24), supporting their use in drug testing regardless of their phylogenetic relationship with parental tumors. Nevertheless, understanding how genetic distance between donor tumors and the resulting PDXs may be correlated with their responses to anti-tumor drugs may have significant implications for using PDXs as a preclinical testing tool. Another limitation of our PDX collection is the lack of longitudinal samples, particularly before and after treatments. Such samples are rarely biopsied in children affected by cancer, making PDX a viable option for conducting cotrials. Insights from such studies may reveal invaluable resistance mechanisms. The next step for PCAT is to expand its PDX collection. More than 100 new models generated by PPTC and CPRIT GCCRI Core (https://gccri.uthscsa.edu/services/pdx-core/) are currently in the pipeline and will be integrated into PCAT. Most of these models were derived from patients of Hispanic ethnicity, thus reflecting a unique demographic patient group in south Texas. Meanwhile, more preclinical testing data and protocols of preclinical testing experiments, including drug doses and schedule of administration, will be gradually added to the portal.
In addition, more functional modules will be added to facilitate data mining and visualization. For instance, new modules will be added to enhance users' ability to explore mutations. More importantly, we will implement functions that allow users to compare their own samples to our PDX models so that results of PDX preclinical testing can be a reference to predict the query sample's sensitivity to therapeutic agents. We envision these new modules will greatly enhance the usability and translational relevance of this resource.