The Progenetix oncogenomic resource in 2021

Huang, Qingyao; Carrio-Cordo, Paula; Gao, Bo; Paloots, Rahel; Baudis, Michael

doi:10.1093/database/baab043

Abstract

In cancer, copy number aberrations (CNAs) represent a type of nearly ubiquitous and frequently extensive structural genome variations. To disentangle the molecular mechanisms underlying tumorigenesis as well as identify and characterize molecular subtypes, the comparative and meta-analysis of large genomic variant collections can be of immense importance. Over the last decades, cancer genomic profiling projects have resulted in a large amount of somatic genome variation profiles, however segregated in a multitude of individual studies and datasets. The Progenetix project, initiated in 2001, curates individual cancer CNA profiles and associated metadata from published oncogenomic studies and data repositories with the aim to empower integrative analyses spanning all different cancer biologies. During the last few years, the fields of genomics and cancer research have seen significant advancement in terms of molecular genetics technology, disease concepts, data standard harmonization as well as data availability, in an increasingly structured and systematic manner. For the Progenetix resource, continuous data integration, curation and maintenance have resulted in the most comprehensive representation of cancer genome CNA profiling data with 138 663 (including 115 357 tumor) copy number variation (CNV) profiles. In this article, we report a 4.5-fold increase in sample number since 2013, improvements in data quality, ontology representation with a CNV landscape summary over 51 distinctive National Cancer Institute Thesaurus cancer terms as well as updates in database schemas, and data access including new web front-end and programmatic data access.

Database URL: progenetix.org

Introduction

Copy number aberrations (CNAs) are present in the majority of cancer types and exert functional impact in cancer development (1, 2). As understanding cancer biologies remains one of the main challenges in contemporary medical and life sciences, the number of studies addressing genomic alterations in malignant diseases continues to grow. Progenetix is a publicly accessible cancer genome data resource (progenetix.org) that aims to provide a comprehensive representation of genomic variation profiles in cancer, through providing sample-specific CNA profiles and associated metadata as well as services related to data annotation, meta-analysis and visualization. Originally established in 2001 with a focus on data from chromosomal Comparative Genomic Hybridization (CGH) studies (3), the resource has progressively incorporated data from hundreds of publications reporting on genome profiling experiments based on molecular cytogenetics (CGH, genomic arrays) and sequencing (whole-genome or whole-exome sequencing—WGS or WES). Since the last publication dedicated to the Progenetix resource in 2014 (4), changes in content and features of the data repository and its online environment have vastly expanded its scope and utility to the cancer genomics community. For data content, additions include the complete incorporation of the previously separate arrayMap data collection (5, 6) and of datasets from external resources and projects such as The Cancer Genome Atlas (TCGA; (7, 8)) or cBioPortal (9), as well as the recurrent collection and re-processing of array-based data from National Center for Biotechnology Information (NCBI)’s Gene Expression Omnibus (GEO) or European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI)’s ArrayExpress (10, 11). Additionally, data content updates have followed the previous methodology of publication-based data extraction where feasible. Beyond the data expansion, a tight integration with projects of the Global Alliance for Genomics and Health (GA4GH (12)) and ELIXIR—such as serving for implementation-driven development of the Beacon application programming interface (API) (13)—has led to an extension of the resource’s features as well as adoption and promotion of emerging open data standards.

Here we present the latest updates on data content, structuring, standardization, access and other modifications made to the Progenetix resource.

Data expansion and new features

Genomic profiling data

Over the last two decades, thousands of cancer genome studies have used the GEO (14) for deposition of data from array-based experiments. Data from GEO contribute a substantial fraction of the genomic screening data in the Progenetix collection and has again been expanded in both number of samples and represented platforms. Additionally, we systematically included suitable data from three more resources: ArrayExpress (15), cBioPortal (16) and TCGA(17) project. As in the previous database updates, we have also included data directly derived from publication supplements and from collaborative projects. Table 1 shows statistics of samples within the major sources. Table 2 reports the overall data growth and sample counts stratified by cancer loci since the last update (4).

Table 1.

Statistics of samples from various data resources

Data source	GEO	ArrayExpress	cBioPortal	TCGA	Total
No. of studies	898	51	38	33	1939
No. of samples	63 568	4351	19 712	22 142	138 663
Tumor	52 090	3887	19 712	11 090	115 357
Normal	11 478	464	0	11 052	23 306
Classifications
ICD-O (Topography)	100	54	88	157	209
ICD-O (Morphology)	246	908	265	140	491
NCIt	346	148	422	182	788
Collections
Individuals	63 568	4351	19 712	10 995	127 549
Biosamples	63 568	4351	19 712	22 142	138 663
Callsets^a	63 568	4351	19 712	22 376	138 930
Variants	5 514 126	118 4170	1 778 096	2 654 065	10 716 093

Data source	GEO	ArrayExpress	cBioPortal	TCGA	Total
No. of studies	898	51	38	33	1939
No. of samples	63 568	4351	19 712	22 142	138 663
Tumor	52 090	3887	19 712	11 090	115 357
Normal	11 478	464	0	11 052	23 306
Classifications
ICD-O (Topography)	100	54	88	157	209
ICD-O (Morphology)	246	908	265	140	491
NCIt	346	148	422	182	788
Collections
Individuals	63 568	4351	19 712	10 995	127 549
Biosamples	63 568	4351	19 712	22 142	138 663
Callsets^a	63 568	4351	19 712	22 376	138 930
Variants	5 514 126	118 4170	1 778 096	2 654 065	10 716 093

^aset of variants from one genotyping experiment; ICD-O, International Classification of Diseases for Oncology; NCIt, National Cancer Institute Thesaurus.

Open in new tab

Table 1.

Statistics of samples from various data resources

Data source	GEO	ArrayExpress	cBioPortal	TCGA	Total
No. of studies	898	51	38	33	1939
No. of samples	63 568	4351	19 712	22 142	138 663
Tumor	52 090	3887	19 712	11 090	115 357
Normal	11 478	464	0	11 052	23 306
Classifications
ICD-O (Topography)	100	54	88	157	209
ICD-O (Morphology)	246	908	265	140	491
NCIt	346	148	422	182	788
Collections
Individuals	63 568	4351	19 712	10 995	127 549
Biosamples	63 568	4351	19 712	22 142	138 663
Callsets^a	63 568	4351	19 712	22 376	138 930
Variants	5 514 126	118 4170	1 778 096	2 654 065	10 716 093

Data source	GEO	ArrayExpress	cBioPortal	TCGA	Total
No. of studies	898	51	38	33	1939
No. of samples	63 568	4351	19 712	22 142	138 663
Tumor	52 090	3887	19 712	11 090	115 357
Normal	11 478	464	0	11 052	23 306
Classifications
ICD-O (Topography)	100	54	88	157	209
ICD-O (Morphology)	246	908	265	140	491
NCIt	346	148	422	182	788
Collections
Individuals	63 568	4351	19 712	10 995	127 549
Biosamples	63 568	4351	19 712	22 142	138 663
Callsets^a	63 568	4351	19 712	22 376	138 930
Variants	5 514 126	118 4170	1 778 096	2 654 065	10 716 093

^aset of variants from one genotyping experiment; ICD-O, International Classification of Diseases for Oncology; NCIt, National Cancer Institute Thesaurus.

Open in new tab

Table 2.

Data growth by cancer loci

Cancer loci	No.in 2014	No.in 2021
Hematopoietic and reticuloendothelial systems	5269	18 482
Lymph nodes	2345	5988
Breast	2271	15 790
Cerebellum	1439	3465
Brain, NOS	1342	6608
Cerebrum	1201	1712
Liver	1180	3237
Stomach	1155	3176
Skin	1073	3343
Connective, subcutaneous and other soft tissues	1058	2526
Kidney	1018	3617
Colon	1001	5182
Ovary	733	3963
Prostate gland	735	4485
Lung and bronchus	699	10 321
Nervous system, NOS	667	926
Urinary bladder	587	1961
Cervix uteri	529	1331
Peripheral nerves incl. autonomous	523	1479
Esophagus	454	1890
Pancreas	426	1620
Thyroid gland	404	1260
Heart, mediastinum and pleura	383	771
Bones, joints and articular cartilage	350	1205
Spleen	278	636
Other	4522	16 268
Total	31 642	115 359

Cancer loci	No.in 2014	No.in 2021
Hematopoietic and reticuloendothelial systems	5269	18 482
Lymph nodes	2345	5988
Breast	2271	15 790
Cerebellum	1439	3465
Brain, NOS	1342	6608
Cerebrum	1201	1712
Liver	1180	3237
Stomach	1155	3176
Skin	1073	3343
Connective, subcutaneous and other soft tissues	1058	2526
Kidney	1018	3617
Colon	1001	5182
Ovary	733	3963
Prostate gland	735	4485
Lung and bronchus	699	10 321
Nervous system, NOS	667	926
Urinary bladder	587	1961
Cervix uteri	529	1331
Peripheral nerves incl. autonomous	523	1479
Esophagus	454	1890
Pancreas	426	1620
Thyroid gland	404	1260
Heart, mediastinum and pleura	383	771
Bones, joints and articular cartilage	350	1205
Spleen	278	636
Other	4522	16 268
Total	31 642	115 359

Open in new tab

Table 2.

Data growth by cancer loci

Cancer loci	No.in 2014	No.in 2021
Hematopoietic and reticuloendothelial systems	5269	18 482
Lymph nodes	2345	5988
Breast	2271	15 790
Cerebellum	1439	3465
Brain, NOS	1342	6608
Cerebrum	1201	1712
Liver	1180	3237
Stomach	1155	3176
Skin	1073	3343
Connective, subcutaneous and other soft tissues	1058	2526
Kidney	1018	3617
Colon	1001	5182
Ovary	733	3963
Prostate gland	735	4485
Lung and bronchus	699	10 321
Nervous system, NOS	667	926
Urinary bladder	587	1961
Cervix uteri	529	1331
Peripheral nerves incl. autonomous	523	1479
Esophagus	454	1890
Pancreas	426	1620
Thyroid gland	404	1260
Heart, mediastinum and pleura	383	771
Bones, joints and articular cartilage	350	1205
Spleen	278	636
Other	4522	16 268
Total	31 642	115 359

Cancer loci	No.in 2014	No.in 2021
Hematopoietic and reticuloendothelial systems	5269	18 482
Lymph nodes	2345	5988
Breast	2271	15 790
Cerebellum	1439	3465
Brain, NOS	1342	6608
Cerebrum	1201	1712
Liver	1180	3237
Stomach	1155	3176
Skin	1073	3343
Connective, subcutaneous and other soft tissues	1058	2526
Kidney	1018	3617
Colon	1001	5182
Ovary	733	3963
Prostate gland	735	4485
Lung and bronchus	699	10 321
Nervous system, NOS	667	926
Urinary bladder	587	1961
Cervix uteri	529	1331
Peripheral nerves incl. autonomous	523	1479
Esophagus	454	1890
Pancreas	426	1620
Thyroid gland	404	1260
Heart, mediastinum and pleura	383	771
Bones, joints and articular cartilage	350	1205
Spleen	278	636
Other	4522	16 268
Total	31 642	115 359

Open in new tab

The ‘ArrayExpress Archive of Functional Genomics Data’, hosted by EMBL-EBI, stores functional genomics data submitted by research groups and projects. In this update, we have incorporated the cancer-related genomic profiles which do not have corresponding GEO entries using our analysis pipeline. Overall, data from ArrayExpress added 3887 samples from 44 projects, which resolve to 143 distinct cancer types according to the National Cancer Institute Thesaurus (NCIt). Similar to the GEO data acquisition procedure, we have used a combination of text mining methods and expert curation for annotation of technical metadata and biomedical parameter.

The ‘cBioPortal for Cancer Genomics’ is an open-access resource for cancer genomics data, representing different types of molecular screening data from 19 712 samples, derived from 38 studies and mappable to 422 NCIt cancer types. The largest part of genomic data is based on WES analyses from the Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets or MSK-TARGET (18) pipeline, with CNA data accessed directly as segment files in genome version hg19/Genome Reference Consortium Human Build 37. Data were converted into Genome Reference Consortium Human Build 38 (GRCh38) with the ‘segment-liftover’ tool (19), and oncology classifications as well as relevant clinical data were incorporated into our database.

TCGA project provides a set of multiomics data with extensive structured metadata annotation for a large collection of cancer types, currently through NCBI’s Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov). In this update, we incorporated its copy number variation (CNV) profiling data as well as transformed the relevant clinical information into our system (Figure 1).

Figure 1.

Open in new tab Download slide

The currently available CNA data points in Progenetix and TCGA Progenetix database contain 115 357 cancer samples with 92 307 mapped to the 51 defined critical nodes in NCIt ontology tree and 23 050 samples not mapped to the tree (black), whereas TCGA repository contains 11 090 samples with 9103 samples mapped and 1987 samples not mapped to the tree (black). Colors of the stacked bar plot (left) match the branch colors on NCIt ontology tree (right).

Data processing update

Genomic profiling data in Progenetix originates from a large number of studies, which are based on different molecular-cytogenetics- and sequencing-based technologies. In order to maximize qualitative homogeneity of the final CNA calls, we prefer to download source files with the least amount of pre-processing and apply our in-house data processing pipeline from the arrayMap project (5). Currently, our analysis workflow handles the raw-data-based processing for 13 Affymetrix single nucleotide polymorphism (SNP) array platforms, including nine genome-wide arrays—10K (GPL2641), 50K (Hind240 and Xba240; GPL2004 and GPL2005), 250K (Nsp and Sty; GPL3718 and GPL3720), Genome-wide SNP (5.0 and 6.0; GPL6894 and GPL6801, respectively), CytoScan (750K and HD; GPL18637 and GPL16131) arrays (GPL-prefixed platform coding in brackets according to GEO standard)—as well as the four cancer-specific ‘Oncoscan’ arrays - GPL18602, GPL13270, GPL15793 and GPL21558 (accessible through GitHub repository baudisgroup/a.m._process). Our current model treats the most prevalent copy number as the baseline and derives the relative copy number gain and loss per sample based on the assumption that the relative gene dosage imbalance exerts pathophysiological effects in cancer biology.

Allele-specific copy number variation

For the subset of SNP-array-based experiments—where the status of both alleles can be evaluated separately—we have analyzed allele-specific copy number data (ASCN) and incorporated 35 897 loss of heterozygosity (LOH) profiles into the database. ASCN potentiates new analysis on the same samples. First, probe-wise it gives an overview of germline variant landscape, as used in determining the ancestry background. Second, it allows detection of LOH events, including copy-number-neutral event (CN-LOH), which e.g. can be commonly observed in hematological malignancies due to a selective process for duplication of minor disease-prone germline alleles (20, 21). Lastly, it acts as a second reference for CNA to combat the variability caused by known wave artifacts from array technologies (22). For all SNP arrays, we have implemented a pipeline to determine probe-wise B-allele frequency (BAF) of SNP probes and perform subsequent segmentation (23, 24). Subsequently, we use ASCN to assess ancestry provenance of the samples (25) and store the LOH regions of the samples in our genomic variants database.

Metadata updates

NCIt ontology mapping

Since its establishment, Progenetix has made use of the ‘International Classification of Diseases in Oncology’, 3rd Edition (ICD-O-3) (26) for cancer sample classification. While the combination of the ICD-O Morphology and Topography coding systems depicts diagnostic entities with high specificity, the current ICD-O is limited in its representation of hierarchical concepts and does not easily translate to modern ontologies. In comparison, NCIt (access through http://bioportal.bioontology.org/ontologies/NCIT) is a dynamically developed hierarchical ontology, which empowers layered data aggregation and transfer between classification systems and resources. However, due to the comparatively recent development and ongoing expansions, NCIt terms are rarely used in primary sample annotations. In the recent Progenetix update, we performed a data-driven generation of ICD-O—NCIt mappings and added the derived NCIt codes to all (existing and new) samples (mapping available through GitHub repository ‘progenetix/ICDOntologies’; manuscript in preparation), to take advantage of NCIt’s hierarchical structure for data retrieval, analysis and exchange (Figure 4B).

Figure 4.

Open in new tab Download slide

Demonstration of further functionality pages: A. Publication search; B. NCIt hierarchical tree navigation A: Cancer-genomics-associated publications are recorded with number of samples stratified by technology used. The publications can be filtered by keywords; B: Part of the sample subsets contained in Progenetix under the hierarchical NCIt classification tree. It allows for selection of sample subsets at different levels; C: User can upload custom segment files for data visualization.

Data summary based on the NCIt hierarchy tree

All cancer samples in Progenetix have been annotated with an NCIt code, resulting in currently 788 distinct NCIt terms. However, as the definition of increasingly specific NCIt terms outruns their incorporation into the hierarchical tree, so far 98 of these terms are not represented in the tree hierarchy. For better illustration, we define 51 prominent nodes under which we summarize and visualize the data collection (see Supplementary material for the selection procedure). This brings about additional 324 (60 in TCGA) terms not mappable to the selected nodes, resulting in 23 050 (1987 for TCGA) samples excluded from the summary tree counts (black bar in left panel of Figure 1). For terms with multiple occurrences in the tree, we define the preferred path to the selected node by prioritizing morphology-based separation. The sample collection in Progenetix compared to TCGA is summarized with reference to the NCIt coding system (Figure 1; Supplementary Table S1).

CNV data content by cancer type

With cancer genomes grouped in the 51 NCIt nodes, we assessed their differences in the CNV landscape. The fraction of genome with a copy number alteration (CNV fraction) varies widely among the cancer types with a global median of 0.121 (Figure 2; Supplementary Figure S1). Among the most studied cancer types, breast carcinoma shows a consistent CNV profile as an earlier analysis with frequent chr1q, 8q, 16p, 17q, 20 gain and 8p, 16q, 17p, 18, 22q loss (27); the CNV patterns in cervical (chr3 gain) and colorectal (chr7, 8q, 13, and 20q gain and 8p, 17p, and 18 loss) carcinoma also correspond with previous observation (28), similar to T-cell non-Hodgkin lymphoma (29), myelodysplastic syndrome (30) and a number of malignant epithelial tumors (31). In addition, we also present the genome-wide LOH profile in the evaluated NCIt nodes clustered by their LOH landscape (average LOH profiles of 42 out of 51 with at least 20 samples are shown in Supplementary Figure S2; (32)). LOH profile of a cancer genome complements its CNV profile with the information of allelic loss. Here we highlight a few prominent patterns, which have been previously reported: chr3p and 9 in esophageal squamous cell carcinoma (33, 34); chr18q in colorectal carcinoma (35); and chr13q, 16q and 17p in hepatocellular carcinoma (36).

Figure 2.

$The genomic CNV fraction across 51 NCIt umbrella nodes Each dot represents one sample’s CNV fraction range from 0 to 1 and the red horizontal line indicates median CNV of the respective cancer type. Each cancer type contains between 104 and 11 804 CNV profiles (median 904; See Supplementary Table S1).$

Open in new tab Download slide

The genomic CNV fraction across 51 NCIt umbrella nodes Each dot represents one sample’s CNV fraction range from 0 to 1 and the red horizontal line indicates median CNV of the respective cancer type. Each cancer type contains between 104 and 11 804 CNV profiles (median 904; See Supplementary Table S1).

Uberon anatomy ontology

While the ICD-O topography system provides organ- and substructure-specific mapping rooted in traditional clinical and diagnostic aspects of a ‘tumor entity’, ‘UBERON’ is a cross-species anatomical structural ontology system closely aligned with developmental processes (37). Its relationship structure allows integrative queries linking multiple databases (e.g. Gene Ontology (38) and Protein Ontology (39)) and description logic query within the same organism (linking related organs) and between model animals and humans. In this resource update, we have mapped all existing ICD-O T codes to ‘UBERON’ terms and additionally provided those as part of the ‘Monarch’ initiative (40), with our latest mapping table (made available through a GitHub repository ‘progenetix/icdot2uberon’).

Provenance by geography

As part of the curated metadata provided in the sample representation, we have included geographic point coordinates for each individual sample. As this information is often missing from individual sample annotations, we have previously applied a mapping procedure to assign the samples’ approximate geographic origins (41, 42). For samples with the submitter’s contact available from repository entries, a default point location in the corresponding city was used—otherwise that of the corresponding author of the associated publication was used. Associated publications were also explored for more detailed descriptions of sample origin. Point coordinates for each city were obtained using the external geographic database GeoNames (www.geonames.org), as detailed previously.

Provenance by ancestry group

While providing a good approximation for the geographic origin of cancer profiling data, which can e.g. be useful for epistemic validation and decision processes, the geographic location of the studies provides limited specificity regarding individual sample provenance, especially when assessing correlations between genomic variants and ancestral population background. Beyond the scope of high-penetrance variants like mutations in the BRCA1/2 (43, 44) or RetinoBlastoma (RB) genes (45) in cancer predisposition, other studies have asserted an influence of genetic background on tumor development (46–49). Previously we have developed a method for deriving ancestry groups from unmasked germline variants in cancer genomes, based on reference populations studied in the 1000 Genomes Project (25). For samples in Progenetix with accessible SNP data, population groups were assigned based on the reference categories mapped to Human Ancestry Ontology terms (Supplementary Table S2). Where available, the respective data are now represented under the ‘populations provenance’ schema for the corresponding biosample entries.

Updated data access modalities

Since the last release, we have adopted the GA4GH data schema standards and migrated to Phenopackets (50)-formatted response delivery with modified data access points in the user interface. Information about API methods are provided through the documentation pages (https://info.progenetix.org/categories/API).

Data standards

In many genomic repositories, databases are structured around experimental outcomes (e.g. variants from a DNA sequencing experiments as collections of VCF files). Recent attempts in evaluating sensible meta-schemas for the representation of genomic variants and related biological or technical metadata, especially with respect to empowering data federation over flexible, networked resources, have led to a set of emerging meta-models and data schemas (51). The data storage and representation models for the Progenetix resource have been designed to comply with concepts developed by the previous GA4GH Data Working Group (12, 52) and subsequent GA4GH work streams, documented e.g. by the ‘SchemaBlocks’ initiative (http://schemablocks.org). One of the core concepts is the ‘individual—biosample(s) - variants’ meta-model, which is applicable to cancer-related analyses with potentially multiple samples representing different stages in the course of disease as well as the underlying genomic background. This hierarchical model provides a solid representation and connection between the physical source of the data and the logical genotyping information and adapts to various scenarios for data aggregation and analysis.

User interface

The completely re-designed user interface provides flexibility and versatility in query parameters and types and optimized the response delivery. Technically, the query interface for retrieval of sample specific data is built on top of a forward-looking implementation of the GA4GH Beacon API (13) with features from the upcoming version 2 of this standard.

Figure 3 shows the current web interface to perform a CNA query with start and end position range with filter options for cancer type, tissue location, morphology, cell line or geographic location. The top panel of the result page shows a summary with the number of matched samples, variants, calls and the frequency of alleles containing the CNA (Figure 3E). The ‘Phenopackets’ link returns a json document of biosamples with the phenopacket-formatted response. The ‘UCSC region’ links externally to a University of California Santa Cruz (UCSC) browser track providing an overview of the genomic elements which map to the region of the observed variants. Also, customized visualization is enabled in the linked page ‘visualization options’, e.g. for selected chromosomal regions and grouping by subsets or studies. The lower panel is organized in four sections: (i) the ‘Result’ tab (Figure 3F) shows the genome-wide CNA by the percentage of samples with yellow (+) as CN gain and blue (−) as CN loss. Below the CNA plot is a table showing the list of subsets as defined by ICD-O-3 and NCIt Ontology terms sorted by frequency of matched samples within that subset. (ii) the ‘Biosamples’ tab (Figure 3G) shows information of matched biosamples, i.e. description, classifications and external identifiers. The table can be downloaded in json or csv format. The further detail of the biosample can be accessed by clicking the biosample id. (iii) The ‘Biosamples Map’ tab (Figure 3H) shows a world map with the matched geological locations highlighted. (iv) the ‘Variants’ tab (Figure 3I) shows the variant ‘digest’ (concatenated format with chromosome, start and end position, and type of the CNA) and its corresponding biosample and callset. Likewise, the table can be downloaded in json or csv format.

Figure 3.

Open in new tab Download slide

Beacon-style query using fuzzy ranges to identify biosamples with variants matching the CNA range This example queries for a continuous, focal duplication covering the complete MYC gene’s coding region with < = 6 Mb in size. A: Filter for dataset; B: filter for cancer classification (NCIt and ICD-O-3 ontology terms available); C: additional filter, e.g. Cellosaurus; D: additional filter for geographic location; E: external link to UCSC browser to view the alignment of matched variants; F: cancer type classification sorted by frequency of the matched biosamples present in the subset; G: list of matched biosamples with description, statistics and reference. More detailed biosample information can be viewed through ‘id’ link to the sample detail page; H: matched variants with reference to biosamples can be downloaded in json or csv format.

Figure 4 shows the additional functional interfaces and services provided by the Progenetix project. Users can search for publications or studies by publication title, author names or the geographic location of the research center. Then, navigation extends to the summary of publications with the number of samples catalogued by technology and availability in database as well as options to visualize the associated samples (Figure 4A). Users can also access samples from the NCIt hierarchical tree or other classification systems (e.g. ICD-O and UBERON) to select a subset of cancer types for summary statistics and visualization (Figure 4B). Alternatively, users can also upload their own data for single or multiple samples to visualize genome-wide CNA (Figure 4C). In addition, a list of studies and cohorts can be selected in the navigation menu, including arrayMap (probe-specific arrays from published studies (5)), diffuse intrinsic pontine glioma cohort (53) and the ‘cancer signature’ cohort (54). All the functionalities and provided services are detailed in the documentation pages at info.progenetix.org, which invite request submission through the GitHub ‘issues’ tracker.

Other improvements

Genome version update

All samples have been updated to GRCh38. The process has been completed in a step-wise manner. Preferably, for samples with available probe-specific array data, either GRCh38 mapped platform data files were used for re-processing of the original files or alternatively, a lift-over of the probe data and subsequent re-segmentation was performed. For those cases where only called CNA data had been collected, we applied our recently published ‘segment-liftover’ tool (19) for the efficient re-mapping of continuous segments. Overall, more than 99.99% of probes and more than 99% of segments could be recovered successfully.

Cell line collection

Cancer cell lines are important models for understanding the molecular mechanisms of malignant diseases and have a prominent role in pharmacological screening procedures. Besides the primary tumor data, the Progenetix data collection also includes genomic profiling experiments using in vitro models. Recently, we introduced a systematic update of cell line annotations based on ‘Cellosaurus’, a comprehensive knowledge resource on cell line data with extensive annotations and mappings to a variety of classifications and ontologies (55). We meticulously assigned Cellosaurus ids for the cancer cell line samples as well as the ICD-O morphology and topography codes based on the NCIt term annotated by Cellosaurus. At this time, Progenetix includes a total of 5764 samples corresponding to 2162 different cancer cell lines, representing 259 different cancer types (NCIt). While so far we provide the option to search for cell lines by applying a ‘cellosaurus’ filter either in the web interface (e.g.‘cellosaurus: CVCL_0030’ for ‘HeLa’ cell line samples) or in the API query, work on a dedicated cell line data access tool is underway.

Conclusion

The Progenetix resource provides an extensive collection of oncogenomic data with a focus on individual genome-wide CNA profiles and the use of modern ontologies and data schemas to render curated biological and technical metadata, as well as thorough references to external repositories and annotation resources. Through aggregation of data from thousands of individual research studies as well as several consortium-derived collections, to our knowledge Progenetix database currently constitutes the largest public, freely accessible resource for pre-computed CNA profiles and associated phenotypic information and additional metadata dedicated to cancer studies. While the application of uniform genomic data formats and a benchmarked data processing pipeline minimizes biases from separate studies, the forward-looking implementation of emerging ontology standards facilitates the integrative and comparative analysis across a vast range of cancer types. The tight integration with GA4GH product development and standardization processes guarantees the compatibility with emerging data federation approaches and the widest re-utilization of the resource’s data. For the future, besides the continuous maintenance and expansion of the existing data types, we will work toward enhancing clinical and diagnostic annotation, expanding cross-database references and the types of genomic variant data as well as active data sharing and integration through networked services and platforms.

Supplementary data

Supplementary data are available at Database Online.

Acknowledgements

We would like to thank Amos Bairoch for support with the cell line annotations. Improvements in data annotation concepts were highly influenced through the GA4GH community.

Funding

The Progenetix database does not receive dedicated funding support. Work on the Beacon API has been supported through the the European life-sciences Infrastructure (ELIXIR) Beacon 2019–2021 implementation study and under the BioMedIT Network project of Swiss Institute of Bioinformatics (SIB) and Swiss Personalized Health Network (SPHN). Bo Gao has been recipient of a fellowship by the China Scholarship Council (CSC).

References

1.

Hanahan

D.

and

Weinberg

R.A.

(

2011

)

Hallmarks of cancer: the next generation

.

Cell

,

144

,

646

–

674

.doi:

10.1016/j.cell.2011.02.013

2.

Albertson

D.G.

,

Collins

C.

,

McCormick

F.

et al. (

2003

)

Chromosome aberrations in solid tumors

.

Nat. Genet.

,

34

,

369

–

376

.doi:

3.

Baudis

M.

and

Cleary

M.L.

(

2001

)

Progenetix. net: an online repository for molecular cytogenetic aberration data

.

Bioinformatics

,

17

,

1228

–

1229

.doi:

10.1093/bioinformatics/17.12.1228

4.

Cai

H.

,

Kumar

N.

,

Ai

N.

et al. (

2014

)

Progenetix: 12 years of oncogenomic data curation

.

Nucleic Acids Res.

,

42

,

D1055

–

D1062

.doi:

5.

Cai

H.

,

Kumar

N.

and

Baudis

M.

(

2012

)

ArrayMap: a reference resource for genomic copy number imbalances in human malignancies

.

PLoS One

,

7

, e36944.

Google Scholar

OpenURL Placeholder Text

WorldCat

6.

Cai

H.

,

Kumar

N.

and

Bagheri

H.C.

et al. (

2014

)

Chromothripsis-like patterns are recurring but heterogeneously distributed features in a survey of 22,347 cancer genome screens

.

BMC Genomics

,

15

, 82.

Google Scholar

OpenURL Placeholder Text

WorldCat

7.

Cancer Genome Atlas Research Network

,

Weinstein

J.N.

,

Collisson

E.A.

et al. (

2013

)

The cancer genome atlas pan-cancer analysis project

.

Nat. Genet.

,

45

,

10

,

1113

–

1120

.doi:

8.

National Cancer Institute

. (

2013

)

The Cancer Genome Atlas Program

.

9.

Gao

J.

,

Aksoy

B.A.

,

Dogrusoz

U.

et al. (

2013

)

Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal

.

Sci. Signal.

,

6

, pl1.doi:

10.1126/scisignal.2004088

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

10.

National Center for Biotechnology Information (NCBI)

. (

2002

)

Gene Expression Omnibus

.

11.

The European Bioinformatics Institute (EMBL-EBI)

. (

2003

)

ArrayExpress

.

12.

Global Alliance for Genomics and Health

. (

2016

)

GENOMICS. A federated ecosystem for sharing genomic, clinical data

.

Science

,

352

,

1278

–

1280

.

Crossref

PubMed

WorldCat

13.

Fiume

M.

,

Cupak

M.

,

Keenan

S.

et al. (

2019

)

Federated discovery and sharing of genomic data using Beacons

.

Nat Biotechnol

,

37

,

220

–

224

.

14.

Edgar

R.

,

Domrachev

M.

and

Lash

A.E.

(

2002

)

Gene expression omnibus: NCBI gene expression and hybridization array data repository

.

Nucleic Acids Res.

,

30

,

207

–

210

.doi:

15.

Athar

A.

,

Füllgrabe

A.

,

George

N.

et al. (

2019

)

ArrayExpress update–from bulk to single-cell expression data

.

Nucleic Acids Res.

,

47

,

D711

–

D715

.doi:

16.

Cerami

E.

,

Gao

J.

,

Dogrusoz

U.

et al. (

2012

)

The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data

.

Cancer Discov.

,

2

,

401

–

404

.doi:

10.1158/2159-8290.CD-12-0095

17.

Cancer Genome Atlas Research Network

. (

2008

)

Comprehensive genomic characterization defines human glioblastoma genes and core pathways

.

Nature

,

455

,

1061

–

1068

.

Crossref

PubMed

WorldCat

18.

Cheng

D.T.

,

Mitchell

T.N.

,

Zehir

A.

et al. (

2015

)

Memorial sloan kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): a hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology

.

J. Mol. Diagn.

,

17

,

251

–

264

.doi:

10.1016/j.jmoldx.2014.12.006

19.

Gao

B.

,

Huang

Q.

and

Baudis

M.

(

2018

)

segment_liftover : a Python tool to convert segments between genome assemblies [version 1; referees: awaiting peer review]

.

F1000Research

,

7

, 319.doi:

10.12688/f1000research.14148.2

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

20.

O’Keefe

C.

,

McDevitt

M.A.

and

Maciejewski

J.P.

(

2010

)

Copy neutral loss of heterozygosity: a novel chromosomal lesion in myeloid malignancies

.

Blood

,

115

,

2731

–

2739

.doi:

10.1182/blood-2009-10-201848

21.

Mullighan

C.G.

,

Goorha

S.

,

Radtke

I.

et al. (

2007

)

Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia

.

Nature

,

446

,

758

–

764

.doi:

22.

Ai

N.

,

Cai

H.

,

Solovan

C.

et al. (

2016

)

CNARA: reliability assessment for genomic copy number profiles

.

BMC Genomics.

,

17

, 799.doi:

10.1186/s12864-016-3074-7

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

23.

Ortiz-Estevez

M.

,

Bengtsson

H.

and

Rubio

A.

(

2010

)

ACNE: a summarization method to estimate allele-specific copy numbers for Affymetrix SNP arrays

.

Bioinformatics

,

26

,

1827

–

1833

.doi:

10.1093/bioinformatics/btq300

24.

Olshen

A.B.

,

Venkatraman

E.S.

,

Lucito

R.

et al. (

2004

)

Circular binary segmentation for the analysis of array-based DNA copy number data

.

Biostatistics

,

5

,

557

–

572

.doi:

10.1093/biostatistics/kxh008

25.

Huang

Q.

and

Baudis

M.

(

2020

)

Enabling population assignment from cancer genomes with SNP2pop

.

Sci. Rep

,

10

,

1

–

9

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

26.

World Health Organization and others

(

2013

).

International Classification of Diseases for Oncology (ICD-O)

1st revision 3rd edition.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

27.

Cai

H.

,

Gupta

S.

,

Rath

P.

et al. (

2015

)

ArrayMap 2014: an updated cancer genome resource

.

Nucleic Acids Res.

,

43

,

D825

–

D830

.doi:

28.

Ried

T.

,

Hu

Y.

,

Difilippantonio

M.J.

et al. (

2012

)

The consequences of chromosomal aneuploidy on the transcriptome of cancer cells

.

Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms.

,

1819

,

784

–

793

.doi:

10.1016/j.bbagrm.2012.02.020

Google Scholar

Crossref

WorldCat

29.

da Silva Almeida

A.C.

,

Abate

F.

,

Khiabanian

H.

et al. (

2015

)

The mutational landscape of cutaneous T cell lymphoma and sezary syndrome

.

Nat. Genet.

,

47

,

1465

–

1470

.doi:

30.

Xu

L.

,

Gu

Z.-H.

,

Li

Y.

et al. (

2014

)

Genomic landscape of CD34+ hematopoietic cells in myelodysplastic syndrome and gene mutation profiles as prognostic markers

.

Proc. Natl. Acad. Sci.

,

111

,

8589

–

8594

.doi:

10.1073/pnas.1407688111

Google Scholar

Crossref

WorldCat

31.

Baudis

M.

(

2007

)

Genomic imbalances in 5918 malignant epithelial tumors: an explorative meta-analysis of chromosomal CGH data

.

BMC Cancer

,

7

, 226.doi:

10.1186/1471-2407-7-226

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

32.

Cordo

P.C.

and

Baudis

M.

(

2021

)

Copy number variant heterogeneity among cancer types reflects inconsistent concordance with diagnostic classifications

.

BioRxiv

.

Google Scholar

OpenURL Placeholder Text

WorldCat

33.

Tarmin

L.

,

Yin

J.

and

Zhou

X.

et al. (

1994

)

Frequent loss of heterozygosity on chromosome 9 in adenocarcinoma and squamous cell carcinoma of the esophagus

.

Cancer Res

,

54

,

6094

–

6096

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

34.

Kuroki

T.

,

Trapasso

F.

,

Yendamuri

S.

et al. (

2003

)

Allele loss and promoter hypermethylation of VHL, RAR-β, RASSF1A, and FHIT tumor suppressor genes on chromosome 3p in esophageal squamous cell carcinoma

.

Cancer Res

,

63

, 3724–3728.

Google Scholar

OpenURL Placeholder Text

WorldCat

35.

Armaghany

T.

,

Wilson

J.D.

,

Chu

Q.

et al. (

2012

)

Genetic alterations in colorectal cancer

.

Gastrointestinal Cancer Research: GCR

,

5

, 19.

Google Scholar

OpenURL Placeholder Text

WorldCat

36.

Nishida

N.

,

Fukuda

Y.

,

Kokuryu

H.

et al. (

1992

)

Accumulation of allelic loss on arms of chromosomes 13q, 16q and 17p in the advanced stages of human hepatocellular carcinoma

.

nt. J. Cancer

,

51

,

862

–

868

.doi:

10.1002/ijc.2910510605

Google Scholar

Crossref

WorldCat

37.

Mungall

C.J.

,

Torniai

C.

,

Gkoutos

G.V.

et al. (

2012

)

Uberon, an integrative multi-species anatomy ontology

.

Genome Biol.

,

13

, R5.doi:

10.1186/gb-2012-13-1-r5

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

38.

Gene Ontology Consortium

. (

2008

)

The gene ontology project in 2008

.

Nucleic Acids Res

,

36

,

D440

–

D444

.

Crossref

PubMed

WorldCat

39.

Natale

D.A.

,

Arighi

C.N.

,

Barker

W.C.

et al. (

2007

)

Framework for a protein ontology

. In:

BMC bioinformatics

, Springer, S1.doi:

10.1186/1471-2105-8-S9-S1

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

40.

Mungall

C.J.

,

McMurry

J.A.

,

Köhler

S.

et al. (

2017

)

The monarch initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species

.

Nucleic Acids Res.

,

45

,

D712

–

D722

.doi:

41.

Carrio-Cordo

P.

and

Baudis

M.

(

2018

)

Mountains and chasms: surveying the oncogenomic publication landscape

.

Oncology

,

1

–

12

.

Google Scholar

OpenURL Placeholder Text

WorldCat

42.

Carrio-Cordo

P.

,

Acheson

E.

,

Huang

Q.

et al. (

2020

)

Geographic assessment of cancer genome profiling studies

.

Database

,

2020

.doi:

10.1093/database/baaa009

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

43.

Miki

Y.

,

Swensen

J.

,

Shattuck-Eidens

D.

et al. (

1994

)

A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1

.

Science

,

266

,

66

–

71

.doi:

10.1126/science.7545954

44.

Wooster

R.

,

Bignell

G.

,

Lancaster

J.

et al. (

1995

)

Identification of the breast cancer susceptibility gene BRCA2

.

Nature

,

378

,

789

–

792

.doi:

45.

Friend

S.H.

,

Bernards

R.

,

Rogelj

S.

et al. (

1986

)

A human DNA segment with properties of the gene that predisposes to retinoblastoma and osteosarcoma

.

Nature

,

323

,

643

–

646

.doi:

46.

Amundadottir

L.T.

,

Sulem

P.

,

Gudmundsson

J.

et al. (

2006

)

A common variant associated with prostate cancer in European and African populations

.

Nat. Genet.

,

38

,

652

–

658

.doi:

47.

Stacey

S.N.

,

Manolescu

A.

,

Sulem

P.

et al. (

2007

)

Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor–positive breast cancer

.

Nat. Genet.

,

39

,

865

–

869

.doi:

48.

Tenesa

A.

,

Farrington

S.M.

,

Prendergast

J.G.D.

et al. (

2008

)

Genome-wide association scan identifies a colorectal cancer susceptibility locus on 11q23 and replicates risk loci at 8q24 and 18q21

.

Nat. Genet.

,

40

,

631

–

637

.doi:

49.

Wu

C.

,

Hu

Z.

,

Yu

D.

et al. (

2009

)

Genetic variants on chromosome 15q25 associated with lung cancer risk in Chinese populations

.

Cancer Res.

,

69

,

5065

–

5072

.doi:

10.1158/0008-5472.CAN-09-0081

50.

Jacobsen

J.O.B.

,

Robinson

P.N.

and

Mungall

C.J.

(

2019

)

Phenopackets Schema

.

51.

Wagner

A.H.

,

Babb

L.

,

Alterovitz

G.

et al. (

2021

)

The GA4GH variation representation specification (VRS): a computational framework for the precise representation and federated identification of molecular variation

.

BioRxiv

.

Google Scholar

OpenURL Placeholder Text

WorldCat

52.

Lawler

M.

,

Siu

L.L.

,

Rehm

H.L.

et al. (

2015

)

Clinical working group of the global alliance for genomics and health, (GA4GH, all the World’s a stage: facilitating discovery science and improved cancer care through the global alliance for genomics and health

.

Cancer Discov.

,

5

,

1133

–

1136

.doi:

10.1158/2159-8290.CD-15-0821

53.

Mackay

A.

,

Burford

A.

,

Carvalho

D.

et al. (

2017

)

Integrated molecular meta-analysis of 1,000 pediatric high-grade and diffuse intrinsic pontine glioma

.

Cancer Cell

,

32

,

520

–

e5

.doi:

10.1016/j.ccell.2017.08.017

54.

Gao

B.

and

Baudis

M.

(

2021

)

Signatures of discriminative copy number aberrations in 31 cancer subtypes

.

BioRxiv

,

12

, 732.

Google Scholar

OpenURL Placeholder Text

WorldCat

55.

Bairoch

A.

(

2018

)

The cellosaurus, a cell-line knowledge resource

.

J. Biomol. Tech.: JBT

,

29

,

25

–

38

.doi:

10.7171/jbt.18-2902-002

Google Scholar

Crossref

WorldCat

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
July 2021	176
August 2021	138
September 2021	85
October 2021	84
November 2021	48
December 2021	88
January 2022	39
February 2022	52
March 2022	71
April 2022	45
May 2022	34
June 2022	63
July 2022	26
August 2022	36
September 2022	37
October 2022	67
November 2022	30
December 2022	24
January 2023	34
February 2023	22
March 2023	25
April 2023	30
May 2023	56
June 2023	72
July 2023	96
August 2023	76
September 2023	143
October 2023	29
November 2023	28
December 2023	60
January 2024	36
February 2024	67
March 2024	28
April 2024	20

Article Contents

The Progenetix oncogenomic resource in 2021

Abstract

Introduction

Data expansion and new features

Genomic profiling data

Data processing update

Allele-specific copy number variation

Metadata updates

NCIt ontology mapping

Data summary based on the NCIt hierarchy tree

CNV data content by cancer type

Uberon anatomy ontology

Provenance by geography

Provenance by ancestry group

Updated data access modalities

Data standards

User interface

Other improvements

Genome version update

Cell line collection

Conclusion

Supplementary data

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

The Progenetix oncogenomic resource in 2021

Abstract

Introduction

Data expansion and new features

Genomic profiling data

Data processing update

Allele-specific copy number variation

Metadata updates

NCIt ontology mapping

Data summary based on the NCIt hierarchy tree

CNV data content by cancer type

Uberon anatomy ontology

Provenance by geography

Provenance by ancestry group

Updated data access modalities

Data standards

User interface

Other improvements

Genome version update

Cell line collection

Conclusion

Supplementary data

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only