Advances in genomic hepatocellular carcinoma research

Abstract Background Hepatocellular carcinoma (HCC) is the cancer with the second highest mortality in the world due to its late presentation and limited treatment options. As such, there is an urgent need to identify novel biomarkers for early diagnosis and to develop novel therapies. The availability of next-generation sequencing (NGS) data from tumors of liver cancer patients has provided us with invaluable resources to better understand HCC through the integration of data from different sources to facilitate the identification of promising biomarkers or therapeutic targets. Findings Here, we review key insights gleaned from more than 20 NGS studies of HCC tumor samples, comprising approximately 582 whole genomes and 1,211 whole exomes mainly from the East Asian population. Through consolidation of reported somatic mutations from multiple studies, we identified genes with different types of somatic mutations, including single nucleotide variations, insertion/deletions, structural variations, and copy number alterations as well as genes with multiple frequent viral integration. Pathway analysis showed that this curated list of somatic mutations is critically involved in cancer-related pathways, viral carcinogenesis, and signaling pathways. Lastly, we addressed the future directions of HCC research as more NGS datasets become available. Conclusions Our review is a comprehensive resource for the current NGS research in HCC, consolidating published articles, potential gene candidates, and their related biological pathways.


Introduction
Based on GLOBOCAN 2012, liver cancer is the second most common cause of death from cancer worldwide. Liver cancer is the fifth most common cancer in males (554,000 cases) and the ninth most common cancer in females (228,000 cases) [1]. The incidence rate is higher in males than females at a male-to-female ratio of 2.4 worldwide, and the mortality-to-incidence rate is as high as 0.94 and 0.98 for males and females, respectively. Hepatocellular carcinoma (HCC) is the most dominant form of primary liver cancer. Geographically, there is a high incidence rate in Africa (northern and western) and Asia (eastern and southeastern), particularly in China, which accounts for 50 percent of all HCC cases [2].
HCC is commonly associated with risk factors such as hepatitis B (HBV) and hepatitis C (HCV) infection, alcohol, mycotoxin Aflatoxin, obesity, and non-alcoholic fatty liver disease; the risk varies depending on gender, geographic region, and ethnicity [2][3][4]. Early evidence shows the association of HBV and HCV infection with the development of liver cirrhosis and HCC [5,6]. The HBV vaccine has been available since the early 1980s; and implementation of HBV vaccination programs in 177 of 193 World Health Organization member states are successful in decreasing HCC incidence rates in children [7,8].
While environmental factors play a role in HCC, multiple recurrent genetic aberrations and the disruption of the host genome due to HBV DNA integration in HBV-associated HCC are reported to cause the dysregulation of genes important for the hallmarks of cancer. Initial studies identified HBV integration sites via HBV DNA probes or polymerase chain reaction assay followed by Sanger sequencing [9][10][11][12][13]. Subsequently, somatic alterations such as mutations, gene copy number changes, and chromosomal rearrangements detected in the HCC-derived cell lines were found to affect the expression of oncogenes and tumor suppressor genes [14,15]. Progress in the mapping of each viral integration site and genetic aberration in HCC patients was ad hoc and slow before the advent of next-generation sequencing (NGS).
NGS technologies, including RNA-sequencing (RNA-seq), whole-exome sequencing (WXS), and whole-genome sequencing (WGS), form the foundation of today's discovery-based genomics research. With the reduced cost of massively parallel sequencing technologies over the last decade [16], there has been an increasing number of genomic liver cancer studies providing new insights about liver cancer. Pioneering NGS studies conducted on patient samples have shown a tremendous increase in our understanding of HBV viral integration patterns [17][18][19] as well as somatic alterations found in liver cancer [20][21][22]. The large amount of sequencing data generated has been archived on data servers worldwide, enabling researchers to perform integrative analyses that will lead to new findings. However, maneuvering through literature and data repositories to locate and access this information remains a tedious process.
Here, we introduce and consolidate all existing NGS-based studies on liver cancer (Fig. 1). Only the most relevant studies, conducted using NGS in HCC, have been listed in a recent review [23]. Our NGS-based resource is a complete list of data samples of approximately 582 whole genomes and 1,211 whole exomes. It summarizes the key research and clinical findings from each article with direct links to all publicly available WGS/WXS liver cancer datasets to promote better knowledge and data facilitation. The key findings of somatic mutations, HBV integrations, and mutational signatures reported from recent high-throughput studies and related integrative studies are discussed. We highlight key genes reported across multiple studies found to have recurrence of somatic mutations or HBV integration events. Additionally, we provide a meta-analysis of the pathways that these alterations dysregulate. Finally, we discuss future directions and trends in liver cancer research via the analysis of high-throughput data.

NGS Resources
Raw sequencing data, read alignment, and annotations from NGS platforms can be accessed via National Center for Biotechnology Information (NCBI)-Sequence Read Archive (SRA) [24], European Molecular Biology Laboratory -European Bioinformatics Institute (EMBL-EBI), European Nucleotide Archive [25], and DNA Data Bank of Japan-SRA [26]. The National Cancer Institute's Genomic Data Commons [27] currently hosts genomic data from the Cancer Genome Atlas (TCGA) project that consists of multiple cancer types. There are currently 377 liver hepatocellular carcinoma samples with data from WXS, single nucleotide polymorphism (SNP)-array, methylation, mRNA, and microRNA profiling. Gigadb [28] is a repository for open-access data associated with the GigaScience journal [29], which currently holds an HCC dataset from 88 individuals [30]. The International Cancer Genome Consortium (ICGC) [31] is a global effort to coordinate large-scale cancer genome studies by providing a comprehensive catalogue of somatic mutations across 50 cancer types, which generates approximately 500 samples each [32]. While primary data files are stored on NCBI and/or EBI, ICGC provides interpreted datasets for somatic mutation calls and incorporates transcriptomic and DNA methylation analyses from the same tumor samples.

Somatic genomic alterations
By comparing matched normal and tumor samples, computational algorithms have identified a number of likely cancercausing point mutations and insertions/deletions (indels). Somatic alterations such as point mutations, indels, structural variants, and copy number alterations have been identified in 1 or more of the 85 genes that we have included in Table 2. Recurrent mutations in 12 genes (TP53, CTNNB1, AXIN1, ALB, ARID2, ARID1A, RPS6KA3, APOB, RB1, CDKN2A, LRP1B, and PTEN) were reported in multiple studies. In this section, we discuss five genes (ALB, ARID2, RB1, BRD7, and RPL22) that were reported to show all four types of somatic alterations. To gain further insights into the genes with reported somatic mutations, their gene expression (tumor/normal fold-change) and clinic-pathological clinical information (histologic grade and survival) from the TCGA HCC cohort are also presented. (2/9 cases) were recurrent missense mutations in HCCs.
2. Functional analysis of the β-catenin H36P mutant was observed to be resistant to protein degradation and to promote HCC cell proliferation. 2. HBV integrations led to increased gene expression of TERT, MLL4, and CCNE1.

Histologic grade Survival
The               ARID2 belongs to the SWI/SNF-related chromatin remodeling complexes and is identified as a tumor suppressor that is frequently mutated in HCC patients [22,42,48]. In addition, gene expression profiling of ARID2-deficient HCC cell lines reveals negative regulation of UV-response gene sets, suggesting that ARID2 may be involved in DNA repair processes [57]. ARID2 is also involved in HCC via the effects of hepatitis B and C infection. In HBV-related HCC, the HBV X protein is reported to suppress ARID2 expression, leading to increased hepatoma tumorigenesis [58]. ARID2 mutations are also significantly associated (P = 0.046) with HCV-related HCC [22]. These findings suggest that ARID2 is a critical tumor suppressor in hepatitis virus-related HCC progression.
Similar to ARID2, BRD7 is also a component of the SWI/SNF remodeling machinery and a putative tumor suppressor reported with significant truncating mutations in HCC [55]. Lossof-function mutations at the BRD7 gene locus are frequently observed (7/268) in HBV-associated HCC patients [33]. BRD7 expression is also reported to be associated with the clinical characteristics in HCC (tumor size, tumor stage, and survival) [59]. HCV infections repress BRD7 expression in vitro, resulting in the dysregulation of hepatoma cell proliferation [60]. BRD7 also negatively regulates PI3K signaling by binding to the inter-SH2 (iSH2) domain of p85, leading to the impairment of p88/p110 complex formation [61].
The ALB gene encodes for the most abundant plasma protein, albumin, synthesized exclusively by hepatocytes [41]. Blood albumin tests that deviate from the normal healthy range often indicate dysregulation of protein production in the liver and other liver-associated issues. Somatic mutations at the ALB gene locus were reported in multiple studies, including genomic rearrangements in 10% (9/88) of Chinese HCC patients [41] as well as point mutations clusters and indels in Japanese HCC patients [33]. ALB is touted as a liver cancer driver gene as it is significantly enriched with damaging mutations in the European population [50]. Highly expressed genes such as ALB and APOB have been shown to be strongly enriched with indels, which are characteristic of replication slippage errors resulting from conflicts between the replication and transcription machineries [51]. Hence, low albumin levels may contribute to liver cancer progression.
RB1 is a key inhibitor of cell cycle progression that harbors multiple nonsense mutations and genomic deletions in HCC patients [33,42,43,50]. RB1 is found to be predominantly mutated in Asian Americans (10/53 patients) as compared to European Americans (2/101 patients) [62]. The inactivation of the RB pathway in Rb family triple knockout mice resulted in the development of HCC [63]. A study reveals that in 16/40 HCC patients, DNA methylation abnormalities were observed in CpG island 85 (CpG85) located within intron 2 of the RB1 gene, which can potentially regulate the expression of the RB1-E2B alternative transcript [64]. In addition, RB1 mutations are also significantly associated with reduced cancer-specific and recurrence-free survival after resection in HCC patients [43,50]. It is thus worthwhile to further characterize RB1 mutations, as they are reported to have a significantly higher mutation rate in HBV-related HCCs [42,43].
RPL22, another gene that is reported to exhibit all 4 different types of mutations (single-nucleotide variant, indels, structural and copy number variation), encodes for a ribosomal 60S subunit protein. It was reported to be significantly mutated in Japanese (5/268 patients) and European (7/242) HCC patients [33,50]. RPL22 was identified through pan-genomic characterization as a driver gene with significant somatic alterations in adenocortical carcinoma [65]. A study of microsatellite instability-positive gastric cancers also identified RPL22 as a recurrently mutated gene with single base deletions [66]. Therefore, there is potential for more research to be conducted to fully determine the functional roles of RPL22 in HCC.

HBV integration
The HBV genome often integrates into the chromosomes of liver cells, resulting in alterations of the host genome. Recent findings have confirmed that the viral transcription/replication initiation site, DR1 (located near the 3 end of the HBx gene and the beginning of the Precore/Core gene), is the preferred region to be integrated into the host chromosome [11,17,19]. More HBV integration events were identified in tumor as compared to their matched normal samples [18]. In HCC tumors, studies show that HBV integration was randomly distributed throughout the human genome [17,18,33]. In a group of 48 HCC patients from the Singapore cohort, HBV integrations were significantly enriched in the q arm of chromosome 10 and correlated with poorly differentiated tumors [19].
CCNE1 encodes for the cyclin E1 protein that is a regulatory subunit of CDK2 involved in the G1/S phase of the cell cycle. CCNE1 amplification has been reported to be the mechanism of resistance in ER-positive and HER2-positive breast cancers as well as high-grade serous ovarian cancer [69][70][71][72]. HBV integrations within the CCNE1 have been reported in 4 of 76 HBVpositive HCC samples and resulted in significantly increased expression of CCNE1 [18]. The molecular mechanism of CCNE1 mutations in HCC patients has yet to be fully elucidated.
The previously reported recurrent integration site at the TERT promoter was found by several high-throughput genomic studies to be the most frequent site for integration [19,33,73,74]. Disruption of the TERT promoter is likely to cause the dysregulation of the telomerase reverse transcriptase (TERT) expression, which plays important roles in cancer development due to its diverse telomere-independent functions in Wnt pathway signaling, cell proliferation, and DNA-damage repair [75]. Viral sequences may act as enhancers where the closer the HBV is integrated to the transcription start site of TERT, the higher the mRNA expression of TERT [19].
Chimeric HBx/MLL4 fusion transcripts containing the HBx promoter and Open Reading Frame (ORF)fused to the exon 4 and 5 of MLL4 were initially detected in 4 of 10 HCC patients [76] and subsequently confirmed in later studies and reported to lead to increased MLL4 expression [17,18,67]. In a Chinese cohort, 8 of 44 patients were found to have HBx/MLL4 fusion transcripts, resulting in a higher expression of MLL4 gene [67]. The chimeric transcript lacks the AT-hook DNA-binding domain of MLL4, hence, it may act as a dominant negative allele [17].
CDK15 encodes for the cyclin-dependent kinase 15 and is a serine/threonine protein kinase. In one study, CDK15 contributed to the effects of tumor necrosis factor-related apoptosis-inducing ligand resistance by possibly regulating the phosphorylation of survivin (Thr34) [77]. Interestingly, multiple HBV-CDK15 fusion transcripts were detected in an HCC patient, including one in-frame fusion, which caused CDK15 over-

HBV integration in host sites Host region Histologic grade Survival
The  √ √ expression [37]. However, like many of the other genes where HBV integrations have been identified, the function of CDK15 in HCC remains unclear. Hence, there is great potential to further investigate HBV integrations in HCC. It is noteworthy that CCNE1, TERT, and ANGPT1 not only harbor somatic mutations (Table 2), they are also reported to be sites for viral integrations (Table 3). CCNE1 has been reported with structural variant alterations and HBV integrations, while TERT has been reported with point mutations, structural variant alterations, and HBV integrations, suggesting that deregulation of these genes may play important roles in tumorigenesis. ANGPT1 (Angiopoietin-1), a ligand for Tie2 vascular endothelialspecific receptor tyrosine kinase, involved in the induction of HCC neovascularization and disease progression [78][79][80], was reported to harbor point mutations and HBV integrations in its intronic regions. ANGPT1 and Angiopoietin-2 (ANGPT2) were overexpressed in 68 and 81 percent of poorly differentiated HCC tu-mors, respectively [81]. However, high ANGPT2 expression, but not ANGPT1, showed correlation in the disease-free survival of 60 HCC patients [82]. The role of ANGPT1 in tumor angiogenesis remains unclear.

Pathways of somatic mutated genes and mutation signatures
Pathway analysis based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) was performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID v6.8) to identify pathways that were altered by somatic mutations in the TCGA HCC cohort [83,84]. Seventy-nine of the 85 genes in our list of somatic mutations have identifiable DAVID IDs, of which 45 genes can be categorized in KEGG pathways. Fifteen significant pathways were identified (FDR <0.05) from the 45 genes, of which 14 genes are found to be involved in more than one of the pathways (Fig. 2). All 14 genes are involved in pathways in cancer, including other significant cancer types: prostate, endometrial, glioma, melanoma, chronic myeloid leukemia, colorectal, pancreatic, bladder, and non-small lung cancer. The association of the genes with the PI3K-Akt signaling pathway and the regulation of pluripotent stem cells also reflect the importance of these somatic mutations. Lastly, the analysis also reported viral-associated pathways such as hepatitis B, viral carcinogenesis, and Human T Lymphotropic Virus Type 1 (HTLV-1) infection, where the interplay between somatic mutations in genes and viral integration events come together to give a bigger picture represented by overall changes in the biological pathways.
Mutational signatures are well-categorized somatic mutations with distinct nucleotide substitutions. These signatures are often identified through principal-component analysis of the trinucleotide mutation context, with 96 possible combinations of the mutated nucleotide including the bases 5 and 3 to each site [33]. There are currently 30 mutational signatures listed in the Catalogue of Somatic Mutations in Cancer (COSMIC), where some of these signatures represent exposure to mutagens, errors in the DNA replication machinery, or defective DNA repair [85]. Fujimoto et al. (2016) was able to identify seven distinct mutational signatures (W1-W7) in HCC patients. Three of the seven signatures (W1, W4, and W5) were found in multiple studies [33,50,55]. These recurrent signatures correspond well to COSMIC Signature 1, Signature 4, and Signature 16, which are proposed to be caused by the spontaneous deamination of 5-methylcytosine, tobacco mutagens, or unknown factors, respectively [85]. Other COSMIC signatures identified include Signature 9, Signature 12, and Signature 19, which are linked to somatic hypermutation, liver cancer, and unknown factors, respectively [86]. Signature  [51]. A mutational signature characterized with increased C>A transversions was a major contributor to the driver mutations found in HCC patients exposed to aflatoxin B1 [40]. A high proportion of Taiwanese HCC patients marked with aristolochic acid mutagen exposure had T>A mutations that corresponded to COSMIC signature 22 [47]. The AA signature was also found to be higher in HCC patients from China and Southeast Asia and much lower in Japan, America, and Europe. A prominent mutational signature was also identified after cisplatin treatment in human liver cancer cell line HepG2 [87]. Mutational signatures not only allow us to appreciate the mechanisms underlying somatic mutations in HCC tumors but they could relate to mutational processes in other cancer types with related etiology.
Multi-omics analysis combine results from more than one type of data to give us a more comprehensive view of biological profiles. Boyault et al. (2007) conducted an unsupervised transcriptome analysis to identify six subgroups of HCC, G1-G6, where G1-G3 are associated with chromosomal instability, G5-G6 are related to β-catenin mutations, and G4 is a heterogenous group [88]. The association between HCC transcriptome subclasses, G5-G6, involved in Wnt pathway activation and CTNNB1 mutations has been validated using WXS data in a later study [48,88]. In addition, multi-omics analysis shows that there is a correlation between gene expression profiles from RNA-seq data and allele frequencies of somatic mutations from WGS, highlighting 252 genomic mutations that cause transcriptomic aberrations [37].
With the large number of available NGS-based HCC studies, there is an opportunity to integrate data across studies to provide greater statistical power and elimination of potential biases from a single cohort study. Zhang et al. (2014) collected four datasets containing 99, 88, 10, and 10 HCC samples to identify known and also novel mutated genes and pathways [89]. This study illustrated that larger sample sizes can identify mutations at lower frequencies in HCC than in smaller sample cohorts. As a second example of data integration, using combined liver cancer data from ICGC and TCGA to analyze the association of ancestry to HCC mutational signatures, an increase in T>C substitutions (in the ATA context) in Japanese males and an increase in T>A substitutions (in the CTG context) in US-Asian males and females were also reported [55].

Mutations in the non-coding regulatory regions of the genome
Non-coding DNA makes up more than 98 percent of the human genome and include crucial transcription factor binding sites that regulate the transcription of RNA. Non-coding RNA includes introns, 3 and 5 UTR located in pre-mRNAs as well as microR-NAs and long non-coding RNAs (lincRNAs) [90,91]. The functional annotation of non-coding elements from the Encyclopedia of DNA Elements consortium and the US National Institutes of Health Roadmap Epigenomics project have provided support for the study of non-coding regions of human DNA [92,93]. Cancer whole-genome data from TCGA have been intensively analyzed to identify mutations in the non-coding regions. For example, two pan-cancer studies have shown that TERT promoter mutations are present in at least six cancer types including glioblastoma, bladder, low-grade glioma, melanoma, and lung (and liver which is analyzed in one of the studies) [68,94].
TERT promoter mutations are detected in 254 of 469 cases of HCC (54%) and more frequently detected in HCV-positive and non-viral cases than HBV-positive cases [55]. A more in-depth study reveals other noncoding mutations in NEAT1, MALAT1, WDR74 promoter, BCL6 promoter, and TFPI2 promoter [33]. Noncoding DNA analysis is challenging because many of the noncoding mutations are reported at lower mutation frequencies and at DNA locus with limited information regarding its function. We may overcome limitations in sample size and statistical power of patient datasets by analyzing an increased number of liver cancer whole genomes. Hence, there is potential to better characterize non-coding regions in the future.

AAV2 viral integration events
In addition to HBV integration, recent reports of the observation of integration of the wild-type adeno-associated virus 2 (AAV2) in 11 of 193 cases of HCC via deep sequencing [49,95] have sparked a debate regarding the safety issues of using AAV2 as a gene delivery vector in gene therapy [96][97][98][99]. Coincidently, the AAV2 integrations were detected in several recurrent mutation sites in HCC including the TERT promoter, MLL4, CCNE1, CCNA2, and TNFSF10 [49,100].
In an independent study, Fujimoto et al. (2016) detected AAV genome sequences in three liver cancer and three non-cancer liver cases. These three liver cancer cases were also infected with either HBV or HCV, and the AAV2 integration sites were located at MLL4, CCNE1, and an intergenic region of chromosome 5, respectively [33]. HBV integration sites were detected at the CCNA2 locus in one patient in this study as well as an early, welldifferentiated HCC patient [12]. With these observations, additional analyses are necessary to evaluate the prevalence and effects of AAV2 integration events in liver cancer and in gene therapy. The extensiveness of WGS data is therefore applicable to the detection of foreign genomic material present in the human genome that may influence the development and the treatment of liver cancer.

RNA editing
RNA editing caused by the deamination of nucleotide bases on an RNA sequence is catalyzed by the nucleotide-specific deaminases. Historically, transgenic mice and rabbits expressing mRNA editing enzyme APOBEC-1 (C-to-U editing) resulted in unexpected liver dysplasia, with a few of the mice developing HCC [101]. The main form of RNA editing is A-to-I editing catalyzed by the adenosine deaminase acting on RNA (ADAR) (Ato-I editing) family [102].
A genome-wide study that used both WGS and RNA-seq data reported normal and tumor-specific RNA editing sites in HCC as well as the positive correlation between editing degree ratio and gene expression ratio [39]. Results show that the increased expression of ADAR1 resulted in the over-editing of the AZIN1 gene in HCC tumors, confirming the findings from a previous study [103]. Another genome-wide study showed that in addition to AZIN1, the BLCAP RNA has been over-edited (A-to-I editing) in HCC, and functional analysis suggests that the overedited BLCAP resulted in enhanced cell proliferation and the activation of the AKT/mTOR signal pathway [104]. Two pan-cancer studies involving A-to-I RNA editing using data from TCGA reported no significant differences between matched normal and tumor samples, although a high Alu editing index in HCC has been significantly associated with poor survival [105,106].

Expanding the cancer genome database
With rapidly falling costs and newer technologies, the number of whole genomes sequenced in the next 10 years is projected to increase dramatically [107]. Larger sample sizes will provide better statistical power to detect rare variants and subgroups of liver cancer, particularly in HCC. For example, a large-scale wholegenome study conducted on the Icelandic population identified missense SNP variants in ABCB4 to be associated with gallstone disease, liver cancer, liver cirrhosis, and other liver-specific traits [108,109]. There are currently several international collaborations to generate more cancer whole genomes. The Pan-Cancer Analysis of Whole Genomes is an international collaboration between ICGC and TCGA to analyze more than 2,800 whole genomes across different cancer types to identify genetic alterations, beginning with 12 tumor types profiled by TCGA, although HCC was not included [110]. Additionally, the 100,000 Genomes Project by Genomics England in the United Kingdom will consist of samples from 25,000 cancer patients [111].

Conclusion
In this review, we have discussed the key findings from WGS information (Fig. 1) and future directions of HCC. WGS is a promising approach that provides genomic information for discovery-based genomic analyses in the future. Hence, it holds great potential for liver cancer research as we seek to understand more about the genetic characteristics of HCC, which is influenced by gender, ethnicity, geolocation, and many risk factors. This review identified genes with somatic mutations ( Table 2), many of which are involved in cancer-related pathways (Fig. 2). Many of the mutated genes are yet to be characterized for their molecular function and roles in cancer, presenting great opportunity for future research in this direction. With improved clinical annotation and the automation of data analysis, more genomic sequences can be translated into valuable biological insights.