VISDB: a manually curated database of viral integration sites in the human genome

Abstract Virus integration into the human genome occurs frequently and represents a key driving event in human disease. Many studies have reported viral integration sites (VISs) proximal to structural or functional regions of the human genome. Here, we systematically collected and manually curated all VISs reported in the literature and publicly available data resources to construct the Viral Integration Site DataBase (VISDB, https://bioinfo.uth.edu/VISDB). Genomic information including target genes, nearby genes, nearest transcription start site, chromosome fragile sites, CpG islands, viral sequences and target sequences were integrated to annotate VISs. We further curated VIS-involved oncogenes and tumor suppressor genes, virus–host interactions involved in non-coding RNA (ncRNA), target gene and microRNA expression in five cancers, among others. Moreover, we developed tools to visualize single integration events, VIS clusters, DNA elements proximal to VISs and virus–host interactions involved in ncRNA. The current version of VISDB contains a total of 77 632 integration sites of five DNA viruses and four RNA retroviruses. VISDB is currently the only active comprehensive VIS database, which provides broad usability for the study of disease, virus related pathophysiology, virus biology, host–pathogen interactions, sequence motif discovery and pattern recognition, molecular evolution and adaption, among others.


VISDB Supplementary file
. Statistics of curated VISs in VISDB Table S2. The ratios of cytobands harboring VISs Table S3. The ratios of VISs located at chromosome fragile sites Table S4. The ratios of VISs located in genes Table S5. VIS distribution on DNA elements Table S6. Ratios of oncogenes, tumor suppressor genes and non-classified genes targeted by VIS Table S7. Average VIS number located in genes Table S8. Ratios of pathway genes targeted by VIS Figure S1. Virus integration event model  Tables   Table S1 shows the statistics of VISs in VISDB. 77 632 VISs involved in 5 DNA oncoviruses (HBV,   HPV, EBV, MCV, AAV2) and 4 RNA retroviruses (HIV, MLV, HTLV, XMRV) are curated from 108 publications. All VISs are evaluated by the completeness of VIS data and further categorized according to the method used to detect VIS in the original articles. Table S2 shows the statistics of cytobands harboring VISs. Table S3 shows the results of chromosome fragile sites harboring VISs in HBV, HPV, HIV, HTLV-1 and EBV integration. Table S4 shows the frequency of VISs located in gene and intergenic regions. Table S5 shows the frequency of VISs located in exons, intron, promoter regions, etc.

Table S6
shows the target rate of three kinds of genes.

Table S7
shows the average VIS number located in genes. The top 25 most frequently targeted genes of HBV, HPV, HIV and HTLV-1 integration are shown in Figure S6. Table S8 shows the statistics of pathway genes targeted by VISs. £ C1-VISs curated with all genomics annotations, junction sequence and site-adjacent sequence; C2-VISs with two precise integrated locations are curated with all genomics annotations and target sequence; C3-VISs with one precise integrated location are curated with all genomics annotations and site-adjacent sequences; C4-VISs only with rough integrated location. Numbers before and after "/" are VISs detected by wet lab assay and non-wet lab assay.
* Some articles harboring multiple kinds of viral integration. We only calculated VIS with precise chromosome location.   Figure S1). The category of basic information includes sample type and detection or description methods for VIS in the original article. In addition, some identifying attributes such as a status attribute are added to show whether the VIS is detected by an experimental assay and a completeness attribute to evaluate the integrity of VIS.
The junction sequence category has the most significant role in downstream analysis.
Virus integrates to human genome may have many different patterns. The simplest pattern is a segment of the virus sequence is broken and inserted into the host's genome without any other process in the occurrence of integration event. However, reverse-inserts, rearrangements, microhomology and mutations may take place in the process of integration, and the integration event may be complex. Therefore, we consider a virus-integrated within a human sequence to have the form of "human sequence" + "virus-mixed sequences" + "human sequence". In other words, a junction sequence is composed of a human sequence preceding the integrating region, a sequence mixed with virus sequences and unknown sequences excluding human sequences, and a human sequence following the integration region. Notably, overlap of human sequence, virus sequence and unknown sequence between human sequence and virus sequence are both allowed. However, no human sequence can exist in the mixed sequence; otherwise, the integration event is divided into two events. Figure S1. Virus integration event model.
We use API provided by Cytoscape to visualize the interactions ( Figure S2). The number of VISs involved in lncRNA-associated interaction and miRNA-associated interaction are 83 and 26 414. Figure S2. Visualization of virus-host interactions involved in miRNAs. Figure S3 shows the chromosomal distribution of VISs. We first calculated the chromosome distribution of VIS using the original data. Then we calculated the density score of each chromosome using the following formula:  Figure S4 is the heatmap of VIS distribution in CFS, we normalize VIS number of 123 fragile sites for HBV, HPV, HIV and EBV with the following formula: Where v∈{HBV,HPV,HIV,EBV}, is CFSi-targeted VIS number of virus v , Nv is the total VIS with precise location.