Unmasking the tissue-resident eukaryotic DNA virome in humans

Abstract Little is known on the landscape of viruses that reside within our cells, nor on the interplay with the host imperative for their persistence. Yet, a lifetime of interactions conceivably have an imprint on our physiology and immune phenotype. In this work, we revealed the genetic make-up and unique composition of the known eukaryotic human DNA virome in nine organs (colon, liver, lung, heart, brain, kidney, skin, blood, hair) of 31 Finnish individuals. By integration of quantitative (qPCR) and qualitative (hybrid-capture sequencing) analysis, we identified the DNAs of 17 species, primarily herpes-, parvo-, papilloma- and anello-viruses (>80% prevalence), typically persisting in low copies (mean 540 copies/ million cells). We assembled in total 70 viral genomes (>90% breadth coverage), distinct in each of the individuals, and identified high sequence homology across the organs. Moreover, we detected variations in virome composition in two individuals with underlying malignant conditions. Our findings reveal unprecedented prevalences of viral DNAs in human organs and provide a fundamental ground for the investigation of disease correlates. Our results from post-mortem tissues call for investigation of the crosstalk between human DNA viruses, the host, and other microbes, as it predictably has a significant impact on our health.

TGTGTGCCAAAGAAGTGTCCT TCTGTCACCTGTTGGAGCATT CTGATGCTACTACTGAAATTGAA *All the primers and probes were ordered from Sigma-Aldrich/Merck. **Locked nucleic acids are inside brackets. All qPCRs were done in a reaction volume of 25 µl, except for RNase P and human herpesvirus qPCRs which were done in 20µl. Luminex multiplex PCR was done in 20 µl reaction.

Supplementary Texts
Supplementary Text S1. Description of hybrid reference and de-novo-based genome reconstruction.
TRACESPipe pipeline uses alignment-based assembly together with de novo assembly to reconstruct genomes with maximum sensitivity and resolution (25). TRACESPipe selects the genome sequence with the highest breath coverage derived from five different rounds. The first round reconstructs a genome exclusively with an alignment-based approach according to the best reference. The second round uses the consensus generated from the alignments and aligns the de novo scaffolds using BWA (26) while prioritizing the reference-based approach. The third round is similar to round one, but priority is given to the de novo scaffolds. The alignments are produced with very high sensitivity, forcing the output to be more similar to the de novo when the consensus from the alignments is ambiguous or contains gaps. The fourth round finds the scaffolds from the de novo assembly with the highest similarities reported by FALCON-meta (27) and employs them as a candidate genome. The fifth round uses the scaffolds from round three as a reference and aligns the consensus sequence created in round one.
Supplementary Text S2. Computational controls to prevent irregular patterns, imbalanced representation, and exogenous content.
TRACESPipe includes three main controls: redundancy, database, and exogenous controls. These controls are critical to detect the source of irregular patterns, imbalanced representation, or exogenous content. The redundancy control estimates duplications or low-complexity regions in the sequences.
TRACESPipe enables cross-checking of the information with similar sub-regions. GTO (28) is used to identify low-complexity regions (29), and it includes a DNA compressor that estimates the content along each genome based on GeCo2 (30). This information is crossed with the coverage profiles generated with BEDTools (31) and the data from the exogenous control. Finally, TRACESPipe removes duplicates using the markdup function from Samtools (32).
The database control includes viruses that share high similarity to other family members (e.g., Polyomaviridae) or the human host (e.g., Herpesviridae). These can result in imbalanced mapping of the reads to various references. When the references are complete genomes, the mapping automatically finds the best reference; however, when partial genomes are also included, the best reference may be attributed to a partial genome in which only conserved regions are present. To mitigate this, we apply FALCON-meta to measure the cross-similarity between the best references.
Regarding the cross-similarity to human DNA, a small number of reads may be assigned to a reference virus, albeit of human origin. We apply FALCON-meta to measure and localize regions of high similarity between the viruses and human reference genome. Additionally, low-breadth coverage sequences are always manually inspected and confirmed by BLAST.
Fungi, bacteria, plants, archaea, and protozoa (among others), may display low levels of similarity to the viral or mitogenomes (33). TRACESPipe estimates the content of exogenous sequences with FALCON-meta (34) using databases for each respective type. The download and construction of the reference databases are automatically created with Entrez (35). The most representative genomes are aligned according to the respective reference for further impact identification on the reconstructed genomes.