Characterization of Posa and Posa-like virus genomes in fecal samples from humans, pigs, rats, and bats collected from a single location in Vietnam

Abstract Porcine stool-associated RNA virus (posavirus), and Human stool-associated RNA virus (husavirus) are viruses in the order Picornavirales recently described in porcine and human fecal samples. The tentative group (Posa and Posa-like viruses: PPLVs) also includes fish stool-associated RNA virus (fisavirus) as well as members detected in insects (Drosophila subobscura and Anopheles sinensis) and parasites (Ascaris suum). As part of an agnostic deep sequencing survey of animal and human viruses in Vietnam, we detected three husaviruses in human fecal samples, two of which share 97–98% amino acid identity to Dutch husavirus strains and one highly divergent husavirus with only 25% amino acid identity to known husaviruses. In addition, the current study found forty-seven complete posavirus genomes from pigs, ten novel rat stool-associated RNA virus genomes (tentatively named rasavirus), and sixteen novel bat stool-associated RNA virus genomes (tentatively named basavirus). The five expected Picornavirales protein domains (helicase, 3C-protease, RNA-dependent RNA polymerase, and two Picornavirus capsid domain) were found to be encoded by all PPLV genomes. In addition, a nucleotide composition analysis revealed that the PPLVs shared compositional properties with arthropod viruses and predicted non-mammalian hosts for all PPLV lineages. The study adds seventy-six genomes to the twenty-nine PPLV genomes currently available and greatly extends our sequence knowledge of this group of viruses within the Picornavirales order.


Introduction
The order Picornavirales includes a wide range of viruses that infect a variety of hosts. According to the latest International Committee on Taxonomy of Viruses (ICTV) classification (ICTV 2017), the order comprises five families: the families Dicistroviridae and Iflaviridae contain members which infect insects (e.g. cripavirus and deformed wing virus), the family Secoviridae members which infect plants (e.g. turnip ringspot virus), the family Picornaviridae members infecting vertebrates (e.g. enteroviruses) and the family Marnaviridae. The latter contains only Heterosigma akashiwo virus for which algae is the natural host (Le Gall et al. 2008).
Although members of the Picornavirales are highly diverse, they share a number of common features, including a single stranded positive-sense RNA genome and co-linear genes encoding a helicase, protease, and RNA-dependent RNA polymerase (RdRP) replication block (Le Gall et al. 2008). The genome lengths for Picornavirales range from 7.2 to 9.8 kb. Typically, the encoded polyprotein is cleaved by virus-encoded proteases (Blom et al. 1996). Generally, members of the Picornavirales are monopartite, although some members of the Secoviridae have genomes with two segments (Le Gall et al. 2008).
Increasing improvements in next-generation sequencing (NGS) has identified a number of divergent members of the order Picornavirales. Porcine stool-associated RNA viruses (posaviruses) were found in the feces of healthy pigs and water collected from swine farms (Shan et al. 2011, Hause et al. 2015, 2016, fish stool-associated RNA virus (fisavirus) was identified in the intestinal content of a healthy carp (Reuter et al. 2015), and human stool-associated RNA virus (husavirus) was identified in the feces of predominantly healthy humans (Oude Munnink et al. 2015). Although structually closely related (based on the genome organisation), these viruses display broad genetic diversity with often less than 40% amino acid identity in specific coding regions thereby suggesting a deep evolutionary history of the virus family.
Although posaviruses can be detected at high frequency in pig fecal samples (21%), a recent study using immunoprecipitation coupled with PCR detection assay showed that posavirus antibodies were infrequently detected (Hause et al. 2016). The possibility that posaviruses may not infect pigs but rather infect gut commensal organisms or have a dietary or environmental origin is supported by blast analysis of posavirus sequences that showed that some posavirus strains have greatest sequence similarity to an RNA sequence from the parasite Ascaris suum (Shan et al. 2011;Wang et al. 2011). Furthermore, a mRNA sequence from the mosquito Anopheles sinensis and a virus recently identified in the fruit fly Drosophila subobscura have been described showing some sequence identity to posaviruses (Webster et al. 2016). Although the viruses have been identifed in samples from different hosts, the true infection hosts for fisavirus, posavirus, and husavirus remain to be determined.
As part of a study to define patterns of viral zoonosis in Vietnam (Rabaa et al. 2015), we performed detailed agnostic (random-primed) whole-genome deep sequencing (Cotten et al. 2014) on fecal samples from bats, humans, pigs, and rats and rectal swabs from humans and pigs. We have analyzed these sequence data for the presence of PPLVs and we describe here a large set of novel virus genomes from human, rat, pig, and bat samples that share homology and protein domain architecture with the previous described posaviruses.

Results
For simplicity, we will use the term Posa and Posa-like viruses (PPLVs) throughout the manuscript. The PPLV category comprises virus and virus sequences that show >30% amino acid homology to the existing posavirus and husavirus genome sequences, do not cluster within the five established Picornavirales families and show a Picornavirales genome organization with the expected five Picornavirales protein domains (see below for further details). A search for PPLV genomes in sequences was performed as follows: short read data (3-4 million 250 nt paired end reads per sample) were de novo assembled into longer sequence contigs and a protein sequence based USEARCH analysis (Edgar 2010) was performed against a database containing all Picornavirales protein entries in GenBank, including all known posaviruses sequences. This search identified three husaviruses, forty-seven posaviruses, ten novel rasaviruses, and sixteen novel basaviruses genomes. The genome lengths of the newly identified PPLV genome sequences varied from 8,262 to 11,318 nucleotides and for all viruses the read coverage across the genome and G þ C content was determined. The results of these analyses and the available demographical data for these samples are summarized in Table 1.
In two human fecal samples, husaviruses (KX673274 and KX673221) showed high level of amino acid sequence identity to the previously described husaviruses KT215901, KT215902, and KT215903 (97-98% amino acid identity). In contrast, an additional husavirus detected in a human rectal swab (KX673248) showed only 25% amino acid identity over the entire polyprotein with other husaviruses.
Posavirus sequences could be detected in thirty-three (of 189) pig rectal swabs (17% frequency) and in eight (of 146) pig fecal samples (5% frequency). In each of four pig rectal swabs (sample IDs 17189_4, 17819_95, 17668_11_2, and 17668_13), two distinct strains of posaviruses were identified, while in one sample (17668_33) three distinct posaviruses were identified. The posavirus sequences identified in this study have the closest sequence identity to variants detected in farmed pigs in the USA (Shan et al. 2011, Hause et al. 2015, 2016. Moreover, novel posalike genome sequences were found in nine (of 45) rat fecal samples [provisionally named rat stool-associated RNA viruses (rasaviruses)], and in thirteen (of 135) bat fecal samples [provisionally named bat stool-associated RNA viruses (basaviruses)]. In one rat (16715_47) and in three bat fecal samples (16715_52, 16715_61, and 16715_71) two distinct rasa/basaviruses were identified.
The low level of shared nucleotide identity between these novel viruses made it difficult to perform phylogenetic analyses at the whole genome level. Therefore, the protein sequence encoding the most conserved region, a provisional RdRP protein, was identified and used for phylogenetic analysis. This analysis supported a conclusion that two husaviruses (KX673221 and KX673274) belonged to a lineage that includes the previously described husaviruses (KT215901, KT215902, and KT215903), while husavirus KX673248 was distant (Fig. 1). Based on this phylogenetic analysis and using a pairwise amino acid identity cutoff of 40%, twenty-two lineages could be identified. While most lineages were found in only a single source type of sample (e.g. all porcine), the Bv_7 lineage comprised basaviruses and a virus isolated from a fruit fly Drosophila subobscura and the two lineages Pv_8 and Pv_9 comprised posaviruses and viral sequences derived from a parasite (Ascaris suum ; Shan et al. 2011) (Fig. 1). For each lineage, a representative virus genome (based on the most complete sequence of the twenty-nine sequences present in the GenBank database and the newly identified sequences) was selected and characterized in more detail (Supplementary Table S1). The length of these genomic sequences ranged from 8,576 nt to 11,318 nt with a G þ C content of 31.0-53.0%. The twenty-two lineages share, on average, only 9-38% amino acid identity across the entire polyprotein (Fig. 2). Phylogenetic analysis was also performed on amino acid sequences encoding the conserved helicase, 3C protease, capsid I and capsid II domains. Due to the high sequence diversity, each set of sequences was trimmed to the most conserved region of each identified conserved domain.
Neighbor-joining (NJ) and maximum-likelihood (ML) trees were constructed individually for each of the conserved domain set of sequences. The NJ tree topology of the RdRp (left panel Supplementary Fig. S1A) was relatively consistent with the ML tree topology of RdRP (Fig. 1). However, this relatively consistency was not observed in the NJ versus ML trees in other domains (left panel compared right panel, Supplementary Fig. 1), probably due to the great sequencing divergence hence challenging proper ML tree inference.
The prevalence of husaviruses among stools samples from Vietnamese individuals was 1.4% (1/71) in healthy human rectal swabs and 0.3% (2/573) in human diarrheal feces. Rasaviruses and basaviruses were detected in 22 and 9% of the rat and bat fecal samples, respectively. Pigs also commonly carry these viruses, with posavirus being found in 17% of the rectal swabs and in 5% of the fecal samples examined in this study. The frequency of virus detection was significantly higher in rectal swabs compared to fecal samples for posaviruses (P value ¼ 0.002; Chi-squared test).
The frequency of husavirus positive samples was too low to draw conclusions about the prevalence in rectal swabs compared to fecal samples (P value ¼ 0.59; Fishers' Exact test).
While members of the Picornavirales typically contain a Hel-Pro-Pol replication block (Le Gall et al. 2008), some of the recently identified posaviruses initially appeared to not encode a recognizable conserved protease domain (Shan et al. 2011;Hause et al. 2015Hause et al. , 2016. A local HMMER search (Eddy 2011) using the complete PFAM library (Marchler-Bauer et al. 2015) failed to identify a recognizable picornavirus 3C protease domain in the majority of genomes. It was unlikely that these viruses completely lacked the protease and we suspected that the failure to detect the protease domain could be due to sequence diversity in the protease domain of RNA viruses (Koonin and Dolja 1993). Accordingly, a refined 3C protease HMM profile was constructed including all newly identified protease domains in posaviruses. A search using this refined protease domain profile identifed a putative protease domain in all of the posaviruses (Supplementary Table S1). In addition, all genomes were found to encode an RNA helicase domain, an RdRP domain and two picornavirus capsid domains [with the exception of Asv_1 since this GenBank entry is only partial and posavirus_3 where no conserved RNA helicase domain could be identified (Fig. 3)].
The G þ C contents for all PPLV sequences were determined but no specific G þ C content pattern was observed in virus sequences from different hosts. The husaviruses showed the highest G þ C content (50.5-53.0%), followed by posaviruses (30.9-51.2%), rasaviruses (40.5-44.0%), and basaviruses (32.2-48.2%) ( Table 1 and Supplementary Table S1). As previously described (Kapoor et al. 2010), nucleotide composition analysis (NCA) can be used to predict the host range of members of Supergroup 1 RNA virus, that includes the Picornavirales. Sequences from 105 PPLV genomes obtained in the current study and from published sources were analyzed using a pre-trained dataset of reference genomes from three categories of hosts (arthropod, plant, and vertebrate; Fig. 4). The analysis revealed that almost all posaviruses as well as all husa-, basa-, rasa-, fisa-, insect-, and nematoda viruses clustered within the arthropod group (Fig. 4). Two basaviruses (lineage Bv_2) cluster within the vertebrate group of the Picornaviridae and one husavirus (lineage Hv_2) clustered within the plant virus group of the Picornaviridae. These observations fell within the 5% error range of the analysis (95% prediction accuracy of the controls, Supplementary  Table S2).

Discussion
Here we report the identification of new Picornavirales members related to sequences previously identified in pig stool (posavirus) and human stools and/or rectal swabs (husavirus). In addition, we describe newly identified bat stool-associated viruses (basaviruses) and rat stool-associated viruses (rasaviruses) which have a similar genomic organization compared to posaviruses. Posaviruses are known to be widely distributed geographically with examples found in the USA (Shan et al. 2011) and in China (Zhang et al. 2014), however, this is the first detection of husavirus in human stools outside the Netherlands. These Posa and Posa-like virus genomes are collectively referred as PPLVs.
The PPLV genomes were identified based on identity to previously identified posaviruses and their lack of close protein homology to any of the Picornavirales families and the presence of a set of five protein functional domains. Using standard phylogenetic analyses, the PPLVs formed lineages which are distinct from the five established Picornavirales families (Dicistroviridae, Iflaviridae, Marnaviridae, Picornaviridae and Secoviridae) and the unassigned Picornavirales genome sequences. However, there is as much diversity between the  PPLVs as there is between the PPLVs and the established Picornavirales families.
We used the USEARCH clustering algorithm in an attempt to determine how close the PPLV genomes are to existing Picornavirales genomes. All 108 PPLV genome sequences were combined with all available Picornavirales full genome sequences from GenBank (5,766 genomes, excluding those with stretches if Ns greater than 20). At various levels of homology (ranging from 60% to 90% nucleotide identities), the PPLV sequence clusters were distinct from the clusters formed from the Picornavirales genomes, i.e. there were no clusters containing both PPLV sequences and genomes classified in one of the Picornavirales families (results not shown). Thus we think it is valid to conclude that none of the PPLVs belong to established Picornavirales families. The PPLV group is however too diverse to be classified as a single virus family. Given the pace at which new virus sequences are becoming available, we believe the best approach is to deposit these sequences with a tentative identification as PPLV and as more detailed sequence data become available a better organization of these virus sequences into well-supported family or families can be made.
Members of PPLV group have now been identified in pigs, humans, fish, rats, bats, insects (Anopheles sinensis and Drosophila subobscura), and parasites (Ascaris suum). Based on phylogenetic analysis of the RdRP domain and pairwise comparisons of the entire polyprotein, we propose that the PPLVs comprise twenty-two phylogenetic lineages. These PPLVs could also be grouped in twenty-two lineages based on the NJ trees constructed from amino acid sequences encoding other conserved domains (putative helicase, protease, capsid I, and II, Supplementary Figs S1A-D).
Consistent with other members of the Picornavirales, most of the newly described PPLV sequences encoded a Hel-Pro-Pol replication block. However, in some of the genome sequences, no recognizable protease domain could be identified using conventional methods with an existing pre-made PFAM domain based on a limited number of picornavirus protease domains. However, a more detailed protease domain database based on a broader set of Picornavirales proteases, including the novel putative posavirus protease domains, revealed the presence of a protease domain across the entire range of PPLV genomes (Fig. 3).
In an attempt to infer a putative cellular host for the PPLVs, a nucleotide composition analysis (NCA) was performed. NCA incorporates composition measures of dinucleotide frequencies and has been used to predict the infectious hosts of members of RNA virus supergroup I (Kapoor et al. 2010). In a set of sequences for which the infectious host was known, the analysis was able to accurately classify viruses as either being of vertebrate, plant, or arthropod origin in around 95% of the cases (Koonin et al. 2008). Using this analysis method, the PPLV genome sequences were found to cluster with viruses from the arthropod group (Fig. 4). The two outliers of the discriminant analysis (lineage HV_2 and Bv_2) fall within the 95% confidence interval, but given their substantial sequence divergence from other PPLVs it is possible that these viruses infect another hosts.
The prevalence of husaviruses in fecal samples (0.3%) and rectal swabs (1.4%) in Vietnam was lower than the 3.5% prevalence observed in a cohort of predominantly healthy HIV-1 positive and negative individuals (Oude Munnink et al. 2015). Prevalence differences may be due to differences in RT-PCR detection versus genome assembly from next generation sequencing, the small sample numbers and/or true differences between the cohorts. Of interest, posavirus could be detected significantly more often in pig rectal swabs compared to pig fecal samples (P ¼ 0.002), suggesting that the viruses are enriched on the rectal epidermis. This enrichment and the clustering of posaviruses with the arthropod viruses may be consistent. It is known that intestinal parasites can be found perirectally and can be detected using the scotch tape test (Enterobius_Vermicularis_Diagnostic_Test). An interesting follow-up analysis would be to determine the scotch tape virome and our prediction would be that members of the PPLV group can be found in these samples.
In summary, this study provides a large set of seventy-six new PPLV genomes, quadrupling the available genomic data for this broad group viruses. A novel Vietnamese husavirus genetically distant from the previously described husaviruses was identified and PPLV members were also detected in rat and bat feces. In addition, we were able to clarify two additional features of posavirus virology: a putative protease domain was detected in all PPLV genomes and NCA revealed that members of the PPLV group share a conserved nucleotide composition with viruses infecting members of the arthropod phylum.

Samples
Fecal material was collected from 135 bats (Scotophilus kuhlii), 573 humans (Homo sapiens), 146 pigs (Sus domesticus), and 45 rats (Rattus argentiventer). In addition, rectal swabs were collected from seventy-one humans and 189 pigs. These samples were collected from a 150 square kilometer area of Dong Thap province, a southern region within the Mekong Delta River in Vietnam. All fecal samples from human enrollees were diarrheal patients admitted to Dong Thap Provincial hospital, while human rectal swabs were taken from healthy farmers and family members. Pig fecal samples and rectal swabs were collected from individual pigs from breeding farms. Rat fecal samples were collected from rats, which were purchased on the market or collected from rice-field traps. The disease state of these animals is unknown. Bat fecal samples were collected from beneath roosting sites.
Ethical approval for the study was obtained from the Oxford Tropical Research Ethics Committee (OxTREC Approval No. 15-12) (Oxford, United Kingdom), the institutional ethical review board of Dong Thap Provincial Hospital (DTPH) and the Sub-Department of Animal Health Dong Thap province (Dong Thap, Vietnam). Posaviruses are plotted in white diagonals, husavirus in light yellow, basavirus in yellow, rasavirus in dark yellow, fisavirus in orange and insect/nematode infecting viruses in dark orange.

Illumina sequencing
Fecal samples (n ¼ 899) or rectal swabs (n ¼ 260) were centrifuged for 10 min at 10,000 Â g after which the samples were DNase treated at 37 C for 30 min (20 U of TURBO DNase, Thermo Fisher per 100 ml of sample). Nucleic acids were extracted, transcribed into cDNA and subjected to second strand synthesis (de Vries et al. 2011(de Vries et al. , 2012. The resulting dsDNA from each sample was sheared and fractionated to 400-500 bp in length after which Illumina adapters with a unique barcode were ligated to the fragments. Resulting libraries were sequenced with the Illumina MiSeq or HiSeq platforms to generate 1-2 million 150 nt (MiSeq) or 3-4 million 250 nt (HiSeq) paired-end reads per sample.

De novo assembly and complete genome characterization
Adaptor sequences were removed and sequence reads that passed quality control were de novo assembled using SPAdes version 3.5.0 (Bankevich et al. 2012) followed by improve_assembly (Page 2012). The resulting contigs were subjected to a modified protein blast search using USEARCH (Edgar 2010) to identify novel members of the Picornavirales. To minimize the effects of Illumina cross-talk, all preliminary contigs were examined and contigs within a sample with low median coverage (greater than 10-fold lower than the major contig in the sample) were excluded from the analysis. For all PPLVs reported here, the complete or nearly complete (>8,000 nt) genome was obtained and for all viruses the genome coverage was determined by mapping all quality controlled sequence reads to the final genome. The G þ C content was determined using Geneious (Kearse et al. 2012). To determine the average percentage amino acid identity across the PLLV lineages, amino acid sequences were aligned using the ClustalW in Geneious (Kearse et al. 2012).
To identify conserved protein domains encoded by the new genomes an RPS-BLAST search (Marchler-Bauer et al. 2015) against the Conserved Domain Database (CDD) was performed. The initial screen identified the helicase, RdRP, and picornavirus capsid (I and II) domains across almost all genomes. However, the 3C protease domain was identifed in only a subset of genomes, suggesting either a true absence or a misidentification due to great sequence divergence. A modified 3C protease domain profile was generated from a protein sequence alignment of the conserved domain (pfam00548) from the CDD and used to identify the 3C protease-like regions in the new PPLV genomes. An updated alignment containing all the putative protease domains used to create a new HMM index file. A local hmmsearch analysis with this updated 3C protease profile was then performed to identify divergent putative protease domains in the PPLV genome sequences.

Discriminant analysis of the dinucleotide bias
Nucleotide composition analysis (NCA) was performed as previously described (Kapoor et al. 2010) using sequences of members of RNA virus supergroup 1 (Koonin et al. 2008) infecting vertebrates (n ¼ 113), arthropods (n ¼ 66), and plants (n ¼ 172) for classification. The frequencies of each mononucleotide and dinucleotide were used for discriminant analysis to maximize discrimination between control sequences; these canonical factors were then used to infer the host origin of the RNA virus sequences obtained in the current study.

Phylogenetic analysis
All PPLV sequences identified in this study combined with all complete PPLV genomes present in the GenBank database (retrieved on 16 July 2016) were aligned using muscle (Edgar 2004). Amino acids sequences were trimmed to the region encoding for the conserved domains and alignments were manually inspected and trimmed to the most conserved part. Phylogenetic analyses were performed on the conserved putative conserved domains using IQtree (Nguyen et al. 2015), under the best-fitted amino acid model with 500 pseudo-replicates. The resulting trees were visualized using FigTree v1.4.2 (http://tree.bio.ed.ac. uk/software/figtree/).

Statistical analysis
Statistical analysis was performed using the two by two table from Open Epi (Sullivan et al. 2009). As a measure of association, the Chi-squared test or the Fishers's exact test was used.

GenBank accession numbers
All PPLV genome sequences generated in this study were deposited into the GenBank database under the accession numbers KX673215-KX673290.