Metagenomic analysis of planktonic riverine microbial consortia using nanopore sequencing reveals insight into river microbe taxonomy and function

Abstract Background Riverine ecosystems are biogeochemical powerhouses driven largely by microbial communities that inhabit water columns and sediments. Because rivers are used extensively for anthropogenic purposes (drinking water, recreation, agriculture, and industry), it is essential to understand how these activities affect the composition of river microbial consortia. Recent studies have shown that river metagenomes vary considerably, suggesting that microbial community data should be included in broad-scale river ecosystem models. But such ecogenomic studies have not been applied on a broad “aquascape” scale, and few if any have applied the newest nanopore technology. Results We investigated the metagenomes of 11 rivers across 3 continents using MinION nanopore sequencing, a portable platform that could be useful for future global river monitoring. Up to 10 Gb of data per run were generated with average read lengths of 3.4 kb. Diversity and diagnosis of river function potential was accomplished with 0.5–1.0 ⋅ 106 long reads. Our observations for 7 of the 11 rivers conformed to other river-omic findings, and we exposed previously unrecognized microbial biodiversity in the other 4 rivers. Conclusions Deeper understanding that emerged is that river microbial consortia and the ecological functions they fulfil did not align with geographic location but instead implicated ecological responses of microbes to urban and other anthropogenic effects, and that changes in taxa manifested over a very short geographic space.


Background
River ecosystems are Earth's biogeochemical powerhouses, and riverine processes largely are driven by the microbial communities that inhabit their water columns and sediments [1]. From an applied anthropogenic perspective, rivers are the lifeblood of human communities; recognition of this perspective led the New Zealand Government to grant legal personhood status to the Whanganui River as an indivisible and living whole [2]. Rivers provide food and drinking water and are a resource for agricultural and industrial use coupled with waste distribution, thereby reflecting a fingerprint of the total environment. Frequently, these services and activities are provided within an alarming proximity to each other. Regulatory authorities in many regions currently assess river "health" for management and monitoring of water resources using methods such as Biological Condition Gradient [3] and Index of Biotic Integrity [4,5]. Such assessments score river "health" on the basis of occurrence of certain conditions, response to stress, and abundance of eukaryotic organisms. The recent focus on antimicrobial resistance (AMR) has highlighted the potential of AMR genes in aquatic microbes as a potential threat to human health. Complex microbial river water communities, often contributed to by human and animal activity, have more AMR genes than simple communities [6]. However, it is unclear at the moment which microbial resistance genes (or gene combinations) are a threat to human health and at what concentrations. Given that high-throughput sequencing has become economically viable for environmental monitoring, it is now possible to accurately characterize river metagenomes and determine the extent of taxonomic and functional variability among them. We can use this technology to monitor the levels of water-borne disease microorganisms and AMR genes, highlighting the need to include microbial community data in broad-scale ecosystem models.

Data Description
It is likely that there is a correlation between river water microbial community composition, as determined by metagenomic sequencing, and river function and health [7][8][9]. Recent eco-genomic methods offer the capability to understand river ecosystems in greater detail, but for this approach to be widely used, particularly for real-time study in remote river systems such as Amazon, Klinaklini, Onyx, or Yarlung Tsanpo, fielddeployable sequencing technology is necessary; the MinION has been demonstrated to be appropriate for on-site analysis [10]. We designed a study to evaluate river water metagenomes and the occurrence of riverine xenobiotic components, on a global scale, using the MinION portable sequencer paired with on-site data analysis. Assigning taxonomy and/or function for complex environmental river samples traditionally has been accomplished using whole-genome short-read sequences or short amplicons of 16S subregions [7,8]. More recently, Johnson et al. [11] provided data illustrating that taxonomic resolution based on short reads of 16S subregions is less accurate than defining taxa using the full 16S gene sequence, primarily due to intragenomic differences among 16S gene copies. Meanwhile, high-throughput long-read analysis of complex mixtures using MinION and Pacific Biosciences platforms has become routine. Metagenomic analysis of mock communities using long single-molecule reads generated using Oxford Nanopore recently has been validated by comparing taxonomic assignment from long reads against taxonomy assigned using 16S ribosomal DNA genes [12,13], illustrating that long-read metagenomes significantly match expected microbial taxonomic assignments and abundances. Bioinformatic study has shown that long, even error-prone, reads can significantly increase classification accuracy [13,14]. Other critical assessments of the strengths and weaknesses of long-read-based metagenomic analysis have shown that these data can enhance our knowledge of ecosystem function coupled to microbial community structure [15,16] and ultimately should help to more accurately model the biogeochemical processes driven by microbes. A deeper understanding of microbial diversity is needed to discern the implications on human health [17], e.g., the occurrence of antibioticresistant strains of bacteria in waterways that provide food and drinking water, and on productivity (e.g., nutrient cycling, crop irrigation, disposal of industrial and sanitation-related waste). Given that metagenomic analysis based on long-read data is promising, we envisioned a study with broad implementation of field-deployable long-read sequencing wherein we sampled a diverse set of 11 contrasting rivers and waterways across the globe. Here we describe a basic, high-level analysis of the results using multiple bioinformatic pipelines, providing all of the underlying raw sequence data for additional discovery and analysis by other researchers. We document the potential of longread nanopore sequencing and real-time analysis of DNA obtained globally for environmental monitoring of the river biota, detection of microbes that respond to urban anthropogenic influence, and documentation of potential pathogens and AMR presence and diversity, with the aim of ultimately enabling water quality enhancement. We further believe that the methodology developed in this study provides a robust, small-footprint protocol that will facilitate broadening riverine metagenomic studies.

Length and count statistics
Libraries constructed by the MinION SQK-RLB001 kit consistently produced 2-5 kb fragments that yielded sequencing data sets averaging 1.1 · 10 6 reads (3.8 · 10 9 bases) of read length 3.4 · 10 3 bp (Supplemental Fig. 1, Supplemental Table 1). Metagenomic results rarefaction ( Fig. 1) indicated that although curves did not reach saturation, in general 0.5-1.0 · 10 6 sampled long reads were adequate to capture most of the operational taxonomic unit (OTU) diversity of most samples.

Negative control samples
The number of classified reads from the mapped negative control sample reads was very low in comparison with the number of river sample reads (∼0.1%). Within all negative controls, 33 families were identified above a 1% proportion (in any control sample) and negative control reads accounted for 0.04 ± 0.02% of the total read counts in the corresponding samples. Two of the negative control sample data sets had at most 1 read, and all but 4 sets had too few reads to be analysed in MG-RAST. Ultimately, there was no obvious trend that indicated the source of negative control reads being a result of consistent sample contamination during sample preparation (i.e., the isolation, library preparation, barcoding). Combined with the fact that the total number of negative reads was trivial in most cases and that there was no obvious pattern to their derivation, we performed no deeper investigation into the sources of negative control reads.

Taxonomic diversity
Every read in every metagenome was assigned by MG-RAST to a predicted feature (a protein or ribosomal RNA). The mean proportion of reads was classified using the default criteria by MG-RAST to family 99.72% ± 0.09%, and for Kraken2 the average classification was 60.64% ± 3.75%. Metagenomic assignment of the whole-genome shotgun-sequenced (WGS) long-read data using One Codex resulted in much larger proportions of reads that were not classified (47-89%), which we hypothesize was a result of fewer taxa in the reference database and different assignment criteria. Thus, after preliminary analysis of the results, we opted for the MG-RAST and Kraken2 pipelines. Nearly all of the river metagenomes exhibited multimodal GC distributions (Supplemental Fig. 2), another indication of multiple domain representation that mirrors the GC representation in many other reports of freshwater environmental metagenomes [8,[18][19][20]. Reads for most river metagenomes were overwhelmingly assigned to the Bacteria domain at ≥94% with 1 exception, Sydhavnen at Copenhagen Harbour, where Virus accounted for >25% of the data and Bacteria only 68%. Eukaryotes were identified by MG-RAST in every metagenome at a level of ≤4% of reads, and Archaea were represented in all metagenomes by 0.2-6.0% of reads.
The 5 most common bacterial phyla observed were Proteobacteria, Bacteroidetes, Actinobacteria, Firmicutes, and Cyanobacteria. Proteobacteria were the most abundant prokaryote in most metagenomes (Vedder Canal was a distinct outlier where Bacteroidetes predominated), and within that group, the predominant taxon was the Burkholderiales, dominated therein by the Comamonadaceae comprising predominantly Acidovorax species (0.3-5% of assigned bacterial reads; iron and uranium oxidizers, nitrotolulene degraders, and plant pathogens) and Polaromonas (0.1-4% of bacterial reads; degraders of chlorinated alkenes and naphthalene). Another group that dominated the prokaryote hits was Bacteroidetes, composed overwhelmingly of Flavobacterium (0.5-35% of bacteria reads; extremely common in soils and freshwaters, and some are known disease agents). Moderately abundant prokaryotes were Actinobacteria, consisting nearly completely of Actinomycetales, fungus-like soil bacteria (0.4-41% of bacteria). Archaea occurred at an average of 1% of read assignments in all metagenomes except Chena River, which contained a high proportion of Archaea (6%); other published river metagenome studies recorded Archaea at the 1% level [7][8][9]21]. Archaea groups detected were extremely similar across most metagenomes (most of which were Methanomicrobia, CO 2 reducers); a notable exception was the metagenome for Sydhavnen (Copenhagen Harbour), where most of this group's representatives were instead Thaumarchaeota (noted for the ability to nitrify via oxidizing ammonia aerobically), dominated by Nitrosopumilus, a common player in the marine nitrogen cycle.
Across all metagenomes, 64 families were detected at ≥1% normalized abundance (Table 1). Alpha diversity across the 13 metagenomes ranged from a low of 135 species (Vedder Canal) to a high of 1,139 species (Chena River). Of the temperate urban rivers we investigated, Yare, Rhine, Neckar, Corrib, James, and St. Lawrence had average alpha diversity of 413 ± 29 SE species and exhibited family sets that conformed to the core groups that have been found to dominate other large temperate rivers and lakes [7-9, 22, 23]. The families observed in those 7 rivers concurred with what is expected on the basis of a gen- Table 1: Normalized proportions of 64 families that were detected at ≥1% in any of 13 river metagenome samples analysed by MinION, listed with the most commonly noted families at the top of the table * includes OTUs in root where tools were unable to identify precise taxonomy. ND indicates family was not detected. Green-shaded cells indicate occurrence ≥1%. eral understanding of river ecology (covered more extensively below). Like those "typical" rivers, the metagenomes of Vedder River and Canal exhibited prokaryote families ubiquitous in soils and water environments, but these 2 samples stood apart owing to higher abundance of Cytophagaceae and Burkholderiaceae and lower abundance of Streptomycetaceae than in the other rivers. Metagenomes of Chena and Karori Rivers also exhibited prokaryote families ubiquitous in soils and water environments, but their consortia were dominated by different families than the other rivers. We also saw that Chena River showed evidence of hydrocarbon influence as the fourth, eighth, and 10th most abundant microbe families are important degraders of methylnaphthalene and BTEX (benzene, toluene, ethylbenzene, and xylene). Karori River was distinctive in that some of its most numerous microbial families were either marine (Cytophagaceae, Alteromonadaceae, and Vibrionaceae) or signified the presence of sewage (Enterobacteriaceae and Campylobacteraceae). The Skaevinge wastewater inlet metagenome was unique, as expected, in that it was dominated by families (5 of the top 10) that are linked to sewage. The Sydhavnen metagenome was unique, as expected, owing to abundance of marine bacteria, marine-related viruses, and algae; only 2 of the major prokaryote families were typical of freshwater river ecosystems.
Of 1,249 genera classified, 69 occurred at ≥1% in any 1 of the 13 metagenomic samples and of those, 35 genera were represented on average at ≥1% in all of the samples. For the majority of samples, the most common OTUs were the bacterial genera Flavobacterium, Polynucleobacter, Acidovorax, Polaromonas, and Streptomyces. These microbes, known to be members of the "microbial loop" [24], are among the predominant drivers of water and soil ecosystem processes and have been documented as major contributors to the consortia of other aquatic systems [7,9,18,22,23,[25][26][27]. Three rivers exhibited very low frequencies of the common river OTUs. These exceptions included Chena (where Clostridium, Bacillus, and Geobacter predominated), Vedder (where Pelagibacter and Ricketsiales were most common), and Karori Stream (where most numerous were Cellvibrio, Pseudomonas, Arcobacter, Bacteroides, and Burkholderia). The least typical "river" samples were the wastewater influent at Skaevinge (where the dominant genus was Arcobacter, 48.7%, followed by Bacteroides and Campylobacter, both of which are significant clinical pathogens) and Sydhavnen at Copenhagen Harbour (dominated by Prasinovirus and Phycoviridae, and having primary bacterial genera Flavobacterium and Candidatus Pelagibacter). Across all metagenomes, 5 genera that include some human pathogenic species were detected at ≥1% and many occurred at lower abundances. Present in all 13 metagenomes were Campylobacter (normalized proportion of 0.1-2.9%), Clostridium (0.3-3.2%), and Prevotella (0.2-1.3%). Corynebacter was in all except Karori Stream (0.1-1.6%) and Helicobacter present in all except Yare W and Neckar (0.1-1.5%). The fact that the taxonomic assignments for most rivers also implicated taxa that are anthropogenically relevant such as xenobiotic processors, diseasecausing organisms, and pathogens of humans, fish, and crops is not novel. Xenobiotics and significant pathogens previously were observed for 1 of the rivers examined in this study (James River [7]) and have been documented using WGS data for other river metagenomes [9,22].
The long-read WGS data provided important novel insight into the viral complements of some river metagenomes. Across the 11 rivers (13 sampling sites), the normalized proportions of viral reads ranged from 0.03% to 25.8% of read assignments. For 11 samples, virus accounted for <1% of reads, a finding typical of other river planktonic metagenomes [7-9, 18, 21]. Except for the 1 metagenome outlier, most virus read annotations were similar to T4-like virus (bacteriophages with some similarity to cloning vectors). The next most common were Phycoviridae (types that infect bacteria and archaea), followed by Iridoviridae (insect virus), and Cafeteria roenbergensis virus (a giant virus of marine phagotrophic flagellates). The notable outlier metagenome was Sydhavnen (Copenhagen Harbour), where >25% of all reads were mapped to viruses. These were not the type observed to dominate the other river metagenomes; instead the dominant types were Prasinovirus (52,116 annotations, observed e-values ≥1 · e −9 , alignment lengths ≥38, identity ≥80.8%) and Phycodnavirus (e-values ≥1 · e −7 , alignment lengths all >34, all showed >72% identity), which infect oceanic picoalgae, Bathycoccus, Ostreococcus, Micromonas (family Mamiellaceae), and other common groups of coastal green algae and cyanobacteria. Similar viral annotations were found in other samples but at 5-10 times lower abundance. Capture of this viral event may reflect the effect of oceanic water mixing with fresher water because salinity can influence the rate of viral decay and others have observed that algae transitioning from fresher to more saline waters experience increased viral abundances [28]. Alternatively, there could have been a recent bloom of picoalgae that advected onshore and was at the time of sampling in de-cline. The detection of such a high proportion of viral reads is notable in comparison to other halophilic WGS metagenomes where viruses generally are detected at ≤2% [29,30] but actually has been seen recently as a significant benefit of the MinION sequencing method [31].
River location by longitude, latitude, country, or continent was not reflected in the principal component analysis (PCA) grouping. The fact that both family and function PCAs yielded similar groupings and that those clusters did not reflect geography leads to the conclusion that the consortia and the ecological functions they fulfil may be more important than a river's precise location. An example can be seen in the paired sets of samples from Yare and Vedder that were collected up-and downstream of an urban center to examine the extent to which supposed anthropogenic effects on the waterways affected the river microbial consortia. Representative PCAs created from both Kraken2 results based on the annotated families imputed with missMDA and the MG-RAST normalized family frequencies ultimately clustered River Yare samples collected east and west of Norwich, suggesting that both samples have similar metagenomic profiles ( Fig. 2A and B). Conversely, Vedder River and Vedder Canal metagenomes that were separated by a similar distance as the River Yare samples did not cluster.

Functional diversity
There were 2,889 Clusters of Orthologous Groups (COG) pathways, 3,806 KO pathways, and 6,554 Subsystem Functions annotated across the 13 metagenomes. The distribution of detected functions versus sequence count was logarithmic for hits from all 3 databases (Fig. 3), indicating in each case that ∼2.5 · 10 5 long reads seem to be necessary to adequately diagnose river function potential using the MinION sequencing platform, a range well within the read output for 9 out of 12 experiments in this study. The long-read data yielded assignments for functions of Bacteria comparable to North American river bacterial functions detected using data from other sequencing platforms [7][8][9], the vast majority of which were associated with basic cellular housekeeping ( Table 2).
Prior studies suggest that wastewater release contributes to river resistomes [22,[32][33][34] and as mentioned above, we  found signals of urban sewage in all of the metagenomes examined, at low abundance in most but unexpectedly high in others. We also detected other functions that indicated how these river consortia respond to the putative anthropogenic influences on these waterways (Table 3). For example, across the river metagenomes, 24 different mechanisms (0.8% of the COG processes detected) were related to antibiotic or multidrug resistance (AMR), toxins, or virulence. The most prevalent of those were AMR pathways dominated by the cation/multidrug efflux pump and the ABC-type multidrug transport system (ATPase and permease components). According to the SEED viewer, 24 genes were detected that direct transporting and processing of heavy metals (arsenic, copper, cobalt, zinc, lead, and cadmium; those for copper were highly represented). Of all KO functional pathways detected, 60 (24% of all annotated pathways) were related to processing of xenobiotic substances or to human or plant pathogens and diseases. The xenobiotic processes were dominated by degradation of benzoate (PATH:ko00362), chlorocyclohexane and chlorobenzene (PATH:ko00361), aminobenzoate (PATH:ko00627), nitrotoluene (PATH:ko00633), atrazine (PATH:ko00791), and dioxin (PATH:ko00621). Similar observations were made for an earlier James River metagenome previously analysed using different WGS sequencing technologies [7]. The PCA analysis for families ( Fig. 2A and B) grouped samples in a nearly identical fashion as for Subsystem Functions (Fig. 2C), giving support to the contention that microbial function is driving differences among river and waterway metagenomes, not location.

Long-read metagenome analysis for river taxonomy and function
Because rivers are used extensively for anthropogenic purposes (drinking water, recreation, agriculture, and industry), it is essential to understand how these activities affect the composition of river microbial consortia. Such understanding could be facilitated on a massive scale if there were a broadly applicable means for spatiotemporal river testing that could produce an unbiased representation of the microbial community. This would be especially helpful to document the presence, distribution, function, and evolution of epidemiologically significant organisms. Nanopore technology has enabled the application of long-read nucleotide sequencing in a number of ecogenomic applications and holds promise for just such testing in river systems. We found that DNA sequence data from river water produced by the portable MinION device joined with metagenomic analysis provided a sensitive platform for investigating the diversity and ecological function of microbiota inhabiting Earth's rivers and waterways. Because we used a WGS approach, we also captured signals of organisms inhabiting the riparian zones and larger watershed. We even captured genomic signals of an algal/viral event in an eighth waterway sample (Sydnavnen). Detecting biodiversity of rivers and their watersheds previously has been reported using short-read whole-genome and targeted strategies [7][8][9]35]. This study illustrates that similarly comprehensive results are obtained using long-read sequencing. By sequencing DNA from rivers on 3 continents using MinION rapid sequencing and analysing those data with both local and cloudbased tools, we obtained detailed results on taxonomy and function that implied just how distinct and ecologically responsive those river system microbiota are. Furthermore, the study highlights the added value of portable WGS paired with on-site data analysis, which allows us to avoid assembly approaches and resolve possible gene synteny directly from sequencing reads rather than from contigs, to avoid the artifacts and biases inherent to PCR (e.g., false-negative results, polymerase error, primer mismatch, saturation) and to avoid the need for multiple sample processing to investigate different "fractions" (e.g., 16S, 18S, viral, COI).
In addition to the utility of this approach for studying river consortia, we present data that illustrate its potential for monitoring for anthropogenic effects on river biota, detecting pathogen presence and diversity throughout river systems, judging risk associated with water uses, and hopefully enhancing water quality. Despite <1× coverage of taxa and the precautions espoused by Gweon et al. [36], our analyses exposed previously unrecognized aspects of microbial biodiversity in 4 waterways, where the metagenomes deviated from the expected suite of taxa; several illustrated marine influence, some showed taxa responsive to hydrocarbon pollution, and others had strong signals of taxa related to sewage and AMR gene matches linked to antibiotic resistance. We suggest that the most common OTUs (i.e., groups that occur at ≥1% of the detected consortia) seem to be good indicators of the extent to which rivers and waterways are responding to anthropogenic impacts.
Additional bioinformatic development is necessary, however, to ultimately support a field-deployable sequencing device paired with deeply informative statistical analysis that has the capability of rapidly and comprehensively detecting microbes. Comparing widely available bioinformatic tools to analyse the Table 3: Normalized percentage abundances, where a function was represented at ≥0.1% of annotations, of KO pathways detected related to processing of xenobiotic substances or to human or plant pathogens and diseases, and of COG pathways relating to antibiotic or multidrug resistance, toxins, or virulence River abbreviations are as in Table 1. Colours indicate relative abundances: green -≥1%; yellow/orange -intermediate; red -≤0.1% / not detected (ND). river metagenomes revealed that One Codex resulted in a very high proportion of unclassified reads and, as the parameters were not adjustable, was unsuitable in our hands for this analysis. Using the local Kraken2 sequence classification system, we had fewer unclassified reads and detected a wider diversity of organisms in each river, concurrent with levels of diversity detected by other sequencing technologies. The web-based MG-RAST service provided zero unclassified reads and deeper, more comprehensive information, particularly with regard to xenobiotics, pathogens, and AMR. Comparative PCA analyses of these river metagenomes using both Kraken2 and MG-RAST data, at both the family and function levels (Fig. 2), yielded highly similar groupings, indicating that geographic proximity is far less important than the ecological functions being carried out by the predominantly microbial consortia. Such relationships among microbial consortia and urban/agriculture effects have been noted in other watersheds wherein, as for studies op. cit. here, riverine microbial consortia varied as a function of land use and environmental quality [21,37].

Substantial compositional differences between geographically proximal sites
The differences among stories told by the metagenomes were striking in that these microbial "snapshots" (Supplemental Fig.  3) and jobs that the microbial consortia in rivers and waterways are performing (Tables 2 and 3) give signals that we believe could be used to enhance river management. For example, although Yare River samples collected west and east of Norwich yielded highly similar taxonomy and function, our 2 Vedder (River and Canal) samples, which were geographically closer than the Yare samples were to each other, yielded distinct taxonomic and functional arrays. The 2 Vedder sites had substantially different flow and anthropogenic impact. Vedder River is an actively flowing, relatively natural river area with a rock and gravel bottom and fed by an upstream lake of mountain rain and snowmelt runoff, whereas the downstream Vedder Canal is a chan-nelized deep artificial canal with minimal current flow, a mud bottom, and high sediment load. These physicochemical differences manifested in radically different consortia and therefore different predicted river functional pathways, illustrating how informative metagenomic analysis can be for investigating the interaction between geophysical site composition and bacterial community composition. The understanding derived from such observations will be particularly useful for river ecosystem management and, indeed, is emerging as an important component of global change models.

Potential Implications
We demonstrate that yields of >1 M reads (i.e., 1 Gb data) are easily achievable with MinION, and further that yields of >10 Gb are possible using the rapid PCR barcoding kit, thus allowing for multiplexing of environmental samples. If one were planning multiplexed runs, our study shows that this process should be sufficient for a quick indication of highlevel metagenomic diversity. The analyses presented here illustrate that at the present average output, users interested in accurate assessment of taxonomy and function should strive for ≥250,000 reads per sample. Readers should be aware that nanopore sequencing is a new and disruptive technology in a state of constant improvement. As such, by the time of this publication, the approach presented here is anticipated to already have been enhanced through modification of MinION flow cells, library chemistry, and bioinformatic capabilities.

Global river water sites
For this study, 11 diverse global riverine waterways (Supplemental Table 2, Supplemental Fig. 4) were analysed to compare the metagenomic diversity of microorganisms identified and to garner an initial understanding of microbial resistance genes present. In Europe, these rivers included River Yare (collections west and east of Norwich, UK), River Rhine (Bimmen Netherlands/Germany), Neckar River (Tübingen, Germany), River Corrib (Galway, Ireland), Sydhavnen (Copenhagen, Denmark), and the Skaevinge wastewater treatment plant (Skaevinge, Denmark). In the USA, rivers sampled included James River (Richmond, VA) and Chena River (Fairbanks, AK). In Canada, rivers sampled included Vedder River (Vancouver, BC) and St. Lawrence River (Montreal, QC). A final site sampled in this study was the Karori Stream in New Zealand (Wellington).
River Yare is ∼84 km long and flows from the west of Norfolk to the east coast, passing through the city of Norwich (urban population ∼300,000). Outside the city center, most of the rest of the land that the river traverses is rural, with arable agriculture and tourism (sailing, motorboats). Two sample locations from this river were analysed. This eastern sample was collected from the riverbank by a public house in a small village, ∼3 km downstream from the edge of Norwich. The upstream sample was obtained beside the University of East Anglia sports fields in a suburban area of Norwich.
River Rhine is one of Europe's largest rivers, with a length of 1,230 km. The sampling site was a pier extending into the river where the surrounding land was rural in character with mainly agricultural farmland. Upstream of the sampling point (the lower Rhine) consists of one of Europe's largest industrial and urban areas, the Ruhr area (urban population ∼5 million).
The Neckar River flows 362 km in Germany from the Black Forest to the Rhine River. Upstream of the sampling site near Tübingen (urban population ∼100,000), the river flows through an area with a mix of villages, farmland, and forest. The sample was collected from a multi-lane divided bridge.
River Corrib in the west of Ireland is one of the shortest rivers in Europe. It flows 6 km from Lough Corrib to the Atlantic Ocean. Samples for analysis were taken from the Upper Corrib region, 2 km upstream from the Galway city center (rural population ∼80,000) in an unpopulated area where minor cattle and sheep grazing occurs (pastoral farming).
Sydhavnen is a suburb of Copenhagen situated along Sluseløbet Canal and Copenhagen Harbour, directly connected to the Øresund (a sound that forms the border between Denmark and Sweden) in the northeast and the Baltic Sea in the southwest, stretching ∼8 km. In addition to oceanic influence at both ends, the waterway, not technically a river, has heavy urban and transportation influence. The sample was collected in Sluseløbet Canal, beneath a bridge in Sydhavnen (urban population ∼780,000), ∼5 km below the Øresund.
Influent wastewater to the Skaevinge wastewater treatment plant, not technically a river, was included to act as a control for high anthropogenic impact because it was assumed to contain human-associated bacteria. Skaevinge wastewater treatment plant is located in a rural area of Zealand, Denmark, and treats wastewater from residential, industrial, and agricultural areas.
James River runs 560 km from the Appalachian Mountains to the Atlantic Ocean. The sampling site was in downtown Richmond, Virginia, and is known to have heavy urban, industrial, and transportation influence. Near the site is one of the largest combined sewer overflow systems on the mid-Atlantic east coast of North America, and the site also receives local permitted input from construction, power plants, failing sewer systems, and industrial discharges, resulting in elevated levels of polychlorinated biphenyls (PCBs). Upstream watershed activities include >170 active industrial discharges and >90 permitted pre-treatment discharge sites.
Chena River, the northernmost river in our sample, is springfed, stretches 160 km, and collects water from interior Alaska. Samples were taken on the downstream side of the Moose Creek Dam and upstream of the populated areas of the Fairbanks North Star Borough.
Vedder River is a continuously flowing river 80 km in length that drains Chilliwack Lake, itself snow fed from the Cascade Mountains. The immediate area where the sample was collected was near Chilliwack suburbs and exhibited constant current flow. A second sample was collected from the Vedder Canal, a downstream artificial canal that drains into the Fraser River and is a main area for both swimming and salmon fishing. The canal is surrounded by earthen dykes that are immediately adjacent to active farming land on both flanks; there was little visible water movement at the time of sample collection.
Saint Lawrence River, running nearly 1,200 km, is the thirdlongest river in Canada. The sampling site was located off of Jean-Drapeau Park ∼5 km downstream of downtown Montreal (urban population ∼1.8 million). This site is close to a municipal routine sampling site named FLS190 where past data have been collected and are available [38].
Karori Stream, the southernmost river in our sample, traverses ∼10 km through Wellington, New Zealand, with headwaters in bush and suburban areas, discharging into the sea at Wellington's south coast. The "Karori Stream at Makara Peak Mountain Bike Park" sampling site in Wellington is one of the Greater Wellington Regional Council's regular river sampling sites, in the middle reaches of the Karori Stream. There it also has some suburban and transportation influence.

Sample collection, DNA extraction, library generation, sequencing technology
Between April 2017 and October 2018, 12 laboratories with personnel exhibiting a wide diversity of skills and experience followed the standardized protocol [39] outlined below for the filtering and extraction of DNA from shallow waters of 11 riverine waterways. Water samples were taken at 0.5-1 m depth during daylight hours at a time when neither drought nor recent excessive precipitation events occurred within 1 week preceding sample collection. River water (2-4 L) was collected for filtration in sterile collection bottles and was processed immediately or stored at 4 • C for prolonged transportation time or until ready for filtration. The water samples were subsequently processed through a GF/C filter to remove suspended solids, particles, etc. (size retention: 1.2 μm). The water recovered after GF/C filtering was subsequently filtered through a 0.22-μm Durapore filter to capture microorganisms present. Upon completion of all filtering, nucleic acid was recovered using a modified procedure combining enzymatic lysis and purification using a DNeasy PowerWater DNA Isolation Kit (Qiagen, Germantown, Maryland, USA). Briefly, each filter was aseptically transferred to a 5-mL tube. To this tube, a lysis mix was added, which contained 1 mL of PW1 (DNeasy Power Water DNA isolation kit) and a previously described enzyme cocktail [40] containing 100 μL lysozyme (10 mg/mL, Sigma-Aldrich, St. Louis, Missouri, USA), 12 μL mutanolysin (25 KU/mL, Sigma-Aldrich), and 6 μL lysostaphin (4,000 U/mL, Sigma-Aldrich). The 5-mL tube with the lysis mixture was subsequently incubated at 37 • C for 1 hour, with gentle agitation to facilitate washing of the filters.
Steps 8-23 of the "experienced user protocol from the DNeasy Power Water DNA isolation kit" were followed. The eluted DNA was quantified using a Qubit fluorometer with the dsDNA HS kit. After quantification, a 0.4X SPRI bead clean-up of ∼100 ng neat DNA was performed and eluted in 20 μL molecular grade water (or Tris-Cl pH 8-8.5). Subsequently 10-50 ng of DNA was used in conjunction with the Rapid Low Input by PCR Barcoding kit (SQK-RLB001, Oxford Nanopore, Oxford, United Kingdom) in accordance with manufacturer's protocols to prepare WGS libraries for use with a MinION device with minor alterations outlined below. A barcoding kit was chosen to facilitate multiplexing of negative controls and DNA from river samples to determine whether any contamination was present during the processing of the river water templates. Modifications for the library preparation were (i) 10-50 ng of input DNA and 2.5 μL of fragmentation mix (FRM) were used for the tagmentation/fragmentation reaction and nuclease-free water was used to make the volume up to 10 μL; (ii) for the PCR reaction, 20 cycles were used and the PCR reaction volumes were doubled. When multiplexing (negative filter and DNA from associated river samples), PCR products were pooled together in equal volumes, then subjected to a 0.6x AMPure XP bead wash and eluted in 12 μL of the buffer recommended in the manufacturer's instructions (10 μL 50 mM NaCl, 10 mM Tris-HCl pH 8.0).
After amplification a number of quality control checks were implemented to ensure that successful library preparation was achieved. The quantity was assessed using the Qubit fluorometer dsDNA HS kit and DNA quality and estimated size distribution were subsequently determined via Tapestation, Bioanalyzer, or agarose gel. Following quality control steps and removal of unincorporated primers, sequencing adapters were added to the mix and a room-temperature ligation-free reaction was carried out to link the adapters to the prepared DNA template. This prepared library (100-200 fmol) was then loaded into the Min-ION flow cell (R9.4) in accordance with manufacturer's guidelines and the unit was run for a full 48 hours of sequencing.
Sequence processing, annotation, post-processing, and data analysis WGS reads were processed for basecalling and quality control filtering using Albacore version 2.1.10 (Oxford Nanopore), and adapters were removed from the resulting DNA sequence reads using Porechop version 0.2.3 [41] using the command-line parameters "porechop -i $INPUT -o $INPUT.porechop.fq -format fastq -t 32 -discard middle" . In a number of instances where replicate runs were performed for the same sample, the replicate data sets were pooled. The final adapter-trimmed data are accessible in both EBI (fast5) and MG-RAST (FASTQ, Supplemental Table 2). Read lengths for sample-pooled FASTQ files were determined using a custom fastx-length.pl script and processed into cumulative read length distribution plots and digital electrophoresis plots using a custom length plot.pl script [42].
To classify the sequence data for the purposes of identifying the microbial community in each water sample and to consider how these contribute to the ecology of each river ecosystem, FASTQ data initially were submitted to One Codex (based on the recommendation of [12]), an online pipeline that identifies microbial sequences using a k-mer-based taxonomic classification algorithm, typically used for short-read data. The 2018 database chosen for analysis comprises a reference database that included ∼80,000 bacterial, viral, fungal, and protozoan genomes. Reads also were processed using Kraken2 [43], a different k-mer-based sequence classification algorithm optimized for long-read sequences, which uses a publicly available precompiled genome database of bacteria, fungi, and viruses from RefSeq [44]. Last, sequences were uploaded via the commandline API and processed using MG-RAST [45], a pipeline that for whole-genome sequences first performs a protein similarity search between predicted proteins and database proteins and then provides bioinformatic tools to predict ribosomal DNA, gene, and protein functions with default parameters as follows: e-value 1 · e −5 (probability of chance incorrect annotation), identity 60%, and a minimum alignment length of 15 (10-60 bp is common). Pavian plots of the representative taxa for each metagenome [46] were constructed using the Kraken2 output. Using both MG-RAST and Kraken2 taxonomy results and the Bray-Curtis distance matrix among normalized family counts, PCA was implemented, with 1 exception: for PCA on Kraken2 results, families were filtered to only include those that had <20% of samples with missing or zero counts.
To evaluate putative ecosystem-related functions from the reads, the MG-RAST server was used to compare data sets to 3 controlled annotation namespaces: Subsystems, KO, and COG proteins. Normalized function data for each river sample were compared using PCA in MG-RAST (Subsystems Level 1, Minkowski distance matrix).

Availability of Supporting Data and Materials
Raw signal FAST5 and FASTQ files are available from the ENA via accessions Nos. PRJEB34137 and ERP116996. Basecalled FASTQ read sets are archived in MG-RAST. Supplemental tables and figures are available. Custom codes can be found at [47]. Other data further supporting this work can be found in the GigaScience repository, GigaDB [48]. Table S1. Long-read metagenome properties and downstream analysis summary. L50: length of the shortest read in the set of the longest 50% of base-called data; N50: number of reads in the set of the longest 50% of base-called data; rRNA: number of reads that contain ribosomal RNA genes; Features: predicted proteins; CDS: identified proteins; Subsystems: number of reads assigned to all Subsystem Level 1 functional categories; Anthro: percentage of reads with predicted protein functions annotated to Virulence, Disease and Defense, Phages, Prophages, Transposable Elements, Plasmids, Metabolism of Aromatic Compounds, and Stress Response; α-Diversity: estimated from the distribution of the species-level annotations. Table S2. Site, study, and sample information. The notation "NA" in parentheses is included where negative read set files included too few reads to be analysed in MG-RAST. Where accessions are given in parentheses, those refer to the barcoded negative control data. All FASTQ files are available from ENA via accessions Nos. PRJEB34137 and ERP116996. Figure S1. Read length distribution plots for sample (A and B) and negative (C and D) sequencing libraries. The cumulative sequenced bases plots (A and C) allow read length percentiles to be identified; read N10, N50, and N90 are indicated on the plot by vertical lines and a pair of circles. The digital electrophoresis plots (B and D) show the distribution of read lengths in the libraries, as might be seen via gel electrophoresis. The sample libraries generally show a very tight read length distribution (except for the low-count samples, Karori and Neckar), whereas the negative samples have platykurtic length distribution curves. Not shown in panel D: James-neg (because only 1 sequence, H. sapiens, ATP synthase, was detected, 1.67 kb). Figure S2. Proportion of GC in each of 13 metagenomes from 11 rivers. Color scale same as in Supplemental Fig. 1. Figure S3. Representative taxonomy shown as Pavian plots for each of 13 metagenomes from 11 rivers. Figure S4. Full-resolution location maps (insets are reducedresolution Pavian plots) to illustrate the range of sites sampled and the relative diversity of taxa identified in 13 riverine metagenomes.