LAGOS-NE: a multi-scaled geospatial and temporal database of lake ecological context and water quality for thousands of US lakes

Abstract Understanding the factors that affect water quality and the ecological services provided by freshwater ecosystems is an urgent global environmental issue. Predicting how water quality will respond to global changes not only requires water quality data, but also information about the ecological context of individual water bodies across broad spatial extents. Because lake water quality is usually sampled in limited geographic regions, often for limited time periods, assessing the environmental controls of water quality requires compilation of many data sets across broad regions and across time into an integrated database. LAGOS-NE accomplishes this goal for lakes in the northeastern-most 17 US states. LAGOS-NE contains data for 51 101 lakes and reservoirs larger than 4 ha in 17 lake-rich US states. The database includes 3 data modules for: lake location and physical characteristics for all lakes; ecological context (i.e., the land use, geologic, climatic, and hydrologic setting of lakes) for all lakes; and in situ measurements of lake water quality for a subset of the lakes from the past 3 decades for approximately 2600–12 000 lakes depending on the variable. The database contains approximately 150 000 measures of total phosphorus, 200 000 measures of chlorophyll, and 900 000 measures of Secchi depth. The water quality data were compiled from 87 lake water quality data sets from federal, state, tribal, and non-profit agencies, university researchers, and citizen scientists. This database is one of the largest and most comprehensive databases of its type because it includes both in situ measurements and ecological context data. Because ecological context can be used to study a variety of other questions about lakes, streams, and wetlands, this database can also be used as the foundation for other studies of freshwaters at broad spatial and ecological scales.

But I find costs provided in "The economic value of water quality data in an integrated database" (791-805) out of proportion.The cost estimate of a single lake sample of $2000-6000, based on stream sampling, seems extremely high (line 799).Consider the inexpensive Secchi data and other data collected by volunteers.Commercial water TP analysis is typically less than Can$45, and physical profile data (temperature, oxygen) do not require special expertise and time after an initial investments into equipment (<$5000, depending on lake depth).On the other hand, the section on "Strategies for broad-scale data-integration efforts" (lines 807-858) is well thought out and should help other, similar endeavours.RESPONSE: we agree that there could be some cost savings in lakes, but then again, lake sampling also requires boats, trailers, etc that many stream sampling efforts do not.We did not include costs for secchi samples, and only include records for which a lab analysis is required.Nevertheless, as recommended, we lowered the range compared to stream samples of $1000-$4000 rather than $2000-$6000.This rough estimate is only intended to put the dataset and costs in context.
One strength of the chosen approach is the modular build.This make it possible to add potentially useful information, such as: *Information pertaining to internal P loading, including discrete depth samples of phosphorus, iron and manganese.
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation *Information pertaining to cyanobacteria proliferation and blooms: Maximum chlorophyll concentration, phytoplankton species and biomass, cyanotoxins Additional documents and files are extensive.They seem to explain and describe methods of data selection and other approaches used in detail.I believe that a potential user can find all the information needed to determine the data validity.
Detailed comments in the order of the text by line numbers follow: 105: Also indicate the number of nutrient data, especially of total phosphorus (TP).RESPONSE: Done.We have added TP 107: Were there no data used from the published peer-reviewed scientific literature?RESPONSE: No, we have found it sometimes too difficult to acquire the metadata for such studies, as well as the data themselves because historically, it has not been the practice to put data into data repositories.It was more efficient to get data directly from sources, and state agency datasets are larger, and contain more data than published studies typically.140-1: A fitting reference would also be: --Bachmann, R.W., Hoyer, M.V., and Canfield Jr, D.E. 2013.The extent that natural lakes in the United States of America have been changed by cultural eutrophication.Limnol.Oceanogr 58(3): 945-950.RESPONSE: We have chosen not to cite this article due to the numerous responses to the article that were published questioning their conclusions.157-160: It would be great to test this assumption of lacking metadata for the lake data (and not just citing river data and reference [16]).RESPONSE: Yes, we agree, however, it is beyond the scope of our data paper to include this estimate.Further, we do not have any reason to expect it to differ greatly between lake and stream samples.Nevertheless, we are now working more closely with the authors of this article who are employees at the USGS for the next phase of our research to build LAGOS for the entire US by integrating more with the Water Quality Portal.
195: It would be helpful to be more specific: what time periods are usually provided (before 2012)?RESPONSE: we agree.We have added: mostly from the late 1980's to up until about 2012.255: Replace "were" with "was" (grammar) RESPONSE: done 327-331: Phosphorus retention in lakes is not usually complete (100%) so the notion of "trapping" TP in any large upstream lakes is an oversimplification.Nonetheless, retention of large and deep lakes without internal loading is usually 70-90%, so that the assumption of R=100% is more valid than R=0%.--Brett, M.T., and Benjamin, M.M. 2008.A review and reassessment of lake phosphorus retention and the nutrient loading concept.Freshw.Biol.53: 194-211.
--Nürnberg, G.K. 1984.The prediction of internal phosphorus load in lakes with anoxic hypolimnia.Limnol.Oceanogr.29: 111-124.RESPONSE: We agree, but have chosen not to add citations as this is not a major focus of this manuscript and the paper that we cite also cites these papers within it.405: It is confusing that in Table 2: "… lakes are counted for each state in which they occur (i.e., lakes that straddle two states are counted in both states)", while in other files such lakes are counted only once.RESPONSE: We agree, however, there is little that we can do that would not require a complete GIS analysis to reclassify lakes by state and make decisions about which border lake belongs where.Unfortunately, lakes do not follow state borders, and different table summaries make different assumptions.We felt the important part of this table was to show the relative numbers of lakes by lake type rather than the state data, so slight discrepancies due to border issues was acceptable.476: "All data in LAGOS-NELIMNO v1.087.1 are from samples that we identified as Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation being collected from either the lake surface or the epilimnion (the well-mixed surface layer of a thermally-stratified lake during the period of stratification)."As mentioned above, it would be useful to expand the dataset to include data that can be used to determine whether there is any sediment P release.Such data include hypolimnetic and discrete deep water samples during the stratification period in stratified lakes.RESPONSE: We certainly agree, and in fact some of those data reside in the master LAGOS-NE database, we just have not sufficiently processed them to make them available, nor do we have the associated temperature and dissolved oxygen profiles that would make those values even more useful.However, for the next version of LAGOS-US, we will include both oxygen and temperature profiles and possibly, lake nutrients at depth.625: "We have published 10 articles using portions of this database".Perhaps these and the 13 articles in review (if available when this ms is published), could be listed and cited in a separate table.But perhaps the subsequent paragraph already refers to these references?RESPONSE: Correct, the later paragraph describes them and cites the published studies.We would rather not provide citations to the in prep manuscripts in a table since those will likely change in the coming months and soon be out of date.However, we have updated any manuscripts that have now been published so that there are fewer 'in preparation' manuscripts that we discuss in this section.Further, we have chosen not to include a table of papers because this is not the main focus of this manuscript, and this section is intended to only show that many publications have used this database.808: This sentence is not complete ("which" is awkward) RESPONSE: We have fixed by adding 'and to identify the types of datasets….843: I think you mean "disseminate" rather than "dissemination" RESPONSE: fixed.
----------------------------------Data management and R-related files: reviewed by Stefanie LaZerte This R package is a nice way of providing access to this large dataset.The package was generally easy to install and easy to use.I wasn't able to use lagos_get() to download, as it got through most but failed on one file.It was nice that the function detected previously downloaded files and resumed.But it would be even nicer if it had the option to skip over files that couldn't be reached.RESPONSE: Now that all files area available on EDI and we have updated to package to point to them this should not be an issue.We agree that additional flexibility would be a nice feature.We have filed an issue on the Github repository and hope to implement this for users in the future.
I was able to use the files provided in the dropbox folder, by compiling them with the 'lagos_compile()' function, although I needed to fix a couple of typos to make them work: -'.txt' in LOCUS file needed to be renamed to '.csv' -'LakesLocus' should be lowercase RESPONSE: Again, now that all files are available on EDI and we have updated to package to point to them this should not be an issue.We apologize for the earlier challenges in accessing the data.
Although not crucial, I would suggest having the compile function create individual rds files in a single directory, and then giving users the option of loading select datasets as the whole set is quite a large table.RESPONSE: We agree that implementing additional flexibility would be a great option for users.We have filed an issue on the Github respository and we hope to implement this in the future.
The data itself was well explained and organized, but there is such a wealth of information it may become confusing.Perhaps consider making the output of ?dataset (e.g.?county) specific to that particular dataset, so users don't have to scroll through the descriptions of all columns for all tables if they're only interested in the one.RESPONSE: We agree that there is a very large volume of information.We hope to eventually improve the organization of the metadata to maximize ease of use, which is an ongoing effort.
The ability to select by categories is very cool, and it would be nice to have a category for sample information (i.e.sampling event, lakeid, etc.) RESPONSE: We agree that this is a fantastic idea and we have added this to our 'to do list' for updating the R package in the coming months, which we view as an ongoing process.Nevertheless, the package allows full access to the database now and improves accessibility of the data to other users.We will be working towards making it increasingly user-friendly with such ideas as this one.Also, although not related to the quality of the dataset, consider including vignettes or more in-depth tutorials, perhaps for how to merge different data sets together or how to extract and transform particular columns (see coding example below).As the data is in wide format as opposed to long (e.g., years are in different columns, as opposed to having a single year column), the data will have to be transformed before most if not all types of analysis.These transformations are not always trivial.By providing some guidance and examples, the accessibility of the data by users less familiar with R can be improved.In particular, if downloading the data separately is expected to be a common place occurrence, there should be instructions for the use of the 'lago_compile()' function.RESPONSE: We definitely agree and have added a minimal vignette showing basic interaction with LAGOS Overall I think this package is a convenient way of accessing both the datasets and the metadata.It is well documented and will be very useful to scientists wishing to use the data.
Minor Comments -For imports, best to give a minimum version number, eg: dplyr (>= 0.  ----------------------End of review------------------------Reviewer #3: This paper provides a valuable documentation of a geospatial database for lakes of the upper midwest and northeast United States.The value of the database is well illustrated visually in non-uniform distributions of quality (Figure 5) and hydrological variables (Figure 6).The main points -some of which could be addressed in a revision of this paper -include: (1)[comment only] I have a few misgivings about such a large author list.There is a good justification of the authorship and no doubt, with a few self-citations, this paper will become well cited.But it still does not sit entirely comfortably with me, especially when I can still readily pick out simple typographical errors.RESPONSE: While we agree for more typical research papers, we do not agree for data papers, in which the author list should be as long as the number of individuals who provided data.We are fixing the typographical errors.
(2)I was disappointed that the dataset extended until 2012.This is hardly a contemporary dataset and it raises a question for me about whether the database is sufficiently nimble to allow rapid incorporation of recent data and time series analysis.RESPONSE: This is a major issue that we are now addressing in a new grant that will create LAGOS for the entire US and try to integrate with the WQX data repository for updates of newer datasets.Also, our work has shown that for many research questions, the spatial data (i.e., many lakes across broad regions) is more important than good temporal resolution.
(3)I was a little concerned about the large number of 'in prep' articles being cited in section 8. Are these all necessary.Could some be substituted or supplemented with recent published articles.Are other articles recent such as: -Read JS, Winslow LA, Hansen GJA, Van Den Hoek J, Hanson  pollution, as the primary driver of lake and reservoir eutrophication [1].In lakes and reservoirs, 137 eutrophication is expected to become more widespread in the coming decades as the human population 138 increases and climate and land use change commensurately, placing increasing pressures on freshwaters 139 [2,3,4]; although, there is also recognition that eutrophication or its response to management actions does 140 not progress in the same way in all lakes (e.g., [5,6,7]).Most research to understand lake nutrients and 141 their effects on algae, plants, and aquatic food webs has been conducted in individual or small groups of 142 lakes by studying the complex within-lake mechanisms that control responses to nutrients (e.g., [8,9]). 143 Such relationships and interactions have also been found to be influenced by the ecological context of 144 lakes (i.e., the land use, geologic, climatic, and hydrologic setting of lakes), which varies by lake and 145 region, and is multi-scaled.In fact, it is not always clear whether local or regional ecological context 146 matters more for predicting lake eutrophication (e.g., [10,11,12]).Therefore, determining the current 147 extent of lake eutrophication and predicting how eutrophication will respond to future global change 148 requires water quality data (e.g., nutrients, water clarity, and chlorophyll concentrations) and measures of 149 lake ecological context across regions, the continent, and the globe (e.g., 13,14,15).

150
In practice, measures of water quality are often collected from a relatively small number of lakes 163 We created a database called LAGOS-NE, the 'lake multi-scaled geospatial and temporal 164 database' for thousands of inland lakes in 17 of the most lake-rich states in the upper midwest and 165 northeastern U.S. (Figure 1).We avoided the problem of lack of metadata for the water quality data by 166 contacting the original data providers for water quality data, asking for metadata, and only including data 167 for which sufficient metadata were available.We addressed the problem of lack of ecological-context 168 data by creating our own database of lake ecological context.The detailed methods and approach for 169 building this database have been published previously [17]; here we publish and describe the database for 170 the 51,101 lakes and reservoirs > 4 ha in the study area (1,800,000 km 2 ). 171 We had three related motivations for developing this database: (1) to facilitate further 172 development of our basic understanding of lake water quality at broad scales using water quality data on 173 thousands of lakes collected over the last several decades (see [11,17]

254
There was a number of constraints for each of the categories of data that had to be considered.

255
For creating the census population of lakes (i.e., their geospatial location, perimeter, and surface area), we 256 relied on a single source of data (the 1:24,000 National Hydrography Dataset (NHD) [21]).For the in-situ 257 water quality data, we incorporated data only if they were in a digitally-accessible format such as a text or 258 spreadsheet file.Finally, for the ecological-context variables, we included only data for which we could 259 obtain a GIS or raster coverage at the national or state scale for all 17 states.

260
We organized these three categories of data into database 'modules' that had similar data types 261 and sources so that we could develop procedures and set standards for each module (Figure 2).The

319
'lake' for LAGOS-NE has been developed only for the purpose of this database and its applications (e.g., 320 to answer questions about lake water quality).The intent of LAGOS-NE is not to document and measure 321 the total number of water bodies in our study area, although we are able to perform this calculation for 322 lakes ≥ 4 ha, with an acceptable level of uncertainty (see below).

323
Definition of lake watersheds: We calculated lake watersheds as 'inter-lake watersheds' (IWSs) 324 defined as the area of land draining directly into the lake as well as the area that drains into upstream-325 connected streams and lakes < 10 ha (Figure 3).We defined lake watersheds this way to define the 326 drainage basin of lakes that includes connected streams and their drainage basins.However, because 327 research has shown that large upstream lakes can trap nutrients flowing into them, these large lakes can 328 block nutrient transport of nutrients that originate upstream of them to downstream lakes in a connected 329 lake chain (e.g., [22]).Therefore, to calculate a drainage basin for a lake with large upstream connected 330 lakes, we did not include the drainage basins of upstream lakes > 10 ha.See Soranno et al. [17] for full 331 details on how lake IWSs were calculated and the section on LAGOS-NEGEO for further details.

366
The full description of error analysis for this module is described in Soranno et al. [17].However, 367 here we briefly describe our efforts to determine the minimum area of a lake that we could confidently 368 represent using the NHD (further details located in Additional File 9 in Soranno et al. [17]).Although the 9 NHD is a national dataset, it is updated and edited regionally (often at the state level) by local 370 practitioners familiar with each study region.As a result, there are regional differences in the resolution 371 and digitization of water bodies, particularly for small water bodies, making it difficult to quantify or 372 document even nominal error rates, or rather, the minimum lake size that is well-represented in the NHD.

373
It has been documented previously that the NHD may not successfully identify small water bodies due to 374 a variety of reasons including the resolution of the original underlying data of the NHD database, errors in 375 digitization, hydrologic changes since the time of map creation (e.g., [25,26]).Because of these 376 documented issues, some programs have set minimum lake area cutoffs for sampling lakes.depth values, and 4,090 lakes with mean depth values), there is little regional pattern of lake depth; 396 shallow and deep lakes are found throughout the study area (see [28] for further details).Watershed size 397 varies greatly across the study extent, reflecting the wide range of different lake hydrologic types and 398 connections to upstream water bodies (Figure 3).In fact, the proportion of lakes in different lake 399 hydrologic connectivity classes varies regionally across our study extent (Table 2; see [29] for further 400 details).

439
Quality control of the LAGOS-NELIMNO module 440 The full description of our QAQC procedures for this module are described in Additional File 2. Here, we 441 provide a brief overview of our approach.Our goal for this effort was to identify egregiously high values 442 and values that might be too low, both defined below.Note that our quality control procedures were not 443 designed to identify statistical outliers, which individual users are expected to perform themselves 444 because such analyses depend on the subsequent statistical analysis of each user.There were three major 445 phases in the quality assurance/quality control (QAQC) procedure for LAGOS-NELIMNO.Phases I and II

446
were designed to identify the egregious values that we defined as those that: 1) did not make ecological 447 sense, 2) were far beyond what has been detected in previous studies, 3) were not technically feasible 448 (e.g., SRP > TP), or 4) were a result of a data or file corruption or error in the data loading stage.For these 449 egregious values, we explored the issues that might be underlying the values and removed them from the 450 LAGOS-NELIMNO data export provided in this data paper because we had sufficient evidence that they 451 were not scientifically valid data values.We were very conservative in these assessments to avoid 452 removing data values that were high, yet still valid.Phase III was designed to identify and flag values that 453 were lower than analytically possible (i.e., below detection limits) when there was sufficient metadata; 1 Using the three most sampled variables in the dataset (Secchi depth, chlorophyll concentration 528 and total phosphorus), we found that larger lakes were more likely to be sampled for water quality than 529 smaller lakes (Figure 4).This result was expected given the economic and recreational interest in larger 530 lakes, including easier public access.Previous research has already documented this basic pattern in 6 of 531 the states included in LAGOS-NE [30].Across all states, almost 80% of lakes > 400 ha have water 532 quality data.

533
Lakes are also unevenly sampled through time, depending on the variable (Figure 5).Some 534 programs' focus is on long-term monitoring, whereas others are short-term initiatives.Typically, long-535 term monitoring programs are localized to a few lakes, although there are exceptions (e.g., monitoring for 536 acid rain in the NE in the 1980s-present has resulted in good temporal and spatial coverage for some 537 variables through time and space [31]. 538 539 and wetland abundance and connectivity measures (Figure 2).We also provide the GIS coverages that 551 include some of the underlying data for this module, including: lake polygons and their hydrologic 552 classifications defined in [17]; wetland polygons and their classification; streams as a line coverage and 553 their classification by stream order; the zones used for this study (state and county; hydrologic units [at 554 the 4, 8 and 12 scales; [32]]); and, lake watersheds (IWS).We also include boundaries of U.S. states and 555 Canadian provinces for mapping.

557
Data sources of the LAGOS-NEGEO module 558 Detailed information on data sources are found in 'Additional File 5' in Soranno et al. [17].

559
Almost all data sources for this module are from national-scale datasets and thus use standardized 560 methods throughout the study extent.

562
Data-integration methods of the LAGOS-NEGEO module

563
All methods to create this module are described in 'Additional files 5, 7, 8, 13, and 14' in 564 Soranno et al. [17].Briefly, we calculated the metrics for this module that describe the ecological context 565 surrounding lakes by developing project-specific GIS tools in the ArcGIS environment, which are 566 referred to as the LAGOS GIS Toolbox (and made available here: [33]).The toolbox outputs multiple 567 individual data tables of calculated values organized by the above three data themes that are then 568 imported into LAGOS-NEGEO for different spatial classifications, including values calculated at the level 569 of the individual lake, 100 m and 500 m buffers around each lake, the lake IWS, states and counties, 570 hydrologic units, and ecological drainage units (an ecoregion spatial classification).The unique identifiers 571 for this data module are the zone ID's for each spatial classification for which we calculate these metrics.

572
In other words, we calculate land use around a lake in each of the zones of the many spatial classifications      U.S. outlined in white and 51,101 lakes > 4 ha shown as blue polygons.Some lakes extend beyond state borders and are included in the database if it was possible to delineate their watersheds.Watershed boundaries rather than state boundaries were used for all analyses of lakes, streams and wetlands.The map is modified from [17].[17], this module was called LAGOS-lakes), and LAGOS-NELIMNOv.1.087.1.We include descriptions of the type of data that are included in each module; with the major categories of variables the same as those describing the data tables in Additional File 1.The black connectors among the modules show that the modules are connected to each other through common unique identifiers through the LAGOS-NELOCUS module (either the unique lake ID or the zone ID).P is phosphorus, N is nitrogen, C is carbon, S is sulfur, atm is atmospheric, NHD is the National Hydrography Dataset, IWS is the interlake watershed, WBD is the Watershed Boundary Dataset, EDU is Ecological Drainage Unit.
facet_wrap(~ Type, ncol = 1, scales = "free_y") 151 within individual regions.In the U.S., large investments have been made in water quality monitoring by 152 federal, state, local, and tribal governments; and, many, but not all, of the datasets have been placed in 153 government data repositories such as the USGS National Water Information System (NWIS) and the 154 USEPA Storage and Retrieval (STORET) database.Unfortunately, these data repositories do not 155 currently allow us to study lake water quality at broad scales.Despite the large number of water quality 156 records in these systems, a recent analysis of their stream nutrient data found that over half of the data 157 records lacked the most critical metadata necessary to make the data usable (e.g., chemical form, 158 parameter name, units;[16]); and, we would expect a similar result with lake data because they are 159 typically treated similarly to stream nutrient data.In addition, STORET and NWIS do not include any 160 measures of lake ecological context.Therefore, to study the controls of eutrophication specifically, and 161 water quality in general, requires development of a comprehensive database for lake water quality that is 162 integrated with measures of lake ecological context and sufficient metadata for robust analysis.

Figure 2 .
Figure 2. LAGOS-NE data modules and version numbers.The data modules and versions that are included in

Figure 3 .Quality
Figure 3. Examples lake watersheds (IWS) in LAGOS-NE.The watersheds are coded by hydrologic class to

Figure 5 . 6 .
Figure 5.The number of years of water quality data by lake.The number of years for which at least one sample

Figure 6 .
Figure 6.Example ecological context variables by spatial classification in LAGOS-NE.The top four panels are

HUC
and abundance (lake, stream, and wetland) 902 CHAG -Climate, Hydrology, Atmospheric deposition of nitrogen and sulfur, and surficial Geology 903 LAGOS-NE was supported by: 911 The National Science Foundation MacroSystems Biology Program in the Emerging Frontiers Division of 912 the Biological Sciences Directorate (EF-1065786, EF-1638679, EF-1065649, EF-1065818, EF-1638554) 913 and the USDA National Institute of Food and Agriculture, Hatch project 176820 to PAS.KEW thanks the 914 STRIVE Programme (2011-W-FS-7) from the Environmental Protection Agency, Ireland.SMC thanks 915 the NSF Division of Biological Infrastructure (1401954).

Figure 1 .
Figure 1.Map of the study extent of LAGOS-NE.Map includes 17 states in the upper midwest and northeastern

Figure 3 .
Figure 3. Examples lake watersheds (IWS) in LAGOS-NE.The watersheds are coded by hydrologic class to which its lake belongs.Data are from the LAGOS-NEGEO v.1.01data module and the GIS data coverages.

Figure 4 .Figure 5 .
Figure 4. Percentage of lakes by lake area with water quality data.Percentage of census lakes in each lake area bin (top panel) compared to the percentage of census lakes for which there are limnological data for Secchi (second panel), chlorophyll a (third panel), and total phosphorus (TP; bottom panel)

Figure 6 .
Figure 6.Example ecological context variables by spatial classification in LAGOS-NE.The top four panels are zoomed in to selected regions of Minnesota and Wisconsin so that the zone boundaries can be seen.The upper left panel shows stream density in each lake IWS, and the upper right panel shows the percent of connected wetlands in each lake IWS.The middle left panel shows the 2011 percent urban land use/cover in each hydrologic unit code 12 (HUC12), and the middle right panel shows the 2011 percent agricultural land use/cover in each hydrologic unit code 12 (HUC12).The lower left panel shows the 2010 nitrogen deposition in each HUC8, and the lower right panel shows the average percent of streamflow that is baseflow in each HUC8.

requested by the journal, is intended to show the potential value of the dataset by showing the types of research that has been conducted to date. Because it took a long time to complete the database, many manuscripts are still in prep. Although, now, some have been accepted, which we have updated, and in fact, a large number have been published relative to the numbers in preparation, so we have kept them in the manuscript to convey the types of research questions we are
133A major concern for water quality in freshwaters globally is cultural eutrophication, or excess 134 nutrient inputs from human activities that lead to increased plant and algal growth.In many parts of the 135 world, runoff from land, or nonpoint-source pollution, has replaced discharges of sewage, or point-source 136

Table 1 : Summary statistics for LAGOS-NE study area. 235
NE is comprised of three data modules that, although integrated in the same database, 184 were derived using different data sources and data integration methods, and thus must be version-185 controlled separately.LAGOS-NELOCUS v1.01 includes lake location and physical characteristics based on 186 an existing national-scale database of lake and streams in the U.S. for all lakes.LAGOS-NEGEO v1.05 187 includes measures of land, water, and air (ecological context) obtained from existing national scale GIS [18,19,20,12]of the study extent of LAGOS-NE.Map includes 17 states in the upper midwest and northeastern 179 U.S. outlined in white and 51,101 lakes > 4 ha shown as blue polygons.Some lakes extend beyond state borders and 180are included in the database if it was possible to delineate their watersheds.Watershed boundaries rather than state 181 boundaries were used for all analyses of lakes, streams and wetlands.The map is modified from[17].200Wedesignedthedatabaseusingprinciples of open science so futures users could ask new 201 research questions by using the existing database or adding new data modules to the database.To ensure 202 users could do this, we documented the major steps of dataset integration and carefully integrated 203 metadata directly into the database itself, we emphasized data provenance, and we used a database 204 versioning system.In this data paper, we make the following research products available: (1) data tables 205 with the data that make up LAGOS-NE and an R package for accessing the data and integrating the 206 tables; (2) for each of the 87 water quality datasets, we provide the EML (ecological metadata language) 207 metadata files that we authored after receiving the data, the data files that we processed to import into 208 LAGOS-NE, and the R-script that we wrote to process the data; and (3) GIS coverages of the underlying 209 freshwater geographic features (lakes, streams and wetlands) that are linked to the data tables for GIS 210 processing by researchers.2112122.Study site: Midwest and Northeast U.S. lakes 213We selected an area of the U.S. known to have large numbers of lakes, well-developed lake water 214 quality sampling programs, and that spans diverse geographic conditions and thus gradients of ecological 215 context (Table1).Our study area of 17 U.S. states includes 51,101 lakes > 4 ha (Figure1).These states 216 are in the north temperate climatic zone, which experience cold winters and warm, humid summers.The 217 study area includes part of the Interior Plains, Laurentian Uplands, Appalachian Highlands, and Atlantic 218 Plain geological provinces, and thus encapsulate a range of geological ages, glacial histories, and 219 topography.Land use/cover is highly variable, ranging from regions of intense agriculture in the This table includes numbers of lakes and geophysical setting of each state and state averages for climate and the 4 236 major land use/cover types, which do not add up to 100% because we do not include all cover types.Temperature 237 and precipitation data are 30 year climate normals (1981-2010); land use/cover data are from the 2011 National 238 Land Cover Database (NLCD).Note, border lakes are only counted in one state.2392403.Overview of LAGOS-NE241 LAGOS-NE includes some data on all lakes in a study area (above the minimum lake area 242 threshold, which was 4 ha), which we call the 'census' population of lakes.The census population of 243 lakes is a critical feature of LAGOS-NE because it allows us to characterize the ecological context of 244 every lake in our study population and to identify whether the lakes for which we have water quality data 245 are biased in any way.LAGOS-NE includes three main categories of variables: (1) variables that describe 246 the physical characteristics and location of lakes themselves; (2) variables that describe in-situ water 247 quality; and (3) variables that describe a lake's ecological context at multiple scales, and across multiple 248 dimensions (such as hydrology, geology, land use, climate, etc.) based on the principles of landscape 249 limnology[18,19,20,12].Three factors dictated which data were included: past research and theory about 250 the spatial and temporal controls of lake water quality, data availability and quality, and the time and 251 resources necessary to compile, integrate, or process the original data.In other words, data that were 252 especially time-and resource-intensive to collate, integrate, or process were given lowest priority and in 253 some cases, were not ultimately incorporated into the database.

332
Lakes near and beyond the state borders: For some of our analyses, we delineated boundaries in 333 other ways than political boundaries that were more ecologically relevant, which resulted in the inclusion 334 of some lakes outside of the exact 17 state border.This fact allowed us to include more in situ data 335 collected by state and citizen sampling programs which do not always follow strict state borders and may 336 include lakes that are outside of state lines.Although most of these border lakes have hydrological (i.e. , 337 lake connectivity measures) and topographic (i.e., lake watershed delineations) calculations or water 338 quality data, some measures of ecological context may be missing.For example, for lakes in Canada, we 339 were not able to estimate any data that relied on national datasets that stopped at the Canadian border; one 340 exception is the NHD, which extends into Canada to retain hydrologic boundaries.

Table 2 . Numbers of lakes in each state by lake hydrologic class 403 State Lakes ≥ 4 ha (#) Isolated Lakes (#) Headwater lakes (#) Drainage lakes (#) Drainage lakes with upstream lakes (#)
The number of lakes > 4 ha in each of the lake hydrologic classes by state, as well as the total numbers of lakes by 404 hydrologic class calculated for the study extent.Note, in this table, lakes are counted for each state in which they 405 occur (i.e., lakes that straddle two states are counted in both states).
[17] of the 17 state and 5 tribal agencies.These contacts helped us to identify the state-agency collected 420 dataset required by the Clean Water Act and which is most likely to be in the public domain.In this way, 421 we were able to acquire at least one (and typically more) dataset from each of the 17 states.Because state 422 and tribal agencies vary in sampling approach and intensity (see below for details), we sought to 423 supplement these datasets with other known sources of water quality data, including university 424 researchers, federal agencies, and non-profit groups to integrate into the LAGOS-NELIMNO module.The 425 full list of data sources acquired is in Soranno et al.[17]in 'Additional File 17'; however, we 426 incorporated a subset of these datasets in LAGOS-NELIMNO v1.087.1 (the data file 427 LAGOSNE_source_program_10871.csv contains the list of sources for this version of LAGOS-NE).428429Data-integrationmethods of the LAGOS-NELIMNO module 430All methods to create this module are described in Soranno et al.[17].Briefly, for each dataset acquired, 431 we authored LAGOS-NE metadata in EML to aid in data provenance (included in this paper).We also 432 incorporated key metadata features (e.g., methods used, censor codes (if applicable)), and sampling 433 program information) into the database so that future users could easily identify these important 434 attributes.Because each dataset was unique in structure, file format, and naming conventions, we 435 manually processed each dataset and its metadata so that they could be translated into the standard 436 LAGOS-NE vocabulary and data model.Although labor-intensive, we created customized R scripts to 437 process and load each dataset separately (included in this data paper).

Table 3 . Summary of the water quality variables and the number of values per variable by state.
We include the number of individual values (representing an individual sampling event); the number of unique lakes for which there is at least one data value; 508 and, the earliest and most recent year of sampling, all recorded by state and variable from any time period.Additional variables in LAGOS-NELIMNO v1.087.1, 509 not included in this table, which have relatively low sample sizes include: dissolved Kjeldahl nitrogen, ammonium, nitrite, soluble reactive phosphorus, total 510 dissolved nitrogen, total dissolved phosphorus, total organic carbon, and total organic nitrogen.

Percentage of lakes by lake area with water quality data.
Percentage of census lakes in each lake area 524 bin (top panel) compared to the percentage of census lakes for which there are limnological data for Secchi (second NE.However, the data are exported into individual tables by spatial classification.Therefore, 574 there are different numbers of rows in each table; for example, there are 51,101 rows for the land use 575 metrics calculated for the 100 m lake buffer because there are 51,101 lakes that have a 100 m buffer area, 576 but only 17 rows for the land use metrics calculated for the state spatial classification.
[37] water quality is affected by many ecological context features, such as lake physical 667 characteristics, land cover, land use, and climate.The relationship between these features and the water-668 quality measurements is not always linear.In addition, the data tend to be noisy and often contain missing 669 values, which makes it challenging to fit effective statistical models.To overcome these challenges, Yuan 670 et al.[37]developed a novel algorithm for learning non-linear features to predict lake water quality.The