Internet of Samples (iSamples): Toward an interdisciplinary cyberinfrastructure for material samples

Abstract Sampling the natural world and built environment underpins much of science, yet systems for managing material samples and associated (meta)data are fragmented across institutional catalogs, practices for identification, and discipline-specific (meta)data standards. The Internet of Samples (iSamples) is a standards-based collaboration to uniquely, consistently, and conveniently identify material samples, record core metadata about them, and link them to other samples, data, and research products. iSamples extends existing resources and best practices in data stewardship to render a cross-domain cyberinfrastructure that enables transdisciplinary research, discovery, and reuse of material samples in 21st century natural science.


Abstract
Sampling the natural world and built environment underpins much of science, yet systems for managing material samples and associated (meta)data are fragmented across institutional catalogs, practices for identification, and discipline-specific (meta)data standards. The Internet of Samples (iSamples) is a standards-based collaboration to uniquely, consistently, and conveniently identify material samples, record core metadata about them, and link them to other samples, data, and research products. iSamples extends existing resources and best practices in data stewardship to render a cross-domain cyberinfrastructure that enables transdisciplinary research, discovery, and reuse of material samples in 21st century natural history.

Keywords
Material sample, specimen, data standards, cyberinfrastructure, unique identifiers, persistent identifiers, collections, geoscience, bioscience, archaeology Background Material samples from natural and built environments are fundamental to many branches of science and are increasingly needed for interdisciplinary research with critical societal relevance, such as sustaining natural resources, controlling infectious diseases, and coping with environmental change. Scientific collections have entered the realm of big data with the advent of simultaneous sampling across large areas and repeated sampling of the same area [1][2][3] . Many (perhaps most) material samples, however, are not accessioned into institutional collections but remain 'hidden' in labs, offices, and basements, as researchers and institutions often lack the resources and expertise to properly curate them [4] . Harnessing existing sample-based data for science is cumbersome and often impractical as data about most material samples are difficult or impossible to Find, Access, Interoperate, and Reuse --they are simply not FAIR [5] . As a consequence, the full value of material samples and the data derived from them are rarely realized, either for basic scientific research or societal applications. For example, published DNA sequence data often lack the basic geographic metadata needed to understand the origin and spread of pathogens [6] . Maximizing the value of today's samples for tomorrow's science requires cyberinfrastructure designed to facilitate sharing and reuse across the material sample value chain and to accommodate the interdisciplinary nature of many samples ( Box 1 ). Unleashing the societal value of material samples also requires linking them to derived data and published interpretations of those data -essential steps to making sample-based scientific knowledge reproducible, credible, and useful. In order to achieve these linkages, material samples need globally unique, persistent, and resolvable identifiers with reliably accessible and trustable standards-based metadata describing the sample and its provenance.

Box 1: Interdisciplinarity of Material Samples -Example from Archaeology
Archaeologists study highly diverse material culture created over many millennia by peoples across the world who lived in very different regions, societies, and cultural traditions. While material culture is difficult to describe with standard metadata, archaeologists draw upon geological and biological sources of evidence and vice versa. For example, large-scale data integration of animal remains has been used to demonstrate domestication patterns in Southwest Asia [7] . It is vital, however, that

Main text iSamples Solution
Recognizing the need for research infrastructure to support material samples, the U.S. National Science Foundation funded iSamples in 2020 to develop consistent services for unique and persistent sample identification and sample metadata registration across disciplines. Collaborating with similar efforts globally, iSamples will provide services for creating and assigning persistent, unique, and resolvable identifiers to material samples in a consistent manner across disciplines, and for registering and indexing metadata using semantic web technologies. The result will be a searchable global index of material samples linked to appropriate metadata and derived data products. iSamples aims to (i) enable previously impossible connections between diverse and disparate sample-based observations; (ii) support existing research programs and facilities that collect and manage diverse sample types; (iii) facilitate new interdisciplinary collaborations; and (iv) provide an efficient solution for FAIR samples, avoiding duplicate efforts in different domains. To achieve its goals, iSamples must incorporate and help advance diverse metadata vocabularies and standards across natural history domains ( Figure 1 ).
Technical Description: Distributed Cyberinfrastructure The iSamples system has two core components ( Figure 2 ). An iSamples-in-a-Box instance is a standalone system that enables creation of identifiers and associated metadata, retrieval of the sample information, updates to the sample metadata (e.g., augmenting or correcting metadata or appending provenance statements), sample identifier resolution, and discovery of samples. iSamples-in-a-Box will support different scenarios. Initial use cases include: (a) SESAR, which provides reliable services for sample metadata cataloguing and Global Sample Number (IGSN) registration for individual researchers and institutions [8] . (b) GEOME, which supports capturing metadata on biological samples, and links to associated genomic data [9] , and (c) Open Context, a publishing service maintained by the Alexandria Archive Institute, which serves as a metadata repository for archaeological artefacts and ecofacts and links each discipline creates its own community-specific metadata fields (different colored dots) and data standards (oval shaped 'petals') based on their specialized knowledge and needs. Disciplinary communities are at different stages of organization. The most advanced have standardized metadata fields (e.g., yellow, green, and tan petals) sometimes with minimum required fields (darker inner petals). Some disciplines are beginning to organize (pink dots with dotted-line petal) while others have no organization as yet (purple dots). Some metadata fields cut across disciplines (the brown and purple dots in the green domain). At the cutting edge of research, new data types and custom metadata fields are constantly emerging (blue dots). (B) Cyberinfrastructure being built by iSamples focuses on sampling events and the resulting material samples and subsamples thereof. Metadata needed will include the material sample identifier (black dot) and its required metadata kernel (light blue circle), as well as an iSamples core (orange circle) that encompasses all required metadata fields shared across disciplines in the natural history domain. Promoting and facilitating community-driven metadata standards from each domain, iSamples will also support the creation of interdisciplinary metadata profiles (see Figure 2, iSamples in a Box) that include metadata fields from the iSamples core (required) and across domains (optional) to serve the needs of interdisciplinary researchers and other users.
samples to associated data. iSamples Central is designed as a permanent Internet service that preserves and indexes sample metadata to ensure reliable discovery and retrieval, and provides a gateway between iSamples-in-a-Box instances and identifier authorities to ensure that remote iSamples-in-a-Box content is fully synchronized with the relevant authorities (e.g., IGSNs generated on iSamples-in-a-Box are synchronized with iSamples Central and the IGSN central authority). By offering services that augment existing identifier authority capabilities, iSamples Central enables support of other identifier types such as ARKs or DOIs that are not traditionally associated with material samples, but are used by some organizations. iSamples Central is a central discovery and resolution service (search interface on the web and API) for any community that wishes to participate, while iSamples-in-a-Box will deliver distributed infrastructure early in the data production chain with an emphasis on the needs of specific research domains.
Provenance is often truncated in current data systems ( Figure 3 ), iSamples takes an event-based approach capturing metadata upstream from Field Information Management Systems and maintaining links downstream, with metadata standards implemented or inferred at each step. Some metadata are inferred, as they must follow all parent-child relationships (e.g., 'where' and 'when' of the event), but other types of metadata (e.g., taxonomy) cannot always be inferred (e.g., a subsample from a fish might not inherit the fish's taxonomy as it 5 Figure 2. iSamples System Infrastructure iSamples infrastructure supports individuals and organizations through two key components. The iSamples project will create generic code that can be used to build many instances of iSamples-in-a-Box (center). Each box is a domain or community portal that provides local services for identifier allocation and metadata collection according to metadata profiles specific to that portal. Individual users will push their sample metadata, collected via spreadsheets or apps (left), to the iSamples-in-a-box local index. Larger institutions may choose to create sub-boxes (e.g., a museum might create a sub-box for its field station). Boxes connect to iSamples Central (right) to verify their accounts with ID authorities, download or sync metadata profiles, and --if they choose --to sync their metadata with the iSamples Central global index for discovery, resolution, and identifier coordination. iSamples Central manages cross-disciplinary metadata according to the model described in Figure 1B. The iSamples Central index also stores links to related data and publications. might be something the fish ate or a parasite; similarly, a mineral subsampled from a rock cannot inherit the rock taxonomy).
The record on the left is for a Genetic Sample in the Smithsonian's NMNH Biorepository (AG5NQ96). Each material sample has its own identifier, in this case an EZID ARK, that is bolded in the relational sample tree. The DNA was extracted from a tissue (439437, Unknown tissue sample) that was taken from a fish (Paracirrhites arcatus, 439437) that can be found (voucher specimen) in the National Museum of Natural History in Paris as catalog number MNHN-IC-2008-0152. The "Sample Tree" field reveals the provenance of the DNA, and also reveals another tissue sample (439437, Fin-clip) that was taken from the same fish.
The record on the right is for a rock sample, KI-04-112710, registered in the SESAR catalog by a research scientist. Each sample registered in SESAR is assigned an IGSN as a unique identifier, in this case IAC000009. KI-04-112710 is a hand sample that is representative of primitive lavas from Antarctica and was subsequently powdered for additional analysis. In the "Related Samples" field of it's profile page, the resulting rock powder is listed as a child sample, KI-04-11272010 (IAC00000E). KI-04-11272010 was resampled for phase equilibrium experiments, and lists 27 children samples. The links between the records provides the provenance between the parent, child and grandchildren samples.
By registering identifiers, iSamples will enable such "Sample Trees" to reveal a larger value-chain, such as linking the collecting event to other specimens and their derivatives and resulting data (e.g., GenBank submissions, images).
Sampling Nature: Sustainability, Inclusion, and Equity While iSamples has funding to build cyber-infrastructure addressing technological barriers, significant sociological challenges remain to unleashing the full value of material samples. Harnessing material samples for sustainable development, for example, requires empowering a broad swath of stakeholders to benefit from material samples, related data, and research products, particularly people from whose communities the samples are derived. It is vital that standards, training materials, public outreach, and policy recommendations are equitable and inclusive. This is particularly important in the areas of Indigenous data rights and social justice, where inequities of the past and present need to be addressed. Integration of CARE as well as FAIR principles, for example, and the adoption of tools such as Traditional Knowledge and Biocultural Labels (an initiative of "Local Contexts") represent important steps that iSamples will pursue.
Beyond Natural History iSamples will focus on the natural history sector -any sample where geolocation is of primary importance -but can scale to other domains. Material samples are important in a number of sectors that are increasingly interconnected, such as ecology and medicine in 'One Health' [10] . The need for permanent identifiers and robust metadata is not unique to material samples. Building a fully-comprehensive internet of samples will require infrastructure similar to iSamples for all resources connected to samples, including datasets, images, sound recordings, and publications.
Conclusions iSamples will allow scientists to track natural history samples, subsamples, associated metadata, data, and research products. iSamples is a single, distributed, transdisciplinary infrastructure based on domain-neutral technologies, standards, and consistent sample identification that is extensible to accommodate domain-specific needs. iSamples aims to enhance existing research within disciplines while enabling new research across them.