The iTHRIV Commons: a cross-institution information and health research data sharing architecture and web application

Abstract Objective The integrated Translational Health Research Institute of Virginia (iTHRIV) aims to develop an information architecture to support data workflows throughout the research lifecycle for cross-state teams of translational researchers. Materials and Methods The iTHRIV Commons is a cross-state harmonized infrastructure supporting resource discovery, targeted consultations, and research data workflows. As the front end to the iTHRIV Commons, the iTHRIV Research Concierge Portal supports federated login, personalized views, and secure interactions with objects in the ITHRIV Commons federation. The canonical use-case for the iTHRIV Commons involves an authenticated user connected to their respective high-security institutional network, accessing the iTHRIV Research Concierge Portal web application on their browser, and interfacing with multi-component iTHRIV Commons Landing Services installed behind the firewall at each participating institution. Results The iTHRIV Commons provides a technical framework, including both hardware and software resources located in the cloud and across partner institutions, that establishes standard representation of research objects, and applies local data governance rules to enable access to resources from a variety of stakeholders, both contributing and consuming. Discussion The launch of the Commons API service at partner sites and the addition of a public view of nonrestricted objects will remove barriers to data access for cross-state research teams while supporting compliance and the secure use of data. Conclusions The secure architecture, distributed APIs, and harmonized metadata of the iTHRIV Commons provide a methodology for compliant information and data sharing that can advance research productivity at Hub sites across the CTSA network.


INTRODUCTION
The objective of the iTHRIV Commons is to provide a cross-site scalable research infrastructure that enables information sharing as well as the secure collection, storage, sharing, and analysis of data in accordance with the NIH endorsed FAIR (Findable, Accessible, Interoperable, and Reusable) principles. 1 The system must meet administrative, security, legal, informatics, regulatory, and research requirements from a diverse set of institutions, while supporting both manual and automated collection of searchable metadata and providing a managed lifecycle for health research data. The product must allow research teams from different departments and schools to integrate their research workflows while providing personalized access as controlled by project owners and data administrators. The system should also support workflows for secure and well-governed transmission of medical record data to research teams, within and across the iTHRIV network. This article describes the design and early implementation of this product, demonstrating how to meet this broad set of requirements and how to harmonize research workflows and tools across institutions, supporting the sharing of some artifacts while still allowing for security and governance controls at the local level.

BACKGROUND AND SIGNIFICANCE
The integrated Translational Health Research Institute of Virginia (iTHRIV) is a collaboration of public academic institutions (the University of Virginia and Virginia Tech) and private hospitals (Inova Health and Carilion Clinic) across the Commonwealth of Virginia that supports translational research. The mission of iTHRIV is to catalyze and sustain inclusive clinical and translational research through diverse, collaborative team science, innovative data science, and broad workforce development in order to improve human health and promote health equity. 2 Navigating the complex health research landscape can be challenging for research teams, administrators, and service providers. Targeted communication is always difficult in research communities, but it became an even bigger challenge when iTHRIV's multicenter Clinical and Translational Science Awards (CTSA) consortium opened the door to more cross-state resource sharing. Similarly, health data management was already a complex task at each institution, but cross-site workflows often meant a further reduction in efficiency. Previous system solutions for health data management at our iTHRIV institutions often involved project-driven, siloed technology and the appropriation of tools and systems not specifically designed for research workflows. With regard to medical record data, workflows that support patient care were often supplemented at the project or departmental level with custom databases or distributed for storage on secure servers managed by a variety of teams. This redundancy was resource intensive and prohibited findability and interoperability. Regulatory and policy requirements also posed a challenge for researchers who had to generate and manage compliant solutions based on the specific composition of their team and the data requirements of their project. Although health system policy and Institutional Review Board requirements set rules for the management of sensitive data, there was no streamlined and auditable system whereby health data stewards could transfer data sets to researchers. Generation of an appropriate data security plan, and subsequent compliance over the life of the project, typically required researchers to understand and oversee system requirements well-outside their areas of expertise. Investigators across our institutions had to spend significant time and financial resources to address these challenges.
Our research workforce needed a secure and compliant system to share information and data in order to facilitate collaborative research. Data commons have been shown to provide the framework to meet these research needs through the colocation of data, storage, and computing infrastructure. 3,4 While these existing approaches to the data commons are providing solutions to important data sharing problems, they have been built for structured data and for specific focus areas, such as genomics. 5 The NIH Data Commons pilot initiated in 2017 aimed to solve some of the well-recognized issues around data access and sharing, but that initiative was not extended beyond the initial pilot, leaving academic institutions to design their own solutions based on best-practices and lessons learned in the NIH pilot. 6 Recent work has also shown that rather than a single information commons at the national level, a more effective approach to clinical translational research is a distributed set of connected "commonses." 7 As discussed below, this distributed model is the approach in the iTHRIV Commons with an architecture comprised multiple landing services hosted behind the firewalls of each institution.
In response to these common needs across our CTSA, and in the absence of an existing product meeting all requirements, we designed and developed a custom but cost-effective distributed platform for sharing information and securely managing data. This integrated translational research eco-system provides a crossinstitutional index of information, resources, projects, and data sets. Unlike other approaches data commons, the scope of the iTHRIV Commons is broad, meaning that the product supports all types of translational health research. It also gives the appropriate entities the ability to expose some artifacts while protecting others. This ensures the proper stewardship and privacy protection that is essential to the translational research process. These protections enable societal trust in the researchers, 8 as well as, conformance to ethical and legal standards.

MATERIALS AND METHODS
The iTHRIV Commons has been developed as a cross-institution ecosystem of tools, educational content, translational research events, and expert consults. As the primary entry point to the iTHRIV Commons, the iTHRIV Research Concierge Portal supports broad information and resource sharing using harmonized metadata, and now also provides secure federated access to locally stored data (Figure 1). Building and connecting the Research Data Commons to this web application required both technical innovation and careful consideration of data governance, security, and privacy requirements across the diverse participating iTHRIV partner sites. The features and system design of the Commons are explained in detail below.

iTHRIV Commons features: expert consults
The iTHRIV Research Concierge Portal provides users with access to a range of CTSA-supported consulting services. Consult requests, tied to user-selected topics, are automatically routed to the appropriate team of experts based on the user's home institution. These teams may be different for each institution (such as when the topic selected is "Medical Record Data Pull") or might consist of a cross-CTSA team (such as when the topic selected is "Community Seed Grants"). Consult requests submitted through the iTHRIV Portal generate service management tickets which are tracked for purposes of satisfaction and resource-utilization metrics.

iTHRIV Commons features: informational content
The iTHRIV Research Concierge Portal provides a single online entry point to essential clinical and translational research tools and resources, with customized views of content based on users' institutional affiliation and object-level permissions.
The Portal provides templates for resources and events, with the aim of supporting team science, community engagement, and innovation in health-related research. Content is crowd-sourced; any user with an iTHRIV institutional login is able to quickly build new portal pages and submit to site administrators for curation and approval. Personalized features are also available to authenticated users, including a "favorites" menu, private views of draft content, and access to consultations service lines based on their associated institution.
All users can switch from a filtered view of resources and events curated for them to a broader view, allowing users to focus their browsing. The Portal also offers a limited-access public view, with content curated for the public. The many-to-many relational mapping between resources and categories ensures that users find all content related to their area of interest while browsing, while ensuring that content is not duplicated across the site (see Supplementary  Table S1). An Elasticsearch index supports faceted searches of con-tent in the iTHRIV Commons. User views are provided in Supplementary Figures S1-S5.

iTHRIV Commons features: research data
In addition to information and resource management, the Portal also provides tools for project and data management. The foundational functionality of the iTHRIV Research Data Commons facilitates data access and compliance: • Granular permissions are defined at the project and individual data set level. The Principle Investigator and delegated Project Owners manage the study team permissions and project metadata. Access to project data sets is independently controlled by Data Set Administrators, which may include study team members or stewards of research repositories, medical record systems, research labs, etc. Each project team member has a custom view and may only access the associated objects which they are permitted to see. • Highly Sensitive Data (HSD) access is limited via an interface with the respective iTHRIV organization's Institutional Review Board (IRB) database API. If a data set contains HSD, the data set admin must associate an active local IRB protocol with the data set prior to uploading any files. Data set administrators can only assign permissions to appropriate active study personnel as listed on the approved IRB protocol. If a researcher becomes inactive on the protocol in the IRB database, their access is automatically removed; if the protocol becomes inactive, data access is blocked. • Access for research team members at partner institutions is systematically restricted to limited data sets or deidentified data and requires that a project-specific contract first be uploaded into the project. • Integration with research databases, such as REDCap, allows periodic, scheduled extracts that may be shared with the appropriate project stakeholders. This allows project owners to expose certain elements of their data to a broader audience while maintaining control and protection over the source data. • Manual and automated population of project and data set descriptions and characteristics (metadata) support data organization, discovery, and compliance. Permissions are independently controlled for project metadata, data set metadata, and data set files. This allows researchers to share descriptions of their work with the research community while having the option of keeping the actual data files hidden for most users. • The system facilitates administrative reporting and auditing of research data assets. Audit logs reflect all access to both public and private objects in the Commons.
iTHRIV Commons: metadata Resources, events, projects, and data sets all have a well-defined set of metadata that supports discovery and enables the system to apply business logic based on object characteristics. Sensitive data is delineated at the field level in the metadata (ie, Street Address) and the system applies logic to set the value of hidden metadata fields that drive possible actions around that object (ie, sensitivityLevel ¼ "HIGH SEN-STIVITY DATA" can never be made public). Uploaded data files are scanned to check for accuracy of metadata. All data set uploads require the population of a core set of metadata with metadata extensions provided based on data set type as defined by the user. For example, if the user defines the data set as DICOM data, additional imaging metadata fields appear, most of which are auto-populated at the time of file upload using header information in the files. Schema.org metadata is used whenever possible to provide maximal findability to external search engines and also to enable greater interoperability with other data repositories 9 and other large biological sciences metadata projects using this annotation. 10 Comprehensive metadata documentation is provided in the Supplementary Materials.

iTHRIV Commons: governance
Proper governance, stewardship, and protection of data are paramount when managing potentially sensitive research data. System integration with local Institutional Review Boards (IRBs) ensures that work in the Commons aligns with federal regulations pertaining to ethical human subject research 11 without introducing redundancy with existing IRB processes and systems. Similarly, existing legal and security standards are integrated into the system whenever possible to ensure local governance bodies continue to play their appropriate roles while still providing the researcher with a shared single system for data management. Figure 2 illustrates which users control various objects types in the iTHRIV Commons as well as the governing bodies providing oversight and Table 1 provides further details. User roles and object access flags related to content ownership and governance are provided in the Supplementary Materials.
iTHRIV Commons: review of compensating controls Multiple approaches are taken in the iTHRIV Commons to minimize risks (data misuse and noncompliance). • Regulatory Considerations: IRB database integration ensures that regulation around access to highly sensitive data is carried over from the IRB application process into real-time data management practices in the Commons. • User Accountability: Commons users must sign an end-user agreement upon entering the Commons (and at time of agreement update). Users maintain responsibility for their own actions pertaining to data management as trained by their local institutions, but the system supports compliance with built in business logic and user prompts. For example, users must attest to having reviewed the associated data security (provided as a linked document) prior to downloading any sensitive data. • Account Management: iTHRIV institutions retain local control over the management of provisioning and deprovisioning user accounts. Integration with LDAP systems ensures continual alignment with local policy as well as supporting timely deprovisioning of access to Commons objects upon departure of faculty and staff. • Legal Implications: The Commons software is deployed under contractual agreements (described below). iTHRIV Commons: research concierge portal architecture The iTHRIV Research Concierge Portal is a custom-built full-stack web application, currently hosted in Amazon Web Services (AWS). Front-end code, run on the client web browser, is written in Angular. User authentication is done through respective institutional identity management systems via the Internet2 InCommon Federation, 13 the signer and curator of US research and education trust registry information used in federated transactions globally. Reliance on institutional logins ensures that system access remains aligned with each institution's user account access (such as dualauthentication) and password requirements as they change over time. Each identity provider issues a browser token for the session which is stored in the client's browser while session information is passed back to the iTHRIV Portal Backend Service in AWS. Users can also bi-pass authenticated login by selecting the "PUBLIC" icon on the login screen. The user's browser connects to https://portal. ithriv.org/#/home and runs the iTHRIV Portal Client code; content on the user's web client is then dynamically rendered via RESTful API calls to the iTHRIV Web Service. Figure 3 shows the interactions between the user's machine, the systems involved in authentication, and the components of the iTHRIV Research Concierge Portal, with detailed descriptions of the latter provided in Table 2.  Table 3). Note that the user must be on high-security VPN or on a secure local area network in order to access the Commons Landing Services' APIs. (This is not a requirement for accessing other features in the iTHRIV Research Concierge Portal.) Each landing service is configurable for integration with local research management systems, including IRB databases, research databases, and storage. The flexible framework supports adaptation for each institution to its own systems, processes, and policies. As noted above, an iTHRIV Commons Landing Service is installed at each site ( Figure 4). The user will access Commons objects (according to their permissions) across all the partner sites via API calls across the federation of landing services ( Figure 5). Metadata is sent back via the Portal and then sent to the client's browser. In contrast, API calls related to object permissions or the uploading or downloading of data files are sent directly from the client (connected to a secure network) and through F5 and the local firewall to the local landing service. Thus, sensitive information is never routed through the iTHRIV Research Concierge Portal Backend Service in AWS.

RESULTS
iTHRIV Commons: research concierge portal launch The iTHRIV Research Concierge Portal was launched at the University of Virginia in 2018 with a base set of features supporting crowd-sourced resource sharing. Over the next 2 years, the portal was launched at all partner iTHRIV institutions and new features were added, including the ability to add and manage translational research-related events and an integrated concierge consultation service. With >10 000 users per Google Analytics, 829 resources, 325 events, and >1500 consult requests to date, this cross-institution web application has provided iTHRIV institutions with a foundational design for the federation of information and researcher support.
With its curated and personalized views, the platform has proven invaluable for the launch of programs that are developed at a single institution, then expanded and generalized for use across iTHRIV partner sites and eventually shared with the CTSA consortium. One example is the iTHRIV Research Administration Portal for Training and Resources (RAPTR) program which has been constructed as a set of interconnected iTHRIV Research Concierge Portal resource pages. 14 The portal enables RAPTR creators to curate each training resource page based on the audiences for which it is intended by defining "Institutions with Access" in the metadata of each page. Users from each institution are presented with customized training around local processes as they navigate, while users from outside institutions will be presented with generic templates that can be adapted to fit their local research administration processes. The broader resource sharing supports efficiency and standardization across institutions while the local views support more targeted content, all within a single platform.
Resource content is as varied as the landscape it reflects, but the evolution and use of the portal has demonstrated how metadata can be harnessed to focus the users' experiences even as the breadth of available resources is expanding. Analytics show us that users both browse categorically and apply filters in faceted searches to discover content. The content is associated by its owner with one or more subcategories (Supplementary Table S1) and assigned one primary "type" ( Table 4). The values for these object metadata were defined by a collective team of iTHRIV Portal Admins with representatives from each site and have been dynamically refined over time via an administrator interface based on user feedback and domain developments, such as when a "COVID Collaboration Corner" was added in April of 2020.

Project Owners and Project Collaborators
Although the research team manages the documents they upload, the system requires that certain documents be present in order to unlock features that require additional governance (as described in other rows of this table).

Data Set Administrators
A single project may use data sets that are stewarded by individuals within or outside the project team. The Data Set Administrator could be the electronic health record steward, biorepository manager, lab manager, or anyone responsible for controlling access to the source data. The data set admin may be the Principal Investigator if novel data is prospectively generated or collected through the course of the project. The responsible data admin creates a data set object within a project and defines the metadata attributes. They also decide whether or not to make the metadata discoverable for all users via the iTHRIV Commons index. (continued) iTHRIV Commons: research data commons launch The first iTHRIV Commons Landing Service has been installed at the University of Virginia (UVA) and integrates with local storage endpoints as well as the local IRB-HSR database. The product was initially deployed in User Acceptance Testing (UAT) environments for feature acceptance by the product owner then testing by the development team, the health data analytics team, and iTHRIV staff. This round of testing provided a valuable debugging opportunity and led to usability enhancements and planned future feature development. At the time of launch, this landing service is only accessible to authenticated UVA users who are connected to a high-security UVA network (via LAN or VPN connection). iTHRIV has initiated a phased roll-out of Version 1 of this product at UVA; the research teams invited to use the first production version of the product were selected for breadth and depth of their projects. Their crossdepartment teams may typically face stewardship challenges when sharing their data. Their complex projects also benefit from advanced features in the Research Data Commons such as autopopulation of imaging metadata and integration with REDCap.
Pending institutional review of system performance in this pilot phase, iTHRIV plans to expand use of the system to all UVA researchers conducting clinical and translational research later this year. iTHRIV partner institutions (Virginia Tech, Inova Health, and Carilion Clinic) will be installing an iTHRIV Commons Landing Service and integrating it with their local research data environment. This will support management of local private projects and data sets, object sharing across iTHRIV when applicable, and the publishing of research artifacts from partner site research.   A public view of shared informational content is already supported in the iTHRIV Portal. In 2022, iTHRIV expects to support public access to approved project and data set metadata as well as published data sets ( Figure 6). Public data access will always be mit-igated by integrated IRB controls, institutional governance policies, and will also require approval by project owner and data set admin. Thus, researchers will retain the right to not expose metadata or files related to their private projects. ; Figure 4. This diagram is a simple representation of the system interactions between an iTHRIV researcher's computer, the iTHRIV Research Concierge Portal Web Application (collapsed here but with details available in Figure 3), ancillary authentication services, and the iTHRIV Research Data Commons Landing Service at a single site. Hardware and operating system support are provided by each participating institution, including firewalls, file storage, and servers that support file storage and the Landing Service software components.

Product customization and feature enhancement
With additional institutional and project funding, the iTHRIV Research Data Commons software will continue to be extended to meet the needs of specific research projects. In the past, projects developed custom, siloed solutions. In contrast, the Commons provides a robust foundation and a ready-made set of data management tools upon which new features can be built (ie, new system integra-tions, metadata extensions, analytics pipelines, etc.). Sharable software (analytics tools) will be added as the next object type in the Commons with a corresponding set of metadata and business rules. Developing custom tools as an extension of the iTHRIV Commons will ensure new data management tools and solutions are integrated into this cross-institution platform. Thus, project-specific data management innovations can result in system enhancements that will benefit all future researchers across iTHRIV as well as our community partners. This shared resource will expose a growing index of projects and data sets, while visibility and access continue to be controlled at a granular level. The expansion of the iTHRIV Research Data Commons can support any number of new and developing research domains, thereby creating an economy of scale and also supporting organic interdisciplinary interoperability and cross-pollination.

Potential for future product dissemination
The iTHRIV Commons has been designed to that it can be deployed in the future at any institution that supports login via the InCommon federation and that is capable of installing the landing service APIs, Elasticsearch, and MinIO and doing some local configuration (assigning local system admins, integrating with local storage endpoints, making API calls to local IRB). The IRB API calls are agnostic to vendor as long as the basic information can be returned (active studies and active researchers on those studies). iTHRIV intends to license this product at no cost to other nonprofit translational research institutions.

CONCLUSION
The iTHRIV Commons has fundamentally changed the piecemeal and segregated project approach to translational research within and among the iTHRIV partners. iTHRIV Portal usage statistics demonstrate that diverse teams across our iTHRIV institutions can    Table 4. This table provides current counts of resource content in the portal by "type," a system defined set of 10 tags that are more generic than the 95 subcategories but that provide another method for refining a search now work in a single integrated environment, personalized for each user. The initial installation of the iTHRIV Research Data Commons Landing Service at UVA establishes the feasibility of implementing a distributed data management infrastructure.
Commons metadata supports regulated object-level permissions and provides an opportunity for systematic characterization, indexing, discovery, and auditing of institutional research data. We expect that the successful deployment of the iTHRIV Research Data Commons at partner sites will demonstrate the potential for further expansion of the Commons infrastructure to other external partners across the state as well as to other translational research institutions who wish to adopt this technology.

FUNDING
This work was supported by National Center for Advancing Translational Sciences grant number UL1TR003015.

AUTHOR CONTRIBUTIONS
The products described in the paper were initially conceived of by DEB and KCJ (iTHRIV Principal Investigators) and detailed throughout a series of iTHRIV Informatics meetings involving all authors. GSW, JJL, RKRC, PKD, and SGP were responsible for software design and development at the University of Virginia. MMAP, JEK, and MMT ensured that the product aligned with research and informatics processes and requirements at their local institutions. JJL drafted the initial manuscript under guidance from DEB and KCJ with section content contributed by PKD and RKRC. All authors reviewed and revised the manuscript, approved the final version to be published, and agree to be accountable for all aspects of this work.