Improving PHIRI performance and scalability: working within EGI-ACE

Abstract EGI-ACE is a 30-month H2020 project (Jan 2021 - June 2023) with a mission to empower researchers from all disciplines to collaborate in data- and compute-intensive Open Science, enabled by free-at-point-of-use services that are delivered through the European Open Science Cloud (EOSC). EGI-ACE delivers the EOSC Compute Platform (ECP), a federated system of compute and storage infrastructure extended with platform services to support diverse types of data processing and data analytics cases. The ECP currently includes High Throughput Compute (HTC) and Cloud Compute facilities, and will broaden its scope with High Performance Compute services later in 2022. The platform layer of the ECP provides assistance for single sign-on, transfer and federation of distributed data, interactive computing, management of large numbers of jobs, orchestration of compute clusters, AI and machine learning tasks. There are over 25 thematic services in EOSC that build on the ECP, and deliver scalable data analysis for different domains, from astrophysics, through life sciences, environmental sciences, to humanities. PHIRI participates in EGI-ACE as one of the ‘Early Adopters’ of the ECP. Under the EGI-ACE workplan PHIRI will explore reproducible population health workflows with the use of cloud computing, single-sign-on, Jupyter Notebooks and Binder services of the ECP. The tests will enable PHIRI to scale out existing data analysis notebooks to big capacity machines, to reproduce simulations and models across users, and to overall validate the technological and sustainability approaches of EGI-ACE. PHIRI will also advise the project on best ways to introduce secure processing capabilities within the ECP services.

PHIRI infrastructure follows a federated approach that is governed following the European Interoperability Framework. The vision of PHIRI is to create an infrastructure for individual level data processing following the privacy-bydesign principle in a data-centric approach. As a basis to legal interoperability and compliance with the GDPR, the queries or algorithms are moved to the data instead of moving the data. So far, the PHIRI technological developments have focused on a client-server architecture. In this architecture a Coordinator Hub, the server, is in charge of orchestrating the deployment of the data-centric analysis solutions, in the form of R and Python scripts, that will be later executed in the partner nodes (data hubs), the clients. To perform the orchestration the Coordinator Hub encapsulates the scripts in software containers, using Docker images; all the outputs are published in Zenodo. The software containers are then deployed manually from Zenodo in the partner nodes and executed by its IT specialists using their own individual level data -the software containers have represented the technical interoperability layer. The data used on each partner has been previously adapted to a common data model (CDM) and the quality of the dataset has been assessed against the data model by each partner -this has represented the semantic interoperability layer. Finally, the outputs of the analysis's execution are aggregated data that are sent back to the Coordinator Hub to perform a comparative analysis. This stepwise approach has been tested in various research questions promoted by a leading researcher and agreed by the partner nodes who act as data hubs. A help-desk services and a developer's forum and a help-desk service have been set up to ease the implementation and deployment of the research queries -these both have represented the organisational interoperability layer Abstract citation ID: ckac129.468 An enhanced version of the PHIRI infrastructure: improving the analytical services Francisco Estupiñ á n F Estupiñ á n 1 1 Data Sciences for Health Services and Policy, Institute for Health Sciences in Aragon, Zaragoza, Spain Contact: festupinnan.iacs@aragon.es The PHIRI federated approach has consisted of the development of four research queries (use cases) mobilising individual data from a number of data hubs (nodes in the federation). Methodologically speaking, use cases have required the creation of specific cohorts of patients, population subgroups or populations, and the identification of events of interestover-time differences in health status and care healthcare utilisation before and during the pandemic. Technologically speaking, PHIRI infrastructure consists of a distributed endto-end analytical pipeline containing the statistical analysis workflow, including data quality assessment at origin and the mathematical algorithms. Once datasets are prepared in each data hub, partners run the analyses and produce a research output (dashboards containing the research results and tables with aggregated data) that is shared for results compilation and comparative analysis. An enhanced version of the PHIRI infrastructure should allow more complex data distribution. The research questions covered so far are aiming inference on populations or providers, which implies a very simple distribution methodology, as described. However, when the research questions requires inference on the individuals (eg, quasi-experimental study on the effectiveness of a real-life intervention), when the inference requires a hierarchical approach (ie, part of the variance is at individual level and part at cluster level) or when, several rounds of training are needed (eg, validation of an artificial intelligence) the approach would require sharing coefficients, distances in n-dimensional spaces or models, and, some times various rounds of distribution. Finally, an enhanced version of the PHIRI infrastructure should generalise the current FAIR approach limited to the publication of the analytical pipeline in ZENODO, setting up the services and tools required for an improved version of the PHIRI open-science strategy. The proof of concept tested by PHIRI consisted of the development of several research questions in multiple data hubs using a federated approach. It was possible to embed the use cases' analytical pipelines in a portable standalone (i.e. docker image) and distribute it in different health data hubs and technological environments sources for execution. The tested solution has the advantage of not moving sensitive data out of the silos and thus protecting privacy -the code meets data and not the opposite. Some precious lessons provide guidance on how to further develop the PHIRI infrastructure. 1) A deep knowledge on what data is available in the different data hubs of a federation is key since the basis for the development of a research query is the construction of a data model that is common to all the nodes in the federation. In an eventual enhanced PHIRI infrastructure, a solution will be implementing a semantic information system that allows the exchange of metadata using federated and interoperable metadata catalogues based on Semantic RDF graph databases, compliant with the W3C DCAT metadata standard and exposing the end-points of the SPARQL querying language of the Web of linked-data. 2) Making available training samples mimicking real-world data within the docker image has been of high added-value for the development of the use cases' analytical pipelines. In an eventual enhanced PHIRI infrastructure, a generalisation could consist of setting up a ''knowledge hub'' where synthetic data, twinning the population, data would allow any expert users to search and find data through federated queries and prepare and train their analytical pipelines; the ''knowledge hub'' would provide a computational environment (e.g. Jupyter as a service playground), the necessary tools (i.e. cookbooks and capacity building services) and training samples to answer research questions, with the advantage of using data that is anonymous by nature and open access. EGI-ACE is a 30-month H2020 project (Jan 2021 -June 2023) with a mission to empower researchers from all disciplines to collaborate in data-and compute-intensive Open Science, enabled by free-at-point-of-use services that are delivered through the European Open Science Cloud (EOSC). EGI-ACE delivers the EOSC Compute Platform (ECP), a federated system of compute and storage infrastructure extended with platform services to support diverse types of data processing and data analytics cases. The ECP currently includes High Throughput Compute (HTC) and Cloud Compute facilities, and will broaden its scope with High Performance Compute services later in 2022. The platform layer of the ECP provides assistance for single sign-on, transfer and federation of distributed data, interactive computing, management of large numbers of jobs, orchestration of compute clusters, AI and machine learning tasks. There are over 25 thematic services in EOSC that build on the ECP, and deliver scalable data analysis for different domains, from astrophysics, through life sciences, environmental sciences, to humanities. PHIRI participates in EGI-ACE as one of the 'Early Adopters' of the ECP. Under the EGI-ACE workplan PHIRI will explore reproducible population health workflows with the use of cloud computing, singlesign-on, Jupyter Notebooks and Binder services of the ECP. The tests will enable PHIRI to scale out existing data analysis notebooks to big capacity machines, to reproduce simulations and models across users, and to overall validate the technological and sustainability approaches of EGI-ACE. PHIRI will also advise the project on best ways to introduce secure processing capabilities within the ECP services.

Background:
Many countries are experimenting with novel ways of organising and delivering more integrated health and social care. Governance is relatively neglected as a focus of attention in this context but addressing governance challenges is key for successful collaboration. Methods: Cross-country case analysis involving document review and semi-structured interviews with 27 local, regional and national level stakeholders in Italy, the Netherlands and Scotland. We used the Transparency, Accountability, Participation, Integrity and Capability (TAPIC) framework to structure our analytical enquiry to explore factors that influence the governance arrangements in each system. Results: Governance arrangements ranged from informal agreements in the Netherlands to mandated integration in Scotland. Novel service models were generally participative involving a wide range of stakeholders, including the public, although integration was seen to be driven, largely, from a health perspective. In Italy and Scotland some reversion to 'command & control' was reported in response to the imperatives of the Covid-19 pandemic. Policies, budgets, auditing and reporting systems that are clearly aligned at all levels were seen to help with implementing innovations in service organisation. Where alignment was lacking, cooperation and integration was suboptimal, regardless of whether governance arrangements were statutory or not. There was wide recognition of the importance of buy-in. Enablers of greater engagement included visible leadership, time and long-standing working relationships. Lack of suitable indicators and openness to data sharing to measure integration hindered working relationships and thus the successful delivery of integrated services.

Conclusions:
Our study provides important insights into how to more effectively and efficiently govern service delivery structures within care systems. We will discuss approaches to governance that help support more resilient integrated care systems.

Key messages:
Different governance arrangements face common challenges to greater integration of care. Enablers include strong leadership, inclusivity and openness to work across traditional boundaries.
Meeting the governance challenges of integrated health and social care requires clear lines of accountability, aligned policies, budgets and reporting systems.