sfkit: a web-based toolkit for secure and federated genomic analysis

Abstract Advances in genomics are increasingly depending upon the ability to analyze large and diverse genomic data collections, which are often difficult to amass due to privacy concerns. Recent works have shown that it is possible to jointly analyze datasets held by multiple parties, while provably preserving the privacy of each party’s dataset using cryptographic techniques. However, these tools have been challenging to use in practice due to the complexities of the required setup and coordination among the parties. We present sfkit, a secure and federated toolkit for collaborative genomic studies, to allow groups of collaborators to easily perform joint analyses of their datasets without compromising privacy. sfkit consists of a web server and a command-line interface, which together support a range of use cases including both auto-configured and user-supplied computational environments. sfkit provides collaborative workflows for the essential tasks of genome-wide association study (GWAS) and principal component analysis (PCA). We envision sfkit becoming a one-stop server for secure collaborative tools for a broad range of genomic analyses. sfkit is open-source and available at: https://sfkit.org.


INTRODUCTION
Data sharing has been a key driving force of progress in genomics. Sharing data across or ganizations allo ws researchers to analyze data from larger and more di v erse cohorts than what they can individually obtain, which is crucial for extracting novel biomedical insights ( 1 , 2 ). Howe v er, biomedical data sharing has become increasingly difficult due to growing concerns about data privacy as well as stricter polices and regulations resulting from these concerns (e.g., the European Union's General Data Protection Regula tion) ( 3 ). Crea ting tools to facilitate the joint analysis of private data across isolated repositories would thus be a great boon for biomedical research.
An emerging field of privacy-preserving data analysis promises tools for jointl y anal yzing private datasets held by m ultiple parties w hile ensuring the privacy of each party's dataset (3)(4)(5)(6). Methods based on crypto gra phic frame wor ks for secure computation are especially promising as they provide a formal privacy guarantee that the participating collaborators do not gain information about datasets held by other parties, except the final joint analysis results. Recent wor ks hav e introduced algorithms built upon these techniques for a range of standard genomic analysis tasks, including genome-wide association studies (GWAS) and principal component analysis (PCA), that can efficiently scale to large datasets including hundreds of thousands of individuals (7)(8)(9)(10).
Howe v er, le v eraging these modern tools for collaboration in real biomedical studies has remained challenging. Applying these methods r equir e a good knowledge of underl ying crypto gra phic techniques, w hich many biomedical researchers may not immediately possess. Even with such a knowledge, configuring the tools and coordinating interacti v e computation across a distributed network of computers spanning different organizations would still r equir e substantial time and effort.
To address these challenges, we de v eloped our w e b server sfkit (Secure and Federated toolKIT for collaborati v e genomic studies), which streamlines the deployment of secure collaboration tools to broadly enable groups of r esear chers to easily perform joint analyses of their genomic datasets without the need to share any priva te da ta among them. sfkit pr ovides pr ova bly-secure colla borative workflows for GWAS and population structure analysis (based on PCA), both built upon state-of-the-art crypto gra phic techniques (7)(8)(9)(10). The design of our w e b server allows similar methods being de v eloped for a growing range of genomic analysis tasks to be easily incorporated into our server.
In the following, we summarize the system design of sfkit and highlight its key features. We then illustrate the utility and ease-of-use of sfkit for collaborati v e GWAS and PCA in a range of different settings and datasets. sfkit r epr esents a key step towards broadening r esear chers' access to secure collaboration tools for genomics and may help unlock various joint studies that previously could not be realized.

System ov ervie w
The sfkit w e b server provides a common w e b interface through which users can cr eate, join, configur e and run collaborati v e analyses with other users on a collection of priva te genomic da tasets based on a chosen study workflow ( Figure 1 ). The w e bsite implements a range of features, including project bulletin board, chat functions, study par ameter configur a tion and result sharing / visualiza tion, all aimed at streamlining the application of collaborati v e analysis tools that r equir e comple x coor dination among multiple users.
To cover a broad range of usage scenarios, sfkit offers two utilization modes: (i) auto-configured and (ii) userconfigured .
In the auto-configured mode, the server automatically creates and configures Google Cloud Platform (GCP) virtual machines (VMs) on behalf of users, within the respecti v e user-contr olled GCP pr ojects gi v en a minimal set of permissions. This greatly simplifies the setup of networking and computing environments to be used for the joint analysis.
In contrast, the user-configured mode allows users who wish to dir ectly configur e the machines to provide their own private (or protected) computing environments. To help streamline this latter use case, sfkit additionally provides a user-friendly command-line interface (CLI), which the users can utilize on their machines to communicate with the w e b server and launch the analysis client program, which in turn communicates with other users' clients that are executed via the same CLI.

Security
A key feature of sfkit is the rigorous privacy protection it provides to each user by le v eraging state-of-the-art cryptographic tools based on the frame wor ks of secure multiparty computation (MPC) and homomorphic encryption (HE), both of which enable computation over some form of encrypted information. Throughout the joint analysis workflow, data confidentiality is maintained for each user at all times, except what can be inferred based on the final analysis r esults. Curr ent workflows operate under the standard semihonest security model, which assumes that the client programs faithfully follow the protocol as gi v en and aims only to pre v ent leakage of pri va te informa tion in any intermedia te da ta tha t is visible to each user during the study execution. Additionally, the workflows employ server-aided preprocessing for computational efficiency, whereby an auxiliary party distributes correlated randomness to the users to accelerate certain crypto gra phic opera tions. W hen needed, sfkit automa tically crea tes a stud y-specific GCP VM for this role, shown as 'Coordinator VM' in Figure 1 . Both the w e b server and the auxiliary VM only facilitate the setup and execution of analysis protocols without receiving any priva te da ta from the users. By default, sfkit workflows provide a 128-bit security le v el, which can be adjusted if desir ed. All our softwar e modules --the w e bsite, CLI, analysis algorithms, and crypto gra phic libraries (e.g. Lattigo, available a t https://github.com/tuneinsight/la ttigo ) --are opensour ce, which ensur es that our tools are fully transparent. Other methods based on alternati v e security models (e.g., malicious) are also compatible with our w e b server. Further technical details of the security of each analysis workflow can be found in the original r efer ences (7)(8)(9)(10).

Collaborative analysis workflows
sfkit currently supports the collaborati v e e xecution of the following analysis tasks: Genome-wide association study (GWAS). GWAS is an essential study designed for identifying genetic variants that ar e corr elated with a phenotype of interest, such as disease status or other quantitati v e biological traits. Analyzing data from a large cohort is crucial for detecting variants that ar e rar e or weakly associated with the trait. The following sfkit workflo ws allo w a group of collaborators to perform an end-to-end GWAS analysis jointly over their datasets, while keeping the input datasets private. covariates for a group of individuals) is split into multiple encrypted copies of the dataset, which are then distributed to collaborators' machines as input to the joint analysis. Both workflows implement a standard GWAS pipeline, including quality control filters (for missing data, allele frequencies, and Hardy-Weinberg equilibrium), population stra tifica tion analysis (based on PCA), and association tests using a linear model of the trait based on allele dosages (i.e., Cochran-Armitage test for binary traits). Other workflows based on logistic or linear mixed models will be supported in a subsequent version.
We expect the MPC-GWAS and SF-GWAS workflows to be useful in different settings: MPC-GWAS allows each user to provide only an encrypted version of their dataset as input to the sfkit workflow, simplifying trust assumptions. Although SF-GWAS r equir es that user's original input dataset be available on the user's machine, this allows the sfkit protocol to le v erage efficient local computation on the unencrypted data to greatly reduce runtime requirements ( 7 ).
Principal component analysis (PCA). PCA is a standard algorithm for dimensionality reduction, commonly applied in genetic studies to identify the population structure of a gi v en cohort. Coor dinates of each indi vidual in a reduced space output by PCA are thought to r epr esent their genetic ancestry in relation to other individuals in the dataset. This information is useful in various settings, e.g., for defining study cohorts or constructing additional covaria te fea tures in GWAS.
• SF-PCA: a secure and federated (SF) protocol for a group of users to perform a PCA jointly on their private datasets to obtain a desired number of top principal components (PCs) ( 10 ). This corresponds to one of the steps in the GWAS w orkflows abo ve, here pro vided as a standalone tool. Each user provides a matrix with the same number of columns (features) as the local input dataset.

Usage process
sfkit secur ely ex ecutes collaborati v e wor kflows in four main steps outlined below. Note that in the auto-configured mode, users only need to create the study and grant sfkit limited access to their GCP project; all subsequent steps are orchestra ted and automa tically executed by sfkit . In the user-configured mode, users run all the steps after study crea tion and configura tion in their own (private) environment using the sfkit CLI. We illustrate the workflow in Figure 1 and showcase sfkit 's user interface in Supplementary  (1) Study creation and configuration. Users navigate to the 'Studies' page of the w e bsite to create a study (or join an existing stud y). W hen crea ting a new stud y, they select or enter configura tion options, stud y name, parameters, and other study details. Once the study has been created, other participants can join the study either through a request button on the 'Studies' page (which r equir es approv al b y the stud y crea tor) or privately via an invitation button. Note that registration and login are optional; users can choose to create or join studies anonymously if they prefer. In this case, a unique permanent link will be provided to them so they can return to the stud y la ter. In both utiliza tion modes, W538 Nucleic Acids Research, 2023, Vol. 51, Web Server issue sfkit 's w e b server stores only the stud y's informa tion (e.g. name and description) and analysis parameters. (2) Computational setup. In the auto-configured mode, users are guided through the setup of a GCP project containing their data, ensuring compatibility with sfkit .
Users enter information about their GCP project, allowing sfkit to set up the networking and compute r esour ces necessary to run the analysis. In the userconfigured mode, users provide their own networking and compute environments (e.g. the IP address) and interact with the w e b server via the sfkit CLI. (3) Stud y Ex ecution. The study is ex ecuted in thr ee steps: • K ey e x change. Participants gener ate cryptogr aphic key pairs, with public keys being exchanged among them via sfkit . It is important to note that the required private keys are generated locally on participants' machines and ne v er re v ealed to any other entity, including sfkit 's w e b server. • Data validation. Each participant's data is validated locally on their machine to ensure compliance with the appropria te forma t for the stud y and its parameters. No private information is exchanged during this step.
• Protocol e x ecution. The chosen analysis (e.g. GWAS) is performed using the corr esponding secur e collaborati v e protocol. Status updates appear on the study page. Users can leave the page and check back periodically for study completion updates. In auto-configured mode, users click a single button on the w e bsite saying 'Begin Workflow,' which automatically performs the above steps. In user-configured mode, these steps are performed via the CLI.

Command-line interface (CLI) implementation
The CLI, de v eloped in Python, serv es as a secure bridge between the user's compute environment and the w e b server, facilitating core protocol execution. It uses libraries such as PyNaCl ( https://pypi.org/project/PyNaCl/ ) for encryption; and requests ( https://pypi.org/project/requests/ ) for server communication. Available on PyPI, the CLI features a modular design for easy integration of new protocols. In userconfigured mode, users follow a guided process using tokenbased authentication for secure connection. The CLI offers a suite of commands for environment configuration, input validation, and protocol initiation and execution. These commands interact with the sfkit w e b server (hosted in GCP) for coordination purposes, but do not send any priva te da ta to the server. In auto-configured mode, the coordina ting VM automa tes CLI commands, streamlining the process while maintaining security and reliability.

Softw ar e documentation and tutorials
Detailed documentation and instructions for using sfkit are pub licly availab le on two online r esour ces: the sfkit w e bsite at https://sfkit.org/instructions and the commandline interface documentation at https://sfkit.readthedocs. io/ . These r esour ces provide inf ormation on how to perf orm an analysis using sfkit , whether using the auto-configured or user-configured mode. They also offer guidance on how to pr epar e the input data and describe e v ery step of the workflow.

A case study: consortium-based collaborative analysis
We applied sfkit workflows to analyze a collection of genomic datasets from the eMERGE consortium, which included a total of 31,292 individuals split across seven data collection sites (Supplementary Note 1). To demonstrate a collaborati v e study, we simulated a user for each site with access to the corresponding data subset. In the following, we describe the results from the perspecti v e of these users. The users utilized sfkit 's auto-configured mode, in which the server automa tically crea ted a virtual machine with 16 CPUs and 128 GB RAM on GCP for each user. For the GWAS analysis, the users adopted a common set of 38,040,168 imputed biallelic SNPs and 9 covariate features to include in the analysis, and chose body-mass index (BMI) as the target phenotype. More details about the dataset and the analysis parameters can be found in Supplementary Note 1.
After the automatic execution of the SF-GWAS workflow using sfkit , each participating user obtained association sta tistics tha t are nearly identical to the equivalent stud y executed on the pooled dataset (Figure 2 A). The top two loci with the strongest association signals identified by sfkit were co-located with SLC25A48 and FTO genes, recapitulating pr eviously r eported genetic factors of obesity ( 11 , 12 ). In contrast, a single user analyzing their own dataset (comprising 1827 samples) obtained far fewer significant associations (one with P < 5 × 10 −8 , compared to 73 for sfkit ), which illustrates the increased power of collaborati v e analysis enabled by sfkit (Figure 2 A). Meta-analysis led to substantially different results compared to the pooled analysis in our setting (Supplementary Figure S1).
Similarly for PCA, the users obtained joint analysis results using sfkit that were highly consistent with a cen-  tralized study (Figure 2 B). The top two PCs re v ealed the structure of differing ancestry backgrounds among individuals in the cohort. Note that each user obtains the projections of only the individuals in their local dataset onto the top PCs (Figure 2 B). sfkit allows the use of PCs that are jointly constructed across the users, which is otherwise not possible if the datasets cannot be pooled. These projections can in turn be used as covariates in GWAS to correct for popula tion stra tifica tion.
sfkit greatly simplifies the setup of the joint analysis among se v en parties down to a small number of interactions on the w e bsite and deploys the computational proto-col in less than fiv e minutes. The entire GWAS computation is executed in seventeen hours and PCA in 3 hours (Supplementary Table S1).

Reproducible tutorial demonstration on a public dataset
We additionally demonstrate all three workflows of sfkit on a public dataset (1000 Genomes Project; Supplementary Note 1), which can be reproduced following our step-bystep tutorial on the w e b server. For GWAS, we simulated both covariate features and phenotypes based on a small set of causal variants. All three workflows resulted in an output that closely agree with that of non-secure centralized studies on the pooled dataset within two hours of runtime for each workflow (Figure 3 ; Supplementary Table S1).

Alternative utilization modes and environments
We e xtensi v ely tested sfkit 's wor kflows using both utiliza tion modes ( auto-configur ed and user-configur ed ) and in different computational environments (using machines tha t are co-loca ted in GCP versus hosted by different cloud pr oviders). For example, we repr oduced the experiments from the previous section, but this time between a user using GCP and another user using Azure to host their machines, both utilizing the user-configured mode. These variations in settings did not impact the accuracy of sfkit 's workflows and produced identical results. The additional delay introduced by more distant / cross-platform connections remained manageable; e.g., runtime to perform SF-GWAS on the 1000 Genomes Project dataset increased from 110 to 143 min. These experiments are summarized in Supplementary  Table S1.

Runtime and monetary cost
We evaluated the cost of sfkit on a range of dataset sizes obtained by replicating the Lung Cancer dataset (Supplementary Note 1). We split each dataset e v enly between two users and ran the SF-GWAS workflow. The cost of a study increased linearly with the study's runtime ( Figure 4 ). sfkit allows users to choose from a range of virtual machine (VM) types to strike a balance between runtime and monetary cost. For instance, by opting for a more powerful machine (with 32 CPUs instead of 16), the runtime on a dataset with 150k samples per user could be reduced by half, albeit at an increased cost of $70 instead of $62 per user (in USD). By extrapolating these results, we estimate a cost of $200 per user e v en on a much larger dataset including, e.g. 200k samples and 90 milion SNPs. Further cost reductions may be possible with more optimized usage of cloud computing services.

Related work
To the best of our knowledge, sfkit is the only existing w e b server that automates the execution of a collection of crypto gra phic algorithms de v eloped for collaborati v e genomic analysis workflows with a provably high level of security.
Se v er al str ategies and software tools have been de v eloped for collaborati v e GWAS (13)(14)(15)(16). The PLINK software  . Scaling of cost and runtime for collaborati v e analyses using sfkit . When using sfkit in the auto-configured mode, virtual machines (VMs) are automatically provisioned for users in the Google Cloud Platform (GCP) based on their specified parameters. We illustrate the impact of dataset size and VM type on both runtime and monetary cost of sfkit based on the SF-GWAS workflow. We replicated the Lung Cancer dataset (9178 samples with 613k variants) to obtain datasets of varying sizes, then split each dataset between two users. Runtime and cost remain practical for large datasets, and using a higher-class VM can further reduce the runtime. vCPU: virtual CPU.
( https://www.cog-genomics.org/plink/ ) implements metaanalysis methods, allowing multiple parties to statistically combine their local GWAS results without sharing indi vidual-le v el data. Nasiriger deh et al. ( 14 ) proposed sPLINK, a federated implementation of GWAS with secur e aggr ega tion of local sta tistics, which is also available as a w e b-based tool. Both these approaches do not support a collaborati v e PCA, an essential step in GWAS. They also r equir e the users to install the r equir ed softwar e and setup their own machine, processes that are automated in the auto-configured mode of sfkit . Moreover, these existing approaches still re v eal some aggregated intermediate results between the participants. It has been shown that the shared intermediate results in federated analysis pipelines can re v eal some information about the private input datasets ( 17 , 18 ). With sfkit , no data is re v ealed e xcept for the final analysis results. We further note that metaanalysis can be less accurate than a centralized study, especially gi v en heterogeneous data distributions ( 14 , 19 ) (Supplementary Figure S1). Se v er al gener al-purpose and open-source software have been developed for federated model training and data analysis, including: FedML ( https://www.fedml.ai/ ), FATE ( https://fate.fedai.org/ ), PySyft ( https://blog.openmined.org/tag/pysyft/ ), OpenFL ( https://github.com/intel/openfl) and TensorFlow Federated ( https://www .tensorflow .org/federa ted ). W hile some of these solutions provide a similar le v el of pri vacy protection as sfkit (e.g. PySyft), none of them are designed for genomic anal yses. Ada pting these existing tools to efficientl y perform the sophisticated workflows addressed by sfkit would r equir e substantial effort.

CONCLUSION
sfkit is a user-friendly w e b server designed to help a group of r esear chers secur ely perform collaborati v e genomic analyses, including association tests and population stratification anal ysis, jointl y across da tasets tha t cannot be pooled together. The modular design of sfkit facilitates seamless integration of additional analysis workflows, such as training of disease risk prediction models. A key direction for future work is to demonstrate analysis across largescale biobanks, e.g., the All of Us Research Program and the UK Biobank, le v eraging sfkit 's capabilities. sfkit r epr esents a step toward broadening access to state-ofthe-art crypto gra phic tools for collaborati v e biomedical r esear ch.

DA T A A V AILABILITY
The eMERGE and Lung Cancer datasets are available through the National Institutes of Health's database of Genotypes and Phenotypes (dbGaP) under accession numbers phs000888.v1.p1 and phs000716.v1.p1. The example dataset constructed based on the 1000 Genomes Project dataset is available for download on the tutorial page of our w e bsite.

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.