Inferring putative transmission clusters with Phydelity

Abstract Current phylogenetic clustering approaches for identifying pathogen transmission clusters are limited by their dependency on arbitrarily defined genetic distance thresholds for within-cluster divergence. Incomplete knowledge of a pathogen’s underlying dynamics often reduces the choice of distance threshold to an exploratory, ad hoc exercise that is difficult to standardise across studies. Phydelity is a new tool for the identification of transmission clusters in pathogen phylogenies. It identifies groups of sequences that are more closely related than the ensemble distribution of the phylogeny under a statistically principled and phylogeny-informed framework, without the introduction of arbitrary distance thresholds. Relative to other distance threshold- and model-based methods, Phydelity outputs clusters with higher purity and lower probability of misclassification in simulated phylogenies. Applying Phydelity to empirical datasets of hepatitis B and C virus infections showed that Phydelity identified clusters with better correspondence to individuals that are more likely to be linked by transmission events relative to other widely used non-parametric phylogenetic clustering methods without the need for parameter calibration. Phydelity is generalisable to any pathogen and can be used to identify putative direct transmission events. Phydelity is freely available at https://github.com/alvinxhan/Phydelity.

the distribution before the estimator yields unreliable results. However, it is computed from the pairwise differences of all values in the given distribution and does not require a location estimate (i.e. the median, which is used in calculating MAD). It has been proven to be statistically more efficient when applied to both Gaussian and non-Gaussian distributions relative to the MAD (Rousseeuw and Croux 1993).
To facilitate identification of both paraphyletic and monophyletic clusters, Phydelity dissociates distal subtrees and sequences from any ancestral node i where $ > . Starting from the most distant sequence to , Phydelity first applies a leave-one-out strategy dissociating sequences, and the whole descendant subtree if every sequence subtended by it was dissociated, until the recalculated $ without the distal sequences falls below . $ is recalculated for any node with changes to its sequence membership $ after dissociating these distantly related sequences.
Phydelity filters outlying tips from putative clusters under the assumptions that viruses infecting individuals in a rapid transmission chain likely coalesce to the same most recent common ancestor (MRCA). Additionally, Phydelity requires any clonal ancestors in between the MRCA and tips of a putative cluster to be as genetically similar to each other as it is between the tips of the cluster. As such, for a putative transmission cluster, the mean pairwise nodal distance between all internal and tip nodes of a cluster must also fall below MPL.
Phydelity then formulates the phylogenetic clustering problem as an integer linear programming (ILP) model. Let < , T , … , $ , … , U be the set of binary variables indicating if putative cluster satisfies the conditions for clustering ( $ = 1 if it clusters; $ = 0 if otherwise). Each sequence subtended by putative cluster is also assigned a binary variable *,$ indicating if the sequence is clustered under ( *,$ = 1 if is clustered under node ; *,$ = 0 if otherwise).
The ILP model is optimised to cluster all sequences that satisfy the distance bounds defining a putative transmission cluster to as little number of clusters as possible. This is achieved by solving for the following equal-weighted multi-objective: subject to the following constraints: Constraint (3) stipulates that sequence can be clustered under node if and only if is a potential cluster ( $ = 1).
If sequence is subtended by putative clusters and , wherein descended from and both nodes are potential clusters ( $ = ' = 1), constraints (3) and (4) stipulate sequence will not be clustered under the descendant node . Implementing these constraints across all pairwise combinations of subtrees subtending sequence in turn constrains to be clustered under the most ancestral node possible.
Constraint (5) stipulates that each sequence can only be clustered under a single cluster, hence abrogating any fuzzy clustering.
where is any arbitrarily large positive constant. Constraint (6) requires all clusters to subtend at least a pair of sequences.
Constraint (7) ensures that $ of all clusters fall below the stipulated limit.

Algorithmic differences between Phydelity and PhyCLIP
As mentioned in the main text, both Phydelity and PhyCLIP (Han et al. 2019) are statisticallyprincipled phylogenetic clustering algorithms based on integer linear programming optimisation.
However, there are substantial algorithmic differences between the two such that the clusters recovered have distinctly different interpretations (Supplementary Figure 1).
PhyCLIP was developed to identify statistically-supported subpopulations in pathogen phylogenies that putatively capture variant ecological, evolutionary or epidemiological processes that could underlie sub-species nomenclature development. PhyCLIP, like Phydelity, operates on the pairwise patristic distance distribution of an input phylogeny. However, PhyCLIP informs its within-cluster divergence limit using the global distribution of patristic distances, rather than the local distribution of k-neighbouring tips as implemented in Phydelity (Supplementary Figure 2). PhyCLIP's use of the pairwise patristic distance distribution of the entire tree allows it to test against the background genetic diversity of the global population included in the phylogeny, indicating whether the putative clustered sequences are sufficiently more related to one another than to the rest of the dataset to be designated a distinct cluster.
PhyCLIP calculates the within-cluster divergence limit ( ) as: where ̅ is the grand median of the mean pairwise patristic distance distribution { < , T , … , $ , … , U } and is any robust estimator of scale (e.g. median absolute deviation ( ) or , see Han et al., 2019) that quantifies the statistical dispersion of the mean pairwise patristic distance distribution for the ensemble of subtrees.
Both PhyCLIP and Phydelity include a robust estimator of scale when calculating the respective within-cluster and maximal patristic distance limit to negate assumptions of symmetry in the pairwise distribution of patristic distances. However, PhyCLIP multiplies the robust estimator of scale with an input parameter γ, which allows the user to calibrate the resolution of clustering to their research question. PhyCLIP also parameterises the inter-cluster divergence limit of clusters in the model using either the Kolmogorov Smirnov test or Kuiper's test. These tests test the null hypothesis that the pairwise sequence distance distributions of every combinatorial pair of putative clusters is empirically equivalent to that if the two subtrees were clustered together.
Statistical significance is inferred if the multiple-testing corrected p-value for the cluster pair is below a user-input false discovery rate.
PhyCLIP therefore has three user-defined parameters: 1. A minimum cluster size. 2. The γ multiple of the robust estimator of scale and 3. the false discovery rate for the Kolmogorov Smirnov/Kuiper test. Accordingly, PhyCLIP's parameters need to be calibrated across a range of input values, optimising the global statistical properties of the clustering results to select an optimal parameter set. The optimisation criteria (see Han et al., 2019) are prioritised by the research question, as the clustering resolution and cluster definition are dependent on the question, and therefore the degree of information required to capture ecological, epidemiological and/or evolutionary processes of interest. By comparison, Phydelity only has a single userdefined parameter (k) to define ' which is optional and can be automatically scaled.
Unlike for Phydelity, PhyCLIP's designated clusters should not be interpreted as sequences linked by rapid transmission events. The use of the global patristic distribution in PhyCLIP sets the within-cluster limit to a resolution too low to only capture homogenous subpopulations of highly similar viruses descendant from the same source in rapid transmission chains. Phydelity's local maximal patristic distance limit was formulated to allow for the identification of these highly similar sequences putatively linked by transmission events.
Phydelity also does not include PhyCLIP's formulation of an inter-cluster divergence test.
PhyCLIP's approach is underpowered to detect statistically distinct distributions in phylogenetic trees that do not include a high number of so-called "false-positives" -sequences that are markedly from a different or outgroup cluster. PhyCLIP's approach works well when representative or relatively complete phylogenies are clustered, as there are many outgroup clusters to test against. However, it is unlikely that many non-outbreak sequences will be included in the background diversity of transmission dynamic studies for which Phydelity was formulated. Resultantly, Phydelity relies on distal-dissociation and the local maximal patristic distance limit to set rigorous boundaries on clusters to ensure robust inter-cluster divergence.

Supplementary Figures
Supplementary Figure 1. Comparison of output clusters from Phydelity and PhyCLIP when applied to the Hepatitis C virus genotype 1a NS5B phylogeny. Each distinct colour represents a unique cluster. Input parameters for PhyCLIP are minimum cluster size = 2, multiple of the robust estimator of scale = 1, and false discovery rate for the Kuiper's test = 0.05. Comparison between how maximal patristic distance limit (MPL) and within-cluster limit (WCL) are determined. MPL is computed using the median ( ) and robust estimator of scale ( ) based on the k-th core distance distribution ( ' ) of every sequence xj and its k-closest neighbours ( ( * , * + ); k=2 in this case).
In comparison, PhyCLIP uses the distribution of mean pairwise patristic ( $ ) distance for all putative clusters to calculate WCL for a user-provided multiple ( ) of .  Figure 4. Plots of mean Silhouette index (SI) computed for the range of genetic distance thresholds implemented in ClusterPicker (measured as nucleotides/site) and PhyloPart (measured as percentile of pairwise patristic distribution of input phylogeny). Clustering results from the lowest optimal distance threshold ( klm$no9 ) with the highest SI value for each method were compared to Phydelity (see Figure 4 in main text). (a) HCV genotype 1a ( klm$no9 : 0.075 nucleotide/site (ClusterPicker); 2.7% (PhyloPart)). (b) HCV genotype 4d ( klm$no9 : 0.047 nucleotide/site (ClusterPicker); 0.8% (PhyloPart)).  (SD)) comparing results generated from Phydelity, WPGMA and MSBD-method. 300 simulations were performed for each distinct weight (w) of inter-community transmission rate. Metrics compared included coverage (i.e. percentage of sequences clustered), adjusted rand index (ARI), purity, modified Gini index ( i ) and normalised mutual information (NMI).