Transmission trees on a known pathogen phylogeny: enumeration and sampling

One approach to the reconstruction of infectious disease transmission trees from pathogen genomic data has been to use a phylogenetic tree, reconstructed from pathogen sequences, and annotate its internal nodes to provide a reconstruction of which host each lineage was in at each point in time. If only one pathogen lineage can be transmitted to a new host (i.e. the transmission bottleneck is complete), this corresponds to partitioning the nodes of the phylogeny into connected regions, each of which represents evolution in an individual host. These partitions define the possible transmission trees that are consistent with a given phylogenetic tree. However, the mathematical properties of the transmission trees given a phylogeny remain largely unexplored. Here, we describe a procedure to calculate the number of possible transmission trees for a given phylogeny, and we show how to uniformly sample from these transmission trees. The procedure is outlined for situations where one sample is available from each host and trees do not have branch lengths, and we also provide extensions for incomplete sampling, multiple sampling, and the application to time trees in a situation where limits on the period during which each host could have been infected are known. The sampling algorithm is available as an R package (STraTUS).


Introduction
The use of genetic data to reconstruction pathogen transmission trees has been the subject of considerable interest in recent years. Many different approaches have been proposed, both phylogenetic and non-phylogenetic [1,5,10]. The phylogenetic approaches can broadly be divided into two categories: those that assume that internal nodes in the phylogenetic tree correspond to transmission events [7][8][9], and those that do not [2][3][4]6]. In the former case, the phylogeny and the transmission tree are effectively the same object.
The assumption of coinciding lineage coalescences and transmission events may be unwise, and in particular it does not take into account within-host pathogen diversity [11]. Several approaches have been taken that do not make it, one of which to note that if a phylogeny from a completely sampled outbreak has its nodes annotated with the hosts in which each lineage was present, the transmission tree is known [2][3][4]. In particular, Hall, Woolhouse, and Rambaut [4] noted that the set of transmission trees for a known phylogeny, with complete sampling and assuming transmission is a complete bottleneck, is equivalent to the set of partitions of its nodes such that each partition element contains at least one tip and the subgraph induced by the nodes in each partition element is connected. However, for the most part the mathematical properties of this space of partitions remain unexplored.
Here, I provide procedures for counting the total number of partitions (and hence the total number of transmission trees) for a known phylogeny. I also show how an algorithm can be written to sample uniformly from the set of partitions. Initially I assume that the phylogeny is binary, sampling is complete, that each host provided one sample, and that nothing is known about the timings of each infection, but I go on to individually relax each of these assumptions. The procedures outlined here may be useful to researchers wishing to explore the structure that the phylogeny imposes on transmission tree space. An R implementation of the algorithms described here is available at http://github. com/twoseventwo/TTsampler.

.1 Preliminaries
Let the phylogeny T be an unlabelled rooted binary tree, initially without branch lengths. Let T * represent the unrooted tree obtained from T by attaching a single extra tip to the root of T by a single edge. (Note that two distinct T s can have the same T * .) T * , importantly, has one more tip than T . I follow the correspondence described by Hall, Woolhouse, and Rambaut [4] between transmission trees and partitions of the node set of T such that all tips derived from the same host are members of the same partition element and the subgraph induced by each partition element is connected. This assumes that sampling is complete, which I later relax, and that transmission is a complete bottleneck, which is a more fundamental assumption. Furthermore, I assume for now that only one tip is derived from each host. See figure 1 for an example.
In what follows, "subtree" has its normal phylogenetic meaning; a subgraph of tree T consisting of a node of T , all its descendants (if any), and the edges between them. If u is a node then we will denote the subtree rooted at u by T u ; this is defined even if u is a tip.

Counting transmission trees
With T fixed and having n tips, suppose we wish to count the number of partitions, as defined above, of its node set N (T ). If the set of such partitions is P(T ) (this is a set of sets of sets), we wish to calculate |P(T )|. Nothing about the definition of a partition requires a rooted tree, so P(T * ) is defined similarly. It is trivial that if n = 1 then |P(T )| = |P(T * )| = 1.
If T u is a subtree, we can define P(T u ) in the obvious way by regarding it as a tree in its own right. We need to define another set of partitions of N (T u ), however, which is the set of its intersections with all the elements of each member of P(T ). This is different because it allows an internal node of T u to share its partition element with no tip of T u , as that element was constructed by taking the intersection of N (T u ) with an element of a partition of N (T ) that contains a tip of T which is not a tip of T u .
Suppose A is a partition of N (T ) and there exists A ∈ A such that A∩N (T u ) is nonempty and contains no tip of T u . Then: were not then the A would not obey the connectedness requirement for being an element of a partition of N (T ). If v ∈ A ∩ N (T u ) and t is the tip of T in A then the path from v to t must intersect u.
2. A∩N (T u ) is the only member of the set {B ∩N (T u ) : B ∈ A} that contains no tips of T u , because u can belong to only one member of it.
Let Q(T ) be the set of partitions of T which allow (but do not insist on) an extra partition element containing T 's root. Figure 2 shows an example of the extra elements of Q(T ) which are not already elements of P(T ) (and hence already displayed in figure 1). To look at Q(T u ) (u = r) in another way, suppose ∼ is an equivalence relation on the elements of is the set of pairwise intersections of N (T ) with elements of A. Q(T u ) can be seen as the set of equivalence classes. An important point to note is that even if the tip of T which shares a partition element with u is different in A and B, A ∼ B is still possible. The partition of the nodes in N (T ) \ N (T u ) does not matter. |Q(T u )| is the number of ways of partitioning the nodes of T u allowing for the possibility that that an unspecified extra partition element could "creep" down onto it from above.
The exact correspondence of Q(T ) with P(T * ) is obvious, as T * is obtained from T by attaching a single tip to T 's root. Compare figure 3 with the full set of partitions displayed in figures 1 and 2 as an example.
If n is at least 2, then T has a left subtree T rL rooted at the left child rL of its root node r and a right subtree T rR rooted at the right child rR. Proof. Since T is not the tree with one node, its root r is internal. First we count the number of partitions where r is in the same partition element as a tip of T rL . If this is true then rL is in that same element by the connectedness requirement for elements: if it were not then the path from that tip to r would go through a node in a different element. The connectedness requirement then also insists that no node of T rL is in the same partition element as a tip of T rR , and the number of ways of partitioning the nodes of T rL as a subtree of T such that this is true is just |P(T rL )|. For each of those ways, the number of ways of partitioning the nodes of T rR is |Q(T rR )|, since some nodes of T rR can be in the same element as r (and hence rL) and if any are then rR is. Since |Q(T rR )| = |P(T * rR )| the number we are looking for is hence |P(T rL )|×|P(T * rR )|. An identical argument shows that the number of partitions where r is in the same element as a tip of T rR is |P(T rR )| × |P(T * rL )|, so the total number is (|P(T rL )| × |P(T * rR )|) + (|P(T rR )| × |P(T * rL )|).  Proof. The root r of T has become an internal node of T * connected to a new tip, t. In some partitions of N (T * ), t is the only member of its element and there are obviously |P(T )| of these, because counting them is the same problem as counting partitions of the tree rooted at r with t excised.
If t is not the solo member of its element, r is in the same element by the connectedness requirement. The number of ways of partitioning T rL and T rR as subtrees of T if this is true are |Q(T rL )| (= |P(T * rL )|) and |Q(T rR )| (= |P(T * rR )|) respectively. The total number of such partitions is the product of these.
Since P(T ) and P(T * ) are trivially known when T has one tip, |P(T )| can now be calculated for any T by doing a post-order tree traversal. Specifically, at each node u, if u is a tip then we record |P(T u )| = |P(T

Extension to the multifurcating case
The modification is fairly trivial If T is not binary. If the root r has p children then and T rk is the subtree rooted at its kth child, then: and the traversal works as before.

Counting root elements
We now want to determine, of the |P(T )| partitions of T 's node set, what number have r in the same partition element as a tip t. Let {t 1 , . . . , t n } be the tips of T , ordered as they would appear in a postorder traversal, in particular such that the tips from any subtree make up a consecutive run of indexes. We wish to calculate the elements of n-tuple v(T ) = (v 1 (T ), . . . , v n (T )) where v i (T ) is the number of partitions of N (T ) with r in the same element as t i ; then 1≤i≤n v i (T ) = |P(T )|. If T has one tip t 1 , then obviously v 1 (T ) = 1 and v(T ) is the 1-tuple (1). For any other T , define v(T rL ) and v(T rR ) as the same counts when T rL and T rR are considered trees in their own right; these tuples thus have only the same number of elements as T rL and T rR , respectively, have tips. Suppose the tips of T rL occur first in the ordering of the tips of T , and there are z of the former (and hence n − z tips of T rR ).
(the sum of the red number at u and the product of the blue numbers at its children). Proposition 2.3. Suppose T has at least two tips. Then: Proof. First suppose r is in the same element as a tip t i of T rL . (Because the tips of rL occur first in the ordering, the index of t i as a tip of T is the same as that of it as a tip of T rL .) This forces rL to be in the same partition element as r by the connectedness requirement, and the number of members of T rL that have this property is v i (T rL ). Such a choice of partition of T rL leaves |Q(T rR )| = |P(T * rR )| ways of partitioning the nodes of T rR . If r is instead the same element as a tip of T rR , then the argument is the same except that the ith tip of T is the (i − z)th tip of T rR .
All entries of v(T ) can be calculated by a similar post-order traversal to that described in the previous section. At each internal node u, v(T u ) can be formed as described above, |P(T u )| obtained by summing its elements, and |P(T * u )| calculated from |P(T u )| and the partition counts for u's child subtrees. See figure 5 for an example.

Sampling uniformly from P(T )
If the post-order traversal above is complete, sampling a random partition requires a single pre-order traversal. The vector v(T ) consists of probability weights for a draw of partition element for r, as it determines how many of the |P(T )| total partitions have r in each element. Subsequently, when the traversal reaches another node u with parent uP , and we have already placed uP in a partition element containing a tip t, then u must also be in that element if t is one of its descendants, by connectedness, or if t is u itself. Otherwise, there are |P(T * u )| ways in which T u can be partitioned given that uP has already been allocated an element. |P(T * u )| − |P(T u )| of these have u in the same element as uP , while the remaining |P(T u )| do not. The entries of v(T u ) give the numbers of ways in which u can be placed in the same element as each of its descendant tips. The partition element for u can then be sampled with probability determined by a weight vector that has the entries of v(T u ) for the elements containing the tips descended from u, |P(T * u )|−|P(T u )| for the already determined parent partition element, and 0 for any other element.

Incomplete sampling
Now suppose that not every host in the transmission tree was sampled, but instead that that there were m unsampled individuals, all of which are ancestral to at least one sampled individual. It is not sufficient merely add m extra partition elements that contain no tips. This is because, to provide a simple example, if an unsampled host b was infected by a sampled host a and directly infects only one other host c which was also sampled, the region of the phylogeny  corresponding to the infection of b exists only along a branch (the branch whose parent node is in the partition element containing a's tip and whose child node is in the one containing c's tip) and no internal nodes are associated with b at all. Nonetheless, the procedure for counting, and sampling from, the set of tree partitions with m extra elements turns out to lead to the more general answer as a byproduct of the calculations. Now suppose P m (T ) is this set. (P(T ) as described above is P 0 (T )). Let PS m (T ) be the subset of P m (T ) where the root of T shares its partition element with a tip, and PU m (T ) the subset where it does not. P m (T * ) can be defined, although as T * has no root PS m (T * ) and PU m (T * ) cannot be. Q m (T ) can also be defined and again is exactly analogous to P m (T * ).
Call the partition elements containing tips the sampled elements, and those not containing tips the unsampled elements.
If T has a single tip then P m (T ) = 0 and P m (T * ) = 0 for all m > 0. No internal nodes exist to be assigned to unsampled elements.

Counting transmission trees
Proposition 3.1. If T has at least two tips, then Proof. Since r is not part of an unsampled element, the m such elements must be split between the subtree descended from its left child and that descended from its right child. The summation expresses the number of ways to make this split. Apart from this adjustment the argument is the same as in proposition 2.1, as if m = 0 then r is always in a sampled element.
Proposition 3.2. If T has at least two tips, then Proof. Since r is in an unsampled element, one of the m of those is accounted for. The remaining m − 1 are split amongst the left and right subtrees as above.
If we consider T rL in isolation, we want to count the number of ways of partitioning its nodes with i unsampled elements for certain and possibly one extra which, if it exists, must contain rL. (This possible element is the intersection of N (T rL ) with an element A of a member of P(T ) such that r ∈ A.) This number is clearly |Q i (T rL )| = |P i (T * rL )|. Since exactly the same applies to T rR , the product of |P i (T * rL )| and |P m−1−i (T * rL )| is the desired number for a known i.

Proposition 3.3. If T has at least two tips, then
Proof. Identical to proposition 2.2 except that we again allow for all the ways that the m unsampled elements can be distributed.

Counting root sampled elements
Once again, let {t 1 , . . . , t n } be the tips of T , ordered as they would appear in a post-order traversal and suppose that the first z tips are descended from T rL . Define V(T ) be the n × (m + 1) matrix whose ijth entry v ij (T ) is the number of partitions of T such that r is in the same element as t i if there are j − 1 unsampled elements.
Proof. Analagous to proposition 2.3 after counting the ways the i unsampled elements can be split between the two child subtrees.

Sampling uniformly from P m (T )
The previous sections allows us, for m ∈ N and a binary T , to calculate PU i (T ), PS i (T ), P * i (T ) and V(T ) for all i with 0 ≤ i ≤ m by a post-order traversal. The pre-order sampling procedure works by, first, at r, choosing an element using a vector of probability weights consisting of the (m + 1)th row of V(T ) for the sampled elements and PU m (T ) for an unsampled element. Once this is done, we must randomly choose how the remaining unsampled elements are divided between T rL and T rR .
If we chose the element containing a tip t i for r and t i is descended from rL, then the number of partitions which have j unsampled elements amongst the nodes of T rL (and hence m − j amongst the nodes of T rR ) is v ij (T rL ) × |P m−j (T * rR )|. If we chose the element containing a tip t i for r and t i is descended from rR, then the number of partitions which have j unsampled elements amongst the nodes of T rL is v i(m−j) (T rR ) × |P j (T * rL )|.
If we chose an unsampled element for r, then the number of partitions which have j other unsampled elements amongst the nodes of T rL (and hence m−1−j amongst the nodes of T rR ) is |P j (T * rL )| × |P m−1−j (T * rR )|. In all cases we have a set of weights which we can use to randomly select j. When the traversal arrives at a new node u whose parent uP has been assigned an element and we know that there are k previously unencountered unsampled elements in the partition of the nodes of T u , then we are forced to put u in the same element as uP if that element is sampled and contains a tip descended from u. Otherwise, the (k + 1)th row of V(T u ) gives the weights for being in a sampled element along with a tip of T u , |PU k (T u )| the weight for a transition to a new unsampled element (whether or not the element containing uP is unsampled) and |P k (T * u )|−|P k (T u )| the weight for continuing in the same element as uP whether that element was sampled or unsampled. We choose an element for u in this way and then, if u is not a tip, split the k −1 (if we assigned u to a new unsampled element) or k (if we did not) remaining unencountered unsampled elements between u's left and right descendant subtrees as above, except that there is an extra case where we chose a sampled element for u but the tip in that element is descended from neither of u's children. In that situation, the weight for j of k remaining unsampled elements going to u's left subtree

Sampling uniformly from the set of transmission trees with m unsampled hosts
To complete the picture, we now must consider the case where only l of the m unsampled hosts actually correspond to partition elements, and the remaining m − l appear only along branches. If we have sampled an element of P l (T ) as above, we need to distribute the additional m − l hosts, and there are l + n branches on which these can be put, which are the branches ending in the earliest appearances of in the tree of each of the l + n partition elements. The number of ways of doing this is the number of ways of assigning m − l identical objects to l + n possibly empty groups, i.e. m+n−1 l+n−1 . If we calculate |P l (T )| for 0 ≤ l ≤ m, then we know that there are |P l (T )| × m+n−1 l+n−1 transmission trees with m unsampled hosts where l of those hosts have partition elements. We can randomly select an l using those counts as probability weights, randomly generate an element of P l (T ) as above, and finally randomly assign the extra m − l elements to branches.

Multiple sampling
Removing unsampled hosts from consideration, I now relax the assumption that each partition element contains only a single tip. Fix a partition P of the tip set E(T ) of T ; we now investigate the set P(T ; P) which is the set of partitions A of N (T ) such that {A ∩ E(T ) : A ∈ A} = P (i.e. the partitions of N (T ) which agree with P on the tips). Each element of P contains all the tips sampled from a single host. P(T ) as in section 2 is P(T ; I) where I is the partition of singletons of E(T ).
For a set A ∈ P define the bridge b(A) of A to be the minimal subset of N (T ) such that A ⊆ b(A) and the subgraph of T induced by b(A) is connected. This contains all elements of A, the MRCA of A, and all nodes on the paths between them. Obviously if |A| = 1 then b(A) = A.
If any two elements of P have bridges whose intersections are nonempty, then |P(T ; P)| = 0; there are simply no possible transmission trees because the connectedness requirement would insist that some nodes be part of more than one partition element. So assume that this is not true. For any partition, being a bridge node forces a node to be a member of the same element as those tips whose bridge it belongs to.
Note there are two intuitive ways to consider P(T u ; P) for a subtree T u of T rooted at u. The first would be to use the set P u = {A ∩ E(T u ) : A ∈ P} as a partition of the tips of T u . In this case, bridge nodes determined by P are not necessarily determined as such by P u .
The second way is to retain the restrictions on the partition elements to which a node of T u can belong that are determined by P even when we move to counting partitions of T u . This is the version which is useful for our purposes. For example, if A ∈ P consists of the two tips t 1 and t 2 , but only t 1 is a tip of T u , then in enumerating and sampling partitions we still consider the intersection b(A) ∩ N (T u ) to be bridge nodes, even though {t 1 } is a singleton element of P u . (In fact u must be a member of b(A).) So for any node u of T , we define P(T u ; P) to be the number of ways to partitioning T 's nodes that respects the set of bridge nodes that P requires.
It should be fairly obvious that if an internal node is a bridge node then one or both of its children must be as well. P(T * ; P) has the definition one would expect, with the extra node forming a singleton extra element of the tip partition. For a node u, P(T * u ; P) again respects the bridge nodes imposed by P on T . Now suppose T has at least two tips and let u be any internal node of T , including r. Its children are uL and uR.  Proof. If u is a bridge node then at least one of its children is too. At least one of those children, in fact, must be part of the same bridge as itself; suppose this is uL. Now u and uL must be in the same partition element so the |P(T uL ); P| partitions of uL determine which element u belongs to, and because uL is a bridge node |P(T uL ); P| = |P(T * uL ); P| by proposition 4.1. If uR is not a bridge node then there are |P(T * uR ); P| ways of partitioning the nodes of T uR by a by now very familiar logic. If it is, then there are |P(T uR ); P| partitions of those nodes since the element that uR belongs to is fixed, but |P(T uR ); P| = |P(T * uR ); P| by proposition 4.1. Obviously the same applies with uR and uL reversed. If u is not a bridge node then the argument of proposition 2.1 still applies.
If P has l elements, number them in an arbitrary way. Define v(T ; P) = (v 1 (T ; P), . . . , v l (T ; P)) where v i (T ; P) is the number of partitions of the nodes of T where r shares a partition element with the members of the ith element A i of P. For a subgraph T u , let v i (T u ; P) be the number of partitions of the nodes of T u where r shares a partition element with the intersection E(T u ) ∩ A i ; this may be 0 where that intersection is empty. Once again, v i (T u ; P) counts only partitions that respect the bridge nodes imposed by P.
Proof. First note that if A i ∩ E(T uL ) = ∅ and A i ∩ E(T uR ) = ∅ then u is a bridge node and in particular a member of b(A i ). If u is not a bridge node at all, then we must be in one of the first two cases. The argument in that case differs from proposition 2.3 only in terms of notation. If u ∈ b(A i ) but only one child of u has any descendant tips that are members of A i , then suppose uL is the child that does. Then uR is either not a bridge node or a member of b(A j ) for j = i. The number of ways of partitioning the nodes of T uL with uL sharing a partition element with the members of A i is v i (T uL ) and, because u shares a partition element with uL, each of those once again results in |P(T * uR ; P)| ways of partitioning the nodes of T u . The same goes with uL and uR reversed.
If both children have descendant tips that are members of A i then the number of partitions with u sharing an element with the members of A i is just the product of the number of ways of partitioning its child subtrees in the same way. All three nodes are forced to be part of the same partition element.
Under any other situation u ∈ b(A j ) for j = i and there cannot be any partitions that have it sharing an element with the members of A i .
If t is a tip then |P(T * t ; P)| = 1, |P(T t ; P)| = 1, and v i (T t ; P) = 1 if t ∈ A i and 0 otherwise. This is all that is necessary to set up traversals analogous to those described in section 2.

Trees with timings
Finally, I return once more to the case where sampling is single and complete. I now give T branch lengths, which means a height function h : N (T ) → R + can be defined such that for all nodes u with parent uP , h(u) < h(uP ). Branch lengths and heights are intended to be in units of calendar time, not genetic distance. We extend h to N (T * ) by setting h(t) = ∞ if t is the extra tip; while T * remains formally unrooted, there is only one way to display it that makes sense.
Each tip t i is now associated with a closed interval I i = [α i , β i ] such that, for any partition A of the nodes of T , if {u, t i } ⊆ A i ∈ A then h(u) ∈ I i . (Obviously no partitions exist without t i ∈ I i for all i.) The I i s determine minimum and maximum heights for all the nodes in each partition element. This is useful if infection is expected to end with sampling, or if a maximum time from infection to sampling is known. If I is the complete set of intervals, then let P(T ; I) by the set of partitions subject to these additional restrictions. For a subtree T u , P(T u ; I) is the set of partitions subject to the restrictions where they are appropriate (i.e. the I i where t i is actually a tip of T u ).
Let P(T * ; I ∪ [γ, ∞)) be the set of partitions of T * such that the element containing the extra tip contains only nodes of heights greater than γ, and the restrictions imposed by I still apply to the other elements. I omit a single expression for |P(T ; I)| and instead suggest calculating it by calculating the v i first and adding them up. So let v i (T ; I) be the number of partitions of T where r is in the same element as t i and the restrictions imposed by I apply. Returning to the notation of proposition 2.3, so T rL has z of the n tips and those descended from it come first in the ordering: Proof. If h(r) lies outside I i then the answer is trivially zero. If not, and t i is descended from rL, then there are v i (T rL ; I) ways of partitioning the nodes of T rL such that rL is in the same element as t i . For each of these, we need the the number of ways of partitioning the nodes of T rR such that an extra element can creep down from the root, but that that element cannot contain any nodes whose heights are smaller than the lower limit of I i , i.e. α i . This is |P(T * rR ; I ∪ [α i , ∞))|. As usual, an identical argument applies with rL and rR reversed.
Then |P(T ; I)| = n i=1 v i (T ; I). If T has one tip, then both P(T ; I) and P(T * ; I ∪ [γ, ∞)) (regardless of γ) have one element if the tip lies within its own interval and zero if it does not. As in the previous section, the traversals described in section 2 can be used to count the full set of partitions and sample uniformly from it.