In vivo generation of DNA sequence diversity for cellular barcoding

Heterogeneity is a ubiquitous feature of biological systems. A complete understanding of such systems requires a method for uniquely identifying and tracking individual components and their interactions with each other. We have developed a novel method of uniquely tagging individual cells in vivo with a genetic ‘barcode’ that can be recovered by DNA sequencing. Our method is a two-component system comprised of a genetic barcode cassette whose fragments are shuffled by Rci, a site-specific DNA invertase. The system is highly scalable, with the potential to generate theoretical diversities in the billions. We demonstrate the feasibility of this technique in Escherichia coli. Currently, this method could be employed to track the dynamics of populations of microbes through various bottlenecks. Advances of this method should prove useful in tracking interactions of cells within a network, and/or heterogeneity within complex biological samples.


INTRODUCTION
Reverse engineering in any complex system requires the simultaneous monitoring of individual components. Recent advances in high-throughput DNA sequencing have given biologists unprecedented access to massively parallel data streams. Genetic barcoding--the labelling of individual cells with a unique DNA sequence--when combined with these technologies, will enable monitoring of millions or billions of cells within complex populations. This approach has proved useful in tagging neurons (1) and hematopoietic stem cells (2,3) for lineage analysis and could be applied to the normal and/or abnormal development of other cell populations or tissues, including tumours. Indeed, in vivo barcoding of individual neurons is the requisite first step towards converting neuronal connectivity into a form readable by high throughput DNA sequencing (4).
Most current approaches for tagging individual cells with a genetic barcode rely on the creation of diverse libraries in vitro and subsequent delivery of genetic material into a host cell at low-copy number. Such in vitro approaches are limited by cloning bottlenecks that cause reduced library diversities and sequence biases, by incomplete labelling of all cells within a population, by the possibility of introducing multiple barcodes per cell, and by the challenges of working across organisms (e.g. retroviral barcoding cannot be applied in some organisms like Caenorhabditis elegans).
In vivo barcoding has the potential to overcome all of these limitations. Mechanisms for generating diversity in vivo exist, endogenously, in many organisms--most notably the mammalian immune system. However, efforts to repurpose the immune system's V(D)J recombination for in vivo cellular barcoding (5) yielded limited barcode diversity--on the order of a dozen unique sequences--in cells other than lymphocytes [T. N. Schumacher (personal communication)]. Exogenous recombinases have been successfully applied to generate diverse combinations of colours for cellular tagging purposes. This technique, better known as Brainbow (6), relies on Cre recombinase to rearrange a cassette resulting in the stochastic expression of a subset of different coloured fluorescent proteins (XFPs) in neurons. The theoretical diversity of Brainbow is in the hundreds, but cannot be easily assayed with DNA sequencing because it relies heavily on gene copy number variation as well as recombination. We reasoned that by replacing XFPs with unique sequences, we could design a barcoding system with the potential to achieve diversities that matched the scale of high-throughput sequencing technologies.
We have developed a novel method of generating sequence diversity in vivo for the purpose of cellular barcoding. Our method, which relies on a DNA invertase--Rci, recombinase for clustered inversion (7)--to shuffle fragments of DNA, has the potential to easily achieve diversities over 10 9 unique sequences. Here, we show that this method can be applied for the in vivo generation of diversity in E. coli.

In silico simulations
We performed in silico simulations to determine the behaviour of different cassette architectures. For Cre-based cassettes, n cell cassettes (n cell = 10 000) of n frag fragments (n frag = 100) were operated on independently. Each fragment was flanked on its 5 end with a loxP site in sense orientation (5 -GCATACAT-3 ) and on it's 3 end with a loxP site in the antisense orientation. Concatenation of fragments resulted in cassettes in which adjacent fragments (excluding ends) were separated by two loxP sites in opposing orientation. We defined Cre recombination as two independent binding events to loxP sites. Binding of Cre to a pair of loxP sites always resulted in recombination, where the result (inversion or excision) was defined by the relative orientation of the sites defined in a look up table (updated after each event). Completion is defined to be the point at which Cre can no longer mediate an excision event. The number of recombination events required to reach completion was tracked for each cassette. For Rci-based cassettes, n cell cassettes (n cell = 10 000) of n frag fragments (n frag = 10) were operated on independently. Here, we considered two architectures. For the first architecture, each fragment of the cassette was flanked on its 5 end with an sfx site in sense orientation and on its 3 end with an sfx site in the antisense orientation. Concatenation of fragments resulted in cassettes in which adjacent fragments (excluding ends) were separated by two sfx sites in opposing orientation. For the purpose of simulations, we consider each pair of sfx sites between fragments to be equivalent to a single bidirectional sfx site. We defined Rci recombination as two independent binding events to sfx sites. In this case, binding of Rci to a pair of sfx sites always resulted in recombination (inversion). Simulations were allowed to proceed for m recombination events per cassette. For the second Rci architecture, the 5 end of the cassette begins with a single sfx site in sense orientation, followed by a single sequence fragment. The cassette is extended by addition of an sfx site and a sequence fragment, with the orientation of sfx sites alternating throughout the cassette. The cassette is terminated at its 3 end by an sfx site in antisense orientation. We defined Rci recombination as two independent binding events to sfx sites. Binding of Rci to a pair of sfx sites only resulted in recombination if the sfx sites were in opposite orientations (inversion only). Simulations were allowed to proceed for m recombination events per cassette. The code for running all simulations is provided in Supplementary Materials.
All cloning was performed using Top10 chemically competent cells (Invitrogen) with growth at 37 • C.

Bacterial culture & shuffling
For initial tests with T7 induced expression of Rci, plasmids IDP205 or DIG35 were transformed into E coli strain BL21(DE3) (NEB) and grown in 5 ml of normal or OvernightExpress (Millipore) supplemented media overnight. Plasmid DNA was isolated and the Rci coding sequence was removed (to prevent further shuffling) by double digestion (NdeI-NotI), blunting (Mung Bean) and re-ligation (Roche Rapid Ligation kit). The transformed ligation was plated for clonal analysis. Clonal analysis involved the selection of single colonies, growth in LB for 16 h, plasmid isolation and Sanger sequencing with AN-CHOR105 used as a primer. For tests of the pKat driven PAGE 3 OF 10 Nucleic Acids Research, 2014, Vol. 42, No. 16 e127 expression, plasmids IDP205 or DIG35 were transformed into E. coli strain Top10 cells (Invitrogen) and grown in 5 ml of LB overnight. Plasmid DNA was isolated and the Rci coding sequence was removed (to prevent further shuffling) by double digestion (NdeI-NotI), blunting (Mung Bean) and re-ligation (Roche Rapid Ligation kit). The transformed ligation was plated for clonal analysis. Clonal analysis involved the selection of single colonies, growth in LB for 16 h, plasmid isolation and Sanger sequencing with ANCHOR105 used as a primer. For high-throughput sequencing by PacBio, plasmids IDP205, DIG35 and DIG71 were transformed into Top10 cells (Invitrogen) and grown overnight in 50 ml of LB. Plasmid DNA was isolated and digested with PciI to release the barcode cassette. The barcode cassette was prepared for PacBio sequencing using the PacBio SMRTbell Template Prep Kit according to the manufacturer's instructions and cassettes from each original plasmid were sequenced on a single Single Molecule Real Time (SMRT) sequencing cell. PacBio sequences were collapsed into circular consensus reads using the PacBio command line tools.

Barcode Reconstruction
Our sequence alignment algorithm is written in Matlab and uses the Matlab Bioinformatics Toolbox. Sequencing reads are processed independently. Each known barcode fragment is aligned to the sequence read (in both orientations) with a thresholded Smith-Waterman alignment (8) and the position of the segment along the read is stored. The threshold is set by a bootstrap method. Briefly, one hundred randomly generated 100-mers are aligned to all of the sequence traces and a score is associated with each alignment. The mean score plus two standard deviations is considered the threshold for all subsequent alignments. The complete barcode can be reconstructed based on the positions of each segment. Because the algorithm relies only on local alignment, this method is extremely robust to sequencing errors. Table   1 B

Sequences of BC cassettes and Rci ORF
All relevant sequences and plasmid maps are provided in Supplementary Materials.

Design of Cre-based barcoding
Our design goals were to create a modular genetically encoded barcode system that is easily scalable, cross-platform (applicable across model organisms), compatible with highthroughput sequencing technologies and robust to sequencing errors. Our ultimate goal is to generate unique barcodes to label all of the cells of the mouse cortex--approximately 1 × 10 7 neurons (9). In general, if the repertoire of possible barcodes is substantially greater than the number of cells  in the population of interest, then even randomly generated barcodes will label most cells uniquely. Specifically, if n is the number of cells and k is the number of possible barcodes, then under simple assumptions the fraction of uniquely labelled cells will be e −n/k . Thus assuming one barcode per cell, a barcode repertoire exceeding the number of cells by a factor of 100 will yield 99% uniquely labelled cells. Therefore, we sought a barcode system with the potential to scale to at least 100 × 10 7 = 10 9 unique barcodes. Our initial design relied on Cre recombinase to shuffle and pare down a cassette of n barcode fragments (each fragment flanked by lox sites in alternating orientation) by stochastic inversion and excision events to leave a single fragment (Figure 1a). Cre acts by binding to and mediating recombination between two of its cognate DNA sequences, called lox sites. Cre-mediated recombination between any two compatible lox sites in the same orientation causes the excision of the intervening sequence elements. In contrast, recombination between lox sites in opposing orientation causes the inversion of any intervening sequence elements.
This architecture (shown in Figure 1a) has a theoretical diversity of 2n after completion, since any fragment, j, of the n fragments can end in either the forward or inverted orientation ( Figure 1a). Several variant lox sites have been discovered with a wide variety of characteristics (10,11,12,13). The Brainbow (6) system, e.g. employed three lox sites that were shown to be mutually incompatible including loxP (the original lox site), loxN (6) and lox2272 (13) (Figure 1b). We reasoned that a Cre-based barcoding approach could be ex-tended to achieve higher diversities by concatenating k cassettes of n fragments, where each cassette employs one of a subset of incompatible lox sites (Figure 1c). Here, the theoretical diversity is (2n) k because each cassette operates independently. Thus with n = 100, k = 4, theoretical diversities reach our goal of 10 9 unique sequences.
Unfortunately, many of the reported lox variants have not been validated for complete incompatibility or pairwise efficiency. Moreover, because of the repetitive nature of lox sites it would be difficult to synthesize cassettes with the hundreds of fragments required to achieve our target diversity. For example, an architecture with n = 100, k = 4, where each fragment is above the minimum length (∼100 bp) for efficient recombination (14), requires a large genomic insertion with a length >60 kb (Figure 1d). Finally, simulations suggested that Cre-based architectures are subject to considerable biases that limit the diversities that can be achieved in practice (Supplementary Note 1, Supplementary Figure  S1).

Employing DNA invertases for cassette shuffling
The key limitation of Cre-based designs is that Cre mediates both inversion and excision/insertion. Because insertion is a bimolecular reaction whereas excision is a unimolecular reaction, in general equilibrium will favour excision over insertion. Thus the equilibrium diversity of a Cre-based cassette scales linearly with the number of fragments n. In simulations (See 'Materials and Methods' section), 10-40 recombination events were performed on each cassette before reaching completion (Figure 1e). The diversity of the Cre-based cassettes could, in principle, be increased by preventing the reaction from proceeding to completion. In the limit, if only inversions were permitted, then the fragments of the cassette would be shuffled rather than pared down. Intuitively, the advantage of eliminating excision events can be understood by analogy to a deck of n playing cards, in which each card can occur in either orientation (face up or face down). If excisions dominate, then the diversity is given by eliminating all but one card. If inversions dominate, then the diversity is given by all the possible sequences of n shuffled cards (n!) multiplied by all of the possible orientations (2 n ). Thus, eliminating excisions allows the diversity (given by Equation 1) to increase supra-exponentially, rather than linearly, with the number of elements n.
With only 10 fragments, the diversity reaches >3 × 10 9 unique sequences. Importantly, the equilibrium state of this architecture, as the number of recombination events m approaches infinity, is an equal distribution of all potential unique sequences (15). In practice, m will reach some finite number that may result in cassette biases. However, simply extending the cassette by one fragment can compensate for any modest biases of this architecture. Moreover, because the barcodes are made of a small number of known sequence fragments, reconstruction of barcode sequences even from highly error prone sources becomes possible. The order and orientation of each fragment within the cassette after recombination can be determined simply by performing 2n pairwise alignments (each fragment in both orienta- tions is aligned to the recovered sequence). This results in barcodes that are robust to many classes of sequencing errors.
We thus adopted a strategy based on DNA invertasesrecombinases that can mediate only inversions (16). Rci (recombinase for clustered inversion) is a site-specific recombinase of the integrase family, of which Cre is also a member (7). Rci recognizes 31 bp sfx sites and mediates recombination events only between sites in inverted orientation (inversions)--it cannot mediate excision events between sites in the same orientation. Unlike other inversion systems, such as Hin and Gin (17), Rci does not appear to require any co-factors or enhancer sequences (18). Because of this, we selected Rci as our recombinase and designed a new barcode cassette in which segments of DNA are shuffled by inversion-only recombination events (Figure 2a).

Design and synthesis of a 5BC cassette
Initially, we planned to synthesize DNA in which n = 5 fragments, each flanked by sfx sites in opposite orientation, were concatenated to form a barcode cassette. After concatenation, each fragment is separated from its immediate neighbour by two sfx sites in opposite orientation (similar to the architecture proposed for Cre in Figure 1a). However, DNA synthesis constraints and plasmid stability forced us to redesign the cassette to minimize the effects of repetitive elements (the sfx sites). Ultimately, we employed an architecture in which individual fragments are separated by a single sfx site--the orientation of which alternates between successive fragments (Figure 2a), relying on several compatible sfx site variants to further reduce complexity (19) (Figure  2b). where D is the total diversity and n is the number of fragments in the cassette. Despite the reduction of diversity due to the modified architecture, only 12 fragments are required (n = 12) to achieve our target diversity of >10 9 . Additional segments greatly increase the diversity making this a scalable approach (Figure 2c). Moreover, unlike the Cre-based architecture, the Rci-based cassettes reach our requisite diversity at reasonable cassette lengths of ∼1-2 kb ( Figure  2d). Ultimately, a 5-fragment (100 bp fragments) barcode cassette was synthesized utilizing five different sfx site sequences to decrease the repetitive nature of the cassette to aid in synthesis and replication. In addition, known anchor sequences (ANCHOR105 and ANCHOR56) positioned at either end of the cassette, were added outside of the recombination region to aid in sequence reconstruction. The final 5-fragment cassette, 5BC (Figure 3a), was cloned into a low-copy plasmid containing the Rci gene (resulting in plasmid IDP205). This ensures that all barcode cassettes that are transformed into bacterial cells will be exposed to the Rci coding sequence. Plasmid IDP205 (T7→Rci; 5BC) con-tains the Rci gene driven by the inducible T7 promoter. This plasmid is remarkably stable, showing no signs of recombination in the absence of induction across many generations (Supplementary Figure S2).

Testing of the 5BC cassette in vivo
We transformed two populations of E. Coli BL21-DE3 (NEB) cells with plasmid IDP205 (T7→Rci; 5BC) and grew 10 ml cultures overnight. One culture was grown under conditions that induce the expression of Rci from the T7 promoter (see 'Materials and Methods' section). After growth, cells were plated for clonal analysis on plates that did not support Rci expression. Twenty colonies were chosen for each condition (+ and -Rci induction) and analysed by Sanger sequencing. Without induction of Rci expression, no recombination was observed (0 of 20 colonies sequenced, data not shown). Moreover, the induction of Rci led only to modest recombination--shuffling the cassette in 8 of the 20 reconstructed barcode sequences (Figure 3b). Interestingly, each of the final products could be explained by a single recombination event (Supplementary Figure S3).
We reasoned that the inefficient shuffling might be due to protein aggregation and insolubility due to high overexpression, as is often the case with T7 overexpression (20). To test this, we fused an HA-tag at the N-terminus of Rci and tested the expression level and solubility via western blot. Indeed, we found that HA-Rci was found only in the insoluble fraction (Figure 3c), perhaps explaining the inefficient shuffling observed. Thus, we tested expression of HA-Rci from a different promoter, a medium strength constitutively active promoter, pKat (Registry of Standard Biological Parts: BBa I14034), and found that the protein was soluble when expressed from this promoter (Figure 3d). Therefore, we cloned the pKat promoter in place of T7 to make plasmid DIG35 (pKat→Rci; 5BC) and tested for shuffling efficiency. Briefly, we transformed DIG35 (pKat→Rci; 5BC) into Top10 competent cells and grew cultures overnight. To stop shuffling, plasmid DNA was isolated and the sequence for the Rci gene was removed via restriction enzyme digestion. Plasmids were re-transformed and colonies were selected for Sanger sequencing. DNA was isolated from each colony and subjected to Sanger sequencing. Sequence reads were analysed with our alignment algorithm in order to reconstruct full barcodes. Reads that could not be fully reconstructed from sequencing data were discarded from further analysis.
Expression of Rci from this promoter, pKat, resulted in robust shuffling (Figure 3e). Of the 22 reconstructed (3 sequences failed reconstruction) cassette sequences, each had undergone shuffling to yield 19 unique barcode sequences. Moreover, all of the final barcode sequences could only be explained by multiple (>1) recombination events. Based on these positive preliminary results, we next subjected the 5BC cassette to high throughput DNA sequencing.

High throughput sequencing of shuffled 5BC
Advances in high throughput sequencing (HTS) have allowed for unprecedented access to massive quantities of DNA sequence data. We took advantage of HTS to sequence the shuffled 5BC cassettes at depths that allowed for a more thorough analysis of the actual in vivo behaviuor of barcode generation by Rci. Because of the length of our potential cassettes (∼1-2 kb to reach diversities of 10 9 ), we chose the PacBio sequencing platform.
Briefly, we transformed E. coli Top10 (Invitrogen) cells with either plasmid IDP205 (T7→Rci; 5BC -negative control) or DIG35 (pKat→Rci; 5BC) and allowed cultures to grow overnight. DNA was isolated and digested to release the barcode cassette. The barcode cassettes were then prepared for HTS on the PacBio RS II and sequenced.
Using our algorithm, we reconstructed 5887 barcodes from 7203 circular consensus reads (see 'Materials and Methods' section) obtained from cells in which Rci was not expressed (plasmid IDP205; T7→Rci, 5BC). Of the reconstructed barcodes, 5886/5887 gave the original sequence (Figure 4a). These data indicated that the PacBio sequencing platform could handle the highly repetitive nature of the barcode cassettes and would allow for HTS of cassettes without the introduction of recombination during sequencing from template switching or other sources. When Rci was expressed off of the pKat promoter (DIG35) the cassette was shuffled robustly (Figure 4b). Here, we reconstructed 5243 barcodes from 6105 circular consensus reads, of which there were 203 unique sequences. After shuffling, each position along the cassette is populated with a relatively even distribution of all of the possible fragments, with only slight biases at the ends of the cassette (Figure 4b). In other words, the occupancy at each position of the cassette, while still preferential for the original fragment sequence, approaches randomness (Figure 4c). This important observation indicates that there are no biological constraints on our design that prohibit the full exploration of the barcode space. Simulations of shuffling of a 5BC cassette under simple assumptions (see 'Materials and Methods' section for details) show that the occupancy at any given position reaches equilibrium after ∼6 or more recombination events per cassette (Figure 4d). Comparison of the simulated data and the data collected from in vivo shuffling suggests that our cassettes likely experienced 2-3 recombination events on average in vivo (Figure 4d).
Even this limited amount of recombination events resulted in an observed 203 unique sequences (out of a theoretical 384. n = 5, diversity = 2 5 × 3 × (2!) 2 ). Intuitively, after a small number of recombination events, the cassette remains biased at each position to its initial state ( Figure 4d). As the number of recombination events increases, however, the cassette goes to equilibrium, and there is a nearly uniform distribution of each fragment at each position in the cassette (Figure 4d). In the limit, the cassette will reach an equilibrium state in which every possible barcode is equally probable (15).
This proof of principle experiment shows that in vivo recombination by Rci on a cassette is efficient at shuffling the original sequence into a unique barcode. However, the theoretical diversities of the 5BC cassette are well below our initial goals. Therefore, we sought to expand the cassette to achieve higher diversities.

High throughput sequencing of shuffled 11BC
To explore the feasibility of achieving diversities that are capable of labelling large populations of cells, we expanded the cassette to 11 fragments. We synthesized a 6-fragment extension to our original cassette and concatenated this with our original 5BC cassette to create an 11BC cassette (Figure 5a). Importantly, plasmids harbouring this cassette were stable across many generations and showed no evidence of recombination in the absence of Rci expression (Supplementary Figure S4).
The theoretical diversity of this cassette is 2 11 × 6 × (5!) 2 = 176 947 200. Unfortunately, there is currently no sequencing technology that has both the requisite depth and read-length to appropriately cover the potential diversity of the 11BC cassette. Nevertheless, we used highthroughput sequencing on the PacBio platform to sample the barcodes produced by the recombination of the 11BC cassette. Briefly, we transformed Top10 bacterial cells with plasmid DIG71 (pKat→Rci; 11BC cassette) and cultured the cells overnight. The barcode cassette was released by restriction digestion and subjected to HTS on the PacBio RS II.
Shuffling of the 11BC cassette in vivo was efficient ( Figure  5b). We reconstructed 1786 barcodes, of which 1723 were unique. Again we observe that, after shuffling in vivo, each position along the cassette is populated with a relatively even distribution of all of the possible fragments ( Figure  5b), approaching a completely random cassette (Figure 5c). Based on simulations, we estimate that the 11BC cassettes have experienced more than five recombination events on average ( Figure 5d). As suggested by simulations, further shuffling will lead to an increasingly random cassette (Supplementary Figure S5).

DISCUSSION
To our knowledge, this is the first example of an in vivo barcoding scheme with the potential to scale to uniquely label all of the individual cells of an entire tissue or organism. The system, which takes advantage of recombinases to shuffle fragments within a cassette has several advantageous characteristics. Recombinases similar to Rci (i.e. Cre, FLP, PhiC31) have been successfully employed across a wide variety of organisms, suggesting that our barcoding system could be easily ported to organisms beyond E. coli, either by expressing Rci or alternatively, by employing designs that take advantage of asymmetric mutant recombination sites (21) of other recombinases. In addition, the system is highly scalable--addition of a single fragment to a cassette results in an exponential gain in diversity. Moreover, because the input space is small (only a handful of unique segments), each segment can be designed to be maximally orthogonal, thus rendering barcode readout highly robust to DNA sequencing errors. We took advantage of this fact in designing our barcode reconstruction algorithm, which relies on local alignment between the known input fragments and the final imperfect sequence recovered from HTS.
In its current form, however, this paradigm has at least several shortcomings that will need to be addressed before e127 Nucleic Acids Research, 2014, Vol. 42, No. 16 PAGE 8 OF 10 the system can be used at a larger scale. First, the expression of the Rci protein must be controlled in order to induce shuffling at a specific time point and then stopped to prevent further recombination events. In theory this can be accomplished through the use of an inducible promoter (i.e. T7 promoter, Tetracycline-responsive promoter, etc.). In practice, the expression level and recombination efficiency will need to be monitored at different levels of induction to permit solubility, genomic stability, and optimal recombination kinetics. Second, the cassettes, despite their design, were still subject to various poorly understood biases. Further work will be needed on Rci (and other recombinases) to determine the pair-wise efficiencies between different recombination sites, the efficiency of recombination as a function of length between recombination sites, and to increase the recombination kinetics to achieve the maximal number of recombination events per unit time. As we saw in our simulations, a higher number of recombination events leads to far greater diversity and less bias. The biases that we detected, particularly in the case of the 5BC cassette, were likely exacerbated by the fact that we introduced a homogeneous barcode into exponentially dividing cells, where the kinetics of cell division likely outpace that of recombination (at least initially). In terminally differentiated cells, such as neurons, this is less of an issue as the expression of the recombinase can be sustained for long periods of times (i.e. weeks or months). However, in the case of dividing cells, recombinase expression must be pulsed for short durations to allow shuffling and then abruptly stopped to permit lineage tracing. An additional limitation of our design is that the barcodes are carried on extrachromosomal plasmids and thus cannot be used for lineage tracing in organisms that do not allow for plasmid replication and inheritance. Introduction of the cassette into the genome of E. coli can be accomplished by homologous recombination and should be straightforward. Genetic engineering tools such as ZFNs, TALENS or CRISPR will allow introduction of barcode cassettes in other organisms.
There are two factors that need to be considered in terms of the compatibility of our technique with current DNA sequencing technology. The first is read depth, or the number of amplicons that can be read in a single DNA sequencing run. The second is the read length, or the number of bases that can be obtained for a given amplicon--which is currently the main limitation that constrains the application of our barcoding system. Over the last several years, both read depth and read length have increased steadily. It is likely that this trend will continue and that sequencing technology will soon be able to economically meet the needs of our barcoding system. Furthermore, additional research on Rci could allow for the optimization of barcode fragment lengths to permit recombination with shorter fragments--thereby decreasing the length of the cassette to be within reach of current next-gen sequencing platforms. Nevertheless, it is useful here to review the current high-throughput sequencing technologies and the compatibility with the barcoding approach outlined here.
Current Illumina technology produces ∼8 billion reads per flow cell, which is enough to measure millions of barcoded cells at sufficient depth (assuming a uniform 100 reads per barcode). The read-depth of high-throughput se-quencing is thus well matched to our current system. The other factor that must be considered is the available read length of current high-throughput sequencing technology. The longest read lengths currently offered by Illumina are for the MiSeq platform, which currently offers 2 × 300 = 600 bp reads. This limitation severely constrains the scale of our approach. With 600 bp reads we can sample barcode cassettes of length <600 bp. With our current design of 100 bp fragments separated by 31 bp sfx sites, we can read cassettes with only n = 4 barcode fragments (∼550 bp). This limits the achievable diversity to <100 unique barcodes. 'Hacking' techniques are available that can push the read length of the Illumina technology to longer read lengths; 2 × 500 has recently been demonstrated (22). Alternatively, stripping a primer after a certain number of reads and rehybridizing with a new primer would allow for multiple subreads from the same amplicon--thus allowing probing of the sequence at various sites for reconstruction of the full barcode (23) [J. Mellor and F. Roth (personal communication)]. Moreover, it is likely that this technology will continue its steady rollout of improvements to both read-depth and read-length that will allow for a well-matched economical technology for sequencing barcodes generated by our method.
Alternative sequencing platforms, including Roche 454 or Pacific Biosciences (used here), offer different specifications that may be more applicable currently. The PacBio platform allows ∼100k reads of >1 kb, and the newest Roche sequencing platform, the GS FLX Titanium XL+, offers read lengths of up to 1000 bp for ∼700 000 reads. This would allow monitoring of at least 7000 unique sequences. At 1000 bp our cassette design can reach n = 7 barcode fragments for a total diversity of ∼18 000 barcodes. At this read length and depth, our barcoding technology is well matched and would allow tracking of ∼7000 barcodes.
Within these constraints, our technology could be applied immediately to a dissection of dynamics of a microbial population during various stressors including antimicrobials, limited resources or niche competition. Specifically, our technology is uniquely positioned to probe serial population bottlenecks, which remain poorly understood (24). A population of cells carrying the barcode cassette can be exposed to transient barcode shuffling (Figure 6a, b and c). The pool can then be probed by DNA sequencing to test the original distribution of barcodes (Figure 6d). At this point, the population can be exposed to various stressors resulting in a population bottleneck that selects <1000 barcoded cells (Figure 6e). After recovery (Figure 6f), the population can be probed again to measure the resulting distribution of barcodes (Figure 6g). The effect of serial bottlenecks can be measured by additional rounds of transient shuffling and exposure to stressors. Other barcoding techniques, such as shotgun cloning, would not allow for the serial tracking of population dynamics because new barcodes would need to be introduced at each stage. Our technique will be advantageous in any situation in which genetic diversity must be introduced at a specific point in time (i.e. lineage tracing), or in cells in which the introduction of genetic material is challenging and/or inefficient.
We have shown that in vivo shuffling of a cassette of DNA fragments by a recombinase--Rci--can generate significant diversities for cellular barcoding purposes. Barcoding of individual cells within bacterial or yeast populations should prove to be a useful tool for population geneticists and evolutionary biologists and will allow for a detailed analysis of population genetics and growth dynamics under various conditions. The system has few moving parts (all that is needed is Rci and a barcode cassette) and is likely to work across a variety of higher organisms with optimization. This system, applied in other organisms could pave the way to the dissection of complex developmental programmes, study of heterogeneity within tissues, and/or probing of interactions between cells in a population (i.e. B-and T-cells, neurons, tumours, etc.). In vivo barcoding will also pave the way to new explorations in systems biology, from high-throughput monitoring of non-cell-autonomous spread of genetic material to the variability of single-cell transcription profiles. In combination with other molecular tools, in vivo barcoding has the potential to provide biologists with unprecedented knowledge of the complex orchestration of single cells within populations.