Reprever: resolving low-copy duplicated sequences using template driven assembly

Genomic sequence duplication is an important mechanism for genome evolution, often resulting in large sequence variations with implications for disease progression. Although paired-end sequencing technologies are commonly used for structural variation discovery, the discovery of novel duplicated sequences remains an unmet challenge. We analyze duplicons starting from identified high-copy number variants. Given paired-end mapped reads, and a candidate high-copy region, our tool, Reprever, identifies (a) the insertion breakpoints where the extra duplicons inserted into the donor genome and (b) the actual sequence of the duplicon. Reprever resolves ambiguous mapping signatures from existing homologs, repetitive elements and sequencing errors to identify breakpoint. At each breakpoint, Reprever reconstructs the inserted sequence using profile hidden Markov model (PHMM)-based guided assembly. In a test on 1000 artificial genomes with simulated duplication, Reprever could identify novel duplicates up to 97% of genomes within 3 bp positional and 1% sequence errors. Validation on 680 fosmid sequences identified and reconstructed eight duplicated sequences with high accuracy. We applied Reprever to reanalyzing a re-sequenced data set from the African individual NA18507 to identify >800 novel duplicates, including insertions in genes and insertions with additional variation. polymerase chain reaction followed by capillary sequencing validated both the insertion locations of the strongest predictions and their predicted sequence.

: One-end anchored reads from different homologs forming an insertion location. One-end anchored reads (discordant-one-end reads, see Methods) from the originally predicted region (Chr 1, blue arrows) and its homolog (Chr M, beige arrows) form a complementary cluster at Chr 8. Increased coverage around breakpoints improves breakpoint accuracy and reconstruction performance. from chr 1 from chr M insertion breakpoint (chr 8:63177585) Figure S2: Breakpoint specificity. Mapping around the breakpoint is confounded by non-unique flanking sequences. A. A breakpoint (right yellow dot) can be detected by the discordant-one-end reads mapped around it. However due to the homologous sequences (L 1 , L 2 ), the signal around the breakpoint (L 3 ) can be weaken (blue arrows at L 1 and L 2 ). B. An example of a false breakpoint signal due to the non-unique flanking sequences. A clean shaped cluster (at D 3 ) is formed simply by mapping ambiguity around, but there is no actual duplication.  Figure S3: Breakpoint composition. Discordant-one-end reads form clusters around insertion locations. A. A standard clean shaped cluster is shown around a true duplication breakpoint (yellow dot). When the flanking sequences are unique (pink ovals), all the forward reads (blue arrows) are mapped just prior to the breakpoint, and reverse strand read (red arrows) just posterior to it. B. A mixed shaped cluster can be generated at a false breakpoint, where the reference genome has the same sequence(H 2 ) at the locus. We can see some forward reads (blue arrows) are mapped at the 3' region of the block (just from mapping ambiguity caused by sequence similarity to H 1 ), and reverse reads (red arrows) at the 5' region. C. The overlap length is defined as the span between the leftmost reverse-strand read and the rightmost forward-strand read. The cluster size is defined as the span from the beginning of the first read to the end of the last read. Using these two measures, we can classify the clean shape from the mixed shape.  Figure S4: Analysis on unmapped reads. A. Possible causes of unmapped (orphan-one-end reads are depicted. Reads (red arrows) sequenced from additional variations (e.g. SNPs, micro-insertions/deletions), repeat elements (e.g. SINEs, LINEs) and/or across breakpoints can be unmapped in the whole genome mapping. Reads with bad sequencing quality (e.g. bad Phred quality score) can be also unmapped. B. Rescued unmapped reads from REPREVER LOC . Reads with a certain sequencing quality (less than 10 nucleotides with <Q20 Phred Score) are aligned pairwise to each H ∈ H and its flanking regions; reads of >92 similarity and >400 SW score are rescued and reassigned. Unlike our expectation, bad sequencing quality is the most important cause of mapping loss. Only 0.7% orphan reads came from non-homologous regions. C. The portion of repeat elements in unmapped reads are analyzed using RepeatMasker. Even in unmapped reads, the portion is not much higher than that in mapped reads (57% versus ∼ 50%) meaning that repeat elements in flanking regions are not a major cause of mapping loss. repeat elements in unmapped reads causes of unmapped (orphan-one-end) reads C B Figure S5: Classification of paired end mapping with respect to a high copy region H. (a) A Two-end read is a paired end read that is successfully mapped inside H. It can be from the same region (blue) or any duplicates (red). (b) When only one read is mapped to H and the mapping is concordant, it is classified as concordant-one-end read. The read mapped inside H is internal (the right read of blue and the left read of red), while the other is external. (c) When the one-end read is discordant (by insert size or coordinate), it is classified as discordant-one-end read. It can be generated from boundaries of duplicates. Discordant one-end reads play an important role because its external reads can be used to infer breakpoints while the internal reads provide variation information in the duplicate boundaries. (d) When the external read is not mapped (orphan), the paired end read is called orphan-one-end read. It can be generated from excessive variation inside duplicates (blue), repeat elements in flanking regions (magenta), or split-read generation at the boundaries (red).  Figure S6: A landscape of copy number increase in NA18507. Totally 990 duplication events whose insertion locations were identifiable are drawn (center circle). Each edge connects a original copy location to its corresponding insertion location; edge color corresponds to the original site. We found no specific preference in chromosomes or chromosomal locations for duplication site. Five sub-networks are also drawn including stronger candidates (top), inter/intra-chromosome duplication (right up and down), and homo/heterozygous duplication (left up and down). Networks are drawn using Circos (Krzywinski et al., 2009 Figure S7: Complex copy number variation in MUC20 explained by discordant reads. A. Many alternative transcripts are found in MUC20 around the exon 2 and 3. Short tandem repeats are shown in the intron region. B. Discordant-one-end reads from REPREVER LOC (red and blue solid arrows) represent a breakpoint near the intron 2. Non-unique sequences caused from segmental duplication provide a secondary mapping place for each read (lined arrows), however, mate-pairs remain discordant due to the abnormally large insert size. C. A non allelic homologous recombination (NAHR) can occur between two short tandem repeats (yellow dashed line). A few repeat units also can be omitted during the recombination. D. The NAHR explains and resolves all the discordance: i) the gain call made by CNVer, ii) increased insert size of mate-pairs, iii) discordant mapping across the two homologs predicted from breakpoint specificity. … primary mapping possible mapping with lower score Figure S8: With an significant sized insertion, mate-pairs generated across the breakpoint in the donor genome are not mapped concordantly in the reference genome. Left: There is no change of the breakpoint coverage when no duplication is occurred. Middle: Mate-pairs generated in or across the inserted block D (red lines) are not mapped in the reference causing roughly 50% reduction of coverage at the breakpoint. Right: Likewise, we can observe a "V" shaped reduction of coverage around the homozygous breakpoint.  Figure S9: An overview of duplicate identification process. (a) Boundary regions of duplicates can be reconstructed from discordant-one-end reads where one (external) sub-read is mapped to the flanking region of breakpoints and the other subread (internal) is mapped to the boundary regions of the duplicates (D 1 to D N ). If there are any variations including SNPs and micro-indels in the boundaries (colored narrow boxes), we can differentiate the duplicates by training PHMMs using assigned reads (right). (b) After training boundary regions, two-end reads are used for reconstruction of central regions. For each two-end read, one sub-read is used to determining where it is from (e.g. the one in the figure is assigned to D N ) and the other end is used for extending trained regions (dotted ovals below each duplicate). This is an iterative process and is repeated until 1) all the reads are assigned to any duplicates or 2) when trained regions cannot be extended.  Figure S10: Getting a template sequence of a duplicate (D 2 ) from the original copy (H). In case of the duplicate is truncated when it is first generated, a subsequence of H is used as the template. The range of subsequence can be inferred from internal sub-reads; the new start point (b') is given by the leftmost position of all reverse internal sub-reads (red), and the new end point (e') is by the rightmost position of all forward internal sub-reads (green).

Ref
reverse internal external sub-reads Figure S12: PCR experiments on high confident NA18507 duplicons. Out of 10 total regions, six regions not presented in the main manuscript are shown. From the 12 primer pairs, we successfully confirmed the amplification at 8 pairs (21,22,24,25,26,27,28,29). Detailed primer information is available in Table S4.   20  21  22  23  24  25  26  27  28  29  30  31 Ladder 100 bp Primer pair# (Table S4 for detailed information)  Table S1: The method for Insertion breakpoint varies widely depending on the size and contents of the inserted sequences. For example, small insertions (∼a few bp) can be found from mismatches in alignment without considering any discordant mapping coordinates. Insertions of larger (> 1kb) sequences exist in fewer numbers and harder to be identified. There have been a few trials to reconstruct inserted sequences using local assembly around the breakpoint. However, most of them are targeted to novel sequences (which do not exist in the reference sequence). Insertion of duplicated sequences (exist in reference sequence in one or many copies) has been rarely validated either by computational or experimental ways. a Inserted sequence or its homologs exist in the reference. Here we define homologs as sequences whose length is within 90% 110% of the original sequence, and >95% sequence identity. b The number of assembled sequences are dependent on the 1) tools that are used in assembly (e.g. Velvet, Euler) and 2) stringency for contig assembly. c we additionally found Blat matches from 777/2898 inserted sequences, which are not analyzed in the original study and yet to be validated.

Parameter initialization for profile Hidden Markov Model
A profile HMM is described by the tuple (Q, V, π, A, B) referring to States, Symbols, Initial state distribution, Transition and Emission probabilities, respectively. Q={q 1 , q 2 , ..., q n } is the set of states, V={v 1 , v 2 , ..., v m } is the output alphabet, π(i) is the probability of being in state q i at time t = 0(initial state), A={a(q i , q j )|1 ≤ i, j ≤ n} is the set of transition probabilities, where a ( q i , q j ) is the probability of entering state q j at time t + 1 from state q i at time t, and B={b(q j , q k )} is a set of output probabilities, where b(q j , v k ) is the probability of producing v k at time t when in state q j at time t. Here, each state q i corresponds to a single base pair of a duplicate D, and the output alphabets consist of a single nucleotide ('A', 'T', 'G', and 'C') including any probabilistic combinations (e.g. 'M'='A' or 'C', 'R'='A' or 'G', and 'N'=any nucleotide), and an empty character (null).
To initialize background emission probabilities (b 0 ), we calculate a number of observation (c k ) of each alphabet v k in the template sequence (H). b Background transition probabilities can be initialized by assigning equal probabilities to the all possible transitions of the current state.

Parameter training for profile Hidden Markov Model
Training a PHMM includes updating emission and transition probabilities at the site of new observations. Assume that we observed O k time of an alphabet v k at a given match state. The total number of observation O is now O k . The updated emission probability b(v k ) after O times of observation should satisfy following constraints.
Expression 5 describes that the emission probability should be proportional to the observation frequency (pseudocount). The Inequality 6 limits the range of b(v k ) by applying a minimum and a maximum value µ, which corresponds to the known sequencing error rate. Because the output is not a direct observation, there always is a chance to observe difference alphabets even though a state is perfectly trained to emit a single output. More importantly, if an emission probability of 0 (or 1) is allowed for each match state, we would not be able to variate the PHMM model because any sequences with a single variation will not be assigned to the model. Considering the constrains, we can set up the first emission function: where x,y,z, and w are constants for controlling weights. We can apply the initial emission probability in Equation 1 given no observations By applying y = w − wcµ/c k to Equation 8, we get: And from Expression 7, we get: where m is the total number of alphabets. By applying x = (1 − mµ)y from Equation 11 to Equation 10, we get: If we rewrite the equation using ρ = y/w, we finally get: Here, the only parameter is ρ, which is used to control weight between the effect from observation (O k ) and the background information (c k ). The bigger ρ provides faster training from same number of observations, and make the probability converge to its limit value (1 − µ) more rapidly. Training transition probabilities a(q i , q j ) can be defined in a similar way particularly by regarding the observations as the number of observed transitions instead of observed outputs.

Training transition probabilities
To update transition probabilities, we use observations of transitions a(q i , q j ) for training where the range of q j is differently defined depending on the type of q i . For example, if q i is a match state M i , the set of possible transition A i is {M i+1 , I i , D i+1 } (see Figure S11). So the transition probability after observation can be calculated from Equation ?
? when we substitute b(v k ), O k and O to a(q i , q j ), O ij and O i respectively:

Prediction of breakpoint zygosity
Consider a candidate breakpoint region, where a duplicon is inserted in the donor and not the reference. For a homozygous insertion, we do not expect to see any concordant read straddling the breakpoint. For a heterozygous insertion, we expect the coverage to reduce by half ( Figure S8).
To check zygosity, we model the clone depth distribution around breakpoints. Define λ as an expected number of clones that start at a specific location. The expected coverage at location x, C(x), can be defined by the expected number of clones that start between x − l and x, where l is the clone length (insert size): Assuming that the clone depth follows a Poisson distribution, the probability that we observe k clones spanning position x without any insertion breakpoints is: Likewise, the probability that we observe k clones at a position i bp (i < l) away from the breakpoint b is: because the clone start point must be located in [b − i − l, b − l] allowing only i bp range instead of l. The probability of observing data at ±l bp around the breakpoint given a zygosity z can be calculated as below: So the probability of zygosity z of breakpoint given observation can be calculated from priors: P (z|obs) = P (obs|z)P (z) P (obs|z)P (z) .
Finally, REPREVER LOC calls the zygosity of maximal likelihood.

PCR amplification of duplicated regions:
Totally 19 primer pairs are designed for PCR amplification of the four test cases (Table S4) using NCBI Primer-BLAST (http://www.ncbi.nlm.nih.gov/tools/primer-blast/). Consensus sequences reconstructed by REPREVER SEQ were used as templates. For each consensus template, primer sequences are selected by following criteria: 1) primer length is 25 to 30 bp. 2) product size is 300 to 700 bp. 3) melting temperature is 57 to 63 • C. 4) 50 bp breakpoint flanking regions are excluded. For regions in known segmental duplication, least similar sequences are used to build primers yielding also limited primer quality. PCR reactions were performed on 100ng of NA18507 genomic DNA template in a 40ul PCR reaction consisting of 2ul (100ng) each primer, 4ul (25mM) dNTPs, 4ul (1X) PCR buffer, 1ul DMSO, 0.4ul (1.2 U) PicoMaxx High-Fidelity Polymerase (Agilent Technologies) and 25.3ul water. Cycling conditions for the PCR reaction were as follows: 1 cycle of 94 • C for 5 minutes followed by 30 cycles of 94 • C for 30 seconds, 62 • C for 30 seconds and 72 • C for 1 minute and finally 1 cycle of 94 • C for 7 minutes and 10 • C forever. 2ul of PCR reactions were run on a 1.5agarose gel for visualization. Where single PCR bands were observed, products were purified using the MinElute PCR Purification Kit (Qiagen) using the manufacturer's recommended protocol. If gel purification was needed, the product was extracted from the gel and purified using the QIAquick Gel Extraction Kit (Qiagen).

Sanger sequencing of duplicons:
For each successfully amplified region, 50ng of purified PCR product (prepared as described above) mixed with 50ng of forward primer were submitted to The Centre for Applied Genomics' Sanger sequencing facility (Toronto, Ontario, Canada; http://www.tcag.ca/facilities/ dnaSequencingSynthesis.html). Sequencing reactions were performed using Big Dye Terminator v3.1 cycle sequencing enzyme (Life Technologies) and run on a 96-capillary 3730xl DNA Analyzer (Applied Biosystems/Life Technologies).
coverage. All other parameters were set to default. Another whole genome sequencing dataset (SRR034939, 150x2 base and ∼ 350 bp insert, 20x) of European individual (NA12878) was downloaded from SRA and processed similarly. After getting the whole genome mapping file (.bam), we used SAMtools to convert formats for tools in REPREVER. For additional processing, SAM-JDK library (ver. 1.40) was downloaded from Picard (http://picard.sourceforge.net) and imported to REPREVER toolkit.

Candidate copy number increased regions:
Three independent CNV callsets were used as REPREVER input. First, a CNV list of 500 regions was downloaded from the online version of the study Yoon et al. (2009). We selected 100 regions marked with "duplication" out of 500, and further reduced to 85 by filtering out short (< 1 kb) duplications. Second, a list of 206 regions that are predicted to have absolute copy number bigger than 2 (CN > 2) in NA18507 genome was downloaded from the project website (http://www.sanger.ac.uk/research/areas/humangenetics/cnv/ highres\_discovery.html) Conrad et al. (2010). A third dataset was generated by running CNVer v0.8.1 Medvedev et al. (2010) on the NA18507 sequencing dataset. As required by CNVer, we first re-mapped the reads using bowtie to report all good alignments using the suggested parameters. We ran CNVer with parameter min mps=5, thus requiring a cluster to contain at least five supporting mappings. Out of initial 5325 CNVs we selected 2022 copy number increased regions (relative donor copy change>0) and further reduced to 1876 long (> 1 kb) regions.

Fosmid clone sequences and validated insertions
A dataset of 226 fully sequenced fosmids from NA18507 was downloaded from NCBI GenBank via Human Genome Structural Variation Project homepage (http://hgsv.washington.edu/). We conducted the following procedures to discover insertion sequences that are contained in the fosmid set. 1) We ran Blat for each fosmid sequence against the hg18 reference genome to determine mapping sites. The best match is selected using Blat match-score constrained by the fosmid's origins. 2) For each best-match, we compared the size of region used in query ( f ) and match ( r ). We classified the fosmid match into 4 classes: i) normal, where | fr | < 1000, ii) deletion, where rf > 1000, iii) insertion, where fr > 1000 and iv) others (e.g. no preferential matches or split-match). From this step, we found 16 fosmids that have potential insertion sequences (79 normal, 61 deletion, and 70 others). 3) We conducted pairwise alignment between the matched fosmid and the corresponding reference sequences using Strecher Rice et al. (2000) to confirm the exact insertion location and sequence. 4) The inserted sequence is queried against hg18 reference using Blat to check i) novelty and ii) type of the sequence (e.g. repeat element). We found 12/16 do not exist in the reference, confirming they are not duplicated insertions; all of the 4 previously validated insertions from a subset of the 226 fosmids Kidd et al. (2008) were discovered here again. Out of 4 remaining duplicated sequences, 2 could not be determined due to mapping ambiguity (e.g. microsatellite), leaving 2 as working examples.
Another dataset of 454 fully sequenced fosmids from NA12878 was downloaded and processed similarly to discover 74 insertions. Out of 74, we found 7 were unambiguously duplicated (61 novel, 6 undetermined).

Comparison with split-read based algorithms
There are more tools for detecting structural variations. Pindel  and PRISM (Yue et al 2012) are two well known tools based on split-read algorithm. We tested these tools on calling non-tandem duplications. Pindel targets (a) large deletions (1bp∼10kb) and (b) medium sized insertions (1∼20bp). More recently (2011), the same authors announced that the Pindel has been improved to target more SV types (https:// trac.nbic.nl/pindel/wiki/BackgroundInformation), which includes (c) tandem duplications,  (d) translocations; note that translocation is also far from interspersed duplication where it does not accompany homologous regions and copy number changes We compared Pindel and PRISM on the same NA18507 dataset. As shown the Table below, both tools did not report any breakpoints explicitly marked "interspersed duplication". In Pindel, only one PCR validated breakpoint "chr1:63177985" was called as a large insertion. PRISM did the same study in its paper to detect 784319 indels, 172 inversions and 407 tandem duplications. However insertion sizes were only constrained to less than 100 bp reporting no interspersed duplications. Therefore, there is no other current tool that is specialized in calling duplicons targeted in our study.