Genome-wide analyses of LINE–LINE-mediated nonallelic homologous recombination

Nonallelic homologous recombination (NAHR), occurring between low-copy repeats (LCRs) >10 kb in size and sharing >97% DNA sequence identity, is responsible for the majority of recurrent genomic rearrangements in the human genome. Recent studies have shown that transposable elements (TEs) can also mediate recurrent deletions and translocations, indicating the features of substrates that mediate NAHR may be significantly less stringent than previously believed. Using >4 kb length and >95% sequence identity criteria, we analyzed of the genome-wide distribution of long interspersed element (LINE) retrotransposon and their potential to mediate NAHR. We identified 17 005 directly oriented LINE pairs located <10 Mbp from each other as potential NAHR substrates, placing 82.8% of the human genome at risk of LINE–LINE-mediated instability. Cross-referencing these regions with CNVs in the Baylor College of Medicine clinical chromosomal microarray database of 36 285 patients, we identified 516 CNVs potentially mediated by LINEs. Using long-range PCR of five different genomic regions in a total of 44 patients, we confirmed that the CNV breakpoints in each patient map within the LINE elements. To additionally assess the scale of LINE–LINE/NAHR phenomenon in the human genome, we tested DNA samples from six healthy individuals on a custom aCGH microarray targeting LINE elements predicted to mediate CNVs and identified 25 LINE–LINE rearrangements. Our data indicate that LINE–LINE-mediated NAHR is widespread and under-recognized, and is an important mechanism of structural rearrangement contributing to human genomic variability.

. Results of the repeated PCR amplifications of junction fragments in patients with LINE-LINE-mediated genomic deletions on chromosomes 5 in patient 2 and 20 in patient 3 with the original (old 1/2 and 5/6) and new (3/4 and 7/8) primers, respectively. The PCR products amplified with the new primers were~7 kb longer than the original amplicons.
The above files are in CSV format, with the following columns: aln len Length of the BLAST alignment between the two mediating transposons, only cases where this is greater than 1000 are included distance Genomic distance between the pair of mediating transposons in case of intrachromosomal rearrangements, meaningless otherwise. eval The E-value of BLAST alignemnt gap openings Number of gap openings in the BLAST alignemnt. hsp1s Start of first BLAST HSP in genomic coordiantes hsp1e End of first BLAST HSP in genomic coordiates hsp2s Start of second BLAST HSP in genomic coordiantes hsp2e End of second BLAST HSP in genomic coordiates idperc Identity percentage of BLAST alignment, only cases where this is greater than 92.0 are included matches Number of mismatches in BLAST alignemnt orientation Orientation of the transposon pair. 1 is directly oriented, -1 is inverted. q end, q start, query id, s end, s start internal score, subject id te1s Start of the first interacting transposon 0 te1e End of the first transposon te1 chr Name of the chromosome containing the first transposon te2s Start of the second interacting transposon te2e End of the second transposon te2 chr Name of the chromosome containing the second transposon type Type of NAHR event that's suspected to be made possible by the transposons. Valid values: DELDUP, INVERSION, TRANSLOCATION All listed coordinates are with respect to the HG19 genome assembly.

LINES IN THE HUMAN GENOME
There are 1,498,692 LINE elements annotated in the HG19 assembly of the human genome. Most of these are short, fragmentaric copies, with shortest of them being 11 bases long. We decided to focus our analysis on the longer elements: over 4000 base pairs for wet-lab analysis, and over 1000 base pairs for bioinformatics. Figure S5. Histogram of of LINE element lengths greater than 1kb found in the human genome. The cluster around 6 kb corresponds to full-length LINE elements.

ALGORITHMIC PREDICTION OF BREAKPOINTS FROM SEQUENCING
For each pair of LINEs, a consensus sequence was computed, and a custom version of the Needleman-Wunsch algorithm (9) modified to compute a semi-global alignment was used to align the Sanger reads to the consensus. An artificial sequence containing the information about sequence cis-morphisms was computed for each case (Fig. S6). Then, the sequences were analyzed with a Hidden Markov Model (10) trained using a custom version of the Baum-Welch algorithm (11). The HMM has 5 hidden states: S 0 ,S 1 ,...,S 4 , the input alphabet is {S,N,L,R,E}, and the structure of the HMM is shown on Figure S7 The modified algorithm differs from the standard version in that it enforced the following constraints during training: : ensures the model does not favour placement of breakpoints near the beginning or end of alignments because the training data happens to be skewed as such : assumes that SNVs with respect to the reference sequence, which would make the source LINE ambiguous (such as Fig. S6, location 5), or even suggest the wrong LINE (location 6) are equally likely to occur on either side of the breakpoint.
The prior and posterior values for chain parameters are as follows: The model with parameters obtained from the Baum-Welch algorithm were then used to compute the posterior probabilities of transition from the S 1 state to S 2 at all locations, which correspond to the probability that the NAHR cross-over event occurred at each location. These were computed using a custom version of the forward-backward algorithm (12), in which the observation matrices corresponding to the L and R emissions were replaced with an affine combination of matrices for L and R with weights based on the PHRED quality score (13,14) of the sequence from which the L or R signals originated. The posterior probabilities were calculated, and in most cases a single location of the breakpoint was obtained. The computed locations were later confirmed by visual inspection using Sequencher software.   Figure S6. Construction of input sequence for estimation of NAHR breakpoint location. In artificial sequence, the S and E are special markers, for beginning and end of the sequence, L means that the observed sequence seems to come from the left (first) LINE, R means it comes from the right (second) one, N means that the source LINE cannot be determined from this location.  Table 1.  GCAGGGCACAGACAAACAAAAGGCAGCAGTAGCCTCTGCAGACTTAAATG  TCCCTGTCTGACAGCTTTGAAGAGAGCAGTGGTTCTCCCAGCACGCAGCT  GGAGATCTGAGAACGGGCAGACTGCCTCCTCAAGTGGGTCCCTGACCCCT  GACCCCCGAGCAGCCTAACTGGGAGGCACCCCCCAGCAGGGGCACACTGA  CACCTCACACGGCAGGGTATTCCAACAGACCTGCAGCTGAGGGTCCTGTC  TGTTAGAAGGAAAACTAACAAACAGAAAGGACATCCACACCAAAAACCCA  TCTGTACATTACCATCATCAAAGACCAAAAGTAGATAAAACCACAAAGAT  GGGGAAAAAACAGAACAGAAAAACTGGAAACTCTAAAACGCAGAGCGCCT  CTCCTCCTCCAAAGGAACGCAGTTCCTCACCAGCAACAGAACAAAGCTGG  ATGGAGAATGACTTTGACGAGCTGAGAGAAGAAGGCTTCAGACGATCAAA  TTACTCTGAGCTACGGGAGGACATTCAAACCAAAGGCAAGGAAGTTGAAA  ACTTTGAAAAAAATTTAGAAGAATGTATAACTAGAATAACCAATACAGAG  AAGTGCTTAAAGGAGCTGATGGAGCTGAAAACCAAGGCTCGAGAACTACG  TGAAGAATGCAGAAGCCTCAGGAGCCGATGCGATCAACTGGAAGAAAGGG  TATCAGCGATGGAAGATGAAATGAATGAAATGAAGCGAGAAGGGAAGTTT  AGAGAAAAAAAGAATAAAAAGAAATGAGCAAAGCCTCCAAGAAGTATGGG  ACTATGTGAAAAGACCAAATCTACGTCTGATTGGTGTACCTGAAAGTGAT  GGGGAGAATGGAACCAAGTTGGAAAACACTCTGCAGGATATTATCCAGGA  GAACTTCCCCAATCTAGCAAGGCAGGCCAACGTTCAGATTCAGGAAATAC  AGAGAACGCCACAAAGATACTCCTTGAGAAGAGCAACTCCAAGACACATA  ATTGTCAGATTCACCAAGGTTGAAATGAAGGAAAAAATGTTAAGGGCAGC  CAGAGAGAAAGGTCGGGTTACCCTCAAAGGGAAGCCCATCAGACTAACAG  TGGATCTCTCAGCAGAAACCCTACAAGCCAGAAGAGAGTGGGGGCCAATA  TTCAACATTCTTAAAGAAAAGAATTTTCAACCCAGAATTTCATATCCAGC  CAAACTAAGCTTCATAAGTGAAGGAGAAATAAAATACTTTACAGACAAGC  AAATGCTGAGAGATTTTGTCACCACCAGGCCTGCCCTAAAAGAGCTCCTG  AAGGAAGCGCTAAACATGGAAAGGAACAACCGGTACCAGCCGCTGCAAAA  TCATGCCAAAATGTAAAGACCATCGAGGCTAGGAAGAAACTGCATCAACT  AACGAGCAAAATCACCAGCTAACATCATAATGACAGGATCAAATTCACAC  ATAACAATATTAACTTTAAATGTAAATGGACCAAATGCTCCAATTAAAAG  ACACAGACTGGCAAATTGGATAAAGAGTCAAGACCCATCAGTGTGCTGTA  TTCAGGAAACCCATCTCACGTGCAGAGACACACATAGGCTCAAAATAAAA  GGATGGAGGAAGATCTACCAGGCAAATGGAAAACAAAAAAAGGCAGGGGT  TGCAATCCTAGTCTCTGATAAAACAGACTTTAAACCAACAAAGATCAAAA  GAGACAAAGAAGGCCATTACATAATGGTAAAGGGATCAATTCAACAAGAA  GAGCTAACTATCCTAAATATATATGCACCCAATACAGGAGCACCCAGATT  CATAAAGCAAGTCCTGAGTGACCTACAAAGAGACTTAGACTCCCACACTT  TAATAATGGGAGACTTTAACACCCCACTGTCAACATTAGACAGATCAACG  AGACAGAAAGTCAACAAGGATACCCAGGAATTGAACTCAGCTCTGCACCA  GGTGGACCTAATTGACATCTACAGAACTCTCCACCCCAAATCAACAGAAT  ATACATTTTTTTCAGCACCACACCACAGCTATTCCAAAATTGACCACATA  CTTGGAAGTAAAGCTCTCCTCAGCAAATGTAAAAGAACAGACATTATAAC  AAACTATCTCTCAGACCACAGTGCTATCAAACTAGAACTCAGGATTAAGA  ATCTCACTCAAAACCGCTCAACTACATGGAAACTGAACAACCTGCTCCTG  AATGACTACTGGATACATAACGAAATGAAGGCAGAAATAAAGATGTTCTT  TGAAACCAACGAGAACAAAGACACAACATACCAGAATCTCTGGGACGCAC  TCAAAGCAGTGTGTAGAGGGAAATTTATAGCACTAAATGCCCACAAGAGA