The genomic landscape of polymorphic human nuclear mitochondrial insertions

The transfer of mitochondrial genetic material into the nuclear genomes of eukaryotes is a well-established phenomenon that has been previously limited to the study of static reference genomes. The recent advancement of high throughput sequencing has enabled an expanded exploration into the diversity of polymorphic nuclear mitochondrial insertions (NumtS) within human populations. We have developed an approach to discover and genotype novel Numt insertions using whole genome, paired-end sequencing data. We have applied this method to a thousand individuals in 20 populations from the 1000 Genomes Project and other datasets and identified 141 new sites of Numt insertions, extending our current knowledge of existing NumtS by almost 20%. We find that recent Numt insertions are derived from throughout the mitochondrial genome, including the D-loop, and have integration biases that differ in some respects from previous studies on older, fixed NumtS in the reference genome. We determined the complete inserted sequence for a subset of these events and have identified a number of nearly full-length mitochondrial genome insertions into nuclear chromosomes. We further define their age and origin of insertion and present an analysis of their potential impact to ongoing studies of mitochondrial heteroplasmy and disease.


INTRODUCTION
The presence of mitochondrial DNA in the nuclear genomes of eukaryotes has been well established, and recent reports have shown that this transfer of genetic material is an ongoing evolutionary process (1)(2)(3)(4)(5). In humans, these nuclear insertions of mitochondrial origin (NumtS) have been estimated to occur at a rate of ∼5 × 10 −6 per germ cell per generation (6) and have been implicated directly in a number of genetic disorders (7)(8)(9)(10)(11) while also indirectly hindering studies of mitochondrial diseases (12). A total of 755 NumtS have been identified in version hg19 of the human reference genome (13), although some portion of these have likely arisen through the duplication of previously inserted NumtS. These fragments range in size from 39 bp to almost the entire mitochondrial sequence and are thought to integrate themselves through a nonhomologous end joining mechanism during double-strand break repair based on studies in yeast (14,15). Over evolutionary time, many have been highly modified due to inversions, deletions, duplications and displaced sequences, but some remain very well conserved relative to their parent mitochondria genome. While these fragments appear to be randomly selected from different regions of the mitochondria, an under-representation of the D-loop region has been reported, though why this is observed is currently unknown (16).
Several studies have previously looked at the enrichment of Numt insertions found in the human reference genome assembly relative to different genomic features. Some reports have suggested that Numt insertions tend to colocalize with repetitive elements (16,17), while others have found them to be under-represented (18). Some groups have further shown an under-representation of repetitive elements nearby NumtS in humans but not flanking Numt insertions found in chimpanzees (19). In addition, there is evidence that numts preferably insert into open chromatin regions, typically near A + T oligomer sequences (16). As these studies are primarily based on older, fixed insertions in the human lineage, it is possible that they may have been confounded by evolutionary mutational processes that have occurred since the time of these insertions. As such, an investigation into more recent insertions is warranted in order to determine any insertion biases that may lead to a greater understanding of how this transfer of genetic material occurs.
Another important aspect of NumtS is their potential effect on studies of mitochondrial heteroplasmy, which are cell or tissue level differences in individual mitochondrial genomes due in part to the high rate of mutation within these sequences (20). Low levels of heteroplasmy are typical in healthy individuals, and recent reports have deter-mined that each person carries between 1 to 14 heteroplasmies (4,16,(21)(22)(23)(24)(25)(26). However, higher levels of heteroplasmy have been implicated in aging and various diseases such as Leber's hereditary optic neuropathy, diabetes, deafness and even cancer (27)(28)(29)(30). The presence of NumtS can confound the study and diagnosis of these diseases through the mistaken identification of nuclear-specific Numt mutations as heteroplasmy (12,31). Computational and molecular approaches have been developed to help reduce the effect of NumtS on these studies (25,26,(32)(33)(34), but they only make use of known NumtS already present in the reference sequence and do not take into account recent insertions which may be still prevalent in a significant portion of the population.
In contrast to the many studies that have utilized NumtS present in the reference sequences of many species, there has been comparably little exploration into the landscape of polymorphic NumtS in humans (35,36) and the largest such investigation to date has identified only 14 segregating events through investigation of the 1000 Genomes Project INDEL catalog (37). While rigorous, this analysis was limited in its ability to find novel insertion polymorphisms not present in the human genome reference due to the size of the sequence reads in which the variants could be discovered, resulting in the identification of only four such events. Here, we describe a new method, dinumt, for identifying numt insertions in whole genomes sequenced using paired-end sequencing technology, thus allowing for a greater sensitivity in identifying Numt variants of all sizes. We applied this method to 999 individuals from the 1000 Genomes (38) and Human Genome Diversity Project (HGDP) (21,39) projects and conducted an updated enrichment analysis in humans using these polymorphic insertions. We further sequenced a subset of the polymorphic NumtS we discovered and examined them for their age, origin and sequence characteristics, and assessed their potential impact on ongoing studies of mitochondrial heteroplasmy.

Data sources
Whole genome sequences were generated as a part of Phase I of the 1000 Genomes Project (http://www.1000genomes. org) with an average 4-6X sequence coverage and from the CEPH-HGDP (SRA: SRP036155) (21,39) with a higher average coverage of 5-20X. Alignments to version GRCh37/hg19 of the human reference genome were provided in BAM format and optimized using the Genome Analysis Toolkit (GATK) (McKenna A, 2010) and Picard (http://picard.sourceforge.net/), as described elsewhere (38,39).

Detection of Numt insertions
Non-reference NumtS were discovered in paired-end, whole genome sequences using a newly developed software package named dinumt, as outlined in Figure 1. This approach first derives an empirical insert size distribution from the observed alignment positions of each read pair and then identifies sequences where one end aligns to either the mtDNA or a known reference Numt and the other maps  [1] Identification and filtration of paired reads with one read anchored to the mtDNA and another mapping to a nuclear chromosome; [2] Clustering and linking of nearby mapped nuclear reads together using insert size information; [3] Localization of insertion breakpoint using cluster distribution and truncated read alignments. (B) Example of Numt insertion (Poly NumtS 2541) in sample HGDP00856 (top) compared to sample HGDP002222 (no insertion, bottom) on chromosome 8 as displayed in the IGV Browser. Sequences are represented by blocks and colored by the alignment of their mate sequence to the canonical location (gray), mitochondria genome (teal), or reference Numt homolog on chromosome 1 (blue). Multi-color bars represent split-reads whereby a portion of the sequence aligns to another location in the genome and are indicative of structural genomic breakpoints. elsewhere in the nuclear genome. Individual sequence reads that do not align uniquely to the nuclear genome (MAPQ < 10) are discarded, removing ambiguity from insertion sites but also limiting the ability of this approach to identify NumtS in highly repetitive sequences. These sequences are then clustered together based on their shared mapping orientation (forward or reverse) and whether or not they are within a distance of W L from each other, where W L is calculated as the derived mean insert length + 3 x insert standard deviation. Clusters are further linked together if they are within a distance of 2 x W L from each other and are in the correct orientation relative to each other (forward to reverse). Individual sequence reads are then examined within the clusters to identify soft-clipped reads with breaks at the same position to identify putative breakpoint locations. The likelihood of an insertion is then calculated as where m is the ploidy, l is the total number of reference supporting reads, k is the total number of insertion supporting reads, and e is the mapping error for read j, as modified from (40). Putative insertions were then filtered for quality (at least 50 using Phred scaled maximum insertion likelihood of non-reference allele), the number of total reads supporting the insertion (at least 4) and the depth of total coverage at the insertion point (at least 5).
Identified Numt insertions were then genotyped across the entire set of samples to identify sites which may have been previously missed or filtered in those individuals during the discovery step as well as to determine the copy number of the insertion. This was done by systematically examining each sample at an insertion location for clusters of reads supporting an insertion. In order to refine breakpoint positions, these clusters are used to search for positions where soft-clipped reads are consistently broken at the same location across samples and where the longest unaligned portion of such reads at that position map to the mtDNA reference genome. These refined positions are then utilized to determine how many sequences overlap the insertion location in an unbroken manner, supporting the reference allele, and how many contribute to the inserted Numt. These are then tabulated, combined with the read pair information and scored in a similar fashion as the discovery step above. However, the reported genotype here reflects the overall maximum likelihood of that genotype (0/0, 0/1, 1/1) and not just that of an insertion (as above). Global genotypes are then used to construct population level priors that are then applied in a Bayesian fashion for further likelihood calculations and iterated until convergence or a maximum of 10 iterations utilizing an Expectation-Maximization schema.

Validation and sequencing experiments
NumtS identified by computational analysis were validated by polymerase chain reaction (PCR) and Sanger sequencing of amplicon(s) that spanned 50-500 bp of gDNA flanking the insert, the breakpoint between the gDNA and the insert. Primer sets that hybridize to the gDNA flanking the insert were designed using Primer3 Software (http://www.genome. wi.mit.edu/cgi-bin/primer/primer3 www.cgi) and amplification was done with Platinum Taq (Invitrogen Life Technologies, Gaithersburg, MD), Picomaxx (Agilent Technologies, Palo Alto, CA, USA) or LongAmp (New England Biolabs, Beverly, MA, USA) products in a 20-50 l reaction volume containing 50 ng of template DNA, 1-uM primer and 1.5-mM MgCl 2 if not supplied in the PCR buffer. Thermocycling was done for 30 cycles at 56-67 • C annealing temperature and 1-15 min extension time. For inserts <3 kb, a PCR product of the predicted size was identified in individuals homozygous or heterozygous for the insert by agarose gel electrophoresis and the insert was sequenced in one individual. Amplicons of interest were purified from a PCR reaction for homozygous individuals (Qiaquick PCR purification kit, Qiagen, Valencia, CA, USA) or isolated from the gel for heterozygous individuals (Qiaquick Gel Extraction Kit, Qiagen) and sequenced at the University of Michigan Sequencing Core. For inserts larger than 3 kb, a PCR product of the predicted size was identified in individuals heterozygous for the insert by gel electrophoresis. For sequencing, two overlapping PCR products were made using primer sets designed as outlined above with one primer that binds in the gDNA flanking the insert and one primer that binds in the middle of the insert. Amplicons were purified from PCR reactions as outlined above and sequenced by primer walking at the University of Michigan Sequencing Core. Five loci failed initial validation efforts, likely due to a greater than predicted insertion size and uncertainty in the insertion breakpoints. For these loci, we performed a local assembly of the supporting reads using CAP3 (41) and designed additional PCR primers flanking the genomeinsertion junction.

Enrichment analysis
We analyzed the genomic context of the regions flanking the Numt insertion positions for various characteristics, including genes, %GC content, open chromatin regions, repetitive elements, CpG Islands and microsatellites. With the exception of %GC and AT dinucleotide calculations, which were derived from the reference sequence itself using a custom PERL script, all datasets were downloaded from the UCSC Genome Browser (http://genome.ucsc.edu/) in BED format (see Supplementary Table S3 for the specific tables used). We then performed a two-tailed permutation test by resampling 1000 sets of random positions matched to our insertion set to determine whether they were significant enriched or depleted for each feature.

Phylogenetics and age estimation
An inferred ancestral mitochondria sequence was obtained from ENSEMBL Compara Release 71 based on alignment of six primate species (42). A profile of nucleotide changes was obtained by aligning this sequence to the mtDNA genome from current human reference hg19 using MEGA v5.2.2 with the Muscle algorithm (http://www. megasoftware.net). The age of each NumtS insertion and human-specific reference NumtS (>300 bp) was calculated by aligning each sequence to the previously aligned ancestral and modern mtDNA sequence. We tabulated the total number of sites in the aligned region where the ancestral and modern mitochondrial sequences differ and counted how often the NumtS sequence matched the modern human allele. We used the resulting allele matching ratio as an estimate of the point along the human lineage where the insertion occurred.

Haplotyping and analysis of heteroplasmy
Single nucleotide polymorphism (SNP) profiles for each inserted sequence were generated using mtDNAprofiler (44). The identified variants were then annotated using Haplogrep (45) to identify potential haplotypes of origin. These SNP profiles were also used to assess potential heteroplasmy artifacts by direct comparison with reported events (22,24,25,33,(46)(47)(48). To be considered a match, both the position and the Numt allele must match what was reported and sample specific information was also indicated where available (22).

Software and data availability
The genomic locations and sequences for the identified NumtS are provided as Supplementary Data to this manuscript. Sequences of mitochondria insertions and immediate flanking regions have been submitted to GenBank (Accession: KM281512-KM281534). The software package dinumt is available for download at https://bitbucket. org/remills/dinumt.

Detecting Numt insertions in whole genome sequences
Our discovery approach identifies clusters of read pairs mapping to both the nuclear and mitochondrial genomes and then examines these regions for potential insertion events (Figure 1). This is similar in principle to earlier strategies designed to identify insertions of novel genetic material by finding clusters of one-end anchored reads (23,49) and to discovering mobile element insertions (50)(51)(52)(53)(54), but requires that reads map to either the mitochondria or known reference NumtS to report a putative insertion (see 'Materials and Methods' section). Nearby clusters are grouped together based on their orientation and distance from each other and the surrounding region is examined for split reads that map partially to both the chromosome and mitochondria, indicating the precise molecular breakpoint of the insertion. These sites can then be systematically genotyped across the entire sample set using a statistical framework similar to that previously developed for SNPs (40) to determine the insertion copy number in each individual.
We applied our method to 946 low coverage, whole genomes that were sequenced in Phase 1 of the 1000 Genomes Project (38) as well as 53 additional genomes sequenced to higher coverage from the HGDP (21,39) and were able to identify 141 polymorphic nuclear insertions of mitochondrial origin among all nuclear chromosomes except chrY, (Figure 2), including three which had been previously characterized (37). On average, ∼1.5 non-reference NumtS were seen in each sample, which showed a statistically significant but modest correlation with coverage (r 2 = 0.12, Supplementary Figure S1) suggesting that although some NumtS may be missed due to low coverage, they are likely minimal. These insertions were fairly evenly distributed among all 20 different populations assessed (Supplementary Figures S2 and S3).
We next assessed the overall accuracy of our approach. Using PCR, we were able to verify 23/24 of the predicted Numt insertion sites in the HGDP samples in which they were discovered and a further 17/18 from a subset of the lower coverage 1000 Genomes samples ( Figure 3A, Supplementary Table S1), with the events that we did not validate occurring either in segmental duplications or having uncertain breakpoints with potentially large insertion sizes, making them difficult to conclusively verify. Additional validation with PCR panels across multiple samples showed a concordance for 713/748 (95.3%) of our predicted allele genotypes ( Figure 3B). We further validated that these were indeed of mitochondrial origin and not post-insertion duplications by Sanger sequencing through the breakpoints for 23 events (Supplementary Table S2. These results sug- gest that dinumt is able to accurately discover Numt insertions in whole genome sequence data.

Characteristics and enrichment of Numt insertions
We conducted an analysis to confirm whether the insertion positions of these recent NumtS co-localized with specific genomic features, as has previously been assessed with the older, fixed events present in the reference genome (Supplementary Table S3). Using a series of permutation tests, we found no enrichment in regions containing CpG islands, microsatellites and other types of structural variants (P > 0.05). Most Numt insertions were in intronic (42%) and intergenic regions (43%), consistent with expectations from random sampling, and we observed no insertions into coding exons. NumtS were also found in the 5 and 3 UTR's as well as promoter and terminator regions (5 kbp up and downstream, respectively), albeit at a much lower frequency. Consistent with previous reports, we found a significant enrichment near repetitive regions (P ≤ 0.004) (16) and an insertion preference for slightly higher%GC regions overall (Supplementary Figure S4).
Interestingly, we neither observed enrichment for A + T oligomers immediately adjacent to the polymorphic insertions (Supplementary Figure S5) nor a preference for open chromatin regions in the cell lines we investigated (P > 0.05), as had been previously reported (16). This enrichment was prevalent even when limited to those NumtS for which we had validated breakpoints. To verify the consistency of our permutation approach, we applied the same analysis to the 610 reference NumtS described in Tsuji et al. as well as the human-specific NumtS described in Lang et al. (37) and were able to replicate their results. This difference at the insertion sites could represent a bona fide change in the integration mechanism for recent NumtS, but may also be an artifact from the way reference NumtS have been annotated relative to the mitochondria genome sequence. It is also possible that this is reflective of a technical difference between the identification of NumtS from short-read, next generation sequences and those found in longer capillary sequences from which the reference genome was constructed, as the latter would have fewer potential biases due to alignment artifacts in repetitive regions.

Analysis of polymorphic Numt sequences
While the precise insertion location for each Numt is informative, the underlying sequence itself can provide additional information. Using either direct Sanger sequencing of PCR products or subsequent primer walking and assembly for larger insertions (see 'Materails and Methods' section, Supplementary Table S1), we were able to determine the sequence for 23-Numt insertions (Supplementary  Table S2). Most NumtS sequences were small (<500 bp), however we did identify a number of larger events including an almost complete mitochondrial genome insertion of 16 106 bp in sample HGDP01275. We observed fragments originating from all parts of the mitochondrial genome, including multiple sequences overlapping the D-loop region ( Figure 2C) which had been previously reported as underrepresented in the human reference (16). Interestingly, these Nucleic Acids Research, 2014, Vol. 42, No. 20 12645 polymorphic sequences exhibited a higher%GC than that of their parent mitochondria genome (47 versus 44%) and also showed similar characteristics to other human-specific NumtS in the reference genome (Supplementary Figure S6). In contrast, non-human specific NumtS in the reference showed a markedly lower%GC that is more consistent with the average nuclear genome%GC of 41.5%.
Using these sequences, we estimated when these insertions occurred in the human lineage by comparing the human mtDNA reference sequence to an inferred ancestral mitochondrial sequence and identifying diagnostic mutations that matched specific positions in each Numt sequence (see 'Materials and Methods' section). We then used the fraction of alleles that matched those from the modern human mitochondrial sequence to derive an approximate age for each insertion, relative to an estimated human-chimpanzee divergence time of 6 million years (Table 1). We observed that most of the polymorphic insertions occurred within the past 1 million years, however there were six NumtS that were markedly older, including two that appear to have inserted over 2.5 million years in the past. We next constructed maximum-likelihood trees to compare fixed human-specific NumtS present in the reference genome ( Figure 4A) with the discovered polymorphic NumtS ( Figure 4B). As expected, the ongoing polymorphisms co-localized with the human lineage while the fixed events were likely inserted further back in time.

Assessment of Numt impact on heteroplasmy
Although many recent studies of mitochondrial heteroplasmy have taken NumtS into consideration (22,24,25,33,(46)(47)(48), they have all been limited to those insertions present in the reference sequence. To assess the potential impact of more recent insertions, we compared our set of sequenced polymorphic NumtS to these studies by identifying single nucleotide differences in the Numt insertions relative to the mtDNA reference and comparing the allelic changes to those reported ( Table 2). We identified 59 positions of possible Numt confounding, most of which occur in polymorphic insertions common in the general human population (MAF > 0.01). The samples used in most of these studies differ from those analyzed here, making direct inferences regarding these effects of NumtS difficult. However, one study (22) had an intersecting set of individuals with our analysis, and we were able to determine that there were eight positions within NumtS that were genotyped in those samples and had alleles matching the reported heteroplasmy.

DISCUSSION
Almost every eukaryotic species that has had their genome fully sequenced to date has exhibited evidence for the transfer of organelle DNA to the nuclear genome. These NumtS occur in both animals and plants, and show a strong correlation with genome size and the total number of NumtS observed (1). In humans, there are ∼755 annotated NumtS in the reference genome (13), though this number is variable depending on the methods and parameters used to identify their presence. However, very few of these are due to recent Fixed NumtS were previously identified as human-specific (19) and are present in the human reference sequence (hg19) as well as every 1000 Genomes sample assessed (38). Polymorphic sequences were chosen from among the longest insertions that were identified. Bootstrap values are indicated at branch locations.
insertions and almost all NumtS that have been identified are present in every human genome. Indeed, only 14 events differentially present in human populations have been previously reported (37).
Here, we present a large-scale analysis of polymorphic mitochondrial insertions into the nuclear genome of humans. These recent Numt insertions share many characteristics with previously identified human-specific NumtS that are fixed in the genome, including their patterns of integration within the genome and sequence composition. We identified many NumtS that contain the mitochondrial Dloop, a noncoding region of the mitochondria that controls the synthesis of DNA and RNA within the organelle and typically exhibits a higher mutation rate than the rest of the genome. This region is often used in forensic and population genetics due to its two hypervariable regions that provide distinguishing polymorphisms between individuals  (55,56) and has been previously found to be depleted among NumtS present in the human reference sequence (16). Previous studies have found little effect of existing, older NumtS on these types of assays (57), but have not taken into account these more recent insertions that are more likely to cause off-target amplification and erroneous conclusions. Our methods and datasets should thus provide a useful resource in these types of analyses.
There are a number of limitations to our approach, many of which are shared with other general strategies for identifying genetic variation using next generation sequencing. The depth to which an individual genome is sequenced can have an effect on variant detection accuracy (58), leading to potentially missed polymorphisms. Our results showed a modest correlation between sequence coverage and the number of insertions identified, so it is likely that a small number of NumtS were missed due to the use of low coverage genomes in our analysis. It is also very challenging to identify variation in highly repetitive sequences, and our analytical filters preclude the identification of NumtS in such regions of the genome. As such, it is possible that the results of our enrichment analysis are biased due to their omission. This is a technical obstacle, however, and should be resolved as future sequencing technologies are established.
The polymorphic nature of these insertions within the human lineage indicates that they have likely occurred since the most recent common ancestor between humans and chimpanzees, however it is unknown where they are relative to other species of humans such as Neanderthals or Denisovans. We thus attempted to date our insertions both through direct comparison with a consensus ancestral mito-chondrial as well as through phylogenic analysis and found that most were integrated into the nuclear genome within the past 100 000 years, although the small number of substitutions in many fragments makes their exact age difficult to be precisely determined. Over half of the NumtS we discovered were present in very low frequencies across the samples we interrogated (MAF < 0.1%), suggesting that they were likely integrated even more recently than the resolution of our analysis would allow. This supports the theory that mitochondria gene transfer to the human nuclear genome is ongoing and prevalent (1,3).
NumtS have been previously implicated in a number of sporadic disease cases through their integration into functional regions of the genome (8-10). While we did not identify any NumtS that would directly affect the coding region of a gene, we did identify a Numt insertion in a single individual that represented an almost entire insertion of a mitochondrial genome into chromosomal DNA ( Figure 5). This insertion was 16 106 bp in size and integrated into a potential regulatory region in the first intron of the SDC2 gene (59), a member of the syndecan family that encodes an integral membrane protein and has been associated with cell proliferation and migration, including altered expression in several cancer cells (60,61). We investigated whether this insertion may have had an effect on the expression of this gene by looking at recently published RNA-Seq data over these same samples (39), however this gene is not expressed in the tissues which were used in that analysis and so we are unable to draw any conclusions regarding its potential impact. It is tempting to speculate, however, that an insertion that is highly enriched for functional regions could indeed affect Nucleic Acids Research, 2014, Vol. 42, No. 20 12647 (22). It should be noted that (5) identified heteroplasmy in RNA sequences while the remaining studies were from DNA. * Matches sample specific heteroplasmy allele as reported in (22). canonical gene structure and expression, and ongoing studies in individual tissues from projects such as GTEx (62) and others may provide they keys for further investigating these types of events. Finally, we explored the potential effect of our NumtS on studies of mitochondrial heteroplasmy and identified a number of positions within the mitochondrial genome that could be erroneously attributed to mutations in NumtS. It is possible that these heteroplasmies are prevalent and the allelic changes in the NumtS occurred prior to their insertion, and indeed a recent study using RNA-Seq expression data to examine heteroplasmy identified a number of these same positions (47). While it is possible that NumtS may be expressed at some low level in the nucleus, it is more likely that the reported heteroplasmies are bona fide mitochondrial differences. However we believe that our set of genotypes and insertion sequences will be a useful resource for future studies into mitochondrial heterogeneity.