-
PDF
- Split View
-
Views
-
Cite
Cite
Yijing Wang, Guihu Zhao, Zhenghuan Fang, Hongxu Pan, Yuwen Zhao, Yige Wang, Xun Zhou, Xiaomeng Wang, Tengfei Luo, Yi Zhang, Zheng Wang, Qian Chen, Lijie Dong, Yuanfeng Huang, Qiao Zhou, Lu Xia, Bin Li, Jifeng Guo, Kun Xia, Beisha Tang, Jinchen Li, Genetic landscape of human mitochondrial genome using whole-genome sequencing, Human Molecular Genetics, Volume 31, Issue 11, 1 June 2022, Pages 1747–1761, https://doi.org/10.1093/hmg/ddab358
- Share Icon Share
Abstract
Increasing evidences suggest that mitochondrial dysfunction is implicated in diseases and aging, and whole-genome sequencing (WGS) is the most unbiased method in analyzing the mitochondrial genome (mtDNA). However, the genetic landscape of mtDNA in the Chinese population has not been fully examined. Here, we described the genetic landscape of mtDNA using WGS data from Chinese individuals (n = 3241). We identified 3892 mtDNA variants, of which 3349 (86%) were rare variants. Interestingly, we observed a trend toward extreme heterogeneity of mtDNA variants. Our study observed a distinct purifying selection on mtDNA, which inhibits the accumulation of harmful heteroplasmies at the individual level: (1) mitochondrial dN/dS ratios were much <1; (2) the dN/dS ratio of heteroplasmies was higher than homoplasmies; (3) heteroplasmies had more indels and predicted deleterious variants than homoplasmies. Furthermore, we found that haplogroup M (20.27%) and D (20.15%) had the highest frequencies in the Chinese population, followed by B (18.51%) and F (16.45%). The number of variants per individual differed across haplogroup groups, with a higher number of homoplasmies for the M lineage. Meanwhile, mtDNA copy number was negatively correlated with age but positively correlated with the female sex. Finally, we developed an mtDNA variation database of Chinese populations called MTCards (http://genemed.tech/mtcards/) to facilitate the query of mtDNA variants in this study. In summary, these findings contribute to different aspects of understanding mtDNA, providing a better understanding of the genetic basis of mitochondrial-related diseases.
Introduction
The mitochondrion, present in all human cell types, with the exception of mature erythrocytes, is a vital organelle that contains its own DNA (1) and produces energy in the form of adenosine triphosphate (ATP) through oxidative phosphorylation (OXPHOS) in every cell in the body with the energy it needs to function properly (2). The human mitochondrial genome (mtDNA) is a 16 569 bp double-stranded circular genome, which is small compared to the nuclear genome that has nearly three billion base pairs. The human mtDNA codes for 22 tRNAs, 2 rRNAs and 13 protein-coding genes that make up the respiratory chain complexes (I–V) and the displacement loop (D-loop) region, which is the most important non-coding region of the mtDNA and contains the replicative origins and transcriptive promoters for mtDNA (3–5).
Unlike the nuclear genome, which is diploid, mitochondrial genomes are polyploid, which means that there are multiple copies of mtDNA present within each cell and that mtDNA copy number varies greatly between different individuals and different tissues of the same individual (6). In addition, mtDNA variants result in different copies of mtDNA that are not exactly the same, a condition referred to as heteroplasmy (7). The heteroplasmic levels of mtDNA variants may change over time or with changes in the tissues (8,9). Typically, mtDNA variants lead to corresponding phenotypes only if present above a specific level referred to as the ‘threshold effect’ (10). Moreover, the heteroplasmic levels of some pathogenic mtDNA variants increase with age (11). Therefore, the identification of low-level mtDNA heteroplasmic variants is essential for disease prevention. As the mutation rate in mtDNA is higher and mtDNA is maternally inherited, researchers have defined geographically restricted mtDNA haplogroups based on mtDNA variants and constructed a human mitochondrial lineage, which reveals that human ancestors originated from Africa (4,12). There is strong evidence showing that in addition to primary mitochondrial disease, mtDNA variants have been associated with multiple complex diseases, including hypertension (13), diabetes (14), cancer (15,16) and neurological disorders, such as Parkinson’s (17) and Alzheimer’s disease (18).
In the past few years, next-generation sequencing (NGS) technologies have been extensively used in mitochondrial disease research (19,20). To date, several studies have been conducted on the genetic landscape of mtDNA in people with specific diseases or ethnic groups using NGS data, including target mtDNA gene panels (21), whole mtDNA sequencing (22,23), whole-exome sequencing (WES) (24–27) and whole-genome sequencing (WGS) (28,29). However, most studies usually only describe one specific aspect of the mtDNA based on the WES data, which results in a relatively low sequencing depth of mtDNA and limits the accuracy of the results. Furthermore, most of these studies were based on relatively small sample cohorts, resulting in relatively low statistical power. Thus, the multidimensional, comprehensive genetic landscape of mtDNA in the Chinese population remains unclear, hindering further exploration of the genetic basis of mitochondrial-related diseases. Moreover, previous studies have mainly focused on the European population, and East Asians and Europeans have huge differences in genetic background and population characteristics.
In this study, the WGS data from 3241 Chinese individuals were analyzed using a well-designed variant-calling pipeline to systematically explore the characteristics of mtDNA variants, haplogroups and copy number. Owing to the large sample size analyzed and the high coverage of mtDNA obtained from the WGS data, this study provides a comprehensive genetic landscape of mtDNA in the Chinese population, which will help us better understand the molecular basis of multiple human diseases caused by mitochondrial dysfunction. We also developed a public research resource of mtDNA variants, MTCards, which is a large-scale dataset, and created a tremendous resource for further identification of pathogenic mtDNA variants in mitochondrial-related diseases in the future.
Results
Population characteristics and WGS data
To characterize the genetic landscape of the mitochondrial genome in the Chinese population, 3241 unrelated individuals from the Chinese population were included in this study (Supplementary Material, Table S1). Of these, 1597 were males and 1644 were females, aged 51–92 years (median 65 years; Supplementary Material, Fig. S1).
The average sequencing depth of the whole genome was 12-fold (Supplementary Material, Table S2 and Supplementary Material, Fig. S2). Since mtDNA has a much higher copy number relative to the nuclear genome, the average sequencing depth for each mtDNA genome was ~1198-fold, and 98.9% of mtDNA genome regions were covered at least 500-fold (Supplementary Material, Table S2 and Supplementary Material, Fig. S2). In contrast, compared to WES or targeted sequencing technologies, WGS can interrogate all regions of the genome without bias, thereby reaching a much more uniform and complete coverage of the whole mitochondrial genome region (30). Thus, mtDNA reads with high depth and uniform coverage obtained from the WGS data allowed us to accurately identify mtDNA variants with heteroplasmic levels as low as 1%.
Characterization of mitochondrial DNA variations
To avoid contamination of nuclear mitochondrial DNA segments (nuMTs), the sequencing reads uniquely mapped to the mitochondrial genome were used for further analysis, which included both homoplasmic (heteroplasmic fraction (HF) ≥ 98%) and heteroplasmic variants (1% ≤ HF < 98%). In total, we identified 3892 mtDNA variants (Supplementary Material, Table S3), including 3274 (84%) in the homoplasmic state, 140 (4%) in the heteroplasmic state and 478 (12%) in ‘both’ states (homoplasmic state in some individuals but heteroplasmic state in others; Fig. 1A). Overall, our results show that each individual carries 37.0 homoplasmic variants (range: 8–53, median: 37) and 4.1 heteroplasmic variants (range: 0–40, median: 3) on average (Supplementary Material, Fig. S3). The majority of identified mtDNA variants (n = 3349, 86%) were rare variants with allele frequency (AF) < 0.5%; the low-frequency variants (0.5% ≤ AF < 5%) and common variants (AF ≥ 5%) accounted for 11% (n = 426) and 3% (n = 117), respectively (Fig. 1B). Considering all variant calls, the proportion of homoplasmic variants across all three groups was remarkably high, but the rare group had a slightly higher proportion (Rare: 95.9%; Low-frequency: 90.6%; Common: 89.1%; chi-square test, P < 2.2e−16; Fig. 1C).

Characterization of 3892 mtDNA variants. (A) Proportion of variants present only in the homoplasmic (Hom_only) or the heteroplasmic state (Het_only) or present in both (Hom and Het). Present in both means that there was homoplasmic state in some individuals but heteroplasmic state in others for the given variant. (B) Variants were grouped according to allele frequency (AF). Rare variants: AF < 0.5%; low-frequency variants: 0.5% ≤ AF < 5%; common variants: AF ≥ 5%. The percentage represents the proportion of each group. (C) Stacked bar plot representing the percentage of the homoplasmic (Hom) and homoplasmic (Het) variants in rare, low-frequency, and common variants in all variant calls a heteroplasmic fraction ≥1%. (D) The distribution of heteroplasmic fraction for rare, low-frequency and common variants in all variant calls with a heteroplasmic fraction <98% and supporting reads ≥2. (E) The non-synonymous/synonymous rate ratio (dN/dS) for hom_only, het_only and both groups. (F) The proportion of predicted deleterious mRNA variants (%) for the hom_only, het_only and both groups. (G) The proportion of predicted deleterious tRNA variants (%) for the hom_only, het_only and both groups. Exact values are labeled on the top of the corresponding column. ***P < 0.001; **P < 0.01; *P < 0.05; P-values for chi-square test, Bonferroni-adjusted (E) or Fisher’s exact test, Bonferroni-adjusted (F, G).
For heteroplasmic variants that were detected in at least one of the individuals (het_only and both, n = 618), we further quantified the heteroplasmic level, defined as the HF, and analyzed the distribution of HF for these variants. First, the 618 heteroplasmic variants were divided into three groups according to AF: rare (AF < 0.5%, n = 309), low-frequency (0.5% ≤ AF < 5%, n = 202) and common variants (AF ≥ 5%, n = 107). We found that all three groups showed extreme heterogeneous trends (Fig. 1D). In other words, mtDNA heteroplasmic variants were mainly enriched at high or low levels. Similarly, the trend of rare variants is more distinct than of low-frequency and common variants. In addition, the distribution of the mean HF of heteroplasmic variants observed only in the heteroplasmic state was significantly skewed toward low-level HF (skewness = 0.80, kurtosis = 2.13; Supplementary Material, Fig. S4). In comparison to this, the distribution of the mean HF of heteroplasmic variants observed in the homoplasmic and heteroplasmic states exhibited significantly higher levels of HF (skewness = −0.86, kurtosis = 2.69, Supplementary Material, Fig. S4). This may be owing to the variants that were only observed in the heteroplasmic state being relatively harmful, so that it is difficult for these variants to accumulate to very high heteroplasmic levels at the individual level.
Among the 3892 mtDNA variants, a vast majority were single nucleotide variants (SNVs; n = 3719, 96%), whereas 4% were insertions or deletions (indels). However, we observed a certain preference for the distribution of indel events occurring within the mtDNA. Specifically, heteroplasmic variants were more likely to be indels than homoplasmic variants (chi-square test, P < 2.2e−16, Supplementary Material, Fig. S5A). This observation is consistent with a previous study, which showed that heteroplasmic variants were more likely to accumulate harmful mutations than homoplasmic variants (31). In addition, we observed differences in the distribution of indels in different regions of the mitochondrial genome (chi-square test, P < 2.2e−16, Supplementary Material, Fig. S5B). Almost no indel variants were present in coding regions (n = 4), with the vast majority being present in non-coding regions (n = 148), and approximately equal numbers were present in rRNA (n = 11) and tRNA (n = 10) regions. This result can be explained by the fact that harmful variants are more likely to accumulate in the non-coding region (Supplementary Material, Fig. S5B).
Mutation rate, Ti/Tv and dN/dS in mitochondrial genomes
Of the 3719 SNVs, 14.2% (n = 524) were located in non-coding regions, 7.4% (n = 274) were located in rRNA genes, 6.2% (n = 232) were located in tRNA genes, and 72.3% (n = 2689) were located in protein-coding regions (Table 1). Although most SNVs were located in protein-coding regions, their mutation rate was the highest for the non-coding regions, which accounted for only 7.3% of the entire mitochondrial genome (Table 1). The SNV mutation rate of the non-coding regions was almost 2-fold higher than that of the protein-coding regions (43.27% versus 23.71%). The mutation rates for the rRNA and tRNA genes were 10.90% and 15.43%, respectively, which were lower than those for the protein-coding regions. In addition, the mutation rate varied significantly among the 13 protein-coding genes (Table 1). The top two highest mutation rate were ATP6 and ATP8, and the proteins encoded by these two genes are responsible for the same respiratory chain complex (V) (32).
Region . | Length (bp) . | Number of variants . | Mutation rate (%) . | Ti . | Tv . | Ti/Tv . | dN/dS . |
---|---|---|---|---|---|---|---|
Non-coding region (D-loop) | 1211 (7.30%) | 524 (14.09%) | 43.27 | 435 | 89 | 4.89 | NA |
rRNA gene | 2513 (9.08%) | 274 (7.37%) | 10.90 | 249 | 25 | 9.96 | NA |
tRNA genes | 1504 (9.08%) | 232 (6.24%) | 15.43 | 220 | 12 | 18.33 | NA |
Protein-coding genes | 11 341 (68.45%) | 2689 (72.30%) | 23.71 | 2503 | 186 | 13.46 | NA |
ATP6 | 681 | 232 | 34.07 | 217 | 15 | 14.47 | 0.46 |
ATP8 | 207 | 63 | 30.43 | 58 | 5 | 11.60 | 0.38 |
CYTB | 1141 | 324 | 28.40 | 300 | 24 | 12.50 | 0.19 |
ND6 | 525 | 134 | 25.52 | 128 | 6 | 21.33 | 0.14 |
COX3 | 784 | 188 | 23.98 | 175 | 13 | 13.46 | 0.15 |
COX2 | 684 | 163 | 23.83 | 153 | 6 | 25.50 | 0.13 |
ND5 | 1812 | 422 | 23.29 | 383 | 39 | 9.82 | 0.11 |
ND1 | 956 | 222 | 23.22 | 202 | 20 | 10.10 | 0.16 |
ND2 | 1042 | 227 | 21.79 | 213 | 14 | 15.21 | 0.12 |
COX1 | 1542 | 325 | 21.08 | 311 | 14 | 22.21 | 0.08 |
ND4 | 1378 | 284 | 20.61 | 264 | 20 | 13.20 | 0.09 |
ND3 | 346 | 70 | 20.23 | 64 | 6 | 10.67 | 0.11 |
ND4L | 297 | 52 | 17.51 | 47 | 5 | 9.40 | 0.10 |
mtDNA | 16 569 | 3719 | 22.45 | 3407 | 312 | 10.91 | 0.14 |
Region . | Length (bp) . | Number of variants . | Mutation rate (%) . | Ti . | Tv . | Ti/Tv . | dN/dS . |
---|---|---|---|---|---|---|---|
Non-coding region (D-loop) | 1211 (7.30%) | 524 (14.09%) | 43.27 | 435 | 89 | 4.89 | NA |
rRNA gene | 2513 (9.08%) | 274 (7.37%) | 10.90 | 249 | 25 | 9.96 | NA |
tRNA genes | 1504 (9.08%) | 232 (6.24%) | 15.43 | 220 | 12 | 18.33 | NA |
Protein-coding genes | 11 341 (68.45%) | 2689 (72.30%) | 23.71 | 2503 | 186 | 13.46 | NA |
ATP6 | 681 | 232 | 34.07 | 217 | 15 | 14.47 | 0.46 |
ATP8 | 207 | 63 | 30.43 | 58 | 5 | 11.60 | 0.38 |
CYTB | 1141 | 324 | 28.40 | 300 | 24 | 12.50 | 0.19 |
ND6 | 525 | 134 | 25.52 | 128 | 6 | 21.33 | 0.14 |
COX3 | 784 | 188 | 23.98 | 175 | 13 | 13.46 | 0.15 |
COX2 | 684 | 163 | 23.83 | 153 | 6 | 25.50 | 0.13 |
ND5 | 1812 | 422 | 23.29 | 383 | 39 | 9.82 | 0.11 |
ND1 | 956 | 222 | 23.22 | 202 | 20 | 10.10 | 0.16 |
ND2 | 1042 | 227 | 21.79 | 213 | 14 | 15.21 | 0.12 |
COX1 | 1542 | 325 | 21.08 | 311 | 14 | 22.21 | 0.08 |
ND4 | 1378 | 284 | 20.61 | 264 | 20 | 13.20 | 0.09 |
ND3 | 346 | 70 | 20.23 | 64 | 6 | 10.67 | 0.11 |
ND4L | 297 | 52 | 17.51 | 47 | 5 | 9.40 | 0.10 |
mtDNA | 16 569 | 3719 | 22.45 | 3407 | 312 | 10.91 | 0.14 |
bp = base pair; mutation rate = number of variants/length; Ti/Tv = transition/transversion; dN = non-synonymous substitution; dS, synonymous substitution; dN/dS = means the rate of non-synonymous/synonymous substitutions; NA = not available.
Region . | Length (bp) . | Number of variants . | Mutation rate (%) . | Ti . | Tv . | Ti/Tv . | dN/dS . |
---|---|---|---|---|---|---|---|
Non-coding region (D-loop) | 1211 (7.30%) | 524 (14.09%) | 43.27 | 435 | 89 | 4.89 | NA |
rRNA gene | 2513 (9.08%) | 274 (7.37%) | 10.90 | 249 | 25 | 9.96 | NA |
tRNA genes | 1504 (9.08%) | 232 (6.24%) | 15.43 | 220 | 12 | 18.33 | NA |
Protein-coding genes | 11 341 (68.45%) | 2689 (72.30%) | 23.71 | 2503 | 186 | 13.46 | NA |
ATP6 | 681 | 232 | 34.07 | 217 | 15 | 14.47 | 0.46 |
ATP8 | 207 | 63 | 30.43 | 58 | 5 | 11.60 | 0.38 |
CYTB | 1141 | 324 | 28.40 | 300 | 24 | 12.50 | 0.19 |
ND6 | 525 | 134 | 25.52 | 128 | 6 | 21.33 | 0.14 |
COX3 | 784 | 188 | 23.98 | 175 | 13 | 13.46 | 0.15 |
COX2 | 684 | 163 | 23.83 | 153 | 6 | 25.50 | 0.13 |
ND5 | 1812 | 422 | 23.29 | 383 | 39 | 9.82 | 0.11 |
ND1 | 956 | 222 | 23.22 | 202 | 20 | 10.10 | 0.16 |
ND2 | 1042 | 227 | 21.79 | 213 | 14 | 15.21 | 0.12 |
COX1 | 1542 | 325 | 21.08 | 311 | 14 | 22.21 | 0.08 |
ND4 | 1378 | 284 | 20.61 | 264 | 20 | 13.20 | 0.09 |
ND3 | 346 | 70 | 20.23 | 64 | 6 | 10.67 | 0.11 |
ND4L | 297 | 52 | 17.51 | 47 | 5 | 9.40 | 0.10 |
mtDNA | 16 569 | 3719 | 22.45 | 3407 | 312 | 10.91 | 0.14 |
Region . | Length (bp) . | Number of variants . | Mutation rate (%) . | Ti . | Tv . | Ti/Tv . | dN/dS . |
---|---|---|---|---|---|---|---|
Non-coding region (D-loop) | 1211 (7.30%) | 524 (14.09%) | 43.27 | 435 | 89 | 4.89 | NA |
rRNA gene | 2513 (9.08%) | 274 (7.37%) | 10.90 | 249 | 25 | 9.96 | NA |
tRNA genes | 1504 (9.08%) | 232 (6.24%) | 15.43 | 220 | 12 | 18.33 | NA |
Protein-coding genes | 11 341 (68.45%) | 2689 (72.30%) | 23.71 | 2503 | 186 | 13.46 | NA |
ATP6 | 681 | 232 | 34.07 | 217 | 15 | 14.47 | 0.46 |
ATP8 | 207 | 63 | 30.43 | 58 | 5 | 11.60 | 0.38 |
CYTB | 1141 | 324 | 28.40 | 300 | 24 | 12.50 | 0.19 |
ND6 | 525 | 134 | 25.52 | 128 | 6 | 21.33 | 0.14 |
COX3 | 784 | 188 | 23.98 | 175 | 13 | 13.46 | 0.15 |
COX2 | 684 | 163 | 23.83 | 153 | 6 | 25.50 | 0.13 |
ND5 | 1812 | 422 | 23.29 | 383 | 39 | 9.82 | 0.11 |
ND1 | 956 | 222 | 23.22 | 202 | 20 | 10.10 | 0.16 |
ND2 | 1042 | 227 | 21.79 | 213 | 14 | 15.21 | 0.12 |
COX1 | 1542 | 325 | 21.08 | 311 | 14 | 22.21 | 0.08 |
ND4 | 1378 | 284 | 20.61 | 264 | 20 | 13.20 | 0.09 |
ND3 | 346 | 70 | 20.23 | 64 | 6 | 10.67 | 0.11 |
ND4L | 297 | 52 | 17.51 | 47 | 5 | 9.40 | 0.10 |
mtDNA | 16 569 | 3719 | 22.45 | 3407 | 312 | 10.91 | 0.14 |
bp = base pair; mutation rate = number of variants/length; Ti/Tv = transition/transversion; dN = non-synonymous substitution; dS, synonymous substitution; dN/dS = means the rate of non-synonymous/synonymous substitutions; NA = not available.
We further investigated the transition/transversion (Ti/Tv) ratio of the mitochondrial genome and found that it significantly differed across studies, ranging from 10 to 38 (33,34). In this study, the observed Ti/Tv ratio for the whole mitochondrial genome was 10.91, which is much higher than that of the human nuclear genome (around 3.0 for SNVs inside exons and around 2.0 elsewhere) (35). Consistent with previous studies (33), our results also indicated that the Ti/Tv ratio was highly imbalanced across different regions (Table 1): 4.89 in the non-coding region (D-loop), 13.46 in protein-coding genes, 9.96 in rRNA genes and 18.33 in tRNA genes.
To explore selective pressure, we calculated the ratio of non-synonymous/synonymous substitutions (dN/dS) for the variants of the protein-coding region (see Methods). The dN/dS ratio in the protein-coding region was 0.14, which was significantly <1 (Table 1), suggesting extensive purifying selection on the mtDNA. The dN/dS ratio varied significantly among the 13 protein-coding genes (Table 1). Notably, the top two high dN/dS ratios were those of ATP6 and ATP8, consistent with the mutation rate results.
Heteroplasmic mtDNA variants are enriched for non-synonymous and deleterious variants
The 3892 identified variants were divided into three groups (hom_only, het_only and both), as previously described. Then, we calculated the overall dN/dS ratio for each group separately. We found that the dN/dS ratios of the three groups were all significantly <1, and the dN/dS ratio of heteroplasmic variants was higher than that of homoplasmic variants (chi-square test, Bonferroni-adjusted P = 0.0014; Fig. 1E), suggesting that heteroplasmic variants are more non-synonymous and that purifying selection is likely to play a role in shaping the pattern of mtDNA heteroplasmy.
Of the 2693 variants in the protein-coding regions, 1883 (70.0%) were synonymous, 797 (29.6%) were non-synonymous and only 13 (0.5%) were loss-of-function variants (including one frameshift deletion, three start-loss and nine stop-loss; Supplementary Material, Table S2). We predicted the effects of the 797 non-synonymous variants and three start-loss variants using protein variation effect analyzer (PROVEAN) and found that 579 (72%) were predicted to be neutral and 221 (28%) were predicted to be deleterious. We calculated the proportion of the deleterious variants of the ‘hom_only’ (196/687), ‘het_only’ (15/25) and ‘both’ (15/88) groups, which showed that the ‘het_only’ group had a much higher proportion of deleterious variants than the ‘hom_only’ (60% v.s. 29%, Fisher’s exact P = 0.03, Bonferroni-adjusted P = 0.1) or ‘both’ groups (60% v.s. 17%, Fisher’s exact P = 0.005, Bonferroni-adjusted P = 0.015; Fig. 1F).
We further investigated the pathogenicity of 242 variants in tRNA genes using mitochondrial tRNA informatics predictor (MitoTIP) and found that 176 (73%) were predicted to be ‘Likely benign’ (average MitoTiP score = 3.64), 49 (20%) were predicted to be ‘Possibly benign’ (average MitoTiP score = 10.21), 12 (5%) were predicted to be ‘Possibly pathogenic’ (average Mito Tip score = 14.36), one was predicted to be ‘Likely pathogenic’ (MitoTiP score = 16.68) and four had unknown effects. The percentage of pathogenic variants for the ‘het_only’ group was 12.5% higher than that for the ‘hom_only’ (5.12%) or ‘both’ (6.67%) groups, but this difference was not statistically significant (Fisher’s exact, P = 0.38, Fig. 1G). This result was consistent with the above results for variants in the protein-coding regions.
MTCards: a database of mtDNA variants in the Chinese population
Analysis of genetic variants in large-scale populations is very important for validating known and discovering novel pathogenic variants in nDNA, similar to mtDNA (36). For example, variant frequency in affected and normal populations is the most widely used and important evidence for evaluating the pathogenicity of the identified variant according to the American College of Medical Genetics guidelines (37,38). To our knowledge, ChinaMap is currently the largest database of genomic variants in the Chinese populations from WGS data, but it does not have information of mtDNA variants (39). Thus, to promote further studies on the biology of mitochondrial disease, we developed an open-access mtDNA variation database of the Chinese population named MTCards to facilitate the query of mtDNA variants in this study. The database can be accessed freely at http://www.genemed.tech/mtcards.
This database features a user-friendly query interface and provides detailed information on the mtDNA variants detected in this study. Users can query the detailed information of all variants via quick search or advanced search function (Fig. 2). The quick search function can be found on the home page and automatically recognizes multiple key terms, such as gene symbol, genomic region, genomic coordinate and rs ID. The advanced search can be found on the search page and supports batch searches. In addition to general information about the variants, MTCards includes three sections: (1) AF from the MTCards, gnomAD, and MITOMAP databases; (2) phenotypes from the MITOMAP and ClinVar (40) databases; and (3) in silico prediction for non-synonymous mRNA variants (PolyPhen2 (41), SIFT (42), FatHmm (43), PROVEAN (44), MutationAssessor (45), EFIN (46), and CADD (47), among others) and tRNA variants (MitoTIP (48) and PON-mt-tRNA (49)). Furthermore, MTCards also contains the mtDNA copy numbers and haplogroups of each individual.

Snapshot of query interface of variants in MTCards. Two approaches are available to query mtDNA variants, the ‘Quick search’ and ‘Advanced search’. The results of the ATP6 gene are shown as an example. First, the basic information about variants is shown on the web page. Then, user can click the ‘Detail information’ button to view the detailed information of the corresponding variant, including allele frequencies, database phenotypes and in silico prediction for non-synonymous mRNA or tRNA variants.
Comparison with other mtDNA variation databases
To our knowledge, MITOMAP (50) and Genome Aggregation Database (gnomAD) (51) are available databases for human mtDNA variants with the largest sample size. MITOMAP aggregated 51 836 full-length mitochondrial sequences from GenBank, including 11 993 mtDNA variants (last checked in July 2021). However, GenBank sequences obtained from different studies have resulted in inconsistent sequence quality, some of which are from patients with mitochondrial diseases (52). gnomAD has called mtDNA variants for 56 434 whole-genome samples, including 10 850 mtDNA variants with heteroplasmy levels >10%. In addition, Chinese individuals represented a very small percentage of the total population of both databases; therefore, these two databases cannot represent the characteristics of the mitochondrial variation spectrum of the Chinese population.
The Venn diagram in Figure 3A shows the overlap of mtDNA variants in MITOMAP, gnomAD and MTCards. We found that 3595 (92.4%) and 3566 (91.6%) variants in MTCards have been reported in the MITOMAP database and gnomAD, respectively. In total, 172 (4.4%) novel variants were not recorded in the MITOMAP or gnomAD databases. To assess the mtDNA variant-calling accuracy, we selected variants that were common to all three databases described above and compared their allele frequencies of variations. The results showed that the concordance of allele frequencies between MTCards and other mtDNA variant databases was high (MITOMAP, rho = 0.6; gnomAD, rho = 0.5; Supplementary Material, Fig. S6). Notably, most of the novel variants were homoplasmic and singletons (Fig. 3B). Thus, these results demonstrate the accuracy of the variants detected through our well-designed variant-calling pipeline. In addition, we found that novel variants tended to be singleton in the population (Fig. 3C), heteroplasmic (Fig. 3D) and were located in non-coding regions (Fig. 3E) compared to previously reported variants in MITOMAP or gnomAD.

Comparison with other mtDNA variant databases. (A) Venn diagram of overlapping mtDNA variants in MITOMAP, gnomAD and MTCards. (B) Allele frequency (AF) of novel variants. The majority of novel variants are singletons and only observed in the homoplasmic state in the population (represented by the lightest blue box). (C) Stacked bar plot representing the percentage of different allele frequencies in ‘reported’ and ‘novel’ variants. (D) Stacked bar plot representing the percentage of the hom_only, het_only and both variants in ‘reported’ and ‘novel’ variants. (E) Stacked bar plot representing the percentage of mtDNA different regions in ‘reported’ variants and ‘novel’. The ‘reported’ means that the given variant was previously reported in MITOMAP or gnomAD; ‘Novel’ means that the given variant was only reported in MTCards.
Haplogroup frequency distribution and associations with number of variants per individual
The mtDNA haplogroup for each individual was categorized using HaploGrep2 based on the variant list detected by WGS data (53). We chose the best-ranked haplogroup per sample for further analysis, and only 15 (0.46%) samples had a quality score below 0.8. The results showed that 16 haplogroups (one-capital letter haplogroup level) were identified in 3241 samples, of which 3149 (97%) belonged to specific Asian haplogroups (Supplementary Material, Table S4, Fig. 4A), where M and D were the most prevalent (20.27% and 20.15%, respectively), followed by B (18.51%) and F (16.45%). We found that 86 of the remaining 92 samples belonged to the mitochondrial haplogroup R, which is itself a branch of haplogroup N shared between Europe and East Asia (54).

Haplogroup frequency distribution and associations with the number of variants per individual. (A) The distribution of higher level haplogroups represented in MTCards. rCRS, revised Cambridge Reference Sequence; RSRS, Reconstructed Sapiens Reference Sequence (79). This figure was adapted from that on the MITOMAP website under a Creative Commons Attribution 3.0 license. (B) Stacked bar plots of the frequencies of the haplogroups (one-capital letter haplogroup level) in Chinese and subpopulations from the 1000 Genomes Project (1KG). CHB, Han Chinese; CHS, Southern Han Chinese; CDX, Dai Chinese; JPT, Japanese; KHV, Kinh vietnamese; EAS, East Asia; SAS, South Asia; AMR, American; EUR, European; AFR, African. The correlation between the number of mtDNA homoplasmic (C) and heteroplasmic variants (D) per sample and haplogroup identified as M or N lineage. ****P < 0.0001, P-values for Wilcoxon signed-rank test. The correlation between the number of mtDNA homoplasmic (E) and heteroplasmic variants (F) per sample and haplogroup identified as C, M, Z, D or G, which belong to the M lineage (red), and B, F, A, Y, R or N, which belong to the N lineage (blue).
When comparing the frequency distribution of each haplogroup in the Chinese population to that in the five continental populations from the 1000 Genomes Project Phase 3 (1KG, n = 2534; Fig. 4B). There was no significant difference in the distribution of haplogroups between our cohort and East Asians (EAS) of 1KG (chi-square test, P = 0.08). Next, we further divided the EAS into five populations, namely Southern Han Chinese (CHS, n = 108), Japanese (JPT, n = 104), Han Chinese (CHB, n = 103), Kinh Vietnamese (KHV, n = 101) and Dai Chinese (CDX, n = 99), and there was no significant difference between our cohort and CHS (chi-square test, P = 0.58, chi-square value = 0.31, contingency coefficient = 0.0096) or CHB (chi-square test, P = 0.71, chi-square value = 0.14, contingency coefficient = 0.0065). Thus, our cohort was most similar to CHB based on the contingency coefficient. Notably, we found 31 samples belonging to mitochondrial haplogroup Y, which was absent in the CHB population from 1KG. However, previous studies have found haplogroup Y in a Chinese population (54). Therefore, our results confirmed that haplogroup Y is present in the Chinese population but israre.
In our dataset, 51.03% of the individuals came from the Asian M lineages, and 48.97% came from the Eurasian N lineages (Supplementary Material, Table S4). We found that there was a difference in the number of mtDNA variants per individual between the M and N lineages (Supplementary Material, Fig. S3). Specifically, for homoplasmic variants, the M lineages were significantly higher than N lineages (mean = 40 for M group, mean = 33 for N group, Wilcoxon signed-rank test, P < 2.2e−16; Fig. 4C). Conversely, for heteroplasmic variants, N lineages were modestly but significantly higher than M lineages (mean = 3.4, M group; mean = 4.9, N group; Wilcoxon signed-rank test, P < 2.2e−16; Fig. 4D). The revised Cambridge Reference Sequence (rCRS) genome belongs to the haplogroup HV, which is a mitochondrial clade within the N lineages (Fig. 3A). Thus, there is a closer evolutionary distance between N lineages and the rCRS compared with M lineages, which can explain the smaller number of homoplasmies for the N group described above. Next, we further divided the M and N lineages into multiple haplogroups with >30 samples for further analysis. The result showed that there were also significant differences in the number of mtDNA homoplasmic (Kruskal–Wallis test, P < 2.2e−16, Fig. 4E) and heteroplasmic (Kruskal–Wallis test, P < 2.2e−16, Fig. 4F) variants per sample between the different mitochondrial haplogroups. It is worth noting that for homoplasmies, all haplogroups of M lineages were higher than those of N lineages. However, for heteroplasmies, all of the remaining haplogroups were slightly different, with the exception of haplogroup B was significantly higher than other haplogroups, which may be the reason for the greater number of heteroplasmies in the N group. Expectedly, the difference between the M and N lineages was not significant after the removal of haplogroup B (mean = 3.4, M group; mean = 3.2, N group; Wilcoxon signed-rank test, P = 0.0984). Since mitochondrial haplogroups were assigned based on mtDNA variants carried by individuals, we counted the number of variants used to define haplogroups for each individual. We found that the number of indels in haplogroup B was significantly higher than that in other haplogroups. Moreover, the proportion of heteroplasmic variants in indels (50.0%) was greater than that in SNVs (2.9%). This resulted in a significantly higher number of heteroplasmies in haplogroupB.
Correlation of mtDNA copy number with sex, age and haplogroup
Precise regulation of mtDNA copy number is critical for maintaining normal mitochondrial function and is a potential biomarker for the diagnosis of disease and aging (55,56). However, neither of the two databases mentioned above contains information on mtDNA copy number in the human mitochondrial genome. Before the availability of NGS techniques, mtDNA sequences were mainly obtained using a polymerase chain reaction (PCR)-based strategy; however, because traditional PCR involves random exponential amplification, the PCR products are not proportional to the input. This deviation leads to the inaccuracy of mtDNA quantification, which makes this method unsuitable for the measurement of mtDNA copy number (57). NGS technology minimizes library construction bias, allowing both mtDNA and nuclear DNA (nDNA) to have the same probability of being sequenced (58,59). Therefore, we calculated mtDNA copy number for each individual by comparing the mean sequencing depth of nDNA and mtDNA (see Methods), and observed a range of 61–415 (mean = 196.21; standard deviation = 42.88; Fig. 5A), with the majority of individuals between 125 and 288 (95%CI).

mtDNA copy number distribution and correlation with sex and age. (A) The distribution of mtDNA copy number per individual. The dashed line denotes the mean value of the histogram. SD, standard deviation. (B) The correlation between the mtDNA copy number and sex. ****P < 0.0001, P-values for Wilcoxon signed-rank test. (C) The correlation between the mtDNA copy number and age. The shaded region represents the 95% confidence interval for the predictions of a linear model with the copy number and age as the response and explanatory variables, respectively. The pearson coefficient (cor) and P value are indicated on the upper right corner of theplot.
Next, we investigated the correlation between mtDNA copy number and sex, age or haplogroup. We found that consistent with previous research (31,56,60), females had 17 (9.0%) more copies of mtDNA than males on average (Wilcoxon signed-ranks test, P < 2e−16, Fig. 5B). After linear regression analysis adjusted for age, the obtained result still showed a significant sex effect for mtDNA copy number (β = −16.1, P < 2e−16). Regarding the relationship between mtDNA copy number and age, the results of previous studies are controversial. However, consistent with the most of studies (56,61), we observed that mtDNA copy number and age were significantly negatively correlated (linear regression models; β = −1.3, P < 2e−16), which remained significant after grouping according to sex (Pearson coefficient (cor) = −0.25, P = 1.78e−24 in males; cor = −0.27, P = 3.77e−28 in females; Fig. 5C). In terms of the associations of mitochondrial haplogroups with mtDNA copy number, we did not observe any significant difference in mtDNA copy number among the different mitochondrial haplogroups (Kruskal–Wallis test, P = 0.6486) consistent with the results of a previous study (62).
We selected haplogroups with >30 samples to investigate the mtDNA copy number correlation with sex or age, respectively. We observed that females had a higher mtDNA copy number than males in all haplogroup groups, except haplogroup C (Supplementary Material, Fig. S7). Although not significant, the females showed a slightly higher mtDNA copy number than males in haplogroup C (female = 194.48; male = 191.95; P = 0.6). Similarly, most haplogroups showed a significant negative correlation between mtDNA copy number and age (Supplementary Material, Fig. S8). Although not significant, haplogroup Y (n = 31, cor = −0.233, P = 0.2) and haplogroup R (n = 86, cor = −0.196, P = 0.07) still show a tendency of negative correlation. Thus, our results showed that the correlation of mtDNA copy number with sex and age was independent of haplogroups.
Discussion
It is widely accepted that mitochondrial dysfunction is involved in the pathogenesis of many common diseases. At present, >250 pathogenic mtDNA variants have been shown to be associated with mitochondrial disease (63,64). The availability of mtDNA sequence data from NGS technologies makes it possible to analyze the mitochondrial genomes of large populations, thus revealing new genes and pathological mechanisms of mitochondrial diseases. However, mitochondrial genome research of large sample sizes using WGS data is lacking, particularly in non-European populations (24–26,29). The current study is the largest to date that characterizes the mitochondrial genome in the Chinese population. Here, we characterized the mitochondrial genome in a comprehensive, multidimensional and systematic manner, including mtDNA variants, haplogroups and copy number, from the high-depth WGS data of a large number of Chinese individuals.
In this study, we identified 3892 mtDNA variants, provided a clear landscape of these variants, and observed unique features. First, consistent with previous studies (33,65,66), our results show that the mutation rate was highest for the D-loop region among the whole mitochondrial genome because it contains three hypervariable regions (HVI, HVII and HVIII) with high polymorphism (67). Previous studies have demonstrated significant differences in mutation rates among 13 protein-coding genes (65,66,68). Interestingly, we observed that the mutation rate and dN/dS ratio of ATP8 and ATP6 were higher than those of other genes. These results showed that ATP6 and ATP8 were not only prone to variants, but also more prone to non-synonymous variants compared to other coding genes. Jun et al. (31) showed that the number of mtDNA heteroplasmic variants per sample increased with age. However, we did not observe this trend in either homoplasmic or heteroplasmic variants in ourdata.
The average sequencing depth for mtDNA was 1198-fold in this study and consequently allowed for confident detection of mtDNA heteroplasmic variants, of which 4% (n = 140) were only identified in the heteroplasmic state. Therefore, the novelty of this study was the analysis of mtDNA heteroplasmy and distinctive features were observed. First, consistent with the gnomAD results, a vast majority of variant calls (90%) were identified in the homoplasmic state. However, majority of the individuals (n = 2804, 86.5%) had at least one heteroplasmic variant at a level of 1% or higher, which further confirms previous research that considers mitochondrial heteroplasmy as prevalent in the normal human population (69). All mtDNA variants are generated on a single mtDNA molecule, and there is some mechanism by which emerging variants either gradually accumulate to a high-level heteroplasmy or even homoplasmy, or remain as a low-level heteroplasmy, or even disappear. In our study, we observed that most mtDNA variant calls were homoplasmy and mtDNA heteroplasmic variants exhibited extreme heterogeneity, consistent with the view that intermediate-level heteroplasmy is uncommon by James et al. (70). Additionally, we noticed that low-level or high-level heteroplasmy and homoplasmy were more frequent among rare variants than low-frequency or common variants. This may be attributed to the fact that rare variants are mainly de novo and therefore, more likely to show extreme heterogeneity or homoplasmy in a few individuals. Whereas common variants are usually inherited and become more stably passed on via multi-generation transmission at the population level. Finally, we demonstrated in three aspects that heteroplasmic variants are more likely to be harmful: (1) heteroplasmic variants had a higher proportion of indels; (2) the dN/dS ratio of heteroplasmic variants was higher; and (3) heteroplasmic variants had a higher proportion of predicted deleterious variants. Furthermore, we observed that mtDNA variants only seen at heteroplasmic state showed low-level HF (Supplementary Material, Fig. S4). Thus, these results are consistent with those from a previous study (31) and may be attributed to the existence of purifying selection, which makes it difficult for harmful mtDNA variants to be fixed to homoplasmy, and keeps at a low fraction (31,71).
In addition, we estimated the mtDNA haplogroup for each individual and noted that there was a difference in the number of mtDNA variants per sample in individuals from different haplogroups. This is because the number of variants per sample largely depends on the haplogroup to which it belongs and the evolutionary distance of that haplogroup to the haplogroup of the reference genome (rCRS). Therefore, M lineages should have more variants than N lineages. This is indeed the case for homoplasmic variants, but the heteroplasmic variation did not show an obvious trend. The results suggest that heteroplasmic variants appeared randomly in the process of evolution, whereas homoplasmic variants were stably inherited and were mostly used to define the mtDNA haplogroup. In addition, Wei et al. analyzed over 30 000 human mtDNA sequences and reported that the association between disease-causing mtDNA mutations and background mtDNA haplogroups was universal (72). Thus, it may be necessary to consider the mtDNA haplogroup to which individuals belong when assessing potentially pathogenic mtDNA variants in the future.
Maintaining mtDNA copy number is essential for preserving mitochondrial function. Our results showed that mtDNA copy number of most individuals was between 125 and 288, which was >75–150 as measured from WGS data by Jun et al. (31). These differences might be attributed to mtDNA sources in different cell populations. In our study, mtDNA was extracted from peripheral blood leukocytes, whereas Jun et al. extracted DNA from lymphocytes with low levels of cytoplasm. Our results (mean of 196) are similar to the previously reported reference intervals of mtDNA copy number in peripheral blood for Chinese individuals (mean of 287), which were estimated by quantitative real-time PCR. Although the exact mechanisms regulating mtDNA copy number remain unclear, we observed its significant association with the individuals’ sex or age. First, all studies, including ours, have confirmed that average mtDNA copy number in females was significantly higher than that in males (31,56,60). Meanwhile, previous studies on the association between mtDNA copy number and age were inconsistent (56,60–62,73). Chirag et al. did not observe any significant differences in mtDNA copy number with age (60), whereas Wachsmuth et al. observed a strong age-related decrease in mtDNA copy number only in male (62). In contrast, our study has demonstrated that mtDNA copy number decreased with age in both females and males. The inconsistent results reported in previous studies may be attributed to the way the mtDNA copy number was measured. In our study, we calculated mtDNA copy number using WGS data, which is more accurate than data generated using traditional methods. Thus, our study findings validate that mtDNA copy number is related to age. As mtDNA copy number is an important parameter of mitochondrial function, we hypothesized that low mtDNA copy number indicates decline in the mitochondrial function. Therefore, these results suggest that the mitochondrial function is likely to play an important role in the aging process and partly explain the differences in longevity between males and females (74).
Our study, however, has several limitations. First, the individuals were all middle-aged and older adults, and the obtained variants may contribute to the identification of pathogenic variants for age-related diseases. However, the narrow age span (a median age of 60–70 years) leads to certain constraints when using age as a clinical indicator. Second, most of the individuals were from the central provinces of China, resulting in the inability to analyze the differences in mitochondrial genomes between different provinces, especially the spectra of mtDNA haplogroups. Third, our study cohorts included some patients with Parkinson’s disease (PD). However, as this study focused only on fundamental questions in biology, our conclusions on heteroplasmy, selective effects and copy number were similar between PD cohorts and normal cohorts. Fourth, mtDNA was extracted from the peripheral blood. Therefore, we can only analyze the heteroplasmic levels between different individuals but cannot reflect the heteroplasmic levels present between different tissues.
Taken together, the findings of this study comprehensively and clearly characterized the genetic landscape of the mitochondrial genome in the Chinese population. In addition, we constructed a publicly available mtDNA variant database for the Chinese population, which would become a valuable resource for further identification of pathogenic variants in human mitochondrial-associated diseases.
Materials and Methods
Sample collection and DNA extraction
We recruited 3241 unrelated Chinese individuals from the Parkinson’s Disease & Movement Disorders Multicenter Database and Collaborative Network in China (http://www.pd-mdcnc.com/), including 1962 patients with PD and 1279 normal controls. Peripheral blood specimens were collected after written informed consent of the genetic research was obtained from all participants. Genomic DNA was extracted from peripheral blood leukocytes using standard phenol-chloroform extraction procedures. This study was approved by the Ethics Committee of Xiangya Hospital, Central South University.
Mitochondrial DNA variant calling from the WGS data and variant annotation
WGS was performed at the WuXi NextCODE Genomics using the Illumina Nova sequencing platform. Generated paired-reads files were run through fastp (v0.19.5) to automatically cut adapters, remove low quality or too short base pairs, and correct mismatched base pairs using the arguments ‘--qualified_quality_phred 15 --length_required 30 --cut_tail --correction --overrepresentation_analysis.’ Quality-controlled reads were aligned to the hg19 human reference genome using Burrows-Wheeler alignment-maximal exact match (BWA-MEM, version 0.7.15-r1140). The uniquely mapped reads on the mitochondrial genome sequences were extracted to minimize contamination by nuclear copies of nuMTs (20,75). Then, we used rCRS (NC_012920.1) as the mitochondrial genome to identify mtDNA variants. Base quality score recalibration and local indel realignment were applied to the realigned mitochondrial DNA reads for better genotyping using Realigner and QualCal in Sentieon (version 2019.11). We generated mitochondrial genomic variant call format (VCF) files for each individual using the Sentieon Haplotyper with parameters ‘--emit_mode gvcf --ploidy 1,’ and then performed joint-sample variant calling using the Sentieon GVCFtyper for optimal SNV and indel discovery.
Raw VCF files were converted into ANNOVAR input files using an in-house Python script. All mtDNA variants were annotated using ANNOVAR with ensGene databases, including region (exonic, ncRNA_exonic or non-coding) and effect on amino acid change (non-synonymous or synonymous). Non-synonymous variants were further annotated with PROVEAN (76) to estimate the likelihood that a mutation affected protein function. Functional impact (FI) scores < −2.5 were considered deleterious, whereas FI scores ≥ −2.5 were considered neutral. PROVEAN is freely available in MitImpact (77). The pathogenicity of variants in 22 tRNAs was calculated using MitoTIP (48): >16.25 = Likely Pathogenic (LP); 12.66–16.25 = Possibly Pathogenic (PP); 8.44–12.66 = Possibly benign (PB); and <8.44 = Likely Benign (LB). In this study, LP and PP were considered deleterious.
Determining heteroplasmic and homoplasmic variants
We calculated HF (%) by dividing the number of variants reads by the total number of reads to determine the proportion of variant alleles at each site of the mitochondrial genome. To reduce false positives, variants were filtered out if HF < 1%. To identify true heteroplasmic variants, the remaining variants displaying an HF from 1% to 98% were considered heteroplasmic variants, whereas those with HF ≥ 98% were considered homoplasmic variants.
dN/dS ratio calculation
The dN/dS ratio was obtained by dividing the number of non-synonymous mutations per non-synonymous site by the number of synonymous mutations per synonymous site. Generally, dN/dS assesses selective selection acting on a gene, which may be positive (dN/dS > 1), negative (dN/dS < 1) or neutral (dN/dS ≈ 1).
Database construction
The online database MTCards (http://www.genemed.tech/mtcards) was developed using JavaScript, PHP and Perl using a Linux platform on a NGINX web server. A front and back separation model was used in this study. The front end was based on vue and used the UI Toolkit element, which supports all modern browsers across platforms, including Microsoft Edge, Safari, Firefox and Google Chrome. The back end was based on Laravel, a PHP web framework.
MTCards was developed and is supported by versatile browsing and searching functionalities. All data were stored in a MySQL database. Users can easily access the data through this web interface. The web interface of MTCards contains the search, browse and download modules.
Haplogroup calling
We used the publicly available software HaploGrep2 (53) (http://haplogrep.uibk.ac.at/), an unsupervised clustering method based on Phylotree17, to classify each participant into different mitochondrial haplogroups based on detected mtDNA variants. rCRS was used as the reference genome, and the first (ranked) hits were selected for further analysis. The nomenclature of mtDNA haplogroups begins with a capital letter that indicates a major branch of the mitochondrial DNA tree, followed by a series of lowercase letters and numbers that indicate a subsequent branch of the tree. It is important to note that individuals classified in a subclade were collapsed into the parent haplogroup. For example, individuals in haplogroup B included those classified as ‘B’, ‘B4a’, ‘B5b1’ and ‘B4c1b2c2’. Based on the MITOMAP criteria, we combined haplogroups into higher level haplogroups (such as ‘M’, ‘N’ and ‘L’), which were defined as a haplogroup lineage. Similarly, individuals in the haplogroup lineage included all subsequent branches of the mitochondrial DNA tree. For example, individuals in macrohaplogroup M included those classified as ‘M’, ‘D’, ‘C’, ‘G’, ‘E’, ‘Q’ and ‘Z.’ For comparison, we used the same method to define the mitochondrial haplogroups of the 2534 individuals from the 1KG.
Calculation of mtDNA copy number from sequencing data
Autosomal and mitochondrial average depths were separately accounted for by the bamdst (version 1.0.7) from the aligned binary alignment map files.
Statistical analyses
The P-values for all comparisons of mean values were calculated using the Wilcoxon signed-rank test. Proportions were determined using the chi-square or Fisher’s exact test with Bonferroni correction, as appropriate. Linear regression models were used to test the association between copy number and age as well as mtDNA variants with age. Pearson and Spearman coefficients were used for the correlation analysis. All statistical analyses were two-sided with α = 0.05. and were performed using the R software (v4.0.2).
- ATP
adenosine triphosphate
- CDX
Dai Chinese
- CHB
Han Chinese
- CHS
Southern Han Chinese
- D-loop
displacementloop
- EAS
East Asians
- gnomAD
Genome Aggregation Database
- HF
heteroplasmic fraction
- Indel
insertions or deletion
- JPT
Japanese
- KHV
Kinh Vietnamese
- MitoTip
mitochondrial tRNA informatics predictor
- mtDNA
mitochondrial genome
- NGS
next-generation sequencing
- nuMTs
nuclear mitochondrial DNA segments
- OXPHOS
oxidative phosphorylation
- PD
Parkinson’s disease
- PROVEAN
protein variation effect analyzer
- rCRS
revised Cambridge Reference Sequence
- SNV
single nucleotide variant
- WES
whole-exome sequencing
- WGS
whole-genome sequencing
Acknowledgements
We are grateful to all the subjects for donating samples.
Conflict of Interest statement. None declared.
Funding
The Innovation-Driven Project of Central South University (20180033040004), Young Elite Scientist Sponsorship Program by CAST (2018QNRC001), and Hunan Science and Technology Innovation Platform and Talent Program (2020RC3060 awarded to J.L.) and National Natural Science Foundation of China (U20A20355) and Hunan Innovative Province Construction Project (2019SK2335 awarded to B.T.).
References
Author notes
Yijing Wang and Guihu Zhao authors have contributed equally.