MitoCarta2.0: an updated inventory of mammalian mitochondrial proteins

Mitochondria are complex organelles that house essential pathways involved in energy metabolism, ion homeostasis, signalling and apoptosis. To understand mitochondrial pathways in health and disease, it is crucial to have an accurate inventory of the organelle's protein components. In 2008, we made substantial progress toward this goal by performing in-depth mass spectrometry of mitochondria from 14 organs, epitope tagging/microscopy and Bayesian integration to assemble MitoCarta (www.broadinstitute.org/pubs/MitoCarta): an inventory of genes encoding mitochondrial-localized proteins and their expression across 14 mouse tissues. Using the same strategy we have now reconstructed this inventory separately for human and for mouse based on (i) improved gene transcript models, (ii) updated literature curation, including results from proteomic analyses of mitochondrial sub-compartments, (iii) improved homology mapping and (iv) updated versions of all seven original data sets. The updated human MitoCarta2.0 consists of 1158 human genes, including 918 genes in the original inventory as well as 240 additional genes. The updated mouse MitoCarta2.0 consists of 1158 genes, including 967 genes in the original inventory plus 191 additional genes. The improved MitoCarta 2.0 inventory provides a molecular framework for system-level analysis of mammalian mitochondria.


INTRODUCTION
There is increasing appreciation for the essential roles that mitochondria play not only in oxidative phosphorylation and energy metabolism, but also in small molecule metabolism, ion homeostasis, immune signalling and cell death. Mitochondria originally descended from an en-dosymbiotic bacterium, predicted to resemble modern-day ␣-proteobacteria, early in eukaryotic evolution (1). Mammalian mitochondria contain their own genome (mtDNA), which encodes a total of 13 proteins that are all core components of oxidative phosphorylation. However, all of its remaining >1000 proteins (2) are nuclear encoded and imported into the organelle. Mutations in either the mtDNA or the nuclear genome underlie the largest collection of inborn errors of metabolism (3), and there is growing evidence that a gradual decline in mitochondrial activity is associated with aging and age-associated disorders.
To fully understand the molecular basis of mitochondrial physiology and the organelle's role in disease, it is very useful to have a complete protein parts list for this organelle. In 2008, we constructed the MitoCarta1.0 inventory of mitochondrial proteins using multiple experimental and computational approaches (4). At that time, we purified mitochondria from 14 mouse tissues and performed in-depth tandem mass spectrometry (MS/MS) to identify mitochondrial proteins. We then compiled complementary clues of mitochondrial localization from homology to yeast and Rickettsia prowazekii proteins, presence of mitochondrial targeting signals and protein domains, and RNA coexpression across tissues and during mitochondrial biogenesis ( Figure 1). Using a naïve Bayes integration (5), every mouse gene was assigned a combined score of mitochondrial localization from the seven data sources, each weighted by its accuracy based on large training sets of known mitochondrial and non-mitochondrial mouse genes. The resulting MitoCarta1.0 inventory of 1098 mouse genes contained 591 curated mitochondrial components used for training, 131 proteins validated using GFP/microscopy, and 376 proteins assigned to the organelle at a 10% false discovery rate (FDR). MitoCarta1.0 has been widely used to elucidate the function of uncharacterized genes and pathways (6)(7)(8)(9), including reverse genetic screens (8,10) and forward genetic approaches to identify genes underlying rare mitochondrial disorders (11)(12)(13)(14).
Here we present an updated MitoCarta2.0 inventory using the same overall strategy (Figure 1). The seven un-  derlying data sources have been substantially updated using improved transcript models, MS/MS search algorithms, database versions and homology detection methods. Furthermore, the mitochondrial training set was increased by 60%. The MitoCarta 2.0 inventory consists of 1158 human genes and 1158 mouse genes encoding mitochondrial proteins. The MitoCarta2.0 website www.broadinstitute. org/pubs/MitoCarta freely provides the updated mitochondrial gene identifiers, evidence of mitochondrial localization, protein expression across 14 mouse tissues and protein sequences.

Method overview
The MitoCarta2.0 inventory of mitochondrial proteins is constructed by first compiling the evidence of mitochondrial localization from seven complementary data sources ( Figure 1). In parallel, we compiled large training data of known mitochondrial proteins (T mito ) and nonmitochondrial proteins (T non mito ). These training data are used to assess the accuracy of each input data source by computing a likelihood score of mitochondrial localization at a range of input values ( Figure 2). The seven individual likelihood scores are combined using a naïve Bayes methodology into an overall score for each gene (15). The resulting naïve Bayes score is far more accurate at scoring the known training data compared to each individual method ( Figure 2). The final MitoCarta2.0 inventory is constructed by combining the T mito training data with all genes scoring below a 5% false discovery threshold ( Figure 1).

Training data
All human and mouse genes are partitioned into three sets: T mito (960 human, 961 mouse), T possible mito (816 human, 750 mouse) or T non mito (17468 human, 18918 mouse) as follows.
The T mito set of definite mitochondrial proteins is the union of (i) literature curation of proteins with strong experimental evidence of mitochondrial localization in mammals (see Supplementary Data), (ii) presence in the mitochondrial matrix proteome or intermembrane space (IMS) proteome in HEK 293T cells based on APEX-labeling (18,19) or (iii) confirmed mitochondrial localization by GFP-tagging and microscopy (4). 15 proteins in the previous T mito1.0 were excluded based on updated literature curation (Aadat, Armc4, Eln, Iqce, Mobp, Myl10, Nt5c3, Phyhipl, Pisd, Pla2g15, Pts, Tmem143, Tmem186, Tshz3, Txn1). We include the APEX-based matrix and IMS proteomes in T mito given the extremely high specificity of APEX-labeling. Human and mouse T mito sets were created using humanmouse orthologs (best reciprocal BlastP hits, Expect < 1e-3) with the addition of species-specific genes with literature evidence.
Genes that did not meet our selective criteria for inclusion in T mito but which had some evidence of mitochondrial localization from the MitoP2 database (20)    We define the non-mitochondrial training set T non mito as all genes not in T mito or T possible mito . This differs from Mi-toCarta1.0, where T non mito1.0 contained 2519 genes whose proteins were reliably localized in non-mitochondrial compartments (e.g. ER, nucleus, lysosome, plasma membrane, vacuole). Thus , T non mito2.0 now contains thousands of cytoplasmic proteins that were previously underrepresented.

Data integration
As in MitoCarta1.0, seven methods for determining mitochondrial localization were integrated using the Maestro naïve Bayes classifier whereby each method is weighted based on its accuracy (15). Training sets (T mito and T non mito ) were used to create a LogOdds score for each feature F at each predefined bin b (Figure 2A), defined as log 2 Assuming conditional independence between the data sets (Supplementary Figure S1), the individual LogOdds scores were summed to create a Maestro score for each gene ( Figure 2B, C). For transcript or protein level scores, the gene inherited the highest score of any isoform. The scores for the seven genomic features were calculated at predefined ranges ( Figure 2A) (4). While the scoring method was identical to MitoCarta1.0, the original MS/MS spectra were searched against new RefSeq transcript models using updated SpectrumMill software that reversed faulty acquisition-time lock mass correction of MS1 scans and precursor masses, which resulted in 75% more spectra identified and 10% more unique peptides identified. The improvements are chiefly due to reversing the faulty lock mass calibration (see Supplementary Data).
Yeast: categorical score [OrthologMitoHighConf, Or-thologMitoLowConf, HomologMitoHighConf, Homolog-MitoLowConf, NoMitoHomolog]. Homology was determined by BlastP (22) top hit (Expect < 1e-3) or jackHM-MER (23) reciprocal hit (see Supplementary Data). Orthology was defined as a 1:1 homolog (i.e. the yeast gene had only one homolog in human/mouse). Genes with a yeast homolog/ortholog annotated as mitochondrial in SGD (Saccharomyces Genome Database, 03/06/14) (24) were scored as either MitoHighConf (SGD manual annotation, excluding dual localized proteins) or MitoLowConf (SGD dual localized proteins, or annotated mitochondrial based on high throughput data only). Genes that lacked a yeast homolog or where the yeast homolog was not annotated mitochondrial were categorized as NoMitoHomolog. Compared to MitoCarta1.0, this scoring method was more sensitive due to use of jackHMMER to identify distant homologs, and more specific due to the separate scoring of orthologs and homologs.
Protein domain: categorical score [MitoDomain, Non-MitoDomain, SharedDomain or NA] representing presence of a protein domain that is exclusively mitochondrial, exclusively non-mitochondrial, ambiguous or not present in any annotated eukaryotic protein (UniProt Knowledgebase Release 2014 06) (27,28). Protein domains were identified using HMMER (23) based on Pfam version 27 (29). This scoring was identical to MitoCarta1.0, with updated UniProt and Pfam databases.
Endosymbiont ancestry: categorical score [Ortholog, Homolog, NoHomolog], where homology was defined by BlastP (Expect<1e-3) or jackHMMER to Rickettsia prowazekii (see Supplementary Data), and orthology was defined as a 1:1 homolog (i.e. the Rickettsia gene has only one homolog in human/mouse). Compared to Mi-toCarta1.0, this scoring was more sensitive due to use of jackHMMER and more specific due to separate scoring of orthologs and homologs.
Induction: log2 fold-change of mRNA expression in cellular models of mitochondrial proliferation (overexpression of PGC-1␣ in mouse myotubes) compared to controls (15,31). Compared to MitoCarta1.0, probes in this data set  were re-annotated using new transcript models (32), and data were normalized using gcRMA (33).
In contrast to MitoCarta1.0, these seven features were generated separately for all human genes and all mouse genes. Features that were mouse-specific (MS/MS, Coexpression, Induction) were mapped to human orthologs (BlastP best reciprocal hit, Expect<1e

Accuracy of data sets and naïve Bayes integration
We assessed the accuracy of each data set and the combined Maestro naïve Bayes integration using recall and precision at predicting the training sets ( Figure 2D). Recall is equivalent to sensitivity and precision is the percent of all predictions expected to be true (TP/(TP+FP) corrected for the size of the training data sets, equivalent to 1-FDR). As shown in Figure 2D, the Maestro naïve Bayes integration has substantially increased accuracy compared with the individual data sets. At the selected 5% FDR threshold on the human data set, the naïve Bayes method has 80% sensitivity and 99.6% specificity--far outstripping any single method. Using ten-fold cross-validation, the naïve Bayes integration showed a similar 79% sensitivity and 99.7% specificity at the same 5% FDR threshold.

HUMAN AND MOUSE MITOCARTA2.0
We separately performed a naïve Bayes integration to create a human-centric and mouse-centric MitoCarta2.0 inventory ( Figure 3).
The human MitoCarta2.0 inventory contains 1158 genes, 79% of which overlap MitoCarta1.0 ( Figure 3A). Of the 240 genes not in MitoCarta1.0, 100 were detected in the APEX-based matrix or IMS proteomes, 36 have other experimental literature evidence, and 104 achieve high probability of mitochondrial localization at the 5% FDR. For example, ECHDC1 now has evidence of mitochondrial localization based on updated MS/MS, yeast homology and coexpression ( Figure 2B). Of the 94 MitoCarta1.0 human genes that are now retired and not in MitoCarta2.0, 9 were pseudogenes no longer present in the latest RefSeq Nucleic Acids Research, 2016, Vol. 44, Database issue D1255 database and 85 score below our stringent 5% FDR. For example, NFXL1 was in MitoCarta1.0 based solely on a high confidence TargetP prediction ( Figure 2C), however the more recent RefSeq database replaces the previous protein fragment (XP 001052092) with a full-length protein (NP 598682) that has only a low-confidence TargetP prediction thus it is no longer predicted as resident in the mitochondrion.
The mouse MitoCarta2.0 inventory contains 1158 genes, 83% of which overlap MitoCarta1.0 ( Figure 3B). 191 mouse genes were not in MitoCarta1.0, including 84 detected in the APEX-based matrix or IMS proteomes, 31 with other literature experimental evidence and 76 computational predictions (FDR≤5%). The previous version of RefSeq had a larger number of mouse pseudogenes. Of the 131 genes only in MitoCarta1.0, 42 were retired pseudogenes, 15 were T mito1.0 genes no longer deemed to have strong evidence and rest were low-confidence computational predictions.
The vast majority of human and mouse MitoCarta2.0 genes are reciprocal top hits (96%). However, the separate inventories contain species-specific genes (e.g. human ATAD3B, mouse Csl) and predictions that had slightly different species-specific scores and thus exceeded the FDR threshold in only one of the two mammalian species (e.g. human BOLA3, LDHB and mouse Ppm1m).

WEBSITE INTERFACE
MitoCarta2.0 inventory is available at www.broadinstitute. org/pubs/MitoCarta. The human and mouse mitochondrial inventories contain the Maestro naïve Bayes score and FDR, a summary of the evidence supporting mitochondrial localization and protein expression in 14 mouse tissues. Available for download are the naïve Bayes scores and mitochondrial evidence for all human and mouse proteins, BED files of gene coordinates, FASTA files of gene sequences and Excel files of the MS/MS peptides detected across 14 mouse tissues. Images supporting from previous GFP-tagging/microscopy experiments (4) are also available.

COMPARISON TO OTHER MITOCHONDRIAL DATABASES
Multiple research groups have created inventories of mammalian mitochondrial proteins. To our knowledge, these include MitoCarta1.0 (4), MitoP2 (20,(34)(35)(36)(37), MitoProteome (38,39) and IMPI http://impi.mrc-mbu.cam.ac.uk/), however MitoP2 and MitoProteome are no longer available on the internet. Similar to MitoCarta, IMPI (Integrated Mitochondrial Protein Index) uses machine learning to predict mitochondrial localization in human, mouse, rat and cow based on experimental proteomics data in MitoMiner (40,41), antibody staining from the Human Protein Atlas (42) and mitochondrial targeting sequence prediction tools. The IMPI version Q2 2015 contains 1480 human Ensembl genes with substantial overlap with MitoCarta2.0 (980 in both, 500 IMPI-specific and 178 MitoCarta2.0specific). Compared to MitoCarta's naïve Bayes methodology, IMPI's machine learning methods (support vector machines and random forests) have the advantage of al-lowing redundant data sets that are not conditionally independent, however the resulting scores are less readily interpretable and the techniques are more susceptible to overfitting of the training data. Additionally, IMPI does not provide the atlas of protein expression across tissues. Several other mitochondrial-focused web resources (43) aggregate useful mitochondrial data but do not include a reference set of mitochondrial proteins, e.g. MitoMap provides human polymorphism and mutation data (44,45), HMPDb (bioinfo.nist.gov/hmpd) aggregates data from nine knowledge bases and MitoMiner aggregates extensive proteomics data with MitoCarta1.0, UniProt and IMPI (40,41).
There are also many general databases of sub-cellular localization that provide breadth across many species and a hierarchy of subcellular locations. NCBI GO cellular compartments (21) annotates mitochondrial localization based on literature reports (including MitoCarta1.0). It currently includes over 1500 human genes linked to the mitochondrion (of which 1050 are in MitoCarta2.0), however it contains hundreds of genes with annotations electronically inferred from distant species or from single, controversial reports in the literature and furthermore there is no confidence score of mitochondrial localization. Similarly, UniProt (27,28) includes over 1094 human genes linked to mitochondria (of which 76% are in MitoCarta2.0) but lacks confidence scores of localization. COMPARTMENTS (46) lacks a downloadable list of mitochondrial genes, but for any query gene it provides a confidence score of localization to multiple cellular compartments (e.g. mitochondrion, nucleus, ER, cytoplasm) based on aggregating data from knowledge bases (e.g. UniProt), prediction algorithms (e.g. PSORT, yLoc) and text mining.
Overall, MitoCarta2.0 and IMPI provide the most specific inventories of mammalian mitochondrial components to the community, while broader databases such as NCBI GO and UniProt provide more breadth across species and cellular compartments.

CONCLUSION
MitoCarta2.0 represents an easy-to-use inventory of mitochondrial proteins in mouse and human along with the evidence supporting mitochondrial localization for each protein--thereby providing a molecular framework for systematic studies of mitochondrial function and physiology. The MitoCarta database can be tuned to provide more or less stringent predictions of mitochondrial proteins by altering the FDR threshold. For example, when evaluating the results of a high-throughput screen of mitochondrial function, users may want to use a less stringent threshold such as 20% FDR. Similarly, when interpreting whole exome data from patients with mitochondrial disease, a 15% FDR might help interpret recessive mutations in genes with unknown function that may actually underlie mitochondrial dysfunction.
The current database has several important limitations. First, it is static and does not continually incorporate new literature evidence. Second, it is a mitochondrial-centric inventory that does not identify additional cellular localizations for proteins, or those that reside in the mitochondrion only under certain conditions. Third, it a gene-based inven-tory rather than an isoform-based inventory, because the underlying training data were available only for gene loci. Fourth, the training data were skewed toward proteins that reside within the double-membrane, thus it will be less accurate at predicting proteins of the outer mitochondrial membrane. Additional experimental data sets will be needed to interrogate the outer membrane and additional tissues and conditions not covered in MitoCarta2.0.
Despite these limitations, the updated MitoCarta2.0 mitochondrial inventory provides a valuable research tool to investigate mitochondrial pathways in health and disease. We expect that in the coming years this inventory will help elucidate the function of many specific pathways as well as to interpret many high-throughput data sets in molecular biology and human genetics.