The human iron-proteome †

Organisms from all kingdoms of life use iron-proteins in a multitude of functional processes. We applied a bioinformatics approach to investigate the human portfolio of iron-proteins. We separated iron-proteins based on the chemical nature of their metal-containing cofactors: individual iron ions, heme cofactors and iron–sulfur clusters. We found that about 2% of human genes encode an iron-protein. Of these, 35% are proteins binding individual iron ions, 48% are heme-binding proteins and 17% are iron–sulfur proteins. More than half of the human iron-proteins have a catalytic function. Indeed, we predict that 6.5% of all human enzymes are iron-dependent. This percentage is quite different for the various enzyme classes. Human oxidoreductases feature the largest fraction of iron-dependent family members (about 37%). The distribution of iron proteins in the various cellular compartments is uneven. In particular, the mitochondrion and the endoplasmic reticulum are enriched in iron-proteins with respect to the average content of the cell. Finally, we observed that genes encoding iron-proteins are more frequently associated to pathologies than the all other human genes on average. The present research provides an extensive overview of iron usage by the human proteome, and highlights several specific features of the physiological role of iron ions in human cells. need an iron ion to perform their catalytic mechanisms. The analysis of the subcellular location highlighted that some organelles are enriched in iron-proteins, in particular about 7% of the proteins localized in the endoplasmic reticulum and in the mitochondrion bind iron. Finally, our data show that mutations in genes encoding iron-binding proteins are more likely to be associated with pathology than all human genes on average.


Introduction
During evolution, organisms have selected some of the available elements from the environment to catalyze physiological reactions. Consequently, some metal ions became essential to life. Iron is one of the most ancient and abundant transition metal ions in living organisms, 1,2 as it was highly available as ferrous ion in the early days of terrestrial life. 3 Iron is essential to all forms of life and participates in fundamental biological processes, such as photosynthesis, respiration and nitrogen fixation. 4,5 In cells, it is normally found in the +2 (ferrous) and/or +3 (ferric) oxidation states. Higher oxidation states may be generated transiently in the course of the catalytic cycle of enzymatic reactions. Besides individual iron ions, proteins can bind also iron-containing cofactors, such as heme or iron-sulfur clusters. [6][7][8] Heme is one of the most versatile prosthetic groups in metalloproteins. The porphyrin constituting the heme group can be of several types, including e.g. heme a, heme b, and heme c. The heme proteins that transfer electrons mainly belong to the cytochromes class, and may contain one or several heme groups; globins are heme-containing proteins involved in dioxygen binding and/or transport; other heme proteins serve as biological sensors for oxidative stress. The broad range of possible reactions occurring at the heme center is mainly based on the ability of the heme iron to coordinate small molecules like CO, NO, and O 2 . The protein matrix can modulate the affinity towards the different exogenous ligands. Iron-sulfur clusters contain two or more iron ions bridged by sulfide ions. Each iron ion is tetracoordinated, with its coordination sphere typically completed by the sulfur or nitrogen atoms of cysteine and histidine side chains, respectively. 9 The metal site of rubredoxin, which contains a single iron ion coordinated by four cysteines, is generally classified as the simplest unit of iron-sulfur clusters. Iron-sulfur clusters are among the most versatile inorganic cofactors. 5 They are involved in a plethora of functional processes, including aerobic as well as anaerobic respiration, regulation of gene expression, amino acid and nucleotide metabolism, DNA modification and repair and tRNA modification.
Heme and iron-sulfur clusters are cofactors featuring a high chemical complexity. Therefore, their biosynthesis as well as the biosynthesis of the final holo-proteins containing these cofactors involve a significant number of different protein components, some of which are iron-binding proteins. In the human cell, these biosynthetic processes have multiple pathways, related also to cellular compartmentalization. Nevertheless, some components may move across different compartments; furthermore, the various pathways can communicate with one another via the exchange of biosynthetic intermediates.
While iron is essential for life, it can catalyze the formation of potentially toxic reactive oxygen species (ROS). This process is unavoidable in the present oxygen-rich environment, and iron and ROS are increasingly recognized as important initiators and mediators of cell death in various organisms as well as in pathological conditions in humans. 10 Therefore, biological systems must control iron metabolism by providing the adequate amount of iron for proper cellular function while limiting iron toxicity. 11,12 Iron has also a role in pathogen virulence. The growth of microbial pathogens within the host usually requires iron as an essential nutrient. 13,14 Hemecontaining proteins, such as hemoglobin, and transferrin are the preferential iron sources for human pathogens. 15,16 Therefore, another crucial reason for the cell to maintain a strict control on iron homeostasis is to restrict its access by pathogens.
In this paper, we carried out a systematic prediction of ironbinding proteins encoded in the human genome, extending our previous analysis on iron-sulfur proteins. 17 By integrating this prediction with information on heme and individual iron ions, we achieved a complete landscape of the iron handling by proteins in human, thus providing a framework for the understanding of physiological iron metabolism and of its dysfunction in diseases.

Iron binding by human proteins and their coordination spheres
We analysed iron usage by human proteome via three different possible modes of binding: as individual iron ions, as ironcontaining heme cofactors and as iron-sulfur clusters. In total, we identified 398 human genes whose protein products interact with iron (iron-proteins hereafter), i.e. about 2% of the human genes. Of these, 139 genes express proteins binding individual iron ions (Table S1, ESI †), 192 express proteins binding heme  (Table S2, ESI †) and 70 17 express proteins binding iron-sulfur  clusters (Table S3, ESI †).
The coordination spheres of the three different ironcontaining cofactors are quite diverse; we refer to the pattern of the protein residues coordinating the iron ion(s) of the cofactor as the iron-binding pattern (IBP). The IBP is a regular expression defined by the identity of the amino acids coordinating the metal and by their spacing along the protein sequence (e.g. CX 4 CX 25 C). Thus, the coordination sphere of each iron ion corresponds to a single IBP.
In IBPs of human iron-proteins binding individual iron ions, histidine is by far the most common residue. His is present in 94% of these IBPs, each of which contains on average two His (Fig. 1). Aspartate, glutamate and tyrosine are found in 53%, 30% and 10% of the identified patterns, respectively. On average, only one Asp and one Tyr are found in each IBP, whereas there can be one (such as in most iron-dependent enzymes) or two (such as in ferritins) Glu residues. All ironsulfur binding proteins use on average three-four cysteines to coordinate the cluster. Cys is absolutely required in the IBPs of these proteins. In particular, in human iron-sulfur proteins the coordination sphere of the Fe 4 S 4 clusters is always and only composed by cysteines whereas the IBPs of Fe 2 S 2 clusters sometimes (37% of Fe 2 S 2 IBPs) include one or two His residues. In human heme-binding proteins, IBPs commonly contain one or two His with the exception of catalytic heme sites (such as in cytochrome P450) where Cys is more common (83% of IBPs).
The function of the metal cofactor within the protein is correlated also to the number of coordinating residues provided by the protein (i.e. the number of residues in the IBP). Indeed, the coordination sphere of the metal ion is not always completed by atoms of the protein. 64% of the sites that bind individual iron ions contain three protein residues in the IBP, whereas the others contain four protein residues. Similarly, most of the iron ions in heme cofactors have only one ligand provided by the protein (about 58%), which allows the substrate to occupy the second heme axial position. The remaining 42% heme sites have two coordinating residues provided by the protein. In iron-sulfur proteins, the most common number of protein ligands is 4; however, all the iron-sulfur clusters that perform a catalytic function have only three Cys ligands in the IBP. It is thus evident that there is a trend for human iron-proteins to have a lower number of residues in their IBPs when the metal-binding site performs a catalytic function, in order to allow the iron ion to coordinate directly to the substrate as already observed for other metal containing proteins. 18

Subcellular localization of human iron-proteins
We then analysed the subcellular localization of the human ironproteins identified through our search (Tables S4-S6, ESI †). This information is not available for 94 proteins (37 binding individual iron ions, 10 binding iron-sulfur clusters, and 47 binding hemes), which were thus ignored for this analysis. Various proteins are present in more than one compartment, and thus were included in the statistics of each relevant organelle. Fig. 2 summarizes the distribution of the different types of ironproteins within each cellular compartment and reports the fraction of iron-proteins with respect to the total number of proteins localized in each compartment (percentages within parenthesis). It appears that two subcellular locations stand out for their enrichment in iron-proteins: the mitochondrion and the endoplasmic reticulum.
Our dataset (iron-proteins for which cellular localization is known) is composed by 45% heme-binding proteins, 34% proteins binding individual iron ions, and 21% proteins binding iron-sulfur clusters. From Fig. 2, we can readily identify compartments that differ appreciably in the distribution of the types of iron-proteins. The nucleus is highly depleted of hemebinding proteins, whereas it features a relatively high number of proteins binding individual iron ions. On the other hand, the mitochondrion is the compartment most enriched in iron-sulfur proteins, with respect to both the two other types, whereas the endosome is mostly enriched in heme-binding proteins and does not contain any iron-sulfur protein. In addition, the endoplasmic reticulum is enriched in hemebinding proteins and depleted in iron-sulfur proteins. The distribution of the three types of iron-proteins in the cytoplasm closely resembles that of the overall dataset. It should be noted that in this respect, we are referring to the number of proteins and not to their relative quantity, which depends on their expression levels. We did not analyze such levels in this work.
The mitochondrion and the endoplasmic reticulum are the compartments with the largest percentage of iron-proteins. As mentioned, the mitochondrion is significantly enriched in iron-sulfur proteins (about 2.5 times the average fraction for the whole cell), whereas the endoplasmic reticulum is enriched in heme-binding proteins (1.6 times the cell average). The nucleus is the only compartment where proteins binding individual iron ions are the majority of iron-proteins (1.7 times the cell average).  (Tables S4-S6, ESI †). This information is not available for 24 proteins (14 binding iron-sulfur clusters, and 10 binding heme), which were thus ignored for this analysis. It appears that sites binding heme or individual iron ions most commonly have a catalytic role, i.e. are directly involved in enzymatic mechanisms. This is also the most common role for the entire set of iron-proteins, partly due to the low number of iron-sulfur proteins. For sites binding individual iron ions the only other relevant function is its use as a substrate, i.e. in storage and transport processes (this classification of sites is taken from the MetalPDB database 9 ). Heme-binding sites have the largest variety of functional roles, among which electron transfer is the second most common. As it is well known, human heme-binding proteins also play a crucial role in the transport of molecular dioxygen and in sensing, particularly of small gaseous molecules such as NO, leading to a regulatory function. Heme-binding proteins associated with a substrate function (i.e. when the heme cofactor is the target/substrate of the protein) are involved in the biosynthesis, transport and degradation of the heme cofactor. This may be linked also to the fact that there are as many as seven different types of heme cofactors in human heme-binding proteins (heme a, b, c, d, i, o, m). While the most common type is heme b, occurring in 90% of the heme-proteins, the synthesis of all the other heme types requires the action of specific enzymes that modify the cofactor and/or the protein binding it (e.g. cytochrome c 19 ). 20,21 The most common role for iron-sulfur proteins is transport, biosynthesis and insertion into the final target proteins of the clusters themselves (tagged as substrate). [22][23][24][25][26] This is the result of both the chemical complexity of the iron-containing clusters, thus requiring elaborate biosynthetic and degradation pathways, and the potential toxicity of free iron ions. The second most common roles for iron-sulfur proteins are structural and regulatory. The role of iron-sulfur clusters in several DNA-and RNA-binding proteins is not completely understood, in particular for the many systems involved in DNA repair, where the presence of the cluster could be instrumental to detect lesions. Curiously, sites performing electron transfer are less common.

Functional roles
We then checked whether there is a relationship between cellular localization and protein function in order to rationalize the patterns reported in Fig. 2. To do this we examined the lists of the iron-proteins localized to the various compartments and identified all the processes, as defined by the Gene Ontology (GO 27,28 ), associated with the corresponding genes. Seven processes involve 81% of the genes coding for iron-proteins localized to the endoplasmic reticulum ( Table 1). The process involving more iron-proteins is lipid metabolism, which is a key cellular role played by cytochromes P450; only one tenth of the genes involved in lipid metabolism codes for proteins binding individual iron ions. Xenobiotic metabolic process and drug metabolism are common processes which involve exclusively heme-binding proteins and are essentially associated to cytochromes P450, which are involved in the modification of exogenous molecules, from drugs to pollutants. Proteins binding individual iron ions are involved in different pathways, such as peptidyl amino acid hydroxylation. These pathways do not involve any heme-binding protein. Overall, 92% of the iron-proteins localized to the endoplasmic reticulum are oxidoreductases, as directly observed from their Enzyme Commission (EC) numbers, and these are either members of the cytochrome P450 family (heme-containing enzymes) or iron-dependent hydroxylases (typically harboring two iron ions in their active site). The functional role of the iron-proteins in the endoplasmic reticulum is thus tightly linked to their catalytic activity, most commonly in biosynthetic or metabolic processes.
In the nucleus, 5 processes involve about 89% of the ironproteins present in this cell compartment. Gene expression is the process associated to most of these proteins, because several genes encode iron-proteins involved in the regulation of transcription e.g. through DNA binding or histone modification. Many iron-proteins in the nucleus are also involved in  response to stress, for instance by repairing damaged DNA, in apoptosis 17 and in cell proliferation. About half of the nuclear iron-enzymes are oxydoreductases; transferases and hydrolases are relatively common.
In the mitochondrion, 6 processes involve about 63% of all iron-proteins within this cellular compartment. The process involving the largest number of iron-proteins is cellular respiration, which leverages both heme-binding and iron-sulfur proteins (6 vs. 10 genes, respectively). Other processes involving more than 10 genes are cell death, iron ion homeostasis and response to stress (which is mainly response to oxidative stress), half of which are iron-sulfur proteins. The biosynthesis of iron-sulfur clusters comprises genes encoding require ironsulfur proteins. At the functional level, the observed enrichment of the mitochondrion in iron-sulfur proteins (Fig. 2) is largely accounted for by the involvement of these proteins in the respiratory chain, in stress response and in the assembly of iron-sulfur clusters themselves. For the latter, the clusters are transiently bound by various proteins along the biosynthetic pathway, also depending upon the final target for cluster insertion. 25,26,29 The electron transfer capabilities of ironsulfur proteins are important but not the only determinant of the higher abundance in the mitochondrion of iron-sulfur proteins with respect to all iron-proteins.

Uncharacterized putative human iron-proteins
Our analysis identified several proteins that had not been described in the literature as binding iron or iron-containing cofactors. In particular, Retinoid-related Orphan Receptorsalpha, beta and gamma (RORa, RORb, and RORg, hereafter) were predicted to have a heme-binding site similar to that found in REV-ERBa and REV-ERBb. The REV-ERB family binds heme with two axial ligands: one His and one Cys. 30 The sequence alignment of these two families (Fig. S1, ESI †) clearly shows that the His ligand is strictly conserved also in the ROR family whereas the Cys ligand is not. However, the superimposition of the heme-containing 3D structure of REV-ERBb (PDB code 3CQV 30 ) with the experimental structures of RORa, RORb and RORg (PDB codes 1N83, 31 1NQ7, 32 4WLB, 33 respectively) shows that the latter contain a Cys (Cys323, Cys262 and Cys320, respectively) that is essentially in the same position as the heme-binding Cys384 of REV-ERBb (Fig. 4A). A small rearrangement of the side chains of the Cys residues would bring their Sg atoms at a distance from the iron ion compatible with the formation of a coordination bond. This Cys corresponds to a strictly conserved position in the multiple sequence alignment of the ROR family (Fig. S1, ESI †). Furthermore, the cavities of the 3D structures of ROR are sterically compatible with the binding of a heme molecule and the regions in contact with the cofactor have a high sequence similarity with the REV-ERB family. Another new putative heme-binding protein is the extracellular matrix protein FRAS1. This protein is in the plasma membrane: it has a very long region exposed in the extracellular matrix and a short cytoplasmatic tail. We identified three putative heme-binding sites in the extracellular part. We predicted the occurrence of a site with two potential axial ligands (His2080 and His3301) whereas for the other two sites, we predicted only one ligand, i.e. His1799 and His1945, respectively. The structure of this protein is not available and we were not able to build a 3D structural model, which would have allowed us to evaluate the possible geometrical features of the three predicted sites. The HSPB1-associated protein 1 is another potential iron-binding protein which could bind a single iron ion via its residues His175, Asp177 and His257; all these three residues are highly conserved in the protein family. For this protein we could identify a suitable template in the PDB for 3D structural prediction by homology modeling: the Hypoxiainducible factor 1-alpha inhibitor which has a sequence identity to human HSPB1-associated protein 1 as high as 26%, and contains a site binding a single iron ion. The structural model in Fig. 4B, shows that the predicted ligands of HSPB1-associated protein 1 have the proper spatial configuration to bind an iron ion. Finally, we predicted as putative heme-binding protein the phosphatidylinositol 3,4,5-trisphosphate 5-phosphatase 2. A structure as well as a suitable 3D template for the putative heme-binding region of this protein are not available. This prediction, however, appears less reliable than the previous ones.

Pathogenic alterations associated to human iron-proteins
To assess the impact of the iron-proteome on the human health, we investigated how often defects or mutations affecting genes encoding iron-proteins are associated to pathologies (Tables S4-S6, ESI †). We analysed only proteins in the Swiss-Prot database (Reviewed proteins) 34 and excluded those from the trEMBL database, which are just predicted and do not have mutational studies associated. Thus, we took into account 385 proteins (137 binding individual iron ions, 178 binding heme, and 70 binding iron-sulfur clusters). Of these, 148 are related to one or more pathogenic mutations or alterations, corresponding to about 38% of the total. Interestingly, if we consider the different types of iron sites, we found that more than half of the identified iron-sulfur proteins are involved in pathologies (37/70 corresponding to 53%). For proteins binding individual iron ions or heme cofactors, the percentage of proteins associated to pathologies is 31% (i.e. 43/137) and 38% (i.e. 68/178), respectively. As of January 2018, the total number of human proteins in the Swiss-Prot database was 20259. Of these, 4014 are associated to pathogenic mutations, corresponding to about 20% of the dataset. It thus appears that on average defects or mutations affecting genes encoding iron-proteins are more commonly associated to pathologies than all the other genes.
In Table 2 we broke down the cumulative data reported in the previous paragraph for the whole human cell by looking at specific compartments. In particular, we took into consideration the compartments with the highest number of ironproteins. In the mitochondrion, 36% of all proteins are associated to pathologies, whereas as many as 60% of mitochondrial iron-proteins are disease-related, with the main contribution of heme-proteins and iron-sulfur proteins. Similarly, in the cytoplasm and in the nucleus, heme-proteins and ironsulfur proteins are more commonly associated to pathologies than all other human genes ( Table 2).

Discussion
398 human genes encode iron-proteins, which correspond to about 2% of all human genes. This number should be regarded as a lower limit because within our approach to the identification of iron-proteins false positives (i.e. proteins that do not bind iron but are predicted to do so) are quite unlikely to occur. This is due to the fact that we rely significantly on the known 3D structures of iron-proteins, while in the absence of structural data we scan the literature for supporting evidence. On the other, it is possible that we did not detect completely uncharacterized iron-proteins, especially if they are membrane-associated. Therefore, this number (398) should be taken as a lower limit even if we foresee that the actual number should not be much different.
Of the 398 human iron-proteins, 48% are heme-binding proteins, 35% are proteins binding individual iron ions and 17% are iron-sulfur proteins. The intracellular distribution of these proteins is uneven, with some organelles containing a larger share of iron-proteins than others do. In particular, 7% of all the proteins localized in the endoplasmic reticulum and in the mitochondrion are iron-proteins. Thus these two organelles are significantly enriched (in comparative terms) in iron-proteins with respect to the average of the entire human cell (2%, as mentioned above). Within heme-binding proteins, 90% bind heme b and 61% are membrane-associated.
The three types of iron-proteins feature highly diverse preferences in the coordination sphere of the bound iron ions (i.e. IBPs). Cys is always present in the IBPs of iron-sulfur proteins, whereas it is practically absent from the coordination sphere of individual iron ions. Conversely, His, which is nearly always present in the IBPs of proteins binding individual iron ions, is observed rarely in the IBPs of iron-sulfur proteins. Asp is the second most common ligand in proteins binding individual iron ions. Heme-proteins have a similar preference for His and Cys in their IBPs. Cys is particularly common in the IBPs of heme-proteins that have catalytic function. This is presumably linked to the role of Cys in promoting the heterolytic breakage of the O-O bond of the iron-bound peroxide intermediate that forms along the catalytic cycle of cytochromes P450 or of nitric oxide synthase. [35][36][37] This feature is independent of the overall protein fold, and is defined by the coordination chemistry properties of the sites.
6.5% of the human enzymes are iron-proteins. Unsurprisingly, this percentage is not the same for all enzyme classes. In particular, 37% of human oxidoreductases use a catalytic iron ion. 56% of all human iron-proteins have a catalytic function (Fig. 3). Proteins that bind individual iron ions mainly represent them: 86% of these proteins (119 out of 139) are iron-dependent enzymes. The large majority of these enzymes are oxidoreductases, in particular dioxygenases, where the iron ion is directly involved in the transfer of electron from/to the substrate. Also, about half of the heme-sites in the human proteins have a catalytic function. These enzymes are primarily members of the human cytochrome P450 family, whose isoforms are significantly differentiated in terms of expression but have typically broad and overlapping substrate specificities.
Iron-binding enzymes are commonly located in the nucleus and cytoplasm, followed by the mitochondrion and endoplasmic reticulum. The latter features the highest number of hemebinding proteins as it is the most common localization for cytochromes P450. Consistently with this, we observed that processes such as drug metabolism, lipid metabolism or xenobiotic stimulus are the most common processes associated with iron-proteins localized to the endoplasmic reticulum (Table 1). In the mitochondrion, 63% of all iron-proteins are involved in only 6 processes; the process involving the largest number of iron-proteins is respiration, which leverages both heme-binding and iron-sulfur proteins. The mitochondrion is the most likely localization for iron-sulfur proteins (Fig. 2), whose primary processes within this compartment are, besides respiration, the biosynthesis of iron-sulfur clusters and the response to oxidative stress. The biosynthesis of iron-sulfur clusters is among the most common functional roles of iron-sulfur proteins at the level of the whole cell, 17,38 owing to the chemical Table 2 Number of proteins associated to at least one pathology in UniProt and their ratio with respect to the total number of iron proteins in each cellular compartment, and compared with the data for all human proteins. The percentage of disease-related proteins is in parentheses

Heme
Individual iron-ions Iron-sulfur clusters Total iron-proteins All human proteins complexity of this group of cofactors. Within the nucleus, ironproteins are largely involved in various aspects of the regulation of protein expression, such as histone modification. In addition, also DNA binding, DNA biosynthesis and DNA replication involve several iron-proteins, especially iron-sulfur proteins. We identified three human members of the retinoid-related orphan receptor (ROR) family as potentially harbouring a heme-binding site similar to those observed in proteins of the REV-ERB family. In the absence of experimental evidence in the literature, our hypothesis is supported by the strict conservation of the two potential heme ligands. The experimental structures of RORa, RORb, and RORg, feature a His and a Cys residue in a spatial position corresponding to His and Cys ligands of iron in REV-ERBb. Another putative human ironbinding protein is the HSPB1-associated protein 1. A structural model of this proteins shows that the reciprocal position in 3D space of the putative ligands is completely consistent with our prediction (Fig. 4).
As an important aspect of the present study, we analysed how many pathologies are associated to human genes encoding iron-proteins, based on the occurrence of disease-associated mutations reported in the Swiss-Prot database. The percentage of pathologies associated to genes encoding iron-proteins is almost 40%, which is higher than the percentage of pathologies associated to all human genes (about 20%). In practice, two genes out of 10 are associated with pathogenic mutations in the human genome, whereas this percentage is essentially doubled if we take into account specifically the genes encoding ironproteins. Interestingly, this percentage peaks at 72% for all heme-binding proteins in the mitochondrion.
In summary, this work provided an extensive overview of iron usage by human proteins, spanning from iron coordination properties to biochemical/cellular function and compartmentalization, and addressing the interplay between these aspects. We observed that the distribution of the type of iron cofactors and of their catalytic properties is quite uneven, with some organelles such as the mitochondrion or the nucleus displaying higher occurrence than the others. The main localization of irondependent enzymes, which constitute 6.5% of all human enzymes, is the endoplasmic reticulum, where they catalyze the modification of both endo-and exogenous molecules and metabolites. Human iron-enzymes have a lower number of protein residues in their IBPs, in order to allow the iron ion to coordinate directly to the substrate.

Materials and methods
Proteins are generally composed of one or more functional regions, commonly termed domains. The identification of domains that occur within proteins can therefore provide insights into their function. Pfam is a database of protein domains, defined on the basis of the comparison of ensembles of protein regions that share a significant degree of sequence similarity, thereby suggesting homology. Each domain is represented by a multiple sequence alignment and by a more complex mathematical representation called a hidden Markov model (HMM). HMMs can be used for analyzing proteomes to search for occurrences of the corresponding domain (see below). Each domain entry in the Pfam database has an annotation, which may include the ability to bind metal cofactors.
Using the approach described in ref. 39 as implemented in the RDGB program, 40 we predicted all iron-binding proteins (IBPs) encoded by the human genome. RDGB is a computational tool written in Python. The approach of RDGB exploits the protein domains of the Pfam database to identify putative homologues of the proteins of interest in any desired genome or list of genomes. Thus, the input to RDGB is a list of Pfam domains of interest (in our case, domains associated with iron-binding capability) and a list of genomes to be analyzed (in our case only the human genome).
The input list of Pfam domains is created by merging two lists: first, the list of all Pfam domains annotated as ironbinding, retrieved by mining the text of the annotations in the database; second, from the analysis of the sequence of ironbinding proteins with known 3D structure that are available from the Protein Data Bank (PDB). In the latter case, we extract from the PDB database also the pattern of amino acids that are responsible for metal binding (i.e. the metal binding pattern, MBP) and its position within the domain sequence. The MBP is defined by the identity and spacing of the amino acids, e.g., CX4CX20H, where X is any amino acid. This pattern provides a way to filter the initial results in order to reduce the number of false positives 39 (i.e., of the proteins containing a Pfam domain annotated as iron-binding but which in reality are unable to bind it) by rejecting the proteins that lack the MBP or that have the MBP in the wrong position within the domain. The MBP filter cannot be applied in the absence of a relevant 3D structure available from the PDB. The MetalPDB database contains information on all the MBPs and the Pfam domains found in structurally characterized metalloproteins. 9 Our search started from 352 Pfam domains: 261 with an associated iron-containing 3D structure (102 binding individual iron ions, 80 binding iron-sulfur clusters, and 79 binding heme) and 91 annotated as iron-binding domains.
This search was integrated by locally searching from MBPs within all human protein sequences. This is done by extracting from the HMM representing the Pfam domain that contains the binding site of interest only the regions around the MBP. This ''trimmed domain'' provides a convenient way to search for a MBP regardless of the agreement with the whole Pfam domain, thus affording a better sensitivity in the detection of MBPs in divergent sequences. 41 In total we retrieved 363 human iron-proteins. As a qualitative indicator of reliability of our dataset, we checked whether one of the following conditions applied (in decreasing order of reliability): (1) A 3D structure of the human protein in the iron-bound form is available (105 proteins).
(2) A 3D structure of a close homolog (sequence identity Z50%) of the human protein in the iron-bound form is available (76 proteins).
(3) The predicted protein contains an iron-binding Pfam domain with a conserved MBP (147 proteins).
(4) The predicted protein contains a conserved MBP (based on local search) (22 proteins).
(5) The predicted protein contains an iron-binding Pfam domain, but the occurrence of the MBP cannot be verified due to the lack of a 3D structure for that domain family (13 proteins).
We integrated these predictions by adding the proteins annotated in the Uniprot database, a public comprehensive resource of protein sequence and functional information, as ''iron-binding'', ''iron-sulfur-binding'', or ''heme-binding''. This contributed 35 additional iron-proteins.
For each predicted iron-protein, we retrieved the following annotations from UniProt: 42 intracellular location, EC number, biological processes as reported in the Gene Ontology database, 43 involvement in diseases. Further annotation such as the cofactor role and type were manually added by inspecting the literature. We used the Swiss-Prot database (at February 2018 contained 20259 entries) 34 to compare the iron-protein dataset with all human proteins. For the latter dataset, annotations were retrieved from Uniprot in the same way as for the iron-protein dataset.
The 3D structural model of the HSPB1-associated protein 1 was built using MODELER v.9.2 44 and energy-refined using the AMBER 45 web server provided by the WeNMR platform. 46 Abbreviations IBP Iron-binding pattern ROS Reactive oxygen species ROR Retinoid-related orphan receptor

Conflicts of interest
There are no conflicts to declare.