Carotenoids Database: structures, chemical fingerprints and distribution among organisms

Abstract To promote understanding of how organisms are related via carotenoids, either evolutionarily or symbiotically, or in food chains through natural histories, we built the Carotenoids Database. This provides chemical information on 1117 natural carotenoids with 683 source organisms. For extracting organisms closely related through the biosynthesis of carotenoids, we offer a new similarity search system ‘Search similar carotenoids’ using our original chemical fingerprint ‘Carotenoid DB Chemical Fingerprints’. These Carotenoid DB Chemical Fingerprints describe the chemical substructure and the modification details based upon International Union of Pure and Applied Chemistry (IUPAC) semi-systematic names of the carotenoids. The fingerprints also allow (i) easier prediction of six biological functions of carotenoids: provitamin A, membrane stabilizers, odorous substances, allelochemicals, antiproliferative activity and reverse MDR activity against cancer cells, (ii) easier classification of carotenoid structures, (iii) partial and exact structure searching and (iv) easier extraction of structural isomers and stereoisomers. We believe this to be the first attempt to establish fingerprints using the IUPAC semi-systematic names. For extracting close profiled organisms, we provide a new tool ‘Search similar profiled organisms’. Our current statistics show some insights into natural history: carotenoids seem to have been spread largely by bacteria, as they produce C30, C40, C45 and C50 carotenoids, with the widest range of end groups, and they share a small portion of C40 carotenoids with eukaryotes. Archaea share an even smaller portion with eukaryotes. Eukaryotes then have evolved a considerable variety of C40 carotenoids. Considering carotenoids, eukaryotes seem more closely related to bacteria than to archaea aside from 16S rRNA lineage analysis. Database URL: http://carotenoiddb.jp


Introduction
Carotenoids have been investigated due to the importance of their diverse biological functions, since the beginning of the 19th century (1). Investigations of their molecular structures were triggered by the successful determination of the structures of lycopene and b-carotene by Paul Karrer et al. in 1930 (2). The number of compiled carotenoid structures can be estimated to have risen almost linearly with time since 1948, that is, at about 15 structures per year on average (see Figure 1). The growth curve shows no saturation yet, implying the existence of many carotenoids yet to be identified. According to Carotenoids Handbook  'Carotenoids' (4) and by Otto Straub in the book 'Key to Carotenoids' (5). In 1987, 563 carotenoids were compiled in the 'Key to Carotenoids, second edition' by Hanspeter Pfander (6). In 1995, D. Kull and H. Pfander added 54 new carotenoids as 'Appendix' (7).
In the course of evolution, carotenoids have been developed to perform diverse functions, probably starting with photosynthetic and photoprotective pigments and later sources of color, odor and taste. All biological functions investigated here are listed at http://carotenoiddb.jp/ Biological_activity/biological_activities_list.html.
Organisms are sometimes related via carotenoids symbiotically as in the case of Arbuscular mycorrhizae accumulating the apocarotenoid mycorradicin in plant-roots during colonization (8). Diatoms produce the feeding deterrent apocarotenoids apo-fucoxanthinals and apofucoxanthinones against copepods, which may significantly influence food chains (9,10).
For deeper understanding of the world of carotenoidshow organisms are related via carotenoids, either evolutionarily, or symbiotically, or in food chains through natural histories, and how carotenoids have been evolved with biological functions, we compiled 1117 structures and their distribution among organisms using the latest available original papers. We made these data accessible via the Internet at 'http://carotenoiddb.jp'.
Aiming to extract organisms closely related through the biosynthesis of carotenoids, we developed a precise similarity search system exploiting the 'Carotenoid DB Chemical Fingerprints' from the IUPAC semi-systematic names. IUPAC semi-systematic names are very well defined to fully represent the chemical structures (11).
The Carotenoid DB Chemical Fingerprints describe the chemical substructure and modification details with modified carbon-nu mbering; for example, '3-OH, 3 0 -OH, 4 ¼ O, 4 0 ¼O, beta,beta' for astaxanthin. The carbon-numbering and the naming system follow the Nomenclature of Carotenoids approved by the IUPAC and International Union of Biochemistry (IUB) commissions (11). Our fingerprints are unique in including positional information. Consequently, precise similarity searching has been achieved by a simple scoring method.
The chemical fingerprints also allow (i) easier prediction of biological functions of carotenoids, (ii) easier classification of carotenoid structures, (iii) partial structure searching by simple string searches 'psi,psi 4-apo 4-al' for instance, from the search box 'http://carotenoiddb.jp/ search.cgi' and (iv) easier extraction of structural isomers and stereoisomers.
It is worth noting that this is the first attempt, to our knowledge, to establish fingerprints from IUPAC semisystematic names.

Carotenoids Database information
The Carotenoids Database provides carotenoid chemical information, distribution among source organisms, and biological functions of carotenoids. A list of all the carotenoids compiled here is available at 'http://carotenoiddb. jp/Entries/list1.html'. Information on each carotenoid is described in each entry. All the entries can be searched with a free word retrieval system at 'http://carotenoiddb. jp/search.cgi'. Information in each entry can be categorized in six types, namely, (i) name information, (ii) hierarchical classification, (iii) structural information, (iv) biological functions, (v) chemical properties and (vi) source organisms. The details are described in Table 1.
The carotenoid profile of one source organism is described in each organism entry. A list of all organisms in the Carotenoids Database is available at 'http://carote noiddb.jp/ORGANISMS/all_org.html'. Organism entries are also searchable with a free word retrieval system at 'http://carotenoiddb.jp/search_organism.cgi'. These entries include (i) scientific name, (ii) lineage, (iii) carotenoid Links from the front page of the Carotenoids database are shown in Figure 2.

Data sources
Information on carotenoid structures, biological functions and source organisms has been collected from the latest available original papers, reviews, and books via Google scholar, PubMed systems and Chemical Abstract Service. We also refer and link to other databases such as the  KEGG COMPOUND database (12), the KNApSAcK database (13), the Lipidbank database (14) (17) and Estate fingerprint (18) are generated by the PaDEL-Descriptor (16) (Figure 3).
We basically make monthly updates as declared in the release notes. See: http://carotenoiddb.jp/releasenotes_ 2016.html.

Similarity search with Carotenoid DB Chemical Fingerprints
Using the Carotenoid DB Chemical Fingerprints, we developed a simple scoring method for similarity searches. Similarity searches are possible from each entry, for example for b-carotene at 'http://carotenoiddb.jp/search_simi lar_carotenoid.cgi?keyword¼CA00309'. In order to evaluate reaction likeliness by the frequency of fingerprints, aside from the Michaelis constant K m and/or maximum reaction velocity V max values, we introduced weighted Tanimoto coefficient as follows.
Here we define similarity as reaction likeliness: The fingerprints in every category vary in number of atoms, so we weighted each category of fingerprints to give weighted Tanimoto coefficient (19), inversely proportional to the occurrence rate, with a few exceptions. For example, hydroxylation and saturation occur quite frequently in carotenoids, so we assigned a small weight to those fingerprints.
By this combination of fingerprints and weighted Tanimoto coefficient, we obtained more precise results than with conventional fingerprints in the chemical space of carotenoids within short computational times. Comparisons with other, conventional fingerprints were done by calculating Tanimoto coefficients of all to all pairs of Carotenoids DB entries. See: http://carotenoiddb.jp/FTP/and http://carote noiddb.jp/FTP/Tanimoto_coeff_eq_1/. All the conventional fingerprints were generated by the PaDEL-Descriptor (http://padel.nus.edu.sg/software/padeldescriptor/).

Search similar profiled organisms
We have compiled 683 source organisms' carotenoid profiles. Using these profiles, we have developed a comparison tool. We introduced unweighted Tanimoto coefficients as similarity scores. This 'Search similar profiled organisms' is available at 'http://carotenoiddb.jp/search_similar_pro filed_organisms.cgi'.
It seems that we succeeded in extracting species and/or organisms potentially related in some manner to each query organism.
For example, calculating the Tanimoto coefficients for two carotenoid profiles of Cyanidioschyzon merolae (20) and Prochlorothrix hollandica strain PCC 9006 (21) gives unity. That is, both species have the same simple carotenoid profile: b-carotene and zeaxanthin, which is called ZEA-type by Takaichi et al. (22). See the profile comparisons at http://carotenoiddb.jp/search_similar_profiled_ organisms.cgi?keyword¼Cyanidioschyzon%20merolae. The same profile can be found in two other glaucophytes: Cyanophora paradoxa (23) and Glaucocystis nostochinearum (23) at the same URL. These facts may suggest that the chloroplasts of these primitive unicellular organisms, Cyanidioschyzon merolae, Cyanophora paradoxa and Glaucocystis nostochinearum, may have been derived from the same cyanobacteria which is closely related to Prochlorothrix hollandica in agreement with Takaichi et al. (22) and Tomitani et al. (24).
However, these results are heavily dependent on the conditions, the accuracies, and the fullness of the data found in the original papers. Similarity searching of carotenoid profiles in every lineage is also possible at 'http://carotenoiddb. jp/search_simiar_profiles_in_all_levels.cgi', which is linked at every webpage of all lineages. (See 'http://carotenoiddb. jp/ORGANISMS/Prochlorothrix.html', for instance).

Predicting biological functions using Carotenoid DB Chemical Fingerprints
We have also investigated a simple method of predicting six biological functions of carotenoids using Carotenoid DB Chemical Fingerprints, which are as provitamin A, membrane stabilizers, odorous substances, allelochemicals, antiproliferative activity and reverse MDR activity against cancer cells. Feature extractions are based on empirical findings from the latest original papers, which are listed at 'http://carotenoiddb.jp/Biological_activity/biological_activ ities_list.html'. Chemically unmodified carotenoids with b end groups can be expected to be provitamin A. Carotenoids with oxygen on both end groups are potentially membrane stabilizers. Namely, fingerprints including oxygen, such as '¼O' describing ketone, '-Methoxy' describing methoxy, '-Epoxy' describing epoxy, '-Glc' describing glucoside, '-al' describing aldehyde, '-SO4' describing sulfate with carbon-numbering with and without prime, indicating both ends of the carbons are potentially membrane stabilizers. Carotenoids whose carbon Acetylenecarotenoids End groups, and/or cis/trans, and saturation/desaturation (7,8-H), and/or glycosylation/hydroxylation/alkoxylation, and/or epoxydation, and/or aldehyde, and/or ketolation, and/or carbonylation/olide, and/or nor and/or apo

Diapocarotenoids
End groups, and/or cis/trans, and saturation/desaturation, and/or glycosylation/hydroxylation/alkoxylation, and/or epoxydation, and/or aldehyde, and/or ketolation, and/or carbonylation/olide, and/or nor and two apos can be expected to be odorous substances, and/or allelochemicals. Carotenoids with epoxidized b end groups with b ends on the other side such as Fucoxanthin and Peridinin function as antiproliferative agents against cancer cells (25). Therefore, carotenoids with fingerprint '5,6-Epoxy' or '5,8-Epoxy' with and/or without prime, and 'beta,beta' are predicted as possible antiproliferative agents against cancer cells. Likewise, epoxycarotenoids having b,b or b,j or b,e end groups (Capsochrome, for example) function as reverse MDR agents against cancer cells (25). That is, carotenoids with fingerprints '5,6-Epoxy' and/or '5,8-Epoxy' with and/or without prime, and 'beta,beta' or 'beta,kappa' or 'beta,epsilon' are potentially reverse MDR agents.

Classification of carotenoids using Carotenoid DB Chemical Fingerprints
Carotenoids are classified along with their biosynthesis pathways. We simplified them into three steps; first, by carbon numbers: C30, C40, C45 and C50 carotenoids, second, by end-groups, of which there are seven: w, b, c, e, u, v and j, and third, by chemical modification pattern, that is, hydrocarbons, hydroxycarotenoids, epoxycarotenoids, aldehydes, ketones, carboxylic acids, apocarotenoids, norcarotenoids, secocarotenoids, retrocarotenoids, olidecarotenoids, allenecarotenoids, acetylenecarotenoids and diapocarotenoids. Carotenoid DB Chemical Fingerprints allowed easier classification as shown in Table 3. Bold characters are the necessary fingerprint of each carotenoid. All these types of carotenoids are available from the links in the front page of 'http://carote noiddb.jp'.

Distribution among organisms
Based on the facts that b, c and e rings are formed from w ends, and u, v and j rings are formed from b end groups (1), we can postulate that the carotenogenesis pathways may have evolved dendritically (Table 4). The phyla of source organisms producing each end group with carbon numbers are listed in Table 5. Updated lists are also available at 'http://carotenoiddb.jp/stats/stats_endgroup_phy lums.html', as well as the lists of scientific names of organisms at 'http://carotenoiddb.jp/stats/stats_endgroup_org_ detailed.html', and the lists of families of organisms at 'http://carotenoiddb.jp/stats/stats_endgroup_family.html'. Carotenoids are widely distributed in the three domains of life according to our current investigations. Archaea produce C30 w, w carotenoids, C40 w, w carotenoids, b, b carotenoids, b, e carotenoids, C50 w, w carotenoids and apocarotenoids. Bacteria produce wider ranges of carotenoids, except that they do not produce C40 e, w, C40 c end or C40 j end carotenoids. Eukaryotes produce only C40 originated carotenoids, including apocarotenoids numbering 154. Source references are all listed in carotenoid entries and organism entries.
Although the numbers of organisms we could compile are not evenly distributed in the three domains of life (Archaea: 8, Bacteria: 170 and Eukaryotes: 505), and not all the carotenoid entries could be linked to source organisms in the time so far available, our statistics on distribution in organisms show some insights into natural history. Carotenoids seem to have been diversified largely by bacteria, as they produce C30, C40, C45, C50 carotenoids and C40 originated apocarotenoids, with the widest range of end groups, numbering 307. Bacteria share 52 C40 carotenoids and C40 originated apocarotenoids with eukaryotes, and archaea share only seven with eukaryotes. In terms of carotenoids, eukaryotes seem more closely related to bacteria than to archaea, aside from 16S rRNA lineage analysis. This may be caused by the restricted number of reports on archaeal carotenoids (26). Eukaryotes then probably have evolved a considerable number of C40 carotenoids and their derivatives apocarotenoids, numbering 607 by our present count at the release of December 2016.
Updated information on the distribution of chemical modification details at phylum level is available at 'http://car otenoiddb.jp/stats/org_statistics_phylum.html'. Distribution of chemical modifications at family level is also available at 'http://carotenoiddb.jp/stats/org_statistics_family.html'.
Updated information on common carotenoids and unique carotenoids in the three domains of life are available at http://carotenoiddb.jp/ORGANISMS/common_car otenoids.html.

Summary and future works
We have developed the Carotenoids Database to provide chemical information on 1117 natural carotenoids with 683 source organisms. Our newly developed Carotenoid DB Chemical Fingerprints make classification easier and similarity searching precise among carotenoids known to us. Also, the Carotenoid DB Chemical Fingerprints have made it easy to predict six biological functions of carotenoids, that is (i) provitamin A, (ii) membrane stabilizers, (iii) odorous substances, (iv) allelochemicals, (v) antiproliferative activity against cancer cells and (vi) reverse MDR activities. We have newly developed a tool to search for similar profiled organisms, that helps extracting organisms potentially closely related to any query organism, evolutionarily, or symbiotically, or in food chains. Although the numbers of organisms that we have been able to include so far are not evenly distributed in the three domains of life, our statistics on distributions among organisms give some insights into natural history. Carotenoids seem to have been diversified largely by bacteria. Bacteria and archaea seem to have shared small portions of C40 carotenoids with eukaryotes. Eukaryotes then probably have evolved a considerable number of C40 carotenoids. In terms of carotenoids, eukaryotes seem more closely related to bacteria than to archaea, aside from 16S rRNA lineage analysis. In our current investigation, seco, nor, cyclo, olide carotenoids, geranylgeranyl and complex structure polymerized carotenoids are only observable in eukaryotes. See current statistics at http://carotenoiddb.jp/stats/org_statistics_phy lum.html. To promote understanding of how organisms are related via carotenoids, further development of fingerprints for de novo reconstruction of carotenoid biosynthesis pathways will be reviewed in a later paper.