Abstract

Alkaptonuria (AKU, OMIM: 203500) is an autosomal recessive disorder caused by mutations in the Homogentisate 1,2-dioxygenase (HGD) gene. A lack of standardized data, information and methodologies to assess disease severity and progression represents a common complication in ultra-rare disorders like AKU. This is the reason why we developed a comprehensive tool, called ApreciseKUre, able to collect AKU patients deriving data, to analyse the complex network among genotypic and phenotypic information and to get new insight in such multi-systemic disease. By taking advantage of the dataset, containing the highest number of AKU patient ever considered, it is possible to apply more sophisticated computational methods (such as machine learning) to achieve a first AKU patient stratification based on phenotypic and genotypic data in a typical precision medicine perspective. Thanks to our sufficiently populated and organized dataset, it is possible, for the first time, to extensively explore the phenotype–genotype relationships unknown so far. This proof of principle study for rare diseases confirms the importance of a dedicated database, allowing data management and analysis and can be used to tailor treatments for every patient in a more effective way.

Introduction

Precision medicine (PM) is an emerging approach for the study of diseases based on individual variability in genes, environment and lifestyle to customize different illness courses. For PM application to an ultra-rare disease like Alkaptonuria (AKU), collecting as much information as possible about each patient, not overlooking apparently ordinary factors, is crucial to achieve a first patient stratification.

AKU is the first disorder to conform with the principles of Mendelian recessive inheritance [1] with an estimated incidence of 1 case in 250.000–1.000.000 births in most ethnic groups [2] and 1233 patients around the world [3, 4]. AKU is due to mutations of the Homogentisate 1,2-dioxygenase (HGD) gene which leads to a deficiency of the HGD enzyme [4, 5], producing accumulation of homogentisic acid (HGA) especially in connective tissues. The active form of the HGD enzyme is a highly complex hexamer [6] with a low tolerance to mutations including missense variants (about 65% of all known AKU substitutions) which can cause a harmful effect on protein folding stability and, consequently, a possible alteration of HGA accumulation [7]. In a recent study, a correlation study between the most abundant AKU-causing missense variants (G161R, M368V and A122V) and the manifestation of different amounts of unmetabolized HGA leading to differences in serum and urine levels and consequently differences in disease severity was performed. The three mutants had a significantly reduced activity compared to the wild-type enzyme, ranging from <1% of wild-type activity for G161R to 37% and 34% for A122V and M368V, respectively [8]. It can be speculated that different residual catalytic activity of the HGD protein carrying different variants can be reflected on different amounts of unmetabolized HGA. In this context, a genotype–phenotype correlation analysis was carried out in 33 patients and it was possible to notice a small but statistically significant difference in urinary HGA excretion in patients who carried variants with 1% and >30% residual HGD activity [4].

The HGA surplus is mostly eliminated through the urine; the remaining part contributes to the production of an ochronotic pigment deposited in cartilage [9]. Ochronosis causes an early onset arthropathy responsible for severe pain, deficit in locomotion and reduction of patients' quality of life [9]. HGA accumulation has also been found to trigger oxidative stress and chronic inflammation in AKU [10–13]. Moreover, recent studies have classified AKU as a secondary amyloidosis [13–15], characterized by deposition of serum amyloid A (SAA) fibers, which is a circulating protein produced at high levels (100–1000 times the normal plasmatic condition of about 4–6 mg/L) in chronic inflammation, making SAA a sensitive biomarker of inflammation [16]. Studies on different AKU samples (i.e., cartilage, salivary glands, chondrocytes and synoviocytes) exhibited that ochronotic pigment and amyloid fibers share the same location [14] and confirm the elevated plasma levels also in AKU patients [13–15, 17, 18].

Another marker linked to chronic inflammation is chitotriosidase (CHIT1), a chitinase mainly expressed in the differentiated and polarized macrophages [19]. Serum concentration of CHIT1 relates with the evolution and severity of several disorders (i.e., sarcoidosis, rheumatoid arthritis, ankylosing spondylitis and chronic obstructive lung diseases), suggesting a probable involvement of CHIT1 as an AKU biomarker [18, 19]. Therefore, in AKU, besides inflammation, patients also suffer from significant oxidative stress caused by high systemic levels of HGA and its products. In the study of Braconi et al. (2016), it was described proteomic alterations in AKU samples of six patients and showed interesting similarities with other rheumatic diseases [17]. In this context, Protein Thiolation index (PTI) interestingly denotes and summarizes the oxidative state of AKU patients [20].

One of the main problems in carrying out clinical research on AKU is the lack of a standardized methodology to assess disease severity and response to treatment [21], which is complicated by the large variety of AKU symptoms from an individual to another [4, 22]. A reliable way to monitor patients’ clinical condition and overall health status is the use in clinical practice and research of measures of Quality of Life (QoL) [18]. Our previous studies showed that, in a rare and multisystemic disease like AKU, QoL scores help to identify health needs and to evaluate the impact of the disease [18]. The presence of a correlation between QoL and the clinical data deposited in the ApreciseKUre database could be helpful in shading light on AKU complexity [23]. Here, we have developed a machine learning application that perform a first AKU patient stratification based on phenotypic and genotypic data collected in the ApreciseKUre, an AKU-dedicated digital platform [20, 23]. ApreciseKUre is the AKU-reference database in which genetic, protein, biochemical, histopathologic, clinical data and images about more than 200 AKU patients areorganized and freely available after user registration (www.bio.unisi.it/aprecisekure/; www.bio.unisi.it/aku-db/) [24]. Furthermore, this integral platform aims to the discovery of new AKU biomarkers, thanks to different analytical tools. We believe this approach can be turned into a best practice model also for other rare diseases and can be useful for overcoming the obstacles in small dataset management and analysis.

Results and discussion

AKU disease is closely related to the loss of activity of HGD enzyme. In this context, the mutations associated to this protein are worth being investigated. Up to now, the relationship between genotype and phenotype is unknown. By taking advantage of the dataset hereby present, containing the highest number of AKU patients ever considered, it is possible to explore the phenotype–genotype distribution study in AKU. In the present study, machine learning algorithms were firstly implemented with the aim to perform a stratification of AKU patients based on clinical, biochemical and QoL data deposited in the ApreciseKUre database [25–27]. To achieve this purpose, we have considered biochemical markers of chronic inflammation and amyloidosis (i.e., SAA and CHIT1, for more details see Supplementary Materials) and markers linked with oxidative stress status (i.e., PTI and S-thiolated proteins, for more details see Supplementary Materials). Moreover, 11 QoL scores were considered to evaluate disease severity and progression (see Supplementary Materials S2 for deeper information):

  • - MHS and PHS are multi-purpose short-form health surveys with 36 questions that measures patients’ QoL across several domains, which are both physically and emotionally based [18]. The score 0 is equivalent to maximum disability and a score of 100 is equivalent to no disability.

  • - KOOS is a knee-specific instrument developed with the purpose of evaluating short- and long-term symptoms. The scale ranges from 0 to 100, with 0 representing extreme knee problems and 100 representing no knee problems [18, 28].

  • - HAQ-DI and hapVAS evaluate the disability index and a global pain visual analog scale where, differently from KOOS, PHS and MHS, 0 means no pain and 100 severe pain [29].

  • - AKUSSI: questionnaire-based evaluation of AKU severity [21]. It incorporates multiple, clinically meaningful AKU outcomes combined with imaging investigations.

Data pre-processing

Indicators of disease severity were firstly examined through a preliminary statistical analysis in order to evaluate the degree of correlation among pairs of variables (Figure 1).

Correlation matrix of health survey questionnaires. In this correlation matrix, all QoL scores and biochemical AKU biomarkers are correlated to each other. In red is indicated an inverse statistical correlation, in blue a direct statistical correlation, in white no correlation.
Figure 1

Correlation matrix of health survey questionnaires. In this correlation matrix, all QoL scores and biochemical AKU biomarkers are correlated to each other. In red is indicated an inverse statistical correlation, in blue a direct statistical correlation, in white no correlation.

It is interesting to notice the presence of statistically significant correlations among all the scores AKUSSI, KOOS, HAQ and PHS considering the different evaluation scale in the questionnaires: in the KOOS, PHS and MHS questionnaires the value 0 represents the maximum negative evaluation (maximum pain, maximum disability, etc.) while 100 the maximum positive evaluation (no pain, no disabilities, etc.) whereas the AKUSSI and HAQ questionnaires are expressed by an opposite scale of values. By taking this point into account, the real relationship shows a direct dependency among the scores and a homogeneous accordance of their capability to investigate disease severity grade.

Specifically, KOOS pain, KOOS symptoms, KOOS daily living and KOOS sport have a high correlation with PHS, AKUSSI joint pain and spinal pain, with hapVAS and HAQ-DI. Differently, the mental health score (MHS) correlation with all the other QoL scores is not statistically significant (Pearson correlation coefficient between −0.3 and 0.3). Taken together, this data suggests that the MHS, the only one assessing the psychological status of the patient, is independent from other QoL scores linked to the individual’s physical status. This finding shows that the patients’ psychological experience, based on the evaluation of levels of anxiety and depression, is not directly related with their actual physical and clinical status. Moreover, differently from other scores (AKUSSI, KOSS, HAQ, hapVAS), MHS is not correlated with age. This observation, in line with Braconi et al. (2018), confirms a not infrequent disability paradox in inherited/chronic disease, underlying the difference between the physical and mental impact on pathological severity, which may underestimate overall mental state.

It can be observed that increasing the severity of spinal and articular pain (AKUSSI) results in a correspondent increase of the patient’s self-assessment condition on his knee (KOOS).

This can indicate that the increases in spinal and articular symptoms, due to the consequent accumulation of ochronotic melanin-like pigment on the articulation, decreases the abilities of the patients in performing their daily activities. These can range from the most intense (e.g., sports) to the simplest ones, with a consequent deterioration of their physical conditions. AKUSSI test considers objective clinical parameters [30], while the KOOS test is only related to the perception of the patient. The good correlation highlighted by the present study indicates that the results obtained with the KOOS test are well correlated with those that are scientifically more reliable such as the AKUSSI test, with the great advantage of their simplicity and straightforward evaluation of results.

HAQ-DI represents the evaluation of disability whereas hapVAS the pain severity. These parameters are more general with respect to KOOS tests which concern more specifically knee related factors. In fact, HAQ tests allow to evaluate the overall condition of the patient based on its perception. The correlation between AKUSSI and HAQ is always direct, thus it exists a direct correlation between the progression of the disease, and in particular of the spinal and joint pain, and with the perceived pain, disability degree, and difficulties in performing daily activities (walking, getting dressed, etc.).

Important biomarkers of chronic inflammation and amyloidosis like CHIT1 and SAA do not result strongly correlated with disease severity represented by the above cited QoL scores; differently from PTI, which is inversely correlated with KOOS scores and directly with age. This result confirms the conclusions of the investigation of disease severity based on the oxidative stress performed by Giustarini et al. [31] which evidenced PTI values significantly higher in old patients affected by AKU with respect to sane controls. It was accordingly proposed that the higher pro-oxidation (i.e., higher oxidative stress) conditions in the blood of people affected by AKU may be related to the age progress. Our results agree with this experimental report.

Methods for unsupervised clustering

A first AKU patients’ stratification was performed by means of a ML approach. We used two methods, K-mean algorithm and hierarchical clustering, in order to cluster a population of AKU patients into subgroups with similar features. The experiment was conducted using three different stratifications, i.e., setting (i) K = 2 to obtain two clusters (two patients subgroups), (ii) K = 3 to obtain three clusters (three patients subgroups) and (iii) K = 4 to obtain four clusters according to the optimal cluster number of the R study package (Figure 2) following the elbow rule. The elbow method is a heuristic approach used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.

Optimal number of clusters. The elbow method is a heuristic approach used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.
Figure 2

Optimal number of clusters. The elbow method is a heuristic approach used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.

The resulting clusters are grouped according to the severity of the AKU disease, by considering age, the levels of biomarkers describing the severity of oxidative stress, inflammation and amyloidosis, and QoL scores. The results are shown in Figure 3 and Figure 4 with the aim of finding possible reiterations of certain mutations of the HGD protein in the various clusters. Specifically, in Figure 3 it is possible to notice AKU patients’ stratification based on K-means with K = 2, 3 and 4 whereas in Figure 4 the classification based on hierarchical clustering is shown. The number of patients included in each cluster is reported in Figure 3 and Figure 4.

AKU patient stratification using K-means. The dimension reduction method for the plot on the left is a PCA, where the two axes explain the principal components. Three different stratifications are reported: setting K = 2 to obtain 2 clusters, K = 3 to obtain 3 clusters, and K = 4 to obtain 4 clusters. The bar plots for every K show the ratio between the mean of each one of the 24 features for the patients in the same cluster, divided by the mean of the same feature in the whole population.
Figure 3

AKU patient stratification using K-means. The dimension reduction method for the plot on the left is a PCA, where the two axes explain the principal components. Three different stratifications are reported: setting K = 2 to obtain 2 clusters, K = 3 to obtain 3 clusters, and K = 4 to obtain 4 clusters. The bar plots for every K show the ratio between the mean of each one of the 24 features for the patients in the same cluster, divided by the mean of the same feature in the whole population.

AKU patient stratification using hierarchical clustering. Analogously to the K-Means case, we cut the dendrogram for K = 2, K = 3 and K = 4. The bar plots represent the ratio between the mean of each one of the 24 features for the patients grouped in the same cluster, and the mean of the same feature in the whole population.
Figure 4

AKU patient stratification using hierarchical clustering. Analogously to the K-Means case, we cut the dendrogram for K = 2, K = 3 and K = 4. The bar plots represent the ratio between the mean of each one of the 24 features for the patients grouped in the same cluster, and the mean of the same feature in the whole population.

Higher is the average level of age, biomarkers (i.e., SAA, CHIT1, PTI) and QoL scores (except for AKUSSI, HAQ-DI and hapVAS) more severe is the phenotype. For instance, for both K-means and hierarchical clustering methods we found that for K = 2, the most severe phenotype seems to be the cluster number 1, cluster number 2 for K = 3 and cluster number 4 for K = 4.

Statistical clusters evaluation

As reported in Figures 3 and 4, the bar plots for every K, show the ratio between the mean of each one of the 24 features for the patients in the same cluster, divided by the mean of the same feature in the whole population. Higher is the average level of age, and of the biomarkers (i.e., SAA, CHIT1, PTI) and QoL scores (except for AKUSSI, HAQ-DI and hapVAS) more severe is the phenotype. For instance, for both K-means and hierarchical clustering methods we found that for K = 2, the most severe phenotype seems to be the cluster number 1, cluster number 2 for K = 3 and cluster number 4 for K = 4. Before we jump to conclusions in the evaluation of the most/less severe phenotype in every cluster, it is needed to evaluate the coherence of individuals belonging to the same community. To assess if clusters were significantly identifying sub-groups of individuals, we applied the Kruskall–Wallis (KW) ranking non-parametric test in order to check the presence of a significant difference existing between the obtained groups. The comparison was performed between the medians of the features set considered for each group previously formed. In this manner, it was possible to detected features significantly differently distributed among individuals belonging to diverse clusters. The results, based on biomarkers and QoL scores previously described, are listed in Table 1 which is split into two different sub-tables, one for K-means and the other for hierarchical clustering.

Table 1

Sub-tables related to KW for K-means and for hierarchical clustering. In green are highlighted the statistically significant values adjusted with multiple test (FDR < 0.05)

K-meansK2K3K4
StatisticFDRStatisticFDRStatisticFDR
Age35,38966,48E-0936,08052,58E-08251,7120,868,057,333
AJP31,31674,78E-0829,49282,74E-06241,8170,868,057,333
ASP19,44551,66E-0521,31380,000277154,0090,988,098,604
CHIT120,1291,24E-0522,2582,35E-05170,6350,897,373,794
hapVAS29,75382,49E-1126,84187,98E-0617560,868,057,333
HAQ-DI46,46259,81E-0836,6981,34E-08255,2860,868,057,333
KOOS_QOL71,42951,72E-1663,36338,36E-14308,2410,868,057,333
KOOSdl58,23199,32E-1451,00432,88E-11283,9470,868,057,333
KOOSp61,01692,72E-1451,18532,88E-11307,2650,868,057,333
KOOSs54,12056,47E-1349,2036,21E-11271,5070,868,057,333
KOOSsp74,2085,62E-1764,75785,20E-14427,7960,868,057,333
MHS87,9320,00403115,36180,000615132,4260,868,057,333
PHS48,59549,44E-1238,22443,19E-08267,3110,868,057,333
PTI23,68422,10E-0624,55453,53E-05174,3970,868,057,333
SAA981,5230,00244317,07527,88E-07135,0550,902,399,786
Hierarchical ClusteringK2K3K4
StatisticFDRStatisticFDRStatisticFDR
Age0,002050,963,9310,11,8950,539,763145,311709173E-08
AJP0,16,7580,927,528160,4020,720,68419,3214594206E-06
ASP0,50,9840,893,266245,9460,633,07332,85410,001050238
CHIT10,43,6830,893,26619,5860,89,69232,4271729205E-07
hapVAS0,10,3780,927,5280,80,1720,0797813,32990,000331199
HAQ_DI0,97,2510,893,266490,2630,626,01160,1806208592E-09
KOOS_QOL0,96,1850,893,266456,7950,166,86255,609171602E-13
KOOSdl0,083250,927,5280,58,2220,539,763528,802161359E-12
KOOSp0,11,9250,927,528123,0470,35,58916,771143593E-12
KOOSs0,73,9550,893,266374,8450,301,04940,1535135751E-11
KOOSsp0,78,0840,893,266402,7930,305,63645,1279109533E-13
MHS544,8680,117,499774,0060,368,33665,48910,005020985
PHS0,55,6810,893,266251,4160,626,01137,5348846952E-07
PTI153,4720,861,624622,3770,213,68160,6891215695E-08
SAA22,08872,08E-0510,02680,943,72267,0517638258E-07
K-meansK2K3K4
StatisticFDRStatisticFDRStatisticFDR
Age35,38966,48E-0936,08052,58E-08251,7120,868,057,333
AJP31,31674,78E-0829,49282,74E-06241,8170,868,057,333
ASP19,44551,66E-0521,31380,000277154,0090,988,098,604
CHIT120,1291,24E-0522,2582,35E-05170,6350,897,373,794
hapVAS29,75382,49E-1126,84187,98E-0617560,868,057,333
HAQ-DI46,46259,81E-0836,6981,34E-08255,2860,868,057,333
KOOS_QOL71,42951,72E-1663,36338,36E-14308,2410,868,057,333
KOOSdl58,23199,32E-1451,00432,88E-11283,9470,868,057,333
KOOSp61,01692,72E-1451,18532,88E-11307,2650,868,057,333
KOOSs54,12056,47E-1349,2036,21E-11271,5070,868,057,333
KOOSsp74,2085,62E-1764,75785,20E-14427,7960,868,057,333
MHS87,9320,00403115,36180,000615132,4260,868,057,333
PHS48,59549,44E-1238,22443,19E-08267,3110,868,057,333
PTI23,68422,10E-0624,55453,53E-05174,3970,868,057,333
SAA981,5230,00244317,07527,88E-07135,0550,902,399,786
Hierarchical ClusteringK2K3K4
StatisticFDRStatisticFDRStatisticFDR
Age0,002050,963,9310,11,8950,539,763145,311709173E-08
AJP0,16,7580,927,528160,4020,720,68419,3214594206E-06
ASP0,50,9840,893,266245,9460,633,07332,85410,001050238
CHIT10,43,6830,893,26619,5860,89,69232,4271729205E-07
hapVAS0,10,3780,927,5280,80,1720,0797813,32990,000331199
HAQ_DI0,97,2510,893,266490,2630,626,01160,1806208592E-09
KOOS_QOL0,96,1850,893,266456,7950,166,86255,609171602E-13
KOOSdl0,083250,927,5280,58,2220,539,763528,802161359E-12
KOOSp0,11,9250,927,528123,0470,35,58916,771143593E-12
KOOSs0,73,9550,893,266374,8450,301,04940,1535135751E-11
KOOSsp0,78,0840,893,266402,7930,305,63645,1279109533E-13
MHS544,8680,117,499774,0060,368,33665,48910,005020985
PHS0,55,6810,893,266251,4160,626,01137,5348846952E-07
PTI153,4720,861,624622,3770,213,68160,6891215695E-08
SAA22,08872,08E-0510,02680,943,72267,0517638258E-07
Table 1

Sub-tables related to KW for K-means and for hierarchical clustering. In green are highlighted the statistically significant values adjusted with multiple test (FDR < 0.05)

K-meansK2K3K4
StatisticFDRStatisticFDRStatisticFDR
Age35,38966,48E-0936,08052,58E-08251,7120,868,057,333
AJP31,31674,78E-0829,49282,74E-06241,8170,868,057,333
ASP19,44551,66E-0521,31380,000277154,0090,988,098,604
CHIT120,1291,24E-0522,2582,35E-05170,6350,897,373,794
hapVAS29,75382,49E-1126,84187,98E-0617560,868,057,333
HAQ-DI46,46259,81E-0836,6981,34E-08255,2860,868,057,333
KOOS_QOL71,42951,72E-1663,36338,36E-14308,2410,868,057,333
KOOSdl58,23199,32E-1451,00432,88E-11283,9470,868,057,333
KOOSp61,01692,72E-1451,18532,88E-11307,2650,868,057,333
KOOSs54,12056,47E-1349,2036,21E-11271,5070,868,057,333
KOOSsp74,2085,62E-1764,75785,20E-14427,7960,868,057,333
MHS87,9320,00403115,36180,000615132,4260,868,057,333
PHS48,59549,44E-1238,22443,19E-08267,3110,868,057,333
PTI23,68422,10E-0624,55453,53E-05174,3970,868,057,333
SAA981,5230,00244317,07527,88E-07135,0550,902,399,786
Hierarchical ClusteringK2K3K4
StatisticFDRStatisticFDRStatisticFDR
Age0,002050,963,9310,11,8950,539,763145,311709173E-08
AJP0,16,7580,927,528160,4020,720,68419,3214594206E-06
ASP0,50,9840,893,266245,9460,633,07332,85410,001050238
CHIT10,43,6830,893,26619,5860,89,69232,4271729205E-07
hapVAS0,10,3780,927,5280,80,1720,0797813,32990,000331199
HAQ_DI0,97,2510,893,266490,2630,626,01160,1806208592E-09
KOOS_QOL0,96,1850,893,266456,7950,166,86255,609171602E-13
KOOSdl0,083250,927,5280,58,2220,539,763528,802161359E-12
KOOSp0,11,9250,927,528123,0470,35,58916,771143593E-12
KOOSs0,73,9550,893,266374,8450,301,04940,1535135751E-11
KOOSsp0,78,0840,893,266402,7930,305,63645,1279109533E-13
MHS544,8680,117,499774,0060,368,33665,48910,005020985
PHS0,55,6810,893,266251,4160,626,01137,5348846952E-07
PTI153,4720,861,624622,3770,213,68160,6891215695E-08
SAA22,08872,08E-0510,02680,943,72267,0517638258E-07
K-meansK2K3K4
StatisticFDRStatisticFDRStatisticFDR
Age35,38966,48E-0936,08052,58E-08251,7120,868,057,333
AJP31,31674,78E-0829,49282,74E-06241,8170,868,057,333
ASP19,44551,66E-0521,31380,000277154,0090,988,098,604
CHIT120,1291,24E-0522,2582,35E-05170,6350,897,373,794
hapVAS29,75382,49E-1126,84187,98E-0617560,868,057,333
HAQ-DI46,46259,81E-0836,6981,34E-08255,2860,868,057,333
KOOS_QOL71,42951,72E-1663,36338,36E-14308,2410,868,057,333
KOOSdl58,23199,32E-1451,00432,88E-11283,9470,868,057,333
KOOSp61,01692,72E-1451,18532,88E-11307,2650,868,057,333
KOOSs54,12056,47E-1349,2036,21E-11271,5070,868,057,333
KOOSsp74,2085,62E-1764,75785,20E-14427,7960,868,057,333
MHS87,9320,00403115,36180,000615132,4260,868,057,333
PHS48,59549,44E-1238,22443,19E-08267,3110,868,057,333
PTI23,68422,10E-0624,55453,53E-05174,3970,868,057,333
SAA981,5230,00244317,07527,88E-07135,0550,902,399,786
Hierarchical ClusteringK2K3K4
StatisticFDRStatisticFDRStatisticFDR
Age0,002050,963,9310,11,8950,539,763145,311709173E-08
AJP0,16,7580,927,528160,4020,720,68419,3214594206E-06
ASP0,50,9840,893,266245,9460,633,07332,85410,001050238
CHIT10,43,6830,893,26619,5860,89,69232,4271729205E-07
hapVAS0,10,3780,927,5280,80,1720,0797813,32990,000331199
HAQ_DI0,97,2510,893,266490,2630,626,01160,1806208592E-09
KOOS_QOL0,96,1850,893,266456,7950,166,86255,609171602E-13
KOOSdl0,083250,927,5280,58,2220,539,763528,802161359E-12
KOOSp0,11,9250,927,528123,0470,35,58916,771143593E-12
KOOSs0,73,9550,893,266374,8450,301,04940,1535135751E-11
KOOSsp0,78,0840,893,266402,7930,305,63645,1279109533E-13
MHS544,8680,117,499774,0060,368,33665,48910,005020985
PHS0,55,6810,893,266251,4160,626,01137,5348846952E-07
PTI153,4720,861,624622,3770,213,68160,6891215695E-08
SAA22,08872,08E-0510,02680,943,72267,0517638258E-07

Thanks to KW test, it was possible to confirm a statistically significant difference existing between the parameters considered in the clusters with K = 2 and K = 3 obtained by K-means. In the case of hierarchical clustering we considered K = 4 where we obtained all of that parameters considered were statistically significant different among clusters with the exception of age, as you can see in the statistical result in Table 1. Additionally, to have a measure of goodness of fit of the clustering methodologies applied, we considered the Silhouette Score for each cluster, with the aim to test the consistency within elements which have been assigned to the same cluster. We derived the Silhouette score for each community, using as input a combination of QoL scores, demographics and biochemical features. The results are reported in Table 2.

Table 2

Application of Silhouette Score and Calinski–Harabasz to measure the cohesion of elements belonging to the same group leading to a comparison between clustering methodology. Highlighted in green are the best results relative to one of the two methods of clustering

Silhouette score mean
K2. K-means0,239,212
K2. Hierarchical clustering0,494,158
K3. K-means0,271,703
K3. Hierarchical clustering-0,02403
K4. K-means-0,04854
K4. Hierarchical clustering0,141,756
Silhouette score median
K2. K-means0,255,614
K2. Hierarchical clustering0,530,227
K3. K-means0,306,136
K3. Hierarchical clustering-0,01112
K4. K-means-0,03735
K4. Hierarchical clustering0,134,083
Calinski–Harabasz
K2. K-means39,11,002
K2. Hierarchical clustering39,07637
K3. K-means52,05992
K3. Hierarchical clustering1,375,957
K4. K-means0,880,215
K4. Hierarchical clustering38,23,567
Silhouette score mean
K2. K-means0,239,212
K2. Hierarchical clustering0,494,158
K3. K-means0,271,703
K3. Hierarchical clustering-0,02403
K4. K-means-0,04854
K4. Hierarchical clustering0,141,756
Silhouette score median
K2. K-means0,255,614
K2. Hierarchical clustering0,530,227
K3. K-means0,306,136
K3. Hierarchical clustering-0,01112
K4. K-means-0,03735
K4. Hierarchical clustering0,134,083
Calinski–Harabasz
K2. K-means39,11,002
K2. Hierarchical clustering39,07637
K3. K-means52,05992
K3. Hierarchical clustering1,375,957
K4. K-means0,880,215
K4. Hierarchical clustering38,23,567
Table 2

Application of Silhouette Score and Calinski–Harabasz to measure the cohesion of elements belonging to the same group leading to a comparison between clustering methodology. Highlighted in green are the best results relative to one of the two methods of clustering

Silhouette score mean
K2. K-means0,239,212
K2. Hierarchical clustering0,494,158
K3. K-means0,271,703
K3. Hierarchical clustering-0,02403
K4. K-means-0,04854
K4. Hierarchical clustering0,141,756
Silhouette score median
K2. K-means0,255,614
K2. Hierarchical clustering0,530,227
K3. K-means0,306,136
K3. Hierarchical clustering-0,01112
K4. K-means-0,03735
K4. Hierarchical clustering0,134,083
Calinski–Harabasz
K2. K-means39,11,002
K2. Hierarchical clustering39,07637
K3. K-means52,05992
K3. Hierarchical clustering1,375,957
K4. K-means0,880,215
K4. Hierarchical clustering38,23,567
Silhouette score mean
K2. K-means0,239,212
K2. Hierarchical clustering0,494,158
K3. K-means0,271,703
K3. Hierarchical clustering-0,02403
K4. K-means-0,04854
K4. Hierarchical clustering0,141,756
Silhouette score median
K2. K-means0,255,614
K2. Hierarchical clustering0,530,227
K3. K-means0,306,136
K3. Hierarchical clustering-0,01112
K4. K-means-0,03735
K4. Hierarchical clustering0,134,083
Calinski–Harabasz
K2. K-means39,11,002
K2. Hierarchical clustering39,07637
K3. K-means52,05992
K3. Hierarchical clustering1,375,957
K4. K-means0,880,215
K4. Hierarchical clustering38,23,567

Silhouette Score indicates the measure of cohesion of elements belonging to the same cluster, with respect to observations which belong to different communities. In this way we could compare the clusters obtained with K-Means with respect to the clusters obtained with the agglomerative hierarchical clustering approach. For AKU patient stratification with K = 3 and K = 4, the Silhouette Score confirmed the same statistically significative results arisen from KW test in which the best clustering methods turn out to be K-means and hierarchical clustering, respectively. Different is the case of K = 2: KW test considered K-means as the best way to cluster AKU patients into two subgroups, whereas for Silhouette Score, the best one is the hierarchical clustering. As reported in Figure 4, the hierarchical clustering for K = 2 is strongly unbalanced in terms of numbers of patients involved in each cluster (cluster 1, 104 patients; cluster 2, 8 patients). This fact could affect the statistical results; thus, we have decided to consider K-means as algorithm of choice for K = 2 (where there are 60 patients in cluster 1 and 52 in cluster 2). In summary, we have statistically corroborated the application of K-means algorithm for K = 2 and K = 3, and hierarchical clustering for K = 4 for AKU patient stratification. Starting from this point, it was possible to detect the most/less severe subgroups based on demographics, QoL scores and biochemical markers (see Supplementary Materials S3). Specifically, for K = 2, cluster 1 turns out to group AKU patients with most severe symptoms and QoL scores, older age and higher levels of biomarkers of oxidative stress, chronic inflammation and amyloidosis. For K = 3, clusters 2 and 3 comprehend older patients (in particular cluster 2). In cluster 3 there are higher level of SAA and PTI, whereas higher values for CHIT1 are in cluster 2. Patients with less severe symptoms are present in cluster 1, on the contrary patients with the worst QoL score are all included in cluster 2, which turns out to be the most severe one. For K = 4, older patients with more severe symptoms and higher levels of CHIT1, SAA and PTI are stratified in cluster 4, whereas in cluster 2 are grouped younger individuals with less severe AKU manifestations. Further considerations are more extensively faced in ‘Discussion’ section.

Genotype–phenotype distribution

In ApreciseKUre database, about 44% of AKU patients are carriers of G161R, M368V and A122V missense mutations in allele 1 or 2 in HGD. This fact was confirmed by a recent study describing a correlation between the most abundant AKU-causing missense variants (G161R, M368V and A122V) and the manifestation of different amounts of unmetabolized HGA leading to differences in serum and urine levels and consequently differences in disease severity. The three mutants had a reduced activity compared to the wild-type enzyme, ranging from <1% of wild-type enzymatic activity for G161R, which strongly affects HGD protein functions, to a residual activity of 37% and 34% for A122V and M368V, respectively. It can be speculated that different residual catalytic activity of the HGD protein carrying different variants can be reflected on different amounts of unmetabolized HGA. In this context, a genotype–phenotype correlation analysis was described in a previous study involving 33 patients and it was possible to notice a small but statistically significant difference in urinary HGA excretion in patients who carried variants with 1% and >30% residual HGD activity [4]. Thanks to ApreciseKUre database and the previous described AKU patient stratification based on a cohort of 112 individuals, it was possible to detect the presence of these abundant mutations in the more/less severe clusters in order to pave the way for a larger genotype–phenotype distribution study in AKU. In Table 3, are shown the normalized percentage of the three mutations in each cluster in both allele 1 and allele 2. Specifically, G161R mutation, responsible for a dramatic reduction of HGD activity, presented a higher percentage in clusters phenotypically severe (such as cluster 1 in K = 2, cluster 2 in K = 3 and clusters 3 and 4 in K = 4). On the contrary, for M368V and A122V mutations, in which enzymatic activity of HGD is conserved for more than 30%, the trend described a higher percentage in less severe phenotypic sub-groups (i.e., cluster 2 for K = 2, cluster 1 for K = 3 and clusters 1 and 2 for K = 4).

Table 3

Three most abundant missense mutations (G161R, M368 and A122V) with their relative enzymatic residual activity and their normalized mutation percentages (order of magnitude) in each cluster related to K = 2, K = 3 and K = 4 in both allele 1 and allele 2

K = 2Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 1Cluster 2
G161R204 ± 80160406040
M368V7715 ± 17493740607030
A122V6842 ± 17603420803070
K = 3Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 3Cluster 1Cluster 2Cluster 3
G161R204 ± 801305010304030
M368V7715 ± 17493760301040600
A122V6842 ± 1760348020070300
K = 4Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 3Cluster 4Cluster 1Cluster 2Cluster 3Cluster 4
G161R204 ± 8012525153510203535
M368V7715 ± 17493730601002525050
A122V6842 ± 1760345545006030010
K = 2Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 1Cluster 2
G161R204 ± 80160406040
M368V7715 ± 17493740607030
A122V6842 ± 17603420803070
K = 3Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 3Cluster 1Cluster 2Cluster 3
G161R204 ± 801305010304030
M368V7715 ± 17493760301040600
A122V6842 ± 1760348020070300
K = 4Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 3Cluster 4Cluster 1Cluster 2Cluster 3Cluster 4
G161R204 ± 8012525153510203535
M368V7715 ± 17493730601002525050
A122V6842 ± 1760345545006030010
Table 3

Three most abundant missense mutations (G161R, M368 and A122V) with their relative enzymatic residual activity and their normalized mutation percentages (order of magnitude) in each cluster related to K = 2, K = 3 and K = 4 in both allele 1 and allele 2

K = 2Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 1Cluster 2
G161R204 ± 80160406040
M368V7715 ± 17493740607030
A122V6842 ± 17603420803070
K = 3Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 3Cluster 1Cluster 2Cluster 3
G161R204 ± 801305010304030
M368V7715 ± 17493760301040600
A122V6842 ± 1760348020070300
K = 4Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 3Cluster 4Cluster 1Cluster 2Cluster 3Cluster 4
G161R204 ± 8012525153510203535
M368V7715 ± 17493730601002525050
A122V6842 ± 1760345545006030010
K = 2Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 1Cluster 2
G161R204 ± 80160406040
M368V7715 ± 17493740607030
A122V6842 ± 17603420803070
K = 3Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 3Cluster 1Cluster 2Cluster 3
G161R204 ± 801305010304030
M368V7715 ± 17493760301040600
A122V6842 ± 1760348020070300
K = 4Allele 1Allele 2
HGO proteinSpecific activity (nmol/min/mg)% of wild-typeCluster 1Cluster 2Cluster 3Cluster 4Cluster 1Cluster 2Cluster 3Cluster 4
G161R204 ± 8012525153510203535
M368V7715 ± 17493730601002525050
A122V6842 ± 1760345545006030010

Moreover, it is possible to assert that the 2 patients homozygous for the missense mutation E401Q were always clustered in the most severe phenotype sub-groups (cluster 1 for K = 2, Cluster 2 for K = 3 and cluster 4 for K = 4). Similarly, the two patients homozygous for E168K were grouped in all the previous listed phenotypically severe clusters, with the only exception of one patient in K = 4 which is included in cluster 1. Additionally, the AKU homozygote individuals affected by S47L missense mutation and by W60* stop mutation were included in all the most severe clusters. On the contrary, the two AKU homozygote individuals affected by A267V and W97C missense mutations were included in all the less severe clusters. Similarly, the two compound heterozygous patients P230S/V300G were involved in the less severe sub-groups. We cannot know the enzymatic activity of this HGD mutant, however we know that the residual activity of P230S homozygosis condition is equal to 4% [6]. For more details see Supplementary Materials S4.

The limited number of AKU patients spread around the world represent a major obstacle for generating a standardized strategy to assess disease stage and progression. While several biomarkers for AKU have been identified, a clear connection between biomarkers levels and disease severity (QoL score) is still missing. As reported in data pre-processing section, levels of the most important biomarkers of amyloidosis, oxidative stress and chronic inflammation are uncorrelated with QoL scores. Actually, in clusters including AKU patients with the most severe symptoms linked to the worst QoL scores, pathological levels of biomarkers are not always found. Once AKU stratification and clusters validation were performed, we connected the phenotypic data with the genotypic one to compute the normalized percentage of particular mutations involved in the more/less severe sub-group. In the next sections, we reported a description of AKU patient stratification from clinical and genetic point of view for K = 2, K = 3 and K = 4 clusters.

K = 2 using K-means

In particular, for K = 2 with K-means as algorithm of choice, it was possible to notice in cluster 1 the grouping of the oldest patients showing high severity of AKU disease and with higher level of SAA, CHIT1 and PTI. Moreover, in the first cluster all the KOOS scores are low, which indicate that cluster 1-patients have greater knee problems than those in cluster 2. Similarly, hapVAS, HAQ-DI have higher scores in cluster 1, indicating worse arthritic conditions. PHS and MHS, have lower values in cluster 1, indicating worse physical and mental conditions. Analogously, cluster 1 shows higher scores of AKUSSI, indicating greater severity of the disease. From genotypic point of view, the G161R mutation is mostly represent in the phenotypically more serious subgroup according to a low enzymatic specific activity of 1% wt. The A122V and M368V mutations are more represent in the phenotypically less severe subgroup (with the only exception of M368V allele 2, see Table 3). Both mutations resulted in agreement with the experiments conducted in vitro for the measurement of the specific enzymatic activity (33.5% wt for A122V and 37% wt for M368V). Other less common mutations like S47L and stop mutation W60* are exclusively present in the most sever cluster, differently from A267V, W97C in the less severe one.

K = 3 using K-means

In this three clusters stratification, cluster 2 and 3 contain older patients with higher level of SAA, CHIT1 and PTI. Significantly higher values of SAA and PTI can be observed in cluster 3 with respect to the other two clusters, while CHIT1 and age turns out to higher in cluster 2. Younger patients with lower levels of such biomarkers and less severe symptoms are present in cluster 1. In cluster 1 the values of KOOS are higher when compared to the other two clusters. Therefore, patients in cluster 2 present greater problems at the level of knees. The same trend can be observed in the hapVAS, HAQ-DI scores, which are higher in clusters 2 with respect to clusters 1 and 3 and thus the correspondent patients are in worse conditions. Similarly, PHS have lower values in the cluster 2, indicating worse physical conditions. Differently, MHS indicates the most problematic mental condition in cluster 3, which results completely uncorrelated with all the other scores are reported in the pre-processing section. As far as the AKUSSI values are concerned, cluster 2 shows higher scores, and, consequently, patients present greater severity of the disease. Similarly to K = 2 clustering, the G161R, S47L, W60* are mostly represent in the phenotypically more serious subgroup. On the opposite, the A122V, M368V A267V and W97C mutations are more represent in the phenotypically less severe subgroup (with the only exception of M368V allele 2, see Table 3).

K = 4 using hierarchical clustering

In the stratification with four clusters, the analysis of the phenotype showed that older patients with higher levels of CHIT1 and PTI are stratified in cluster 4. In cluster 3 higher level of SAA can be observed. In cluster 2 are present much higher values (less severe) of KOOS and PHS whereas cluster 4 presented patients with greater problems at the knee and physical difficulties. The same trend was found in the hapVAS and HAQ-DI scores, which are higher in clusters 4, therefore the correspondent patients are in worse conditions, and lower in cluster 2. Moreover, cluster 4 showed higher AKUSSI scores, consequently present patients with greater severity of the disease. On the contrary, in cluster 2 are grouped younger individuals with less severe AKU manifestations: all the patients showed the best QoL scores with the only exception of MHS. This last one had the loss value in clusters 3, following an independent trend with all the other scores, as reported in the pre-processing data section. Correspondingly to K = 2 and K = 3 clustering, the G161R, S47L, W60* are mostly represent in the phenotypically more sever cluster, whereas, the A122V, M368V A267V and W97C mutations are more represent in the phenotypically less severe subgroup (with the only exception of M368V allele 2, see Table 3).

Current study limitations and future perspective

There are several challenges in studying an ultra-rare and complex disease like AKU, and specifically (i) the paucity of specimens and available data, and (ii) the lack of a standardized method able to objectively assess disease severity or response to treatment. For this reason we developed ApreciseKUre database, aiming to collect as many AKU patients’ data as possible, and to use QoL scores to monitor patients’ clinical condition and health status, although the database does not yet include objective disease severity findings (i.e., imaging, cardiac valve or calcification, radiographic severity score, treatment modalities, time to surgery, etc.). Currently, we have included more than 1/6 of the worldwide AKU patients, our aim is to provide a computing platform able to collect as much data as possible and make available a tool to extract data to perform analysis on AKU. We are aware that our analysis is a starting point for a better investigation of the utility and reliability of QoL scores, which are becoming increasingly popular, and their relationship with biochemical and genotypic data which is one of main challenges in AKU studies.

The last type of data is not easily analysable. As a matter of fact, it is not uncommon that AKU patients carry compound heterozygotes for two HGD gene variants. In such cases, the estimation of the role of each missense variant is not trivial, since the hexamer could be assembled with monomers all affected by the same variant (homo-oligomer) or by two different ones (hetero-oligomer). Variants affecting two different regions could have additive destructive effect, on the contrary, the effects could partially compensate for those that belong to the same region. However, we do not have any tools able to evaluate such events so far.

Despite all the previous listed limitations, the novelties brought by our work are mainly two:

1. The application of a bioinformatic approach to undercover, for the first time, relationship between genotype and phenotype in a rare disease like AKU, considering the difficulties to work with a small number of data available.

2. The described omni-comprehensive approach showed the utility, the exploiting and the importance of disease-dedicated databases. Thus, this application could give an inspiration for further construction of similar initiatives related to other rare diseases and to other more common pathologies.

Conclusions

In conclusion, PM application to an ultra-rare disease like AKU, in which patients are scattered around the world and the available specimens are scarce, is a challenging task. This is the reason why the collection of as much information as possible about each patient, not overlooking apparently ordinary factors, is crucial. ApreciseKUre is an AKU dedicated database where genetic, proteic, biochemical, histopathologic and clinical data about AKU patients are organized and freely available after user registration. By taking advantage of the dataset hereby presented, containing the highest number of AKU patient ever considered, it is possible to apply more sophisticated computational methods, such as Machine Learning, to achieve a first and preliminary AKU patient stratification based on phenotypic and genotypic data. Our contribution can be mainly summarized in four steps:

  1. Data pre-processing and preliminary statistical analysis. A preliminary analysis based on Pearson Correlation Coefficient was carried out in order to evaluate the relationship between pairs of clinical data, biochemical parameters and QoL scores. It was interesting to notice the presence of statistically significant correlations among all the QoL. Interestingly, important biomarkers of chronic inflammation and amyloidosis like CHIT1 and SAA do not result strongly correlated with disease severity differently from PTI, which instead is correlated with KOOS scores and age.

  2. Methods for Unsupervised Clustering. We applied both K-means and Hierarchical Clustering in order to stratify AKU population into subgroups with similar features. The experiment was conducted using three different stratifications, i.e., setting (i) K = 2, (ii) K = 3 and (iii) K = 4 to obtain two, three and four clusters, respectively. The resulting clusters are grouped according to the severity of the AKU disease, by considering age, the levels of oxidative stress, inflammation and amyloidosis biomarkers and QoL scores. For both the clustering methods, we found that for K = 2, the most severe phenotype seems to be the cluster number 1, cluster number 2 for K = 3 and cluster number 4 for K = 4.To the best of our knowledge this is the first time in literature that such kind of analysis is performed for AKU patients.

  3. Statistical Clusters Evaluation. To evaluate if the clusters were significantly identifying sub-groups of individuals, we applied the KW ranking non-parametric test. Additionally, we computed the Silhouette Score with the aim to test the consistency within elements which have been assigned to the same cluster. Thanks to these methods we have statistically corroborated the application of K-means algorithm for K = 2 and K = 3 and hierarchical clustering for K = 4 for AKU patient stratification.

  4. Once AKU stratification and clusters validation were performed, we investigated the HGD mutations distribution across the obtained clusters, paying attention to G161R, M368V and A122V (representing about 44% of AKU patients’ mutation in ApreciseKUre). Specifically, G161R mutation, responsible for a dramatic reduction of HGD activity, occurred in higher percentage in the most phenotypically severe clusters. On the contrary, for M368V and A122V mutations, in which enzymatic activity of HGD is conserved for more than 30%, the trend described a higher percentage in less severe phenotypic sub-groups.

The combination of a ML to analyse and re-interpret data available in the ApreciseKUre shows the potential direct benefits for patient care and treatments, highlighting the necessity of patient databases for rare diseases, like ApreciseKUre. We believe this is not limited to the study of AKU, but it represents a proof of principle study that could be applied to other rare diseases, allowing data management, analysis and interpretation.

Up to now, the AKU relationship between genotype and phenotype is unknown. Thanks to our sufficiently populated and organized dataset, it was possible, for the first time, to extensively explore the phenotype–genotype distribution in a typical PM perspective.

Materials and methods

Dataset

The ApreciseKUre contains data from 203 patients, of whom 112 do not contain missing data and thus have been used in this study [20, 25, 26]. Each patient in the ApreciseKUre database is characterized by more than 100 features, describing biochemical (i.e., SAA, CHIT1 and PTI), clinical, genotypic information and replies to questionnaires evaluating QoL scores (for a full description of data deposited in ApreciseKUre see Additional file 1 in Spiga et al. 2020 [23]). It has been performed patients assessment involving 11 QoL scores: (i) physical health score (PHS), (ii) mental health score (MHS), (iii) AKU Severity Score Index (AKUSSI) for joint pain (AJP), (iv) AKUSSI spinal pain (ASP), (v) Knee injury and Osteoarthritis Outcome Score (KOOS) pain (KOOSp), (vi) KOOS symptoms (KOOSs), (vii) KOOS daily living (KOOSdl), (viii) KOOS sport (KOOSsp), (ix) KOOS QOL, (x) Health Assessment Questionnaire Disability Index (HAQ-DI) and (xi) global pain visual analog scale (hapVAS) (for more details about every score see Supplementary Materials S2).

The method and the process on how the data and the website are optimized for access and other analyses are described in Figure 1 in the manuscript of Spiga et al. 2018 [25].

Classification model

In the present work we performed an analysis of the AKU patients’ cohort, implementing a machine learning pipeline based on an unsupervised analysis of the set of the patients. There are several key steps we have followed in this study to perform machine learning-based classification model to stratify AKU patients: phenotypic data preprocessing (including clinical and QoL data), methods for algorithm selection based on unsupervised clustering, statistical clusters evaluation, genotype–phenotype correlation. Our workflow is described in Figure 5. After a data preprocessing based on Pearson Correlation Coefficient, in the step 1, we clustered the patients according to their clinical data and QoL scores. We used two clustering methods: K-Means and Agglomerative Hierarchical clustering, which we will describe in the following section. The clusters so obtained, were therefore analysed in step 2. To evaluate the coherence of individuals belonging to the same community, we derived the Silhouette score for each cluster, using as input a combination of QoL and demographics features, as described in the Results section. Moreover, to evaluate if clusters were significantly identifying sub-groups of individuals, we applied the KW ranking non parametric test. Further analysis based on the identification of genetic mutations and relative phenotypic information, were performed for the communities identified as we show in the Results section. The workflow of the machine learning pipeline implemented is depicted in Figure 5.

  • 1) Data preprocessing

Machine learning framework. A four-step workflow of the machine learning-based classification model.
Figure 5

Machine learning framework. A four-step workflow of the machine learning-based classification model.

In the step 0, a preliminary statistical analysis based on Pearson correlation coefficient and p-value (<0.05) was calculated to evaluate the linear correlation between QoL scores and clinical data:
where σxy is the covariance of the two variables x and y, σx and σy are the variances of x and y, respectively, and μx and μy are the mean values.
  • 2) Clustering Methods

K-means, also known as ‘Lloyd’s algorithm’ [32, 33] represents one of the most well-known clustering algorithms nowadays, based on an intuitive iterative approach, which aims to maximize the intra-cluster variance. The K number of desired clusters is specified a priori by the user. In a first step, K partitions are therefore created, and the initial clustering assignment can be performed randomly, or according to some initialization procedure. Subsequently centroids, i.e., representatives, for each cluster are computed, and the following steps consist of an iterative procedure, in which distances between the features representations of the patients, and the centroids computed at the previous steps, are iteratively re-computed until convergence. It has been shown how the algorithm is NP complete [34] and heuristics approaches have therefore been proposed to reach convergence of the algorithm. K-Means has been widely and successfully applied to many different fields, from economics, to social sciences, bioinformatics and image processing [35]. We used the R implementation of the K-Means algorithm, from the VIM package (https://cran.r-project.org/web/packages/VIM/index.html).

Hierarchical Clustering: in hierarchical clustering the aim is to build a hierarchy of the original dataset samples, using distances between observations. As an outcome of the clustering procedure a tree like structure, which describes the hierarchies between samples of the dataset [34, 36], is created. In the case of hierarchical clustering, the number of desired partitions does not need to be specified a-priory, and most common types of hierarchical clustering methodologies are based on the computation of a distance matrix between samples. It is possible to follow a bottom-up or top-down approach to build the tree, depending on the type of hierarchical clustering criterion chosen [36]. The tree built can therefore be cut at any arbitrary height, obtaining the corresponding number of clusters, which have been formed at the height chosen. In our experiments we used the Sklearn Python implementation of the agglomerative cluster (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html). The parameters of affinity, and linkage were left to the default parameters of Euclidean distance and Ward respectively. We therefore cut the trees obtained for the corresponding values of clusters we obtained using the K-Means, to be able to compare the communities between the two procedures.

  • 3) Statistical Clusters Evaluation

To evaluate the clusters obtained using the two methods described above, we used the KW non-parametric rank test [37]. The statistical test checks a significant difference existing between groups. The comparison is performed between the medians of the features set considered for each group previously formed [37, 38]. In this way features significantly differently distributed among individuals, belonging to different clusters, can be detected. The KW test was performed, using R statistical software, with the matrixTests package (https://cran.rstudio.com/web/packages/matrixTests/index.html). Furthermore, to have a measure of goodness of fit of the clustering methodologies applied, we considered the Silhouette Score and the Calinski-Harabasz for each cluster. Such measures are widely used to test the consistency within elements which have been assigned to the same clusters. They are, in fact, indicators of the measure of cohesion of elements belonging to the same clusters, with respect to observations which belong to different communities [39, 40]. In this way we could compare the clusters obtained with K-Means with respect to the clusters obtained with the agglomerative hierarchical clustering approach. The Silhouette score was computed using the cluster package in R (https://cran.r-project.org/web/packages/cluster/cluster.pdf).

Genotype–phenotype distribution

Further analysis based on the identification of genetic mutations and relative phenotypic information, were performed for the communities identified using R software we analysed the percentages of each mutation present in every phenotypically identified clusters. Histograms showing such statistics are attached in Supplementary Materials S4.

Key Points
  • It has been implemented an unsupervised analysis for AKU patients stratification

  • Bioinformatic approach allows to investigate relationship between genotype and phenotype in a rare disease

  • Model for other rare diseases and for overcoming the obstacles in small dataset management and analysis

  • Omni-comprehensive approach showing the utility and the exploiting of disease-dedicated databases.

Data availability statement

The data underlying this article will be shared on reasonable request to the corresponding author.

Acknowledgements

Procedures were approved by Siena University Hospital and national Ethics (Comitato Etico Policlinico Universitario di Siena, number GGP10058, 21 July 2010) in accordance with 1975 Helsinki Declaration, revised in 2000 (52nd WMA General Assembly, Edinburgh, Scotland, October 2000). Informed written consent was obtained from the patient.

Authors’ contributions

Ottavia Spiga conceived the experiments and revised the manuscript. Vittoria Cicaloni conceived the experiments, designed and performed the experiments and wrote the paper. Giovanna Maria Dimitri performed the experiments and wrote the paper. Francesco Pettini performed the experiments. Daniela Braconi checked the data and the results. Andrea Bernini and Annalisa Santucci revised the manuscript.

Ottavia Spiga is a researcher at Biotechnology Chemistry and Pharmacology Department. She obtained her Bachelor’s degree in Biology and received her PhD degree in Biotechnology from University of Siena in 2002. Dr. Spiga has published her work in many renowned journals in the field of Structural and Molecular Biology and Computational Biology. Her strong interest in molecular modelling and computational docking analysis led her to participate in many research projects globally. Her research of interests include nuclear magnetic resonance (NMR) structure determination of biological macromolecules, protein surface accessibility investigated with paramagnetic probes and hydration NMR techniques and computer-aided molecular modelling and docking.

Vittoria Cicaloni currently works as research scientist at Toscana Life Sciences Foundation, Siena. Graduated cum Laude in Medicinal Chemistry in 2016, she received a PhD degree in Biochemistry and Molecular Biology in February 2020. The PhD training took place at the Department of Biotechnology, Chemistry and Pharmacy (University of Siena), at Toscana Life Sciences Foundation (TLS) and Melbourne University. Her master degree thesis as well as her PhD project were focused on the application of a precision medicine approach to rare diseases, development of integrated and interactive databases, algorithms for patient stratification and proteomics data analysis. She is the author of the several papers on international scientific journals.

Giovanna Maria Dimitri holds a PhD in Computer Science, obtained at the University of Cambridge (UK), under the supervision of Prof Pietro Liò, with a dissertation on ‘Multilayer network methodologies for brain data analysis and modelling’. She graduated in July 2015 in the MPhil in Advanced Computer Science at the University of Cambridge, with distinction, under the supervision of Dr Pietro Liò with a dissertation on ‘Predicting drugs side effects from a combination of chemical and biological profiles’. She is a life member of Clare Hall College, Cambridge. Previously, she received her master and bachelor thesis (both 110/110 cum laude) in Computer and Automation Engineering at the University of Siena. Her research interests are in developing deep learning and machine learning models for biomedical data analysis. She is currently a post-doc at the Department of Engineering at the University of Siena, working on deep learning for biomedical image processing. She is the author of several papers in international peer-reviewed journals.

Francesco Pettini is a PhD student in Genetics, Oncology and Clinical Medicine at the Department of Medical Biotechnologies of Siena University, where he graduated in Medical Biotechnologies in 2019. The scientific background is in Bioinformatics and Computational Biology. Actually, his PhD project is focused on development of systematic approach to predict protein–ligand binding interactions by MMPBSA energy calculation methods in Molecular Dynamics simulation.

Daniela Braconi is a research assistant at the Department of Biotechnology, Chemistry and Pharmacy of Siena University, where she graduated in Pharmacy in 2003 and received her PhD degree (‘Doctor Europaeus’ label) in Medical Biotechnology in 2008. Her scientific background is primarily in biochemistry, with several proteomic and immunoproteomic studies related to microorganisms and human diseases, paying particular attention to oxidative post-translational modifications of proteins. In the past years, she focused on the elucidation of molecular mechanisms underlying the rare disease alkaptonuria (AKU) and she became actively involved in the DevelopAKUre EU-funded project for the analysis of oxidative and inflammatory biomarkers in AKU. She is the author of more than 50 papers in international peer-reviewed scientific journals.

Andrea Bernini graduated in Chemistry in 1997 and got his PhD in Biotechnology in 2001. The PhD training took place at the University of Siena and at the Oxford Centre for Molecular Science, Oxford University, UK, where the author got trained in multidimensional/multinuclear NMR of biological macromolecules. The author is currently in charge of the NMR & Structural Biology Lab of the University of Siena. She authored more than 60 papers on international peer-reviewed scientific journals. She is the associate editor for BMC Structural Biology and is reviewer for several journals in the fields of biochemistry and NMR and for proposals for international granting agencies. Her scientific background is primarily in structural biochemistry by means of Nuclear Magnetic Resonance and computational methods, with focus on targeted and untargeted metabolomics, hot-spot mapping of biological macromolecules surface, rational design of protein–protein interface disruptors and molecular core determination.

Annalisa Santucci is a full-time professor of Biochemistry Chair of the Department of Biotechnology Chemistry and Pharmacy Director of the Tuscany Region Pegaso PhD School in Biochemistry and Molecular Biology. She is the founder and president of the Scientific Committee of aimAKU (Associazione Italiana Malati di Alcaptonuria) and a member of the Advisory Board of FindAKUre network. Her expertise include post-genomic approaches for the study of molecular mechanisms of physiopathology of metabolic, rheumatic and rare diseases.

References

1.

Garrod
AE
.
the croonian lectures on inborn errors of metabolism
.
Lancet
1908
;
172
(
4427
):
1
7
. doi: .

2.

Phornphutkul
C
, Introne WJ, Perry MB, et al.
Natural history of alkaptonuria
.
N Engl J Med
2002
;
347
(
26
):
2111
21
. doi: .

3.

Zatkova
A
,
Ranganath
L
,
Kadasi
L
.
Alkaptonuria: current perspectives
.
Appl Clin Genet
2020
;
13
:
37
47
. doi: .

4.

Ascher
DB
,
Spiga
O
,
Sekelska
M
, et al.
Homogentisate 1,2-dioxygenase (HGD) gene variants, their analysis and genotype–phenotype correlations in the largest cohort of patients with AKU
.
Eur J Hum Genet
2019
;
27
(
6
):
888
902
. doi: .

5.

La Du
BN
,
Zannoni
VG
,
Laster
L
, et al.
The nature of the defect in tyrosine metabolism in alcaptonuria
.
J Biol Chem
1958
;
230
:
251
60
.

6.

Titus
GP
, Mueller HA, Burgner J, et al.
Crystal structure of human homogentisate dioxygenase
.
Nat Struct Biol
2000
;
7
(
7
):
542
46
. doi: .

7.

Nemethova
M
, Radvanszky J, Kadasi L, et al.
Twelve novel HGD gene variants identified in 99 alkaptonuria patients: focus on ‘black bone disease’ in Italy
.
Eur J Hum Genet
2016
;
24
(
1
):
66
72
. doi: .

8.

Rodríguez
JM
, et al.
Structural and functional analysis of mutations in alkaptonuria
.
Hum Mol Genet
2000
;
9
(
15
):
2341
50
. doi: .

9.

Milch
RA
.
Studies of Alcaptonuria: a genetic study of 58 cases occurring in eight generations of seven inter-related Dominican Kindreds
.
Arthritis Rheum
1961
;
4
(
2
):
131
36
. doi: .

10.

Braconi
D
,
Millucci
L
,
Bernardini
G
, et al.
Oxidative stress and mechanisms of ochronosis in alkaptonuria
.
Free Radic Biol Med
2015
;
88
:
70
80
. doi: .

11.

Braconi
D
, Laschi M, Taylor AM, et al.
Proteomic and redox-proteomic evaluation of homogentisic acid and ascorbic acid effects on human articular chondrocytes
.
J Cell Biochem
2010
;
111
(
4
):
922
32
. doi: .

12.

Braconi
D
,
Bianchini
C
,
Bernardini
G
, et al.
Redox-proteomics of the effects of homogentisic acid in an in vitro human serum model of alkaptonuric ochronosis
.
J Inherit Metab Dis
2011
;
34
(
6
):
1163
76
. doi: .

13.

Millucci
L
, Ghezzi L, Braconi D, et al.
Secondary amyloidosis in an alkaptonuric aortic valve
.
Int J Cardiol
2014
;
172
(
1
):
121
23
. doi: .

14.

Millucci
L
, Spreafico A, Tinti L, et al.
Alkaptonuria is a novel human secondary amyloidogenic disease
.
Biochim Biophys Acta Mol Basis Dis
2012
;
1822
(
11
):
1682
91
. doi: .

15.

Millucci
L
,
Braconi
D
,
Bernardini
G
, et al.
Amyloidosis in alkaptonuria
.
J Inherit Metab Dis
2015
;
38
(
5
):
797
805
. doi: .

16.

Gabay
C
,
Kushner
I
.
Acute-phase proteins and other systemic responses to inflammation
.
N Engl J Med
1999
;
340
(
6
):
448
54
. doi: .

17.

Braconi
D
, Bernardini G, Paffetti A, et al.
Comparative proteomics in alkaptonuria provides insights into inflammation and oxidative stress
.
Int J Biochem Cell Biol
2016
;
81
:
271
80
. doi: .

18.

Braconi
D
, Giustarini D, Marzocchi B, et al.
Inflammatory and oxidative stress biomarkers in alkaptonuria: data from the DevelopAKUre project
.
Osteoarthr Cartil
2018
;
26
(
8
):
1078
86
. doi: .

19.

Cho
SJ
,
Weiden
MD
,
Lee
CG
.
Chitotriosidase in the pathogenesis of inflammation, interstitial lung diseases and COPD
.
Allergy Asthma Immunol Res
2014
;
7
(
1
):
14
21
. doi: .

20.

Cicaloni
V
, Spiga O, Dimitri GM, et al.
Interactive alkaptonuria database: investigating clinical data to improve patient care in a rare disease
.
FASEB J
2019
;
33
(
11
):
12696
703
. doi: .

21.

Ranganath
LR
,
Cox
TF
.
Natural history of alkaptonuria revisited: analyses based on scoring systems
.
J Inherit Metab Dis
2011
;
34
(
6
):
1141
51
. doi: .

22.

Vilboux
T
, Kayser M, Introne W, et al.
Mutation spectrum of homogentisic acid oxidase (HGD) in alkaptonuria
.
Hum Mutat
2009
;
30
(
12
):
1611
19
. doi: .

23.

Spiga
O
,
Cicaloni
V
,
Fiorini
C
, et al.
Machine learning application for development of a data-driven predictive model able to investigate quality of life scores in a rare disease
.
Orphanet J Rare Dis
2020
;
15
(
1
):
1
10
. doi: .

24.

Rossi
A
, Giacomini G, Cicaloni V, et al.
AKUImg: a database of cartilage images of Alkaptonuria patients
.
Comput Biol Med
2020
;
122
:103863. doi: .

25.

Spiga
O
, Cicaloni V, Zatkova A, et al.
A new integrated and interactive tool applicable to inborn errors of metabolism: application to alkaptonuria
.
Comput Biol Med
2018
;
103
:1–7. doi: .

26.

Spiga
O
,
Cicaloni
V
,
Bernini
A
, et al.
ApreciseKUre: an approach of precision medicine in a rare disease
.
BMC Med Inform Decis Mak
2017
;
17
(
1
):42. doi: .

27.

Cicaloni
V
, Zugarini A, Rossi A, et al.
Towards an integrated interactive database for the search of stratification biomarkers in Alkaptonuria
.
Peer J Prepr
2016
. doi: .

28.

Roos
EM
,
Lohmander
LS
.
The knee injury and osteoarthritis outcome score (KOOS): from joint injury to osteoarthritis
.
Health Qual Life Outcomes
2003
;
1
:64. doi: .

29.

Bruce
B
,
Fries
JF
.
The Stanford health assessment questionnaire: dimensions and practical applications
.
Health Qual Life Outcomes
2003
;
1
:20. doi: .

30.

Langford
B
,
Besford
M
,
Hall
A
, et al.
Alkaptonuria severity score index revisited: Analysing the akussi and its subcomponent features
.
JIMD Rep
2018
;
41
:53–62. doi: .

31.

Giustarini
D
, Dalle-Donne I, Lorenzini S, et al.
Protein thiolation index (PTI) as a biomarker of oxidative stress
.
Free Radic Biol Med
2012
;
53
(
4
):
907
15
. doi: .

32.

MacQueen
J
.
Some methods for classification and analysis of multivariate observations
.
Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability
.
1967
;
1
:
281
96
.

33.

Jain
AK
.
Data clustering: 50 years beyond K-means
.
Pattern Recognit Lett
2010
;
31
(
8
):
651
66
. doi: .

34.

Flach
P
.
Machine Learning: The Art and Science of Algorithms That Make Sense of Data
.
Cambridge: Cambridge University Press
,
2012
. doi: . ISBN: 9781107422223.

35.

Han
J
,
Kamber
M
,
Pei
J
.
Data Mining: Concepts and Techniques
. third edition (3rd ed.). Waltham, Mass. USA: Morgan Kaufmann Publishers. 2012. doi: .

36.

Vercellis
C
.
Business Intelligence: Data Mining and Optimization for Decision Making
. Chichester, West Sussex: John Wiley & Sons Ltd,
2009
. doi: .

37.

Kruskal
WH
,
Wallis
WA
.
Use of ranks in one-criterion variance analysis
.
J Am Stat Assoc
1952
;
47
(
260
):
583
21
. doi: .

38.

Corder
GW
,
Foreman
DI
.
Nonparametric Statistics for Non-Statisticians
. Hoboken, New Jersey, USA: John Wiley & Sons Inc.,
2009
. doi: .

39.

Rousseeuw
PJ
.
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
.
J Comput Appl Math
1987
;
20
(
C
):
53
65
. doi: .

40.

Caliñski
T
,
Harabasz
J
.
A dendrite method foe cluster analysis
.
Commun Stat
1974
;
3
(
1
):
1
27
. doi: .

41.

Andresen
EM
,
Gravitt
GW
,
Aydelotte
ME
, et al.
Limitations of the SF-36 in a sample of nursing home residents
.
Age Ageing
1999
;
28
(
6
):
562
66
. doi: .

42.

Yang
RL
,
Shi
YH
,
Hao
G
, et al.
Increasing oxidative stress with progressive hyperlipidemia in human: relation between malondialdehyde and atherogenic index
.
J Clin Biochem Nutr
2008
;
43
(
3
):
154
58
. doi: .

43.

Giustarini
D
,
Dalle-Donne
I
,
Milzani
A
, et al.
Low molecular mass thiols, disulfides and protein mixed disulfides in rat tissues: influence of sample manipulation, oxidative stress and ageing
.
Mech Ageing Dev
2011
;
132
(
4
):
141
48
. doi: .

Author notes

Ottavia Spiga and Vittoria Cicaloni have contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)