Beyond the Hubble Sequence - Exploring galaxy morphology with unsupervised machine learning

Conventionally, galaxy morphological classiﬁcations are deﬁned by visual assessment. However, visual classiﬁcation systems such as Hubble types can be intrinsically biased due to the subjective judgement of human classiﬁers. Additionally, since morphological ”classi-ﬁcations” into types is an important and complementary process, it is not clear if we know what these ”best types” are, such as, whether a classiﬁcation scheme results in relatively unique physical properties of the galaxies or traces the merger history in each class. As most machine learning applications in astronomical studies focus on the improvement of accuracy and eﬃciency, we apply machine learning to approach a new insight towards galaxy morphological classiﬁcation. In particular, machines decide a classiﬁcation system to describe the variation shown in galaxy morphology in the dataset. We explore galaxy morphology from SDSS imaging data with an unsupervised machine learning technique composed of a feature extractor using vector-quantisation variational autoencoder and hierarchical clustering. Our methodology results in 27 machine-deﬁned classes which are physically distinctive from each other in stellar mass, absolute magnitude, physical size, and colour. When we merge these clusters into 2 clusters for binary classiﬁcation, the unsupervised method provides an accuracy of 87% for separating early-type (ETG) and late-type galaxies (LTG) using a dataset with the morphology distribution of nearby galaxies (i


INTRODUCTION
Galaxy structure and visual morphology have a strong connection with their stellar population properties, such as surface brightness, colour, and the formation history of galaxies (Holmberg 1958;Dressler 1980). The dominant visual morphological classification system in use today was first con-E-mail:ting-yun.cheng@nottingham.ac.uk structed by Hubble (1926), which was then revised by adding a class for lenticulars (S0), a type of galaxy has a disk structure without apparent spiral arms (Hubble 1936;Sandage 1961). Since then, a number of detailed classification systems were proposed such as ones including the notation for the inter and outer ring structure (de Vaucouleurs 1959) and different arm classes (Elmegreen & Elmegreen 1982, 1987, among others.
However, visual classification systems can be intrinsi-cally biased due to the subjective judgement of different human classifiers. These human errors are unavoidable and sometimes cannot be reproduced for carrying out a statistical analysis. This greatly limits the ability to use galaxy classification in a formal quantitative way. These issues led astronomers to search for a quantitative description of galaxy structure based on the shape, structure, and physical properties of galaxies which can in principle be connected with visual morphology. For example, Principal Component Analysis (PCA) was applied to determine the number of dominant features needed to reproduce the variance shown in observation in Whitmore (1984) as well as to provide an objective procedure for analysing galaxy properties (also see Conselice 2006). Other studies such as non-parametric methods, e.g., concentration, asymmetry, smoothness/clumpiness, and gini coefficient Bershady et al. 2000;Abraham et al. 2003;Conselice 2003;Lotz et al. 2004;Law et al. 2007), and parametric methods, e.g., Sérsic profile (Sérsic 1963(Sérsic , 1968 for measuring galaxy structure were also proposed to provide a more objective and quantitative classification systems than visual assessment alone. Even though quantitative measures of galaxy structure are extremely useful for measuring properties such as the merger history (e.g., Conselice 2003), morphological 'classifications' into types is still an important and complementary process. However, it is not clear if indeed we know what these best 'types' are. Thus, in this study we build a galaxy morphological classification system that does not involve human bias; we do this through a machine learning approach. For this purpose, we use unsupervised machine learning which is trained without any prior knowledge (e.g., galaxy labels, such as Hubble types). This approach is able to give us suggestive classifications from the machine's perspective based upon input features. However, with an unsupervised machine learning technique it becomes more challenging to have a 'sensible' classification, that is one with more consistency with human opinion, when the dimensionality of a feature space becomes high (curse of dimensionality , Bellman 1954;Keogh & Mueen 2017). In astronomical studies, unsupervised machine learning applications have been mostly used in the studies of spectroscopic data which is less dimensional than applying to imaging data (e.g., Geach 2012; Krone-Martins & Moitinho 2014;Carrasco Kind & Brunner 2014;Siudek et al. 2018). Therefore, unsupervised learning for galaxy classification is still in its infancy.
There are currently several types of astronomical studies that apply unsupervised machine learning techniques to images which reach reasonable results, including: galaxy morphology (Hocking et al. 2018;Martin et al. 2019), strong lensing identification (Cheng et al. 2020), and anomaly detection (Xiong et al. 2018;Margalef-Bentabol et al. 2020). For example, Hocking et al. (2018) and Martin et al. (2019) apply a technique called Growing Neural Gas algorithm (Fritzke 1994), which is a type of Self-organising Maps (SOMs, Kohonen 1997), to extract features from images. These features are then connected with a hierarchical clustering algorithm (Hastie et al. 2009). On the other hand, Cheng et al. (2020) use a fundamentally different approach by using a convolutional autoencoder (Masci et al. 2011), which includes an architecture of convolutional neural networks, for feature extraction. This method connects the ex-tracted features with a Bayesian Gaussian mixture model from which a clustering analysis can be done.
In this study, we apply an architecture consisting of a convolutional autoencoder, as convolutional neural networks have demonstrated their capability for capturing representative and meaningful features from images (Krizhevsky et al. 2012). We do not use the same convolutional autoencoder as Cheng et al. (2020), but we apply a newly developed technique from Google DeepMind (van den Oord et al. 2017;Razavi et al. 2019) called 'Vector-Quantised Variational Autoencoder (VQ-VAE)'. This technique includes a vector quantisation method that accelerates the time-consuming process of feature extraction when using a convolutional autoencoder, as explained in Cheng et al. (2020). On the other hand, for clustering algorithms, we decide to apply a modified hierarchical clustering method to group the data in order to explore connections between the distances amongst extracted features in feature space, and the number of classification clusters.
In this paper, we use this unsupervised machine learning technique to develop a galaxy morphology classification system defined by a machine, and compare it with traditional visual classification system such as the Hubble sequence. We furthermore also compare our machine developed classification with galaxy physical properties, such as stellar mass, colour, and physical size of galaxies. We use monochromatic images throughout to focus only on the impact of galaxy shape and structure on morphological classifications in this paper. The methodology we develop is introduced in Section 2, while the detailed description of how to approach using our method and the data used in this study are shown in Section 3. Section 4 presents the results in this study. Finally, we conclude the work in Section 5.

METHODOLOGY
In this section we explain our unsupervised machine learning methodology that is used throughout this paper. We give a brief overview here, before going into detail in the following subsections.
Our unsupervised machine learning technique includes a feature learning phase with a vector-quantised variational autoencoder (VQ-VAE; Section 2.1 and Section 2.2) and a clustering phase using a hierarchical clustering algorithm (HC; Section 2.3). Several novel approaches for unsupervised machine learning applications are made in this paper: (1) the VQ-VAE considers both reconstruction and preliminary clustering results in the feature learning phase (Section 2.2 and also see Section 3.3); (2) multiple different distance thresholds are used to draw the decision lines on the merger tree in the clustering process (see details in Section 2.3).

Vector-Quantised Variational Autoencoder
(VQ-VAE) The vector-quantised variational autoencoder (VQ-VAE) was built by Google DeepMind (van den Oord et al. 2017;Razavi et al. 2019) and was originally used for high-fidelity image emulation. The task of image emulation is to learn the distribution of the data given a set of training images, and then to reproduce the images with the learnt distribution.
In details, the structure of an autoencoder ( Fig. 1) contains an encoder with a posterior distribution q (z|x) and a prior distribution p (z) where x is the input data and z represents latent variable, and a decoder with a distribution p (x|z) for reproducing the input data. The VQ-VAE is a type of autoencoder which includes the structure of convolutional neural networks and applies a vector quantisation process (van den Oord et al. 2017) to make the posterior and prior distribution become categorical. By using a categorical distribution, the computational time for training an autoencoder is significantly reduced compared to other machine learning methods. For example, in Cheng et al. (2020), it takes up to 5 days to train 100,000 images by a convolutional autoencoder running on a NVIDIA GeForce GTX 1080 Ti GPU, while a VQ-VAE takes up to a few hours to train the same amount of data with the same device. This is an enormous difference and shows the power of the VQ-VAE method.
Following the top coloured area in Fig. 1, the posterior categorical distribution q (z|x) is defined as (van den Oord et al. 2017;Razavi et al. 2019): where ze (x) is the output of the encoder (the blue part at the left in the figure), the value ej represents a vector in the codebook which is used for vector-quantising the ze (x), and k is the index for the vector used in the selected codebook (the top box of the yellow part in the figure). We then measure the vector-quantised representation zq (x), which is the input of the decoder (the blue shading at the right side in the figure), through Equations 1 and 2.
The vector quantisation process is shown as the yellow part in Fig. 1. The output of an encoder, ze (x) can be represented by a combination of the index of different vectors, k, in the codebook (the square in the middle of the yellow part). For example, in Fig. 1, a three dimensional 'pixel' in the output of an encoder is represented by a vector, e3, after the vector quantisation. We then use the index of these vectors to build a two dimensional index map. For the pixel used in our example the value is 3. With this index map, we can rebuild the distribution, zq (x), with the same dimension as ze (x) but in this case each 'pixel' in zq (x) is quantised to one of the vectors shown in the codebook. For our example, the vector e3 is used for the pixel. The distribution of zq (x) is then used as the input for the decoder to reconstruct the images. The loss function of the original VQ-VAE contains three parts: reconstructed loss, codebook loss, and commitment loss. An additional penalty is considered later in the modified version of the VQ-VAE (see Section 2.2). The reconstructed loss is measured by comparing the reconstructed images with the input images. The codebook loss is used to make the selected codebook, ej, approach the output of the encoder, ze (x), while the commitment loss is applied to encourage the ze (x) to be as close as possible to the chosen codebook from the previous epoch. With these definitions, the loss function, L, for the VQ-VAE is defined as (Razavi et al. 2019): where the value sg is the stopgradient operator and β is used for adjusting the weight of the commitment loss. The study of van den Oord et al. (2017) found that these results correlate with the value of β, and no apparent change occurs when β ranges from 0.1 to 2.0. Therefore, we set β = 0.25 in this study which follows the setting in van den Oord et al. (2017).
The details of the VQ-VAE architecture is shown in Table 1. Four convolutional layers are used in both the encoder and decoder, and residual neural networks (ResNets, He et al. 2016) are used in this architecture to create a deeper neural network with less complexity. The activation function applied in the convolutional layers is the Rectified Linear Unit (ReLu) (Nair & Hinton 2010) The VQ-VAE code is based upon the example provided in sonnet library (DeepMind 2018) 1 which is built on top of TensorFlow (Abadi et al. 2015) 2 . To train the VQ-VAE, we apply the Adam Optimiser (Kingma & Ba 2014) and the learning rate is set to 0.0003 which is used in Razavi et al. (2019).

Modified VQ-VAE
In this study, we apply a modification to our original VQ-VAE to consider both image reconstruction and a preliminary clustering result when extracting the representative features from images ( Fig. 1). To achieve this goal, a penalty defined by silhouette score (Rousseeuw 1987, Equation 4) is added (Equation 5). The silhouette score indicates how well clusters are separated from each other and is defined by the formula, where a represents the mean intra-cluster distance while b is the distance between a cluster and its nearest neighbour cluster. Therefore, a larger silhouette score indicates a better separation between clusters in feature space. To train our VQ-VAE, we minimise the final loss function combining the loss described in Equation 3 and the penalty defined as, where s represents the silhouette score and λ is a constant used for making the magnitude of this penalty of the same order as other losses used in the VQ-VAE (Section 2.1). The value of λ is equal to 0.1 in our case. As shown in Fig. 1, during the training of the VQ-VAE, we interpolate an instance-based clustering algorithm called 'k-medoid clustering' (Maranzana 1963;Park & Jun 2009) to obtain two preliminary classification clusters using a flattened index map. The two clusters are then used for measuring a silhouette score to evaluate the performance of the clustering. The Hamming distance (Hamming 1950

K-medoids Clustering
Vector Quantisation Figure 1. A schematic architecture of the modified VQ-VAE used for feature extraction of images. The top aspect with a coloured background is the main architecture of the VQ-VAE, which is then modified to consider the silhouette score calculated using the two preliminary clusters given by k-medoids clustering as a part of the loss function when training VQ-VAE (see details in Section 2.2). The blue shading at the left and right represents the encoder and the decoder, respectively while the yellow part shows the vector quantisation process. The details of each layer are shown in Table 1 Type #channel kernel size stride size activation function  as the distance metric as our data is represented by the indices of the vectors in the codebook whereby the number itself only represents a category rather than a real value of the vector (more description in Section 2.3). The 'k-medoid clustering' is used here for a fast evaluation; in the main clustering process after feature extraction, we apply hierarchical clustering algorithms (Section 2.3).

Uneven Iterative Hierarchical Clustering
In this section we describe our hierarchical clustering procedure for identifying different types of clusters. Hierarchical Clustering (HC; Johnson 1967; Hastie et al. 2009), in particular agglomerative HC (called sometimes 'bottom-up'), first assigns each input as an individual group, then merges two nearest (the most similar) groups together based upon the measured pair distance in the feature space, recursively. The 'bottom-up' HC structure allows a different number of datapoints in clusters because it starts with individuals ( Fig. 2).
Other kinds of clustering such as 'top-down' HC and Kmedoid clustering used in Section 2.2 start with clusters themselves, which are more difficult to provide a starting point for an uneven number of datapoints for the initial clusters. The distance (similarity) measured in this study is the Hamming distance (Hamming 1950). As stated in Section 2.2, our data is represented by the index of the vectors selected from the codebook. This is such that an index indicates a category rather than the real value of a vector. We compare two data sets represented by a set of features labelled with indices. The Hamming distance is defined as the number of mismatched indices between the pair over the number of features used to represent the data. For example, assuming that an image can be presented by four different features labelled with the indices: 1, 2, 3, 4, after VQ-VAE; in this case the Hamming distance is 0 if the other image is represented as 1, 2, 3, 4 as well, and the Hamming distance is 1 if it is represented by 4, 3, 2, 1.
For further clarification, Fig. 2 illustrates the clustering process. Within this study, we realise that when all the data are considered, the merging point can be less accurate due to the mixture of blindly measured distances from a great variety of extracted features in images. Therefore, we carry out an iterative clustering process with a reverse concept that we control the data used for doing HC from the top to bottom. We first make the HC merge all data into two top parent branches, then apply the second round of HC to the data of a parent branch to obtain two children branches, and apply the same procedure again to the sub-data of a child branch to get two grandchildren branches, and so on. The iterative action stops when it reaches a certain condition (the black circle in Fig. 2; see Section 3.4).
In a typical HC, a uniform distance is used to determine the final clusters. However, a uniform distance threshold is not appropriate considering that galaxies' appearance in different morphological types have different complexity, such that spiral galaxies have a larger diversity in appearance than elliptical galaxies. Therefore, in this study, we propose to allow a different stopping point/distance threshold for each branch depending on the complexity of the objects in the branch (see Section 3.4). For example, a branch which consists of galaxies which can look very different within a class may continue for many iterations, while others may reach the stop criteria with fewer iterations due to a relatively monotonous structure within the data of the branch. For example, spiral galaxies can have a variety of spiral arms appearances, i.e., different number of arms, different positions of arms, etc. Therefore, the distance between spiral-like galaxies are generally larger than the distance between two elliptical-like galaxies. This consideration is sensible and is of great importance in morphological classification of galaxies; however, this is neglected in a typical HC algorithm. Therefore, to distinguish it from a typical HC algorithm, we call this setup 'uneven clustering' which provides us with a more precise distinction in galaxy shape, structure, and morphology.

IMPLEMENTATION
The pipeline of this study includes three main steps: (1) feature selection; (2) feature learning (using the modified VQ-VAE); and finally (3) clustering process. The data used in this study are introduced in Section 3.1. The feature selection is described in Section 3.2, and the setup for the feature learning process using the modified VQ-VAE (Section 2.2) is discussed in Section 3.3. Finally, in Section 3.4 we explain the details of the clustering process we use to classify galaxies.

Data Sets
The imaging data used throughout this work is from the Sloan Digital Sky Survey (SDSS) Data Release 7 (York et al. 2000;Abazajian et al. 2009) with a redshift cut of z < 0.2. In order to focus on the impact of galaxy shape and structure to morphological classifications, we utilise monochromatic -3 -2, -1 0 -2 3 -8 10 Table 2. The classification scheme used in this work and in Domínguez Sánchez et al. (2018, DS18;presented in T-Type). In DS18, they define the T-Type of -3 for ellipticals (E), -2 for lenticulars at the early stage (S0 − ), -1 for lenticulars at the intermediate to late stages (S0), 0 for S0/a, and the positive values of T-Type are for different stages of spirals. Finally the T-Type of 10 represents irregular galaxies (Irr).
r-band images. An extension including colour and other factors is some to consider for the future. Here we are focused on single-band morphological classification on features seen and not in general a physical classification that might result from considering galaxy colours and colour distributions.
To examine what types of systems our classification clusters contain, as well as to have the flexibility within the data distribution in our datasets, we use morphology labels defined by T-Type (de Vaucouleurs 1964) and the probability of being a barred galaxy (P bar ). Both quantities are obtained using deep learning techniques from Domínguez Sánchez et al. (2018, hereafter, DS18). We define eight labels including barred galaxies that contain significant features shown in the Hubble morphological system: ellipticals (E), lenticulars (S0), early spirals (eSp), late spirals (lSp), irregulars (Irr), barred lenticulars (SB0), early barred spirals (bar eSp), and late barred spirals (bar lSp).
The comparison of the classification scheme is shown in Table 2; in which, S0, eSp, and lSp are separated into barred and non-barred galaxies based on the value of P bar . We additionally include labels of irregular galaxies from three other works: Fukugita et al. (2007), Nair & Abraham (2010), and Oh et al. (2013) to provide more irregular galaxies in our database. The morphological labels in our datasets are not used for training our machine, but to prepare an appropriate dataset with a specific data distribution, and as a way to examine the obtained clusters in terms of these types.
To investigate the differences in the classification systems defined by humans and those from a machine, as well as potential application within our unsupervised machine learning technique in future surveys, we prepare two different datasets: which are 'balanced' and 'imbalanced'. In the balanced dataset, we artificially allocate the same number of galaxy images to each morphological type. The eight human defined morphological types have visually distinctive differences from each other; therefore, the purpose of this arrangement is to allow our VQ-VAE consider fairly the characteristics of each morphology type when extracting the representative features from input images. Otherwise it is possible that some type of bias would result if the distribution of the types we select are input into our VQ-VAE in the same fraction as they are found in the nearby universe. In this case we would find that the late-type disks would dominate over early disks and ellipticals (e.g., Conselice 2006).
On the other hand, it is of great importance to know how an unsupervised machine learning technique can be applied in future surveys to explore a large scale of unknown galaxies' morphology in an 'as is' situation. That is, we need to know how our VQ-VAE performs when galax-ies are inputted from imaging observations of the real universe with no balancing. For this goal, we set up the 'imbalanced dataset' with a realistic distribution in terms of galaxy morphological types which follows the distribution of nearby galaxies at z=0.033-0.044 as presented in Oh et al. (2013). The type distributions of the balanced and imbalanced dataset are shown in Fig. 3.

Feature Selection
In this section we discuss a preprocessing procedure to reject irrelevant information from images. The feature selection procedure is used to select the pixels in images that are significant and which reflect the shape or structure of the targets. Cheng et al. (2020) showed that the background noise can result in an overfit to the noise when training the convolutional autonencoder. To solve this, Cheng et al. (2020) applied a simplified convolutional autoencoder to denoise the images and emphasise the pixels from the targets themselves before the main task is computed. However, a denoising process by another autoencoder is time-consuming and could potentially add artificial structure when reconstructing the images. Therefore, in this study, we simply use a one sigma clipping of pixel values measured through the background noises as our selection threshold. Any pixel value is below this criterion the pixel value is set as 0 (Martin et al. 2019). Whilst this will remove noise, it will also potentially remove outer fainter portions of the galaxies themselves. However, this will retain the brighter portions of the inner parts of galaxies where classification is done in any case. Removing this fainter light does not have an effect on our measurements as it would if we were measuring for example surface brightness profiles.

Feature Learning
As described, in this study, we apply a modified vectorquantised variational autoencoder (VQ-VAE) (see Section 2.2) to carry out our unsupervised learning. Our VQ-VAE basically learns the representative features from our images. It considers a preliminary clustering result by including an additional penalty (Equation 5) in the VQ-VAE (Section 2.2). This modification helps to find not only better representative features for image reconstruction, but also the features that can be well separated into two initial groups in feature space.
The main advantage of the VQ-VAE technique is to accelerate the unsupervised feature extraction process which is over 30 times faster than using a typical convolutional autoencoder (e.g., Cheng et al. 2020) without a significant trade-off to the reconstruction accuracy (Razavi et al. 2019). This is achieved by quantising the values used for reconstruction (Section 2.1).
The hyper-parameters setting used in this study follows the setup described in Razavi et al. (2019) except for the codebook size. It determines the number of vectors available in the quantisation process (Section 2.1). This number of vectors decides the 'resolution' of the reconstructed images. Namely, the more available vectors, the more details can be presented in images. Razavi et al. (2019) use 512 vectors in their codebook to generate high-fidelity emulated images of balanced imbalanced Figure 3. The type distributions of the balanced (left) and imbalanced (right) datasets. The latter follows the distribution of nearby galaxies (Oh et al. 2013). The number shown above the coloured bar represents the fraction of the type in all data. The fraction of barred galaxies are highlighted with hashed lines. The orange and light blue colouring represent early-type galaxies and late-type galaxies, respectively. different animals, e.g., dogs, cats. However, with a different goal from emulation in our study, we realised during analysis that a larger codebook size leads to a worse clustering result. This is because the machine with a larger codebook uses too many details of the images into account when carrying out the clustering. These details help to complete the puzzle when emulating images but they blur the boundary in the feature space when doing clustering. In this study, after a series of tests, we choose a size of 16 for our codebook, which forces the machine to use the provided vectors on the most significant features while still retaining a certain level of the reconstruction quality. This number of 16 was determined through experimental method, and is not based on any basic principles related to galaxies or machine learning. It may, and probably does, differ within different instances of use.

Clustering
Within the clustering task, we apply an uneven iterative hierarchical clustering (Section 2.3) on the data represented by a set of vector-quantised features obtained after the VQ-VAE.
In this study, we propose a new approach to decide the number of clusters within unsupervised machine learning applications. This approach can be used in other instances beyond using a VQ-VAE. Part of this is inspired by the fact that the clusters can be highly sensitive to galaxy orientation. The concept we use is to take the threshold measured by the features of galaxy orientation on the merger tree to find where the effect of galaxy orientation in a branch starts to appear (e.g., gray dotted lines in Fig. 2). In other words, this threshold also provides the number of classification clusters that are not separated based on the galaxy orientation. This threshold is defined by the average distance between the artificially rotated images in a branch (drot), where N is the number of datapoints in the branch, and dij represents the distance between an image i and image j. The distance, dij, is measured through the Hamming distance. In this process we stop a branch and decide the number of clusters within that branch when one of two criteria is satisfied: (1) the drot suggests fewer than two clusters (≤ 2) in a branch; (2) the difference between the drot measured using the data of a parent branch and the data of a child branch are smaller than 0.015: that is, dp,rot −dc,rot ≤ 0.015.
The first criterion indicates that galaxy orientation is considered when having more than two clusters (> 2) in this branch (e.g., circle 1 and 2 on Fig. 2). Two clusters are the minimal number to split; therefore, we stop the iterative clustering in a branch when this criterion is satisfied. On the other hand, the second criterion is used to decide whether a branch (the parent branch) should have more sub-branches (the child branches). The variation between branches is less significant when the difference in the distance between the data of a parent branch and a child branch is small (≤ 0.015). The value used in the second criterion is measured based on the branches stopped due to the first criterion. Therefore, there is no need to split a parent branch when the second criterion is satisfied. The suggested number of clusters by the drot of the parent branch is then the number of clusters in the branch without having the effect of galaxy orientation. For example in Fig. 2, the branch stops at the circle 3 by satisfying the second criterion, and the drot (gray dotted line) suggests three clusters without showing the effect of galaxy orientation in this branch. featured group less featured group Figure 4. Examples of galaxies found within our two preliminary clusters using the balanced dataset. Galaxies in one cluster have more features (left left), and galaxies in the other group have relatively fewer features (right).

Unsupervised Binary Classification
Starting with a simple examination, we enforce our machine to merge all galaxies in the balanced dataset into two preliminary clusters. Examples of galaxies within the two clusters are shown in Fig. 4. Galaxies in one cluster have clearly more features (featured group; e.g., arm structure) than the galaxies of the other cluster (less featured group; more elliptical). We examine the morphological distribution in both clusters (left column in Fig. 5); one cluster has ∼ 96% late-type galaxies (LTGs) and the other one has ∼ 60% early-type galaxies (ETGs). Due to an unequal number between the ETGs and the LTGs in the balanced dataset (Fig. 3), the fraction of ETGs and LTGs in each cluster might be biased. We examine another quantity, 'dominance', which represents the ratio between the fraction of a certain type in a given cluster to the fraction of this type within the dataset (right column in Fig. 5). This quantity removes the statistical influence from different number of types used in the input datasets; hence, it shows a better representation of the galaxy features emphasised in the cluster. Through the dominance distribution, we observe that the featured and less featured group are clearly dominated by the features of LTGs and ETGs, respectively.
We further investigate the potential structural factors considered when separating the two clusters. With the analysis of the two clusters, we can decide what are the major structural factors in the clustering process. First of all it is clear that with our unsupervised learning we obtain a separation into two main clusters where one correlates with late-type galaxies and the other with early-type galaxies. This verifies with a machine this basic dichotomy which has existed in classification schemes for over 100 years.
However, we also want to compare our clusters with more quantitiative measures. In Fig. 6, we compare a variety of structural measurements such as concentration, asymmetry, smoothness/clumpiness, Sérsic index, Gini coefficient, M20, apparent half-light radius (Re, arcsec), and r-band apparent magnitude (mr) between the two clusters. These measurements, except for the r-band magnitude, are provided from the catalogue of Meert et al. (2015), and the r-band magnitudes are from Simard et al. (2011). Within these measurements, the asymmetry, Sérsic index, Gini coefficient, and M20 show a clear separation between the two clusters in Fig. 6. This indicates that our machine takes galaxy structure which correlates with measurable strcutural parameters (asymmetry, Gini coefficient, M20) and light distribution (Sérsic index) into account rather than the apparent size and the apparent brightness of galaxies, when categorising galaxies into the two clusters. This is good, as it shows that our method does not depend on distance or the apparent sizes of galaxies but on the inherent morphologies and structures of the galaxies themselves.
Note that the concentration and smoothness distributions show fewer differences between the two clusters. These two quantities also do not have apparent differences between the LTGs and ETGs in our dataset, because the galaxies in our datasets are relatively faint (∼ 74% galaxies fainter than mr = 16) and the image resolution is limited by the groundbased seeing (>1 arcsec; the image sampling is 0.396 arcsec per pixel). Although we cannot straightforwardly confirm the correlation between the two clusters and the concentration parameter, the Gini coefficient and M20 provide a connection with the concept of concentration.
Based on our visual assessment, we proceed to associate the featured group to LTGs and the less featured group to ETGs in order to compare these machine-predicted labels with the catalogue labels. Using the balanced dataset, the machine-predicted and the catalogue labels agree with an accuracy of ∼ 0.75 in this binary classification. The accuracy is defined as the number of the correct matches between the machine labels and the catalogue labels from all galaxies in the dataset.
In Fig. 7, we present the T-Type distribution between the two clusters. It shows that the main confusion in binary classification by our machine happens when classifying early spirals into either ETGs or LTGs, in particular, Sab  . The distribution of visual galaxy morphology in each cluster obtained using the balanced input dataset. The left column shows the fraction of each morphology type in the clusters while the right column presents the dominance of each type. The 'dominance' is defined by the fraction of a certain morphology type in the cluster divided by a fraction of this type within the dataset. The top row shows the distribution of the 'featured group' while the bottom row presents the statistics for the 'less featured group'. galaxies (T-Type=2). When we exclude early spirals from the balanced dataset, the accuracy increases to ∼0.87 for binary classification.
We discuss some plausible reasons for this misclassification compared to visual classification by our machine. For example, one uncertainty originates from the provided labels which combine the uncertainty of both visual classifications and machine learning predictions. Second, from our machine's perspective, in addition to the potential machine learning uncertainty, another possible uncertainty is caused by the reconstruction inaccuracy in the VQ-VAE, particularly within spiral galaxies with insignificant arm structures. However, although these causes are unavoidable, these conditions exist only in a fairly small fraction of the data in the input imaging dataset. The main reason for the mixture of early spirals in both clusters is due to the intrinsic difficulty of classifying this type into either ETGs or LTGs based only on galaxy structure. The 'early spirals' in fact include a wide range of transitional features which are difficult to accurately define. The separation may become better when including colour information; however, with our method, we state the difficulty to discriminate early spirals when con-sidering only galaxy appearance/structure in a unsupervised machine learning methodology.

Machine Classification Scheme
In the previous section, we enforce our machine to provide two initial clusters for a preliminary examination. However, the main motivation for this study is to investigate the classification system a machine would suggest when 'looking' at galaxies and classifying them through machine learning. We use the proposed method in Section 3 with the balanced dataset to let the machine explore freely and suggest a number of clusters to categorise the galaxies in the dataset. Galaxies in our dataset are categorised into 27 classification clusters by our machine. Comparing with previous work on unsupervised learning which produced 160 clusters (Martin et al. 2019). Our method suggests significantly fewer number of galaxy morphology classifications which is more in line with what one would surmise is a more accurate number of classes for galaxies. In addition to the different implementations applied in both works, the difference in the number of obtained clusters might be due to the fact that we only consider monochromatic images to investigate the impact of Figure 6. The comparison of structural measurements including: concentration, asymmetry, smoothness/clumpiness, Sérsic index, Gini coefficient, M20, half-light radius (Re), and r-band apparent magnitude (mr) between the two initial clusters. The blue shading represents the featured group while the orange shading is for the less featured group.  Table 2. The blue shading shows the distribution of the featured group, while the light orange colour represents the less featured group. galaxy structure in this study, while Martin et al. (2019) used coloured images. Additionally, to have more available measurements of galaxy structure and properties, we choose to use the imaging data from the Sloan Digital Sky Survey (SDSS; York et al. 2000;Abazajian et al. 2009) which has a worse resolution and image sampling (0.396 arcsec per pixel) than the one used in Martin et al. (2019, 0.168 arcsec per pixel). This may be a reason for the resulting fewer number of clusters obtained in our work. To further investi-gate galaxy morphology classifications, the colour information and images with better resolutions will be considered in future work.
Examples of images from each of the 27 clusters are shown in Fig. 8. The number shown on the bottom left is the average value of the T-Type in the clusters and the identification number of the cluster is shown on the top right. The identification numbers of groups are generated on the merger tree from left to right; therefore, they are simply labels without physical interpretation. Table 3 lists the characteristics of each cluster in structural measurements, galaxy properties, and statistics. This can be used to co-examine the figures shown from this section to Section 4.4. Through visual assessment in Fig. 8, we observe that galaxies in some clusters show bars (e.g., g15 and g16 in Fig. 8) or show more elongated in shape than in others.
In Fig. 9, we re-examine the influence of the major structural parameters such as the Sérsic index, asymmetry, Gini coefficient, and M20 (Section 4.1), in separating clusters. Each coloured circle represents one cluster and is coloured by the average value of the T-Type in the cluster. We confirm again a clear correlation between our machine classification clusters and major structural features. Additionally, the given clusters show a transition along with the T-Type. This suggests the clusters are correlated with the visual morphology roughly from early-types to late-types.

Machine Classifications versus Human Visual Classifications
It is important to note that the goal of this work is not to find a perfect agreement between our machine-based classifi-  Additionally, the statistics of each cluster are presented in the last four columns where Ng shows the number of galaxies in the cluster and Fg indicates the percentage of total samples. The Dg lists the dominated types in each cluster, which are selected based on the dominance of each morphology type, and F g,D shows the fraction of the dominated types in a cluster. The F g,bar is the fraction of barred galaxies in a cluster. Finally, D g,bar and D g,nobar is the dominance of barred galaxies and non-barred galaxies in a cluster, respectively. The ordering follows the group IDs which are simply labels for convenience.
cation and the visual morphologies. Our goals are to understand the features used by our method to categorise galaxy images, and to introduce a novel classification scheme 'proposed' by our machine. That is, we want to develop a scheme whereby galaxies are classified by a reproducible and scientific computational way and not by human opinion.
To better understand our machine-based classes, we compare them with visual morphological classes such as the Hubble sequence, and discuss the visual features extracted by our machine. To do this comparison, we associate each cluster with one or a mix of Hubble types based on the dominance of each type within each of the clusters (Fig. 10). As mentioned in Section 4.1, the 'dominance' of each type is the ratio between the fraction of a given morphology type in the cluster to the fraction in the dataset. We associate a given cluster with one or several morphology types when the dominance of a certain type is > 1. This selection in- Figure 8. Examples of images from each cluster listed in the order of the average value of the T-Type within that cluster ( Table 2). The number shown at the left bottom corner is the average value of the T-Type in the cluster. At the right top corner, the identification number of the belonging cluster for the image is presented.
dicates which kinds of visual features considered in a visual morphology type are dominated in a cluster.
In Fig. 10, we show the accumulated distribution of the classification clusters to one or a mix of visual morphology types. Each coloured bar represents one cluster and the deeper bluer colours indicate more barred galaxies than nonbarred galaxies within that given cluster. In Fig. 10, the darkest blue represents a cluster with the strong bar dominance, D g,bar ≥ 1 and the non-bar dominance, D g,nobar < 1 (see the last column in Table 3; e.g., g16 in the table). The medium blue is for a cluster with both bar and non-bar dominance ≥ 1 (weak bar dominance; e.g., g27 in Table 3). This criterion indicates that the features of a barred galaxy are not distinctive in a cluster. The lightest blue is used when the bar dominance is D g,bar < 1 (no/less dominance; e.g., g14 and g19 in Table 3). Through the highlight of the bar dominance in clusters in Fig. 10, our machine is shown to successfully discriminate between barred and non-barred galaxies. Examples of clusters with different bar dominance are shown in Fig. 11.
We observe in Fig. 10 that no cluster is dominated by either elliptical galaxies or early spirals only. The features of elliptical galaxies are recognised to have a great similarity to some lenticular galaxies by our machine. Visually, we separate ellipticals and lenticulars mainly based on the disk structure. However, compared to the cluster dominated by only lenticulars (the g25 in Table 3) in Fig. 12, the galaxies in the two clusters dominated by E/S0 (g22; g23) lack significant disk structure, whereas 'g22' represents the 22th cluster, and so on (also see Fig. 8 and Table 3). However, clusters with more disky galaxies, such as g27 (blue solid line in Fig. 12), are dominated by a mix of S0 and eSp. This is likely an indication for an uncertainty in distinguishing ellipticals, lenticulars, and early spirals in the visual classi- Figure 9. The comparison of the major structural features such as the Gini coefficient, M20, Sérsic index and the asymmetry as a function of each cluster from Section 4.1. Each circle represents one classification cluster from our unsupervised machine learning process which is coloured by the average T-Type in the cluster. The average value of the data in the clusters are used for each structural feature value. Clearly strong trends can be seen that varies along the clusters.
fication system we use and not a defect of our unsupervised learning. Only the lenticulars with a moderate range of Sérsic index (peaks at ∼ 3; yellow solid line in Fig. 12) can be separated from other morphology types.
Additionally, as stated in Section 4.1, early spirals are difficult to categorised into either ETGs or LTGs, and as such it is difficult to have a distinctive cluster dominated by only this morphology type (Fig. 10) due to the broad transitional features in this type. This again indicates the intrinsic difficulty of visually separating early spirals from other morphology types, such as lenticulars and late spirals.
Most of our clusters have a mixture of different Hubble types within them which indicates galaxies with similar features in appearance can be visually classifying into a variety of morphology types (see examples in Fig. 13). In other words, a mix of galaxy structure in fact exists in a visually defined morphology type. This result reveals an intrinsic vagueness of the visual classification systems such that they are not always accurately defined, with many galaxies not optimally classified as a certain T-Type due to the diversity of properties beyond a guessed at morphology.
One exception from the above discussion is our cluster 21 (g21 in Table 3 with a mix of four morphology types: S0, eSp, lSp, Irr). This cluster is shown to have galaxies with bright companions which overwhelms the brightness of the central objects (the 'g21' row shown in Fig. 13). After the feature selection and normalisation in Section 3.2, the central objects might become negligible to the machine learning compared to the companions. This can result in difficulty for our machine to capture the structure of the central objects as well as group these galaxies correctly. On the other hand, galaxies with companions are more likely to experience galaxy mergers, and thus this cluster can be used as an indication to find potential merger events or compact groups of galaxies.

Machine Classifications versus Physical Properties
In previous sections, we show that our machine learning classifications trained with monochromatic images are categorised based on structural features (Section 4.2) and visual features (Section 4.3). In this section, we use the machine classification scheme to study the correlation of galaxy physical properties and galaxy morphology using the colourmagnitude diagram and the mass-size relation of galaxies. In Fig. 14, we examine our the machine classification clusters plotted on the colour-magnitude plane (left) and the mass-size plane (right). The colours and physical sizes are again taken from Simard et al. (2011) while the stellar mass originates from Mendel et al. (2014). Each circle represents one cluster, coloured by the average value of the stellar mass of the galaxies in the cluster for the colour-magnitude diagram and by the average value of colour for the masssize relations. These two plots show that each galaxy cluster as defined by the machine has distinctive physical properties in galaxy colour, absolute magnitude, stellar mass, and physical size. Additionally, our machine classes show a clear transition between galaxy morphology and galaxy properties on both the colour-magnitude diagram and the mass-size relations. Each star shows the average value of the data with a certain visual morphology type (written in black) for comparison. The machine-defined morphology types fill in the gap within the correlation of galaxy morphology and galaxy properties along with the Hubble types. This indicates that the machine classification scheme can complete the missing morphologies in the visual classification systems without involving human potential bias. It will be interesting to investigate the correlation of these machine-defined classifications with galaxy environment and other galaxy properties, but this will be left to study in a future paper.
Additionally, we notice on the mass-size diagram (right in Fig. 14) that the five orange clusters above the eSp star- Figure 10. The accumulated distribution of the classification clusters compared with Hubble sequence morphological types. The x-axis shows one or a mix of visual morphology types which dominates the clusters listed in Table 3. All 27 clusters are plotted here, and each coloured bar represents one cluster. The different colours of the bars show different dominance levels of barred galaxies in the cluster, such that from deep to light blue represent more barred galaxies to no/fewer barred galaxies in the cluster.
label are dominated by barred galaxies, in particular, the top cluster with the largest average size has ∼ 80% barred galaxies in the cluster (g16 in Table 3). Galaxies in this cluster have larger sizes, larger stellar masses, and are redder in colour than other clusters with a mix of typical spiral galaxies.

Dataset with a realistic distribution
To test the capability of our method on a realistic data distribution, we apply our method to the imbalanced dataset ( Fig. 3) which follows the distribution of intrinsic morphology for nearby galaxies (Oh et al. 2013, Section 3.1). In this section, we examine the performance using this dataset for: (1) binary classification (Section 4.5.1) and (2) multiple classification clusters (Section 4.5.2) using the imbalanced dataset, and compare the results with the one using the balanced dataset.

Unsupervised binary classification
Similar to Section 4.1 for the balanced dataset, we merge the imbalanced dataset into two preliminary clusters (Example of galaxies in each is shown in Fig. 15). Although the imbalanced data has a significantly different distribution in galaxy types from the balanced dataset, our machine obtains two preliminary clusters with similar features to the two clusters provided using the balanced dataset (Fig. 4). As before, one cluster is dominated by galaxies with many distinct features while the other has galaxies with significantly fewer features. Fig. 16 shows the morphological fractions of different types (left column) and the dominance of each morphology type in each cluster (right column). The dominance is, again, the ratio between the morphological fraction in the cluster to the fraction in the dataset. This quantity removes the impact of the imbalanced numbers between each type, and indicates the visual features emphasised in a cluster. The two clusters are clearly dominated by LTGs and ETGs, respectively. Additionally, the dominance distribution of the imbalanced dataset is completely consistent with that of the balanced dataset (Fig. 5). This confirms that no matter which data distribution is used, our machine is capable of separating the two clusters based on the specific features existing in the corresponding morphology types.
Additionally, applying our method to the imbalanced dataset we get an initial accuracy of ∼0.87 in separating ETGs from LTGs. The accuracy is again defined as the number of correct matches from the total samples. The reason for a higher accuracy compared with the balanced dataset is due to a lower fraction of early spirals in the imbalanced dataset (∼ 8%) than the balanced dataset (∼ 25%). When we exclude the early spirals from the imbalanced dataset, the accuracy barely changes, and it is consistent with the accuracy obtained when using the balanced dataset (accuracy: ∼0.87; Section 4.1). These results show the ability of our method to achieve reliable binary morphological classification for large surveys with unknown morphological mixes.

Multiple classification clusters
Following Section 3.4, and using the imbalanced dataset, we obtain the same number of clusters, 27, as when we used the balanced dataset through our method of determining the number of clusters (Section 4.2). The clustering results for both datasets are very close to each other, with only very minor differences. For example, 7 clusters are separated under the less featured group using the balanced dataset while 8 clusters are obtained using the imbalanced dataset. Conversely, we obtain 20 clusters for the featured group using the balanced dataset, and 19 using the imbalanced dataset.  In Fig. 17, we associate the classification clusters for the imbalanced, realistic, data set with Hubble types based on the dominance of each type. We find no clean clusters for ellipticals (E), lenticulars (S0), early spirals (eSp), irregulars (Irr) when using the imbalanced dataset. The lack of clusters for E and eSp is due to the same reasons for the balanced dataset discussed in Section 4.2: these two visual morphologies are intrinsically difficult to separated from other morphology types. Additionally, in Section 4.2, we conclude that to get a clean S0 cluster, galaxies have to show a moderate disk structure (Fig. 12). However, there is not a sufficient number of lenticulars with the relevant features due to the low fraction of this type in the imbalanced dataset (Fig. 3). It is impossible for the machine to classify a galaxy that does not exist in some abundence within the dataset; therefore, we miss the pure S0 cluster when using the imbalanced dataset. On the other hand, irregular galaxies do not have a specific structure; therefore, it is easy to be confused them with some late spirals with less structured appearances by our machine, based on only galaxy structure and without the prior knowledge of 'late sprials' or 'irregulars'. They also suffer from the similar cause of the missing S0 cluster: the insufficient number of irregular galaxies in our imbalanced set decreases the possibility of the distinctive irregulars to be picked out by our machine.
Similar to the results of the balanced dataset, the separation between clusters might 'improve' in terms of being closer to a more physical classification when we consider colour information in our machine. Therefore, this will be an important part in future work.

CONCLUSION
In this paper, we present an unsupervised machine learning technique by applying a combination of a feature extractor -a vector-quantised variational autoencoder (VQ-VAE) and a hierarchical clustering algorithm (HC). This method involves a vector quantisation process which provides a rate of classification with a feature extractor in the learning phase at least 30 times faster than a typical convolutional antoencoder used in Cheng et al. (2020) on the same device.  Figure 13. Examples of images of galaxies from clusters with a mix of many visual morphology types. Each row shows five randomly picked examples within the cluster, where 'g22' represents the 22th cluster, and so on. The morphology information is shown below each image.
To sensibly explore galaxy morphologies and investigate the suggestive number of galaxy morphological classes, we propose some novel modifications to the machine learning algorithms used in this work (Section 2). First, we include a preliminary clustering result in the VQ-VAE architecture during the feature learning process. This helps to extract features that can not only reproduce the input images but also be well separated into two preliminary clusters in feature space. Second, different distance thresholds are used within each branch in the merger tree in the HC process rather than m featured group less featured group Figure 15. Examples of galaxies within the two preliminary clusters using the imbalanced dataset. Galaxies in one cluster are with more features (left), and galaxies in the other group are with relatively fewer features (right).
a single distance threshold for a whole tree. This flexibility prevents the creation of unnecessary clusters separating galaxies with few features, while allowing more clusters for galaxies that show larger variation. Another innovation is to use galaxy orientation (a potential problem when classifying galaxies) to our advantage, helping to decide the number of clusters (Section 3.4).
Using the monochromatic images from the Sloan Digital Sky Survey (SDSS), we first explore galaxy classifications using a dataset with a balanced number of galaxies in each morphological class (Section 3.1). This is done to reduce potential biases associated with number imbalances. Using this method we obtain 27 clusters within this balanced dataset. We find that our method separates the classification clusters based on galaxy shape and structure (e.g., Sérsic index, asymmetry, Gini coefficient, M20). We then associate our classification clusters with the Hubble sequence based on the dominance of each type in a given cluster (Section 4.2). Clusters with barred, weak-barred, and non-barred galaxies are well distinguished by our machine. However, when using the balanced dataset, no clean clusters are found for ellipticals or early spirals (Fig. 10). Additionally, most clusters are associated with a mixture of Hubble types. We thus conclude that there is a fundamental difficulty in separating accurately galaxies with transitional features such as lenticular galaxies and early spirals with a machine. This applies both to visual and machine classifications.
In addition, we find that each machine classification cluster has characteristic galaxy properties (e.g., colours, masses, luminosities, sizes) that transition smoothly along the Hubble sequence.
Overall, the machine classification clusters provide a reasonable and detailed scheme for galaxy morphological classification based on a combination of multiple structural parameters, avoiding human errors and biases. The dominated features in our classification clusters can be used as the foundation of an objective alternative to the Hubble sequence. Since our system separates well galaxies with different shape, structure, and physical properties, it may prove useful in generic galaxy formation and evolution studies. The system may be improved by including multi-colour imaging and velocity maps. It will also be interesting to apply our technique to higher redshift galaxies to see how the classification cluster change.
To test the performance of our method with realistic morphological distributions, we also apply it to an imbalanced dataset which follows the morphological distribution of nearby galaxies. The results are very similar to the ones obtained with the balanced dataset, showing that our system is able to deal with large galaxy samples with more realistic morphological mixes. It also shows that our set up is not sensitive to different distributions of input galaxy morphologies, but can handle a range of distributions of various galaxy input 'types'.
In the future, we intend to apply the techniques developed here to multi-colour images with better resolution such as the data from the Dark Energy Survey and the Euclid Space Telescope. Velocity maps from integral-field spectroscopic surveys could also be included. The resulting classification system(s) should prove very useful to better understand galaxy properties, their formation and evolution. We also expect that the future development of this work will result in a fundamental change in how we approach galaxy morphological classification -both visually and when using machine learning. Figure 17. The accumulated distribution of the classification clusters compared with the Hubble sequence for the imbalanced dataset. The x-axis shows one or a mix of visual morphology types. Each coloured bar represents one cluster. Different colours are different dominance of barred galaxies in the cluster, such that from deep to light blue represent more barred galaxies to no/fewer barred galaxies in the cluster.