Detection of sputum by interpreting the time-frequency distribution of respiratory sound signal using image processing techniques

Abstract Motivation Sputum in the trachea is hard to expectorate and detect directly for the patients who are unconscious, especially those in Intensive Care Unit. Medical staff should always check the condition of sputum in the trachea. This is time-consuming and the necessary skills are difficult to acquire. Currently, there are few automatic approaches to serve as alternatives to this manual approach. Results We develop an automatic approach to diagnose the condition of the sputum. Our approach utilizes a system involving a medical device and quantitative analytic methods. In this approach, the time-frequency distribution of respiratory sound signals, determined from the spectrum, is treated as an image. The sputum detection is performed by interpreting the patterns in the image through the procedure of preprocessing and feature extraction. In this study, 272 respiratory sound samples (145 sputum sound and 127 non-sputum sound samples) are collected from 12 patients. We apply the method of leave-one out cross-validation to the 12 patients to assess the performance of our approach. That is, out of the 12 patients, 11 are randomly selected and their sound samples are used to predict the sound samples in the remaining one patient. The results show that our automatic approach can classify the sputum condition at an accuracy rate of 83.5%. Availability and implementation The matlab codes and examples of datasets explored in this work are available at Bioinformatics online. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Sputum is produced by the trachea when the trachea is stimulated. Usually when there is sputum in the trachea, the sputum will be ejected through coughing. However, people who are unconscious, especially patients in Intensive Care Unit (ICU) whose breathing is assisted by a ventilator, cannot eject the sputum by themselves. They need medical staff to check the condition of the sputum and remove the sputum using a suction catheter. If the sputum is not timely removed, it can lead to hypo ventilation, dioxide retention and even pulmonary infection. For this nursing work of clearing the sputum in the trachea, the most important is to determine the sputum condition. The traditional method for this determination involves using lung sounds auscultation to get lung sounds from different parts of the lung. Experienced physicians can determine whether there is sputum in the trachea through distinguishing the sputum sound from mixture sounds. However, this job is time-consuming and requires a skill that is difficult to acquire (Jones et al., 2000;Ranson et al.,2014;Sarkar et al., 2015;Yao et al., 2014). Therefore, methods and devices that can accurately detect the condition of the sputum should be investigated. So far, there has been research that focuses on abnormal sound. For example, Habukawa et al. (2013) used breath sounds to evaluate the control level of asthma. Pinho et al. (2015) adopted fractal dimension and box filtering methods to detect the crackle sounds. Bahoura (2009) used pattern recognition methods to classify normal and wheeze sounds. Abbasi et al. (2013) applied Neural Network and Support Vector Machines to classify normal and abnormal sounds. However, few studies involve sputum detection (Shang, 2011;Oliveira et al., 2013;Paratz, et al., 2014). Usually, electronic stethoscopes were used for data acquisition (Riella et al., 2009;Zolnoori et al., 2012). Some research also used two or more sound sensors to detect the signal. Azarbarzin et al. (2011) and Waitman et al.(2000) placed two microphones in different parts of the breast to record data. Charleston et al. (2011) placed 25 sound sensors on the back of the body to obtain the sound signals. All of the devices used in these studies require physical contact with the bodies of the patients. But this extended physical contact can lead to discomfort. For sputum detection, it is necessary to monitor sputum condition in real time. In this paper, we develop a new automatic approach that can be used to measure respiratory sounds of trachea in real time and analyze those sounds in order to detect the condition of the sputum.
Compared with other respiratory sound measurement devices, the sound sensor of our device is embedded in the ventilation tube near the mouth. This helps prevent the noise from the environment from affecting our measurements when we measure the respiratory sound. Currently, most research just focuses on the waveform of a signal and its mathematical procedures. These methods differ from the image recognition that can give better visualization of sputum conditions. Consequently, it is necessary to prepare visual-based evidence of auscultation sound from a practical viewpoint. In this paper, we propose a method for sputum detection by interpreting the time-frequency distribution of respiratory sound signals using image processing techniques. The time-frequency distribution of respiratory sound signals, determined from the spectrum, is treated as an image. After preprocessing, a gray level co-occurrence matrix is used to extract the texture features. Through extracting sputumspecific features, we can then use machine learning to diagnose the sputum condition.
The aim of our research is to develop an approach that can automatically detect the sputum in real-time and inform the medical staff when sputa exist in the respiratory tract of a patient. Indeed, for an in-house experiment with 272 respiratory sound samples (including 145 sputum and 127 non-sputum), our automatic approach can correctly diagnose the sputum at an accuracy of 83.5%. Moreover, this method can be used practically in clinical environments.

Materials and methods
A device is used to measure the respiratory sound from the trachea. After obtaining the sound data, the method of segmentation is taken to divide them into several segments, each of which includes one respiratory cycle. Each segment consists of several frames. We employ texture of the time-frequency distribution spectrum as features for each sound dataset. Using these features, the segments are classified into sputum sound and non-sputum sound. In the feature extraction step, the Gray Level Co-occurrence Matrices (GLCM) are used. The classifier is then set up for classification. The overview of our proposed method is shown in Figure 1.

Data acquisition device
In this study, a new device was developed to measure the respiratory sound signals from the trachea. As shown in Figure 2, this device consists of six major parts: a sound sensor, an audio card, an airway connector, a waterproof connector, a signal amplifier and a power supply. The sound sensor is embedded in the respiratory tube. This way, the noise of environment can be lessened which can increase  the signal-to-noise ratio. The amplifier is used to amplify and denoise the sound signal. The waterproof connector is used to keep the head of the sound sensor desiccated. The audio card is used to connect with the computer and transform the analog signal to a digital one so that the sound can be recorded by the computer with a sampling rate of 44 100 HZ.

Data analysis
In order to use respiratory sound signals to distinguish between sputum and non-sputum, we need to obtain the features inherent in the signals and use these features for classification. The process for data analysis includes segmentation, feature extraction, feature selection, model training and model testing (Fig. 3). Respiratory sound acquisition and feature extraction are performed using code from MatLab (See the Supplementary material for the MatLab codes). Feature selection and classification are performed with the WEKA machine learning tool available at http://www.cs.waikato.ac.nz/ml/weka/.

Segmentation of input lung sound
An autocorrelation method is carried out in the segmentation step (Jalil et al., 2013). Each recording signal will be separated into many frames by a window function. In this paper, a Hamming window is used. The autocorrelation R i x ð Þ of the i th frame of recording sound signal can be calculated as follows.
Þis the i th (i ¼ 1,. . ., M) frame of the signal. M is the total number of frames. L is the length of the data frame and k is the time shift used to compute the autocorrelation. In the experiment, the values of L and k are 1024 and 400 respectively.
Two thresholds, C 1 and C 2 , are calculated to segment the sound data. Based on the property of respiratory sound, the C 1 and C 2 are calculated by the formula C 1 ¼ aC 0 and C 2 ¼ bC 0 . The values of a and b should be adjusted based on the measurement environment. The C 0 is the first silent frames of the recording sound signal. It is assumed that the first 0.025 s of recording sound signal are silent frames, and the average of their maximum of autocorrelated functions C 0 is calculated. Because the autocorrelation of noise is much lower than the respiratory sound, the maximum of autocorrelation is lower than C 1 . When the maximum of autocorrelation (maxR i k ð Þ) is greater than C 2 , it can be used to judge whether the frames of sound belong to the respiratory cycle. And when the maximum value of autocorrelation is greater or less than C 1 , it can be used to judge the start or end of respiratory cycle. In this paper, to prevent the short stop between inspiration and expiration from affecting the segmentation, we introduce the max silence time T s whose value is based on the breathing frequency. The section with low autocorrelation whose lasting time is lower than T s will still be treated as a portion of a respiratory cycle.

Feature extraction
From the image of the time-frequency distribution of respiratory sound signals as described in Figure 4, the difference between the time-frequency distribution of sputum and non-sputum sound signals is displayed on the texture of the image. In Figure 4a, some vertical textures can be seen in the red rectangle. However, in Figure  4b, there are few vertical textures. We can use these textures for classification and sputum detection. That is, the time-frequency distribution of respiratory sound signals, determined from the spectrum, is treated as an image, and the sputum detection is performed by interpreting the patterns in the image. Short-time Fourier transform (STFT) is used to get the time-frequency distribution of sound signal (Wang et al., 1993;Yu et al., 2016) and gray level cooccurrence matrices (GLCM) is used to extract the texture parameters from the image of the time-frequency distribution.
STFT of discrete time signal x n ð Þ can be calculated as: where m is the number of frames in advance of the signal, N is the length of the frame in advance of the signal and the value of N is 1024. w n ð Þ is the window function. In this study, Blackman-Harris window is chosen. Computing the result of this equation with a digital signal and specific window function yields a matrix of complex numbers. It can be expressed as followed: To get a digital spectrogram from this, the magnitude of each number in the STFT matrix is computed and squared.
where E n; k ð Þ is the power spectrum and its value can present the image gray level, n is treated as the horizontal axis and k is treated as the vertical axis. The DB spectrum image is then calculated by using a transformation of 10 log 10 E n; k ð Þ ð Þ . Using the method above, we determine the time-frequency spectrum as shown in Figure 4.
After getting the image of time-frequency distribution, the GLCM is used to quantitatively evaluate textural parameters (Mohanaiah et al., 2013;Soh et al., 1999;Zucker et al.,1980). The GLCM is a matrix where the number of rows and columns is equal to the number of gray level G in the image. The matrix element P i; jjd; w ð Þis the relative frequency of two pixels which are separated by a pixel distance d ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Dx 2 þ Dy 2 p in the direction of w. The element P i; jjd; w ð Þcontains the second order statistical probability and can be written as Due to their large dimensionality, the GLCM is very sensitive to the size of the texture samples on which they are estimated. Thus, the number of gray levels is often reduced. Prior to matrix calculation, the input gray level of image was reduced to 16 levels while maintaining the histogram shape. Four important texture features, angular second moment (energy), inertia moment, correlation and entropy, were selected. Angular Second Moment (Energy) Based on the data we collected and similar respiratory sound analysis (Nogata et al., 2015), the most texture concentrates on four directions (w ¼ 0 , 45 , 90 , 135 ). Therefore, each feature measure is obtained for 4 angles (w ¼ 0 , 45 , 90 , 135 ). We get 4 Â 4 ¼ 16 attributes.

Classification method
Before the classification step, relevant and descriptive features should be selected. A good set of this feature is the one that contains features highly correlated with class. In other words, if the feature is correlated with the class, it will be useful in classification (Hall, 1999). The Pearson correlation coefficient (PCC) gives an indication of the strength of the linear relationship between features and result of classification. Based on the value of PCC, the useful feature can be selected (Zhou et al., 2016). Thus, to find the optimal feature set, a feature selection method based on PCC is used in this paper. The task of classifying is deciding class membership y' of an unknown item x' based on a dataset D ¼ x 1 ; y 1 ð Þ; Á Á Á x n ; y n ð Þ for item x i with known class memberships y i . In this study, the type of classification is a dichotomous classification. The class labels are either sputum sound or non-sputum sound. y i ¼ 0 represents nonsputum sound and y i ¼ 1 represents sputum sound. x i 's are usually d-dimensional vectors which are attributes of the signal. Currently, logistic regression is a popular method to model binary data and is often used in biomedical data processing (Press et al., 1978;Subasi and Erc¸elebi, 2005). Therefore, logistic regression is used to classify the sound data. To avoid unstable parameter estimation when the number of covariates is relatively large or when the covariates are highly correlated, the logistic regression model with a ridge estimator is applied (Lee et al., 1988). Fig. 4. Time-frequency spectrum of signal. Image (a) and image (b) represent sputum sound signal and non-sputum signal respectively. Both images include three parts, the first part is the original wave signal of respiratory sound. After STFT, the time-frequency distribution of signal is represented by the second part. Then through segmentation, the third part that represents the time-frequency distribution of one respiratory cycle is determined. The texture features are extracted from the image of the time-frequency distribution Suppose that there are n observations X i ; Y i ð Þ , where Y is defined as Y i ¼ 1 for a sputum sound signal and Y i ¼ 0 otherwise, X i are d-dimensional feature row vectors of the signal. Then the probability that Y i ¼1, given the value of X i ¼ X i1 ; Á Á ÁX i16 ð Þ , is denoted by p X i ð Þ and is modeled with the standard logistic regression model.
, in this paper, h is a 17-dimensional vector of parameters which can be regarded as weight of each feature parameter. The probability of Y i is determined by the value of h. The vector h is chosen so that it maximizes p X i ð Þ. Based on the p X i ð Þ, the classification result can be acquired. There is no constant term involved in this regression problem. The log-likelihood l of data (X, Y) under this model is When there is correlation between the various explanatory variables, it will lead to unstable parameters estimates. If h can be shrunk towards 0 and allow a little bias, it can stabilize the system and provide more appropriate estimates. Therefore the log-likelihood l h ð Þ can be rewritten as follows: where l h ð Þ is the unrestricted log-likelihood function and khk ¼ ð P h 2 j Þ 1=2 is the norm of the parameter h.k is the ridge parameter that is used to shrink the norm of h. When k ¼ 0, the problem will become maximum likelihood estimation. With a small value of k, a little bias will be introduced to the system, which will stabilize the system and lower the variance of the estimation. The value of b h k can be obtained when the equation reaches its maximum. This way, the estimate b h k will be on average closer to the real value of h. The b h k can be obtained by the Quasi-Newton method (Richard et al.,1989). The first derivative of l k h ð Þ is The first step is to construct an objective function. This step is analogous to Taylor's transformation.
is an approximation to the Hessian matrix which is a positive definite matrix. The gradient of equation (14) is To get the optimal solution, B is chosen to satisfy the Equation (14) and the gradient is set to zero. Define s ¼ s k , B ¼ B k , Therefore Þ should satisfy the Armijo-Goldstein rule condition that is used to ensure a sufficient descent (Renato et al.,1984).
where s k is the searching direction to ensure that the iteration is working. Equation (16) can be identified as When B k satisfies Equation (16), it can be used to update b h k k to be b h k kþ1 . In this paper, the DFP method (Richard et al., 1989) is used to The iteration formula of B k can be written as follows: With the iteration, the estimate b h k can be calculated. In this paper, B 0 and k are set as B 0 ¼ I, k ¼ 1:0 Â 10 À8 , respectively. Figure 5 shows the iteration steps for calculating p X i ð Þ in formula (10) and the class of classification can be determined based on the value of p X i ð Þ.

Results
We conducted an experiment to test our proposed approach. In this experiment, 272 respiratory sound samples were collected from 12 intubated patients (with 145 being sputum sounds and 127 being non-sputum sounds) in the ICU of Chaoyang Hospital. The age of the patients ranged from 60 to 85 years old. Because certain lung diseases and other diseases (such as pneumonia, respiratory failure, cerebellum infarctions, etc.) may lead to the lack of ability in breathing by themselves, they were intubated to help them breath. The measurement environment is shown in Figure 6. Before the experiment, the waterproof connector and other equipment was sterilized with autoclave. Based on the result of the suction and the judgement of experienced nurses, the presence of sputum was determined and categorized as either sputum or non-sputum. When there was secretion after suction, the sounds recorded in the 1-2 min before suction began will be treated as the sputum sound. The sound in 1-2 min recorded after suction was completed was treated as non-sputum sound. The accuracy in detection of the presence of sputum was then evaluated.

Training data and test data
The data before the suction was defined as sputum sound and those after suction was regarded as non-sputum sound. The respiratory sound consists of one or two minute waveforms. All the data was automatically segmented by the autocorrelation method. Through the simulation using Matlab, as shown in Figure 7, the red solid line is the start of a respiratory sound cycle and the red dashed line represents the end of a respiratory sound cycle.
To evaluate the performance of classifiers, we set the training and test sets using the cross-validation strategy applied to the patients as follows. Among the 12 patients, we select one patient and all the sound samples from this patient to form a test set. The sound samples from the remaining 11 patients form the training set. For each classifier, we use the training dataset to predict the test set and obtain a corresponding discrimination rate. We then apply this process to each of the 12 patients. The overall performance is the average of the 12 values of discrimination rate for each classifier. A total of 272 sound samples (145 sputum sounds and 127 non-sputum sounds) from 12 patients were tested in this cross-validation process. For each classifier, we look at the highest discrimination rate and its corresponding number of features.
In this paper, to minimize the effect of the correlations among samples from the same patients on the classification, two methods are taken. One is to use the data that comes from the different measurement time (each patient will be measured 2-4 times.). Because the sputum is related to the patient situation that is easily affected by the treatment performance (Burgel et al., 2009), it is always changing. The generation of sputum sound is more like a random procedure. The correlation of data comes from different times are treated as uncorrelated. The other is to define the training data and test data from the different patients so that the test set comes from a patient that is independent from the patients in the training set.

Analytic result
Our proposed method can be evaluated by discrimination rate: how many segments were correctly classified and how large the sensitivity and specificity of the diagnosis were. The sensitivity is the proportion of actual positives that are correctly identified whereas specificity is the proportion of negatives that are correctly identified.
where TP, TN, FP and FN denote true-positive, true-negative, falsepositive and false-negative, respectively. In order to compare the proposed method with the other classification method, Bayesnet, Naïve Bayes, K-nearest neighbors (KNN), RandomForest and Reptree were also tested.
To test the reliability of the segmentation method described in 2.2.1, we apply the method to 239 segment samples (including 111 sputum and 128 non-sputum segments) all with known positions. The accuracy of segmentation is 98.7% overall, 98.2% for the sputum segments and 99.2% for the non-sputum segments, which indicates that the segmentation method is reliable.
As described in Section 2.2.3, PCC was used to rank an attribute. The results are shown in Table 1. The highest value of PCC was energy in a 45-degree direction and the lowest value was entropy in a 45-degree direction (Table 1).  In order to compare the logistic model with other commonly used existing classification methods, Bayesnet, Naïve Bayes, KNN, RandomForest and Reptree were also tested. Following the rank order in Table 1, different numbers of features were chosen to conduct discrimination using the cross-validation process described in Section 3.1 for each classifier. The results are shown in Figure 8, which indicate that the highest discrimination rate is achieved using the logistic regression model with all the 16 attributes.
The highest discrimination rate and its corresponding number of attributes for each classifier are shown in Table 2, which leads to two discoveries. First, the three classifiers with the single highest discrimination rates, i.e. Logistic, KNN and Random forest, all achieved their highest discrimination rates when all the 16 attributes were included in the model, indicating that all the 16 attributes make contribution to classification and should be included in the classification analysis. Second, among the 6 tested classifiers, the logistic model achieves the highest discrimination rate of 83.5%. Based on these discoveries, we adopted the logistic model for discriminating sputum sounds in our automatic sputum detection procedure. The sensitivity and specificity of this adopted classifier is further shown in Table 3, which indicates that the sensitivity and specificity are 82.1 and 85.0%, respectively.

Discussion
Several techniques for sound feature extraction in either the time, frequency or complex domains have been developed (Azarbarzin et al., 2009;Charleston et al.,2011;Riella et al., 2009;Waitman et al., 2000;Zolnoori et al., 2012). However, most of them (Azarbarzin et al., 2009;Charleston et al.,2011;Waitman et al., 2000;Zolnoori et al., 2012) have focused on the extraction of the information contained in only a narrow band of the signal spectrum, centering on the fundamental or one of the harmonics of the respiratory sound signal frequency. In this paper, the method of features extraction in the joint time-frequency domain was proposed. The spectrum gives a one to one mapping between each signal component. There are no interference terms. The pattern can be displayed in the distribution and correctly reflect the local energy distribution over the time and frequency domains. Based on the GLCM method for texture feature extraction, the wave signal recognition was transformed into a visual-based recognition. This constitutes a novel wide band analysis technique. In addition, to reduce heavy background noise, the sound sensor was embedded in a thick tube which was used to connect the trachea with the airway of the ventilator.
There are five steps in the process of classification. They are respiratory data acquisition, auto-segmentation, feature extraction, feature selection and classification. Because the sound sensor was embedded into respiratory airway, the thick tube of airway can minimize the background environment noise. Because of this, the signalto-noise ratio can be increased. Auto-segmentation was performed by the maximum of autocorrelation function. Based on the property of respiratory sound, the threshold of C 1 and C 2 are calculated by the formulas C 1 ¼ aC 0 and C 2 ¼ bC 0 respectively. The segmentation method is reliable with an accuracy of over 98%.
In this paper, various classifiers (including the logistic model, Bayesnet, Naïve Bayes, KNN, RandomForest and Reptree) were investigated. Table 2 shows that the highest accuracy was obtained by using the logistic model and it can reach 83.5%. The sputum sound and non-sputum sound were 82.1 and 85.0% respectively ( Table 3). The classification accuracy of non-sputum sound is higher than the sputum sound.
In this study, experiments were conducted using real data recorded at the ICU. Our method achieved a discrimination rate of about 83% which can be accepted by the doctor in ICU. The experimental results show that the proposed detection system is able to effectively detect respiratory sounds associated with sputum for the real environments in the hospital. Our proposed approach is a necessary and novel way to help doctors or nurses judge when to proceed with endotracheal suctioning and reduce the frequency of unnecessary suction. According to American Association for Respiratory Care (AARC) guideline (AARC, 1993), the endotracheal suctioning should be performed at a minimum frequency or when clinically indicated (i.e. when complications due to accumulated secretions are manifested). Since endotracheal suctioning can cause hypoxemia, mechanical trauma, bronchospasm and hemodynamic instability, an accurate assessment of the need for suctioning may decrease the frequency of suctioning complications.
For future work, possible improvements can be introduced by increasing the number of features, implementing recently developed non-linear metrics such as sample entropy (Zhang et al., 2017) and strictly standardized mean difference (Zhang et al., 2011) and selecting more appropriate training samples . In addition, more subjects with various types of sounds will be tested. In   clinical practice, the first question to be asked is whether there exists a sputum condition. Thus, in this paper, we develop an automatic approach to detect whether a sputum condition exists. In addition to answering the question of whether a sputum condition exists, doctors are also interested in predicting the amount, proportion and location of sputum. Hence, there is also a future need to build models capable of detecting that information. Conflict of Interest: none declared.