Artificial Intelligent Diagnosis and Monitoring in Manufacturing

.

September, 2018 DRAFT The aim of smart manufacturing is to integrate advanced information techniques into manufacturing processes to produce such benefits as improved production quality and cost reduction 4,5 . Unexpected manufacturing failures can halt production and lead to wastage of raw materials or system malfunctions. In recent decades, it has been envisioned that manufacturing data, including vibration, pressure, temperature, and energy data, can be used to support artificial intelligence (AI) algorithms 6 . AI algorithms have the potential to detect the locations of faults or even predict them before they occur; doing so could allow regular maintenance to be replaced by condition-based or predictive maintenance, which would be more effective in reducing unnecessary maintenance while also guaranteeing the reliability of the machinery 7 .
However, conversion of the measured data from manufacturing processes into actionable knowledge about the health status of the equipment has proven challenging 8 .
In the past, the measured signals have often been processed via feature extraction 9 to represent the complete signals manually. The extracted features are then used to train the system using standard classification and regression methods to allow predictions to be made in a case-by-case manner [10][11][12] . When the features have been extracted, the next step involves translation of the fault diagnosis problem into classification and regression forms. Common methods used to implement this step include use of neural networks (NNs) 13 , support vector machines (SVMs) 14 , and adaptive neuro fuzzy inference systems (ANFISs) 15 . Use of an ANN has been a common mode of choice in applications such as medicine, industry, and power systems since 1997 16 . However, use of ANNs has featured less in the recent literature because it is hard to escape from a local minimum when using an ANN 17 26 and hydraulic system data 27 , airplane girder simulation damage data 28 , broken tool data, and the bearing 28 , tool wear, and gearbox data 29 that were collected via our experiments. All these data were converted into classification problems. National Aeronautics and Space Administration (NASA) tool wearing data 30 , battery data 31 and the Centre of Advanced Life Cycle Engineering (CALCE) battery data 32 were converted into regression problems. After simple truncation and filter-based pre-processing, we substituted the data into a multilayer CNN model for training and testing.
Rolling bearing fault detection and classification is used here as an illustrative example. Rolling bearings are vital components in many types of rotating machinery, ranging from simple electrical fans to complex machine tools. More than half of machinery defects are generally related to bearing faults 33 . Typically, a rolling bearing fault can lead to machine shutdown, chain 4 September, 2018 DRAFT damage, and even human casualties 33 . Bearing vibration fault signals are usually caused by localized defects in three components: the rolling elements; the outer race; and the inner race.
When bearings near the end of their lifetimes, instances of deformation, cracking, and burning among these components may cause spindle deviation and cause further serious damage to the mechanical system.
A bearing data set provided by the Case Western Reserve University (CWRU) data centre 26 , which is regarded as a benchmark for the bearing fault diagnosis problem, was used to validate the effectiveness of our proposed framework. An experimental platform (illustrated in Figure 1  For further evaluation of the classification results, we used the following three assessment metrics to evaluate the classification performance with validation data: (a) precision, (b) recall, and (c) accuracy, which are defined as follows: where the abbreviations TP, FP, FN, and TN denote the numbers of true positives, false positives, false negatives, and true negatives, respectively. In our four-way and ten-way classification, we regarded the first class as the positive class while others are negative classes for computing these metrics. Across the three classification tests, the defined assessment metrics all achieved results of 100%.
These results demonstrate that, without prior knowledge, measurement data suffice to classify fault types accurately and thereby provide pinpoint fault localization, which makes the fixing process efficient. In addition, because of the high sampling frequency (12 kHz) used and the high efficiency of the proposed CNN model, the fault types can be categorized correctly within 0.5 s: thus the proposed algorithm can localize faults in near-real time.
The proposed framework can also be used for a wide range of other applications with high metrics, including greater accuracy, precision and recall (summarized in Figure 2) 2. Hydraulic system condition classification 27 ：Hydraulic system condition monitoring is a classification task. We chose CNN as the base model to make predictions for different conditions.

Datasets
The datasets used in this manuscript are of the following types: open-accessible data; competition data; experimental data collected in our lab; and real production data provided by 7 September, 2018 DRAFT industrial partners with permission. These datasets are composed of sensory current signals, force signals, vibration signals or acoustic emission signals, or their combinations, which are processed for the classification or regression tasks.

Main idea
We convert practical problems into supervised classification and regression tasks and solve them using deep learning technique. An end-to-end algorithm is proposed to automatically discover the hidden features needed for learning and prediction without prior knowledge. We

Convolutional neural networks
CNNs consist of convolutional layers, pooling layers, and fully connected layers with a final Nway prediction layer. The convolution layer uses a number of filters to discretely convolve with the input data. We define a vector K ∈ ℝ m of weights, a vector I ∈ ℝ k of raw data, and a constant of a bias. In a convolutional process, stride is the distance between two subconvolution windows, and we define it as parameter d. We define a sub-vector of I, i.e., The idea of a one-dimensional convolution is to take the product between the vector K and the sub-vector I (i) of raw data, which reads as follows: where K j is the jth element of vector K, j=1,2,…,m. When conducting a convolutional process, the number of filters (different filters have different initial values K) and b) is set to determine the depth of the convolutional results. Since the process of convolution between each filter and data uses weight sharing, the number of training parameters and complexity of the model are greatly reduced. As a result, computational efficiency is improved.
An activation function named Rectified Linear Unit (ReLU) is followed by each convolutional layer, which has the following form: f(S (i) ) ≜ max(0, S (i) ) .
ReLU prevents the saturation nonlinearities with respect to other functions when optimizer calculates gradient descent, meanwhile guarantees the sparsity in convolutional networks. To conclude, the entire convolution process for a sub-vector I (i) can be described as: Data is then fed into the pooling layer with the aim of down-sampling the number of parameters. The commonly used pooling method are max pooling, average pooling, and L2norm pooling. Due to the improved performance achieved in practice, max pooling is chosen here: where p is the pooling size, and e is the stride size in max pooling.
After convolution and pooling, the data is fed into a fully connected layer. Data is transformed into a one-dimensional structure to facilitate data processing in the fully connected layer. The final layer is the output of the model with size N.
For classification problems, the activation function is selected as softmax in classification problem. The probabilities of the faulty types of the ith training measurement are calculated by 10 September, 2018 DRAFT the softmax classifier P(y (i) = n|X (i) ; W ( ) ) = (P(y (i) = 1|X (i) ; W ( ) ), … , P(y (i) = N|X (i) ; W ( ) )) are defined as follows: where vector X (i) , i=1,2,…,q, represents the penultimate layer in fully connected layers (output layer is the last layer in fully connected layers), matrix W (i) is the weights for different faulty types in the ith training measurement and W j (i) is the jth column of matrix W (i) .
For regression, the estimated result y estimate (i) of the ith training measurements is represented as follows: y estimate where the definition of vector X (i) , i = 1, … , q , is the same as Equation (8),while vector W (i) is the weights for the penultimate layer in fully connected layers of the ith training measurements.
In the training process, to minimize the difference between the predicted scores and the ground labels in the training data, we need to design a proper loss function and specify an optimizer. Two loss functions used for classification and regression are cross-entropy L ce and least-squares L ls , respectively, which are described in Equations (10) and (11): Specially, the term 1{y (i) = j} in Equation (10) is the logical expression that always returns either zeros or ones. Meanwhile, y (i) and y estimate (i) in Equation (11)    classification problems include: CWRU bearing data; tool broken data; bearing data; airplane girder data; blades processing data; gearbox data; and hydraulic system data. The datasets for supervised regression problems include: NASA tool wear data; NASA battery data; and the CALCE data. Specifically, for the multi-classification problem, we define the first class as the positive class to calculate the precision, recall and accuracy according to Equations (1), (2) and (3). 18 September, 2018 DRAFT