-
PDF
- Split View
-
Views
-
Cite
Cite
Zhiqiang Zhang, Yi Zhao, Xiangke Liao, Wenqiang Shi, Kenli Li, Quan Zou, Shaoliang Peng, Deep learning in omics: a survey and guideline, Briefings in Functional Genomics, Volume 18, Issue 1, January 2019, Pages 41–57, https://doi.org/10.1093/bfgp/ely030
- Share Icon Share
Abstract
Omics, such as genomics, transcriptome and proteomics, has been affected by the era of big data. A huge amount of high dimensional and complex structured data has made it no longer applicable for conventional machine learning algorithms. Fortunately, deep learning technology can contribute toward resolving these challenges. There is evidence that deep learning can handle omics data well and resolve omics problems. This survey aims to provide an entry-level guideline for researchers, to understand and use deep learning in order to solve omics problems. We first introduce several deep learning models and then discuss several research areas which have combined omics and deep learning in recent years. In addition, we summarize the general steps involved in using deep learning which have not yet been systematically discussed in the existent literature on this topic. Finally, we compare the features and performance of current mainstream open source deep learning frameworks and present the opportunities and challenges involved in deep learning. This survey will be a good starting point and guideline for omics researchers to understand deep learning.
Introduction
The impressive achievement achieved by Google’s AlphaGo inspired some non-computer field researchers to pay attention to deep learning technology. Deep learning is a machine learning method which is based on neural networks. Compared with traditional machine learning methods, deep learning tends to have more network layers and requires more data, and at the same time, its ability to extract features automatically from raw data is greatly enhanced. Based on massive data and stronger feature learning ability, deep learning tends to achieve more satisfactory experimental results.
Deep learning technology has a long history, the earliest prototype was the MCP artificial neural model developed by McCulloch and Pitts [1] in 1943. Then, Rosenblatt [2] proposed the concept of perceptron on the basis of artificial neurons. In 1974, the backpropagation algorithm was proposed in Werbos’ [3] doctoral thesis, which realized the multilayer neural network. The most significant breakthrough in this field occurred in 2006, Hinton’s algorithm effectively resolved the problem of the gradient disappearance in backpropagation and revealed the potential of deep learning technology [4]. Now, this technology has begun to rapidly develop for the following three reasons:
With the arrival of the big data era, the amount of data has become huge. The data dimension has increased. Data structure has become more complex. Traditional machine learning methods, such as support vector machines, are not good at handling such data.
The development of computing hardware makes it feasible to train deep learning models.
The deep learning technology community, including big companies like Google, is growing rapidly every year, promoting the continuous development of this technology.

Approximate number of published articles. The number of articles is based on the search ‘deep learning’ and ‘deep learning + DNA/RNA/protein’ in https://apps.webofknowledge.com.
At present, deep learning technology has achieved great success in image recognition, speech recognition and natural language processing. In addition, many applications in bioinformatics, such as disease prediction using electronic health records [5, 6], the classification of biomedical images [7–10], biological signal processing [11–13], etc., have benefited from deep learning too. As an important discipline in biological science, omics is no exception. Now, omics data, represented by genome data, transcript data and proteome data, are increasing exponentially. Many well-known logical data projects or databases, such as the Encyclopedia of DNA Elements [14] and the Gene Expression Omnibus (GEO) [15], can provide a growing amount of publicly accessible data, which meets the need of deep learning for massive data, enabling deep learning can be applied in omics.
In fact, the development of contemporary omics has been inseparable from the support of deep learning. On the one hand, although some new experimental methods, such as X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy, can produce accurate experimental results, they can be time-consuming and expensive. On the other hand, most of the existing data are diverse, complex and high dimensional. Resolving a problem often requires a combination of various types of data. This increases the difficulty of data analysis. Deep learning technology has the potential to address these two problems. Compared with traditional experimental methods, it is faster and more economical. Compared with traditional machine learning methods, it is better able to handle these complex data, making it easier to obtain more accurate experimental results. The approach of combining deep learning and omics has gained great popularity since 2010, as shown in Figure 1.
Prior to this article, many researchers have reviewed the application of deep learning in bioinformatics, biomedicine and other fields [16–22]. However, at present, there has been no specific discussion on the application of deep learning in omics research. Unlike other works, we focus on genomics, transcriptomics, proteomics and other related research in omics, providing more detailed views on these areas. In addition, we provide a detailed guideline of how to apply deep learning technology in omics research, which has not been addressed in detail in previous works. First, we will introduce several deep learning models which are commonly used in omics research. Then, we will present the application cases and the latest developments in deep learning in the field of omics in the past few years. The steps involved in using deep learning technology in omics research will also be discussed. In addition, in order to enable researchers, who do not understand deep learning, to apply this technology, we will summarize and compare several open-source deep learning frameworks and bioinformatics tools. Finally, we will discuss the potential challenges of and future opportunities in this field. We believe that this work will assist omics researchers to understand deep learning technology.
Deep learning models in omics
Deep learning models are varied, and different models are appropriate for dealing with different types of problems. Here, we will introduce three common models: Deep neural networks (DNNs), convolution neural networks (CNNs) and recurrent neural networks (RNNs).
Deep neural networks
In this section, we will use DNNs to represent those fully connected neural networks which comprise multilayer perceptron (MLP) [23], auto-encoder [24] and restricted Boltzmann machine (RBM) [25].
MLP is also known as a multilayer neural network. In addition to the input and output layers, there are multiple hidden layers. One of the simplest MLP models is shown in Figure 2A. By inputting a large amount of training data, MLP can constantly adjust the weights between two neurons by using the backpropagation algorithm so that the correct network can be established between the output and input layers. Therefore, training an MLP is usually implemented by using a supervised method when a large amount of labeled data is available. MLP is widely used when features are not related in time or space in omics research.

Some typical DNN structures. (A) An MLP structure that contains only one hidden layer. (B) A typical auto-encoder structure. (C) A typical RBM structure. (D) The basic structure and training process of DBN network. The 1st step of training is to pre-train each layer of RBM along the solid arrow, and the 2nd step is fine-tuning the network along the dashed arrow based on the labeled data.
The auto-encoder is a neural network which reproduces the input signal as much as possible. It can capture the most important features of the input data and restores the original data. Its main idea is to regard the hidden layer of the neural network as both an encoder and decoder. After the input data are encoded and decoded by the hidden layer, it is necessary to ensure that the decoded data are consistent with the original input data. One of the simplest auto-encoder structures is shown in Figure 2B. Auto-encoders routinely use greedy layer-wise pretraining methods to implement unsupervised learning. A trained auto-encoder is often used for data reduction or feature extraction and is often used in a situation in which there is no large amount of labeled data. In general, it is difficult to obtain a large amount of labeled omics data, and the omics data usually show high-dimensional characteristics. Therefore, an auto-encoder is often used in omics research.
The RBM proposed by Hinton et al. is a generative stochastic neural network which contains a visible layer and a hidden layer. In RBM, the neurons in different layers are connected to each other; however, the neurons in the same layer are independent of each other. Furthermore, the connections between the neurons are bidirectional and symmetrical, as shown in Figure 2C. Based on the energy model and the probability equation, RBM can establish the correct relationship model between the visual and hidden layers, thus extracting the features of the original data. RBM can be trained by an unsupervised method, for instance, the contrastive divergence algorithm [26]. In the research of omics, RBM is used in two main ways: encoding the data and then using supervised learning methods to classify or regress the data, such as the deep belief network (DBN) [4]; and using RBM to achieve the weight matrix and offset, which can initialize the BP neural network.
After having introduced the three basic components of DNNs, we now introduce a DNN model: DBN [4], which is commonly used in omics research. One of the most classical DBN structures consists of several RBM layers and one BP layer, as shown in Figure 2D. Training this DBN model involves two steps. The 1st step is pretraining each layer of RBM, and the 2nd step is fine-tuning the network based on the labeled data. DBN has achieved considerable progress in omics research. Many common omics challenges, such as protein residue–residue contacts prediction [27] and RNA binding protein site prediction [28], use DBN to resolve these issues.


Schematic diagrams of convolution operation and pooling operation. (A) Schematic diagram of convolution operation. (B) Max-Pooling and Mean-Pooling schematic diagram.
In general, DNNs are a kind of conventional and effective neural network model. Although the best results are not guaranteed, this type of model can be adapted to almost all types of data. Therefore, DNNs are worth trying in the research of omics.
Convolution neural network
CNNs were first proposed by LeCun in 1989 [29]. In recent years, CNNs have been successfully applied in many fields, including speech recognition, face recognition, general object recognition, motion analysis and natural language processing. CNNs have also played an important role in omics research, including gene expression prediction, protein classification and gene structure prediction.
In general, CNNs consist of multiple convolution layers, pooling layers and a full connection layer. A simple CNN structure is shown in Figure 3. The function of a convolution operation is to extract the various features of the data. In the process of convolution, a convolution kernel can slide on the input window so that the weight parameters on the convolution kernel are multiplied by the corresponding pixels. Then, the results follow multiplication. The role of the pooling layer is to abstract the original characteristic signal, which greatly reduces the training parameters and can also reduce the degree of overfitting. Pooling operations can be divided into two categories: max-pooling and mean-pooling. Max-pooling involves selecting the largest value of a corresponding pixel as a sampling result, and mean-pooling involves calculating the average value of the corresponding pixel as the sampling result. The principles of convolution operations and pooling operations are shown in Figure 4.
Compared to other models, CNNs have the outstanding ability to analyze spatial information and require less data preprocessing steps. Therefore, CNNs are particularly good for manipulating image data, and encoding omics data into two-dimensional image matrices is often quite easy. CNNs have made good achievements in the identification of various gene sequence structures, such as protein binding sites and enhancer sequences. In addition, CNN's ability in transfer learning is powerful when it is difficult to obtain a large amount of available markup omics data.
Recurrent neural networks
RNNs are a kind of neural network models proposed in the late 1980s [30]. In recent years, RNNs have been increasingly applied in many fields, such as natural language processing, image recognition and speech recognition. In omics research, RNNs have various applications, such as determining the exon/intron boundaries of a gene, predicting RNA sequence-specific bias, etc.
The reason why RNNs are named as such is that the input of the hidden layer includes not only the output of the input layer, but also the output of the hidden layer at the last moment. A simple RNNs model can be expanded into a complex network. A specific RNNs structure diagram and the time order dependency map are shown in Figure 5.

The structure of RNN and the structure after unfolding by time. Ht is the hidden state of time t, and Ot represents the output of time t; U is the direct weight of the input layer to the hidden layer, which abstracts our original input as a hidden layer input; W is the weight of the hidden layer to the hidden layer, which is the memory controller of the network that is responsible for scheduling the memory; V is the weight of the hidden layer to the output layer, and the features learned from the hidden layer will pass through it again and as a final output.
At present, the two most widely used RNNs architectures are Long short-term memory (LSTM) networks [31] and gated recurrent unit (GRU) networks [32]. These two networks are enhanced versions of the general RNNs structure. Training RNNs model is slightly different to training DNNs models. If an RNNs model is expanded, the parameters W, U and V are shared, but the traditional neural network is not. Furthermore, in the process of training, the output of each step depends not only on the current network but also on the state of several previous networks. This increases the difficulty of training, which is prone to problems such as gradient explosion. Fortunately, some improved networks, such as LSTM and GRU, have been able to resolve such problems.
In view of the powerful memory ability of RNNs, it can resolve the problem of time series well. There is a dependency between most omics data, such as nucleotide and amino acid sequences. RNNs can automatically learn the correlation of the elements in a sequence from such data and extract the global sequence characteristics. Therefore, RNNs also occupy a key position in omics research.
So far, we have explained three of the most commonly used deep learning models in omics research, namely DNNs, CNNs and RNNs. However, the three models can be combined according to actual needs. This can achieve better performance. For example, the approach in [33] was to use a hybrid CNN-LSTM model for predicting the properties and functions of DNA. Compared with the CNN-based method, the performance of this hybrid model was significantly higher.
Application of deep learning in omics
At present, an increasing number of omics researchers have taken note of the value of deep learning technology. They have used deep learning to resolve some problems in this field and have achieved higher accuracy and faster speed than traditional methods. In this section, we will briefly explain the latest advances in deep learning from three perspectives: genomics, transcriptomics and proteomics.
Genomics
First, we can use deep learning technology to predict and identify the functional units in DNA sequences, including replication domain, transcription factor binding site (TFBS), transcription initiation point, promoter, enhancer and gene deletion site. In 2015, a novel hybrid architecture combining a pre-trained, DNN and a hidden Markov model was developed [34] to identify distinct replication domain types. This model achieved significant improvements in terms of recognition accuracy and robustness compared with previous methods. In 2016, a deep convolutional/highway MLP framework [35] was applied to classify genomic sequences according to the TFBS and achieved a good result; a median area under the curve (AUC) of 0.946. In [36], a CNN model was used to analyze the sequence characteristics of prokaryotic and eukaryotic promoters and to develop predictive models. This experiment was excellent for the classification of promoter sequences and non-promoter sequences; for human promoters, the accuracy of prediction reached 0.90 on TATA and 0.89 on non-TATA promoter sequences [36]. Similarly, the methods used in [37] effectively distinguished active enhancers and promoters by using a deep three-layer feed forward neural network. They achieved a maximum level of accuracy of 93.59% on GM12878 lymphoblastic cells. In addition, for gene deletion, the experiment in [38] proposed a tool named CNNdel, which uses shallow CNNs to detect genomic deletions with real data from the 1000 Genomes Project. The experiment’s results show that accuracy and sensitivity are both improving compared with other existent methods. In general, DNA sequence data are commonly used as the primary training data in predicting and identifying functional units in DNA sequences. Moreover, according to our summary, in such research, the application of CNNs is more and more common, and the application of DNNs such as DBN and MLP is gradually decreasing. In the past two years, CNNs have taken a mainstream position in terms of prediction of promoters, enhancers, TFBSs, replication domains, detection of gene deletions and differentiation of intron exons. In addition, the hybrid model of CNN+LSTM is gradually being used.
Deep learning technology can also predict gene expression. This work usually involves predicting the expression of the target gene, predicting gene function, modeling gene regulatory networks, etc. For example, in 2016, Chen, et al. [39] used the microarray-based GEO dataset to train a DNN model to infer the expression of target genes. Their methods were significantly better than logistic regression, with a relative improvement of 15.33% on MAE. In the same year, a DNN model based on MLP and a stacked denoising auto-encoder [40] was proposed to predict gene expression from genotypes of genetic variation, which achieved better performance than lasso and random forests. For another example, a novel hybrid convolutional and bi-directional LSTMRNN framework named DanQ in [41] was proposed for predicting the functions of noncoding regions. Compared with related models, DanQ achieved 97.6% of targets in terms of the Precision-Recall curve. With regard to the gene regulatory network, the approach taken in [42] was to use an RNN model to train the gene regulatory network. The obtained results are superior to all the previous methods, and robustness can be maintained. On the whole, gene expression profiling, DNA sequence data with functional labels and histone modification data are all common training data when predicting gene expression. In predicting the expression of target genes and the function of genes, CNNs are currently the most commonly used deep learning models, followed by MLP. The application of deep learning in modeling gene regulatory networks is still relatively rare. In comparison, RNNs are the most extensive deep learning model used in this field.
Using deep learning technology, we can also explore genomes and diseases in epigenetic and other fields. For example, in 2017, an MLP model was used to predict cancer risk and cancer survival rates [43]. By using the clinical and molecular data of the cancer genomic map (TCGA) as training data, this work achieved comparable performance to the cox elasticity network. Another example, [44], involved using a deep CNN to predict the impact of sequence variation on proximal CpG site DNA methylation, which achieved 0.854 of area under the receiver operating characteristic curve (AUROC). Similarly, a method named DeepCpG was proposed in [45] to predict methylation states in single cells. By using an RNN-CNN joint network, this method can predict methylation states in single cells accurately, and the parameters of the model can be interpreted, thereby providing insight into how sequence composition affects methylation variability. According to our summary, in predicting DNA methylation, DNA sequences and their methylation status are commonly used as training data. In this work, CNNs are the most commonly used deep learning model, and RNNs are sometimes used in this field, but it is usually combined with CNNs into a hybrid model rather than being used alone. In the research of the association between genomics and disease, TCGA data, gene expression profiles and clinical data are common training data. And the most commonly used deep learning model is DNNs. Among them, auto-encoders are often used for feature extraction, while DBN is often used to directly predict or classify diseases.
Transcriptomics
Using deep learning technology, we can analyze the structure of RNA sequences, including predicting RBP binding sites, alternative splicing sites and RNA types. For example, in 2015, a DBN neural network was used to discover potential binding motifs and predict novel candidate binding sites [28]. This model uses RNA sequence information, RNA secondary structure information and RNA tertiary structure information as training data and achieves a 22% reduction in the MRE compared to previous methods. For another example, the approach taken in [84] was to employ deep CNNs for a novel splice junction classification tool named DeepSplice. Compared with traditional machine learning methods, this method not only improves accuracy but also increases computational efficiency and flexibility. Furthermore, in 2017, an MLP neural network was constructed in [46] to achieve the correct classification of pre-miRNAs and other pseudo hairpins, which achieved a level of accuracy of 0.968 ± 0.002. In general, RNA sequences, RNA secondary and tertiary structures, CLIP-seq data, etc. are very useful training data when predicting the structure of RNA. The most common models for predicting RBP binding sites and alternative splice sites are CNNs, DBN and the hybrid model of CNN+DBN. In the classification of RNA, such as identifying whether a RNA sequence is miRNA or long non-coding RNA (lncRNAs), the most commonly used deep learning models are MLP, RBM and RNNs. And from the frequency of use in the last two years, RNNs represented by LSTM will be more and more widely used in this area.
Deep learning technology can also be used in other fields, such as the association between RNA and disease, RNA and drug design, etc. For example, in 2014, a classification model for disease was successfully trained [47] by using a DBN. By using miRNA data as training data, this method increases the F1-measure of many kinds of cancer test data by 6–10% compared with average machine learning methods. Furthermore, the approach taken in [48] was to use transcriptome data to combine DNN models to identify the pharmacological properties of multiple drugs in different biological systems and conditions. Ultimately, their approach achieved much better classification performance than previous support vector machine (SVM) methods. In general, training data commonly used in such research include RNA sequence data represented by miRNA-seq, transcriptome data in TCGA and RNA methylation status data. The most commonly used deep learning models are MLP and DBN. In addition, auto-encoders are sometimes used to extract raw data features.
Proteomics
Similarly, deep learning technology can identify protein structures, including protein secondary tertiary structure prediction, protein model quality assessment, protein contact map prediction, etc. For example, the approach taken in [49] was to use stacked sparse auto-encoders to predict secondary structures and torsion angles. This method used the original amino acid sequence as the original input and the protein secondary structure, backbone torsion angles and other features as an iterative input. The model nearly achieved a level of accuracy of 82% in predicting secondary structures. Additionally, the approach taken in [36] was to evaluate the quality of the protein model by replacing the support vector machine with a DNN, which increased the Pearson correlation coefficient from 0.85 to 0.9. The approach taken in [50] was to use an ultra DNN, formed by combining two deep residual neural networks to predict contacts by integrating both sequence conservation information and evolutionary coupling, achieving the highest F1 score on free-modeling targets in the latest critical assessment of protein structure prediction (CASP). According to our summary, predicting protein structure generally requires amino acid sequences, low-dimensional structures of proteins and physicochemical properties of amino acids as training data. In this work, the most commonly used deep learning model in literature is DNNs. For example, auto-encoder is often used to extract the characteristics of input data. MLP, DBN and RBM are often used as the core model for predicting protein structure. However, from the recent literature, CNNs and RNNs, especially RNNs, have gradually been applied to this field as the main model for predicting protein structure, and have achieved higher accuracy than DNNs.
Classification . | Problem to be solved . | Deep learning model . |
---|---|---|
Genomics | DNA sequence structure | MLP(DNN) [37, 55, 56] |
SAE(DNN)[57] | ||
DBN(DNN) [34, 58] | ||
CNN [33, 36, 38, 59–68] | ||
RNN [33, 63, 64, 69, 70] | ||
Gene expression regulation | MLP(DNN) [39, 40] | |
SAE(DNN) [40, 71] | ||
CNN [41, 72–80] | ||
RNN [41, 81] | ||
Gene expression and disease | MLP(DNN) [43, 82, 83] | |
SAE(DNN) [84, 85] | ||
DBN(DNN) [86–88] | ||
Genotype and Drugs | MLP(DNN) [89] | |
Epigenomics (DNA methylation) | CNN [44, 45] | |
Transcriptomics | RNA sequence structure | MLP(DNN)[90–92] |
SAE(DNN) [93] | ||
DBN(DNN) [28, 46, 94] | ||
CNN [94] | ||
RNN [95–97] | ||
RNA and drug classification | MLP(DNN) [48] SAE(DNN) [98] | |
RNA and disease prediction | MLP(DNN) [99] | |
SAE(DNN) [100] | ||
DBN(DNN) [47] | ||
CNN [101] | ||
Proteomics | Protein classification | RNN [102] |
Protein structure | MLP(DNN) [103–106] | |
SAE(DNN) [49, 107, 108] | ||
DBN(DNN) [27, 109–111] | ||
CNN [50, 112–114] | ||
RNN [112, 115] | ||
Protein function | CNN [51, 116, 117] | |
LSTM(RNN) [52] | ||
Drug design | CNN[118] | |
Intracellular distribution of proteins | SAE(DNN) [119] | |
CNN [54] | ||
RNN [120, 121] | ||
Protein interactions | MLP(DNN) [122] | |
RNN [53, 123] |
Classification . | Problem to be solved . | Deep learning model . |
---|---|---|
Genomics | DNA sequence structure | MLP(DNN) [37, 55, 56] |
SAE(DNN)[57] | ||
DBN(DNN) [34, 58] | ||
CNN [33, 36, 38, 59–68] | ||
RNN [33, 63, 64, 69, 70] | ||
Gene expression regulation | MLP(DNN) [39, 40] | |
SAE(DNN) [40, 71] | ||
CNN [41, 72–80] | ||
RNN [41, 81] | ||
Gene expression and disease | MLP(DNN) [43, 82, 83] | |
SAE(DNN) [84, 85] | ||
DBN(DNN) [86–88] | ||
Genotype and Drugs | MLP(DNN) [89] | |
Epigenomics (DNA methylation) | CNN [44, 45] | |
Transcriptomics | RNA sequence structure | MLP(DNN)[90–92] |
SAE(DNN) [93] | ||
DBN(DNN) [28, 46, 94] | ||
CNN [94] | ||
RNN [95–97] | ||
RNA and drug classification | MLP(DNN) [48] SAE(DNN) [98] | |
RNA and disease prediction | MLP(DNN) [99] | |
SAE(DNN) [100] | ||
DBN(DNN) [47] | ||
CNN [101] | ||
Proteomics | Protein classification | RNN [102] |
Protein structure | MLP(DNN) [103–106] | |
SAE(DNN) [49, 107, 108] | ||
DBN(DNN) [27, 109–111] | ||
CNN [50, 112–114] | ||
RNN [112, 115] | ||
Protein function | CNN [51, 116, 117] | |
LSTM(RNN) [52] | ||
Drug design | CNN[118] | |
Intracellular distribution of proteins | SAE(DNN) [119] | |
CNN [54] | ||
RNN [120, 121] | ||
Protein interactions | MLP(DNN) [122] | |
RNN [53, 123] |
Classification . | Problem to be solved . | Deep learning model . |
---|---|---|
Genomics | DNA sequence structure | MLP(DNN) [37, 55, 56] |
SAE(DNN)[57] | ||
DBN(DNN) [34, 58] | ||
CNN [33, 36, 38, 59–68] | ||
RNN [33, 63, 64, 69, 70] | ||
Gene expression regulation | MLP(DNN) [39, 40] | |
SAE(DNN) [40, 71] | ||
CNN [41, 72–80] | ||
RNN [41, 81] | ||
Gene expression and disease | MLP(DNN) [43, 82, 83] | |
SAE(DNN) [84, 85] | ||
DBN(DNN) [86–88] | ||
Genotype and Drugs | MLP(DNN) [89] | |
Epigenomics (DNA methylation) | CNN [44, 45] | |
Transcriptomics | RNA sequence structure | MLP(DNN)[90–92] |
SAE(DNN) [93] | ||
DBN(DNN) [28, 46, 94] | ||
CNN [94] | ||
RNN [95–97] | ||
RNA and drug classification | MLP(DNN) [48] SAE(DNN) [98] | |
RNA and disease prediction | MLP(DNN) [99] | |
SAE(DNN) [100] | ||
DBN(DNN) [47] | ||
CNN [101] | ||
Proteomics | Protein classification | RNN [102] |
Protein structure | MLP(DNN) [103–106] | |
SAE(DNN) [49, 107, 108] | ||
DBN(DNN) [27, 109–111] | ||
CNN [50, 112–114] | ||
RNN [112, 115] | ||
Protein function | CNN [51, 116, 117] | |
LSTM(RNN) [52] | ||
Drug design | CNN[118] | |
Intracellular distribution of proteins | SAE(DNN) [119] | |
CNN [54] | ||
RNN [120, 121] | ||
Protein interactions | MLP(DNN) [122] | |
RNN [53, 123] |
Classification . | Problem to be solved . | Deep learning model . |
---|---|---|
Genomics | DNA sequence structure | MLP(DNN) [37, 55, 56] |
SAE(DNN)[57] | ||
DBN(DNN) [34, 58] | ||
CNN [33, 36, 38, 59–68] | ||
RNN [33, 63, 64, 69, 70] | ||
Gene expression regulation | MLP(DNN) [39, 40] | |
SAE(DNN) [40, 71] | ||
CNN [41, 72–80] | ||
RNN [41, 81] | ||
Gene expression and disease | MLP(DNN) [43, 82, 83] | |
SAE(DNN) [84, 85] | ||
DBN(DNN) [86–88] | ||
Genotype and Drugs | MLP(DNN) [89] | |
Epigenomics (DNA methylation) | CNN [44, 45] | |
Transcriptomics | RNA sequence structure | MLP(DNN)[90–92] |
SAE(DNN) [93] | ||
DBN(DNN) [28, 46, 94] | ||
CNN [94] | ||
RNN [95–97] | ||
RNA and drug classification | MLP(DNN) [48] SAE(DNN) [98] | |
RNA and disease prediction | MLP(DNN) [99] | |
SAE(DNN) [100] | ||
DBN(DNN) [47] | ||
CNN [101] | ||
Proteomics | Protein classification | RNN [102] |
Protein structure | MLP(DNN) [103–106] | |
SAE(DNN) [49, 107, 108] | ||
DBN(DNN) [27, 109–111] | ||
CNN [50, 112–114] | ||
RNN [112, 115] | ||
Protein function | CNN [51, 116, 117] | |
LSTM(RNN) [52] | ||
Drug design | CNN[118] | |
Intracellular distribution of proteins | SAE(DNN) [119] | |
CNN [54] | ||
RNN [120, 121] | ||
Protein interactions | MLP(DNN) [122] | |
RNN [53, 123] |
Deep learning technology can also be used to predict protein function. For example, in [51], a CNN model was used to identify the function of protein. The experiment used a protein tertiary structure as input and achieved a level of accuracy of 87.6%. Furthermore, the experiment in [52] used an LSTM model to predict the function of four kinds of proteins. Using the original amino acid sequence as training data, the model achieved a level of accuracy of over 99%. In general, when predicting protein function, amino acid sequence, protein structure and the data of protein–protein interactions are very useful information. CNNs and LSTM are the most important prediction models at present.
Deep learning technology can also be used for predicting protein–protein interactions, protein subcellular localization, among many other functions. For example, the approach taken in [53] was to use a stacked auto-encoder to predict the sequence-based protein–protein interaction. This model achieved an average level of accuracy of 97.19%. Additionally, the approach taken in [54] was to utilize a CNN to automate the work of detecting the cell compartment where a fluorescently labeled protein is located. This model performs very well, achieving a level of accuracy of 91% for each cell localization classification, with a level of accuracy of 99% per protein. When predicting protein–protein interactions, amino acid sequences are the most common training data and DNNs, CNNs and RNNs have been used as prediction models in different literatures. In comparison, CNNs are slightly higher in terms of frequency of use and prediction accuracy. When studying protein subcellular localization, amino acid sequences and fluorescently labeled microscopic images are commonly used training data. The application of deep learning in this field is still relatively rare. In related work, some researchers use CNNs as the core model, some researchers use RNNs as the core model and some researchers use stacked auto-encoder as the core model. But in comparison, the classification accuracy of CNNs is higher.
We have briefly described several typical examples of the application of deep learning in omics research. More specific work is included in Table 1. Of course, we believe that deep learning can achieve even greater success in the field of omics, as better training data, more advanced deep learning models, more reasonable deep learning architecture and parametric designs can further improve performance.
Problem to be solved . | Deep learning model . | The source of the software or source code . | Type . |
---|---|---|---|
Predict RBP binding sites | DBN | https://github.com/thucombio/deepnet-rbp [28] | code |
DNA binding protein site prediction | CNN | http://cnn.csail.mit.edu [44] | code |
Identify and distinguish replication domains based on replication timing profiles | DBN | https://github.com/wenjiegroup [34] | code |
Identification of enhancer and promoter regions in the human genome. | MLP(DNN) | https://github.com/yifeng-li [37] | code |
MLP(DNN) | https://github.com/wenjiegroup/PEDLA [55] | code | |
Discriminate between bound and unbound sequences of TF binding data | CNN | https://github.com/kundajelab/keras/tree/keras_1 [61] | code |
Prediction binding of all TF / cell type pairs | CNN+RNN(LSTM) | http://github.com/uci-cbcl/FactorNet [33] | code |
Predict conservative sequences | CNN | https://github.com/uci-cbcl/DeepCons [79] | code |
Predict translation initiation sites | CNN | https://github.com/zhangsaithu/titer [66] | code |
Annotating the pathogenicity of genetic variants | MLP(DNN) | https://cbcl.ics.uci.edu/public_data/DANN/ [83] | code |
Gene expression data compendium for Pseudomonas aeruginosa | SAE(DNN) | https://github.com/greenelab/adage [56] | code |
Predicting the properties and function of DNA sequences | CNN+RNN | http://github.com/uci-cbcl/DanQ. [41] | code |
Predict gene expression | CNN | https://github.com/QData/DeepChrome [76] | code |
MLP(DNN) | https://github.com/uci-cbcl/D-GEX [27] | ||
Histone ChIP-seq data denoising | CNN | https://github.com/kundajelab/coda [77] | code |
Patient prognosis based on transcriptome data | MLP(DNN) | https://github.com/lanagarmire/cox-nnet [99] | code |
Predict the effect of genome sequence variation on DNA methylation | CNN | http://cpgenie.csail.mit.edu [31] | code |
Use the clinical and molecular data of TCGA to predict the risk of disease and survival | MLP(DNN) | https://github.com/CancerDataScience/SurvivalNet [30] | code |
Predict protein contacts | CNN | http://raptorx.uchicago.edu/ContactMap/ [36] | webserver |
MLP(DNN) | http://compbio.robotics.tu-berlin.de/epsilon/[105] | webserver | |
MLP(DNN) | http://scratch.proteomics.ics.uci.edu/ [105] | webserver | |
Protein model quality assessment | MLP(DNN) | http://proq3.bioinfo.se/ [104] | webserver |
Identify protein folding | DBN | http://iris.rnet.missouri.edu/dnfold [110] | webserver |
Comprehensive website | … | http://www.softberry.com/ [36] | webserver |
… | http://sparks-lab.org [49, 108] | webserver |
Problem to be solved . | Deep learning model . | The source of the software or source code . | Type . |
---|---|---|---|
Predict RBP binding sites | DBN | https://github.com/thucombio/deepnet-rbp [28] | code |
DNA binding protein site prediction | CNN | http://cnn.csail.mit.edu [44] | code |
Identify and distinguish replication domains based on replication timing profiles | DBN | https://github.com/wenjiegroup [34] | code |
Identification of enhancer and promoter regions in the human genome. | MLP(DNN) | https://github.com/yifeng-li [37] | code |
MLP(DNN) | https://github.com/wenjiegroup/PEDLA [55] | code | |
Discriminate between bound and unbound sequences of TF binding data | CNN | https://github.com/kundajelab/keras/tree/keras_1 [61] | code |
Prediction binding of all TF / cell type pairs | CNN+RNN(LSTM) | http://github.com/uci-cbcl/FactorNet [33] | code |
Predict conservative sequences | CNN | https://github.com/uci-cbcl/DeepCons [79] | code |
Predict translation initiation sites | CNN | https://github.com/zhangsaithu/titer [66] | code |
Annotating the pathogenicity of genetic variants | MLP(DNN) | https://cbcl.ics.uci.edu/public_data/DANN/ [83] | code |
Gene expression data compendium for Pseudomonas aeruginosa | SAE(DNN) | https://github.com/greenelab/adage [56] | code |
Predicting the properties and function of DNA sequences | CNN+RNN | http://github.com/uci-cbcl/DanQ. [41] | code |
Predict gene expression | CNN | https://github.com/QData/DeepChrome [76] | code |
MLP(DNN) | https://github.com/uci-cbcl/D-GEX [27] | ||
Histone ChIP-seq data denoising | CNN | https://github.com/kundajelab/coda [77] | code |
Patient prognosis based on transcriptome data | MLP(DNN) | https://github.com/lanagarmire/cox-nnet [99] | code |
Predict the effect of genome sequence variation on DNA methylation | CNN | http://cpgenie.csail.mit.edu [31] | code |
Use the clinical and molecular data of TCGA to predict the risk of disease and survival | MLP(DNN) | https://github.com/CancerDataScience/SurvivalNet [30] | code |
Predict protein contacts | CNN | http://raptorx.uchicago.edu/ContactMap/ [36] | webserver |
MLP(DNN) | http://compbio.robotics.tu-berlin.de/epsilon/[105] | webserver | |
MLP(DNN) | http://scratch.proteomics.ics.uci.edu/ [105] | webserver | |
Protein model quality assessment | MLP(DNN) | http://proq3.bioinfo.se/ [104] | webserver |
Identify protein folding | DBN | http://iris.rnet.missouri.edu/dnfold [110] | webserver |
Comprehensive website | … | http://www.softberry.com/ [36] | webserver |
… | http://sparks-lab.org [49, 108] | webserver |
Problem to be solved . | Deep learning model . | The source of the software or source code . | Type . |
---|---|---|---|
Predict RBP binding sites | DBN | https://github.com/thucombio/deepnet-rbp [28] | code |
DNA binding protein site prediction | CNN | http://cnn.csail.mit.edu [44] | code |
Identify and distinguish replication domains based on replication timing profiles | DBN | https://github.com/wenjiegroup [34] | code |
Identification of enhancer and promoter regions in the human genome. | MLP(DNN) | https://github.com/yifeng-li [37] | code |
MLP(DNN) | https://github.com/wenjiegroup/PEDLA [55] | code | |
Discriminate between bound and unbound sequences of TF binding data | CNN | https://github.com/kundajelab/keras/tree/keras_1 [61] | code |
Prediction binding of all TF / cell type pairs | CNN+RNN(LSTM) | http://github.com/uci-cbcl/FactorNet [33] | code |
Predict conservative sequences | CNN | https://github.com/uci-cbcl/DeepCons [79] | code |
Predict translation initiation sites | CNN | https://github.com/zhangsaithu/titer [66] | code |
Annotating the pathogenicity of genetic variants | MLP(DNN) | https://cbcl.ics.uci.edu/public_data/DANN/ [83] | code |
Gene expression data compendium for Pseudomonas aeruginosa | SAE(DNN) | https://github.com/greenelab/adage [56] | code |
Predicting the properties and function of DNA sequences | CNN+RNN | http://github.com/uci-cbcl/DanQ. [41] | code |
Predict gene expression | CNN | https://github.com/QData/DeepChrome [76] | code |
MLP(DNN) | https://github.com/uci-cbcl/D-GEX [27] | ||
Histone ChIP-seq data denoising | CNN | https://github.com/kundajelab/coda [77] | code |
Patient prognosis based on transcriptome data | MLP(DNN) | https://github.com/lanagarmire/cox-nnet [99] | code |
Predict the effect of genome sequence variation on DNA methylation | CNN | http://cpgenie.csail.mit.edu [31] | code |
Use the clinical and molecular data of TCGA to predict the risk of disease and survival | MLP(DNN) | https://github.com/CancerDataScience/SurvivalNet [30] | code |
Predict protein contacts | CNN | http://raptorx.uchicago.edu/ContactMap/ [36] | webserver |
MLP(DNN) | http://compbio.robotics.tu-berlin.de/epsilon/[105] | webserver | |
MLP(DNN) | http://scratch.proteomics.ics.uci.edu/ [105] | webserver | |
Protein model quality assessment | MLP(DNN) | http://proq3.bioinfo.se/ [104] | webserver |
Identify protein folding | DBN | http://iris.rnet.missouri.edu/dnfold [110] | webserver |
Comprehensive website | … | http://www.softberry.com/ [36] | webserver |
… | http://sparks-lab.org [49, 108] | webserver |
Problem to be solved . | Deep learning model . | The source of the software or source code . | Type . |
---|---|---|---|
Predict RBP binding sites | DBN | https://github.com/thucombio/deepnet-rbp [28] | code |
DNA binding protein site prediction | CNN | http://cnn.csail.mit.edu [44] | code |
Identify and distinguish replication domains based on replication timing profiles | DBN | https://github.com/wenjiegroup [34] | code |
Identification of enhancer and promoter regions in the human genome. | MLP(DNN) | https://github.com/yifeng-li [37] | code |
MLP(DNN) | https://github.com/wenjiegroup/PEDLA [55] | code | |
Discriminate between bound and unbound sequences of TF binding data | CNN | https://github.com/kundajelab/keras/tree/keras_1 [61] | code |
Prediction binding of all TF / cell type pairs | CNN+RNN(LSTM) | http://github.com/uci-cbcl/FactorNet [33] | code |
Predict conservative sequences | CNN | https://github.com/uci-cbcl/DeepCons [79] | code |
Predict translation initiation sites | CNN | https://github.com/zhangsaithu/titer [66] | code |
Annotating the pathogenicity of genetic variants | MLP(DNN) | https://cbcl.ics.uci.edu/public_data/DANN/ [83] | code |
Gene expression data compendium for Pseudomonas aeruginosa | SAE(DNN) | https://github.com/greenelab/adage [56] | code |
Predicting the properties and function of DNA sequences | CNN+RNN | http://github.com/uci-cbcl/DanQ. [41] | code |
Predict gene expression | CNN | https://github.com/QData/DeepChrome [76] | code |
MLP(DNN) | https://github.com/uci-cbcl/D-GEX [27] | ||
Histone ChIP-seq data denoising | CNN | https://github.com/kundajelab/coda [77] | code |
Patient prognosis based on transcriptome data | MLP(DNN) | https://github.com/lanagarmire/cox-nnet [99] | code |
Predict the effect of genome sequence variation on DNA methylation | CNN | http://cpgenie.csail.mit.edu [31] | code |
Use the clinical and molecular data of TCGA to predict the risk of disease and survival | MLP(DNN) | https://github.com/CancerDataScience/SurvivalNet [30] | code |
Predict protein contacts | CNN | http://raptorx.uchicago.edu/ContactMap/ [36] | webserver |
MLP(DNN) | http://compbio.robotics.tu-berlin.de/epsilon/[105] | webserver | |
MLP(DNN) | http://scratch.proteomics.ics.uci.edu/ [105] | webserver | |
Protein model quality assessment | MLP(DNN) | http://proq3.bioinfo.se/ [104] | webserver |
Identify protein folding | DBN | http://iris.rnet.missouri.edu/dnfold [110] | webserver |
Comprehensive website | … | http://www.softberry.com/ [36] | webserver |
… | http://sparks-lab.org [49, 108] | webserver |
Open-source software
Especially in the past two years, some excellent software has been developed for applying deep learning technology to omics research. In 2015, Alipanahi, et al. [73] developed a tool called DeepBind to explore the sequence specificities of DNA- and RNA-binding proteins. In the same year, Zhou and Troyanskaya [72] developed a tool called DeepSEA for identifying the functional effects of noncoding variants. In 2016, Kelley, et al. [74] developed a tool called Basset to understand the complex language of eukaryotic gene expression. At present, these three tools have become benchmarks in the field.
In addition to these three tools, there are many researchers who use deep learning to resolve problems in other fields. They integrate software or algorithm source codes and upload them to the Internet for everyone to learn and use. We can directly use their software or their algorithms to expand our understanding of deep learning. We present these software packages and source codes in Table 2.
All of the applications we have listed are verified and available. In terms of the statistical analytical ability of these applications, the application of CNN is more extensive and the application of RNN is still small. In addition, using combined models, such as CNN + RNN, often improves performance.
Resolving omics problems using deep learning
In this section, we summarize several ways of using deep learning technology to resolve an omics problem, including data acquisition, encoding, data preprocessing, model selection, model training and performance evaluation.
Data acquisition
A large amount of omics data is produced every year. Furthermore, with the establishment of various bioinformatics databases, data acquisition is no longer a difficult problem. Table 3 presents several commonly used omics databases in omics research.
Category . | Database name . | Website . |
---|---|---|
Genome database | NCBI | https://www.ncbi.nlm.nih.gov/genome |
Ensembl | https://www.ensembl.org/ | |
USUC | http://genome.usuc.edu/ | |
Nucleic acid sequence database | EMBI | http://www.ebi.ac.uk/embl/ |
GenBank | https://www.ncbi.nlm.nih.gov/genbank/ | |
DDBJ | http://www.ddbj.nig.ac.jp | |
Protein sequence database | SWISS—PROT | http://cn.expasy.org/sprot |
PIR | http://pir.georgetown.edu/ | |
Protein structure database | PDB | http://www.rcsb.org/pdb |
Protein structure classification database | SCOP | http://scop.mrc-lmb.cam.ac.uk/scop/ |
CATH | http://www.cathdb.info/ |
Category . | Database name . | Website . |
---|---|---|
Genome database | NCBI | https://www.ncbi.nlm.nih.gov/genome |
Ensembl | https://www.ensembl.org/ | |
USUC | http://genome.usuc.edu/ | |
Nucleic acid sequence database | EMBI | http://www.ebi.ac.uk/embl/ |
GenBank | https://www.ncbi.nlm.nih.gov/genbank/ | |
DDBJ | http://www.ddbj.nig.ac.jp | |
Protein sequence database | SWISS—PROT | http://cn.expasy.org/sprot |
PIR | http://pir.georgetown.edu/ | |
Protein structure database | PDB | http://www.rcsb.org/pdb |
Protein structure classification database | SCOP | http://scop.mrc-lmb.cam.ac.uk/scop/ |
CATH | http://www.cathdb.info/ |
Category . | Database name . | Website . |
---|---|---|
Genome database | NCBI | https://www.ncbi.nlm.nih.gov/genome |
Ensembl | https://www.ensembl.org/ | |
USUC | http://genome.usuc.edu/ | |
Nucleic acid sequence database | EMBI | http://www.ebi.ac.uk/embl/ |
GenBank | https://www.ncbi.nlm.nih.gov/genbank/ | |
DDBJ | http://www.ddbj.nig.ac.jp | |
Protein sequence database | SWISS—PROT | http://cn.expasy.org/sprot |
PIR | http://pir.georgetown.edu/ | |
Protein structure database | PDB | http://www.rcsb.org/pdb |
Protein structure classification database | SCOP | http://scop.mrc-lmb.cam.ac.uk/scop/ |
CATH | http://www.cathdb.info/ |
Category . | Database name . | Website . |
---|---|---|
Genome database | NCBI | https://www.ncbi.nlm.nih.gov/genome |
Ensembl | https://www.ensembl.org/ | |
USUC | http://genome.usuc.edu/ | |
Nucleic acid sequence database | EMBI | http://www.ebi.ac.uk/embl/ |
GenBank | https://www.ncbi.nlm.nih.gov/genbank/ | |
DDBJ | http://www.ddbj.nig.ac.jp | |
Protein sequence database | SWISS—PROT | http://cn.expasy.org/sprot |
PIR | http://pir.georgetown.edu/ | |
Protein structure database | PDB | http://www.rcsb.org/pdb |
Protein structure classification database | SCOP | http://scop.mrc-lmb.cam.ac.uk/scop/ |
CATH | http://www.cathdb.info/ |
It is important to note that omics data have its own industry standard. For example, ‘fasta,’ ‘fastq,’ ‘gff2,’ ‘bed,’ etc. are common data formats in omics. Obviously, it is difficult to directly apply these data formats to deep learning. However, we can easily find detailed descriptions of these data types on the Internet. It may be necessary to know some scripting languages, such as Perl, R or Python to extract the information we need from such data. Fortunately, the learning costs of learning these scripting languages are often low.
In omics, the input data for common deep learning models include sequencing data (DNA sequencing data, RNA sequencing data and amino acid sequencing data), gene expression data, image data (such as in situ hybridization images), the physicochemical properties of proteins or amino acids, contact maps, etc. Overall, it is necessary to download, extract and tidy these data up into a form that deep learning models can understand (such as vectors and matrices).
Data preprocessing
Although deep learning models can automatically learn the features of data, this does not mean that the original data can always be directly input into a deep learning model. Proper preprocessing of the data can greatly improve the accuracy and speed of the deep learning model. The most commonly used data preprocessing methods in the omics research include data cleaning, normalization and dimensionality reduction.
(1) Data cleaning: The omics data that are obtained may contain a lot of missing values, error values and noise, which might cause serious issues in model training. Therefore, we need to improve the quality of the data as much as possible. Data cleansing is usually undertaken prior to the encoding operation. Data cleansing mainly involves handling missing values and outliers, removing duplicate data and processing noise data. For missing data or outliers, we can fill in the incomplete data by using the k-nearest neighbor algorithm, regression, decision tree analysis and other methods. We can deal with noise data by applying clustering, regression and binning. For duplicate data, we can eliminate data whose similarity is greater than the threshold.
Data cleansing is a time-consuming and labor-intensive task, and it is difficult to judge which method is the best because of the different types of data involved. Many omics researchers may not understand the fundamentals of machine learning and implementing the various algorithms mentioned above. Fortunately, some software packages, such as OpenRefine [124] and DataKleenr, can be good for data cleansing. It is much easier to learn to use these software packages than to implement the cleansing algorithm by ourselves.
(2) Normalization: Normalization involves limiting the collected data to a certain extent. A good normalization method can alleviate the problem of falling into local optimum. Normally, normalization is undertaken after the data are encoded. We introduce two common normalization methods: min-max normalization and zero-mean normalization. The normalization method that we choose depends on the type of data that we need to deal with.
(3) Dimensionality reduction: In general, omics data are usually high dimensional. Proper reduction of data dimensions can remove some irrelevant features for better training. Many deep learning models, such as automatic encoder, have the function of dimensionality reduction. Compared with traditional machine learning methods, dimensionality reduction by deep learning can retain more non-linear features. In addition, sometimes, in order to reduce the amount of computation, some conventional dimensionality reduction methods are also used in omics research. For example, principal component analysis (PCA) was used to reduce the dimensions of gene expression profile data in [79].
Encoding
Reasonable data entry forms have profound implications for the ultimate learning outcomes of deep learning models. In general, the most accepted form of data entry for deep learning or conventional machine learning methods is the form of vectors or matrices. In omics, the most common types of data are sequence data, such as DNA sequences, RNA sequences and amino acid sequences. For sequence data, we often use the following three methods to encode them into the form of a matrix:
One hot encoding: This is the most common coding method in the existent literature on this topic. This encoding method can be used for both nucleotide and amino acid sequences. In the case of a DNA sequence, the sequence ATGCT after one hot encoding is shown in Figure 6A, where the black pattern represents 1 and the white pattern represents 0.
Position-specific scoring matrix (PSSM): This encoding method can be used to encode amino acid sequences and nucleotide sequences. The matrix can show the probability of a base or a certain amino acid present at a specific position. Some software packages, such as position-specific iterative basic local alignment search tool (PSI-BLAST), can generate a PSSM matrix. A simple PSSM matrix is shown in Figure 6B.
PAM matrix and BLOSUM matrix: The point accepted mutation (PAM) matrix and the blocks substitution matrix (BLOSUM) are scoring matrices for sequence similarities. These two encoding methods are mainly used for encoding amino acid sequences. Some experiments, such as [107], have compared these two encoding methods in detail. Currently, the BLOSUM Matrix is the most frequently used of the two methods. Some mature tools, such as BLAST [125], provide good support for both matrices.

Two common encoding methods. (A) One hot encoding of base, where black blocks represent 1, and white blocks represent 0. (B) A simple PSSM matrix, the number represents the probability that the base appears at that location.
In addition to these three common coding methods, there are also some coding methods that encode protein sequences, such as the autocovariance method and the conjoint triad method [126]. Among them, the autocovariance method, which describes how variables at different places are correlated and interact, has been widely used for coding protein sequences [127].
In addition to this sequence data, some omics data, such as contact maps and image data, take the form of a two-dimensional matrix and can be used directly by the deep learning model. Some data, such as gene expression data and the physical and chemical properties of proteins or amino acids, take the form of numerical vectors. It is simple to integrate these into a matrix.
The joint encoding of a variety of data types is often used in omics research. For example, the Atchley factors method [128] is one of the most frequently used such methods. This method can characterize an amino acid by joining five numerals, including secondary structure, polarity, volume, codon diversity and electrostatic charge. It is also common to combine the various physicochemical properties of amino acids and the PSSM matrix into one matrix [49, 50, 98, 129].
Model selection
So far, no universal deep learning model that can solve all problems has been developed. As mentioned above, each model has its own advantages. In omics research, auto-encoders are mainly used for dimensionality reduction and data noise removal. RBM is mainly used for feature learning. Moreover, these two deep learning models are rarely used alone, and they are combined with other models to solve a problem. MLP and DBN are suitable for almost all omics problems, but there is no guarantee that their final effect will be better than other models. CNNs can handle most of the grid-like data in omics, such as image data and encoded sequence data. RNNs are mainly used to resolve sequence learning problems. As for a detailed understanding of the applicable models for different omics problems, please refer to our discussion in the section Application of deep learning in omics.
It is worth mentioning that although the deep learning model we introduced in Section II is the most commonly used model in omics research, we should not rely on any single model when selecting a deep learning model. First, we must learn to use the advantages of various models and combine them to build a more powerful network. For example, the method used in [41] was a combined model of CNN + LSTM to predict the properties and functions of DNA sequences. This method achieved impressive results. Second, we should pay more attention to new technology. For example, in 2015, a more advanced deep residual learning method [130] was proposed. This network can avoid the problem of the disappearance of the gradient while increasing the number of layers in the network, thus facilitating the achievement of greater accuracy. We could boldly try this deep learning model in omics research.
Model training
The training of the deep learning model has never been simple. In order to mitigate the challenges involved, it is first important to consider the nature of the hardware involved. It may take a long time to train a network due to the huge amount of parameters of a deep learning model. When training a large network, a graphics processing unit (GPU) is recommended for accelerating the model training process. Many deep learning frameworks have begun to support GPU acceleration.
Second, we should pay attention to the allocation of datasets. In general, the samples are divided into three parts: a training set, a validation set and a testing set. The training set is used to train the model. The validation set is used to determine the network structure or the parameters that control the complexity of the model. The testing set is used to test the performance of the model. In general, we will use 70% of the data samples to undertake training and verification and 30% of the data samples for testing. However, this proportion is not absolute, and it is possible to make appropriate adjustments according to the sizes of the samples.
The choice of various functions in the network is also a key issue when building models. Common activation functions, such as sigmoid, tanh, softmax, ReLU, maxout and common loss functions (including the mean square error loss function, the log-likelihood loss function and the cross-entropy loss function), are all important factors in determining the final effect of the training. The activation function can be divided into two categories: the output layer activation function and the hidden layer output function. First, it is important to consider the output layer. It is usually sufficient to achieve a simple regression task when we use a simple linear function as an activation function and when we use the mean square error loss function as the loss function. When we want to achieve a binary classification task, we usually use sigmoid/tanh as activation function, with the cross-entropy loss function as the loss function. When we undertake multi-class classification tasks, we usually use softmax as the activation function and the log-likelihood loss function as the loss function. Then, it is important to consider the activation function of the hidden layer. Due to gradient disappearance or the gradient explosion problem occurring in sigmoid and tanh, now ReLU and maxout have become the most widely used activation functions.
The dropout technique is used in a considerable proportion of omics research. If the aim is to train a large network with very little training data or your data contain a lot of noise, it is easy to cause overfitting. In order to resolve this problem, Hinton [131] proposed the dropout technique. In each training session, some neuron units are temporarily discarded from the network according to a certain probability so that the generalization ability of the network is improved. The principle, though simple, is highly effective and worth trying. Of course, in addition to the dropout technique, there are many ways to prevent overfitting, such as early stopping and weight decay. These methods do not conflict and can be superimposed.
In addition, there are still many parameters in the deep learning model that need to be adjusted and set by ourselves, such as the learning rate, weight initialization, etc. We have summarized some of the common parameter settings in Table 4. In addition, many algorithms support the automatic tuning of hyperparameters, such as grid search, random search and Bayesian optimization [132], which help us to alleviate the difficulty of adjusting the parameters to a certain extent.
Name . | Common settings . |
---|---|
Learning rate | Initial value = 0.1, and use Adam to dynamic adjustment |
Parameter optimization method | SGD; momentum; Adagrad; Adadelta; RMSprop; Adam |
Weight initialization | Gaussian; Xavier; MSRA |
Batch size | 64; 128; 256 |
Number of nodes | Such as 16, 32, 128 No more than the number of samples |
Dropout rate | 0.3; 0.5; 0.7 |
Name . | Common settings . |
---|---|
Learning rate | Initial value = 0.1, and use Adam to dynamic adjustment |
Parameter optimization method | SGD; momentum; Adagrad; Adadelta; RMSprop; Adam |
Weight initialization | Gaussian; Xavier; MSRA |
Batch size | 64; 128; 256 |
Number of nodes | Such as 16, 32, 128 No more than the number of samples |
Dropout rate | 0.3; 0.5; 0.7 |
Note: bold font represents common values
Name . | Common settings . |
---|---|
Learning rate | Initial value = 0.1, and use Adam to dynamic adjustment |
Parameter optimization method | SGD; momentum; Adagrad; Adadelta; RMSprop; Adam |
Weight initialization | Gaussian; Xavier; MSRA |
Batch size | 64; 128; 256 |
Number of nodes | Such as 16, 32, 128 No more than the number of samples |
Dropout rate | 0.3; 0.5; 0.7 |
Name . | Common settings . |
---|---|
Learning rate | Initial value = 0.1, and use Adam to dynamic adjustment |
Parameter optimization method | SGD; momentum; Adagrad; Adadelta; RMSprop; Adam |
Weight initialization | Gaussian; Xavier; MSRA |
Batch size | 64; 128; 256 |
Number of nodes | Such as 16, 32, 128 No more than the number of samples |
Dropout rate | 0.3; 0.5; 0.7 |
Note: bold font represents common values
Performance evaluation
K-fold cross validation is usually the 1st step involved in checking the accuracy of an algorithm. Take 10-fold cross-validation as an example: the dataset is divided into 10 parts. Nine of them are taken as training data and one is used as testing data. Each part of the data is taken as a testing set and is tested 10 times in total. The average of the correct rate of the 10 results is used as an estimate of the accuracy of the algorithm.
After this step, there are many criteria for measuring the performance of a deep learning model, such as accuracy, F1-measure, etc. We summarize the most commonly used criteria for omics research and deep learning, as well as their computational methods, in Table S1 in the Supplementary Materials.
Deep learning framework
Although deep learning technology has been shown to have many great advantages, especially in omics research, it is difficult for an omics researcher who does not have a computing background to learn the necessary skills for its application. Fortunately, with the great success of deep learning technology, many well-known companies and institutions, such as Google and Microsoft, have released deep learning frameworks. We just need to learn how to build a model based on these frameworks, which is much simpler than programming a neural network by ourselves.
There are many frameworks which are now publicly available. Choosing a framework that suits our purpose can greatly increase productivity. We first summarize the properties of some common frameworks in Table 5. The characteristics of some frameworks are shown in Table 6. Therefore, we can initially determine which framework is suitable.
Name . | Caffe . | MXNet . | Torch . | Deeplearning4j . | Tensorflow . | Theano . | CNTK . | Neon . | Keras . |
---|---|---|---|---|---|---|---|---|---|
Creator | UC Berkeley | CMU, UW and Microsot | Ronan Collobert et al. | Skymind | Universite de Montreal | Microsoft | Nervana System | Franois Chollet | |
Interface | C++, Python, MATLAB | C++, R,Python, Scala, Matlab,JavaScript, Go, Julia | Lua, LuaJIT, C | Java, Scala, Clojure | C++, Python, GO, Java | Python | NDL, C++, Python | Python | Python |
Suitabe model | CNN, RNN | CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN |
OS | Linux, Win, OSX, Andr. | Linux, Win, OSX, Andr. | Linux, Win, OSX, Andr., iOS | Linux, Win, OSX, Andr. | Linux, OSX, Win | Linux, OSX, Win | Linux, OSX, Win | OSX, Linux | Linux, Win, OSX |
Stars in github | 20212 | 11170 | 7279 | 7203 | 68800 | 6914 | 12396 | 3200 | 19589 |
Multi-GPU | √ | √ | √ | √ | √ | × | √ | √ | × |
Distributed | × | √ | × | √ | √ | × | √ | √ | × |
Cloud copmuting | × | √ | × | × | √ | × | × | √ | × |
Name . | Caffe . | MXNet . | Torch . | Deeplearning4j . | Tensorflow . | Theano . | CNTK . | Neon . | Keras . |
---|---|---|---|---|---|---|---|---|---|
Creator | UC Berkeley | CMU, UW and Microsot | Ronan Collobert et al. | Skymind | Universite de Montreal | Microsoft | Nervana System | Franois Chollet | |
Interface | C++, Python, MATLAB | C++, R,Python, Scala, Matlab,JavaScript, Go, Julia | Lua, LuaJIT, C | Java, Scala, Clojure | C++, Python, GO, Java | Python | NDL, C++, Python | Python | Python |
Suitabe model | CNN, RNN | CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN |
OS | Linux, Win, OSX, Andr. | Linux, Win, OSX, Andr. | Linux, Win, OSX, Andr., iOS | Linux, Win, OSX, Andr. | Linux, OSX, Win | Linux, OSX, Win | Linux, OSX, Win | OSX, Linux | Linux, Win, OSX |
Stars in github | 20212 | 11170 | 7279 | 7203 | 68800 | 6914 | 12396 | 3200 | 19589 |
Multi-GPU | √ | √ | √ | √ | √ | × | √ | √ | × |
Distributed | × | √ | × | √ | √ | × | √ | √ | × |
Cloud copmuting | × | √ | × | × | √ | × | × | √ | × |
Notes: In this table, while the various frameworks support different deep learning models, each framework is good for different models. In terms of CNN modeling capabilities, Caffe is the best. In terms of RNN modeling capabilities, CNTK is the best.
Name . | Caffe . | MXNet . | Torch . | Deeplearning4j . | Tensorflow . | Theano . | CNTK . | Neon . | Keras . |
---|---|---|---|---|---|---|---|---|---|
Creator | UC Berkeley | CMU, UW and Microsot | Ronan Collobert et al. | Skymind | Universite de Montreal | Microsoft | Nervana System | Franois Chollet | |
Interface | C++, Python, MATLAB | C++, R,Python, Scala, Matlab,JavaScript, Go, Julia | Lua, LuaJIT, C | Java, Scala, Clojure | C++, Python, GO, Java | Python | NDL, C++, Python | Python | Python |
Suitabe model | CNN, RNN | CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN |
OS | Linux, Win, OSX, Andr. | Linux, Win, OSX, Andr. | Linux, Win, OSX, Andr., iOS | Linux, Win, OSX, Andr. | Linux, OSX, Win | Linux, OSX, Win | Linux, OSX, Win | OSX, Linux | Linux, Win, OSX |
Stars in github | 20212 | 11170 | 7279 | 7203 | 68800 | 6914 | 12396 | 3200 | 19589 |
Multi-GPU | √ | √ | √ | √ | √ | × | √ | √ | × |
Distributed | × | √ | × | √ | √ | × | √ | √ | × |
Cloud copmuting | × | √ | × | × | √ | × | × | √ | × |
Name . | Caffe . | MXNet . | Torch . | Deeplearning4j . | Tensorflow . | Theano . | CNTK . | Neon . | Keras . |
---|---|---|---|---|---|---|---|---|---|
Creator | UC Berkeley | CMU, UW and Microsot | Ronan Collobert et al. | Skymind | Universite de Montreal | Microsoft | Nervana System | Franois Chollet | |
Interface | C++, Python, MATLAB | C++, R,Python, Scala, Matlab,JavaScript, Go, Julia | Lua, LuaJIT, C | Java, Scala, Clojure | C++, Python, GO, Java | Python | NDL, C++, Python | Python | Python |
Suitabe model | CNN, RNN | CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN | CNN, RNN | DNN, CNN, RNN | DNN, CNN, RNN |
OS | Linux, Win, OSX, Andr. | Linux, Win, OSX, Andr. | Linux, Win, OSX, Andr., iOS | Linux, Win, OSX, Andr. | Linux, OSX, Win | Linux, OSX, Win | Linux, OSX, Win | OSX, Linux | Linux, Win, OSX |
Stars in github | 20212 | 11170 | 7279 | 7203 | 68800 | 6914 | 12396 | 3200 | 19589 |
Multi-GPU | √ | √ | √ | √ | √ | × | √ | √ | × |
Distributed | × | √ | × | √ | √ | × | √ | √ | × |
Cloud copmuting | × | √ | × | × | √ | × | × | √ | × |
Notes: In this table, while the various frameworks support different deep learning models, each framework is good for different models. In terms of CNN modeling capabilities, Caffe is the best. In terms of RNN modeling capabilities, CNTK is the best.
Name . | Advantages . | Disadvantages . |
---|---|---|
TensorFlow | 1. Flexible portability;2. Fast compilation speed; 3. Powerful supporting software, such as TensorFlow Serving; 4. Strong technical support services; 5. Excellent overall architecture. | 1. The documentation and interfaces are insufficiently clear; 2. Debugging difficulties; 3. High memory footprint. |
Caffe | 1. Easy to use; 2. Training speed is sufficiently fast; 3. Highly modular of components. | 1. Support for RNN is not quite adequate; 2. Different versions of the interface are not compatible; 3. No support for distributed training. |
Keras | 1. Documentation is complete; 2. Easy to learn and easy to use; 3. Updates quickly; 4. Highly modular of components. | 1. Insufficiently flexible; 2. Cannot directly use a multi-GPU. |
Theano | 1. High level of flexibility; 2. Low cost of API interface learning; 3. Good computational stability. | 1. It is difficult to learn how to use it; 2. It does not have the underlying C++ interface; 3. The model is very inconvenient to deploy; 4. The debugging error message is very difficult to understand. |
Torch | 1. Easy to use; 2. Highly modular of components; 3. It is convenient for use with GPUs; 4. The high efficiency of the Lua programming language. | 1. The Lua programming language is not commonly used; 2. Torch’s data file format is special and needs conversion. |
MXNet | 1. Good performance; 2. High level of flexibility; 3. Saves memory; 4. Supports many language packages. | 1. Has poor API documentation; 2. It is difficult to learn how to use it. |
CNTK | 1. Very good performance; 2. Good scalability. | 1. It is difficult to install; 2. It has less learning materials than other frameworks. |
Name . | Advantages . | Disadvantages . |
---|---|---|
TensorFlow | 1. Flexible portability;2. Fast compilation speed; 3. Powerful supporting software, such as TensorFlow Serving; 4. Strong technical support services; 5. Excellent overall architecture. | 1. The documentation and interfaces are insufficiently clear; 2. Debugging difficulties; 3. High memory footprint. |
Caffe | 1. Easy to use; 2. Training speed is sufficiently fast; 3. Highly modular of components. | 1. Support for RNN is not quite adequate; 2. Different versions of the interface are not compatible; 3. No support for distributed training. |
Keras | 1. Documentation is complete; 2. Easy to learn and easy to use; 3. Updates quickly; 4. Highly modular of components. | 1. Insufficiently flexible; 2. Cannot directly use a multi-GPU. |
Theano | 1. High level of flexibility; 2. Low cost of API interface learning; 3. Good computational stability. | 1. It is difficult to learn how to use it; 2. It does not have the underlying C++ interface; 3. The model is very inconvenient to deploy; 4. The debugging error message is very difficult to understand. |
Torch | 1. Easy to use; 2. Highly modular of components; 3. It is convenient for use with GPUs; 4. The high efficiency of the Lua programming language. | 1. The Lua programming language is not commonly used; 2. Torch’s data file format is special and needs conversion. |
MXNet | 1. Good performance; 2. High level of flexibility; 3. Saves memory; 4. Supports many language packages. | 1. Has poor API documentation; 2. It is difficult to learn how to use it. |
CNTK | 1. Very good performance; 2. Good scalability. | 1. It is difficult to install; 2. It has less learning materials than other frameworks. |
Name . | Advantages . | Disadvantages . |
---|---|---|
TensorFlow | 1. Flexible portability;2. Fast compilation speed; 3. Powerful supporting software, such as TensorFlow Serving; 4. Strong technical support services; 5. Excellent overall architecture. | 1. The documentation and interfaces are insufficiently clear; 2. Debugging difficulties; 3. High memory footprint. |
Caffe | 1. Easy to use; 2. Training speed is sufficiently fast; 3. Highly modular of components. | 1. Support for RNN is not quite adequate; 2. Different versions of the interface are not compatible; 3. No support for distributed training. |
Keras | 1. Documentation is complete; 2. Easy to learn and easy to use; 3. Updates quickly; 4. Highly modular of components. | 1. Insufficiently flexible; 2. Cannot directly use a multi-GPU. |
Theano | 1. High level of flexibility; 2. Low cost of API interface learning; 3. Good computational stability. | 1. It is difficult to learn how to use it; 2. It does not have the underlying C++ interface; 3. The model is very inconvenient to deploy; 4. The debugging error message is very difficult to understand. |
Torch | 1. Easy to use; 2. Highly modular of components; 3. It is convenient for use with GPUs; 4. The high efficiency of the Lua programming language. | 1. The Lua programming language is not commonly used; 2. Torch’s data file format is special and needs conversion. |
MXNet | 1. Good performance; 2. High level of flexibility; 3. Saves memory; 4. Supports many language packages. | 1. Has poor API documentation; 2. It is difficult to learn how to use it. |
CNTK | 1. Very good performance; 2. Good scalability. | 1. It is difficult to install; 2. It has less learning materials than other frameworks. |
Name . | Advantages . | Disadvantages . |
---|---|---|
TensorFlow | 1. Flexible portability;2. Fast compilation speed; 3. Powerful supporting software, such as TensorFlow Serving; 4. Strong technical support services; 5. Excellent overall architecture. | 1. The documentation and interfaces are insufficiently clear; 2. Debugging difficulties; 3. High memory footprint. |
Caffe | 1. Easy to use; 2. Training speed is sufficiently fast; 3. Highly modular of components. | 1. Support for RNN is not quite adequate; 2. Different versions of the interface are not compatible; 3. No support for distributed training. |
Keras | 1. Documentation is complete; 2. Easy to learn and easy to use; 3. Updates quickly; 4. Highly modular of components. | 1. Insufficiently flexible; 2. Cannot directly use a multi-GPU. |
Theano | 1. High level of flexibility; 2. Low cost of API interface learning; 3. Good computational stability. | 1. It is difficult to learn how to use it; 2. It does not have the underlying C++ interface; 3. The model is very inconvenient to deploy; 4. The debugging error message is very difficult to understand. |
Torch | 1. Easy to use; 2. Highly modular of components; 3. It is convenient for use with GPUs; 4. The high efficiency of the Lua programming language. | 1. The Lua programming language is not commonly used; 2. Torch’s data file format is special and needs conversion. |
MXNet | 1. Good performance; 2. High level of flexibility; 3. Saves memory; 4. Supports many language packages. | 1. Has poor API documentation; 2. It is difficult to learn how to use it. |
CNTK | 1. Very good performance; 2. Good scalability. | 1. It is difficult to install; 2. It has less learning materials than other frameworks. |
Speed is also an important factor in measuring the performance of a deep learning framework. Many researchers have already analyzed and compared the performance of several frameworks [133, 134]. In general, different frameworks have different strengths in different network models and different hardware conditions. In terms of overall performance, CNTK and MXNet may perform better than other frameworks.
Challenges and opportunities
In the research of omics, deep learning technology has faced the following difficulties:
• Data volume: Deep learning models need more data than traditional models to avoid overfitting. If the amount of data is too small, the effect of deep learning models may be worse than traditional machine learning algorithms. For omics data, although a large amount of data can be generated every year, there will still be the problem of insufficient data. For example, due to privacy or sample size limitations, the data on the gene expression profiles of some diseases are still limited and may not be sufficient for deep learning models. Furthermore, data of this type further aggravate the issue of data imbalance, resulting in inaccurate training effects.
• Data quality: The essence of deep learning technology is to learn certain rules according to the input data; therefore, the quality of learning depends on the quality of the input data. However, most of the data in omics are obtained through experiments, and it is difficult to ensure that the data are accurate.
• Computation costs: The structure of a deep learning model is complex. There are countless parameters, and training parameters need to go through forward propagation, backpropagation and a series of complex processes. Therefore, deep learning needs strong computing power. Training a deep learning model requires at least one server with multiple GPU cards, and many purely biological labs or individuals may not have such conditions.
• The ‘black box’ problem: The black-box-like algorithm obtained by deep learning cannot be easily understood by most people. It cannot be proven or falsified by mathematical methods. Therefore, sometimes we cannot understand the real principles even if we obtain the correct results. For example, we can use a deep learning model to identify a gene expression profile as a cancer gene expression profile, but this model does not explain why this expression profile is the expression profile of cancer. In omics, this ability to explain is crucial to advance the discipline’s development.
• Model selection and training: Due to the rapid development of deep learning technology, there are many models from which we can choose, and it is often not easy to choose a model that is suitable for resolving one’s specific problems. In addition, it is difficult to set and adjust the hyperparameters, and very small changes in hyperparameters are likely to change the effect of the training.
Although we have mentioned many deficiencies, we do not need to be pessimistic about deep learning technology. After all, the many successful applications thereof, as mentioned above, have proven the usability of this technology in the field of omics. In terms of the problem of insufficient sample data, some new technologies, such as zero-shot learning [135], one-shot learning [136] and generative adversarial networks [137], can resolve it to a certain extent. In terms of the imbalance in omics data, some methods, such as resampling, cost-sensitive learning [138], etc., also provide us with solutions. In terms of the problem of poor data quality, we can also use the data cleansing method explained above to improve the quality of data. For mitigating the ‘black box’ problem in deep learning technology, due to the development of technology, it is also very promising to convert the black box into the white box. For example, Lanchantin et al. [139] proposed a toolkit called the Deep Motif Dashboard, which provides a suite of visualization strategies to extract motifs or sequence patterns from DNN models for TFBS classification. This method can resolve the ‘black box’ problem to a certain extent. Besides, for the other problems mentioned above, such as the amount of computation involved and the difficulty in adjusting hyperparameters, we can also resolve these problems through cooperation between different organizations and between experts in different fields.
In terms of future development, reinforcement learning [140], incremental learning [141] and transfer learning [142] will be increasingly applied in the research of deep learning and omics. Reinforcing learning which is closer to human learning will greatly improve the self-learning ability of artificial intelligence; however, at present, the idea of reinforcement learning is mainly used in robotics [143]. Incremental learning mainly resolves the problem of repetitive training. A large amount of omics data is produced every year, and the total amount of data is getting larger. Retraining all the data is time-consuming, and storing historical data also consume space resources. Fortunately, incremental learning can be a good solution to this problem. Transfer learning will greatly alleviate the problem of the few samples of omics data, and this train of thought can greatly save training time.
Conclusion
In conclusion, deep learning technology is certainly suitable for resolving omics problems. The combination of omics and deep learning is new. For this reason, we have summarized the relevant recent work and have made a guideline for this topic. We have introduced the deep learning models that are commonly used in omics research and have summarized some recent omics research. In addition, we have discussed the steps of using deep learning technology and some well-known deep learning frameworks which have, until now, not been systematically discussed in the existent literature on this topic. A researcher who is interested in this field can gain a general understanding of deep learning by using our survey. Although deep learning technology does have limitations in its application to omics, these are being resolved. In the future, deep learning technology will play an increasingly important role in this field.
With the advent of big data era, a huge amount of high dimensional and complex structured omics data has rendered conventional machine learning algorithms powerless. Fortunately, deep learning technology can contribute toward resolving these challenges.
We introduce several deep learning models and discuss several research areas which have combined omics and deep learning in recent years.
Furthermore, we summarize the general steps involved in using deep learning which have not yet been systematically discussed in the existent literature on this topic.
The features of some mainstream deep learning frameworks are also discussed in detail in this article. In addition, we also put forward our own opinions on the opportunities and challenges of deep learning in the research of omics.
In general, our review provides a very detailed guideline for omics researchers about deep learning technology.
Funding
This work was supported by National Key R&D Program of China (2018YFC090002, 2017YFB0202602, 2017YFC1311003, 2016YFC1302500, 2016YFB0200400, 2017YFB0202104); National Natural Science Foundation of China (NSFC) (61772543, U1435222, 61625202, 61272056); the Funds of State Key Laboratory of Chemo/Biosensing and Chemometrics; the Fundamental Research Funds for the Central Universities; Guangdong Provincial Department of Science and Technology (2016B090918122).
Zhiqiang Zhang is a master graduate student in National University of Defense Technology. His research insterests include bioinformatics, high performance computing, and artificial intelligence.
Yi Zhao is a professor in Chinese Academy of Sciences. His research interests include bioinformatics, intelligent information processing.
Xiangke Liao is a professor in National University of Defense Technology and the member of China Engineering Academy. His research interests include machine learning, large-scale scientific computing, and quantum computing.
Wenqiang Shi is a PhD student who just graduated last year from National Defense University of Science and Technology, also study in University of British Columbia for some time. His research interests include bioinformatics and artificial intelligence.
Kenli Li is a professor in Hunan University. His research interests include cloud computing, biological computing, and big data management.
Quan Zou is a professor in Tianjin University and University of Electronic Science and Technology of China. He is the senior member of IEEE and ACM. His research interests include bioinformatics, scale data mining and parallel computing application.
Shaoliang Peng is a professor in National University of Defense Technology and Hunan University. He is also the executive director of the National Supercomputing Center in Changsha. His research interests include high performance computing, bioinformatics, big data, virtual screening, and biology simulation.
References
Author notes
These authors contributed equally to this work.