Deep learning for natural language processing: advantages

自然语言处理是人工智能的一个重要方向，研究让计算机使用人类语言、即自然语言的理论和方法。深度学习是指基于深度神经网络的机器学习技术。目前深度学习已成功应用于自然语言处理并取得了重大进展。　　　 本文总结了深度学习用于自然语言处理的成果，并阐述了这一技术的优势和面临的挑战。 该文认为自然语言处理有五个主要任务：分类、匹配、翻译、结构预测和序列决策过程。 这五个任务中的前四个，深度学习方法的表现都优于或显著优于传统方法，并且成为解决这些问题的当前最好技术。第五项任务序列决策过程包括多轮对话，深度学习对该任务的贡献如何尚未得到完全验证。 在深度学习应用于自然语言处理的问题中，机器翻译的进展尤其引人注目，正成为该应用的代表性技术。此外，深度学习还首次使某些应用成为可能，比如，深度学习成功应用于图像检索、生成式的自然语言对话等。 深度学习在自然语言处理中的优势主要在于端到端的训练和表示学习，这使深度学习区别于传统机器学习方法，也使之成为自然语言处理的强大工具。 深度学习面临着一些挑战。比如，缺乏理论基础和模型可解释性、模型训练需要大量数据和强大的计算资源。而深度学习在自然语言处理中也面临一些独特的挑战，如长尾问题、与符号处理的结合，以及推理和决策。 可以预见，未来深度学习与其他技术（强化学习、推断、知识）结合起来将会使自然语言处理更上一层楼。


INTRODUCTION
Deep learning refers to machine learning technologies for learning and utilizing 'deep' artificial neural networks, such as deep neural networks (DNN), convolutional neural networks (CNN) and recurrent neural networks (RNN). Recently, deep learning has been successfully applied to natural language processing and significant progress has been made. This paper summarizes the recent advancement of deep learning for natural language processing and discusses its advantages and challenges.
We think that there are five major tasks in natural language processing, including classification, matching, translation, structured prediction and the sequential decision process. For the first four tasks, it is found that the deep learning approach has outperformed or significantly outperformed the traditional approaches.
End-to-end training and representation learning are the key features of deep learning that make it a powerful tool for natural language processing. Deep learning is not almighty, however. It might not be sufficient for inference and decision making, which are essential for complex problems like multi-turn dialogue. Furthermore, how to combine symbolic processing and neural processing, how to deal with the long tail phenomenon, etc. are also challenges of deep learning for natural language processing.

PROGRESS IN NATURAL LANGUAGE PROCESSING
In our view, there are five major tasks in natural language processing, namely classification, matching, translation, structured prediction and the sequential decision process. Most of the problems in natural language processing can be formalized as these five tasks, as summarized in Table 1. In the tasks, words, phrases, sentences, paragraphs and even documents are usually viewed as a sequence of tokens (strings) and treated similarly, although they have different complexities. In fact, sentences are the most widely used processing units.
It has been observed recently that deep learning can enhance the performances in the first four tasks and becomes the state-of-the-art technology for the tasks (e.g. [1][2][3][4][5][6][7][8]). Table 2 shows the performances of example problems in which deep learning has surpassed traditional approaches. Among all the NLP problems, progress in machine translation is particularly remarkable. Neural machine translation, i.e. machine translation using deep learning, has significantly outperformed traditional statistical machine translation. The state-of-the art neural translation systems employ sequence-to-sequence learning models comprising RNNs [4][5][6].
Deep learning has also, for the first time, made certain applications possible. For example, deep learning has been successfully applied to image retrieval (also known as text to image), in which query and image are first transformed into vector representations with CNNs, the representations are matched with DNN and the relevance of the image to the query is calculated [3]. Deep learning is also employed in generation-based natural language dialogue, in which, given an utterance, the system automatically generates a response and the model   is trained in sequence-to-sequence learning [7]. The fifth task, the sequential decision process such as the Markov decision process, is the key issue in multi-turn dialogue, as explained below. It has not been thoroughly verified, however, how deep learning can contribute to the task.

ADVANTAGES AND CHALLENGES
Deep learning certainly has advantages and challenges when applied to natural language processing, as summarized in Table 3.

Advantages
We think that, among the advantages, end-to-end training and representation learning really differentiate deep learning from traditional machine learning approaches, and make it powerful machinery for natural language processing.
It is often possible to perform end-toend training in deep learning for an application. This is because the model (deep neural network) offers rich representability and information in the data can be effectively 'encoded' in the model. For example, in neural machine translation, the model is completely automatically constructed from a parallel corpus and usually no human intervention is needed. This is clearly an advantage compared to the traditional approach of statistical machine translation, in which feature engineering is crucial.
With deep learning, the representations of data in different forms, such as text and image, can all be learned as realvalued vectors. This makes it possible to perform information processing across multiple modality. For example, in image retrieval, it becomes feasible to match the query (text) against images and find the most relevant images, because all of them are represented as vectors.

Challenges
There are challenges of deep learning that are more common, such as lack of theoretical foundation, lack of interpretability of model, and requirement of a large amount of data and powerful computing resources. There are also challenges that are more unique to natural language processing, namely difficulty in dealing with long tail, incapability of directly handling symbols, and ineffectiveness at inference and decision making.
Data in natural language always follow a power law distribution. As a result, for example, the size of the vocabulary increases as the size of the data increases. That means that, no matter how PERSPECTIVES much data there are for training, there always exist cases that the training data cannot cover. How to deal with the long tail problem poses a significant challenge to deep learning. By resorting to deep learning alone, this problem would be hard to solve.
Language data is by nature symbol data, which is different from vector data (real-valued vectors) that deep learning normally utilizes. Currently, symbol data in language are converted to vector data and then are input into neural networks, and the output from neural networks is further converted to symbol data. In fact, a large amount of knowledge for natural language processing is in the form of symbols, including linguistic knowledge (e.g. grammar), lexical knowledge (e.g. WordNet) and world knowledge (e.g. Wikipedia). Currently, deep learning methods have not yet made effective use of the knowledge. Symbol representations are easy to interpret and manipulate and, on the other hand, vector representations are robust to ambiguity and noise. How to combine symbol data and vector data and how to leverage the strengths of both data types remain an open question for natural language processing.
There are complex tasks in natural language processing, which may not be easily realized with deep learning alone. For example, multi-turn dialogue amounts to a very complicated process. It involves language understanding, language generation, dialogue management, knowledge base access and inference. Dialogue management can be formalized as a sequential decision process and reinforcement learning can play a critical role. Obviously, combination of deep learning and reinforcement learning could be potentially useful for the task, which is beyond deep learning itself.
In summary, there are still a number of open challenges with regard to deep learning for natural language processing. Deep learning, when combined with other technologies (reinforcement learning, inference, knowledge), may further push the frontier of the field.

FUNDING
This work is supported in part by the National Basic Research Program of China (973 Program, 2014CB340301).

INTRODUCTION
Causality is a fundamental notion in science, and plays an important role in explanation, prediction, decision making and control. Recently, with the rapid accumulation of huge volumes of data, it is even more desirable to abstract causal knowledge from data. Furthermore, such data are usually time series measured over a relatively long time period or aggregated data from multiple data sets collected in different environments or under different experimental conditions, leading to the issue of data heterogeneity. Causality also provides a way to understand and tackle data heterogeneity, while traditional machine learning