Abstract

Affective computing is a key research topic in artificial intelligence which is applied to psychology and machines. It consists of the estimation and measurement of human emotions. A person’s body language is one of the most significant sources of information during job interview, and it reflects a deep psychological state that is often missing from other data sources. In our work, we combine two tasks of pose estimation and emotion classification for emotional body gesture recognition to propose a deep multi-stage architecture that is able to deal with both tasks. Our deep pose decoding method detects and tracks the candidate’s skeleton in a video using a combination of depthwise convolutional network and detection-based method for 2D pose reconstruction. Moreover, we propose a representation technique based on the superposition of skeletons to generate for each video sequence a single image synthesizing the different poses of the subject. We call this image: ‘history pose image’, and it is used as input to the convolutional neural network model based on the Visual Geometry Group architecture. We demonstrate the effectiveness of our method in comparison with other methods in the state of the art on the standard Common Object in Context keypoint dataset and Face and Body gesture video database.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
You do not currently have access to this article.