|
Audio Visual Recognition of Spontaneous Emotions In-the-Wild Host Publication: 7th Chinese Conf. on Pattern Recognition: Communications in Computer and Information Science- Authors: X. Xia, L. Guo, D. Jiang, E. Pei, L. Yang and H. Sahli UsePubPlace: Singapore Publisher: Springer Publication Year: 2016
Abstract: In this paper, we target the CCPR 2016 Multimodal Emotion Recognition Challenge (MEC 2016) which is based on the Chinese Natural Audio-Visual Emotion Database (CHEAVD) of movies and TV programs showing (nearly) spontaneous human emotions. Low level descriptors (LLDs) are proposed as audio features. As visual features, we propose using histogram of oriented gradients (HOG), local phase quantisation (LPQ), shape features and behavior-related features such as head pose and eye gaze. The visual features are post processed to delete or smooth the all-zero feature vector segments. Single modal emotion recognition is performed using fully connected hidden Markov models (HMMs). For multimodal emotion recognition, two fusion schemes are proposed: 1) fusing the normalized probability vectors from the HMMs by a support vector machine (SVM), and 2) using the LLD features when the visual feature sequences are all-zero vector sequences. Moreover, to make full use of the labeled data and to overcome the problem of unbalanced data, we use the training set and validation set together to train the HMMs and SVMs with parameters optimized via cross-validation experiments.Experimental results on the test set show that the macro average precision (MAP) of audio emotion recognition is 42:85%, the HOG features obtain the best visual emotion recognition performance with MAP reaching 54:24%, and the multimodal fusion scheme 2 obtains 53:90% of MAP, which are all much higher than the baseline results (24:02%, 34:28%, and 30:63% for audio, visual and multimodal recognition, respectively). The obtained classification accuracies are also much higher than the baseline.
|
|