An efficient model-level fusion approach for continuous affect recognition from audiovisual signals This publication appears in: Neurocomputing Authors: D. Jiang and H. Sahli Volume: 376 Pages: 42-53 Publication Year: 2020
Abstract: Continuous affect recognition has a huge potential in human computer interaction applications. How to efficiently fuse speech and facial information for inferring the affective state of a person from data captured in real-world conditions is a very important issue for continuous affect recognition. Currently, late fusion is usually used in multi-modal continuous affect recognition to improve system performance. However, late fusion ignores the complementarity and redundancy between multiple streams from the different modalities. In this work, we propose an efficient model-level fusion approach for audiovisual continuous affect recognition. First, we propose a LSTM based model-level fusion approach for audiovisual continuous affect recognition. Our approach considers the complementarity and redundancy between multiple streams from different modalities. In addition, our model can efficiently incorporate side information such as gender using adaptive weight network. At last, we design a deep supervision based effective optimization strategy for training the proposed audiovisual continuous affect recognition model. We demonstrate the effectiveness of our approach on the RECOLA dataset. Our experimental results show that the proposed adaptive weight network improves the performance compared to a plain neural network without adaptive weights. Our approach obtains remarkable improvements on both arousal and valence in terms of concordance correlation coefficient (CCC) compared to state-of-the-art early fusion and model-level fusion approaches. Therefore, we believe that our proposed approach gives a promising direction for further improving the performance of audiovisual continuous affect recognition.
|