Abstract:

Different from the previous studies with only text, this paper focuses on multimodal data (text and audio) to perform emotion recognition. To simultaneously address the characteristics of multimodal data, we propose a novel joint learning framework, which allows auxiliary task (multimodal sentiment classification) to help the main task (multimodal emotion classification). Specifically, private neural layers are designed for text and audio modalities from the main task to learn the uni-modal independent dynamics. Secondly, with the shared neural layers from auxiliary task, we obtain the uni-modal representations of the auxiliary task and the auxiliary representations of the main task. The uni-modal independent dynamics is combined with the auxiliary representations for each modality to acquire the uni-modal representations of the main task. Finally, in order to capture multimodal interactive dynamics, we fuse the text and audio modalities’ representations for the main and auxiliary tasks separately to obtain the final multimodal emotion and sentiment representations with the self attention mechanism. Empirical results demonstrate the effectiveness of our approach to multimodal emotion classification task as well as the sentiment classification task.

Logo

电影级数字人,免显卡端渲染SDK,十行代码即可调用,工业级demo免费开源下载!

更多推荐