This patent application claims the benefit and priority of Chinese Patent Application No. 202211139018.0, filed with the China National Intellectual Property Administration on Sep. 19, 2022, the disclosure of which is incorporated by reference herein in its entirety as part of the present application.
The present disclosure belongs to the field of multimodal emotion recognition in the crossing field of natural language processing, vision, and speech, relates to a method for multimodal emotion classification based on modal space assimilation and contrastive learning, and in particular, to a method for determining a subject emotion state by assimilating a heterogeneous multimodal space using a guidance vector and constraining a multimodal representation obtained by supervised contrastive learning.
Emotion analysis typically involves data such as text, videos, and audios. Previous studies have confirmed that such single-modal data typically contains determination information related to emotion states and have found that pure analysis of data of a single modality cannot lead to accurate emotion analysis. However, by using information of a plurality of modalities, it can be guaranteed that a model is capable of more accurate emotion analysis. Singularity and uncertainty between modalities are eliminated by means of complementarity between the modalities to effectively enhance the generalization ability and robustness of the model and improve the performance of an emotion analysis task.
An existing fusion model based on an attention mechanism is designed to establish a compact multimodal representation with information extracted from each modality and perform emotion analysis based on the multimodal representation. Therefore, such a fusion model has received attention from an increasing number of researchers. Firstly, attention coefficients between information of another two modalities (video and audio) and information of a text modality are obtained by the attention mechanism, and multimodal fusion is then performed based on the obtained attention coefficients. However, an interactive relationship between the information of a plurality of modalities is neglected. Moreover, a gap exists between modalities and there is redundancy within each modality, both of which may increase the difficulty of learning a joint embedding space. However, existing multimodal fusion methods rarely take into account the two details and do not guarantee that the information of a plurality of modalities for interaction is fine-grained, which has a certain influence on final task performance.
An existing multimodal fusion model based on a transformation network has a great advantage in terms of modeling time dependence, and a self-attention mechanism involved is capable of effectively solving the problem of non-alignment between data of a plurality of modalities. Therefore, such a multimodal fusion model has received extensive attention. The multimodal fusion model may obtain a cross-modal common subspace by transforming a distribution of a source modality into a distribution of a target modality and use the cross-modal common subspace as multimodal fused information. Moreover, a solution space is obtained by transforming the source modality into another modality. Accordingly, the solution space may be overly dependent on a contribution of the target modality, and when the data of a modality is missing, the solution space will lack a contribution of the data of the modality. This results in a failure to effectively balance the contributions of the modalities to a final solution space. In another aspect, an existing transformation model usually takes into account only transformation from a text to an audio and transformation from a text to a video, and does not take into account the possibility of transformation of other modalities, which has a certain influence on the final task performance.
Chinese patent No. CN114722202A discloses realizing multimodal emotion classification using a bidirectional double-layer attention long short-term memory (LSTM) network, where more comprehensive time dependence can be explored using the bidirectional attention LSTM network. Chinese patent No. CN113064968A provides an emotion analysis method based on a tensor fusion network, where interaction between modalities is modeled using the tensor network. However, it is hard for the two networks to effectively explore a multimodal emotion context from a long sequence, which may limit the expression ability of a learning model. Chinese patent No. CN114973062A discloses a method for multimodal emotion analysis based on a Transformer. The method uses paired cross-modal attention mechanisms to capture interaction between sequences of a plurality of modalities across different time strides, thereby potentially mapping a sequence from one modality into another modality. However, a redundant message of an auxiliary modality is neglected, which increases the difficulty of performing effective reasoning on a multimodal message. More importantly, a framework based on attention mainly focuses on static or implicit interaction between a plurality of modalities, which may result in formation of a relatively coarse-grained multimodal emotion context.
In view of the shortcomings of the prior art, a first objective of the present disclosure is to provide a method for multimodal emotion classification based on modal space assimilation and contrastive learning, where a TokenLearner module is proposed to establish a guidance vector composed by complementary information between modalities. Firstly, this module is configured to calculate a weight map for each modality based on a multi-head attention score of the modality. Each modality is then mapped into a new vector according to the obtained weight map, and an orthogonality constraint is used to guarantee that the information contained in such new vectors is complementary. Finally, a weighted average of the vectors is calculated to obtain the guidance vector. The learned guidance vector guides each modality to concurrently approach a solution space, which may render heterogeneous spaces of three modalities isomorphic. Such a strategy has no problem of an unbalanced contribution of each modality to a final solution space and is applicable to effectively explore a more complicated multimodal emotion context. To significantly improve the ability of a model to distinguish between various emotions, supervised contrastive learning is used as an additional constraint for fine adjusting the model. With the aid of label information, the model is capable of capturing a more comprehensive multimodal emotion context.
The present disclosure adopts the technical solutions as follows.
A method for multimodal emotion classification based on modal space assimilation and contrastive learning includes the following steps:
Z=⅓Σmwm·Zm,m∈{t,a,v} (6)
[Hml+1,_]=Transformer([Hml,Zl];θm) (7)
[Hml+1i,_]=MLP(LN(yl))+MSA(LN([Hml,Zl]))+[Hml,Zl] (8)
During training, prediction quality during training may be estimated using a mean square error loss:
task
=MAE(ŷ,y) (13)
overall=αtask+βdiff+γscl (14)
A second objective of the present disclosure is to provide an electronic device, including a processor and a memory, where the memory stores machine-executable instructions capable of being executed by the processor, and the processor is configured to execute the machine-executable instructions to implement the method.
A third objective of the present disclosure is to provide a machine-readable storage medium, storing machine-executable instructions which, when called and executed by a processor, cause the processor to implement the method.
The present disclosure has following beneficial effects:
The present disclosure introduces the concept of assimilation. A guidance vector is utilized to guide a space where each modality is located to simultaneously approach a solution space so that the heterogeneous spaces of modalities can be assimilated. Such a strategy has no problem of an unbalanced contribution of each modality to a final solution space and is applicable to effectively explore a more complicated multimodal emotion context. Meanwhile, a steering vector guiding a single modality is composed of complementary information between a plurality of modalities, which enables the model to be more concerned about emotion features. Thus, intra-modal redundancy that may increase the difficulty of obtaining a multimodal representation can be naturally removed.
By combining a dual learning mechanism with a self-attention mechanism, in a process of transforming one modality into another modality, directional long-term interactive cross-modal fused information between a modality pair is mined. Meanwhile, the dual learning technique is capable of enhancing the robustness of the model and thus can well cope with the inherent problem (i.e., modal data missing problem) in multimodal learning. Next, a hierarchical fusion framework is constructed on this basis to splice all cross-modal fused information having a same source modality together. Further, a one-dimensional convolutional layer is used to perform high-level multimodal fusion. This is an effective complement for the existing multimodal fusion framework in the field of emotion recognition. Moreover, supervised contrastive learning is introduced to help the model with identifying differences between different categories, thereby achieving the purpose of improving the ability of the model to distinguish between different emotions.
The present disclosure is described in detail below with reference to the accompanying drawings.
A method for multimodal emotion classification based on modal space assimilation and contrastive learning provided in the present disclosure, as shown in
Step 1, information data of a plurality of modalities is acquired.
Data of a plurality of modalities of a subject is recorded when the subject performs a particular emotion task. The plurality of modalities include a text modality, an audio modality, and a video modality.
Step 2, the information data of the plurality of modalities is preprocessed.
A primary feature is extracted from each modality through a particular network:
H
t=BERT(T)
H
a=Transformer(A)
H
v=Transformer(V) (1)
Step 3, a guidance vector is established to guide a modal space.
In the proposed multimodal fusion framework, a TokenLearner module is one of core processing modules. During multimodal fusion, this module is designed for each modality to extract complementary information between modalities, whereby a guidance vector is established to simultaneously guide each modal space to approach a solution space. This guarantees that a contribution of each modality to a final solution space is identical.
Firstly, a multi-head attention score matrix MultiHead(Q, K) of each modality is calculated based on the data Hm(m∈{l, a, v}) of the plurality of modalities. One-dimensional convolution is then carried out for the matrix and a softmax function is added after the convolution, whereby a weight matrix is obtained. A number of rows of the weight matrix is far less than a number of rows of Hm(m∈{l, a, v}). The weight matrix is multiplied by the data Hm(m∈{l, a, v} of the plurality of modalities to extract information Zm(m∈{l, a, v}):
A weighted average of Zm(m∈{l, a, v}) containing the complementary information between modalities is calculated to establish the guidance vector Z in a current state.
Step 3 will be repeated for a plurality of times, and a new guidance vector Z will be generated each time according to the current state of each modality to guide the modal space to approach the final solution space. Meanwhile, to guarantee that the information extracted by the TokenLearner module is complementary between modalities, we finally used an orthogonality constraint to train three TokenLearner modules:
Step 4, pre-training continues.
Based on step 3, after guiding for a plurality of times, we extracted the last elements of the data Hm(m∈{l, a, v}) of the plurality of modalities and integrated them into a compact multimodal representation Hfinal. To enable the model to distinguish between various emotions more easily, we introduced supervised contrastive learning to constrain the multimodal representation Hfinal. This strategy introduces label information. In the case of fully utilizing the label information, samples of a same emotion are pushed closer, and samples of different emotions mutually repel. Finally, final fused information is input to a linear classification layer, and output information is compared with an emotion category label to obtain a final classification result.
The present disclosure is compared with some fusion methods with excellent effects on two disclosed multimodal emotion databases: CMU multimodal opinion sentiment intensity (CMU-MOSI) and CMU multimodal opinion sentiment and emotion intensity (CMU-MOSEI), where the CMU-MOSI dataset is composed of 2199 video clips collected from 93 opinion videos downloaded from Youtube. Opinions of 89 different narrators on some topics are included. Each video clip is manually marked with an emotional intensity from −3 (strong negative) to 3 (strong positive).
Results in Table 1 are related to mean absolute error (MAE), correlation coefficient Corr, accuracy Acc-2 corresponding to an emotional binary classification task, F1 score F1-Score, and accuracy Acc-7 corresponding to an emotional seven-way classification task. Although Self-MM is superior to other existing methods, the advantages and effectiveness of the present disclosure still can be observed in Table 1. On the CMU-MOSI dataset, the present disclosure is superior to the most advanced Self-MM on all indicators. Moreover, on the CMU-MOSEI dataset, the present disclosure is superior to Self-MM, and has an increase of about 0.8% in Acc2 and an improvement of 0.9% on F1-Score. Therefore, the effectiveness of the method provided in the present disclosure has been proven.
—/80.8
—/80.7
—/82.5
—/82.4
—/83.0
—/83.0
—/81.7
—/81.6
Number | Date | Country | Kind |
---|---|---|---|
202211139018.0 | Sep 2022 | CN | national |