This patent application claims the benefit of and priority to Chinese Patent Application No. 202310802378.2 filed with the Chinese Patent Office on Jun. 30, 2023, which is hereby incorporated by reference herein in its entirety.
The present disclosure belongs to the field of emotional state recognition of electroencephalogram (EEG) in the field of biometric recognition, and particularly relates to a multi-source domain adaptive EEG emotional state classification method based on knowledge distillation.
Emotion recognition plays an important role in human-computer interaction. In recent years, with the improvement of computing power, emotion recognition methods based on deep learning have attracted more and more attention. These methods make decisions that reflect human emotions by deeply excavating the potential objective emotional features of users.
Affective brain-computer interfaces (aBCIs) are an important application of emotion recognition. By measuring the signals of the peripheral and central nervous system, features related to the emotional state of users are extracted, and these features are used to adjust human-computer interaction (HCI). The aBCIs show potential in rehabilitation and communication.
Generally speaking, emotion recognition mainly be classified into two categories: a method based on non-physiological signals, such as facial expression images, body gestures, and voice signals; and a method based on physiological signals, such as electroencephalography (EEG), electromyography (EMG) and electrocardiogram (ECG). However, compared with non-physiological signals, physiological signals can directly touch the internal emotional state of individuals, making the internal emotional state of individuals less susceptible to conscious or unconscious manipulation. Among various emotion recognition methods based on physiological signals, the EEG is one of the most commonly used methods, because the EEG is collected directly from the cerebral cortex and is very valuable for reflecting the psychological state of people. With the rapid development of the EEG collecting technology and the processing methods, emotion recognition based on the EEG has attracted more and more attention in recent years.
However, due to the low signal-to-noise ratio (SNR) and the significant individual differences between different subjects at different times, it is still a huge challenge to construct an efficient and robust emotion recognition deep learning model based on the EEG. In addition, it is very important to use the existing labeled data to analyze new unlabeled data in the brain-computer interface (BCI) based on the EEG. Therefore, domain adaptation is widely used in research work. By learning from the source data distribution, a model that performs well in related but different target data distribution is trained. However, in practice, there is usually a plurality of source domains, so that multi-source domain adaptation becomes a powerful extension of domain adaptation. Nevertheless, the technique used for domain alignment in multi-source domain adaptation is usually Maximum mean discrepancy (MMD), which only takes into account the adaptation of the domain level, but lacks the adaptation of the data pair level. Such limitation may lead to a lack of discriminative ability. In addition, in most multi-source domain adaptation frameworks, only the average prediction results of a plurality of single-source domain models are used as the final results, and these single-source domain models are not fully utilized.
In order to solve the defects of the prior art and make better use of the advantages of a plurality of single-source models, the present disclosure proposes a multi-source domain adaptive EEG emotional state classification method based on knowledge distillation (MS-KTF).
The technical scheme used by the present disclosure is as follows.
According to the present disclosure, the differential entropy (DE) features are used as frequency domain features of used EEG signals, the EEGNet model is slightly modified as the feature extractor, and a single-layer linear layer is used as a classifier to analyze the EEG signals, so as to implement a task of emotional state recognition in a cross-subject scenario and a cross-session scenario
According to the present disclosure, the training process is divided into three steps: (1) pre-training each teacher model based on each labeled source domain; (2) based on the corresponding labeled source domain and the unlabeled target domain, performing domain adaptation for each teacher model by using a source domain classification loss (SCL), a target domain classification loss (TCL), a maximum mean discrepancy (MMD) and a pseudo-label triplet loss; (3) transferring knowledge of teachers from a plurality of single-source domains to a student model. In addition, in step (2), in order to improve the effectiveness of the pseudo-label triplet loss, a margin-based sampling strategy is used to filter the original features, and only those features whose marginal scores are higher than the preset threshold are selected as embedded features for calculating the pseudo-label triplet loss.
The embodiment of that present disclosure includes the following steps.
In Step S1, data processing is performed.
The emotional data set SEED is taken as an example for analysis, and the processing steps of the original EEG data collected by an EEG collection device are as follows.
In Step S1-1, data denoising is performed.
The data set used by the present disclosure to verify the model performance comes from SEED. First, the original EEG signal collected in the data set is down-sampled to 200 HZ, then the band-pass filtering of 0-75 HZ is performed, ocular artifacts in the signals are removed by an independent component analysis (ICA) technology, and finally the traditional moving average and linear dynamic system (LDS) methods are used to further smooth the features.
In Step S1-2, differential entropy (DE) feature extraction is performed.
DE features are extracted from the EEG data after removing artifacts, and the data is segmented with a 1 s non-overlapping sliding window for each subject to obtain 3394 data samples. For each data sample xi, the number of EEG data collecting channels is 62; and the frequency domain features of five frequency bands including δ(1-3 HZ), θ(4-7 HZ), α(8-13 HZ), β(14-30 HZ) and γ(31-50 HZ) are extracted.
In Step S2, data definitions and data set divisions are performed.
There are two test scenarios for emotional state classification in the method: the cross-subject scenario and the cross-session scenario, and the model tests in the two scenarios have their own different data definitions and data set divisions, which are explained in detail hereinafter.
It is assumed that there are N subjects, and each subject has D different session (period) tests. The whole sample set is expressed as U={(Xi,Yi)i=1N}j=1D, where i indicates a serial number of the subject, j indicates a serial number of a session (period), Xi indicates a sample set of the subject i, and the corresponding label set is Yi.
For a task of emotional state classification in the cross-session scenario, the data set is cross-verified by using a leave-one-out method; specifically, in each subject i, the data of 15 emotional tests of all subjects in a latest session is taken as a test set; for remaining D−1 sessions, in a unit of session, each session is deem as a source domain in a training set, and finally, D−1 source domains are obtained as the training set; and a total of N tests are conducted and the average accuracy is calculated.
For a task of emotional state classification in the cross-subject scenario, the data set is cross-verified by using the leave-one-out method; specifically, in a session (period), the data of all 15 emotional tests of a subject are iteratively extracted with an assumption that an emotional state label thereof is unknown as a test set; from remaining N−1 subjects, R subjects are randomly and unrepeatably grouped into a group, as a source domain in the training set; finally
(round down) source domains are obtained as the training set, a total of D×N tests are conducted, and the average accuracy is calculated.
In Step S3, the construction and training of the MS-KTF model is performed.
The main parameters in the neural network MS-KTF model include:
The MS-KFT model consists of two parts: teacher models, which each are based on a corresponding single source domain, and a student model acting on the target domain, both the teacher models and the student model each consist of two modules: a domain-specific feature extractor Nf and a label classifier Ny; a plurality of single-source domain teacher models based on the multi-source domain and parameters of a target domain student are initialized.
In Step S3-2, a plurality of single-source domain teacher models are pre-trained.
Based on a multi-source domain sample set, a feature extractor Nf and a label classifier Ny of each domain-specific teacher model are pre-trained using a corresponding labeled single source domain sample set, such that each domain-specific teacher model has a certain pattern recognition ability in its respective source domains.
In Step S3-3, domain adaptation is performed on feature extractors of a plurality of single-source domain teacher models.
One labeled source domain sample set and one unlabeled target domain sample set are formed into a branch. In each branch, the feature extractor Nf corresponding to the domain-specific teacher model is used to extract features from the respective source domain sample and target domain sample, and the features are extracted from the original feature space into the embedded space.
Thereafter, the embedded features are aligned at a domain level based on the maximum mean discrepancy in the feature space; the embedded features are aligned at a data pair level based on the pseudo-label triplet loss of margin-based sampling.
By minimizing the maximum mean discrepancy and the pseudo-label triplet loss, the feature extractors Nf of a plurality of single-source domain teacher models are trained to extract domain-invariant features in the source domain and the target domain.
In Step S3-4, label classifiers Ny of a plurality of single-source domain teacher models are trained.
In each single-source domain teacher model, the extracted source domain feature information is passed through the label classifier Ny to obtain predicted emotion ŶS, and a cross-entropy between the predicted emotion ŶS and a corresponding label YS in an actual sample is calculated; similarly, a cross-entropy between a predicted emotion ŶT of the target domain feature information and a generated pseudo-label ŶT is calculated.
By minimizing the two obtained cross-entropies, the label classifiers Ny of a plurality of single-source domain teacher models are trained to have good emotion classification ability in their respective source domain and respective target domain.
In Step S3-5, the knowledge of a plurality of single-source domain teacher models is merged.
Two different merging strategies are used to balance performances of the teacher models.
This strategy is more suitable for a case in which teacher models have poor performance balance. Based on an emotion prediction result ŶteacherT obtained by the feature extractor and the label classifier of each teacher model, with a unlabeled target domain sample, a corresponding one-hot coding result ÔteacherT is generated; voting is performed based on the one-hot coding result ÔteacherT generated by each teacher model, and the voting result is deem as a decision variable {circumflex over (D)}T; in a case that the emotion prediction result ŶteacherT of the teacher model is the same as the decision variable {circumflex over (D)}T, the teacher model is selected for knowledge merging.
A mean value
This strategy is more suitable for a case in which teacher models have strong performance balance; in this case, all the teacher models have a same weight, and the mean value
In Step S3-6, the merged knowledge of the teacher models is taught to the student model.
With unlabeled target domain sample data, a prediction result ŶstudentT of the student model is obtained through the feature extractor and the label classifier of the student model;
based on a predetermined distillation temperature, smoothing processing is performed on the merged knowledge
by minimizing the KL divergence between the teacher models and the student model so that the student model learns the knowledge of the teacher models, more extensive feature extraction and label classification ability than the teacher models in the target domain are obtained.
In Step S4, model performance evaluation in two scenarios including a cross-session scenario and a cross-subject scenario is performed.
The present disclosure specifically verifies the performance of the model on the SEED data set.
The emotional state ŶstudentT predicted by the target domain sample set in the training convergent student model is compared with the real state YT, the accuracy result is obtained, and the model performance is evaluated. The accuracy rate is a number of correctly classified samples during model testing to a total number of test samples, and the calculation formula of the model accuracy is as follows:
Where TP is a positive sample of a positive type predicted by the model, TN is a negative sample of a negative type predicted by the model, FP is a negative sample of a positive type predicted by the model, and FN is a positive sample of a negative type predicted by the model.
A multi-source domain adaptive EEG emotional state classification system based on knowledge distillation includes a pre-training module, a teacher model domain adaptation module and a student model training module; where the pre-training module pre-trains each teacher model based on each labeled source domain; the teacher model domain adaptation module performs domain adaptation for each teacher model by using a source domain classification loss (SCL), a target domain classification loss (TCL), a maximum mean discrepancy (MMD) and a pseudo-label triplet loss based on the corresponding labeled source domain and an unlabeled target domain; and the student model training module transfers knowledge of teachers from a plurality of single-source domains to a student model. The pre-training module, the teacher model domain adaptation module and the student model training module are computer implemented modules.
In addition, in the teacher model domain adaptation module, in order to improve effectiveness of the pseudo-label triplet loss, a margin-based sampling strategy is used to filter original features, and only features with marginal scores higher than a preset threshold are selected as embedded features for calculating the pseudo-label triplet loss.
The beneficial effect of that present disclosure are as follows.
The present disclosure solves the blind estimation problem of the maximum mean discrepancy (MMD) technology in the multi-source domain adaptation by using the pseudo-label triplet loss. In addition, the margin-based sampling strategy based on uncertainty measurement is used to improve its effectiveness, while knowledge extraction technology is introduced to train a more robust student model by teaching the knowledge of a plurality of teacher models, so as to make maximum use of multi-source domain knowledge. Through the experimental verification on the public emotion data set SEED, compared with the previous method, the present disclosure has achieved significant improvement.
The preferred embodiments of the present disclosure will be described in detail with reference to the attached drawings hereinafter, so that the advantages and features of the present disclosure can be more easily understood by those skilled in the art, and the scope of protection of the present disclosure can be more clearly defined.
Multi-source domain adaptation (MSDA) aims to transfer knowledge from a plurality of source domains to an unlabeled target domain, which is very suitable for cross-session and cross-subject EEG emotion recognition. However, the existing MSDA model only takes into account the domain level of each pair of feature relationships between the source domain and the target domain, but rarely takes into account the correlation of the data pair level between the two domains, resulting in poor robustness.
The present disclosure discloses a multi-source domain knowledge transfer framework (MS-KTF) for EEG emotional recognition. First, data is obtained for band-pass filtering, and artifacts are removed by an independent component analysis (ICA) technology. Second, EEG features are extracted by using a differential entropy (DE) method, and a three-dimensional EEG time series is converted into a two-dimensional sample matrix. Then, a training set and a test set are defined in two task scenarios, respectively, so as to ensure that the training set and the test set are not overlapped with each other. For these samples, MS-KTF combines a pseudo-label triple loss based on margin-based sampling with a maximum mean discrepancy (MMD). According to the method, unbiased alignment between each pair of source domains and target domains can be implemented at the domain level and the data pair level. Specifically, the framework learns knowledge from different source domains, so that a plurality of single-source models are utilized to the greatest extent, and a more powerful model is implemented with less time consumption. Finally, the classification accuracy is used to evaluate the performance of the model in the two task scenarios. According to the present disclosure, the triple loss and the maximum mean discrepancy are combined, so that the problem of insufficient alignment of EEG signal distribution differences is solved to a certain extent, and a high-precision cross-session and cross-subject emotional state classifier is trained, which has the advantages of low time complexity, high calculation efficiency, strong generalization ability and the like, so as to have a wide application prospect in the actual brain-computer interaction.
Refer to
In Step S1, data processing is performed.
The emotional data set is taken as an example for analysis, and the processing steps of the original EEG data collected by an EEG collection device are as follows.
In Step S1-1, data denoising is performed.
The data set used by the present disclosure to verify the model performance comes from SEED. Refer to the paper “Investigating Critical Frequency Bands and Channels for EEG-Based Emotion Recognition with Deep Neural Networks” for details. First, the original EEG signal collected in the data set is down-sampled to 200 HZ, then the band-pass filtering of 0.3-50 HZ is performed, and finally ocular artifacts in the signals are removed by an independent component analysis (ICA) technology.
In Step S1-2, differential entropy (DE) feature extraction is performed.
DE features are extracted from the EEG data after removing artifacts. Each subject will watch 15 videos that can result in obvious emotional changes of the subject, and the EEG data collected within the same video playing duration is regarded as an emotional test. Each subject has 15 emotional tests. The data is segmented with a 1 s non-overlapping sliding window for each subject to obtain 3394 data samples. For each data sample xi, the number of EEG data collecting channels is 62; and the frequency domain features of five frequency bands including: δ(1-3 HZ), θ(4-7 HZ), α(8-13 HZ), β(14-30 HZ) and γ(31-50 HZ) are extracted.
In Step S2, data definitions and data set divisions are performed.
There are two test scenarios for emotional state classification: the cross-subject scenario and the cross-session scenario, and the model tests in the two scenarios have their own different data definitions and data set divisions, which are explained in detail hereinafter.
It is assumed that there are N subjects, and each subject has D different (session) period tests. The whole sample set is expressed as U={(Xi,Yi)j=1D}i=1N, where i indicates a serial number of the subject, j indicates a serial number of a session (period), Xi indicates a sample set of the subject i, and a corresponding label set is Yi.
For a task of emotional state classification in the cross-session scenario, the data set is cross-verified by using a leave-one-out method; specifically, in each subject i, the data of 15 emotional tests of all subjects in a latest session is taken as a test set; for remaining D−1 sessions, in a unit of session, each session is deemed as a source domain in a training set, and finally D−1 source domains are obtained as a training set; a total of N tests are conducted and the average accuracy is calculated.
For a task of emotional state classification in the cross-subject scenario, the data set is cross-verified by using the leave-one-out method; specifically, in a session (period), the data of all 15 emotional tests of a subject are iteratively extracted with an assumption that an emotional state label thereof is unknown, as a test set; from remaining N−1 subjects, R subjects are randomly and unrepeatably grouped into a group, as a source domain in the training set; and
(round down) source domains are obtained as the training set. Finally, the performance of the model is verified on the test set of N subjects, a total of D×N tests are conducted, and the average accuracy is calculated.
In Step S3, the construction and training of the MS-KTF model is performed.
The main parameters in the neural network MS-KTF model include:
In Step S3-1-1, the data division of the model is performed.
The data division and construction of the model are shown in
for the cross-subject scenario: a target domain sample set of the model is UT={Xi}, where Xi indicates the feature data set of an i-th subject; Xi={xj}j=1n; xj indicates a j-th sample in Xi, and n indicates a number of samples in Xi; the multi-source domain sample set of the model is
Pj⊆[N]\i, Pj∩Pk=Ø, ∀i∀k, i≠j, where [N]\i indicates a serial number set of all the subjects after the i-th subject data is removed, and Pj indicates a serial number set of the subjects included in the j-th source domain (* all the data in the cross-subject scenario comes from the same session).
For the cross-session scenario: the target domain sample set of the model is UT={(Xi)i=1N}j, where Xi indicates a feature data set of an i-th subject; j indicates a j-th session (period); the multi-source domain sample set of the model is US={(Xi,Yi)i=1N}k, k∈[D]/j, where [D]/j indicates the a serial number set of all the sessions after the j-th session is removed.
In Step S3-1-2, data input of the model is performed.
As shown in the left half of
In Step S3-2, initialization of the model is performed.
As shown in the right half of
In Step S3-3, pre-training single-source domain teacher models are performed.
As shown in
Based on a multi-source domain sample set US, a feature extractor Nf and a label classifier Ny of each domain-specific teacher model are pre-trained using a corresponding labeled single source domain sample set, such that each domain-specific teacher model has a certain pattern recognition ability in its respective source domains (the optimization goal is the same as SCL in the following formula (5), which will not be described in detail here).
In Step S3-4, training of the feature extractor of single-source domain teacher models is performed.
After passing through the feature extractor Nf of the domain-specific teacher model, the respective low-dimensional features FS and FT of the corresponding source domain data US and the target domain data UT are extracted. In order to ensure the unbiased adaptability of the extracted features, two methods including domain-level distribution alignment and data-pair-level distribution alignment are used in this patent.
In Step S3-4-1, domain-level distribution alignment is performed.
Corresponding to the unbiased distribution alignment in
The maximum mean discrepancy (MMD) is a distance metric in probabilistic metric space, which is widely used in machine learning and nonparametric testing. The distance metric is based on the idea of embedding the probability into reproducing kernel Hilbert space (RKHS), which aims to reduce the distribution difference between the source domain and the target domain while retaining their specific discriminant information. In the training process, the distance between the source domain and the target domain in the feature space is reduced by minimizing the maximum mean discrepancy (MMD) loss, so as to achieve domain-level alignment, and a specific formula is as follows:
Where FiS and FjT indicate the extracted low-dimensional features of the i-th sample in the source domain and the j-th sample in the target domain, respectively; NS and NT indicate a number of source domain samples and a number of target domain samples respectively.
In Step S3-4-2: data-pair-level distribution alignment is performed.
Because the MMD blindly estimates parameters to take into account statistical information and their relationships, feature distinguishability may be reduced, and the relationship between the intra-class distance and the inter-class distance may be affected, because one of the distance values decreases and the other distance value increases. The triple loss can reduce the intra-class distance and increase the inter-class distance, which is a way to solve this problem. However, in domain adaptation, the target domain is usually unlabeled. Therefore, the triple loss of margin-based sampling is used to perform data-pair-level distribution alignment in this patent.
A margin-based score of the prediction result of each sample is used in this patent as a basis for determining whether the sample is sampled, and this method may be expressed by the following formula:
Where x is an input sample, gθ is an abstract function of the label classifier, i* is a category with a highest prediction probability in the prediction result, k is a number of all categories, [k]\i* indicates a set of all the categories except i*, and Threshold is a predetermined threshold of margin-based sampling.
The triplet loss requires sampling in a form of triplet {xia,xip,xin}iN
Where N is a number of samples contained in Xselected, α is a predetermined margin value for guiding separability, d(⋅) is a function for calculating an Euler distance between regularized embedded feature pairs, and fθ(⋅) is an abstract function for feature extraction.
In Step S3-5, training of label classifiers of single-source domain teacher models is performed.
The cross-entropy (CE) loss is used as an evaluation index of the classification result of the label classifier in the source domain and the target domain in the patent, a source classification loss (SCL) is specifically used as the classification loss in the source domain, and a target classification loss (TCL) is specifically used as the classification loss in the target domain.
In the source domain, there is a real label, so the SCL uses the real label and the classification result of the label classifier as the comparison object, and the specific formula is as follows:
Where xi is an i-th source domain input sample, yiS is a real label of the i-th source domain input sample, ŷiS is a prediction result of the label classifier for the i-th source domain input sample, fθ(⋅) is an abstract function for feature extraction, and gθ(⋅) is an abstract function of the label classifier.
In the target domain, the sample lacks a real label, and the corresponding TCL uses the generated pseudo label and the classification result of the label classifier as comparison objects, and a specific formula is as follows:
Where xi is an i-th target domain input sample, {tilde over (y)}iT is a pseudo label generated by the i-th target domain input sample, ŷiT is a prediction result of the label classifier for the i-th target domain input sample, fθ(⋅) is an abstract function for feature extraction, and gθ(⋅) is an abstract function of the label classifier.
In Step S3-6, the goal optimization and the training of the single-source domain teacher model is performed.
Step S3-4 and Step S3-5 are summarized, in a domain adaptation stage of the teacher model, the final optimization goal is shown in a following formula:
Where β, γ and σ are weighting factors for balancing a loss function.
By using a random gradient optimizer and combining with a mini-batch training mode, domain invariant features are obtained for each pair of source domain and target domains at the domain level and a data pair level through minimizing the MMD loss and the triple loss (MMD,
trip) in formula (7); by minimizing the classification losses (
SCL,
TCL) in the source domain and the target domain, a better classifier is obtained, which accurately predicts the source domain samples without sacrificing the ability to discriminate the target domain samples.
In Step S3-7, training of the student model is performed.
The structure of the specific student model can be seen in
In order to better merge the knowledge of the teacher models, a voting-based method is used to select the knowledge of the teacher models to be merged, which is expressed as the following formula in the patent:
Where xi is an i-th input sample, Nt is a number of teacher models, mode(⋅) is a function for finding a mode/multiple modes, * is a point multiplication function, ŷiT
After obtaining the merged knowledge of a plurality of single-source domain teacher models, Kullback-Leibler (KL) divergence is used to evaluate a difference between the prediction result of the teacher model and the prediction result of the student model, and the formula is as follows:
Where X is an input sample set, merge is a merged teacher knowledge set, T is a predetermined temperature coefficient for controlling the smoothness of softmax function, and KLD[p,q] is an evaluation function for measuring a KL divergence between a distribution p and a distribution q.
By using an adam optimizer and combining with a mini-batch training mode, a KL loss in formula (9) is minimized, so that the student model fully learns the merged knowledge of the teacher model and obtains better performance in the target domain.
In Step S4: model performance evaluation in two scenarios including a cross-session scenario and a cross-subject scenario is performed.
The present disclosure specifically verifies the performance of the model on the SEED data set and the SEED-IV data set.
The prediction result ypred obtained by the converged student model in the target domain is compared with the real label yT in the target domain by using a confusion matrix, and the comparison result is obtained to evaluate the model performance. The accuracy rate is a number of correctly classified samples during model testing to a total number of test samples, and the calculation formula of the model accuracy is as follows:
Where TP is a positive sample of a positive type predicted by the model, TN is a negative sample of a negative type predicted by the model, FP is a negative sample of a positive type predicted by the model, and FN is a positive sample of a negative type predicted by the model. SEED data includes 15 subjects, and each subject has three tests with a total of 45 tests. The average accuracy of the first two tests of 15 subjects is as follows:
The mean square error formula of the result is as follows:
Refer to Step S3-1-1 for the data set division in the two scenarios including the cross-subject scenario and the cross-session scenario. For the cross-subject scenario, the model proposed by the present disclosure is tested on EEG data of 15 subjects in one test. For the cross-session scenario, the model proposed by the present disclosure is tested on EEG data of 15 subjects in one test. The comparison between the final test results and the existing technologies (SVM, DGCNN and RGNN) is shown in the following table:
As can be seen from the results in the above tables, the method proposed by the present disclosure has higher accuracy than those of DDC, DAN and MS-MDA in the cross-session scenario and the cross-subject scenario. The present disclosure is not only suitable for the research of emotional state recognition, but also suitable for any EEG-based cross-session and cross-subject classification prediction task, which solves the problem of individual differences of the EEG to some extent.
Number | Date | Country | Kind |
---|---|---|---|
202310802378.2 | Jun 2023 | CN | national |