Embodiments of the present disclosure relate to the technical field of emotion recognition, and in particular to a method for recognizing emotion, a method for training emotion recognition model, apparatuses, an electronic device, a computer non-transitory readable storage medium, and a computer program product.
In the related art, with the continuous development of cloud computing, big data and artificial intelligence, applications including but not limited to face recognition and gait recognition have been widely applied to various industries. An artificial intelligence customer service dialog is another important commercial scenario. This potential application of human-machine interaction poses many challenges to the current situations, in which one important challenge is how to enable a machine to understand the emotion of a human during human-machine interaction, i.e. an emotion recognition task. Emotion recognition, as a hot research topic in the field of emotional computing, has attracted attention from many researchers in fields such as computer vision, natural language processing and human-machine interaction. Most methods use an Artificial Neural Network (ANN, or ANNs, i.e. Artificial Neural Networks) to complete the emotion recognition. However, the reasoning of an emotion recognition model needs to consume large energy of a mobile terminal device, and this high-energy-consumption emotion recognition manner of ANNs mode hinders the application of emotion recognition in embedded and mobile devices.
As a third-generation neural network, a low-power-consumption Spiking Neural Network (SNN, or SNNs, i.e. Spiking Neural Networks) is a potential solution for achieving an emotion recognition algorithm applicable to embedded and mobile terminals; and compared with ANN, the construction of a single neuron in SNN has a stronger similarity with the structure of a neuron in the brain.
In the related art, a method for completing an emotion recognition task by using SNNs is generally applied to extract emotion information from voice, cross-modality or electroencephalogram, and extracting emotion information from a video clip has not been realized, thereby limiting ways for emotion recognition. Therefore, how to extract emotion information from a video clip is a problem requiring to be solved by a person skilled in the art.
An object of embodiments of the present disclosure is to provide a method for recognizing emotion, a method for training emotion recognition model, an apparatuses, an electronic device, a computer non-transitory readable storage medium, and a computer program product, which may recognize an emotion category on the basis of video information in a use process, so that the ways for emotion recognition increase, which is beneficial to better achieve emotion recognition.
In order to solve the technical problem, embodiments of the present disclosure provide a method for recognizing emotion, including:
In some embodiments of the present disclosure, before the to-be-recognized spiking sequences corresponding to video information are acquired, the method further includes:
In some embodiments of the present disclosure, the pre-established spiking neural network emotion recognition model is trained, to obtain the trained spiking neural network emotion recognition model, includes:
In some embodiments of the present disclosure, the pre-established spiking neural network emotion recognition model is trained, to obtain the trained spiking neural network emotion recognition model, includes:
In some embodiments of the present disclosure, the process that the emotion recognition-based dynamic visual data set is pre-established includes:
In some embodiments of the present disclosure, the process that simulation processing is performed on the raw visual data by using the dynamic visual sensor simulation method to obtain a plurality of spiking sequences corresponding to the raw visual data, includes:
In some embodiments of the present disclosure, the process that simulation processing is performed on the raw visual data by using the dynamic visual sensor simulation method to obtain corresponding spiking sequences, further includes:
In some embodiments of the present disclosure, the process that simulation processing is performed on the raw visual data by using the dynamic visual sensor simulation method to obtain corresponding spiking sequences, further includes:
In some embodiments of the present disclosure, the first output channel and the second output channel are respectively assigned according to the grayscale difference value between the current video frame and the previous video frame and the preset threshold, includes:
In some embodiments of the present disclosure, the spiking neural network includes a voting neuronal population;
In some embodiments of the present disclosure, the pre-established spiking neural network emotion recognition model is trained by using the dynamic visual data set, to obtain the trained spiking neural network emotion recognition model, further includes:
In some embodiments of the present disclosure, it is judged whether the current spiking neural network after updating the parameter weight converges according to the following manners:
In some embodiments of the present disclosure, the process that the raw visual data is processed by using the dynamic visual sensor simulation method to obtain spiking sequences corresponding to the raw visual data, includes:
In some embodiments of the present disclosure, the first output channel and the second output channel are respectively assigned according to the grayscale difference value between the current video frame and the previous video frame and the preset threshold, includes:
In some embodiments of the present disclosure, the spiking neural network includes a feature extraction component, a voting neuronal population component and an emotion mapping component;
In some embodiments of the present disclosure, the spiking neural network further includes: a feature extraction component, wherein the feature extraction component includes a single forward extraction unit composed of convolution, normalization, Parametric Leaky-Integrate and Fire (PLIF) model and average pooling, and a network unit composed of two fully-connected layers and two PLIF, which are arranged at intervals.
Embodiments of the present disclosure further provide a method for training emotion recognition model, including:
Embodiments of the present disclosure further provide an apparatus for recognizing emotion, including:
In some embodiments of the present disclosure, the apparatus further includes:
In some embodiments of the present disclosure, the training component includes:
In some embodiments of the present disclosure, the training component includes:
In some embodiments of the present disclosure, the establishment component includes: a spiking sequence acquisition component, a data set establishment component, and at least one of a simulation processing component and a spiking sequence acquisition component, wherein
In some embodiments of the present disclosure, the simulation processing component includes:
In some embodiments of the present disclosure, the simulation processing component further includes:
In some embodiments of the present disclosure, the simulation processing component further includes:
In some embodiments of the present disclosure, the assignment component includes: a first calculation component, and at least one of a first position assignment component and a first position assignment component, wherein
In some embodiments of the present disclosure, the spiking neural network includes a voting neuronal population;
In some embodiments of the present disclosure, the second training component further includes:
In some embodiments of the present disclosure, the judgment component at least includes one of the following:
In some embodiments of the present disclosure, the spiking neural network further includes: a feature extraction component, wherein the feature extraction component includes a single forward extraction unit composed of convolution, normalization, Parametric Leaky-integrate and Fire (PLIF) model and average pooling, and a network unit composed of two fully-connected layers and two PLIF, which are arranged at intervals.
Embodiments of the present disclosure further provide an apparatus for training emotion recognition model, including:
Embodiments of the present disclosure further provide an apparatus for training emotion recognition model, including:
Embodiments of the present disclosure further provide an apparatus for recognizing emotion, including:
Embodiments of the present disclosure further provide an electronic device, including:
Embodiments of the present disclosure further provide a computer non-transitory readable storage medium; the computer non-transitory readable storage medium stores a computer program, which when executed by a processor, implements the method for recognizing emotion or the method for training emotion recognition model.
Some embodiments of the present disclosure further provide a computer program product, including a computer program or instructions, which when executed by a processor, implement the convolution feature caching method or the method for recognizing emotion or the method for training emotion recognition model.
The technical solutions provided in the embodiments of the present disclosure at least bring about the following beneficial effects:
In the embodiments of the present disclosure, first, to-be-recognized spiking sequences corresponding to video information are acquired; and then the to-be-recognized spiking sequences are recognized by using a spiking neural network emotion recognition model, so as to obtain a corresponding emotion category. That is to say, in the embodiments of the present disclosure, the acquired to-be-recognized spiking sequences may be recognized by the spiking neural network emotion recognition model, so as to obtain a corresponding emotion category. Namely, some embodiments of the present disclosure may recognize an emotion category on the basis of video information, so that the ways for emotion recognition increase, which is beneficial to better achieve emotion recognition. Further, in the embodiments of the present disclosure, a spiking neural network may also be trained by pre-establishing a dynamic visual data set, to obtain a spiking neural network emotion recognition model; and then to-be-recognized spiking sequences corresponding to video information are acquired, the to-be-recognized spiking sequences are inputted into the spiking neural network emotion recognition model, and then the to-be-recognized spiking sequences are recognized by using the spiking neural network emotion recognition model, so as to obtain a corresponding emotion category. That is to say, in the embodiments of the present disclosure, a spiking neural network may be trained by pre-establishing a dynamic visual data set, and the to-be-recognized spiking sequences corresponding to the video information are recognized by using the trained spiking neural network emotion recognition model, so as to obtain a corresponding emotion category, so that the ways for emotion recognition increase, which is beneficial to better achieve emotion recognition of video information.
In order to describe the technical solutions in the embodiments of the present disclosure or in the related art more clearly, hereinafter, accompanying drawings requiring to be used in the embodiments of the present disclosure or the related art will be introduced briefly. Apparently, the accompanying drawings in the following description merely relate to embodiments of the present disclosure, and for a person of ordinary skill in the art, other accompanying drawings can also be obtained according to these accompanying drawings without involving any inventive effort.
Embodiments of the present disclosure provide a method for recognizing emotion, a method for training emotion recognition model, apparatuses, a computer non-transitory readable storage medium, and a computer program product, which can recognize an emotion category on the basis of video information in a use process, so that the ways for emotion recognition increase, which is beneficial to better achieve emotion recognition.
To make the objects, technical solutions and advantages of the embodiments of the present disclosure clearer, hereinafter, the technical solutions in embodiments of the present disclosure will be described clearly and thoroughly in combination with the accompanying drawings in the embodiments of the present disclosure. Apparently, the embodiments as described are merely some rather than all of the embodiments of the present disclosure. All other embodiments obtained by a person of ordinary skill in the art on the basis of the embodiments of the present disclosure without any inventive effort shall all fall within the scope of protection of the present disclosure.
Please refer to
In some embodiments, the acquisition of the to-be-recognized spiking sequences corresponding to video information can use a dynamic visual camera to directly acquire the to-be-recognized spiking sequences corresponding to the video information, and can also use simulation data of the video information. It should be noted that, as the cost of the dynamic visual camera is high, in the embodiments of the present disclosure, in order to reduce costs, video information may be acquired first, and then simulation is performed on the video information to obtain corresponding to-be-recognized spiking sequences.
In this step, the specific process of recognizing the to-be-recognized spiking sequences includes: the to-be-recognized spiking sequences are inputted into the spiking neural network emotion recognition model, and the to-be-recognized spiking sequences are recognized by the spiking neural network emotion recognition model, so as to obtain a corresponding emotion category.
The spiking neural network emotion recognition model in this step is pre-established. In some embodiments of the present disclosure, after the spiking neural network emotion recognition model is established and before the to-be-recognized spiking sequences are recognized, the pre-established spiking neural network emotion recognition model is trained first to obtain a trained spiking neural network emotion recognition model. The specific training process includes:
One training mode: test sets of a plurality of emotion categories are acquired; and test training is performed on a pre-established spiking neural network emotion recognition model by using the test sets, to obtain a trained spiking neural network emotion recognition model.
Another training mode: an emotion recognition-based dynamic visual data set is pre-established; and a pre-established spiking neural network emotion recognition model is trained by using the dynamic visual data set, to obtain a trained spiking neural network emotion recognition model.
It can be understood that, in the embodiments of the present disclosure, the emotion recognition-based dynamic visual data set and the spiking neural network are pre-established, and then the spiking neural network is trained by using the dynamic visual data set, so as to obtain the trained spiking network emotion recognition model. The process of pre-establishing the dynamic visual data set may include:
It should be noted that, in practical applications, the dynamic visual camera may be used to directly acquire the to-be-recognized spiking sequences corresponding to the video information, but the cost of the dynamic visual camera is high. In the embodiments of the present disclosure, in order to further reduce costs, the emotion recognition-based raw visual data may be collected first by an ordinary video collection device, and then the dynamic visual sensor simulation method is used to perform simulation on the raw visual data to obtain spiking data corresponding to the raw visual data, thereby converting the raw visual data into spiking data and reducing device costs. It can be understood that spiking sequences corresponding to one piece of raw visual data are actually a spiking sequence array constituted by spiking sequences at all pixel positions of all video pictures in the whole raw visual data. In the embodiments of the present disclosure, the spiking sequence array is simply referred to as spiking sequences corresponding to the raw visual data; and in practical applications, simulation processing is performed on a plurality of pieces of raw visual data by using the dynamic visual sensor simulation method, so as to obtain a plurality of spiking sequences; and the emotion recognition-based dynamic visual data set is established on the basis of the plurality of spiking sequences.
Still further, please refer to
It should be noted that the feature of dynamic vision is that information captured by the camera is no longer all information in the whole scenario, especially in cases where scenario change is not large, which may greatly reduce the amount of data to be recorded and transmitted. In the embodiments of the present disclosure, grayscale information between adjacent picture frames in video data is differentiated, and the differentiated result is judged according to the preset threshold, so as to determine whether it is necessary to record data to complete simulation conforming to dynamic visual characteristics.
Recording features of the dynamic visual data are that only changes are recorded, and are defined by formalized symbol descriptions, which are generally represented by E[xi, yi, ti, pi], where E denotes an event, the event only has two attributes: occurring and not occurring, (xi, yi) denotes the position where the event occurs in a scenario, ti denotes a time when the event occurs, and pi denotes a polarity of occurrence of the event. For example, for a change situation of light intensity in a scenario recorded in an event, the change of light intensity has two directions: changing from strong to weak, or changing from weak to strong, and the two changes both represent the occurrence of the event, and in order to distinguish these two events, the dimension of polarity is defined. The method provided in the embodiments of the present disclosure is to generate dynamic visual data in a similar form by computer simulation, and herein continuous recording of a scenario is represented by using video data. As a task of the present system is emotion recognition, data used herein is raw visual data for emotion recognition. It is assumed that a segment of raw visual data contains N frames of video frame images in total, and these video frame images are inputs to the dynamic visual sensor simulation method, and calculation may be performed according to the following simulation steps to generate simulated dynamic visual data: In practical applications, simulation visual data representation of all-zero numerical values may be represented as: E[xi, yi, ti, pi], where the numerical range of i is from 1 to N, and the magnitude of E is H×W×N×2, where H and W are respectively the height and width of a video frame image; an intermediate variable is initialized and data of a previous frame is recorded, and marked as Fpre, and the sensitivity between frames (i.e. the preset threshold) is defined as Sens, and a simulation event occurs when the difference between two frames exceeds a sensitivity.
In some embodiments, in the embodiments of the present disclosure, in the process of converting the raw dynamic video data into the spiking sequences, N frames of video frame images in the entire raw dynamic video data may be traversed starting from a first frame of video frame image. For example, for the current ith frame of video frame image, the video frame image is converted from an RGB color space to a grayscale space, represented by Vgray, and the converted video frame data is used as the current video frame data, and then the size of i is judged.
In some embodiments, when i is equal to 1, that is, for the current video frame data corresponding to a first frame of video frame image, all floating-point data of the current video frame data is assigned to a first output channel of a first time step of simulation data (which may be achieved by codes E[:,:,i,0]←Vgray), and the current video frame data is taken as a previous video frame (which may be achieved by codes Fpre←Vgray).
When i is not equal to 1, the first output channel and a second output channel are respectively assigned according to a grayscale difference value between the current video frame and the previous video frame and a preset threshold, and the current video frame data is taken as the previous video frame, and the step that the value of i is added by 1 in S240 is executed. This process may be implemented by the following method:
In some embodiments, in the embodiments of the present disclosure, for each pixel in the current video frame image, a grayscale difference value between the current video frame and the previous video frame at the pixel is calculated; and then the grayscale difference value is compared with the preset threshold, and assignment is performed regarding two different types of events according to the comparison result. In some embodiments, when the grayscale difference value is greater than the preset threshold, 1 is assigned to a position corresponding to the first output channel, which may be achieved by codes E[:,:,i,0]←int(Vgray−Fpre>Sens); and when the grayscale difference value is less than the preset threshold, 1 is assigned to a position corresponding to the second output channel, which may be achieved by codes E[:,:,i,1]←int(Vgray−Fpre<Sens).
In addition, in the embodiments of the present disclosure, after the value of i is added by 1, it is judged whether the updated i is less than N; when less than N, the method returns to execute the step that the video frame image of the ith frame is converted from the RGB color space to the grayscale space, so as to continue processing a next video frame image; and when not less than N, the operation ends, it indicates that all the N video frame images are processed, so as to obtain spiking sequences composed of the first output channel and the second output channel.
It should also be noted that, the spiking neural network transfers information in a spiking manner, and the spiking transfer process itself is not derivable, such that gradient back propagation may not be used for synaptic weight turnover; and in an optimization process, in order to avoid manually setting some hyper-parameters (e.g. a membrane time constant T of a neuron), a person skilled in the art recently proposed that the membrane time constant T of a neuron may be integrated into joint update of a whole model synaptic weight, and this model is referred to as PLIF (Parametric Leaky-Integrate and Fire model) Joint optimization is more convenient than manual setting, and may perform optimization to obtain better synaptic weights. In the embodiments of the present disclosure, PLIF is used as layers in the SNN to construct an emotion recognition SNN model, as follows:
Please refer to
It should be noted that, in
In some embodiments, the feature extraction component in the embodiments of the present disclosure simulates a manner in which brain neurons process information, and abstracts a convolution operation and a pooling operation; in addition, in the embodiments of the present disclosure, a spiking neuron model PLIF is used during information transfer. The feature extraction component includes a single forward extraction unit composed of convolution, normalization, Parametric Leaky-Integrate and Fire (PLIF) model and average pooling, and a network unit composed of two fully-connected layers and two PLIF, which are arranged at intervals. In some embodiments, an operation of single forward feature extraction includes: a convolution operation with a convolution kernel of 3×3 (e.g. Conv 3×3 in
Regarding the voting neuronal population module, the decision of neurons in the brain is based on collaborative work of a plurality of neurons, and therefore in the embodiments of the present disclosure, with regard to the final number of emotion recognition categories, a certain emotion category is recognized by using a plurality of neurons forming a population. In some embodiments, ten neurons may be used to form a population corresponding to one category, and in the embodiments of the present disclosure, explanation is made by using examples of recognition of two emotion categories. That is to say, ten neurons are used to cooperatively decide whether it is ultimately an emotion category corresponding to the population of neurons, and the total number is the number of emotion categories multiplied by ten, and the voting neuronal population module outputs spiking sequences.
Regarding the emotion mapping module, the emotion mapping module may map the spiking sequences outputted by the voting neuronal population module to a final emotion category. In some embodiments, a spiking sequence emitted by each neuron may correspond to one frequency, and the frequency may be used as one of output mappings of the neuron, and then, the frequencies of all the neurons in a neuronal population of a current category are averaged, and thus each neuronal population corresponds to a final frequency, a larger frequency indicates that the corresponding emotion category is activated, and the emotion category corresponding to the neuronal population with the largest frequency is outputted.
Please refer to
S310: a parameter weight of a pre-established spiking neural network is initialized; in some embodiments, a parameter weight of the pre-established spiking neural network emotion recognition model is initialized.
It should be noted that, in actual applications, the dynamic visual data set may be divided into three parts, which are respectively a training set, a verification set and a test set; and the spiking neural network emotion recognition model is pre-constructed, the spiking neural network emotion recognition model is as described above, and will not be repeated in the embodiments of the present disclosure. In some embodiments, the parameter weight of the spiking neural network emotion recognition model is initialized first.
S320: the dynamic visual data set is used as an input to a current spiking neural network, and an output frequency of a voting neuronal population of each emotion category is obtained via forward propagation of the network. In some embodiments, the dynamic visual data set is used as an input to the current spiking neural network in the spiking neural network emotion recognition model, and the output frequency of the voting neuronal population of each emotion category is obtained via forward propagation of the current spiking neural network.
In this step, in each round of training process, the current spiking neural network is determined on the basis of a current parameter weight, and the training set in the dynamic visual data set is taken as an input to the current spiking neural network in the spiking neural network emotion recognition model; and then, via forward propagation of the current spiking neural network, the output frequency of the voting neuronal population of each emotion category is obtained. For one voting neuronal population, an average value of output frequencies of various voting neurons in the voting neuronal population may be calculated, to obtain the output frequency of the voting neuronal population.
S330: regarding each emotion category, an error between the output frequency and a real label of a corresponding emotion category is calculated. In some embodiments, regarding each emotion category, an error between the output frequency of the voting neuronal population of the emotion category and a real label of a corresponding emotion category is calculated.
In this step, since each voting neuronal population corresponds to one emotion category, the error may be calculated according to the output frequency of the voting neuronal population and the real label of the corresponding emotion category. In the embodiments of the present disclosure, a mean square error (MSE) may be calculated.
S340: a gradient corresponding to the parameter weight is calculated according to the error, and the parameter weight of the current spiking neural network is updated by using the gradient.
In some embodiments, a final average error may be obtained by calculation according to the error corresponding to each voting neuronal population, then a gradient corresponding to the parameter weight is calculated according to the average error, and then the parameter weight of the current spiking neural network in the spiking neural network emotion recognition model is updated by using the gradient.
It should be noted that, in practical applications, a Stochastic Gradient Descent (SGD) algorithm may be used, and also, other gradient descent-based parameter optimization methods may also be selected to update the parameter weight, including but not limited to, methods such as RMSprob (Root Mean Square propagation), Adagrad (Adaptive Subgradient), Adam (Adaptive Moment Estimation), Adamax (Adam variant based on infinite number), ASGD (Averaged Stochastic Gradient Descent) and RMSprob; and using which method may be determined according to actual situations, which is not limited in the embodiments of the present disclosure.
S350: it is judged whether the current spiking neural network after updating the parameter weight converges; when the current spiking neural network after updating the parameter weight converges, the method proceeds to S360; and when the current spiking neural network after updating the parameter weight does not converge, the method returns to execute S320 to perform next round of training, until the trained spiking neural network emotion recognition model is obtained.
In some embodiments, after the parameter weight is updated, the current spiking neural network in the spiking neural network emotion recognition model is determined on the basis of the updated parameter weight; and then the convergence of the current spiking neural network may be further judged according to the verification set in the dynamic visual data set; and when the current spiking neural network has converged, the method proceeds to S360 and the operation ends, and a spiking neural network emotion recognition model based on the latest parameter weight is obtained. Also, test training may be performed on the spiking neural network emotion recognition model by using the acquired test sets of a plurality of emotion categories, and a corresponding emotion category is outputted, so as to obtain the trained spiking neural network emotion recognition model. When the current spiking neural network has not converged, the method may return to S320 to re-use the training set to perform next round of training on the updated current spiking neural network, so as to re-update the parameter weight until the updated current spiking neural network converges.
S360: the training ends, to obtain a trained spiking neural network emotion recognition model. In some embodiments, when it is judged that the current spiking neural network after updating the parameter weight has converged, the training ends, to obtain a trained spiking neural network emotion recognition model.
It should be noted that in practical applications, there may be a plurality of methods for judging whether the current spiking neural network converges, for example, by judging whether the current training number of times reaches a preset number of times, when the current training number of times reaches a preset number of times, the current spiking neural network has converged, and when the current training number of times does not reaches a preset number of times, the current spiking neural network has not converged. The judgment may also be performed by judging whether an error reduction degree of the current spiking neural network is stabilized within a preset range; when the error reduction degree of the current spiking neural network is stabilized within the preset range, the current spiking neural network has converged, and when the error reduction degree of the current spiking neural network is not stabilized within the preset range, the current spiking neural network has not converged. Further, it may be judged whether the current spiking neural network converges by judging whether an error based on the current spiking neural network is less than an error threshold; when the error based on the current spiking neural network is less than an error threshold, the current spiking neural network has converged, and when the error based on the current spiking neural network is not less than an error threshold, the current spiking neural network has not converged.
Hence, in the embodiments of the present disclosure, a spiking neural network is trained by a pre-established dynamic visual data set to obtain a spiking neural network emotion recognition model and then to-be-recognized spiking sequences corresponding to video information are acquired; the to-be-recognized spiking sequences are inputted into the spiking neural network emotion recognition model, and the to-be-recognized spiking sequences are recognized by the spiking neural network emotion recognition model, so as to obtain a corresponding emotion category. Some embodiments of the present disclosure may recognize an emotion category on the basis of video information in a use process, so that the ways for emotion recognition increase, which is beneficial to better achieve emotion recognition.
In some embodiments of the present disclosure, please further please refer to
It should be noted that, for specific implementation process of each step in some embodiments, reference may be made to the implementation process of corresponding steps in the embodiments above for details, and they will not be repeated herein.
In some embodiments of the present disclosure, please further refer to
It should be noted that, for specific implementation process of each step in some embodiments, reference may be made to the implementation process of corresponding steps in the embodiments above for details, and they will not be repeated herein.
On the basis of the embodiments above, embodiments of the present disclosure further provide an apparatus for recognizing emotion. For details, please refer to
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the apparatus further includes: a training module 801, a schematic structural diagram thereof is as shown in
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the training module includes: a test set acquisition module and a first training module, wherein
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the training module includes: an establishment module and a second training module,
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the establishment module includes: a spiking sequence acquisition module, a data set establishment module, and at least one of a simulation processing module and a spiking sequence acquisition module, wherein
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the simulation processing module includes: a traversing module, a first conversion module and a first assignment module, wherein
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the simulation processing module further includes: a second assignment module, an update module and a second conversion module, wherein
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the simulation processing module further includes: a second spiking sequence acquisition module, wherein
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the assignment module includes: a first calculation module, and at least one of a first position assignment module and a first position assignment module,
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the spiking neural network includes a voting neuronal population;
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the second training module further includes: a second propagation module, wherein
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the judgment module at least includes one of the following:
In some embodiments of the present disclosure, in some other embodiments, based on the embodiments above, the spiking neural network further includes: a feature extraction module and an emotion mapping module, wherein the feature extraction module includes a single forward extraction unit composed of convolution, normalization, Parametric Leaky-Integrate and Fire (PLIF) model and average pooling, and a network unit composed of two fully-connected layers and two PLIF, which are arranged at intervals; and the emotion mapping module is configured to map spiking sequences outputted by the voting neuronal population module to a final emotion category.
Please further refer to
Please further refer to
With regard to the apparatuses in the embodiments above, the specific manner in which various components execute operations has been described in detail in the corresponding method embodiments, and will not be described in detail herein.
It should be noted that the apparatus for recognizing emotion provided in the embodiments of the present disclosure has the same beneficial effects as the method for recognizing emotion provided in the embodiments above; and for specific introduction of the method for recognizing emotion involved in the embodiments of the present disclosure, reference may be made to the embodiments above, and they will not be repeated herein.
On the basis of the embodiments above, embodiments of the present disclosure further provide an apparatus for recognizing emotion, including:
For example, the processor in the embodiments of the present disclosure may be used for: acquiring to-be-recognized spiking sequences corresponding to video information; and recognizing the to-be-recognized spiking sequences by using a pre-established spiking neural network emotion recognition model, so as to obtain a corresponding emotion category; wherein the spiking neural network emotion recognition model is obtained by training a spiking neural network by using a pre-established dynamic visual data set.
Embodiments of the present disclosure further provide an electronic device, including:
On the basis of the embodiments above, embodiments of the present disclosure further provide a computer non-transitory readable storage medium; the computer non-transitory readable storage medium stores a computer program, and when the computer program is executed by a processor, the method for recognizing emotion or the method for training emotion recognition model is implemented.
The computer non-transitory readable storage medium may include: various media that may store program codes, such as a USB flash disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disc.
In some embodiments of the present disclosure, embodiments of the present disclosure also provide a computer program product, including a computer program or instructions, which when executed by a processor, implement the method for recognizing emotion or the method for training emotion recognition model.
The apparatus embodiments as described above are merely exemplary. The unit blocks described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical components, that is, may be located in one place, or may be distributed in a plurality of networks. Some or all of the components may be selected according to actual needs to achieve the purpose of the solutions of the embodiments. A person of ordinary skill in the art would understand and implement the embodiments without any inventive effort.
Refer to
The processing assembly 1102 generally controls overall operations of the electronic device 1100, such as operations associated with display, phone calls, data communications, camera operations and recording operations. The processing assembly 1102 may include one or more processors 1120 to execute instructions to complete all or some of the steps of the described methods. In addition, the processing assembly 1102 may include one or more components to facilitate interaction between the processing assembly 1102 and other assemblies. For example, the processing assembly 1102 may include a multimedia component to facilitate interaction between the multimedia assembly 1108 and the processing assembly 1102.
The memory 1104 is configured to store various types of data to support operations on the device 1100. Examples of such data include instructions for any application program or method operating on the electronic device 1100, contact data, telephone directory data, messages, pictures, video, etc. The memory 1104 may be implemented by any type of transitory or non-transitory storage device or combination thereof, such as a static random access memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic disk, or an optical disk.
The power supply assembly 1106 provides power for various assemblies of the electronic device 1100. The power supply assembly 1106 may include a power management system, one or more power supplies, and other assemblies associated with generation, management and distribution of power for the electronic device 1100.
The multimedia assembly 1108 includes a screen that provides an output interface between the electronic device 1100 and a user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). When the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from the user. The touch panel includes one or more touch sensors to sense a touch, a swipe, and a gesture on the touch panel. The touch sensor may not only sense boundaries of the touch or swipe actions, but also detect the duration and pressure associated with the touch or swipe operation. In some embodiments, the multimedia assembly 1108 includes a front-facing camera and/or a rear-facing camera. When the device 1100 is in an operation mode, such as a photographing mode or a video mode, the front-facing camera and/or the rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zooming capability.
The audio assembly 1110 is configured to output and/or input audio signals. For example, the audio assembly 1110 includes a microphone (MIC), and when the electronic device 1100 is in an operation mode, such as a call mode, a recording mode and a voice recognition mode, the microphone is configured to receive external audio signals. The received audio signals may be further stored in the memory 1104 or sent via the communication assembly 1116. In some embodiments, the audio assembly 1110 further includes a loudspeaker for outputting audio signals.
The I/O interface 1112 provides an interface between the processing assembly 1102 and a peripheral interface component, and the peripheral interface component may be a keyboard, a click wheel, buttons, etc. These buttons may include but are not limited to: a home button, a volume button, a start button and a lock button.
The sensor assembly 1114 includes one or more sensors, for providing state assessment of various aspects of the electronic device 1100. For example, the sensor assembly 1114 may detect an on/off state of the device 1100 and relative positioning of the assemblies, for example, the assemblies are display and keypad of the electronic device 1100; and the sensor assembly 1114 may also detect position change of the electronic device 1100 or position change of one assembly of the electronic device 1100, existence or non-existence of contact between the user and the electronic device 1100, orientation or acceleration/deceleration of the electronic device 1100, and temperature change of the electronic device 1100. The sensor assembly 1114 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 1114 may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1114 may also include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
The communication assembly 1116 is configured to facilitate wired or wireless communications between the electronic device 1100 and other devices. The electronic device 1100 may access a wireless network based on a communication standard, such as Wi-Fi, operator networks (such as 2G, 3G, 4G or 5G), or a combination thereof. In some exemplary embodiments, the communication assembly 1116 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In some exemplary embodiments, the communication assembly 1116 also includes a near field communication (NFC) component to facilitate short-range communication. For example, the NFC component may be achieved on the basis of radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
In embodiments, the electronic device 1100 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic elements, to perform the data decryption method or data encryption method as described above.
In embodiments, a computer-readable storage medium is also provided, and includes e.g., a memory 1104 including instructions, which may be executed by the processor 1120 of the electronic device 1100 to implement the data decryption method or data encryption method as described above. For example, a non-transitory computer-readable storage medium may be an ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In embodiments, a computer program product is further provided. When instructions in the computer program product are executed by the processor 1120 of the electronic device 1100, the electronic device 1100 is enabled to execute the data decryption method or the data encryption method as shown above.
The apparatus 1200 may also include a power supply assembly 1226 configured to perform power supply management of the apparatus 1200, a wired or wireless network interface 1250 configured to connect the apparatus 1200 to a network, and an input/output (I/O) interface 1258. The apparatus 1200 may operate on the basis of an operating system stored in the memory 1232, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, or the like.
Various embodiments in the description are described in a progressive manner. Each embodiment focuses on differences from other embodiments. For the same or similar parts among the embodiments, reference may be made to each other. For the apparatuses disclosed in the embodiments, as the apparatuses correspond to the methods disclosed in the embodiments, the illustration thereof is relatively simple, and for the related parts, reference may be made to the illustration of the method part.
It should be noted that in the present description, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any actual relationship or sequence between these entities or operations. Furthermore, terms “include” “including”, or any other variations thereof are intended to cover a non-exclusive inclusion, so that a process, a method, an article, or a device that includes a series of elements not only includes those elements, but also includes other elements that are not explicitly listed, or further includes inherent elements of the process, the method, the article, or the device. Without further limitation, an element defined by a sentence “including a . . . ” does not exclude other same elements existing in the process, the method, the article, or the device that includes the element.
The illustration of the disclosed embodiments enables a person skilled in the art to implement or use some embodiments of the present disclosure. Various modifications to these embodiments will be apparent to a person skilled in the art. The general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present disclosure. Accordingly, the present disclosure will not be limited to these embodiments shown herein, but needs to comply with the widest scope consistent with the principles and novel features disclosed herein.
Number | Date | Country | Kind |
---|---|---|---|
202210119803.3 | Feb 2022 | CN | national |
The present application is a National Stage Application of PCT International Application No. PCT/CN2022/122733 filed Sep. 29, 2022, which claims the benefit of priority to Chinese Patent Disclosure No. 202210119803.3, filed to the China National Intellectual Property Administration on Feb. 9, 2022 and entitled “Method for recognizing emotion and Apparatus, System and Computer-Readable Storage Medium”, which is incorporated herein by reference in its entirety. To the extent appropriate, a claim of priority is made to each of the above disclosed applications.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2022/122788 | 9/29/2022 | WO |