This application relates to systems and methods for detecting emotion engagement of a group of people doing certain tasks based on visual information streamed from video recording devices over Internet or local networks. The system can be deployed on cloud computing backend as well as mobile and edge devices. The system has particular applicability to such applications as virtual meetings, e-learning, classrooms, shopping, retail, e-commerce, etc.
Machine Learning (ML) focuses on training intelligent systems capable of learning patterns of data without being explicitly programmed. Deep Learning (DL), a subset of Machine Learning powered Artificial Intelligence (AI) domain, can train complex neural network systems to gain enormous insights from various data sources such as audio, video, text, etc. Computer-vision based DL systems operate on video and image data such as images of objects, frontal facial images, retina fundus images, etc. to train classifiers for specific tasks such as object detection, face identification, emotion classification, disease classification, etc. With the advances in technology, there is an imperative shift in the way we set up meetings, learn in classrooms and engage with other people. The use of technology in all aspects of life and the usage of handheld and portable devices are changing the way we interact with both computers and fellow humans alike.
It is often important, as a consumer and producer of information and as a social animal, to evaluate the engagement of others in a large group on certain tasks such as listening from lectures, meetings, etc. as the attention span of people vary widely. Therefore, a non-interactive method of extracting behavioral patterns, attention span, excitement, and engagement could help interpret whether a group of people paid attention to the events and gained any insights from them. This system can be used to assess qualitative and quantitative measures in sharing and receiving information in a group. Temporal information on the attentiveness on a per-user basis as well as of the group can be used to infer points of interest thereby help design novel interactive teaching methods, personalized exercise recommendations to students, better resource sharing, efficiency in meetings, etc.
Emotions play a crucial role in human lives, being functionally adaptive in our evolutionary history and assisting individual survival. Humans are rapid information-processing organisms that use emotion detection to make quick decisions about whether to defend, attack, care for others, escape, reject food, or approach something useful. Emotions, therefore, not only influence immediate actions but also serve as an important motivational basis for future behaviors. Emotions are expressed both verbally through words and nonverbally through facial expressions, voices, gestures, body postures, and movements. Emotions communicate information about our feelings, intentions, and relationships when interacting with others. Therefore, emotions have signal value to others and influence others and our social interactions. In general, as described in “The Expression of Emotion In Man And Animals,” published in Oxford University Press by Charles Darwin and Philip Prodger [1998], emotion expressions are evolved, adaptive, and not only serve as part of an emotion mechanism that protects the organism or prepares it for action, but also have significant communicative functionality.
Facial Expression (FE) research gained momentum with Darwin's theory that proved expressions are universal. Later, FEs were categorized into a set of six emotions. Paul Ekman with his collaborators and Izard provided cross-cultural studies and proposed the universality in interpreting of emotion by facial expression. Ekman et al. published their findings in Journal of Personality and Social Psychology [1971], “Constants across Cultures in the Face and Emotion,” pages 124-129. Furthermore, Ekman and his team developed objective measures of facial expression named the Facial Action Coding System (FACS) published their findings in “Facial Action Coding System,” Consulting Psychologists Press, Inc., Palo Alto, Calif. in 1978. Several theories have been proposed by Ekman for emotion. Among them, the dimension approach argues that emotions are not discrete and separate, but are better measured as differing only in degree of one or another dimension. The findings were published in a Cambridge University Press publication, “Emotion in the Human Face” by Ekman et al. [1982]. Dimension theory has proposed that different emotions are linked to relatively distinct patterns of automatic nervous system activity. Micro-Expression (ME) is a very transitory, automatic reflex of FE according to experienced emotions. ME may occur in high-stake situations when people attempt to conceal or cover their actual mindsets. This organizes the connection between facial expression and automatic physiology. Studies of the central nervous system correlates that facial expressions also bear upon the dimensionality versus discrete issue. Discrete emotions theorists have argued that the experience and perception of different facial expressions of emotion involve distinct central nervous system regions. Different publications such as “An Argument for Basic Emotions,” published in Cognition & Emotion (Taylor & Francis Press) by Paul Ekman in 1992, pages 169-200 and “Four Systems of Emotions Activation: Cognitive and Non-Cognitive Processes,” Published in Psychological review by C. E. Izard in 1993, pages 68-90, describes the discrete emotion model. The mentioned concepts are experimented and proved by functional magnetic resonance imaging by Morris J. S. et al. and published their findings in “A Differential Neural Response in the Human Amygdala to Fearful and Happy Facial Expressions, this is” in Nature, pages 812-815 [1996].
To include all the facial image variations, a sequence of video frames (c1, c2 , . . . , cn) is considered as an input and the output of the network is a binary number y. We propose a residual network with a Long Short-Term Memory (LSTM) layer on top of that to extract intra-class similarity and inter-class discriminatory of captured facial images from different video frames, in other words, the conditional probability of the output, p(y|(c1, c2 , . . . , cn)). Temporal feature of a facial image in a frame is presented as an embedding vector. The embedding vector per identity is constructed through the residual network architecture consisting of residual blocks.
The general form of each block can be formulated as:
yl=h(xl)+F(xl,(Wr,br)l)
x(l+1)=f(yl)
where xl and xl+1 are input and output of the lth unit, h is a forward function of the plain unit, F is a residual function, r stands for the number of repeated convolution layer in the residual function, and f is a differentiable threshold function. The initial idea of the present invention, ResNet, is to achieve additive residual function F with respect to h(xl) and to facilitate minimizing the loss function. In this regard, emphasize on the importance of the facial feature mapping, h(xl)=xl, so in the general formula we denote on r to represent the repetition times of the convolutional layers in residual branch, and we follow the mapping for the plain branch. In residual block, the other noteworthy nob is differentiable threshold function. If f is also considered identify mapping, for any deeper unit L and shallower unit l:
This assumption turns the matrix-vector products, say:
to the summation of the outputs of all preceding residual functions (plus ×0), and consequently clean backpropagation formula:
One of the most interesting properties of this architecture is reducing the probability for the gradient to be canceled out. Refer back to the general form of the residual units, there are other residual units with the properties of increasing dimensions and reducing feature map sizes by using the conventional activation function, Rectified Linear Unit (ReLU), as the differentiable threshold function:
The last residual block maps a facial image into the embedding vector.
t=σ(Wxixt+Whiht−1+Wcict−1+bi)
ft=σ(Wxfxt+Whfht−1+Wcfct−1+bf)
ot=σ(Wxoxt+Whoht−1+Wcoct−1+bo)
gt=PReLU(Wxgxt+Whght−1+bg)
ct=ft⊙ct−1+it⊙gt
ht=ot⊙PReLU(ct)
Inputs of the three gates consist of the current time step of the input and last time step of the output and internal memory. The cell memory is updated as a result of the combination of input gate (it) and forget gate (ft). The influence of the input in the internal state is controlled by input gate, and forget gate takes the control over the contribution of the last internal state to the current internal state.
Basic human emotions translate to a variety of facial muscle movements. It is often easy for humans to read basic emotions such as happiness, sadness, etc. from facial expressions. Teaching a neural network to classify between these basic emotions to match human level accuracy and beyond is a tedious task. The model should not only detect different sized faces, but also accurately generate emotion probabilities for the face. Mathematically, the temporal deep learning model attempts to solve an optimization problem on facial expression image database to find the optimal model over the selected training set to detect basic emotions. The model consists of several convolutional neural network layers with very large number of learnable parameters between the layers to extract various Action Unit (AU) features in the facial images and discover the hidden patterns in them. Action Units (AUs) are the fundamental actions of individual muscles or groups of facial muscles. They are classified as additive or non-additive AUs according to whether they occur in combination or not. In additive AUs, the combination does not change the appearance of other AUs present. The main goal of the proposed deep learning model is to provide the probability of the basic emotions for a real-time video as a single modal input and analyze their emotion trajectories. Teaching the neural network falls into the category of supervised learning in which the neural network is provided with actual data and ground truths to learn from. Thus, teaching the neural network becomes an optimization problem. The input layer accepts a streaming video with facial images, and the output layer generates eight (8) classes of emotions: anger, contempt, disgust, fear, happiness, natural, sadness, and surprise continuously for each frame in the video. Since the output generated depends upon a short span of time, this temporal model provides interesting advantages over other traditional machine learning methods.
A gaze tracking subsystem using video frames information computes special and temporal characteristics of eye movement to estimate user intent and attention. Gaze is estimated from the relative movement between the pupil center and glint positions and can be tracked actively. The objective is on estimating the orientation of the eyes with no or slight head movement. The direction of eye-gaze, including the head orientation is considered in this task.
Thus, what is needed is a system capable of reading streaming videos and audio, finding each user's faces from the videos and converting them to emotional attributes and general attentiveness levels, leading to the attentiveness and engagement of the whole group over time. The attentiveness of each user can be estimated by exploring emotional attributes and gaze estimation.
The present invention has a variety of uses including, but not limited to the following illustrative uses:
Many aspects of the present disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.
The present disclosure relates to a temporal neural network system capable of estimating the excitement and attentiveness of multiple users from streaming videos. The system is capable of detecting spontaneous facial expressions through time. Subtle facial expressions provide enormous insight on a person's behavior. A face detection neural network system is coupled with a temporal neural network emotion model applied to streaming videos and can model a person's behavioral pattern over time. This is crucial in understanding person's attentiveness and excitement. Overall, the system does the following—extracts multiple faces from a streaming video, finds specific emotions in each face (e.g. happiness, fear, anger, etc.), and also determines the degree of arousal and valence associated with each emotion. Thus, the affective computing system provides a comprehensive model for extracting a person's emotions, emotional behavior over time, and the degree of arousal and valence associated.
The term facial expression is defined as a distinctive change in the human face involving various facial muscle groups according to different situations or emotions. Facial Action Coding System (FACS) uses physical, visual changes in the face called action units (AUs) to encode facial expressions. FACS encoding can combine basic facial actions to represent complex human facial expressions. Each facial expression can have one or many AUs associated with it. Unique facial AUs are a result of one or more facial muscle movements. Thus, FACS in a high level is encoding subtle facial muscle movements into discrete action units. For example, AUs 1, 4, and 15 together correlate to a ‘sad’ emotion. In other words, the emotion ‘sad’ is encoded using FACS by combining AUs 1-Inner Brow Raiser, 4-Brow Lowered, 15-Lip Corner Depressor.
Steps 101 through 107 in
Step 104 in
Step 105 in
In Step 106 in
Step 107 uses as input the output of Step 103 and estimates the direction of sight or gaze of each face and head movement in the streaming video (
Number | Name | Date | Kind |
---|---|---|---|
8219438 | Moon et al. | Jul 2012 | B1 |
8401248 | Moon | Mar 2013 | B1 |
9812151 | Amini | Nov 2017 | B1 |
9928406 | Bhanu | Mar 2018 | B2 |
10915798 | Zhang | Feb 2021 | B1 |
20050147291 | Huang | Jul 2005 | A1 |
20090285456 | Moon | Nov 2009 | A1 |
20100086215 | Bartlett | Apr 2010 | A1 |
20110263946 | el Kaliouby | Oct 2011 | A1 |
20130094722 | Hill | Apr 2013 | A1 |
20140049546 | Wang | Feb 2014 | A1 |
20140316881 | Movellan | Oct 2014 | A1 |
20150332118 | Wang | Nov 2015 | A1 |
20160078279 | Pitre | Mar 2016 | A1 |
20170032505 | Levieux | Feb 2017 | A1 |
20170098122 | Kaliouby | Apr 2017 | A1 |
20170109571 | McDuff | Apr 2017 | A1 |
20170193286 | Zhou | Jul 2017 | A1 |
20170330029 | Turcot et al. | Nov 2017 | A1 |
20180114056 | Wang | Apr 2018 | A1 |
20180330152 | Mittelstaedt | Nov 2018 | A1 |
20190034706 | el Kaliouby | Jan 2019 | A1 |
20190122071 | Jin | Apr 2019 | A1 |
20190122072 | Crier | Apr 2019 | A1 |
20190172243 | Mishra | Jun 2019 | A1 |
20190172458 | Mishra | Jun 2019 | A1 |
20190205626 | Kim | Jul 2019 | A1 |
20190228215 | Najafirad | Jul 2019 | A1 |
20190294868 | Martinez | Sep 2019 | A1 |
20190347476 | Jo | Nov 2019 | A1 |
20200210688 | Xu | Jul 2020 | A1 |
20200219295 | el Kaliouby | Jul 2020 | A1 |
20200275873 | Xu | Sep 2020 | A1 |
20210000404 | Wang | Jan 2021 | A1 |
Entry |
---|
International Search Report and Written Opinion dated Mar. 27, 2019; 14 pages, PCT/US19/14228. |
Excerpt from Ekman, P. et al., “Emotions in the Human Face: Chapter XIV: What Emotion Dimensions Can Observers Judge from Facial behavior?”, Cambridge University Press, pp. 67, 75 [1982]. |
Excerpts from Darwin, C. et al., “The Expression of Emotion in Man and Animals: Introduction,” Oxford University Press, p. 11 [1998]. |
Excerpts from Darwin, C. et al., “The Expression of Emotion in Man and Animals: The principle of serviceable associated habits,” Oxford University Press, p. 23 [1998]. |
Excerpts from Darwin, C. et al., “The Expression of Emotion in Man and Animals: The principle of the direct action of the nervous system,” Oxford University Press, p. 45 [1998]. |
Number | Date | Country | |
---|---|---|---|
20190228215 A1 | Jul 2019 | US |
Number | Date | Country | |
---|---|---|---|
62619214 | Jan 2018 | US |