Advances in deep learning techniques have had a significant impact on end-to-end conversational systems. Most of the present research in this area focuses primarily on the functional aspects of conversational systems, such as keyword extraction, natural language understanding, and the logical pertinence of generated responses. As a result, conventional conversational systems are typically designed and trained to generate grammatically correct, logically coherent responses relative to input prompts. Although proper grammar and logical coherency are necessary qualities for dialog generated by a conversational system, most existing systems fail to express social intelligence. Consequently, there is a need in the art for affect-driven dialog generation solutions for producing responses expressing emotion in a controlled manner, without sacrificing grammatical correctness or logical coherence.
There are provided systems and methods for performing affect-driven dialog generation, substantially as shown in and/or described in connection with at least one of the figures, and as set forth more completely in the claims.
The following description contains specific information pertaining to implementations in the present disclosure. One skilled in the art will recognize that the present disclosure may be implemented in a manner different from that specifically discussed herein. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
The present application discloses affect-driven dialog generation systems and methods that overcome the drawbacks and deficiencies in the conventional art. The present solution does so at least in part by utilizing a sequence-to-sequence architecture (hereinafter “seq2seq architecture”) to generate emotionally diverse dialog responses to an input dialog sequence. The present application discloses a novel and inventive seq2seq architecture including an affective re-ranking stage, in addition to an encoder and decoder.
It is noted that, as defined in the present application, the expression “affect-driven dialog generation” refers to a dialog generation approach that includes a socially intelligent emotionally responsive component, as well as components addressing grammar and pertinence. Thus, the dialog sequences generated using an “affect driven” approach express emotions that are relevant to an interaction supported by the dialog sequences, as well as being grammatically correct and logically coherent relative to that interaction.
It is also noted that, as used herein, the feature “conversational agent” refers to a non-human interactive interface implemented using hardware and software, and configured to autonomously acquire and consolidate dialog knowledge so as to enable the conversational agent to engage in extended social interactions. A conversational agent may be utilized to animate a virtual character, such as an avatar, or a machine, such as a robot, automated voice response (AVR) system, or an interactive voice response (IVR) system, for example.
As further shown in
It is noted that, although the present application refers to software code 110 as being stored in system memory 106 for conceptual clarity, more generally, system memory 106 may take the form of any computer-readable non-transitory storage medium. The expression “computer-readable non-transitory storage medium,” as used in the present application, refers to any medium, excluding a carrier wave or other transitory signal that provides instructions to hardware processor 104 of computing platform 102. Thus, a computer-readable non-transitory medium may correspond to various types of media, such as volatile media and non-volatile media, for example. Volatile media may include dynamic memory, such as dynamic random access memory (dynamic RAM), while non-volatile memory may include optical, magnetic, or electrostatic storage devices. Common forms of computer-readable non-transitory media include, for example, optical discs, RAM, programmable read-only memory (PROM), erasable PROM (EPROM), and FLASH memory.
It is further noted that although
As a result, hardware processor 104 and system memory 106 may correspond to distributed processor and memory resources within affect-driven dialog generation system 100. Thus, it is to be understood that non-human conversational agent 108 and software code 110 may be stored remotely from one another and/or may be executed using the distributed processor resources of affect-driven dialog generation system 100.
According to some implementations, user 120 may interact with computing platform 102 to utilize software code 110 to train seq2seq architecture 150 to perform affect-driven dialog generation. Alternatively, or in addition, user 120 may utilize computing platform 102 to interact with non-human conversational agent 108 supported by software code 110 including seq2seq architecture 150.
It is noted that, during training of seq2seq architecture 150, dialog sequence 128 determined by seq2seq architecture 150 of software code 110 may be reviewed and/or corrected by one or more annotators 122 using training input 144. One or more annotators 122 may be remote from one another, as well as remote from computing platform 102, and may be in communication with computing platform 102 via communication network 124, which may be a packet-switched network such as the Internet, for example.
After completion of training, i.e., when software code 110 utilizes seq2seq architecture 150 to perform affect-driven dialog generation substantially autonomously, software code 110 is configured to receive input sequence 130 and predetermined emotion 132 from non-human conversational agent 108 or another source external to software code 110, and to generate final dialog sequence 134 in response. In various implementations, final dialog sequence 134, when provided as an output by software code 110 including seq2seq architecture 150, may be stored in system memory 106 and/or may be copied to non-volatile storage. Alternatively, or in addition, in some implementations, final dialog sequence 134 may be rendered by non-human conversational agent 108, for example by being rendered as text on display 118 or rendered as verbal communications by a virtual character rendered on display 118. It is noted that display 118 may be implemented as a liquid crystal display (LCD), a light-emitting diode (LED) display, an organic light-emitting diode (OLED) display, or another suitable display screen that performs a physical transformation of signals to light.
Input dialog sequence 230, predetermined target emotion 232, dialog sequence 228 determined during training, training input 244, and final dialog sequence 234 correspond respectively in general to input dialog sequence 130, predetermined target emotion 132, dialog sequence 128 determined during training, training input 144, and final dialog sequence 134, in
In addition, software code 210 including seq2seq architecture 250 corresponds in general to software code 110 including seq2seq architecture 150, and those corresponding features may share the characteristics attributed to either corresponding feature by the present disclosure. That is to say, like software code 210, software code 110 may include features corresponding respectively to training module 212 configured to provide training target dialog response 240, and training database 214 storing training datasets 216a and 216b.
As shown in
The functionality of software code 110/210 and seq2seq architecture 150/250/350 will be further described by reference to
As a preliminary matter, V={w1, w2, . . . , w|V|} is defined to be a vocabulary, while X=(x1, x2, . . . , x|x|) is defined to be a sequence of words. In addition, EX ∈6 is denoted as an emotion vector representing a probability distribution over six emotions associated with the sequence of words X as:
E
X=[panger, psurprise, pjoy, psadness, pfear, pdisgust] (Equation 1)
It is noted that x can be an input dialog sequence S corresponding to input dialog sequence 130/230/330, a candidate response RC, a final response Rfinal corresponding to final dialog sequence 134/234/334, or a target dialog response R0 corresponding to target dialog response 240. Also utilized is E0, which, during training of seq2seq architecture 150/250/350 is the representation of the emotion of target dialog response R0240. Once seq2seq architecture 150/250/350 is trained, E0 indicates predetermined target emotion 132/232/332 for final dialog sequence Rfinal 134/234/334.
The present novel and inventive approach to performing affect-driven dialog generation utilizes a sequence-to-sequence architecture that is significantly distinguishable over conventional sequence-to-sequence models. For example, the sequence-to-sequence architecture utilized according to the present inventive concepts includes seq2seq architecture 150/250/350 implemented using a sequence-to-sequence architecture including encoder 352, decoder 360, and affective re-ranking stage 370 configured to determine the emotional relevance of each of multiple emotionally diverse dialog responses RC to input dialog sequence S 130/230/330.
According to a conventional sequence-to-sequence model having an encoder and a decoder but no feature corresponding to affective re-ranking stage 370, the encoder computes vector representation hS for source sequence S, while the decoder generates one word at a time, and computes the conditional probability of a candidate response, RC, as:
log p(RC|S)=Σt=1|R
p(rt|r<t, hS)=softmax g(ht), (Equation 3)
where g(ht) is a linear mapping from GRU hidden state ht=f(ht−1, rt−1) to an output vector of size |V|, and rt−1 is an input to the decoder at time t. To generate Rfinal the conventional sequence-to-sequence approach uses the following objective function:
R
final=argmaxR
The present approach to performing affect-driven dialog generation extends the conventional sequence-to-sequence architecture by including emotion-specific information during training and the inference process. A significant challenge for generating and evaluating affect-driven dialog responses is a reliable assessment of emotional state. The present exemplary approach uses two representations of emotion. The first representation of emotions is categorical, and uses the six emotions addressed by Equation 1, i.e., anger, surprise, joy, sadness, fear, and disgust. The second representation is continuous in a Valence-Arousal-Dominance (VAD) mood space known in the art. In one implementation, the second, continuous, representation of emotion uses a VAD lexicon introduced by S. M. Mohammad and also known in the art, where each of twenty thousand words is mapped to a three-dimensional (3D) vector of VAD values, ranging from zero (lowest) to one (highest), i.e., v ∈[0,1]3. It is noted that Valence measures positivity/negativity, Arousal measures excitement/calmness, and Dominance measures powerfulness/weakness of an emotion or word.
Referring now to
Affective training requires E0, the emotion representation of target dialog response R0 240. In order to label all sentences of datasets 216a and 216b with E0, emotion classifier 372 is utilized. Emotion classifier 372 is configured to predict a probability distribution over each of the six classes of emotions identified above in Equation 1, i.e., anger, surprise, joy, sadness, fear, and disgust.
According to the present affect-driven dialog generation approach, training of seq2seq architecture 150/250/350 aims to generate the final training dialog sequence Rfinal expressing emotions encoded in E0, using the following objective function in lieu of Equation 4, above, utilized in the conventional art:
R
final=argmaxR
Discussed below are four different affective training procedures modeling p(RC|S,E0) that are divided into the following implicit and explicit models: (1) Sequence-Level Explicit Encoder Model, (2) Sequence-Level Explicit Decoder Model, (3) Word-Level Implicit Model, and (4) Word-Level Explicit Model.
Sequence-Level Explicit Encoder Model (SEE): Referring to
Sequence-Level Explicit Decoder Model (SED): Another way of forcing an emotional output is to explicitly indicate E0 at every decoding step. According to this approach, in addition to other inputs, an emotion embedding eSED 362 is provided at each step of decoder 360. Formally, the GRU hidden state at time t is calculated as ht=f (ht−1, r′t) with r′t=[rt−1; eSED], where eSED is defined equivalently to eSEE. It is noted, however, that ASEE and ASED are different, which implies that the emotion embedding spaces they map are also different. The present approach advantageously enables the target emotion, E0, to be provided in a continuous space.
Word-Level Implicit Model (WI): To model the word-level emotion carried by each dialog sequence, an Affective Regularizer (AR) is used, which expresses the affective distance between training Rfinal and training R0, in the VAD space. The AR forces seq2seq architecture 150/250/350 to prefer words in the vocabulary that carry emotions in terms of VAD. Formally, the conventional Negative Log Likelihood (NLL) loss is additively combined with a novel and inventive affective regularizer loss term AR as:
where st=softmax(ht)(st∈|V|) is a confidence of the system of generating words w1, . . . , w|V| at time t and μ ∈. exVAD∈3 is a 3D vector representing emotion associated with a word x in VAD space (note that exVAD is constant with respect to t), and EVAD∈3×|V| is a matrix containing ew
E
VAD=[ew
The affective regularizer term AR penalizes the deviation of the emotional content of the final response Rfinal generated during training from that of target dialog response R0 240. It is noted that the emotional information carried by Rfinal is the weighted sum of emotion representations ew
Thus, seq2seq architecture 150/250/350 is trained using the loss function given by Equation 6 having affective regularizer term AR based on a difference in emotional content between target dialog response R0 240 and dialog sequence Rfinal determined by seq2seq architecture 150/250/350 during training. Moreover, affective regularizer term AR corresponds to the distance between target dialog response R0 240 and dialog sequence Rfinal determined by seq2seq architecture 150/250/350 during training, in VAD space.
Word-Level Explicit Model (WE): Sequential word generation allows sampling of the next word based on the emotional content of the current incomplete sentence. If some words in a dialog sequence do not express the target emotion E0, other words can compensate for that by changing the final affective content significantly. For example, in the sentence “I think that the cat really loves me?”, the first six words are neutral, while the end of the sentence makes it clearly express joy. Such a phenomenon may be captured during training by using an Adaptive Affective Sampling Method:
log p(RC|S, E0)=Σt=1|R
where:
p(rt|r<t, er
Here, g(ht) is defined as in Equation 3, and 0≤λ≤1 is learned during training. The first term in Equation 10 is responsible for generating words according to a language model preserving grammatical correctness of the sequence, while the second term forces generation of words carrying emotionally relevant content. EtVAD∈3 is a vector representing the remaining emotional content needed to match a goal (E0VAD) after generating all words up to time t. It is updated every time a new word rt with an associated emotion vector er
where
is an emotion vector associated with words r0
The expression v(EtVAD) defines a vector whose i-th component measures the potential remaining emotional content of the sequence in the case of choosing the i-th word wi:
In one implementation, the parameter λ is set equal to one, i.e., λ=1, after generating the first maxlength/2 words. This setting for λ has been found to ensure that the first generated words carry the correct emotional content, while preserving the grammatical correctness of the dialog response as a whole.
Referring once again to
Thus, as shown in
Flowchart 480 continues with using seq2seq architecture 150/250/350 to generate multiple emotionally diverse dialog responses RC to input dialog sequence S 130/230/330 based on input dialog sequence S 130/230/330 and predetermined target emotion E0 132/232/332 (action 483). In some implementations it has been found to be advantageous or desirable to generate a number “B” of emotionally diverse, grammatically correct, dialog responses RC for subsequent re-ranking based on their emotional relevance to input dialog sequence S 130/230/330. For example, a list of B emotionally diverse and grammatically correct dialog responses RC may be generated using Beam Search of size B with length normalization.
Generation of the multiple emotionally diverse dialog responses RC to input dialog sequence S 130/230/330 based on input dialog sequence S 130/230/330 and predetermined target emotion E0 132/232/332 may be performed by software code 110/210, executed by hardware processor 104, and using seq2seq architecture 150/250/350. For example, action 483 may be performed using the following objective function:
R
final=argmaxR
It is noted that the weighting factors α and β applied to the second and third terms on the right in Equation 14 are optimized to ensure grammatical correctness and emotional diversity, respectively. The weighting factors α and ⊕ may be optimized using grid search, for example. In one implementation, α may be equal to approximately fifty (α≅50.0), for example. In one implementation, β may be equal to approximately 0.001 (β≅0.001), for example. It is noted that the weighting factor γ may be initially set to zero while values for α and β are determined.
Flowchart 480 continues with using seq2seq architecture 150/250/350 to determine final dialog sequence Rfinal 134/234/334 responsive to input dialog sequence S 130/230/330 based on the emotional relevance of each of the multiple emotionally diverse dialog responses RC to input dialog sequence S 130/230/330 generated in action 483 (action 484). Final dialog sequence Rfinal 134/234/334 may be determined by software code 120/220, executed by hardware processor 104, and using affective re-ranking stage 370 of seq2seq architecture 150/250/350 and the last term of Equation 14.
The last term of Equation 14, i.e., −γ∥ER
According to some implementations, for example, dialog sequence or sequences 128/228 determined using seq2seq architecture 150/250/350 during training are transmitted to one or more annotators 122 for evaluation. In one implementation, one or more annotators 122 may each utilize the AffectButton, introduced by Broekens and Brinkman and known in the art, for assigning emotions. In one implementation, the AffectButton lets one or more annotators 122 choose a facial expression from a continuous space that best matches the emotional state associated with dialog sequence 128/228. That facial expression is received by software code 110/210 of affect-driven dialog generation system 100, when executed by hardware processor 104, and is then mapped into VAD space.
Thereafter, determination of weighting factor γ may be performed by software code 110/210 based on training input 144/244. In some implementations, for example, y may be greater than one (y>1.0). In one implementation, γ may be approximately equal to 4.2 (y=4.2).
Flowchart 480 can conclude with providing final dialog sequence Rfinal 134/234/334 as an output for responding to input dialog sequence S 130/230/330 (action 485). Final dialog sequence Rfinal 134/234/334 may be output by software code 110/210, executed by hardware processor 104. In implementations in which software code 110/210 including seq2seq architecture 150/250/350 is used to support interactions by non-human conversational agent 108 with a human user, such as user 120, action 485 may result in final dialog sequence Rfinal 134/234/334 being output to non-human conversational agent 108 for rendering by non-human conversational agent 108.
As further noted above, conversational agent 108 may be utilized to animate a virtual character, such as an avatar, or a machine, such as a robot, for example, so as to enable that virtual character or machine to engage in an extended social interaction with user 120. For example, and as discussed above, in some implementations, final dialog sequence Rfinal 134/234/334 may be rendered by non-human conversational agent 108 as text on display 118 or may be rendered as verbal communications by a machine such as a robot, or by a virtual character rendered on display 118.
Thus, the present application discloses affect-driven dialog generation systems and methods. By utilizing a loss function having an affective regularizer term based on the difference in emotional content between a target dialog response and a dialog sequence determined by an seq2seq architecture, to train the seq2seq architecture, the present solution enables the generation of multiple, emotionally diverse dialog responses to an input dialog sequence. In addition, by determining a final dialog sequence from among the emotionally diverse dialog responses based on their emotional relevance to the input dialog sequence, the present solution can advantageously provide a socially intelligent and appropriate response to the input dialog sequence. As a result, the present application discloses an affect-driven dialog generation solution capable of expressing emotion in a controlled manner, without sacrificing grammatical correctness or logical coherence.
From the above description it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described herein, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.