A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present disclosure relates generally to dialogue systems, and more specifically to using global-to-local memory pointer networks for task-oriented dialogue.
Task-oriented dialogue systems have been developed to achieve specific user goals such as, for example, making restaurant reservations, finding places of interest, helping with navigation or driving directions, etc. Typically, user enquiries into these dialogue systems are limited to relatively small set of dialogue words or utterances, which are entered or provided via natural language. Conventional task-oriented dialogue solutions are implemented with techniques for natural language understanding, dialogue management, and natural language generation, where each module is customized—designed separately and at some expense—for a specific purpose or task.
In the figures, elements having the same designations have the same or similar functions.
This description and the accompanying drawings that illustrate aspects, embodiments, implementations, or applications should not be taken as limiting—the claims define the protected invention. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail as these are known to one skilled in the art Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
In order to reduce the amount of human effort needed for development of dialogue systems, and to scale up between domains or applications for the same, end-to-end dialogue systems, which input plain text and directly output system responses, have been developed. However, these end-to-end dialogue systems usually suffer in that they are not able to effectively incorporate an external knowledge base (KB) into the system response generation. One of the reasons for this is that a large, dynamic knowledge base can be a voluminous and noisy input, which will make the generation or output of responses unstable. Different from a chit-chat scenario, this problem can be especially challenging or harmful for use in a task-oriented dialogue system, because the information in the knowledge bases is usually expected to include the correct or proper entities in the response. For example, for a dialogue system implementing a car driving assistant, a knowledge base could include information like that illustrated in the example table 610 shown in
To address this problem, according to some embodiments, the present disclosure provides a global local memory pointer (GLMP) network or model for response generation in a task-oriented dialogue system. The GLMP network or model comprise a global memory encoder, a local memory decoder, and an external knowledge memory. The GLMP shares the external knowledge between the encoder and decoder, and leverages the encoder and the external knowledge to learn a global memory pointer. It is then propagated to the decoder and modifies the external knowledge, filtering words that are not necessary for copying into a response. Afterward, instead of generating system responses directly, the local memory decoder first uses a recurrent neural network (RNN) to obtain sketch responses with sketch tags. The sketch responses with tags operate, or can be considered, as learning a latent dialogue management to generate a template for dialogue action. Then the decoder generates local memory pointers to copy words from external knowledge memory to replace the sketch tags.
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
As shown, memory 120 includes a global local memory pointer module 130. The global local memory pointer module 130 may be used to implement and/or generate the global local memory pointers for response generation in task-oriented dialogue for the systems, methods, and models described further herein an. In some examples, global local memory pointer module 130 may be used or incorporated in a dialogue system by which one or more users can interact with a machine, e.g., computer. Each dialogue may comprise an interchange of information, questions, queries, responses between a user and the machine. This sequence of exchanges makes up a history for the dialogue. For a given dialogue, the global local memory pointer module 130 receives user utterances or speech 150, and generates suitable responses 160 for the same. To accomplish this, as described below in more detail, the global local memory pointer module 130 generates both a global pointer and a local pointer for information or data in a knowledge base from which responses can be generated or created. The global local memory pointer module 130 may also receive one or more knowledge bases 155.
In some examples, global local memory pointer module 130 may include a single- or multi-layer neural network, with suitable pre-processing, encoding, decoding, and output layers. Neural networks have demonstrated great promise as a technique for automatically analyzing real-world information with human-like accuracy. In general, neural network models receive input information and make predictions based on the input information. For example, a neural network classifier may predict a class of the input information among a predetermined set of classes. Whereas other approaches to analyzing real-world information may involve hard-coded processes, statistical analysis, and/or the like, neural networks learn to make predictions gradually, by a process of trial and error, using a machine learning process. A given neural network model may be trained using a large number of training examples, proceeding iteratively until the neural network model begins to consistently make similar inferences from the training examples that a human might make. In some examples, global local memory pointer module 130 may include a memory network for storing, among other things, the knowledge base and a history for a current dialogue. And although global local memory pointer module 130 is depicted as a software module, it may be implemented using hardware, software, and/or a combination of hardware and software.
While
In some embodiments, as shown, this model can include or comprise a global memory encoder 210, a local memory decoder 220, and a shared external knowledge memory 230. In some embodiments, one or both of the encoder 210 and decoder 220 comprises one or more recurrent neural networks (RNN).
The global local memory pointer model 200 receives as input one or more knowledge bases (KB) and information for a current dialogue (e.g., exchange between user and system). The knowledge base comprises information or data that may be relevant for generating responses to a user's queries or utterances in connection with a dialogue. This information can include, for example, names of people, places, or points of interest (poi), a type for each poi, addresses or contact information for the same, etc. Examples of this information for a knowledge base are shown in table 610 of
The global memory encoder 210 may receive one or more utterances issued by a user during a dialogue with the computing device (process 720 of
In some embodiments, the external knowledge in memory 300 contains the global contextual representation that is shared with the encoder (e.g., 210) and the decoder (e.g., 220) of the global local memory pointer model (e.g., 200). To incorporate external knowledge into a learning framework, in some embodiments, external knowledge memory 300 can be implemented using end-to-end memory networks (MN) store word-level information for both structural KB and temporal-dependent dialogue history. As shown, this can include the KB memory and the dialogue memory. In addition, end-to-end memory networks (MN) provide, support, or allow for multiple hop reasoning ability, which can strengthen the copy mechanism.
In some embodiments, in the KB memory module 332, each element b1 E B is represented in the triplet format as (Subject, Relation, Object) structure, which is a common format used to represent KB nodes. For example, the knowledge base B in table 610 of
In some embodiments, the external knowledge comprises a set of trainable embedding matrices C=(C1, . . . , CK+1), where Ck∈R|v|xdemb, K is the maximum memory hop in the end-to-end memory network (MN), |V| is the vocabulary size and demb is the embedding dimension. The memory in the external knowledge is denoted as M=[B; X]=(m1, . . . , mn+1), where mi is one of the triplet components mentioned. To read the memory, the external knowledge uses an initial query vector q1. Moreover, it can loop over K hops and computes the attention weights at each hop k using
p
i
kSoftmax((qk)Tcik), (1)
where cik=B(Ckmi))∈Rd
In some embodiments, the encoder 400 can be implemented as a context recurrent neural network (RNN). The context RNN is used to model the sequential dependency and encode the context or dialogue history X. Then the hidden states H are written into the external knowledge or the memory (e.g., 230 or 300 as shown in
Intuitively, since it can be hard for end-to-end memory network (MN) architectures to model the dependencies between memories, which can be a drawback especially in conversational related tasks, writing the hidden states to the external knowledge can provide sequential and contextualized information, and the common out-of-vocabulary (OOV) challenge can be mitigated as well. In addition, using the encoded dialogue context as a query can encourage the external knowledge memory (e.g., 230 or 300) to read out information related to the hidden dialogue states or user intention. Moreover, the global memory pointer that learns a global memory distribution is passed to the decoder along with the encoded dialogue history and the encoded knowledge base (KB) information.
In some embodiments, the context RNN of encoder 400 can include or be implemented with a plurality of encoding elements 402, which separately or together may comprise one or more bi-directional gated recurrent units (GRUs) (such as described, for example, in Chung et al., 2014, which is incorporated by reference herein). Each encoding element 402 may operate on a word or text of the context or dialogue history X to generate hidden states H=(he1, . . . hen). The last hidden state hen is used to query the external knowledge memory as the encoded dialogue history. In addition, the hidden states H are written into the dialogue memory module 334 in the external knowledge 300 by summing up the original memory representation with the corresponding hidden states. In formula,
c
i
k
=c
i
k
+h
e
m
if mi∈X and ∀k∈[1,K+1], (3)
The encoder 400 generates the global memory pointer G (process 730 of
In the auxiliary task, the label Glabel=(g1l, . . . , gn+1l) is defined by checking whether the object words in the memory exists in the expected system response Y. Then the global memory pointer is trained using binary cross-entropy loss Lossg between G and Glabel. In formula,
In some embodiments, as explained in more detail below, the global memory pointer functions to filter information from the knowledge base module (232 or 332) of the memory for use in generating a suitable dialogue response to a user utterance.
In some embodiments, the RNN of the decoder 500 generates a template or sketch for the computer response to a user utterance. The sketch response may comprise a set of elements. Some of these elements of the sketch response will appear in the actual dialogue response output from the computing device 100. Other of these elements, which may be referred to as sketch tags, will be replaced by words from the knowledge base in the actual dialogue response. An example of a sketch response is “@poi is @distance away”, where @poi and @distance are each sketch tags. In the computer dialogue response, these sketch tags may be replaced with the words “Starbucks” and “1 mile,” respectively, from the knowledge memory (e.g., 232 or 332), so that the response actually output is “Starbucks is 1 mile away.”
Using the encoded dialogue history hen, the encoded KB information qK+1 and the global memory pointer G, the local memory decoder 500 first initializes its sketch RNN using the concatenation of dialogue history hen and encoded KB information qK+1, and generates a sketch response that excludes slot values but includes sketch tags. At each decoding time step, the hidden state of the sketch RNN is used for two purposes: (1) predict the next token in vocabulary, which can be the same as standard sequence-to-sequence (S2S) learning; (2) serve as the vector to query the external knowledge. If a sketch tag is generated, the global memory pointer G is passed to the external knowledge 300, and the expected output word will be picked up from the local memory pointer L. Otherwise, the output word is the word that is generated by the sketch RNN. For example in
The decoder 500 generates a sketch response (process 740 in
h
d
t
=GRU(C1(ŷt−1s),hdt−1),Ptvocab=Softmax(Whdt) (5)
The standard cross-entropy loss is used to train the sketch RNN, and the Lossv is defined as
The slot values in Y are replaced into sketch tags based on the provided entity table. The sketch tags ST are all the possible slot types that start with a special token, for example, @address stands for all the addresses and @ distance stands for all the distance information.
The decoder 500 generates one or more local memory pointers L (process 760 in
c
i
k
=c
i
k
×g
i
,∀i∈[1,n+l] and ∀k∈[1,K+1], (7)
and then the sketch RNN hidden state hdt queries the external knowledge 300. The memory attention in the last hop is the corresponding local memory pointer Lt, which is represented as the memory distribution at time step t. To train the local memory pointer, a supervision on top of the last hop memory attention in the external knowledge is added. The position label of local memory pointer Llabel at the decoding time step t is defined as
The position n+1+1 is a null token in the memory that allows the model to calculate loss function even if yt does not exist in the external knowledge. Then, the loss between L and Llabel is defined as
Furthermore, a record R ∈ Rn+1 is utilized to prevent copying of the same entities multiple times. All the elements in R are initialized as 1 in the beginning. The global local memory pointer model or network generates the dialogue computer response Y for the current user utterance (process 770 of
where ⊙ is the element-wise multiplication. Lastly, all the parameters are jointly trained by minimizing the sum of three losses:
Loss=Lossg+Lossv+Lossl (11)
In some embodiments, two public multi-turn task-oriented dialogue datasets can be used to evaluate the model: the bAbI dialogue (as described in more detail in Boards et al., “Learning end-to-end goal-oriented dialog,” International Conference on Learning Representations, abs/1605.07683, 2017, which is incorporated by reference herein) and Stanford multi-domain dialogue (SMD) (as described in more detail in Eric et al., “A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue,” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 468-473, Valencia, Spain, April 2017, which is incorporated by reference herein). The bAbI dialogue includes five simulated tasks in the restaurant domain. Task 1 to 4 are about calling API calls, modifying API calls, recommending options, and providing additional information, respectively. Task 5 is the union of tasks 1-4. There are two test sets for each task: one follows the same distribution as the training set and the other has 00V entity values. On the other hand, SMD is a human-human, multi-domain dialogue dataset. It has three distinct domains: calendar scheduling, weather information retrieval, and point-of-interest navigation. The key difference between these two datasets is, the former has longer dialogue turns but the regular user and system behaviors, the latter has few conversational turns but variant responses, and the KB information is much more complicated.
bAbI Dialogue.
The table of
Stanford Multi-Domain Dialogue (SMD).
The tables of
Moreover, human evaluation of the generated responses is reported, as shown in the second table of
Thus, in the SMD dataset, GLMP model achieves highest BLEU score and entity F1 score over baselines, including previous state-of-the-art results.
Ablation Study.
The contributions of the global memory pointer G and the memory writing of dialogue history H are shown in the table of
Visualization and Qualitative Evaluation.
Analyzing the attention weights has been frequently used to interpret deep learning models.
According to some embodiments, the model of the present disclosure is trained end-to-end using Adam optimizer (Kingma et al., “A method for stochastic optimization,” International Conference on Learning Representations, 2015, which is incorporated by reference herein), and learning rate annealing starts from 1e−3 to 1e−4. The number of hop K is set to 1,3,6 to compare the performance difference. All the embeddings are initialized randomly, and a simple greedy strategy is used without beam-search during the decoding stage. The hyper-parameters such as hidden size and dropout rate are tuned with grid-search over the development set (per-response accuracy for bAbI Dialogue and BLEU score for the SMD). In addition, to increase model generalization and simulate 00V setting, a small number of input source tokens are randomly masked into an unknown token. The model is implemented in PyTorch and the hyper-parameters used for each task T1, T2, T3, T4, T5 are listed in the table of
The output of the GLMP model and Mem2Seq, with respect to appropriateness and human-likeness (naturalness), were compared against a human evaluation. The level of appropriateness was rated from 1 to 5, as follows:
5: Correct grammar, correct logic, correct dialogue flow, and correct entity provided
4: Correct dialogue flow, logic and grammar but has slightly mistakes in entity provided
3: Noticeable mistakes about grammar or logic or entity provided but acceptable
2: Poor grammar, logic and entity provided
1: Wrong grammar, wrong logic, wrong dialogue flow, and wrong entity provided
The level of human-likeness (naturalness) was rated from 1 to 5, as follows:
5: The utterance is 100% like what a person will say
4: The utterance is 75% like what a person will say
3: The utterance is 50% like what a person will say
2: The utterance is 25% like what a person will say
1: The utterance is 0% like what a person will say
The charts in
Thus, disclosed herein is an end-to-end trainable model, using global-to-local memory pointer networks, for task-oriented dialogues. The global memory encoder and the local memory decoder are designed to incorporate the shared external knowledge into the learning framework. It is empirically shown that the global and the local memory pointer are able to effectively produce system responses even in the out-of-vocabulary (00V) scenario, and visualize how global memory pointer helps as well. As a result, the model achieves state-of-the-art results in both the simulated and the human-human dialogue datasets, and holds potential for extending to other tasks such as question answering and text summarization.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
This application claims priority to U.S. Provisional Patent Application No. 62/737,234, filed Sep. 27, 2018, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
62737234 | Sep 2018 | US |