The present disclosure relates generally to machine learning models and neural networks, and more specifically, to composed variational natural language generation.
Neural networks have been used to generate conversational responses and thus conduct a dialogue with a human user. For example, a human user can engage in a conversation with an intelligent assistant to gather information on a specific topic, to perform a task such as booking travel tickets, making restaurant reservations, and/or the like. Given the broad scope and diversity of a user natural language input to the intelligent assistant, providing an effective answer can often be challenging. For example, identifying user intentions from a user spoken input has proven to be both essential and important for the intelligent assistant to generate an effective natural language response. However, most existing systems formulate intent detection as a classification task and rely on supervised learning for this classification task. Such intent detection systems thus require large amounts of labeled examples for each class for training, which can be both expensive and time-consuming.
In addition, the requirements on labeled training data limits the adaptability to previously unseen classes. For example, available training data for a natural language model may be highly imbalanced with dialogues data of existing many-shot intents but insufficient training dialogue data of few-shot intents. Thus, the resulting training result may heavily discriminate against few-shot intents and thereby lead to unsatisfactory performance of the language model that is trained with the imbalanced training data.
In the figures and appendix, elements having the same designations have the same or similar functions.
Existing intelligent dialogue systems usually formulate intent detection as a classification task and rely on supervised learning for this classification task. Such intent detection systems thus require large amounts of labeled examples for each class for training, which can be both expensive and time-consuming. Such systems also generate training data samples that are largely biased towards many-shot intents but fail to provide sufficient training samples with few-shot intents.
In view of the need for providing a balanced training dataset with both many-shot intents and few-shot intents, embodiments described herein provide a composed variational natural language generation (CLANG) model that is configured to generate training samples for few-shot intents. Specifically, the CLANG model may build connections between existing training samples of many-shot intents and new training samples of few-shot intents by modeling an intent as a combination of a domain and an action. The CLANG model may then encode the intent of the domain and the action by a transformer-based conditional variational autoencoder. Specifically, two latent variables are utilized to represent two different portions of an input utterance, each portion corresponding to the domain or the action in the intent. The latent variables are then composed together to generate natural examples with few-shot intents. In this way, the CLANG model transfers knowledge from existing many-shot intents to few-shot intents in natural language generation by learning how to compose utterances with many-shot intents and transferring such knowledge to few-shot intents.
As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.
Overview
For example, various actions such as “query”, “set”, “remove,” and/or the like can be combined with the domain of “weather” to form various different intents. The action “query” can also be associated with multiple domains such as “weather”, “calendar,” “movie” and/or the like. Thus, if a neural model may learn how the expressions representing for a certain action or domain can compose an utterance for existing intents, then the neural model can be used to learn how to compose utterances for few-shot intents.
Specifically, the bi-latent variational encoder in the CLANG model 130 utilizes two independent latent variables to model the distributions of action and domain separately. At the encoder, special attention masks are designed to guide these two latent variables to focus on different parts of the utterance 205 (e.g., “what will be” 110 that relates to the action 105 and “the highest temperature next week” 112 that relates to the domain 106) and disentangle the semantics for action and domain separately. Through decomposing utterances for existing many-shot intents by training with losses at 110, the CLANG model 130 learns to generate utterances for few-shot intents as a composition of the learned expressions for domain and action. Further details of the framework of the CLANG model 130 is described in relation to
Additionally, the training 210 of the CLANG model 130 further adopts a contrastive regularization loss to improve learning. For example, during the training 210, an in-class utterance (e.g., the utterance that specifically corresponds to a certain intent) from one intent may be contrasted with an out-of-class utterance (e.g., an utterance that does not match with the one intent) from another intent. Specifically, the contrastive loss is to constrain the model to generate the positive example with a higher probability than the negative example with a certain margin. With the contrastive loss, the CLANG model 130 is regularized to focus on the given domain and intent and the probability of generating negative examples is reduced. Further detail of the contrastive learning is provided in
Computer Environment
Memory 320 may be used to store software executed by computing device 300 and/or one or more data structures used during operation of computing device 300. Memory 320 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 310 and/or memory 320 may be arranged in any suitable physical arrangement. In some embodiments, processor 310 and/or memory 320 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 310 and/or memory 320 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 310 and/or memory 320 may be located in one or more data centers and/or cloud computing facilities.
In some examples, memory 320 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 310) may cause the one or more processors to perform the methods described in further detail herein. For example, as shown, memory 320 includes instructions for a CLANG module 330 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein. In some examples, the CLANG module 330 may include an input representation submodule 331, an encoder module 332 and a decoder module 333. The submodules 331-333 may be optionally sequentially connected as shown in
Specifically, the input representation submodule 331 is configured to receive an input 340 of an utterance via the data interface 335, e.g., from a database of training data. The input representation may then generate a sequence of token representation for the input 340 of the utterance and its associated intent. The encoder submodule 332 is a conditional autoencoder with dual latent variables that encode the intent domain and intent action. During the training, the CLANG module 330 reconstructs the utterance from input 340 with a certain intent via the decoder module 333, e.g., to output an output 350 of a reconstructed utterance. During inference of the CLANG module 330, the CLANG module 330 generates utterances associated with a given intent through sampling the latent variables at the decoder submodule 333. The output 350 during inference may include an utterance generated given a specific (e.g., few-shot) intent. Further detail of the submodules 331-333 can be found in relation to
In some embodiments, to learn the distinctiveness for each intent, an additional contrastive loss may be used by the CLANG module 330 during training to push the CLANG model 330 to generate the positive examples over the negatives. The input 340 may then include positive (in-class) samples and negative (out-of-class) samples corresponding to a specific intent. Both samples may be passed to the CLANG module 330 to generate an output 350 of a conditional distribution of utterances conditioned on the specific intent. The output 350 may then be used to learn the contrastive loss. Further details of the contrastive learning may be discussed in relation to
In some examples, the CLANG module 330 and the sub-modules 331-333 may be implemented using hardware, software, and/or a combination of hardware and software.
Composed Variational Natural Language Generator (CLANG)
The input representation module 401 is configured to convert an input utterance 405 (e.g., “what will be the highest temperature next week”) and the associated intent 406 (e.g., domain=“Weather,” action=“query”) which may be pre-annotated with the input utterance into an input representation of tokens 409. For example, the intent 406 may include a combination of a domain and an action, denoted by y=(yd, ya). Then for a given intent y=(yd, ya) and an utterance x=(w1, w2, . . . , wn) with n tokens, the input representation module 401 concatenates the intent tokens and the utterance tokens, and further adds a special start-of-sequence ([CLS]) token at the beginning of the input and a special end-of-sequence ([SEP]) token at the end of each part. The resulting input sequence 407 may take a form as ([CLS], yd, ya, [SEP], w1, w2, . . . , wn, [SEP]). As the example in
In some embodiments, [SEP] not only marks the sentence boundary, but also is used for the model to learn when to terminate the decoding process.
In some embodiments, input texts may be tokenized into subword units by a fully-connected layer 408. For example, the tokenization may be performed by Word-Piece as described in Wu et al., Google's neural machine translation system: Bridging the gap between hu-man and machine translation. arXiv preprint arXiv:1609.08144, 2016, which is hereby expressly incorporated by reference herein in its entirety. The input embeddings of a token sequence may be represented as the sum of three embeddings: token embeddings, position embeddings (see Vaswani et al., Attention is all you need. In Advances in neural information pro-cessing systems, pages 5998-6008, 2017), and segment embeddings (Devlin et al., BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018). The segment embeddings are learned to identify the intent and the utterance with different embeddings. The resulting tokenized sequence 409 may be output from the input representation module 401 to the bi-latent variational encoder 402.
The bi-latent variational encoder 402 is configured to encode the input sequence representation of tokens 409 into two latent variables that contain the disentangled semantics in the utterance corresponding to domain and action separately. Specifically, the bi-lantent variational encoder 402 includes multiple transformer layers 410a-n. The transformer layers may be similar to the Transformer model described in Vaswani et al. Through a self-attention mechanism, these transformer layers 410a-n extract semantic meaningful representations for the tokens from the input representation sequence 409, and also model the relation between the intent 406 (the “D” token and “A” token in input representation 409) and the utterance 405 (the “W1 . . . Wn” tokens in input representation 409).
Specifically, the information for the domain and the action is kept disentangled through the self-attention mechanism in the multiple transformer layers 410a-n, e.g., allowing the domain token (represented by “D” in the input sequence 409) to only attend to tokens related to the domain and the action token (represented by “A” in the input sequence 409) to only attend to tokens related to the action. To achieve that, a variation of the attention calculations in the transformer layers 410a-n is adopted to avoid direct interactions between the domain token and the action token in each transformer layer 410a-n.
For example, instead of applying a whole bidirectional attention to the input sequence 409, an attention mask matrix M∈N×N is added at each transformer layer 410a-n to determine whether a pair of tokens can be attended to each other, where Nis the length of the input sequence 409. Thus, for the l-th Transformer layer, the output of a self-attention head Al is computed via:
where Tl-1∈N×d
For example,
With reference to
μd=edWμ
log(σd2)=edWσ
μa=eaWμ
log(σa2)=eaWσ
where the parameters Wμ
zd=μd+σd·εd,
za=μa+σa·εa.
In one embodiment, the bi-latent variational encoder may compute a KL-loss function to regularize the prior distributions for the two latent variables to be close to the Gaussian distribution. For example, the encoder loss is computed based on a KL-loss between the conditional distribution of the first latent variable zd conditioned on the first intent token yd and a fourth conditional distribution of the first latent variable zd conditioned on both the first intent token yd and the training sequence of tokens x, and a second KL-loss between the second conditional distribution of the second latent variable za conditioned on the second intent token ya and a fifth conditional distribution of the second latent variable za conditioned on both the second intent token ya and the training sequence of tokens. Specifically, the encoder KL-loss can be computed by:
kL=KL[q(zd|x,yd),p(zd|yd)]+KL[q(za|x,ya),p(za|ya)].
In one embodiment, the bi-latent variational encoder 402 includes a fully connected layer with a Gelu activation function. The GeLu activation function is similar to that described in Hendrycks et al., Bridging nonlinearities and stochastic regularizers with gaussian error linear units, arXiv preprint arXiv:1606.08415, 2016, which is hereby expressly incorporated herein by reference in its entirety. The fully-connected layer is applied to compose the two latent variables together and outputs z at the encoded sequence 412. The composed latent information z is utilized in the decoder to do generation.
The decoder 403 receives the composed latent information z together with the intent to reconstruct the input utterance p(x|zd, za, yd, ya). As shown in
The decoder 403 is built with multiple transformer layers 415a-n to reconstruct the utterance. It is worth noting that two transformer layers 415a-n are illustrated in
Specifically, the decoder 403 may sequentially generate a text output from the left to the right, e.g., the left context is used to predict the next token. To simulate the left-to-right generation process, the multiple transformer layers 415a-b employs an attention mask. In the attention mask for the decoder, tokens in the intent can only attend to intent tokens, while tokens in the utterance can attend to both the intent and all the left tokens in the utterance. For example, for the first token z which holds composed latent information, the decoder 403 only attend the first token to itself due to the vanishing latent variable problem. The latent information can be overwhelmed by the information of other tokens when adapting VAE to natural language generators either for LSTM or Transformers. To further increase the impact of the composed latent information z and alleviate the vanishing latent variable problem, the token representations of z is concatenated to all the other token embeddings output from the last transformer layer in the decoder 403.
The hidden dimension of the output from the last transformer layer 415n at decoder 403 may increases to 2×dh after the concatenation. To reduce the hidden dimension to dh and get the embeddings to decode the vocabulary, the decoder 403 further employs two fully-connected (FC) layers 420 followed by an optional layer normalization on top of the transformer layers 415a-n. For example, GeLu is used as the activation function at the fully-connected layers 420, and the optional normalization layer may be similar to the layer normalization described in Ba et al., Layer normalization. arXiv preprint arXiv:1607.06450, 2016, which is hereby expressly incorporated by reference herein in its entirety.
The embeddings output from these two FC layers 420 are decoded into tokens 425 in the vocabulary sequentially. For example, the embeddings at position i={1, . . . , n−1} are used to predict the next token at position i+1 till the [SEP] token is generated.
In one embodiment, to train the decoder to reconstruct the input, a reconstruction loss is formulated as:
r=−q(z
At subprocess 702, the CLANG model obtains, via an input interface (e.g., data interface 335 in
At subprocess 704, the CLANG model converts the utterance and the intent into an input sequence of tokens. For example, as shown in
At subprocess 706, the CLANG model applies, by a bi-latent variable encoder, an attention mask that prevents a first token representing the domain and a second token representing the action from the input sequence from attending to each other. For example, as shown in
At subprocess 708, the CLANG model generates a first latent variable corresponding to the first token and a second latent variable corresponding to the second token. The first latent variable, e.g., zd, and the second latent variable, e.g., za, are disentangled from each other conditioned on the domain and the action, respectively.
At subprocess 710, the CLANG model outputs, to the decoder (e.g., decoder 403 in
At subprocess 714, the CLANG model receives, at a decoder from the bi-latent variational encoder, latent information representing the first latent variable and the second latent variable. For example, as described in relation to
At subprocess 720, the CLANG model receives, from the input interface, the input sequence of tokens. For example, the original input sequence 409 is also fed to the decoder 403.
At subprocess 722, the CLANG model generates a decoder input sequence of tokens by replacing a start token (e.g., the [CLS] token) from the input sequence with a latent token representing the latent information.
At subprocess 724, the CLANG model applies a decoder attention mask to the decoder input sequence of tokens. Specifically, the decoder attention mask prevents intent tokens in the decoder input sequence of tokens from attending non-intent tokens and prevents the latent token from attending other tokens.
At subprocess 726, the CLANG model concatenates the latent token to all other token embeddings from a last transformer layer in the one or more transformer layers.
At subprocess 728, the CLANG model sequentially decodes, by at least one fully-connected layer, the concatenated latent token and the token embeddings into tokens corresponding to the utterance by next token prediction. For example, the embeddings at position i={1, . . . , n−1} are used to predict the next token at position i+1 till the [SEP] token is generated.
Contrastive Learning
Given an intent y=(yd, ya) 804, an in-class utterance x+ 802 corresponding to this intent and an out-of-class utterance x− 805 from another intent may be obtained, e.g., received from a dataset via the input interface 335 in
Thus, in the same batch of training data, the in-class example 807 (yd, ya, x+) and the out-of-class example 808 (yd, ya, x−) into the CLANG model 130 to model the likelihood: P(x+|y) 815 and P(x−|y) 816. Specifically, the chain rule is used to calculate the likelihood of the whole utterance: p(x|y)=p(w1|y)p(w2|y, w1) . . . p(wn|y, w1, . . . , wn-1). In the contrastive loss, the log-likelihood of the in-class example 807 is constrained to be higher than the out-of-class example 808 with a certain margin k, and thus the contrastive loss 802 is computed as:
c=max{0,λ−log p(x+|y)+log p(x−|y)}.
In this way, for a given intent, the contrastive loss 820 constrains the CLANG model 130 to generate the in-class example x+ with a higher probability than the out-of-class example x−.
To generate out-of-class utterances 805 for contrastive learning, an utterance, from a set of utterances in the training dataset, which is the most similar to an in-class utterance but with a different intent is chosen as the out-of-class utterance. Three indicators are considered to measure the similarity between the in-class utterance and other utterances having different intents in the training dataset: the number of shared uni-grams s1 between the in-class utterance and another utterance, the number of shared bi-grams s2 between the in-class utterance and another utterance, and the number of shared uni-grams between the name of the given intent associated with the in-class utterance and the name of the different intent associated with another utterance intents, denoted by s3. The sum of these three numbers, s=s1+s2+s3, is utilized to find the out-of-class utterance with the highest similarity. If there are multiple utterances having the same highest similarity, the out-of-class utterance may be randomly selected from the multiple utterances.
In some embodiments, the overall loss function is a summation of the KL-loss, the reconstruction loss and the contrastive loss:
=KL+r+c.
The CLANG model may be trained end-to-end by the overall loss function, e.g., via backpropagation.
At subprocess 918, the CLANG model receives an in-class training sample (e.g., 807 in
At subprocess 920, the out-of-class training sample is generated by selecting, from a set of training samples, a training sample that is similar to the in-class training sample and corresponds to an intent other than the first intent token or the second intent token. For example, the selected training sample has a similarity with the in-class training sample based on a first number of shared unigrams between the selected training sample and the in-class training sample, a second number of shared bi-grams between the selected training sample and the in-class training sample, and a third number of shared uni-grams between names of intents corresponding to the selected training sample and the in-class training sample.
At subprocess 922, the CLANG model encodes the in-class training sample with the first intent token and the second intent token into a first encoded sequence, and the out-of-class training sample with the first intent token and the second intent token into a second encoded sequence, respectively. For example, the encoding may be similar to embodiments of encoder 401 described in relation to
At subprocess 926, the CLANG model generates a reconstructed in-class sample from the first encoded sequence and a reconstructed out-of-class sample from the second encoded sequence, respectively. For example, the reconstruction may be similar to embodiments described in relation to decoder 403 in
At subprocess 928, the CLANG model sequentially generates each token in the reconstructed in-class sample or in the reconstructed out-of-class sample. For example, the utterance of tokens 425 may be sequentially generated via next token prediction.
At subprocess 930, the CLANG model sequentially computes a first conditional probability of the reconstructed in-class sample and a second conditional probability of a reconstructed out-of-class sample, both conditioned on the first intent token and the second token. For example, as the reconstructed in-class sample or the out-of-class sample is sequentially generated, the chain rule is used to calculate the likelihood of the whole utterance sequentially: p(x|y)=p(w1|y)p(w2|y, w1) . . . p(wn|y, w1, . . . , wn-1).
At subprocess 932, the CLANG model computes a contrastive loss by constraining a logarithm of the first conditional probability to be higher than that of the second conditional probability for at least a pre-defined margin. The contrastive loss, together with the encoder loss and decoder reconstruction loss, may be used to update the CLANG model, e.g., via backpropagation.
Few-Shot Intent Detection and Performance
Utterances for few-shot intents can be generated by sampling two latent variables, zd and za, separately from multivariate standard Gaussian distributions. Beam search may be applied to do the generation. To improve the diversity of the generated utterances, the latent variables may be sampled for s times and the top k results are stored for each time. The sampled latent variables are then sent to the decoder 403 in
Specifically, for performance evaluation of CLANG, two public intent detection datasets are used in the experiments: SNIPS-NLU (Coucke et al., Snips voice plat-form: an embedded spoken language understanding system for private-by-design voice interfaces, arXiv preprint arXiv:1805.10190, 2018) and NLUED (Liu et al., Benchmarking natural language understanding services for building conversa-tional agents, in Proceedings of the Tenth International Workshop on Spoken Dialogue Systems Tech-nology (IWSDS), Ortigia, Siracusa (SR), Italy, 2019). These two datasets contain utterances from users when interacting with intelligent assistants and are annotated with pre-defined intents.
Specifically, both the encoder and the decoder of the CLANG model 130 use six trans-former layers. Pre-trained weights from BERT-base are used to initialize the embeddings and the transformer layers. The weights from the first six layers in BERT-base are used to initialize the trans-former layers in the encoder and the later six layers are used to initialize the decoder. The margin for the contrastive loss is 0.5 for all the settings. All the hidden dimensions used in CLANG is 768. For CLANG, the learning rate is 1e-5 and the batch size is 16. Each epoch has 1000 steps. Fifty examples from the training data are sampled as the validation set. The reconstruction error on the validation set is used to search for the number of training epochs in the range of [50, 75, 100]. The reported performances of CLANG and the ablation of contrastive loss are both trained with 100 epochs.
The hyperparameters for the generation process including the top index k and the sampling times s are chosen by evaluating the quality of the generated utterances. search s is performed in the list of [10, 20], and k in the list of [20, 30]. For example, k=30 and s=20 for BERT+CLANG in NLUED, while use k=30 and s=10 for all the other experiments. When fine-tuning BERT for the GF-SID task, the hyperparameters are chosen as follows: the batch size is 32, learning rate is 2e-5 and the number of the training epochs is 3.
As shown in
As illustrated in
Therefore, compared to the few-shot learning method, CLANG achieves better performance consistently in all the settings. In contrast, BERT-PN+may achieve decent performance on many-shot intents but lacks the ability to provide embeddings that can be generalized from existing intents to few-shot intents.
For data augmentation baselines, CLANG obtains the best performance on few-shot intents and the harmonic mean. These results demonstrate the high quality and diversity of the utterances generated form CLANG. CGT and SVAE barely improve the performance for few-shot intents. They only work well with sufficient training data. The utterances generated by these two models are almost the same as the few-shot examples. The performance improved by EDA is also limited since it only provides simple language transformation like insert and delete. Compared with CG-BERT that incorporates the pre-trained language model BERT, CLANG further improves the ability to generate utterances for few-shot intents with composed natural language generation.
As a few-shot natural language generation model, diversity is a very important indicator for quality evaluation. The percentage of unique utterances generated by CLANG is then compared with CG-BERT. In CG-BERT, the top 20 results are generated for each intent by sampling the hidden variable for once. There are 257 unique sentences out of 320 utterances (80.3%). In CLANG, the top 30 results for each intent are generated by sampling the latent variables for once. Thus, CLANG generates 479 unique sentences out of 480 utterances (99.8%), which is much higher than CG-BERT.
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of method 200. Some common forms of machine readable media that may include the processes of method 200 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
The present disclosure is a non-provisional application of and claims priority under 35 U.S.C. 119 to U.S. provisional application No. 63/032,673, filed on May 31, 2020, which is hereby expressly incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
10282663 | Socher et al. | May 2019 | B2 |
10346721 | Albright et al. | Jul 2019 | B2 |
10474709 | Paulus | Nov 2019 | B2 |
10521465 | Paulus | Dec 2019 | B2 |
10542270 | Zhou et al. | Jan 2020 | B2 |
10546217 | Albright et al. | Jan 2020 | B2 |
10558750 | Lu et al. | Feb 2020 | B2 |
10565305 | Lu et al. | Feb 2020 | B2 |
10565306 | Lu et al. | Feb 2020 | B2 |
10565318 | Bradbury | Feb 2020 | B2 |
10565493 | Merity et al. | Feb 2020 | B2 |
10573295 | Zhou et al. | Feb 2020 | B2 |
10592767 | Trott et al. | Mar 2020 | B2 |
10699060 | McCann | Jun 2020 | B2 |
10747761 | Zhong et al. | Aug 2020 | B2 |
10776581 | McCann et al. | Sep 2020 | B2 |
10783875 | Hosseini-Asl et al. | Sep 2020 | B2 |
10817650 | McCann et al. | Oct 2020 | B2 |
10839284 | Hashimoto et al. | Nov 2020 | B2 |
10846478 | Lu et al. | Nov 2020 | B2 |
20160350653 | Socher et al. | Dec 2016 | A1 |
20170024645 | Socher et al. | Jan 2017 | A1 |
20170032280 | Socher | Feb 2017 | A1 |
20170140240 | Socher et al. | May 2017 | A1 |
20180096219 | Socher | Apr 2018 | A1 |
20180121788 | Hashimoto et al. | May 2018 | A1 |
20180121799 | Hashimoto et al. | May 2018 | A1 |
20180129931 | Bradbury et al. | May 2018 | A1 |
20180129937 | Bradbury et al. | May 2018 | A1 |
20180129938 | Xiong et al. | May 2018 | A1 |
20180268287 | Johansen et al. | Sep 2018 | A1 |
20180268298 | Johansen et al. | Sep 2018 | A1 |
20180336453 | Merity et al. | Nov 2018 | A1 |
20180373987 | Zhang et al. | Dec 2018 | A1 |
20190130248 | Zhong et al. | May 2019 | A1 |
20190130249 | Bradbury et al. | May 2019 | A1 |
20190130273 | Keskar et al. | May 2019 | A1 |
20190130312 | Xiong et al. | May 2019 | A1 |
20190130896 | Zhou et al. | May 2019 | A1 |
20190188568 | Keskar et al. | Jun 2019 | A1 |
20190213482 | Socher et al. | Jul 2019 | A1 |
20190251431 | Keskar et al. | Aug 2019 | A1 |
20190258714 | Zhong et al. | Aug 2019 | A1 |
20190258939 | Min et al. | Aug 2019 | A1 |
20190286073 | Asl et al. | Sep 2019 | A1 |
20190355270 | McCann et al. | Nov 2019 | A1 |
20190362020 | Paulus et al. | Nov 2019 | A1 |
20190362246 | Lin et al. | Nov 2019 | A1 |
20200005765 | Zhou et al. | Jan 2020 | A1 |
20200065651 | Merity et al. | Feb 2020 | A1 |
20200084465 | Zhou et al. | Mar 2020 | A1 |
20200089757 | Machado et al. | Mar 2020 | A1 |
20200090033 | Ramachandran et al. | Mar 2020 | A1 |
20200090034 | Ramachandran et al. | Mar 2020 | A1 |
20200090049 | Aliper | Mar 2020 | A1 |
20200103911 | Ma et al. | Apr 2020 | A1 |
20200104643 | Hu et al. | Apr 2020 | A1 |
20200104699 | Zhou et al. | Apr 2020 | A1 |
20200105272 | Wu et al. | Apr 2020 | A1 |
20200117854 | Lu et al. | Apr 2020 | A1 |
20200117861 | Bradbury | Apr 2020 | A1 |
20200142917 | Paulus | May 2020 | A1 |
20200175305 | Trott et al. | Jun 2020 | A1 |
20200184020 | Hashimoto et al. | Jun 2020 | A1 |
20200234113 | Liu | Jul 2020 | A1 |
20200272940 | Sun et al. | Aug 2020 | A1 |
20200273541 | Costello | Aug 2020 | A1 |
20200285704 | Rajani et al. | Sep 2020 | A1 |
20200285705 | Zheng et al. | Sep 2020 | A1 |
20200285706 | Singh et al. | Sep 2020 | A1 |
20200285993 | Liu et al. | Sep 2020 | A1 |
20200301925 | Zhong et al. | Sep 2020 | A1 |
20200302178 | Gao et al. | Sep 2020 | A1 |
20200302236 | Gao et al. | Sep 2020 | A1 |
20200334334 | Keskar et al. | Oct 2020 | A1 |
20200364299 | Niu et al. | Nov 2020 | A1 |
20200364542 | Sun | Nov 2020 | A1 |
20200372116 | Gao et al. | Nov 2020 | A1 |
20200372319 | Sun et al. | Nov 2020 | A1 |
20200372339 | Che et al. | Nov 2020 | A1 |
20200372341 | Asai et al. | Nov 2020 | A1 |
20200380213 | McCann et al. | Dec 2020 | A1 |
20210104231 | Jeon | Apr 2021 | A1 |
20220245191 | Kashino | Aug 2022 | A1 |
Entry |
---|
Li Dong et al., Unified Language Model Pre-training for Natural Language Understanding and Generation, Oct. 15, 2019, arXiv: 1905.03197v3, pp. 1-14 (Year: 2019). |
Ba et al. “Layer Normalization.” Jul. 21, 2016. |
Bowman et al. “Generating Sentences from a Continuous Space.” Nov. 19, 2015. |
Chawla et al. “Smote: Synthetic Minority Over-Sampling Technique.” Journal of Artificial Intelligence Research. vol. 1. Issue 1. pp. 321-357. |
Chen et al. “End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding.” Interspeech. Sep. 8-12, 2016. pp. 3245-3249. |
Chen et al. “Few-Shot NLG with Pre-Trained Language Model.” Sep. 6, 2019. |
Cho et al. “Contrastive Multi-Document Question Generation.” Apr. 29, 2020. |
Coucke et al. “Snips Voice Platform: An Embedded Spoken Language Understanding System for Private-by-Design Voice Interfaces.” May 25, 2018. |
Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Oct. 11, 2018. |
Dong et al. “Unified Language Model Pre-training for Natural Language Understanding and Generation.” Oct. 15, 2019. |
Geng et al. “Induction Networks for Few-Shot Text Classification.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Sep. 29, 2019. pp. 3895-3904. |
Gutmann et al. “Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models.” Proceeding of the Thirteenth International Conference on Artificial Intelligence and Statistics. |
Hendrycks et al. “Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units.” Jun. 27, 2016. |
Hochreiter et al. “Long Short-Term Memory.” Neural Computation. vol. 9. Issue 8. Dec. 1997. pp. 1735-1780. |
Hoy. Alexa, Siri, “Cortana and More: An Introduction to Voice Assistants.” Medical Reference Services Quarterly. vol. 37. Issue 1. Jan. 2018. pp. 81-88. |
Hu et al. “Toward Controlled Generation of Text.” Proceedings of the 34th International Conference on Machine Learning. Journal of Machine Learning Research. vol. 70. Jul. 11, 2017. pp. 1587-1596. |
Hu et al. “Understanding User's Query Intent with Wikipedia.” WWW2009. Apr. 20-24, 2009. pp. 471-480. |
IBM Watson Assistant. “Defining Intents.” https://cloud.ibm/docs/assisant-data?topic=assistant-data-intents. |
Kingma et al. “Adam: A Method for Stochastic Optimation.” Dec. 22, 2014. |
Kingma et al. “Auto-Encoding Variational Bayes.” Dec. 27, 2013. |
Kingma et al. “Semi-Supervised Learning with Deep Generative Models.” Proceedings of Neural Information Processing Systems, Oct. 31, 2014. pp. 3581-3589. |
Li et al. “One-Shot Learning of Object Categories.” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 28. No. 24. Apr. 2006. pp. 594-611. |
Liu et al. “Benchmarking Natural Language Understanding Services for building Conversational Agents.” Proceedings of the Tenth International Workshop on Spoken Dialogue Systems Technology. Mar. 26, 2019. |
Malandrakis et al. “Controlled Text Generation for Data Augmentation in Intelligent Artificial Agents.” Oct. 4, 2019. |
Mikolov et al. “Distributed Representations of Words and Phrases and their Compositionality.” Advances in Neural Information Processing Systems. Oct. 16, 2013. pp. 3111-3119. |
Peng et al. “Few-shot Natural Language Generation for Task-Oriented Dialog.” Feb. 27, 2020. |
Snell et al. “Prototypical Networks for Few-shot Learning.” Advances in Neural Information Processing Systems. Jun. 19, 2017. pp. 4077-4087. |
Tran et al. “Dual Latent Variable Model for Low-Resource Natural Language Generation in Dialogue Systems.” Nov. 10, 2018. |
Vaswani et al. “Attention is All You Need.” Advances in Neural Information Processing Systems. Dec. 6, 2017. pp. 5998-6008. |
Wu et al. “Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.” Oct. 8, 2016. |
Xia et al. “CG-BERT: Conditional Text Generation with BERT for Generalized Few-shot Intent Detection.” Apr. 4, 2020. |
Xian et al. “Zero-Shot Learning—The Good, the Bad and the Ugly.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Mar. 13, 2017. pp. 4582-4591. |
Xu et al. “Convolutional Neural Network Based Triangular CRF for Joint Intent Detection and Slot Filling.” IEEE Workshop on Automatic Speech Recognition and Understanding. Dec. 2013. pp. 78-83. |
Yoo et al. “Data Augmentation for Spoken Language Understanding via Joint Variational Generation.” Proceedings of the AAAI Conference on Artificial Intelligence. Jul. 17, 2019. pp. 7402-7409. |
Zhao et al. “Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders.” Oct. 21, 2017. |
Number | Date | Country | |
---|---|---|---|
20210374603 A1 | Dec 2021 | US |
Number | Date | Country | |
---|---|---|---|
63032673 | May 2020 | US |