The present disclosure is related to artificial intelligence performing classification of input data.
The present application relates to classification of input data. In an example, the present application discusses joint intent detection and slot filling for natural language understanding (NLU). Existing systems learn features collectively over all slot types (i.e., labels) and have no way to explain the model. A lack of explainability creates doubt in a user as to what a model is doing. A lack of explainability also makes improving the model difficult when errors occur. Adding explainability by an additional process unrelated to intent detection and slot filling reduces efficiency and correctness of explanations.
Embodiments provided herein provide classification (inference of mapping input data to one particular class from a set of classes or mapping input data to soft values, one soft value for each class of the set of classes) and explainability (visual outputs that explain how an AI model arrived at a classification).
In an artificial intelligence model (AI model) of embodiments provided herein, an utterance is processed to provide both classification of the utterance and a visualization of the process of the AI model. This is done by performing intent classification of the utterance, generating slot type weights and binary classifier logits, performing feature fusion and slot classification. A graphical slot explanation is then output as a visualization along with slot logits. Based on the output, voice-activated AI-based personal assistant can take action on the input utterance. Also, a debugging engineer is assisted by the visualization in a task of improving the AI model.
Provided herein is a method of visualizing a natural language understanding model, the method including: parsing an utterance into a vector of tokens; encoding the utterance with an encoder to obtain a vector of token embeddings; applying an intent classifier, based on the vector of token embeddings, to obtain an estimated intent; obtaining a vector of slot type weights for visualization. The obtaining the vector of slot type weights uses an auxiliary network and is based on the vector of token embeddings and based on the estimated intent; obtaining a vector of multiple self-attentions. The obtaining the vector of multiple self-attentions uses the auxiliary network and is based on the vector of token embeddings and based on the estimated intent; the method includes visualizing the vector of slot type weights in a two column format. The two column format comprises a first column and a second column; the method includes performing a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features; and obtaining, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance.
Also provided herein is a server for utterance recognition and model visualization, the server including: one or more processors; and one or more memories, the one or more memories storing a program, wherein execution of the program by the one or more processors is configured to cause the server to at least: parse an utterance into a vector of tokens; encode the utterance with an encoder to obtain a vector of token embeddings; apply an intent classifier, based on the vector of token embeddings, to obtain an estimated intent; obtain a vector of slot type weights for visualization, wherein the obtaining the vector of slot type weights uses an auxiliary network and is based on the vector of token embeddings and based on the estimated intent; obtain a vector of multiple self-attentions, wherein the obtaining the vector of multiple self-attentions uses the auxiliary network and is based on the vector of token embeddings and based on the estimated intent; visualize the vector of slot type weights in a two column format, wherein the two column format comprises a first column and a second column; perform a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features; and obtain, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance.
Also provided herein is a non-transitory computer readable medium configured to store a program for utterance recognition and model visualization, wherein execution of the program by one or more processors of a server is configured to cause the server to at least: parse an utterance into a vector of tokens; encode the utterance with an encoder to obtain a vector of token embeddings; apply an intent classifier, based on the vector of token embeddings, to obtain an estimated intent; obtain a vector of slot type weights for visualization, wherein the obtaining the vector of slot type weights uses an auxiliary network and is based on the vector of token embeddings and based on the estimated intent; obtain a vector of multiple self-attentions, wherein the obtaining the vector of multiple self-attentions uses the auxiliary network and is based on the vector of token embeddings and based on the estimated intent; visualize the vector of slot type weights in a two column format, wherein the two column format comprises a first column and a second column; perform a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features; and obtain, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance.
The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of embodiments provided herein.
Embodiments can be applied to any classification task to provide inherent explainability (no need of post-hoc techniques) to an existing classification method.
In one example, explainability is added to a joint intent detection and slot filling task related to natural language understanding (NLU). For example, a joint NLU model can include explainability in intent detection. Intent detection can be considered similar to document classification, paragraph/sentence classification where model classifies the entire input into one of candidate class labels.
Some examples also include adding an explainable component to the slot filling task. Embodiments improve model accuracy in addition to providing explanations for the slot filling task.
Slot filling is similar to other applications such as named entity recognition (NER), part-of-speech (POS) tagging, information extraction (IE). Therefore, embodiments provided herein are applicable to applications similar to slot filling where each word/token in an input has to be classified into one of the candidate class labels.
In embodiments provided herein, attention is found for each slot type of the natural language model. See the slot type attention module 3-3 in auxiliary network 3-10 of
In
That is, as shown in
As mentioned above, the right hand portion of
In
Language encoder 3-1 also performs encoding the utterance with an encoder to obtain a vector of token embeddings.
Intent classifier 3-2 performs applying an intent classifier, based on the vector of token embeddings, to obtain an estimated intent. Intent classifier 3-2 also provides intent logits.
In an example, intent classifier 3-2 is a single layer multi-class classifier that uses BERT as the language encoder 3-1. For a given utterance u, the embedding vector is the context embedding uc∈Rd. The intent classifier 3-2 outputs intent logits gintent as shown in Eq. 1.
g
intent
=u
c
W
intent
+b
intent Eq. 1
where Wintent∈Rd×|I
L
intent=Σx|I
Auxiliary network 3-10 allows system 1-11 to learn explainable weights and general, yet slot-specific, features. Auxiliary network 3-10 includes a slot type attention module 3-3 and a slot type classifier 3-4. Auxiliary network 3-10 performs obtaining a vector of multiple self-attentions. The auxiliary network 3-10 performs obtaining the vector of multiple self-attentions based on the vector of token embeddings and based on the intent logits. Embodiments apply self attention for each slot type and have a binary classifier for each slot type.
Dropout is used in randomly selecting (or not retaining) data for processing in
After applying dropout to the encoded utterance ue, the intent logits gintent are concatenated with ue. Because gintent ∈R|I
u
sa=SA(LL(LN(ue⊕gint_e;θLN);θLL;θSA) Eq. 3
The model parameters for LL, LN and SA are, respectively, θLL, θLN and θSA.
The self attention SA is as shown in Eq. 4.
Query, key and value (Qx, Kx, Vx) are calculated from input x using three different linear projections, respectively, LLq(x), LLk(x) and LLv(x). dx is the dimension of the input x and also of Kx.
Next, the original embedding tensor is added and a layer normalization is performed before proceeding to the auxiliary network 3-10. See Eq. 5.
=LN(ue+usa;θ) Eq. 5
To obtain the slot type weights, the slot type attention module 3-3 projects the input utterance into various representations, one for each slot type of the number of slot types. Slot type attention module 3-3 thus explains the slots at a fine level of granularity. The weights are computed over the entire input for each token and also per slot type. Multiple attentions are computed, one for each slot type. Thus, embodiments obtain slot type specific attention weights per token with respect to the entire input utterance.
Training is important to obtain meaningful weights. Training proceeds as follows. Embodiments project the input utterance into n (number of slot types for the entire problem definition) different embedding vectors. Embodiments use these n projections to predict each of the slot type binary outputs. While training, embodiments train each binary classifier to predict this binary decision and because of this explicit auxiliary network training, embodiments are able to learn the slot type specific attention weights as embodiments constrain and relate the overall network with direct linking of n times self attentions with n binary classifiers. That is, the n binary classifiers predict each token whether it belongs to a token type or not. The binary classifiers are not able to identify entity spans as in the final slot classifier and they only predict whether each token belongs to a slot type or not. This output is a soft value between 0 and 1.
The process of obtaining training data for the auxiliary network is as follows.
Step 1 of obtaining training data: For slot types except O, put a digit “1” to slot tokens (B- and I-prefixed labels belonging in each slot type) and put a digit “0” otherwise.
Step 2 of obtaining training data: For O slot type, all non-slot tokens are marked as digit “1” and the rest are marked as digit “0.”
To train the n binary classifiers, embodiments create training data as shown in the example of Table 1 and using the steps given above. In general, for each utterance, embodiments create n different training examples, one per slot type. See the example of Table 1, which indicates the original ground truth in BIO (sequence labeling) format.
Further with respect to training, binary input vectors (such as the example of Table 1) are used to train each of the binary classifiers (see item 3-4 in
As mentioned above, without proper supervision (training), these weights of the auxiliary network 3-10 would not be meaningful. Embodiments provide binary classifiers to ensure the weights are meaningful. The multiple projections are made to correspond respectively to the multiple classifiers. There is one classifier for each slot type. The self attention defined in Eq. 5 is applied |Tlabel| number of times (|Tlabel|=n, number of slot types for this problem), one for each slot type by projecting different query, key and value (Qlabel_i, Klabel_i, Vlabel_i tensors (which are ∈Rl×d
As mentioned above, slot type projections require additional optimization objectives so that the self attention weights are meaningful and not random noise. Embodiments train each slot type projection to predict binary output that states whether an input token in the utterance is true (1) or false (0) for that slot type. This data takes the role of ground truth for the training of the auxiliary network that computes slot type specific attentions. This training data is automatically generated from sequence labeling BIO ground truth as shown in the example in Table 1. Binary format for a slot type except for O is generated by assigning digit “1” to slot tokens (non-O positions) and assigning digit “0” otherwise. For O, all non-slot tokens are marked as digit “1” and the rest as digit “0.”
With the additional optimization objectives for each slot type, self attention weights can be used to visualize attention points of the input for each token with respect to each slot type as explanations. This is because: (i) per slot type, the embedding for the token being classified is computed attending to all the input tokens including itself, (ii) the same token embedding is used for the particular slot type classification (controlled supervision), and (iii) type classification output logits are fed to the final slot classification as additional features (explicit connection to final classification). Embodiments provide binary classification output logits enabling the model to learn general patterns/features specific to each slot type that also boost model performance. Binary classification for ith slot type type_i∈Tlabel is performed as follows.
Embodiments initialize a set of weight tensors WH and bias bH wherein there are |Tlabel| Of these.
An output feature vector is then given as in Eq. 7.
g
type_i
slot_type
=h
type_i
W
type_i
H
+b
type_i
H Eq. 7
Where the dimension of the LHS of Eq. 7 is l×1, and the dimension of the respective three variables on the RHS of Eq. 7 from left to right is l×dh, dh×1 and b is a scalar.
The cross entropy loss for optimization is computed as shown in Eq. 8.
Where Yjt is the is the one-hot encoded ground truth vector with element values 0, except a single 1, pj is the softmax probability, and N is the total number of data points. Embodiments measure the binary cross entropy loss collectively for all the slot-type classifiers. If there are η data points originally in the dataset, then N=η*|Tlabel|. For a set S, |S| indicates the number of elements in the set S.
Auxiliary network 3-10 performs obtaining a vector of slot type weights for visualization. The obtaining the vector of slot type weights is performed by the auxiliary network 3-10 based on the vector of token embeddings and based on the estimated intent.
Feature fusing module 3-5 performs a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features.
Slot classifier 3-6 performs obtaining, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance.
Visualization module 3-7 performs visualizing the vector of slot type weights in a two column format, wherein the two column format comprises a first column and a second column. Examples of the two-column format are given in
Slot type logits obtained as explained above are AI features provided by embodiments, these AI features improve the accuracy of the slot classifier 3-6. The slot type logits are slot type specific and capture fine grained slot patterns. For quantitative examples of improvements compared to other language modelers for benchmark data sets, see Table 1 of U.S. Provisional Application No. 63/307,592 filed Feb. 7, 2022 to which benefit or priority is claimed, all of which application has been incorporated by reference above.
As mentioned above, embodiments combine slot specific features with original utterance embeddings. Each binary classifier outputs l length vectors where l is the original utterance length. That is, each binary classifier outputs logits values, one real number per utterance token. Embodiments first combine the all binary classifier outputs (there are n binary classifiers) by concatenating them at the token level providing an l×n matrix, gslot_type_c. This is projected into gp having the same dimension as input utterance through a linear layer.
Then two projections from gp are computed using two different linear layers. These two projections act as Key and Value vectors for the self attention (cross attention because Query comes from different source than Key and Value) performed with the original utterance embedding vector Query. The Query vector is computed from the original utterance embedding vector by applying a linear projection. Then, the cross attention mechanism is applied using the Query projected from the input utterance and Key and Value results projected from the slot type features. This cross attention mechanism is extracting features ucross from the input utterance influenced by the slot type features computed using slot type attention modeling. Cross attention is performed using self attention but query and key are projected from two different sources.
Specifically, embodiments concatenate all the slot type logits per each input utterance token and then apply cross attention on it by having query Qu
Thus, embodiments perform feature fusion by computing a cross-attention vector based on the vector of slot type weights and based on the vector of token embeddings. The feature fusion includes forming an intermediate vector as a sum of the cross-attention vector and the vector of token embeddings. Also, the feature fusion includes forming the vector of fused features based on a applying the intermediate vector to a linear layer and normalizing an output of the linear layer.
The logits concatenation is performed at the token level where all the slot type logits for a token are concatenated to produce the tensor gslot_type_c which is ∈Rl×|T
Query, key and value tensor projections (dimension l×d) are computed as follows. The layer parameters for the three different projection layers are θ1, θ2, and θ3.
Q
u
u
e;θ1),Kg
Cross attention between the utterance embeddings and slot type logits highlights slot type specific features from the utterance embeddings. This is added to the utterance as follows to provide the slot classifier input uslot, which has dimension l×d. “drop(⋅)” indicates application of dropout. Dropout is applied, to generalize and reduce over fitting, for the embedding vector passed from the feature fusion module 3-5. Embodiments apply the single layer classifier that outputs slot logits gslot. Argmax is applied on top of these logits to determine the correct slot predictions from the candidate slots where maximum value index from argmax is the predicted slot label. Slabel is the BIO slot label set.
u
slot=drop(LL(LN(ue+ucross;θs
Finally, the slot logits tensor gslot which is of dimension l×|Slabel|, is computed as shown in Eq. 11 (Wslot has dimension d×|Slabel|).
g
slot
=u
slot
W
slot
+b
slot Eq. 11
The slot loss Lslot is computed using the cross entropy loss as shown in Eq. 12.
L
slot=Σx|S
The various parameters θ of the three networks indicated by the intent classifier 3-2, the auxiliary network 3-10 and the slot classifier 3-6 are trained using an overall loss defined in Eq. 13, in which hyperparameters representing loss weights are indicted by α, β, γ.
L=αL
intent
+βL
type
+γL
slot Eq. 13
In
As shown in
The BERT encoder outputs the token embeddings. The intent classifier 3-2 operates on the token embeddings to produce the intent logits.
As shown in
The slot type classifier 3-4 outputs vectors of dimension l×1 composed of binary classifier logits. A binary classifier logit is a likelihood indication of whether a given token should be associated with that slot type.
To provide an input to feature fusing module 3-5, the binary classifier logits are concatenated and the resulting data structure has dimension l×n. This data structure, in the feature fusing module 3-5 passes through a linear layer (“LL”) to obtain a data structure of dimension l×d. This data structure is processed by respective linear layers to obtain K and V values. Q values are obtained by applying a linear layer to the token embeddings. Cross attention is the applied to the Q, K and V values and the result is added to the token embeddings. The result of the sum is applied to a layer normalization (“LN”) and a linear layer, and the output is input to slot classifier 3-6, which provides the slot logits.
Considering
At operation 6-6, slot type weights and binary classifier logits are obtained based on the utterance.
At operation 6-8, feature fusion is performed based on utterance and the binary classifier logits.
At operation 6-10, visualization is provided based on the slot type weights and utterance recognition is provided in terms of slot logits.
Logic 7-1 of
Logic 7-1, at operation 7-4, performs encoding the utterance with an encoder to obtain a vector of token embeddings.
Logic 7-1, at operation 7-6, performs applying an intent classifier, based on the vector of token embeddings, to obtain an estimated intent. Intent classifier 3-2 also provides intent logits.
Logic 7-1, at operation 7-8, performs obtaining a vector of slot type weights for visualization.
Logic 7-1, at operation 7-10 performs obtaining a vector of multiple self-attentions based on the vector of token embeddings and based on the estimated intent.
Logic 7-1, at operation 7-12 provides a visualization of the slot type weights in a two column format.
Logic 7-1, at operation 7-14, performs a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features.
Logic 7-1, at operation 7-16 performs obtaining, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance.
In some embodiments, intent explainability is provided along with the classifier.
In
Intent weights are developed in
Details of obtaining the intent weights and gintent are provided in the description below, including Equations 14-20.
The following description relates to a general classification task. A specific example is intent classification and explainability for NLU. Intent classification and explainability is analogous to sentence classification and many other similar classification applications. Some embodiments use a two step transformation as shown in
In the two step transformation, embodiments first use multiple self attentions to transform input into m (number of intent classes in the NLU case) transformations and then use general query-based attention to learn and compute explainable attention weights.
In a first step, the input words of the input utterance are visualized with respect to their importance to the intent classification task. Fine grained explainability is provided by embodiments, that is to be able to explain through visualization of attention weights per intent class. The input utterance is projected into the number of intent class labels (m) using m times separate self attentions. The original utterance is projected into every intent class representation. The ith intent class projection of input utterance ue is as follows.
u
i
I=SA(ue;θIi) Eq. 14
In a second step, a general attention mechanism is applied on those projections so that attention weights on the input utterance tokens are obtained with respect to each intent class. The CLS embedding from the BERT encoding of utterance u is the query for the general attention. Score vector scorei for the ith intent class and the corresponding attention weights αi are computed as follows.
scorei=CLS×uiI Eq. 15
αi=softmax(scorei) Eq. 16
As a third step, attention weights α computed using general attention are used to visualize intent detection task. Embodiments supervise this attention weights computation specifically and hence embodiments use a set of binary classifiers, one for each intent class. Each binary classifier predicts whether the its corresponding intent class (designated for the binary classifier) as true or false. For each binary classifier, embodiments take the average token embeddings as the input as follows. The ith intent class representation ci of the input utterance u is computed as follows using the ith intent class attention weights αi. ti,x∈uiI, the xth token embedding in the intent class representation of input utterance u.
where the summation is from x=1 to l.
In a fourth step, embodiments initialize a set of weight tensors WI and bias bI where the set sizes of weight tensors WI and bias bI and I are the same and compute a binary classifier logit giI∈R1 as follows.
g
i
I
=c
i
W
i
I
+b
i
I Eq. 18
The network is optimized using cross entropy loss LI.
As a fifth step, embodiments, concatenate intent class specific binary classifier logits and apply a linear layer to get outc=Linear(concat(gI)) where outc∈Rd. Embodiments then add the original context CLS with the concatenated intent-specific logits to get CLSc=LL(LN(CLS+outc; θnc); θlc).) Then the intent classification is performed as follows.
g
intent=CLScWintent+bintent Eq. 19
Full network optimization with slot classifier and slot type classifier are performed as follows with the aggregated loss, where weights to be adjusted for performing the optimization are given as α, η, β, and γ.
L=αL
intent
+ηL
I
+βL
type
+γL
slot Eq. 20
Hardware for performing embodiments provided herein is now described with respect to
An example is provided above of joint intent detection and slot filling task.
Some examples also include adding an explainable component to the slot filling task. As shown, embodiments improve joint model accuracy in addition to providing explanations for the slot filling task.
Application of the explainable component of embodiments can be applied to any classification task to provide inherent explainability (no need of post-hoc techniques) to an existing classification method.
As an example provided above, a joint NLU model can include explainability in intent detection. See
On the other hand, slot filling is similar to other applications such as named entity recognition (NER), part-of-speech (POS) tagging, and information extraction (IE).
Therefore, embodiments provided herein are applicable to applications similar to slot filling where each word/token in an input has to be classified into one of the candidate class labels.
As explained above, the inherently explainable component of embodiments may include: (i) transforming the input into exact number of final classifier classes also incorporating a suitable attention mechanism, (ii) constraining the attention weight learning through an auxiliary network of classifiers with an additional task, and (iii) combining the output of the auxiliary network with the main network classifier so that with the new features, the main network classifier performance can also be improved.
This application claims benefit of priority of U.S. Provisional Application No. 63/307,592 filed Feb. 7, 2022, the contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63307592 | Feb 2022 | US |