This application claims priority to Chinese Patent Application No. 202211187080.7, filed on Sep. 28, 2022, which is hereby incorporated by reference in its entirety.
The present disclosure relates to the technical field of natural language processing and artificial intelligence, in particular to an automatic dialogue method and system based on deep bi-directional attention.
An automatic dialogue technology is an important way of human-computer interaction, which has been widely used in all aspects of society, such as intelligent customer services, intelligent assistants and search engines. As an important research direction in the whole computer field, the automatic dialogue technology has great research significance and application value. Due to different rounds of a dialogue, an automatic dialogue method can be divided into a single-round dialogue and a multi-round dialogue. The single-round dialogue only needs to judge a relationship between a question and a candidate response. The multi-round dialogue needs to judge a relationship between a plurality of historical dialogues and candidate responses, which is closer to an actual application scenario and is more challenging. In short, difficulties of the multi-round dialogue mainly include two points.
First, the historical dialogue sequence is too long, so that encoding process will inevitably lead to loss of a large amount of semantic information.
Second, due to the loss of semantic information caused by the encoding process, the interaction between the historical dialogue sequence and the candidate response is insufficient, which leads to inaccurate response prediction.
However, the existing multi-round dialogue method has not substantially solved the above technical problems. Therefore, the technical problem that needs to be solved urgently at present is how to alleviate the problem of information loss in the semantic encoding process and how to enhance the semantic interaction between the historical dialogue and the candidate response, so as to improve the prediction accuracy of an automatic dialogue.
The technical task of the present disclosure is to provide an automatic dialogue method and system based on deep bi-directional attention, so as to alleviate a problem of information loss in a semantic encoding process and enhance semantic interaction between a historical dialogue and a candidate response, so as to improve prediction accuracy of an automatic dialogue.
The technical task of the present disclosure is implemented in the following way. An automatic dialogue method based on deep bi-directional attention is provided, the method includes:
Preferably, building an automatic dialogue model includes:
{right arrow over (I)}=ReLU(Dense(Idepth));
performing concatenating operation Concat on the n-th historical-dialogue longitudinal self-screening feature representation, the n-th candidate-response longitudinal self-screening feature representation, and the transverse interactive feature representation, so as to obtain a bi-directional feature representation {right arrow over (B)}, which is expressed as follows:
{right arrow over (B)}=Concat({right arrow over (Znh)},{right arrow over (Znr)},{right arrow over (I)});
label predicting: subjecting the bi-directional feature representation as input to a layer of fully connected network with dimension 1 and an activation function Sigmod, so as to obtain a probability that the current response is a correct response.
More preferably, the embedding processing includes:
where h represents a historical dialogue sequence; r represents a candidate-response sequence; Token_Emb( ) represents a Token layer embedding operation; Segment_Emb ( ) represents a Segment layer embedding operation; Position_Emb ( ) represents a Position layer embedding operation.
More preferably, deep bi-directional attention encoding includes:
{right arrow over (F1h)}=Encoder1({right arrow over (Eh)});
{right arrow over (F1r)}=Encoder1({right arrow over (Er)});
where {right arrow over (Eh)} represents the historical-dialogue embedded representation, {right arrow over (Er)} represents the candidate-response embedded representation, and Encoder1 represents a first-layer encoder;
where {right arrow over (F1h)} represents the first historical-dialogue encoded representation; {right arrow over (Eh)} represents the historical-dialogue embedded representation; {right arrow over (F1r)} represents the first candidate-response encoded representation; {right arrow over (Er)} represents the candidate-response embedded representation;
{right arrow over (F2h)}=Encoder2({right arrow over (Z1h)});
{right arrow over (F2r)}=Encoder2({right arrow over (Z1r)});
where {right arrow over (Z1h)} represents the first historical-dialogue longitudinal self-screening feature representation; {right arrow over (Z1r)} represents the first candidate-response longitudinal self-screen feature representation; Encoder2 represents the second-layer encoder;
performing cross-attention calculation on the second historical-dialogue encoded representation and the first historical-dialogue longitudinal self-screening feature representation, so as to obtain a second historical-dialogue longitudinal self-screening feature representation, which is denoted as {right arrow over (Z2h)}; performing cross-attention calculation on second the candidate-response encoded representation and the first candidate-response longitudinal self-screening feature representation, so as to obtain a second candidate-response longitudinal self-screening feature representation, which is denoted as {right arrow over (Z2r)}; performing concatenating operation on the second historical-dialogue encoded representation and the second candidate-response encoded representation, and using the self-attention mechanism to implement interactive processing therebetween, so as to obtain a second transverse interactive feature representation, which is denoted as {right arrow over (I2)}, wherein expressions are as follows:
where {right arrow over (F2h)} represents the second historical-dialogue encoded representation; {right arrow over (Z1h)} represents the first historical-dialogue longitudinal self-screening feature representation; {right arrow over (F2r)} represents the second candidate-response encoded representation; {right arrow over (Z1r)} represents the first candidate-response longitudinal self-screening feature representation;
performing encoding operation on the second historical-dialogue longitudinal self-screening feature representation and the second candidate-response longitudinal self-screening feature representation by a third-layer encoder Encoder3; in a similar fashion, repeating the encoding operation for a plurality of times according to a predetermined hierarchical depth of the automatic dialogue model, until a final n-th historical-dialogue longitudinal self-screening feature representation, a final n-th candidate-response longitudinal self-screening feature representation and a final n-th transverse interactive feature representation are generated; performing encoding operation on a (n−1)-th historical-dialogue longitudinal self-screening feature representation and a (n−1)-th candidate-response longitudinal self-screening feature representation by a nth-layer encoder Encodern, so as to obtain a n-th historical-dialogue encoded representation and a n-th candidate-response encoded representation, which are denoted as {right arrow over (Fnh)} and {right arrow over (Fnr)}, and are expressed as follows:
where {right arrow over (Zn-1h)} represents the (n−1)-th historical-dialogue longitudinal self-screening feature representation; {right arrow over (Zn-1r)} represents the (n−1)-th candidate-response longitudinal self-screening feature representation; Encodern represents the nth-layer encoder;
performing cross-attention calculation on the n-th historical-dialogue encoded representation and the (n−1)-th historical-dialogue longitudinal self-screening feature representation, so as to obtain a n-th historical-dialogue longitudinal self-screening feature representation, which is denoted as {right arrow over (Znh)}; performing cross-attention calculation on n-th the candidate-response encoded representation and the (n−1)-th candidate-response longitudinal self-screening feature representation, so as to obtain a n-th candidate-response longitudinal self-screening feature representation, which is denoted as {right arrow over (Znr)}; performing coupling operation on the n-th historical-dialogue encoded representation and the n-th candidate-response encoded representation, and using a self-attention mechanism to implement interactive processing therebetween, so as to obtain a n-th transverse interactive feature representation, which is denoted as {right arrow over (In)}, wherein expressions are as follows:
where {right arrow over (Fnh)} represents the n-th historical-dialogue encoded representation; {right arrow over (Zn-1h)} represents the (n−1)-th historical-dialogue longitudinal self-screening feature representation; {right arrow over (Fnr)} represents the n-th candidate-response encoded representation; {right arrow over (Zn-1r)} represents the (n−1)-th candidate-response longitudinal self-screening feature representation;
{right arrow over (Idepth)}=Concat({right arrow over (I1)},{right arrow over (I2)}, . . . ,{right arrow over (In)});
where {right arrow over (I1)}, {right arrow over (I2)}, and {right arrow over (In)} represent the first transverse interactive feature representation, the second transverse interactive feature representation and the n-th transverse interactive feature representation, respectively.
Preferably, the training an automatic dialogue model includes:
where ytrue is a true label; ypred is a correct probability outputted by the model;
An automatic dialogue system based on deep bi-directional attention is provided, the system includes:
Preferably, the automatic question-and-answer model building unit includes an input data building module, an embedding processing module, a deep bi-directional attention encoding module, a feature compressing module and a label predicting module;
More preferably, the implementation of the deep bi-directional attention encoding module includes:
{right arrow over (F1h)}=Encoder1({right arrow over (Eh)});
{right arrow over (F1r)}=Encoder1({right arrow over (Er)});
where {right arrow over (Eh)} represents the historical-dialogue embedded representation, {right arrow over (Er)} represents the candidate-response embedded representation, and Encoder1 represents the first-layer encoder;
where {right arrow over (F1h)} represents the first historical-dialogue encoded representation; {right arrow over (Eh)} represents the historical-dialogue embedded representation; {right arrow over (F1r)} represents the first candidate-response encoded representation; {right arrow over (Er)} represents the candidate-response embedded representation;
{right arrow over (F2h)}=Encoder2({right arrow over (Z1h)});
{right arrow over (F2r)}=Encoder2({right arrow over (Z1r)});
where {right arrow over (Z1h)} represents the first historical-dialogue longitudinal self-screening feature representation; {right arrow over (Z1r)} represents the first candidate-response longitudinal self-screen feature representation; Encoder2 represents the second-layer encoder;
where {right arrow over (Zn-1h)} represents the (n−1)-th historical-dialogue longitudinal self-screening feature representation; {right arrow over (Zn-1r)} represents the (n−1)-th candidate-response longitudinal self-screening feature representation; Encodern represents the nth-layer encoder;
where {right arrow over (Fnh)} represents the n-th historical-dialogue encoded representation; {right arrow over (Zn-1h)} represents the (n−1)-th historical-dialogue longitudinal self-screening feature representation; {right arrow over (Fnr)} represents the n-th candidate-response encoded representation; {right arrow over (Zn-1r)} represents the (n−1)-th candidate-response longitudinal self-screening feature representation;
{right arrow over (Idepth)}=Concat({right arrow over (I1)},{right arrow over (I2)}, . . . ,{right arrow over (In)});
where {right arrow over (I1)}, {right arrow over (I2)}, and {right arrow over (In)} represent the first transverse interactive feature representation, the second transverse interactive feature representation and the n-th transverse interactive feature representation, respectively.
An electronic device is provided, which includes a memory and at least one processor;
A computer-readable storage medium is provided, computer program are stored in the computer-readable storage medium, and the computer program is executed by a processor to implement the automatic dialogue method based on deep bi-directional attention as described above.
The automatic dialogue method and system based on deep bi-directional attention according to the present disclosure have the following advantages.
The present disclosure will be further explained with reference to the accompanying drawings hereinafter.
The automatic dialogue method and system based on deep bi-directional attention of the present disclosure will be described in detail with reference to the drawings and specific embodiments hereinafter.
As shown in
In S1, an automatic dialogue data set is acquired; a published automatic dialogue data set is downloaded from the network or the automatic dialogue data set is built by itself.
For example, there are many published automatic dialogue data sets on the Internet, such as Ubuntu Dialogue Corpus. The data format in the data set is shown in the following table:
In a training set and a verification set, there is only one positive response (positive (label: 1)) and one negative response (negative (label: 0)) for the same historical dialogue sequence. In the test set, there is only one positive response (positive (label: 1)) and nine negative responses (negative (label: 0)).
In S2, an automatic dialogue model is built, particularly, an automatic dialogue model based on deep bi-directional attention is built.
In S3, an automatic dialogue model is trained, particularly, an automatic dialogue model is trained on the automatic dialogue data set.
As shown in
In S201, an input data is built. Specifically, For each data in the automatic dialogue data set, all historical dialogue sentences are concatenated, and are separated from each other by a space character “[SEP]”, the result is denoted as h (history). A response is selected from a plurality of responses as a current response, and is formalized as r (response); the label of the piece of data is determined according to whether the response is correct, that is, if the response is correct, the label is denoted as 1; otherwise, the label is denoted as 0; in which h, r and the label form a piece of input data together.
For example, the data shown in step S1 is used as an example to form a piece of input data. The results are as follows:
(h:what is that ubuntu package that installs all the mp3 codec the nvidia driver dvd support etc.? [SEP] you can do that easily with instructions provided here [SEP] i remember there was some package that did it all for you . . . called ez-something [SEP] easyubuntu [SEP] that be it . . . thanks, r: man mount what flag am i looking for, 1)
In S202, embedding processing is performed. Specifically, embedding processing is performed on the input data through a Token layer, a Segment layer and a Position layer, and embedded representations of the three layers are added to obtain a historical-dialogue embedded representation and a candidate-response embedded representation, which specifically includes steps S20201-S20204.
where h represents a historical dialogue sequence; r represents a candidate response sequence; Token_Emb( ) represents the Token layer embedding operation; Segment_Emb( ) represents the Segment layer embedding operation; Position_Emb( ) represents the Position layer embedding operation.
For example, when the present disclosure is implemented on Ubuntu Dialogue Corpus data set, a embedding layer of a pre-training language model BERT is called to complete operations of embedding and adding the Token layer, the Segment layer, the Position layer, and a embedding dimension selects a dimension of the BERT embedding layer, that is, 768 dimensions. In pytorch, the code described above is implemented as follows:
In S203, deep bi-directional attention encoding is performed. A multi-layer encoder is used to perform a longitudinal self-screening feature encoding operation and a transverse interactive feature encoding operation on the historical-dialogue embedded representation and the candidate-response embedded representation, so as to obtain n-th historical-dialogue longitudinal self-screening feature representation, n-th candidate-response longitudinal self-screening feature representation and deep transverse interactive feature representation, which are denoted as {right arrow over (Znh)}, {right arrow over (Znr)} and {right arrow over (Idpeth)};
{right arrow over (I)}=ReLU(Dense(Idepth)).
For example, in pytorch, the code described above is implemented as follows:
where config_hidden size is a encoding dimension, which is set to 768 in the present disclosure; I_depth is a deep transverse interactive feature representation; I is a transverse interactive feature representation.
The concatenating operation Concat is performed on the n-th historical-dialogue longitudinal self-screening feature representation, the n-th candidate-response longitudinal self-screening feature representation, and the transverse interactive feature representation, so as to obtain the bi-directional feature representation {right arrow over (B)}, which is expressed as follows:
{right arrow over (B)}=Concat({right arrow over (Znh)},{right arrow over (Znr)},{right arrow over (I)});
For example, in pytorch, the code described above is implemented as follows:
where Z_h_n is the n-th historical-dialogue longitudinal self-screening feature representation; Z_r_n is the n-th candidate-response longitudinal self-screening feature representation; I is the transverse interactive feature representation; B is the bi-directional feature representation.
In S205, label prediction is performed. The bi-directional feature representation is used as input, and is processed by a layer of fully connected network with dimension 1 and activation function Sigmod, so as to obtain a probability indicating that the current response is a correct response.
The embedding processing in step S202 of this embodiment is performed.
As shown in
In S20301, encoding operation is performed on the historical-dialogue embedded representation and the candidate-response embedded representation by a first-layer encoder Encoder1, so as to obtain the first historical-dialogue encoded representation and the first candidate-response encoded representation, which are denoted as {right arrow over (F1h)} and {right arrow over (F1r)}, which is expressed as follows:
{right arrow over (F1h)}=Encoder1({right arrow over (Eh)});
{right arrow over (F1r)}=Encoder1({right arrow over (Er)});
where {right arrow over (Eh)} represents the historical-dialogue embedded representation, {right arrow over (Er)} represents the candidate-response embedded representation, and Encoder1 represents the first-layer encoder;
where {right arrow over (F1h)} represents the first historical-dialogue encoded representation; {right arrow over (Eh)} represents the historical-dialogue embedded representation; {right arrow over (F1r)} represents the first candidate-response encoded representation; {right arrow over (Er)} represents the candidate-response embedded representation.
In S20303, encoding operation is performed on the first historical-dialogue longitudinal self-screening feature representation and the first candidate-response longitudinal self-screening feature representation by a second-layer encoder Encoder2, so as to obtain the second historical-dialogue encoded representation and the second candidate-response encoded representation, which are denoted as {right arrow over (F2h)} and {right arrow over (F2r)}, which are expressed as follows:
{right arrow over (F2h)}=Encoder2({right arrow over (Z1h)});
{right arrow over (F2r)}=Encoder2({right arrow over (Z1r)});
where {right arrow over (Z1h)} represents the first historical-dialogue longitudinal self-screening feature representation; {right arrow over (Z1r)} represents the first candidate-response longitudinal self-screening feature representation; Encoder2 represents the second-layer encoder.
In S20304, cross-attention calculation is performed on the second historical-dialogue encoded representation and the first historical-dialogue longitudinal self-screening feature representation, so as to obtain the second historical-dialogue longitudinal self-screening feature representation, which is denoted as {right arrow over (Z2h)}. A cross-attention calculation is performed on the second candidate-response encoded representation and the first candidate-response longitudinal self-screening feature representation, so as to obtain the second candidate-response longitudinal self-screening feature representation, which is denoted as {right arrow over (Z2r)}. At the same time, concatenating operation is performed on the second historical-dialogue encoded representation and the second candidate-response encoded representation, and then a self-attention mechanism is used to complete the interactive processing therebetween, so as to obtain the second transverse interactive feature representation, which is denoted as 2, in which expressions are as follows:
where {right arrow over (F2h)} represents the second historical-dialogue encoded representation; {right arrow over (Z1h)} represents the first historical-dialogue longitudinal self-screening feature representation; {right arrow over (F2r)} represents the second candidate-response encoded representation; {right arrow over (Z1r)} represents the first candidate-response longitudinal self-screening feature representation.
In S20305, encoding operation is performed on the second historical-dialogue longitudinal self-screening feature representation and the second candidate-response longitudinal self-screening feature representation by a third-layer encoder Encoder3; in a similar fashion, encoding is performed repeatedly according to a preset hierarchical depth of the automatic dialogue model, until a final n-th historical-dialogue longitudinal self-screening feature representation, the n-th candidate-response longitudinal self-screening feature representation and the n-th transverse interactive feature representation are generated; encoding operation is performed on the (n−1)-th historical-dialogue longitudinal self-screening feature representation and the (n−1)-th candidate-response longitudinal self-screening feature representation by the nth-layer encoder Encodern, so as to obtain the n-th historical-dialogue encoded representation and the n-th candidate-response encoded representation, which are denoted as {right arrow over (Fnh)} and {right arrow over (Fnr)}, which are expressed as follows:
where {right arrow over (Zn-1h)} represents the (n−1)-th historical-dialogue longitudinal self-screening feature representation; {right arrow over (Zn-1r)} represents the (n−1)-th candidate-response longitudinal self-screening feature representation; Encodern represents the nth-layer encoder;
where {right arrow over (Fnh)} represents the n-th historical-dialogue encoded representation; {right arrow over (Zn-1h)} represents the (n−1)-th historical-dialogue longitudinal self-screening feature representation; F represents the n-th candidate-response encoded representation; {right arrow over (Zn-1r)} represents the (n−1)-th candidate-response longitudinal self-screening feature representation;
{right arrow over (Idepth)}=Concat({right arrow over (I1)},{right arrow over (I2)}, . . . ,{right arrow over (In)});
where {right arrow over (I1)}, {right arrow over (I2)}, and {right arrow over (In)} represent the first transverse interactive feature representation, the second transverse interactive feature representation and the n-th transverse interactive feature representation, respectively.
For example, when the present disclosure is implemented on Ubuntu Dialogue Corpus data set, the encoder Encoder selects Transformer Encoder, and the encoding dimension is set to 768; and a number of layers is set to 12.
Cross Attention selects a Dot-Product Attention calculation method. The calculation of the first historical-dialogue longitudinal self-screening feature representation is taken as an example, its calculation process is as follows:
The expression realizes an interactive calculation between the first historical-dialogue encoded representation and the historical-dialogue embedded representation by dot product multiplication operation, F1h represents the first historical-dialogue encoded representation, Eh represents the historical-dialogue embedded representation, and ⊗ represents dot product multiplication operation.
The above expression represents an attention weight α obtained by normalization operation, i and i′ represent element subscripts in the corresponding input tensors, l represents a number of elements in the input tensors Eh, and other symbols have the same meanings as the above expression;
The above expression uses the obtained attention weight α to complete the feature screening of the historical-dialogue embedded representation, so as to obtain the first historical-dialogue longitudinal self-screening feature representation; l represents a number of elements in Eh and α.
Self-Attention selects a Self Dot-Product Attention calculation method, with the calculation of the first transverse interactive feature representation as an example. It is assumed that L=Concat(F1h;F1r) The calculation process is as follows:
The above expression indicates that the self-attention interaction calculation of a concatenating result of the first historical-dialogue encoded representation and the first candidate-response encoded representation is realized by dot product multiplication operation, L represents the concatenating result of the first historical-dialogue encoded representation and the first candidate-response encoded representation, and ⊗ represents dot product multiplication operation;
The above expression represents the attention weight α obtained by normalization operation, i and i′ represent element subscripts in the corresponding input tensors, l represents a number of elements in the input tensors L, and other symbols have the same meanings as the above expression;
The above expression indicates that the obtained attention weight α is used to complete the self-attention feature screening of the concatenating result of the first historical-dialogue encoded representation and the first candidate-response encoded representation, so as to obtain the first transverse interactive feature representation; l represents a number of elements in L and α.
In pytorch, the code described above is implemented as follows:
history_embed represents the historical-dialogue embedded representation; response_embed represents the candidate-response embedded representation; z_h_final represents z_h_final; z_r_final represents the n-th candidate-response longitudinal self-screening feature representation; d_model represents a word vector size required by the encoder, which is set to 512 here; nhead represents a number of heads in the multi-head attention model, which is set to 8 here; um_layer represents a number of layers of the encoder, which is set to 12 here. The training automatic dialogue model in step S3 of this embodiment is as follows.
In S301, a loss function is built with a cross entropy as the loss function, which is expressed as follows:
where ytrue is a true label; ypred is correct probability of model output.
For example, in pytorch, the code described above is implemented as follows.
# error between the predicted value and the label is calculated by the cross entropy loss function.
where labels are the true labels, and logits is the correct probability of model output.
In S302, an optimization function is built. After a plurality of optimization functions are tested, AdamW optimization function is finally selected as the optimization function, in which except that the learning rate is set as 2e-5, other hyper-parameters of AdamW are set to default values in pytorch.
For example, in pytorch, the code described above is implemented as follows.
where optimizer_grouped_parameters are parameters to be optimized, which are all parameters in the automatic question-and-answer model by default.
When the automatic dialogue model has not been trained, it is necessary to train the automatic dialogue model to optimize the parameters of the model; and when the automatic dialogue model has been trained, a label predicting module predicts which one of the candidate responses is the correct response.
As shown in
The automatic question-and-answer data set acquisition unit is configured to download the published automatic dialogue data set from the network or build the automatic dialogue data set by itself.
The automatic question-and-answer model building unit is configured to build an automatic dialogue model based on deep bi-directional attention.
The automatic question-and-answer model training unit is configured to train an automatic dialogue model on the automatic dialogue data set to complete the prediction of the candidate response.
The automatic question-and-answer model building unit in this embodiment includes an input data building module, an embedding processing module, a deep bi-directional attention encoding module, a feature compressing module and a label predicting module.
The input data building module is configured to preprocess an original data set so as to build input data.
The embedding processing module is configured to perform embedding processing on the input data through a Token layer, a Segment layer and a Position layer, and add the embedded representation of the Token layer, the embedded representation of the Segment layer, and the embedded representation of the Position layer to obtain the historical dialogue embedded representation and the candidate response embedded representation.
The deep bi-directional attention encoding module is configured to receive the historical-dialogue embedded representation and the candidate-response embedded representation output by the embedding processing module, and then perform longitudinal self-screening feature encoding operation and transverse interactive feature encoding operation on the historical-dialogue embedded representation and the candidate-response embedded representation in sequence by using a multilayer encoder, so as to obtain the n-th historical-dialogue longitudinal self-screening feature representation, the n-th candidate-response longitudinal self-screening feature representation and the deep transverse interactive feature representation.
The feature compressing module is configured to perform full connection mapping processing (Dense) and ReLU mapping processing on the deep transverse interactive feature representation, and concatenate a mapping result with the n-th historical-dialogue longitudinal self-screening feature representation and the n-th candidate-response longitudinal self-screening feature representation, so as to obtain a bi-directional feature representation.
The label predicting module is configured to predict a probability that the current response is a correct response based on the bi-directional feature representation.
The automatic question-and-answer model training unit in this embodiment includes a loss function building module and an optimization function building module.
The loss function building module is configured to calculate an error between the prediction result and a true label by using the cross entropy loss function.
The optimization function building module is configured to train and adjust parameters to be trained in the model and reduce the prediction error.
As shown in
As shown in
Encoding operation is performed on the historical-dialogue embedded representation and the candidate-response embedded representation by a first-layer encoder Encoder1, respectively, so as to obtain the first historical-dialogue encoded representation and the first candidate-response encoded representation, which are denoted as {right arrow over (F1h)} and {right arrow over (F1r)}, which are expressed as follows:
{right arrow over (F1h)}=Encoder1({right arrow over (Eh)});
{right arrow over (F1r)}=Encoder1({right arrow over (Er)});
where {right arrow over (Eh)} represents the historical-dialogue embedded representation, {right arrow over (Er)} represents the candidate-response embedded representation, and Encoder1 represents the first-layer encoder.
Cross-attention calculation is performed on the first historical-dialogue encoded representation and the historical-dialogue embedded representation, so as to obtain the first historical-dialogue longitudinal self-screening feature representation, which is denoted as {right arrow over (Z1h)}; cross-attention calculation is performed on the first candidate-response encoded representation and the candidate-response embedded representation, so as to obtain the first candidate-response longitudinal self-screening feature representation, which is denoted as {right arrow over (Z1r)}; at the same time, concatenating operation is performed on the first historical-dialogue encoded representation and the first candidate-response encoded representation, and then a self-attention mechanism is used to complete the interactive processing therebetween, so as to obtain the first transverse interactive feature representation, which is denoted as {right arrow over (I1)}, in which the expressions are as follows:
where {right arrow over (F1h)} represents the first historical-dialogue encoded representation; {right arrow over (Eh)} represents the historical-dialogue embedded representation; {right arrow over (F1r)} represents the first candidate-response encoded representation; {right arrow over (Er)} represents the candidate-response embedded representation.
Encoding operation is performed on the first historical-dialogue longitudinal self-screening feature representation and the first candidate-response longitudinal self-screening feature representation by a second-layer encoder Encoder2, respectively, so as to obtain the second historical-dialogue encoded representation and the second candidate-response encoded representation, which are denoted as {right arrow over (F2h)} and {right arrow over (F2r)}, which are expressed as follows:
{right arrow over (F2h)}=Encoder2({right arrow over (Z1h)});
{right arrow over (F2r)}=Encoder2({right arrow over (Z1r)});
where {right arrow over (Z1h)} represents the first historical-dialogue longitudinal self-screening feature representation; {right arrow over (Z1r)} represents the first candidate-response longitudinal self-screen feature representation; Encoder2 represents the second-layer encoder.
Cross-attention calculation is performed on the second historical-dialogue encoded representation and the first historical-dialogue longitudinal self-screening feature representation, so as to obtain the second historical-dialogue longitudinal self-screening feature representation, which is denoted as {right arrow over (Z2h)}; cross-attention calculation is performed on the second candidate-response encoded representation and the first candidate-response longitudinal self-screening feature representation, so as to obtain the second candidate-response longitudinal self-screening feature representation, which is denoted as {right arrow over (Z2r)}; at the same time, concatenating operation is performed on the second historical-dialogue encoded representation and the second candidate-response encoded representation, and then a self-attention mechanism is used to complete the interactive processing therebetween, so as to obtain the second transverse interactive feature representation, which is denoted as {right arrow over (I2)}, in which the expressions are as follows:
where {right arrow over (F2h)} represents the second historical-dialogue encoded representation; {right arrow over (Z1h)} represents the first historical-dialogue longitudinal self-screening feature representation; {right arrow over (F2r)} represents the second candidate-response encoded representation; {right arrow over (Z1r)} represents the first candidate-response longitudinal self-screening feature representation.
Encoding operation is performed on the second historical-dialogue longitudinal self-screening feature representation and the second candidate-response longitudinal self-screening feature representation by a third-layer encoder Encoder3, respectively; in a similar fashion, repeatedly encoding is performed for a plurality times according to a preset hierarchical depth of the automatic dialogue model, until the final n-th historical-dialogue longitudinal self-screening feature representation, the n-th candidate-response longitudinal self-screening feature representation and the n-th transverse interactive feature representation are generated; encoding operation is performed on the (n−1)-th historical-dialogue longitudinal self-screening feature representation and the (n−1)-th candidate-response longitudinal self-screening feature representation by a nth-layer encoder Encodern, respectively, so as to obtain the n-th historical-dialogue encoded representation and the n-th candidate-response encoded representation, which are denoted as {right arrow over (Fnh)} and {right arrow over (Fnr)}, which are expressed as follows:
where {right arrow over (Zn-1h)} represents the (n−1)-th historical-dialogue longitudinal self-screening feature representation; {right arrow over (Zn-1r)} represents the (n−1)-th candidate-response longitudinal self-screening feature representation; Encodern represents the nth-layer encoder.
Cross-attention calculation is performed on the n-th historical-dialogue encoded representation and the (n−1)-th historical-dialogue longitudinal self-screening feature representation, so as to obtain the n-th historical-dialogue longitudinal self-screening feature representation, which is denoted as n; cross-attention calculation is performed on the n-th candidate-response encoded representation and the (n−1)-th candidate-response longitudinal self-screening feature representation, so as to obtain the n-th candidate-response longitudinal self-screening feature representation, which is denoted as {right arrow over (Znr)}; at the same time, concatenating operation is performed on the n-th historical-dialogue encoded representation and the n-th candidate-response encoded representation, and then a self-attention mechanism is used to complete the interactive processing therebetween, so as to obtain the n-th transverse interactive feature representation, which is denoted as {right arrow over (In)}, in which the expressions are as follows:
where {right arrow over (Fnh)} represents the n-th historical-dialogue encoded representation; {right arrow over (Zn-1h)} represents the (n−1)-th historical-dialogue longitudinal self-screening feature representation; {right arrow over (Fnr)} represents the n-th candidate-response encoded representation; {right arrow over (Zn-1r)} represents the (n−1)-th candidate-response longitudinal self-screening feature representation.
The first transverse interactive feature representation, the second transverse interactive feature representation, . . . , and the n-th transverse interactive feature representation are concatenated, so as to obtain the deep transverse interactive feature representation, which is denoted as {right arrow over (Idepth)}, and is expressed as follows:
{right arrow over (Idepth)}=Concat({right arrow over (I1)},{right arrow over (I2)}, . . . ,{right arrow over (In)});
where {right arrow over (I1)}, {right arrow over (I2)}, and {right arrow over (In)} represent the first transverse interactive feature representation, the second transverse interactive feature representation and the n-th transverse interactive feature representation, respectively.
The embodiment further provides an electronic device, which includes a memory and a processor.
The memory stores computer-executable instructions.
The processor executes the computer-executable instructions stored in the memory, so that the processor executes the automatic dialogue method based on deep bi-directional attention in any embodiment of the present disclosure.
The processor can be a central processing unit (CPU), other general-purpose processor, a digital signal processors (DSP), an application specific integrated circuit (ASIC), a field programmable gate arrays (FPGA) or other programmable logic device, a discrete gate or a transistor logic device, a discrete hardware component, etc. The processor can be a microprocessor or the processor can be any conventional processor, etc.
The memory can be used to store computer programs and/or modules. The processor can implement various functions of the electronic device by running or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory can mainly include a program storage area and a data storage area. The program storage area can store an operating system, an application program required by at least one function, etc.; and the data storage area can store data created according to use of the terminal, etc. In addition, the memory can further include a high-speed random access memory and a nonvolatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), a secure digital (SD) card, a flash memory card, at least one disk storage device, a flash memory device, or other volatile solid-state storage device.
The embodiment further provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions are loaded by a processor, so that the processor executes the automatic dialogue method based on deep bi-directional attention in any embodiment of the present disclosure. Specifically, a system or device equipped with a storage medium is provided. The software program codes for implementing functions of any of the above embodiments is stored in the storage medium, so that the computer (or CPU or MPU) of the system or device reads out and executes the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can implement the functions of any of the above embodiments, and thus the program code and the storage medium storing the program code form a part of the present disclosure.
The embodiment of the storage medium for providing program codes includes floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RYM, DVD-RW, DVD+RW), magnetic tapes, nonvolatile memory cards and ROMs. Alternatively, the program code can be downloaded from a server computer over a communication network.
In addition, it should be clear that the operating system and the like operated on the computer can complete some or all of actual operations by executing program codes read out by the computer or based on the instructions of the program codes, so as to implement the functions of any of the above embodiments.
In addition, it can be understood that program codes read from the storage medium is written into a memory provided in an expansion board inserted into the computer or is written into the memory provided in an expansion unit connected to the computer, and then the CPU and the like installed on the expansion board or the expansion unit execute some or all of actual operations based on instructions of the program codes, thereby realizing functions of any of the above embodiments.
Finally, it should be explained that the above embodiments are only used to illustrate the technical solutions of the present disclosure, rather than limit the technical solutions. Although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solutions described in the foregoing embodiments can be still modified, or some or all technical features thereof are equivalently replaced. These modifications or substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of various embodiments of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202211187080.7 | Sep 2022 | CN | national |