This application claims foreign priority of Chinese Patent Application No. 202310295715.3, filed on Mar. 24, 2023, in the China National Intellectual Property Administration, the disclosures of all of which are hereby incorporated by reference.
The disclosure relates to the technical field of artificial intelligence and natural language processing, and in particular to a multi-turn human-machine conversation method and apparatus based on a time-sequence feature screening encoding module.
With the continuous development of artificial intelligence technology, the way of human-machine interaction is gradually changed from a graphical user interface to a conversation user interface. The conversation with a machine is also a constantly pursued target in the field of artificial intelligence. At present, human-machine conversation technology may be divided into single-turn human-machine conversation and multi-turn human-machine conversation, depending on different conversation rounds. The multi-turn human-machine conversation has more practical applications, such as intelligent customer service, a mobile assistant and a search engine, because it is closer to a conversation scene of human beings. As an important way of human-machine interaction, the multi-turn human-machine conversation technology has great research significance and application value, but is also more challenging.
Specifically, at present, the challenges faced by the multi-turn human-machine conversation technology mainly include the following two points: first, the information contained in each utterance in a historical conversation sequence is not all useful for final response selection; the most important difference between the multi-turn human-machine conversation and the single-turn human-machine conversation is that the subject of the single-turn conversation is unique, while the subject of the multi-turn human-machine conversation may not be only one, so that how to identify and screen semantic information for each utterance in historical conversation is the primary challenge faced by the multi-turn human-machine conversation technology. Second, for a multi-turn human-machine conversation system, the sequence among utterances is important feature information that cannot be ignored; and in the course of conversation, two or more identical sentences may have different semantics and intentions due to different sequences. Therefore, if the time-sequence feature in the historical conversation can be effectively extracted, the performance of the multi-turn human-machine conversation method can be improved. So far, however, existing methods have not substantially solved these problems. Therefore, the multi-turn human-machine conversation is still a very challenging task.
Aiming at the challenges faced by the multi-turn human-machine conversation, the disclosure provides a multi-turn human-machine conversation method and apparatus based on a time-sequence feature screening encoding module, which can screen information for each utterance in historical conversation so as to obtain semantic information only relevant to candidate responses and reserve and extract time-sequence features in the historical conversation, thus improving prediction accuracy of a multi-turn human-machine conversation system.
The technical problem to be solved by the disclosure is to provide a multi-turn human-machine conversation method and apparatus based on a time-sequence feature screening encoding module, which can screen information for each utterance in a historical conversation so as to obtain semantic information only relevant to candidate responses and reserve and extract time-sequence features in the historical conversation, thus improving prediction accuracy of a multi-turn human-machine conversation system.
The purpose of the disclosure is achieved by the following technical means.
A multi-turn human-machine conversation method based on a time-sequence feature screening encoding module includes following steps:
As a further limitation to the present technical scheme, the constructing a multi-turn human-machine conversation model in S2 includes constructing an input module, constructing a pre-training model embedding module, constructing a time-sequence feature screening encoding module, and constructing a label prediction module.
As a further limitation to the present technical scheme, the input module is configured to, for each piece of data in the data set, records all sentences in the historical conversation as h1, h2, . . . hn respectively according to a sequence of the conversation; selects a response from a plurality of responses as a current response, and formalizes the response as r; determines a label of the data according to whether the response is a positive response, that is, if the response is a positive response, records the label as 1, otherwise, records the label as 0; and h1, h2, . . . hn, r and the label form a piece of input data together.
As a further limitation to the present technical scheme, the pre-training model embedding module is configured to perform embedding processing on the input data constructed by the input module by using a pre-training language module BERT, to obtain embedding representation and candidate response embedding representation of each utterance in the historical conversation, recorded as {right arrow over (E1h)}, {right arrow over (E2h)}, . . . {right arrow over (Enh)} and {right arrow over (Er)}; and for specific implementation, see the following formula:
{right arrow over (E1h)}=BERT(h1), {right arrow over (E2h)}=BERT(h2), Enh=BERT(hn); {right arrow over (Er)}=BERT(r);
where h1, h2, . . . hn represent the first utterance, the second utterance, . . . , the nth utterance in the historical conversation, and r represents the candidate response.
As a further limitation to the present technical scheme, the time-sequence feature screening encoding module is configured to receive the embedding representation and candidate response embedding representation of each utterance in the historical conversation output by the pre-training model embedding module, and then perform encoding operation on the embedding representation and candidate response embedding representation respectively by using an encoder, thus completing the semantic information screening and time-sequence feature extraction process through an attention mechanism and fusion operation, so as to obtain semantic feature representation of the conversation.
Specifically, the implementation process of the module is as follows:
where {right arrow over (Znh )} represents the encoding alignment representation n; and {right arrow over (Fr)} represents the candidate response encoding representation.
A multi-turn human-machine conversation apparatus applied in the above method, including:
Compared with the existing technology, the disclosure has the following beneficial effects:
(1) By utilizing the pre-training model embedding module, deep semantic embedding features in the historical conversation and the candidate response may be captured, thus obtaining richer and more accurate embedding representation.
(2) By utilizing the time-sequence feature screening encoding module, information screening may be performed on each utterance in the historical conversation so as to obtain semantic information only relevant to candidate response, thus obtaining more complete and accurate semantic feature representation.
(3) By utilizing the time-sequence feature screening encoding module, time-sequence feature in the historical conversation may be reserved and extracted, thus improving prediction accuracy of the multi-turn human-machine conversation system.
(4) According to the method and apparatus provided by the disclosure, in conjunction with the time-sequence feature screening encoding module, the prediction accuracy of the multi-round man-machine conversation model may be effectively improved.
The technical schemes in the embodiments of the disclosure are clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the disclosure. Apparently, the embodiments described are only a part rather than all of the embodiments of the disclosure. All other embodiments obtained by those of ordinary skilled in the art based on the embodiments of the disclosure without creative efforts shall fall within the protection scope of the disclosure.
As shown in
At S1, a multi-turn human-machine conversation data set is acquired: a disclosed multi-turn human-machine conversation data set is downloaded from a network or a multi-turn human-machine conversation data set is automatically constructed;
At S2, a multi-turn human-machine conversation model is constructed: a multi-turn human-machine conversation model is constructed by using a time-sequence feature screening encoding module.
At S3, the multi-turn human-machine conversation model is trained: the multi-turn human-machine conversation model constructed in S2 is trained using the multi-turn human-machine conversation data set obtained in S1.
At S1, a multi-turn human-machine conversation data set is acquired:
For example: there are a number of disclosed multi-turn human-machine conversation data sets on the network, such as Ubuntu Dialogue Corpus. The data format in the data set is as follows:
In a training set and a validation set, there is only one positive response (Positive(label: 1)) and one negative response (negative(label:0)) for a same historical conversation sequence; in a test set, there is only one positive response (Positive (label: 1)) and nine negative responses (negative(label:0)).
At S2, a multi-turn human-machine conversation model is constructed.
The process of constructing the multi-turn human-machine conversation model is shown in
At S201, an input module is constructed.
For each piece of data in the data set, all sentences in the historical conversation are recorded as h1, h2, . . . hn respectively according to a sequence of the conversation; a response is selected from a plurality of responses as a current response, and formalized as r; according to whether the response is a positive response, a label of the data is determined, that is, if the response is a positive response, the label is recorded as 1; otherwise, the label e is recorded as 0; and a piece of input data is formed by h1, h2, . . . hn, r and the label together.
For example, data shown in S1 is taken as an example to form a piece of input data. The result is as follows:
At S202, a pre-training model embedding module is constructed.
The pre-training model embedding module is configured to perform embedding processing on the input data constructed in S201 by using a pre-training language module BERT, to obtain embedding representation and candidate response embedding representation of each utterance in the historical conversation, recorded as {right arrow over (E1h)}, {right arrow over (E2h)}, . . . {right arrow over (Enh)} and {right arrow over (Er)}; and for specific implementation, see the following formula:
{right arrow over (E1h)}=BERT(h1), {right arrow over (E2h)}=BERT(h2), . . . {right arrow over (Enh)}=BERT(hn) {right arrow over (Er)}=BERT(r);
For example, when the disclosure is implemented on the Ubuntu Dialogue Corpus data set, the operation of the module is completed by using the pre-training language model BERT, and all settings are according to the default settings of BERT in pytorch. In pytorch, the code described above is implemented as follows:
#Input data is encoded by using an embedding layer of BERT.
h_encoder_list=[ ]
for i in h_embed_list:
r_embed=BERT(r) [1] where
h_embed_list represents each utterance in the historical conversation, r is candidate response, h_encoder_list represents embedding representation of each utterance in the historical conversation, and r_embed represents embedding feature representation of a question.
At S203, a time-sequence feature screening encoding module is constructed.
The time-sequence feature screening encoding module is shown in
At S204, a label prediction module is constructed.
The semantic feature representation of the conversation obtained in S203 will be taken as input of the module, which is processed by a dense network with a dimension of 1 and an activation function of Sigmod to obtain the probability that the current response is a positive response.
When the model is not yet trained, S3 needs to be further performed for training to optimize parameters of the model; after the model is trained, S204 maybe performed to predict which of the candidate responses is the positive response.
At S3, a multi-turn human-machine conversation model is trained.
The multi-turn human-machine conversation model constructed in S2 is trained on the multi-turn human-machine conversation data set obtained in S1. The process is shown in
At S301, a loss function is constructed.
In the disclosure, cross entropy is taken as the loss function, the formula is as follows:
where ytrue is a real label, and ypred is the correct probability of model output.
For example, in pytorch, the code described above is implemented as follows:
# The error between a predicted value and the label is calculated by using across entropy loss function.
loss_fct=CrossEntropyLoss ( )
loss=loss_fct(logits. view(−1, self.num_labels),
labels.view(−1))
At S302, an optimization function is constructed.
The model tests various optimization functions, and finally, selects an AdamW optimization function as the optimization function, except its learning rate set to 2e-5, other hyperparameters of AdamW are set to the default values in pytorch.
For example, in pytorch, the code described above is implemented as follows:
#Model parameters are optimized by an AdamW optimizer.
optimizer=AdamW (optimizer_grouped_parameters, 1r=2e−5) where optimizer_grouped_parameters are parameters to be optimized, default to all parameters in an auto question-answering model.
When the model is not yet trained, S3 needs to be further performed for training to optimize parameters of the model; after the model is trained, S204 may be performed to predict which of the candidate responses is the positive response.
An apparatus mainly includes three units, namely, a multi-turn human-machine conversation data set acquisition unit, a multi-turn human-machine conversation model construction unit and a multi-turn human-machine conversation model training unit. The process is shown in
The multi-turn human-machine conversation data set acquisition unit is configured to download a disclosed multi-turn human-machine conversation data set from a network or automatically construct a multi-turn human-machine conversation data set.
The multi-turn human-machine conversation model construction unit is configured to construct a pre-training model embedding module, construct a time-sequence feature screening encoding module, and construct a label prediction module, so as to construct a multi-turn human-machine conversation model.
The multi-turn human-machine conversation model training unit is configured to construct a loss function and an optimization function, thus completing prediction of a candidate response.
Furthermore, the multi-turn human-machine conversation model construction unit further includes:
The multi-turn human-machine conversation model training unit further includes:
The disclosure provides a storage medium storing a plurality of instructions where the instructions are loaded by a processor to execute the steps of the above multi-turn human-machine conversation method.
The disclosure provides an electronic device, the electronic device including:
The overall model framework structure of the disclosure is shown in
The time-sequence feature screening encoding module is shown in
Specifically, the implementation process of the module is as
first, performing encoding operation on the candidate response embedding representation by using the encoder, to obtain candidate response encoding representation, recorded as {right arrow over (Fr)}; and for specific implementation, see the following formula:
{right arrow over (F1h)}=Encoder({right arrow over (Er)});
For example, when the disclosure is implemented on the Ubuntu Dialogue Corpus data set, a Transformer Encoder is selected as the encoding structure Encoder, with the encoding dimension set to 768; the number of layers set to 2; a Dot-Product Attention calculation method is selected as the Attention mechanism, and with calculation of the encoding alignment representation 1 as an example, the calculation process is as follows:
F({right arrow over (Fr)},{right arrow over (F1h)}={right arrow over (Fr)}⊗{right arrow over (F1h)};
the formula represents the interaction calculation between the candidate response encoding representation and the encoding representation 1 through dot product multiplication operation, {right arrow over (Fr )}represents the candidate response encoding representation, F1h represents the encoding representation 1, and ⊗ represents the dot product multiplication operation.
the formula represents completing feature screening of the encoding representation 1 by using the obtained attention weight, so as to obtain the encoding alignment representation 1; 1 represents the number of elements in F1h and α.
In pytorch, the code described above is implemented as follows:
# the calculation process of attention is defined.
# the encoding structure is defined.
layers=2)
# codes
final_I=z_r_final+f_r where history_embed_list represents a list of embedding representations of all utterances in the historical conversation; response_embed represents candidate response embedding representation; z_r_final represents encoding alignment representation n; final_I represents semantic feature representation of conversation; d_model represents the size of the word vector required by the encoder, which is set to 512 here; n head represents the number of heads in a multi-head attention model, which is set to 8 here; layers represents the number of layers of the encoding structure, which is set to 2 here.
Although the embodiments of the disclosure have been shown and described, those of ordinary skill in the art can understand that various changes, modifications, replacements, and variations can be made to these embodiments without departing from the principle and spirit of the disclosure, and the scope of the disclosure is defined by the appended claims and equivalents thereof
Number | Date | Country | Kind |
---|---|---|---|
202310295715.3 | Mar 2023 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
10802937 | Crosby | Oct 2020 | B2 |
11416688 | Wu | Aug 2022 | B2 |
11431660 | Leeds | Aug 2022 | B1 |
11803703 | Gamon | Oct 2023 | B2 |
20160219048 | Porras | Jul 2016 | A1 |
20180330723 | Acero | Nov 2018 | A1 |
20190020606 | Vasudeva | Jan 2019 | A1 |
20190138268 | Andersen et al. | May 2019 | A1 |
20190163965 | Yoo | May 2019 | A1 |
20190341036 | Zhang | Nov 2019 | A1 |
20210099317 | Hilleli | Apr 2021 | A1 |
20210118442 | Poddar | Apr 2021 | A1 |
20210150118 | Le et al. | May 2021 | A1 |
20210174026 | Wu | Jun 2021 | A1 |
20210174798 | Wu | Jun 2021 | A1 |
20210183484 | Shaib | Jun 2021 | A1 |
20210256417 | Kneller | Aug 2021 | A1 |
20210294781 | Fernandez et al. | Sep 2021 | A1 |
20210294828 | Tomkins | Sep 2021 | A1 |
20210294829 | Bender | Sep 2021 | A1 |
20210294970 | Bender | Sep 2021 | A1 |
20210295822 | Tomkins | Sep 2021 | A1 |
20210327411 | Wu | Oct 2021 | A1 |
20220019579 | Meyerzon | Jan 2022 | A1 |
20220019740 | Meyerzon | Jan 2022 | A1 |
20220019905 | Meyerzon | Jan 2022 | A1 |
20220036890 | Yuan | Feb 2022 | A1 |
20220068263 | Roy | Mar 2022 | A1 |
20220139384 | Wu | May 2022 | A1 |
20220164548 | Tumuluri | May 2022 | A1 |
20220277031 | Quamar | Sep 2022 | A1 |
20220358297 | Ma | Nov 2022 | A1 |
20220405489 | Radkoff | Dec 2022 | A1 |
20230014775 | Dotan-Cohen | Jan 2023 | A1 |
20230085781 | Zhuge | Mar 2023 | A1 |
20230237276 | Lima | Jul 2023 | A1 |
20230244855 | Attwater | Aug 2023 | A1 |
20230244968 | Gurin | Aug 2023 | A1 |
20230306205 | Maeder | Sep 2023 | A1 |
20230315999 | Mohammed | Oct 2023 | A1 |
20240078264 | Solis | Mar 2024 | A1 |
Number | Date | Country |
---|---|---|
111125326 | May 2020 | CN |
113537024 | Oct 2021 | CN |
114281954 | Apr 2022 | CN |
114722838 | Jul 2022 | CN |
115129831 | Sep 2022 | CN |
115544231 | Dec 2022 | CN |