The present disclosure relates to computing systems and, in particular, methods and systems for processing text data that is exchanged between computing devices. Certain examples relate to providing feedback for a conversational agent, where the conversational agent uses a predictive model to compute responses.
Many users of computing devices prefer to interact with computing systems using natural language, e.g. words and sentences in the user's native language, as opposed to more restrictive user interfaces (such as forms) or using specific programming or query languages. For example, users may wish to ascertain a status of a complex technical system, such as a transport control system or a data center, or be provided with assistance in operating technical devices, such as embedded devices in the home or industry. Natural language interfaces also provide a much larger range of potential queries. For example, users may find that structured queries or forms do not provide options that relate to their particular query. This becomes more of an issue as computing systems increase in complexity; it may not be possible to enumerate (or predict) all the possible user queries in advance of operation.
To provide a natural language interface to users, conversational agents have been proposed. These include agents sometimes known colloquially as “chatbots”. In the past, these systems used hand-crafted rules to parse user messages and provide a response. For example, a user query such as “Where is the power button on device X?” may be parsed by looking for string matches for the set of terms “where”, “power button” and “device X” in a look-up table, and replying with a retrieved answer from the table, e.g. “On the base”. However, these systems are somewhat limited; for example, the user message “I am looking for the on switch for my X” would not return a match and the conversational agent would fail to retrieve an answer.
To improve conversational modelling, a neural conversation model has been proposed to provide a conversational agent, e.g. as in the following document. VINYALS, Oriol and LE, Quoc. A neural conversational model. arXiv preprint arXiv:1506.05869. Submitted 19 Jun. 2015. In this neural conversation model, a sequence-to-sequence framework is used to generate short machine replies to user-submitted text. The model uses a data driven approach, rather than a rule-based approach. While the neural conversation model generates replies that are rated more useful than a comparative rule-based system, the authors admit that their model still has limitations. For example, the conversational agent only gives short and simple answers, which may not always address a user's query. Additionally, the authors found that replies were often inconsistent, e.g. if semantically similar user queries with differing text data were submitted, the conversational agent would provide inconsistent (i.e. differing) answers. Neural conversation models such as in the above paper have been found to be difficult to implement as practical user interfaces in the real-world, e.g. due to the aforementioned issues.
Accordingly, there is a desire to improve user-computing interfaces to enable users to submit natural language queries and to provide these interfaces in a practical and implementable manner By improving user-computing interfaces, it may be possible to efficiently provide responses to a large number of user queries, e.g. which are received concurrently. In particular, there is a desire to build computer systems to implement these user-computing interfaces that allow for improvement and feedback on their operation.
Aspects of the present disclosure are set out in the appended independent claims. Certain variations of the present disclosure are set out in the appended dependent claims.
Some embodiments provide a computer-implemented method for providing feedback to a conversational agent. The method includes loading text data representative of one or more messages received from a user. The method includes converting the text data to a numeric array, each element in the numeric array being associated with one of a predefined set of tokens, each token comprising a sequence of character encodings. The method includes applying a trained predictive model to the numeric array to generate an array of probabilities, a probability in the array of probabilities being associated with a response template for use in responding to the one or more messages. The method includes generating, for display to an operator of the conversational agent, a list of response templates ordered based on the array of probabilities. The method includes receiving, from the operator of the conversational agent, data indicating an incorrect response template that is to be disassociated with the one or more messages. The method includes computing a contribution of elements in the numeric array to an output of the trained predictive model for the incorrect response template. The method includes generating, for display to the operator of the conversational agent, at least a subset of tokens from the predefined set of tokens based on the computed contribution. The method includes receiving, from the operator of the conversational agent, data indicating one or more of the displayed tokens that are to be disassociated with the incorrect response template. The method includes adjusting parameters of the trained predictive model to reduce the contribution of the indicated tokens for the incorrect response template.
Some embodiments provide a system for adjusting a dialogue system. The system includes a conversational agent comprising at least a processor and a memory to receive one or more user messages from a client device over a network and send agent messages in response to the one or more user messages. The system includes a template database comprising response templates for use by the conversational agent to generate agent messages. The system includes a trained predictive model comprising data indicative of stored values for a plurality of model parameters, the trained predictive model being configured to receive a numeric array and output an array of probabilities, each element in the numeric array being associated with one of a predefined set of tokens, each token comprising a sequence of character encodings, a probability in the array of probabilities being associated with a response template from the template database. The system includes a feedback engine comprising at least a processor and a memory configured to apply the trained predictive model to a numeric array generated based on text data received from a client device to generate an array of probabilities associated with a plurality of response templates in the template database. The processor and memory are configured to receive an indication of an incorrect response template in the plurality of response templates that is to be disassociated with the text data. The processor and memory are configured to compute a contribution of elements in the numeric array to an output of the trained predictive model for the incorrect response template. The processor and memory are configured to receive an indication of one or more tokens whose computed contribution values are to be reduced with reference to the incorrect response template. The processor and memory are configured to adjust the data indicative of the stored values of the trained predictive model to reduce the contribution of the indicated tokens.
Some embodiments provide a non-transitory, computer-readable medium comprising computer program instructions. The computer program instructions, when executed by a processor, cause the processor to load text data representative of one or more messages received from a user. The computer program instructions, when executed by a processor, cause the processor to convert the text data to a numeric array, each element in the numeric array being associated with one of a predefined set of tokens, each token comprising a sequence of character encodings. The computer program instructions, when executed by a processor, cause the processor to apply a trained predictive model to the numeric array to generate an array of probabilities, a probability in the array of probabilities being associated with a response template for use in responding to the one or more messages. The computer program instructions, when executed by a processor, cause the processor to generate, for display to an operator of the conversational agent, a list of response templates ordered based on the array of probabilities. The computer program instructions, when executed by a processor, cause the processor to receive, from the operator of the conversational agent, data indicating an incorrect response template that is to be disassociated with the one or more messages. The computer program instructions, when executed by a processor, cause the processor to compute a contribution of elements in the numeric array to an output of the trained predictive model for the incorrect response template. The computer program instructions, when executed by a processor, cause the processor to generate, for display to the operator of the conversational agent, at least a subset of tokens from the predefined set of tokens based on the computed contribution. The computer program instructions, when executed by a processor, cause the processor to receive, from the operator of the conversational agent, data indicating one or more of the displayed tokens that are to be disassociated with the incorrect response template. The computer program instructions, when executed by a processor, cause the processor to adjust parameters of the trained predictive model to reduce the contribution of the indicated tokens for the incorrect response template.
Further features and advantages of the disclosure will become apparent from the following description of preferred embodiments of the disclosure, given by way of example only, which is made with reference to the accompanying drawings.
Certain examples described herein provide methods and systems for providing feedback to a conversational agent. These examples address some of the issues encountered when practically implementing a conversational agent. For example, they enable an operator to inspect the internal operation of the conversational agent and provide feedback to update the agent. In turn, they enable a natural language interface to be efficiently provided, and for performance to improve over time.
In the description below, the operation and configuration of an example conversational agent will be described. Certain examples described herein may allow for feedback to be provided to a conversational agent of a form similar to that described.
The user computing devices 110 may comprise a variety of computing devices including, but not limited to, mobile devices (e.g. smartphones, tablets), embedded devices (e.g. so-called “smart” appliances, or microphone and speaker devices for use with intelligent personal assistants), desktop computers and laptops, and/or server devices. These computing devices comprise at least a processor and memory, wherein computer program code may be stored in the memory and implemented using the at least one processor to provide described functionality. The user computing devices 110 may comprise a network interface to couple to the one or more networks 130. This network interface may be a wired and/or wireless interface.
The conversational agent 120 may be implemented upon a server computing device comprising at least one processor and memory. In examples described herein, the functionality of the conversational agent 120 may be implemented, at least in part, by at least one processor and memory, wherein computer program code is stored in the memory and executed upon the at least one processor. Certain aspects of the conversational agent 120 may also be implemented in programmable integrated circuits. The server computing device may also comprise a wired and/or wireless network interface to couple to the one or more networks 130.
In
Messages may be exchanged over a plurality of differing protocols and mechanisms. Text dialogues may have a single mode (e.g. be based around a single protocol or mechanism) or be multi-modal (e.g. where messages are collated from multiple differing message exchange mechanisms). Example protocols and mechanisms include, amongst others, email, Short-Message Service (SMS) messages, instant messaging systems, web-conferencing, Session Initiation Protocol (SIP) services, Text over Internet Protocol (ToIP) systems, and/or web-based applications (e.g. Hyper Text Markup Language—HTML—data transmission via Hypertext Transfer Protocol—HTTP). Certain messaging systems may be based in the application layer and operate over, for example, transport control protocol (TCP) over Internet Protocol (IP). Messages may be stored and/or managed as part of a Customer Relationship Management (CRM) platform. Text dialogues are typically one-to-one but in certain examples may comprise messages originating from multiple conversational agents and/or users. Text dialogues may be live, e.g. comprise messages exchanged in real-time or near real-time, or may exist over a period of time (e.g. days, weeks or months). Users may be identified via user identifiers such as email addresses, usernames for login credentials, phone numbers and/or Internet Protocol address. A start of a text dialogue may be indicated by a first message exchanged over a given protocol or mechanism, a user or agent initiating a messaging session, and/or a protocol request to start a conversation. An end of a text dialogue may be marked by a period of inactivity, be closed by a user or agent action and/or be set by the closing of a message exchange session, amongst others.
Although a single conversational agent 120 is shown in
Returning to the example of
In the example of
In certain examples, each text string 215 may be pre-processed. One method of pre-processing is text tokenization. Text tokenization splits a continuous sequence of characters into one or more discrete sets of characters, e.g. where each character is represented by a character encoding. The discrete sets of characters may correspond to words or word components in a language. Each discrete set may be referred to as a “term” or “token”. A token may be deemed a “word” in certain cases if it matches an entry in a predefined dictionary. In certain cases, tokens need not always match agreed words in a language, for example “New York” may be considered one token, as may “gr8” or “don't”. One text tokenization method comprises splitting a text string at the location of a white space character, such as “ ”.
There are several possible text tokenization implementations, some of which may produce an output that differs from the example of
In certain examples, text tokens may be converted into a numeric form. For example, a dictionary may be generated that comprises a list or array of all discrete sets of characters that are present following text tokenization of one or more messages, e.g. as received by the conversational agent 120 or retrieved from the dialogue database 150. In this case, within the data or for a copy of the data, each unique set of characters, i.e. each token, may be replaced with a numeric value representing an index in the dictionary. In
In the system 100 shown in
Given text data extracted from received messages as input, this text data may be pre-processed and supplied to a trained version of the predictive model to output (i.e. predict) a set of probability values for a set of response templates in template database 160. This set of response templates may be the set 170 of all response templates or a subset of this set (e.g, based on hierarchical selection methods). A conversational agent 120 may be configured to select the response template associated with the largest probability value output by the trained predictive model and use this response template to respond to the received messages. The probability values may be seen as confidence levels for the selection of a particular response template. Hierarchical groupings are also possible with tiers of response template groups, e.g. a first prediction may generate probabilities for one of eight elements in an array representing eight initial groups, where the element with the largest value (typically selected using an argmax function) may indicate a first predicted group, then a second prediction of a group or response template within the first predicted group may be made.
In the examples discussed herein, a “predictive model” may comprise a selection and specific coupling of interconnected functions, where each function has a set of parameter values. A function may define a geometric operation that is applied by way of matrix multiplication, e.g. on a graphics processing unit (GPU) or central processing unit (CPU), and and/or vector addition. A “predictive model” may have a variety of different architectures depending on the implementation. Functions may be defined in libraries of computer program code, wherein, in use for training and prediction, the computer program code is executed by at least one processor of a computing device. Predictive models may be based, amongst others, on feed forward neural networks, convolutional neural networks or recurrent neural networks. Functional units such as embedding layers, softmax layers and non-linear functions may also be used. Predictive models may be based on differentiable computing approaches that use back-propagation to train the model.
When using predictive models, there is a problem that the operation of the model is often opaque to an operator. For example, many predictive models are implemented as “black boxes” that are configured through training to turn input data into output data. Practical predictive models used in production environments may have millions, hundreds of millions or billions of parameters. Training may comprise using millions of training examples. This is especially the case for modern multi-layer neural networks. With these models, there is no mechanism to present the working of the model to the operator, e.g. to indicate “why” one class label has a higher probability than another class label. There is also no way for the operator to modify the working of the model if the predicted probability values differ from test data values.
Certain examples described herein allow feedback to be exchanged between a conversational agent and an operator (so-called “bi-directional” feedback). Certain examples allow an incorrect response template to be indicated by the operator and the conversational agent to compute a contribution for tokens representative of how influential the tokens were in the prediction of the incorrect response template by an applied predictive model. The computed contribution is used to provide further feedback to the operator comprising potential tokens to disassociate with the incorrect response template. The operator then selects the tokens they wish to disassociate and the parameters of the predictive model are adjusted based on this feedback. By repeating this process, an accuracy of a conversational agent, in the form of the response templates that are selectable for a text dialogue, may be improved.
An example 400 of a response template 410 is shown in
If the predictive model 325 comprises a recurrent neural network then the numeric array 335 may comprise an array similar to array 235 in
The predictive model 325 is configured to output an array of probabilities 340. Each element in the array comprises a probability value (e.g. a real value between 0 and 1) that is associated with a particular response template 315 from the template database 310. As a simple example, there may be three response templates: [“How to reset device”, “How to turn on screen”, “How to use device”], and so in this case the predictive model 325 would output an array of three elements, e.g. [0.1, 0.6, 0.3] representing a probability of each respective template being appropriate (e.g. here “How to turn on screen” has a confidence of 60% and may be selected as the most likely response template to use). As with the numeric array 335, a dictionary or hash table may be provided to map between an index of an element in the array and a response template (e.g. a path of response template data or a database record identifier). In use, e.g. when implementing the dialogue system 100 of
The predictive model 325 is trained on a set of training data to determine a mapping between the numeric array 335 and the array of probabilities 340. The result of training is a trained predictive model comprising data indicative of stored values for a plurality of model parameters. These model parameters are used to implement the geometric transformations that convert numeric values in the numeric array 335 to the probability values in the array of probabilities 340. As discussed above, the trained predictive model 325 may comprise computer program code to implement the model on a processor of a computing device and the data indicative of stored values for a plurality of model parameters. In practice an untrained predictive model may be constructed by assembling computer program code, e.g. from machine learning libraries in programming languages such as Python, Java, Lua or C++. The predictive model may be applied to training data by executing this computer program code on one or more processors, such as groups of CPUs or GPUs. Following training, a trained predictive model may comprise computer program code as executed by a processor and a set of stored parameter values that parameterize (i.e. result in) a specific model configuration.
In particular, the feedback engine 345 is first configured to apply the trained predictive model 325 to the numeric array 335 to generate the array of probabilities 340. For example, the feedback engine 345 may operate on a batch of historical text dialogues from the dialogue database 150 of
In the example of
The data 350 is displayed to the operator, e.g. via a monitor or other screen. A front-end client process may be arranged to receive the data and present it for display (e.g. operating on a client device of the operator). An example of a user interface 510 to display this data is shown in
The feedback from the operator is received by the feedback engine 345 as data 355. Data 355 may be processed on a client device used by the operator and sent to the feedback engine 355 across a network. Data 355 in
Following receipt of an indication of an incorrect response template, feedback engine 345 is configured to compute a contribution 360 of elements in the numeric array 335 to an output of the trained predictive model 325 for the incorrect response template. This may comprise identifying parameters of the predictive model 325 that are associated with the incorrect response template. These may be weights that contribute to the output in the array of probabilities 340 that corresponds to the incorrect response template (e.g. weights that contribute to the value of the third element—RT3—as shown in
The set of tokens 365 are sent to the operator to allow the operate to select one or more tokens that are to be disassociated with the incorrect response template (e.g. the response template selected by the operator as indicated in data 355). A client device operated by the operator may receive the set of tokens 365 and display them for selection, e.g. in a similar manner to the set of response templates. In certain cases, the contributions may also be displayed. In certain cases, the set of tokens 365 may be ordered based on their contributions. An example user interface 520 for selecting a token is shown in
In
The example system 300 shown in
At block 610, text data representative of one or more messages received from a user is loaded. This text data may be loaded from a live conversation or dialogue database 150. At block 620, the text data is converted to a numeric array, e.g. as described in detail above. At block 630, a trained predictive model is applied to the numeric array to generate an array of probabilities, wherein a probability in the array of probabilities is associated with a response template for use in responding to the one or more messages. At block 640, a list of response templates ordered based on the array of probabilities is generated for display to an operator of the conversational agent. At block 650, data indicating an incorrect response template that is to be disassociated with the one or messages is received from the operator. At block 660, a contribution of elements in the numeric array to an output of the trained predictive model is computed for the incorrect response template. At block 670, at least a subset of tokens from the predefined set of tokens is generated based on the computed contribution. These tokens are for display to the operator of the conversational agent. At block 680, data indicating one or more of the displayed tokens that are to be disassociated with the incorrect response template are received from the operator of the conversational agent. At block 690, parameters of the trained predictive model are adjusted to reduce the contribution of the indicated tokens for the incorrect response template.
In certain implementations of the system 300 and method 600, the trained predictive model comprises a multiclass linear model that is trained upon pairs of associated text data and response templates. In this case, it is assumed that each set of text data has one associated response template. Each response template may be associated with an index in an output array. The multiclass linear model takes the numeric array as input and is trained to determine a linear mapping to class logits, representing unnormalized log-probabilities associated with respective ones of a set of potential response templates. The multiclass linear model may be trained with “one-hot” encodings of response templates, e.g. in an array corresponding to the array of probabilities the entry associated with the response template assigned to the text data is set to “1” and all other entries are set to “0”. The unnormalized log-probabilities may be normalized using a softmax function. The output of the softmax function comprises an array of probabilities, such as array 340 in
In the above case of a multiclass linear model implementing the trained predictive model, a contribution of a given element in the numeric array, corresponding to a given token, may be determined by computing logit-contributions for the response template indicated as “incorrect”. In this case, weights of the trained predictive model that are associated with the incorrect response template are obtained. This may comprise extracting a row of weights from the aforementioned matrix of weight values that are associated with the incorrect response template. For a given element in the numeric array, a contribution of the corresponding token may be computed as a ratio of a contribution from a weight from the row of weights associated with the given element and a contribution of the row of weights applied to all the elements of the numeric array. Or in other words, a contribution of a particular token-index to predicting a particular response template is proportional to the value of a token-specific parameter and a score for the particular response template. In this case, salient or “trigger” tokens are those which increase a logit-weight of the predicted incorrect response template. In this case, adjusting parameters of the trained predictive model, e.g. at block 690 of
In certain implementations of the system 300 and method 600, the trained predictive model comprises a multi-layer (a so-called “deep”) neural network that is trained upon pairs of associated text data and response templates. In this case, computing a contribution of elements may comprise using values of back-propagated partial derivatives of a loss computed during training of the trained predictive model. A loss (also referred to as an error) of a predictive model may be computed by a loss function, which takes as input the array of probabilities output by the predictive model and a “ground truth” array, which may be a “one-hot” encoded array of equal size to the array of probabilities where the “correct” response template has a value of 1 and all over elements are set to 0 (representing a point-mass probability distribution). For example, if cross-entropy is used as a loss function for the predictive model, then a partial derivative of the numeric array with respect to the loss may be computed as both the model and the loss function are differentiable. The loss of the model may be computed by seeing how an indicated “ground truth” or proposed response template (which may be the “incorrect” response template) compares to the corresponding element in the output array of probabilities. The cross-entropy may be computed for the incorrect response template. The partial derivative is an array of the same dimension as the numeric array, and as such an element in this partial derivative corresponds to a particular element of the numeric array, i.e. a particular token in the predefined set of tokens. A large positive value in the partial derivative array indicates that increasing the value of the corresponding element in the numeric array increases the loss, i.e. decreases the predictive model's belief that the element is associated with the incorrect response template. Correspondingly, a large negative value in the partial derivative array indicates that increasing the value of the corresponding element in the numeric array decreases the loss, i.e. increases the predictive model's belief that the element is associated with the incorrect response template. Hence, a logit-contribution for a multi-layer neural network may be computed as a ratio of a contribution from the partial derivative associated with a given element (and the incorrect response template) and a sum of the partial derivatives for all elements of the numeric array (assuming the incorrect response template). In other words, for tokens that occur in the numeric array, their contribution to predicting a particular response template is proportional to the negative partial derivative of the numeric array with respect to a loss of predicting that template.
When using a multi-layer neural network, or other trained model, the feedback engine 345 and the method 600 may further comprise a mechanism to update the training data for the predictive model based on the indicated tokens to disassociate these tokens with the incorrect response template. In this case, the training data may be updated with text data that includes the indicated tokens, wherein the text data is paired with the incorrect response template. The predictive model may be updated, e.g. at block 690, by re-training the predictive model using the updated training data. “Re-training” here covers further training of an existing model with the additional data as well as training a new model with initial (non-trained) parameter values. The revised text data may be supplied by the operator or another party. The method 600 may comprise requesting text data comprising the indicated tokens and an indication of a correct response template, receiving the text data and the indication of the correct response template, and adding the text data and the indication of the correct response template to training data for the trained predictive model. This may comprise requesting, from the operator, example user messages that comprise the indicated tokens and using this as text data to be associated with the incorrect response template. The incorrect response template for the original text data is thus here deemed to be a “correct” response template for the newly generated text data that includes the indicated tokens.
In certain implementations of the system 300 and method 600, the trained predictive model comprises a recurrent neural network that is trained upon pairs of associated text data and response templates. This may have one or more layers. It may be implemented using Long Short-Term Memory (LSTM) units or Gated Recurrent Units (GRUs), e.g. as implemented in machine learning libraries of computer program code or dedicate integrated circuits. In this case, a contribution for each token may be computed by determining how probability values in the array of probabilities change when the token (e.g. in the form of an integer index or word embedding) is excluded from the numeric array (i.e. the input sequence). In this case, similar to the case of the multi-layer neural network described above, pairings of new numeric arrays containing the indicated tokens to be disassociated and an indicator of the “incorrect” response template may be generated as further training examples and the predictive model re-trained.
In one case, the method 600 may be performed interactively when engaging in conversions with users. In this case, the conversational agent 305 may use a selected “correct” response template, which may be the response template with the largest probability value in the array of probabilities 340, to retrieve that template from the template database 310 and to populate the indicated response template with user data to generate an agent response 320. This may comprise inserting user and/or case specific details as field data 430 as shown in
Certain examples described herein address an issue of providing feedback to a conversational agent that uses a trained predictive model to select data to respond to users. Through the described example processes, the conversational agent may be considered to partly “explain” its operation, e.g. why it selected particular response text to reply to a user message. An operator is then able to use this “explanation” to provide feedback to the conversational agent so that it can adjust its predictive model and generate more natural replies. An operator may thus be provided with a high-level appreciation of how the conversational agent is working, and the conversational agent is less like a “black box”. In particular, “trigger” tokens that the predictive model uses in selection of a particular response template may be marked and disassociated. The above examples are to be understood as illustrative. Further examples are envisaged. Even though conversations are referred to as “text dialogues”, it is noted that front-end speech-to-text and text-to-speech may be used to convert sound data from a user into text data, and similarly to convert an agent message into sound data. As such, the examples described herein may be used with voice communication systems, wherein “messages” represent portions of an audio conversation that have been converted to text. It is to be understood that any feature described in relation to any one example may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the examples, or any combination of any other of the examples. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the disclosure, which is defined in the accompanying claims.