The present invention relates generally to the field of computer technologies and, more particularly, to a context-aware chatbot system and method.
As E-commerce is emerging, successful information access on E-commerce websites, which accommodate both customer needs and business requirements, becomes essential and critical. Menu driven navigation and keyword search provided by most commercial sites have tremendous limitations, as they tend to overwhelm and frustrate users with lengthy and rigid interactions. User interest in a particular site often decreases exponentially with the increase in the number of mouse clicks. Thus, shortening the interaction path to provide useful information becomes important.
Many E-commerce sites attempt to solve the problem by providing keyword search capabilities. However, keyword search engines usually require users to know domain-specific jargon. Unfortunately, keywords search does not allow users to precisely describe the user intention, and more importantly, keyword search lacks an understanding of the semantic meanings of the search words and phrases. For example, keyword search engines usually may not understand that “summer dress” should be looked up in women's clothing under “dress”, whereas “dress shirt” most likely in men's under “shirt”. A search for “shirt” often reveals dozens or even hundreds of items, which are useless for somebody who has a specific style and pattern in mind.
Given the abovementioned limitations, a current solution is natural language (and multimodal) dialog, namely chatbot. Chatbot has been used in a large variety of fields, such as call-center/routing applications, e-mail routing, information retrieval and database access, and telephony banking, etc. Recently, chatbot has become even more popular with the access to a large number of user data.
However, according to the present disclosure, existing chatbot technologies are often restricted to specific domains or applications (e.g., booking an airline ticket) and require handcrafted rules. Furthermore, in a real dialogue between a user and a robot, user's context could be substantially complex and continuously changed. Thus, context-aware and proactive technologies are highly desired to be incorporated into a chatbot system.
The disclosed methods and systems are directed to solve one or more problems set forth above and other problems.
One aspect of the present disclosure includes a context-aware chatbot method. The context-aware chatbot method comprises receiving a user's voice; converting the user's voice to a question to be answered; determining a question type of the question to be answered; generating at least one answer to the question based on a context-aware neural conversation model; validating the answer generated by the context-aware neural conversation model; and delivering the answer validated to the user. The context-aware neural conversation model takes contextual information of the question into consideration, and decomposes the contextual information of the question into a plurality of high dimension vectors.
One aspect of the present disclosure includes a non-transitory computer-readable medium having computer program for, when being executed by a processor, performing a context-aware chatbot method based on multimodal deep neural network. The method comprises. The context-aware chatbot method comprises receiving a user's voice; converting the user's voice to a question to be answered; determining a question type of the question to be answered; generating at least one answer to the question based on a context-aware neural conversation model; validating the answer generated by the context-aware neural conversation model; and delivering the answer validated to the user. The context-aware neural conversation model takes contextual information of the question into consideration, and decomposes the contextual information of the question into a plurality of high dimension vectors.
One aspect of the present disclosure includes a context-aware chatbot system. The context-aware chatbot system comprises a question acquisition module configured to receive a user's voice and convert the user's voice to a question to be answered; a question determination module configured to determine a question type of the question to be answered; a context-aware neural conversation module configured to generate at least one answer to the question by taking contextual information of the question into consideration and decomposing the contextual information of the question into a plurality of high dimension vectors; an evidence validation module configured to validate the answer generated by the context-aware neural conversation model; and an answer delivery module configured to deliver the answer validated to the user.
Other aspects of the present disclosure can be understood by those skilled in the art in light of the description, the claims, and the drawings of the present disclosure.
The following drawings are merely examples for illustrative purposes according to various disclosed embodiments and are not intended to limit the scope of the present disclosure.
Reference will now be made in detail to exemplary embodiments of the invention, which are illustrated in the accompanying drawings. Hereinafter, embodiments consistent with the disclosure will be described with reference to drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. It is apparent that the described embodiments are some but not all of the embodiments of the present invention. Based on the disclosed embodiments, persons of ordinary skill in the art may derive other embodiments consistent with the present disclosure, all of which are within the scope of the present invention.
Chatbot systems are paramount for a wide range of tasks in enterprise. A chatbot system has to communicate clearly with its suppliers and partners, and engage clients in an ongoing dialog, not merely metaphorically but also literally, which is essential for maintaining an ongoing relationship. Communication characterized by information-seeking and task-oriented dialogs is central to five major families of business applications: customer service, help desk, website navigation, guided selling, and technical support.
Customer service responds to customers' general questions about products and services, e.g., answering questions about applying for an automobile loan or home mortgage. Help desk responds to internal employee questions, e.g., responding to HR questions. Website navigation guides customers to relevant portions of complex websites. A “Website concierge” is invaluable in helping people determine where information or services reside on a company's website. Guided selling provides answers and guidance in the sales process, particularly for complex products being sold to novice customers. Technical support responds to technical problems, such as diagnosing a problem with a device.
In commerce, clear communication is critical for acquiring, serving, and retaining customers. Companies often educate their potential customers about their products and services and, meanwhile, increase customer satisfaction and customer retention by developing a clear understanding of their customers' needs. However, customers are often frustrated by fruitless searches through websites, long waiting in call queues to speak with customer service representatives, and delays of several days for email responses. Thus, correct and prompt answers to customers' inquiries are highly desired.
The existing chatbot systems focus on training the question-answer pairs and recommending the most likely response to individual users, while not taking any contextual information into consideration. Contextual information refers to information relevant to an understanding of the text, for example, the identity of things named in the text: people, places, books, etc., information about things named in the text: birth dates, geographical locations, date published, etc., interpretive information: themes, keywords, and normalization of measurements, dates, etc.
That is, traditionally chatbot systems only deal with users and conversations, but do not embed the conversation into a context when responding to the users. Considering only users and conversations may be insufficient for many applications. For example, using the temporal context, a travel conversational system would provide a vacation recommendation in the winter which may be very different from the one in the summer. Similarly, in a consumer conversational system, it is important to determine what content and when to be delivered to a customer. Thus, incorporating the contextual information in the conversational system to response to users in certain circumstances are highly desired.
Mapping sequences to sequences based on neural networks has been used for neural machine translation, improving English-French and English-German translation task. Because vanilla recurrent neural networks (RNNs) suffer from vanishing gradients, variants of the Long Short Term Memory (LSTM) recurrent neural network may be adopted. Besides, bots and conversational agents have been proposed. However, most of these systems require a rather complicated processing pipeline of many stages, and the corresponding methods do not consider the changes in the user's context.
The present disclosure provides a context-aware chatbot method based on a neural conversational model, which may take contextual features into consideration. The neural conversational model may be trained end-to-end and, thus, may require significantly fewer handcrafted rules. The disclosed context-aware chatbot method may incorporate contextual information in a neural conversational model, which may enable a chatbot to be aware of context in a communication with the user. A contextual real-valued input vector may be provided in association with each word to simplify the training process. The vector learned from the context may be used to convey the contextual information of the sentences being modeled.
The user terminal 102 may include any appropriate type of electronic device with computing capabilities, such as a wearable device (e.g., a smart watch, a wristband), a mobile phone, a smartphone, a tablet, a personal computer (PC), a server computer, a laptop computer, and a digital personal assistant (PDA), etc.
The server 104 may include any appropriate type of server computer or a plurality of server computers for providing personalized contents to the user 106. For example, the server 104 may be a cloud computing server. The server 104 may also facilitate the communication, data storage, and data processing between the other servers and the user terminal 102. The user terminal 102, and server 104 may communicate with each other through one or more communication networks 110, such as cable network, phone network, and/or satellite network, etc.
The user 106 may interact with the user terminal 102 to query and to retrieve various contents and perform other activities of interest, or the user may use voice, hand or body gestures to control the user terminal 102 if speech recognition engines, motion sensor or depth-camera is used by the user terminal 102. The user 106 may be a single user or a plurality of users, such as family members.
The user terminal 102, and/or server 104 may be implemented on any appropriate computing circuitry platform.
As shown in
The processor 202 may include any appropriate processor or processors. Further, the processor 202 can include multiple cores for multi-thread or parallel processing. The storage medium 204 may include memory modules, such as ROM, RAM, flash memory modules, and mass storages, such as CD-ROM and hard disk, etc. The storage medium 204 may store computer programs for implementing various processes, when the computer programs are executed by the processor 202.
Further, the peripherals 212 may include various sensors and other I/O devices, such as keyboard and mouse, and the communication module 208 may include certain network interface devices for establishing connections through communication networks. The database 214 may include one or more databases for storing certain data and for performing certain operations on the stored data, such as database searching.
Returning to
The question acquisition module 301 may be configured to receive a user's question. The user's questions may be received in various ways, for example, text, voice, sign language. In one embodiment, the question acquisition module 301 may be configured to receive a user's voice and convert the user voice to a corresponding question, for example, with the help of speech recognition engines.
The question determination module 302 may be configured to analyze the question and determine a question type. Analyzing the question may refer to deriving the semantic meaning of that question (what the question is actually asking). The question determination module 302 may be configured to analyze the question through deriving how many parts or meanings are embedded in the question. Features of questions may be learned for a question-answer matching.
In particular, the question determination module 302 may be configured to identify Lexical Answer Type (LAT). A lexical answer type is a word or noun phrase in the question that specifies the type of the answer without any attempt to understand its semantics. Determining whether or not a candidate answer can be considered an instance of the LAT is an important kind of scoring and a common source of critical errors. For example, given a question “recommend me some restaurant?”, the question analysis module 302 may be configured to analyze the syntax of the sentence and infer that the question is asking for a place.
The context-aware neural conversation module 303 may be configured to generate answers to the question and a sequence of answers to the question based on a context-aware neural conversation model, i.e., use the data from the question analysis to generate candidate answers. In particular, when a question is received, the context-aware neural conversation module 303 may be confiugred to recognize the contextual information of the question even the context is not appeared. For example, the context-aware neural conversation module 303 may be configured to add time, and event, etc., as input into the context-aware neural conversational model.
Moreover, the context-aware neural conversation module 303 may be configured to infer answers to questions even if the evidence is not readily present in the training set, which may be important because the training data may not contain explicit information about every attribute of each user. The context-aware neural conversation module 303 may be configured to learn event representations based on conversational content produced by different events, in which events producing similar responses may tend to have similar embeddings. Thus, the training data nearby in the vector space may increase the generalization capability of the context-aware neural conversation model.
The evidence validation module 304 may be configured to validate the answer generated by the context-aware neural conversation module 303. Although the answers are generated, the user may not accept the answer. Thus, evidence validation module 304 may be configured to calculate a confidence score for quality control. In one embodiment, the confidence score may be calculated in Kullback-Leibler distance between the question and the answer, and then normalized between 0 and 1.
For example, a predetermined confidence score may be provided as a standard, if the calculated confidence score is larger than the predetermined confidence score, the corresponding answer may be considered as valid. The answer delivery module 305 may be configured to deliver the validated answer to the user. If the calculated confidence score is smaller than the predetermined confidence score, the corresponding answer may be considered as invalid. Then the context-aware neural conversation module 303 may generate a new answer until the answer is validated. In addition, the validated answers may also be used for training for the future questions.
The present disclosure also provides a context-aware chatbot method. To take the contextual inforamiton into consideration, the context-aware chatbot method may model the response with context. Each event may be represented as a vector for embedding, such that event information (e.g., weather, traffic) that influences the content and style of responses may be encoded.
As shown in
Further, and the user's voice is converted to a question to be answered (S404). That is, a question is issued by the user in his/her voice. In one embodiment, the user's voice may be recognized into text and the question may be obtained by analyzing the text. Or the data of the user's voice may be analyzed to obtain the question or questions. In another embodiment, the question to be answered may be received in other ways, for example, text, sign language, not only limited to voice.
Then, the question to be answered is analyzed to determine a question type (S406). For example, the question to be answered may be regarding time, location or place, etc. The question to be answered may be analyzed through deriving how many parts or meanings are embedded in the question to be answered. In one embodiment, the question type may be determined through identifying Lexical Answer Type (LAT). For example, given a question “recommend me some restaurant?”, the syntax of the sentence may be analyzed, and the question to be answered may be inferred as a question regarding a place.
After the question type is determined, at least one answer to the question are generated based on a context-aware neural conversation model (S408). That is, candidate answers may be generated based on the data from the step S406. A sequence of answers to the question to be answered may also be generated based on the context-aware neural conversation model, in which the answers may be ranked in a certain order, for example, an order of preference.
In particular, when a question is received by the context-aware neural conversation model, the context-aware neural conversation model may recognize the contextual information even the context is not appeared. For example, the context-aware neural conversation model may add time, and event, etc., as input into the context-aware neural conversational model.
The context-aware neural conversation model may add a hidden layer that encodes the event information vi, making the response context awareable. The embedding vi may be shared across all conversations that involve event i. {vi} may be learned by back propagating word prediction errors to each neural component during training.
Moreover, the context-aware neural conversation model may be able to infer answers to questions even if the evidence is not readily present in the training set, which may be important as the training data may not contain explicit information about every attribute of each user. The context-aware neural conversation model may learn event representatios based on conversational content produced by different events, and events producing similar respnses may tend to have similar embeddings. Thus, the training data nearby in the vector space may increase the generalization capability of the model.
For example, considering a question-answer pair “recommend some place for fun” and “I think lake tahoe is good” which is generated in winter season, the context-aware neural conversation model may add time, location, people and other contextual information as inputs in the training process, which may be embedded into the learning of restaruant representations considering the contextual information. Then, the “lake tahoe” may be a better answer for the winter season. In the test process, when a restaurant is asked in a question, “how about the restaurant B.J. in lake tahoe”, the context-aware neural conversation model may detect that this question is asked in summer season and may recommend a better result other than B.J. when noticing that “lake tahoe” is not close to current context.
Then the step S408 may be convereted to find a response sentence or an answer Y={y1, y2, . . . , yn} to a given an input sentence X={x1, x2, . . . , xn}, by taking the context EC={ec1, ec2, . . . , ecm} into consideration, where x represents a word in the question, and y represents a word in the response. The problem of finding the response sentence Y may be converted to predict y by maximizing the probability P (yt|yt−1, . . . , y1, ec). Neural network may be adopted to learn the representation of sentences without applying handcraft rules.
A typical neural conversational model each time may provide each sentence with an input gate, a memory gate, and an output gate, which are respectively denoted as it, ft, and ot. xt denotes the vector for an individual text unit at time step t, ht denotes the vector computed by the LSTM model at time step t by combining xt and ht−1, ct denotes the cell state vector at time step t, and θ denotes the sigmoid function. Then, the vector representation ht for each time step t is given by:
where
where W denotes learned and trained factors, and Wi, Wf, Wo, Wl∈RK*2K.
Different from the SEQ2SEQ generation task, each input X may be paired with a sequence of predicted outputs: Y={y1, y2, . . . , yn}. The distribution over outputs and sequentially predicted tokens may be expressed by a softmax function:
where f (ht−1, eyt) denotes an activation function between ht−1 and eyt . Each sentence may be terminated with a special end-of-sentence symbol EOS. Thus, during decoding, the decoding algorithm may be terminated when an EOS token is predicted. At each time step, either a greedy approach or beam search may be adopted for word prediction.
After the answer to the question is generated, the answer is validated by an evidence validation model (S410). Although the answers are generated, the user may not accept the answers. Thus, a confidence score for quality control may be provided. In one embodiment, the confidence score may be calculated in normalized Kullback-Leibler distance (between 0 and 1) between the question and the answer. The calculation of Kullback-Leibler distance is well known by those skilled in the art, thus, is not explained here.
For example, a predetermined confidence score may be provided as a standard, and whether the answer is valid or not is determined based on the calculated confidence score of the answer (S411). If the calculated confidence score is larger than the predetermined confidence score, the corresponding answer may be considered as valid and the valid answer is delivered to the user (S412). If the calculated confidence score is smaller than the predetermined confidence score, the corresponding answer may be considered as invalid, and steps S408 and S410 and S411 may be repeated until the answer is determined as valid. In addition, the validated answers may also be used for training for the future questions.
The disclosed method and context-aware chatbot system may respond to the user or answer questions by taking the contextual information into consideration. To realize a more accurate representation of question, answer and context, the contextual information may be input into the context-aware neural conversation model. That is, the contextual information may be input into the chat robot at a system level. The context-aware neural conversation model may learn the contextual information and question answer pairs together. With the context-aware neural conversation model, the question answer pairs may be trained without handcrafted rules, and the contextual information may be decomposed into a plurality of high dimension vectors, such as people, and, organization, object, agent, occurrence, purpose, time, place, form of expression, concept/abstraction, and relationship, etc.
By analyzing the context in the questions, the user's question may be paired with a better answer. That is, the chatbot may provide more relevant responses to the users, and the users may find services and products they need in different contexts, significantly improving the user experience. The disclosed method and context-aware chatbot system may be applied to various interesting applications without handcrafted rules.
In addition, the disclosed method and context-aware chatbot system may provide a general learning frame for methods and systems which have to take contextual information into consideration. The learned word embedded presentation of context may be used for other tasks in future. The high dimension vectors representing the contextual information may also be used for personalization in recommender system in future.
Those of skill would further appreciate that the various illustrative modules and method steps disclosed in the embodiments may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative units and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The description of the disclosed embodiments is provided to illustrate the present invention to those skilled in the art. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.