This disclosure relates to emotive text-to-speech (TTS) with auto detection of emotions.
Large language models (LLMs) are increasingly used to provide conversational experiences between users and digital assistant interfaces executing on user devices. Generally, when a user provides a request/query to a digital assistant interface powered by an LLM, the resulting response generated by the LLM and reproduced as synthesized speech to audibly convey the response, is generally devoid of emotions, sounding monotonic and unnatural. However, when used for a personal assistant or content narration, injecting emotion into generated speech significantly improves the user experience. Previous solutions have attempted to manually dictate emotions into generated speech. Alternatively, highly specialized speech generation modules (e.g., for reading news, kids stories, etc.) are used. In both of these solutions, however, the ever-increasing volume of synthesized speech and the introduction of newer voice-first technologies requires a cost-prohibitive amount of annotated data and time.
One aspect of the disclosure provides a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations that include obtaining input text characterizing a natural language response generated by an assistant large language model (LLM) to a query input by a user during a conversation between the user and the assistant LLM, and processing, using the assistant LLM, the input text conditioned on an emotion detection task prompt to predict, as output from the assistant LLM, an emotional state of the natural language response. Here, the emotion detection task prompt specifies a task for the assistant LLM to detect an emotional state of the input text from a set of possible emotional states. The operations also include determining, based on the emotional state of the natural language response predicted as output from the assistant LLM, an emotional embedding for the input text. The emotional embedding specifies the emotional state of the natural language response for synthesizing the input text into expressive speech. The operations further include instructing a text-to-speech (TTS) model to process the input text and the emotional embedding to generate a synthesized speech representation of the natural language response. Here, the synthesized speech representation conveys the emotional state of the natural language response as specified by the emotional embedding. Implementations of the disclosure may include one or more of the following
optional features. In some implementations, the operations further include receiving audio data characterizing an utterance of the query spoken by the user in natural language and captured by a user device, and performing speech recognition on the audio data to generate a textual representation of the query spoken by the user. In some examples, the operations further include receiving one or more few-shot learning examples each depicting an example text input paired with a ground-truth emotional state classification of the example text input. Each few-shot learning example provides in-context learning for enabling the assistant LLM to generalize for the task of detecting emotional states of input texts. In these examples, processing the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language prompt includes processing, using the assistant LLM, the input text conditioned on the emotion detection task prompt and the one or more few-shot learning examples to predict, as output from the assistant LLM, the emotional state of the natural language response.
In some implementations, the operations further include receiving a fine-tuned prompt embedding. The fine-tuned prompt embedding includes a soft prompt configured to guide the assistant LLM to detect the emotional state of the input text from the set of possible emotional states while parameters of the assistant LLM are held fixed. In these implementations, processing the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language prompt includes processing, using the assistant LLM, the input text conditioned on the emotion detection task prompt and the fine-tuned prompt embedding to predict, as output from the assistant LLM, the emotional state of the natural language response. In these implementations, the fine-tuned prompt embedding may be learned during a prompt embedding fine-tuning process. The fine-tuning process includes initializing a prompt embedding as fixed-length sequence of learnable vectors, and receiving a training dataset of natural language training utterances. Each natural language training utterance includes a corresponding textual representation of the natural language training utterance, and a corresponding ground-truth emotional state of the natural language training utterance. For each natural language training utterance in the training data, the operations further include processing, using the assistant LLM, the corresponding textual representation of the natural language training utterance to generate a corresponding predicted emotional state for the natural language training utterance as output from the assistant LLM, determining a training loss based on the corresponding predicted emotional state and the corresponding ground-truth emotional state of the natural language training utterance, and tuning, using the training loss, the prompt embedding by updating the learnable vectors while parameters of the assistant LLM are kept fixed.
In some examples, the assistant LLM includes a pre-trained LLM and a low-rank adaption training process fine-tunes a fraction of parameters of the pre-trained LLM to learn how to predict emotional states of input texts. In these examples, the low-rank adaption training process fine-tunes the fraction of the pre-trained LLM by receiving a training dataset of natural language training utterances. Each natural language training utterance including a corresponding textual representation of the natural language training utterance, and a corresponding ground-truth emotional state of the natural language training utterance. For each natural language training utterance in the training dataset, the operations further include processing, using the assistant LLM, the corresponding textual representation of the natural language training utterance to generate a corresponding predicted emotional state for the natural language training utterance as output from the assistant LLM, and determining a training loss based on the corresponding predicted emotional state and the corresponding ground-truth emotional state of the natural language training utterance. The operations further include fine-tuning, using the training losses, the fraction of the parameters of the assistant LLM while a remaining portion of the parameters of the assistant LLM are kept fixed.
In some implementations, obtaining the input text characterizing the natural language response includes processing, using the assistant LLM, a textual representation of the query input by the user to generate, as output from the assistant LLM, the input text characterizing the natural language response to the query, and processing the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language response includes, after the input text characterizing the natural language response to the query is output from the assistant LLM and provided as feedback to the assistant LLM, processing, using the assistant LLM, the input text conditioned on the emotion detection prompt to predict, as output from the assistant LLM, the emotional state of the natural language response. In some examples, obtaining the input text characterizing the natural language response includes processing, using the assistant LLM, a textual representation of the query input by the user to generate the input text characterizing the natural language response to the query. Here, processing the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language response includes processing, using the assistant LLM, the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language response and generate, as output, from the assistant LLM, marked-up text that includes the input text characterizing the natural language response annotated with the predicted emotional state of the natural language response. In some implementations, determining the emotional embedding specifying the emotional state of the natural language response for synthesizing the input text into expressive speech includes accessing a two-dimensional embedding space that maps each respective emotional state from the set of possible emotional states to a different respective emotional embedding.
Another aspect of the disclosure provides a system including data processing hardware and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that when executed by the data processing hardware cause the data processing hardware to perform operations that include obtaining input text characterizing a natural language response generated by an assistant large language model (LLM) to a query input by a user during a conversation between the user and the assistant LLM, and processing, using the assistant LLM, the input text conditioned on an emotion detection task prompt to predict, as output from the assistant LLM, an emotional state of the natural language response. Here, the emotion detection task prompt specifies a task for the assistant LLM to detect an emotional state of the input text from a set of possible emotional states. The operations also include determining, based on the emotional state of the natural language response predicted as output from the assistant LLM, an emotional embedding for the input text. The emotional embedding specifies the emotional state of the natural language response for synthesizing the input text into expressive speech. The operations further include instructing a text-to-speech (TTS) model to process the input text and the emotional embedding to generate a synthesized speech representation of the natural language response. Here, the synthesized speech representation conveys the emotional state of the natural language response as specified by the emotional embedding.
This aspect may include one or more of the following optional features. In some implementations, the operations further include receiving audio data characterizing an utterance of the query spoken by the user in natural language and captured by a user device, and performing speech recognition on the audio data to generate a textual representation of the query spoken by the user. In some examples, the operations further include receiving one or more few-shot learning examples each depicting an example text input paired with a ground-truth emotional state classification of the example text input. Each few-shot learning example provides in-context learning for enabling the assistant LLM to generalize for the task of detecting emotional states of input texts. In these examples, processing the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language prompt includes processing, using the assistant LLM, the input text conditioned on the emotion detection task prompt and the one or more few-shot learning examples to predict, as output from the assistant LLM, the emotional state of the natural language response.
In some implementations, the operations further include receiving a fine-tuned prompt embedding. The fine-tuned prompt embedding includes a soft prompt configured to guide the assistant LLM to detect the emotional state of the input text from the set of possible emotional states while parameters of the assistant LLM are held fixed. In these implementations, processing the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language prompt includes processing, using the assistant LLM, the input text conditioned on the emotion detection task prompt and the fine-tuned prompt embedding to predict, as output from the assistant LLM, the emotional state of the natural language response. In these implementations, the fine-tuned prompt embedding may be learned during a prompt embedding fine-tuning process. The fine-tuning process includes initializing a prompt embedding as fixed-length sequence of learnable vectors, and receiving a training dataset of natural language training utterances. Each natural language training utterance includes a corresponding textual representation of the natural language training utterance, and a corresponding ground-truth emotional state of the natural language training utterance. For each natural language training utterance in the training data, the operations further include processing, using the assistant LLM, the corresponding textual representation of the natural language training utterance to generate a corresponding predicted emotional state for the natural language training utterance as output from the assistant LLM, determining a training loss based on the corresponding predicted emotional state and the corresponding ground-truth emotional state of the natural language training utterance, and tuning, using the training loss, the prompt embedding by updating the learnable vectors while parameters of the assistant LLM are kept fixed.
In some examples, the assistant LLM includes a pre-trained LLM and a low-rank adaption training process fine-tunes a fraction of parameters of the pre-trained LLM to learn how to predict emotional states of input texts. In these examples, the low-rank adaption training process fine-tunes the fraction of the pre-trained LLM by receiving a training dataset of natural language training utterances. Each natural language training utterance including a corresponding textual representation of the natural language training utterance, and a corresponding ground-truth emotional state of the natural language training utterance. For each natural language training utterance in the training dataset, the operations further include processing, using the assistant LLM, the corresponding textual representation of the natural language training utterance to generate a corresponding predicted emotional state for the natural language training utterance as output from the assistant LLM, and determining a training loss based on the corresponding predicted emotional state and the corresponding ground-truth emotional state of the natural language training utterance. The operations further include fine-tuning, using the training losses, the fraction of the parameters of the assistant LLM while a remaining portion of the parameters of the assistant LLM are kept fixed.
In some implementations, obtaining the input text characterizing the natural language response includes processing, using the assistant LLM, a textual representation of the query input by the user to generate, as output from the assistant LLM, the input text characterizing the natural language response to the query, and processing the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language response includes, after the input text characterizing the natural language response to the query is output from the assistant LLM and provided as feedback to the assistant LLM, processing, using the assistant LLM, the input text conditioned on the emotion detection prompt to predict, as output from the assistant LLM, the emotional state of the natural language response. In some examples, obtaining the input text characterizing the natural language response includes processing, using the assistant LLM, a textual representation of the query input by the user to generate the input text characterizing the natural language response to the query. Here, processing the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language response includes processing, using the assistant LLM, the input text conditioned on the emotion detection task prompt to predict the emotional state of the natural language response and generate, as output, from the assistant LLM, marked-up text that includes the input text characterizing the natural language response annotated with the predicted emotional state of the natural language response. In some implementations, determining the emotional embedding specifying the emotional state of the natural language response for synthesizing the input text into expressive speech includes accessing a two-dimensional embedding space that maps each respective emotional state from the set of possible emotional states to a different respective emotional embedding.
The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
Humans may engage in human-to-computer dialogs with interactive software applications referred to as “chatbots,” “voice bots”, “automated assistants”, “interactive personal assistants,” “intelligent personal assistants,” “conversational agents,” etc. via a variety of computing devices. As one example, these chatbots may correspond to a machine learning model or a combination of different machine learning models, and may be utilized to perform various tasks on behalf of users.
Chatbots adopting Large language models (LLMs) are currently opening up a wide range of applications due to their powerful understanding and generation capabilities which can operate over text, image, and/or audio inputs. These models are also being extended with actuation capabilities via integration mechanisms with various service providers.
LLMs are increasingly used to provide conversational experiences between users and digital assistant interfaces executing on user devices. Generally, when a user provides a request/query to a digital assistant interface powered by an LLM, the resulting synthesized speech produced for the response generated by the LLM lacks any emotion for a typical turn in a conversation. However, in spoken conversations where the user speaks an input query/request and synthesized speech conveying the response generated by the LLM is audibly output, the user experience is hurt since the synthesized speech conveying the response to the query is monotonic and unnatural to the user.
between a user 102 and an assistant LLM 220. A conversational assistant application 200 may execute on a user device 10 associated with the user 102 and/or a remote system 60 in communication with the user device 10 via a network 40 to enable the user 102 and the assistant LLM 220 to interact with one another through spoken conversation. The conversational assistant application 200 may access various components for facilitating the spoken conversation in a natural manner between the user 102 and the assistant LLM 220. For instance, through the use of application programming interfaces (APIs) or other types of plug-ins, the conversational assistant application 200 may access an automated speech recognition (ASR) system 112, a prompt structurer 210 (
During a user turn of the spoken conversation between the user 102 and the conversational assistant application 200 (i.e., the assistant LLM 220), the user device 10 captures audio data characterizing an utterance 104 of a query 106 spoken by the user 102 and directed toward the conversational assistant application 200 to solicit a response from the assistant LLM 220. For instance, the query 106 may specify a particular question that the user 102 would like the assistant LLM 220 to answer and the assistant
LLM 220 may generate a response that answers the question. For example, the assistant LLM 220 generates input text 202 characterizing a natural language response generated by the assistant LLM 220 to the query 106 input by the user 102. The query 106 may similarly correspond to a request for information and the assistant LLM 220 may generate the input text 202 as the response conveying the requested information. While the term query 106 is used, the query 106 may correspond to any natural language dialog (e.g., a greeting) directed toward the assistant LLM 220 during the user's turn in the spoken conversation between the user 102 and the assistant LLM 220. The user 102 may speak the utterance of the query 106 in natural language and the ASR system 112 may perform speech recognition on the audio data characterizing the utterance 104 of the query 106 to generate a textual representation 108 of the query 106 spoken by the user 102. The textual representation 108 of the query 106 may be simply referred to as a textual query 108.
Referring to
During a second round trip, the assistant LLM 220 performs the task of predicting the emotional state 232P of the input text 202 and then, based on the predicted emotional state 232P of input text 202 characterizing the natural language response, the conversational assistant application 200 determines an emotional embedding 242 specifying the emotional state of the input text 202 characterizing the natural language response for synthesizing the input text 202 into expressive speech, and instructs the TTS model 300 to process the input text 202 and the emotional embedding 242 to generate a synthesized speech representation 352 of the natural language response. Here, the synthesized speech representation 352 conveys the emotional state 232 of the natural language response as specified by the emotional embedding 242. While examples herein depict the same assistant LLM 220 generating the input text 202 characterizing the natural language response to the user's query 106 input to the assistant LLM 220 and detecting the emotional state 232P of the input text 202, other configurations where a two LLMs are utilized: a first LLM that processes the user's query 106 to generate the input text 202 characterizing the natural language response; and a second LLM that processes the input text 202 to predict the emotional state 232P of the input text.
In these implementations, processing the input text 202 conditioned on the emotion detection task prompt 214 to predict the emotional state 232P of the natural language includes the assistant LLM 220 first generating the input text 212 characterizing the natural language response to the query 106 and then providing the input text as feedback to the assistant LLM 220 during the second round trip to predict the emotional state 232P of the natural language response. Alternatively, the assistant LLM 220 performs the task of generating the input text 202 and the task of detecting an emotional state 232 simultaneously such that the input text 202 and the emotional state 232 are generated/output in a single round trip. In these implementations, the assistant LLM 220 obtains the input text 202 characterizing the natural language response by processing the textual representation 108 of the query 106 input by the user 102 to generate the input text 202 characterizing the natural language response to the query 106. Here, the assistant LLM processes the input text 202 conditioned on the emotion detection task prompt 214 to predict the emotional state 232P of the natural language response and generate, as output from the assistant LLM 220, marked-up text that includes the input text 202 characterizing the natural language response annotated with the predicted emotional state 232P of the natural language response.
Referring back to
The user device 10 may be any computing device capable of communicating with the remote computing system 60 through the network 40. The user device 10 includes, but is not limited to, desktop computing devices and mobile computing devices, such as laptops, tablets, smart phones, smart speakers/displays, digital assistant devices, smart appliances, internet-of-things (IoT) devices, infotainment systems, vehicle infotainment systems, and wearable computing devices (e.g., headsets, smart glasses, and/or watches).
The remote computing system 60 may be a distributed system (e.g., a cloud computing environment) having scalable elastic resources. The resources include computing resources 62 (e.g., data processing hardware) and/or storage resources 64 (e.g., memory hardware). Additionally or alternatively, the remote computing system 60 may be a centralized system. The network 40 may be wired, wireless, or a combination thereof, and may include private networks and/or public networks, such as the Internet.
With continued reference to
The assistant LLM 220 may power the conversational assistant application 200 to function as a personal chat bot capable of having dialog conversations with the user 102 in natural language and performing tasks/actions on the user's behalf. In some examples, the assistant LLM 220 includes an instance of Bard, LaMDA, BERT, Meena, ChatGPT, or any other previously trained LLM. These LLMs have been previously trained on enormous amounts of diverse data and are capable of engaging in corresponding conversations with users in a natural and intuitive manner. However, these LLMs have a plurality of machine learning (ML) layers and hundreds of millions to hundreds of billions of ML parameters.
By conditioning the input text 202 on the emotion detection task prompt 214 to form the emotion prompt 212, the emotion prompt 212 guides the assistant LLM 220 to detect the emotional state 232 of the input text 202 characterizing the natural language response to the query 106 as opposed to generating input text 202 without any accompanying emotion. Thereafter, the TTS model 300 (
As referenced above, and as shown in
The prompt structurer 210 is configured to receive the input text 202 and a set of possible emotional states 232 from the emotional state data store 230 and generate, as output, an emotion prompt 212. The emotion prompt 212 includes the input text 202 conditioned on an emotion detection task prompt 214 that directs the assistant LLM 220 to detect an emotional state 232 of the input text 202 from the set of possible emotional states 232 from the emotional state data store 230. Put another way, the prompt structurer 210 concatenates the emotion detection task prompt 214, the input text 202, and the set of possible emotional states 232 from the emotional state data store 230 to generate the emotion prompt 212 that serves as an instruction to the assistant LLM 220 to detect the emotional state 232 of the input text 202. For example, as shown in
The assistant LLM 220 is configured to receive the emotional prompt 212 and process the input text 202 conditioned on the emotion detection task prompt 214 output by the prompt structurer 210 to predict, as output, an emotional state 232P of the input text 202 (i.e., the natural language response). In some implementations, the assistant LLM 220 also receives, as input, one or more few-shot learning examples 216 that each depict an exemplary text-input paired with a ground-truth emotional state classification of the example text-input. Here, each few-shot learning example 216 provides in-context learning for enabling the assistant LLM 220 to generalize for the task of detecting emotional states of input texts. For example, a few-shot learning example 216 that pairs the example text input of “I'll try to do better, but no promises” with the ground-truth emotional state classification of “firm” and “apologetic.” In another example, a few-shot learning example 216 pairs the example text input of “congratulations, I knew you′d be a hit!” with the ground-truth emotional state classification of “lively.” Here, processing the input text 202 conditioned on the emotion detection task prompt 214 to predict the emotional state 232 of the natural language prompt includes processing, using the assistant LLM 220, the input text 202 conditioned on the emotion detection task prompt 214 and the one or more few-shot learning examples 216 to predict as output from the assistant LLM 220, the emotional state 232P of the natural language response (i.e., the input text 202). In these implementations, the assistant LLM 220 may be a pre-trained LLM that was never trained on the task of emotion detection, where the few-shot learning examples 216 paired with the input text 202 conditioned on the emotion detection task prompt 214 further aid in guiding the assistant LLM 220 to detect an emotional state of input text as an emerging property of the assistant LLM 220. In some implementations, the few-shot learning examples 216 guide the assistant LLM 220 to generate/detect emotional states of input text without training or updating parameters of the pre-trained assistant LLM 220. The assistant LLM 220 may also include the pre-trained LLM in zero-shot learning examples where emotional prompt 212 is fed to the assistant LLM 220 without any few-shot learning examples 216.
Additionally or alternatively to providing few-shot learning examples 216 with the emotional prompt 212, the assistant LLM 220 also receives, as input, a fine-tuned prompt embedding 450 that includes a soft prompt configured to guide the assistant LLM 220 to detect the emotional state 232P of the input text 202 from the set of possible emotional states 232 while parameters of the assistant LLM 220 are held fixed. Here, processing the input text 202 conditioned on the emotion detection task prompt 214 to predict the emotional state 232P of the natural language prompt includes processing, using the assistant LLM 220, the input text 202 conditioned on the emotion detection task prompt 214 and the fine-tuned prompt embedding 450 to predict, as output from the assistant LLM 220, the emotional state 232P of the natural language response. As will be described in more detail with respect to
Referring to
A loss module 440 for the training process 400a receives, as input, the corresponding ground-truth emotional state 434 of the natural language training utterance 430 and the corresponding predicted emotional state 232P for the natural language training utterance 430 as output from the assistant LLM 220 and determines a training loss 442 based on the corresponding predicted emotional state 232P and the corresponding ground-truth emotional state 434 of the natural language training utterance 430. Thereafter, the training process 400a fine-tunes, using the training loss 442, the fine-tuned prompt embedding 450 by updating the learnable vectors while parameters of the assistant LLM 220 are kept fixed. By keeping the parameters of the assistant LLM 220 fixed, the fine-tuned prompt embedding 450 extracts evidence about how to perform the task of detecting an emotion from input text from the training dataset 420, and, as such, performs the same role as a manually written text prompt without the constraints of discrete language.
With reference to
A loss module 440 for the training process 400b receives, as input, the corresponding ground-truth emotional state 434 of the natural language training utterance 430 and the corresponding predicted emotional state 232P for the natural language training utterance 430 as output from the assistant LLM 220 and determines a training loss 442 based on the corresponding predicted emotional state 232P and the corresponding ground-truth emotional state 434 of the natural language training utterance 430. Thereafter, the training process 400a fine-tunes, using the training loss 442, the fraction of the parameters of the assistant LLM 220 while a remaining portion of the parameters of the fine-tuned prompt embedding 450 by updating the learnable vectors while parameters of the assistant LLM 220 are kept fixed.
Referring again to
Here, the emotional embedding EE 242 specifies the emotional state 232P of the natural language response for synthesizing the input text 202 into expressive speech. As described above, the emotional embedding 242 may be a controllable feature that the TTS model 300 uses to synthesize speech with different emotional states 232. For example, determining the emotional embedding 242 specifying the emotional state 232P of the natural language response for synthesizing the input text 202 into expressive speech may include accessing a two-dimensional (2-dimensional) embedding space that maps each respective emotional state 232 from the set of possible emotional states 232 to a different respective emotional embedding 242. Each emotional embedding EE 242 may specify a style/prosody and may be provided to an end-to-end TTS model 300 for converting the input text 202 into synthesized speech 352 having the style/prosody specified by the emotional embedding EE 242.
With particular reference to
In some implementations, a context model 360 in communication with the assistant LLM 220 is configured to receive and process one or more context features 362 to generate a context embedding 364 associated with the input text 202. For example, the context features 362 may include the conversation history between the user 102 and the conversational assistant application 200 as context to the assistant LLM 220. By receiving historical context (e.g., via the context embedding 364), the assistant LLM 220 may be more efficiently perform the task of predicting the emotional state 232 of the input text 202. For example, the historical emotional states 232 (e.g., the previously predicted emotional states 232P from previous conversation turns) may better inform the assistant LLM 220 on the tone and/or emotion of the conversation between the user 102 and the assistant LLM 220.
At operation 502, the method 500 includes obtaining input text 202 characterizing a natural language response generated by an assistant large language model (LLM) 220 to a query input 106 by a user 102 during a conversation between the user 102 and the assistant LLM 220. The method 500 also includes, at operation 504, processing, using the assistant LLM 220, the input text 202 conditioned on an emotion detection task prompt 214 to predict, as output from the assistant LLM 220, an emotional state 232 of the natural language response. Here, the emotion detection task prompt 214 specifies a task for the assistant LLM 220 to detect an emotional state 232 of the input text 202 from a set of possible emotional states 232.
At operation 506, the method 500 also includes determining, based on the emotional state 232 of the natural language response predicted as output from the assistant LLM 220, an emotional embedding 242 for the input text 202. Here, the emotional embedding 242 specifies the emotional state 232 of the natural language response for synthesizing the input text 202 into expressive speech. At operation 508, the method 500 further includes instructing a text-to-speech (TTS) model 300 to process the input text 202 and the emotional embedding 242 to generate a synthesized speech representation 352 of the natural language response, the synthesized speech representation 352 conveying the emotional state 232 of the natural language response as specified by the emotional embedding 242.
The computing device 600 includes a processor 610, memory 620, a storage device 630, a high-speed interface/controller 640 connecting to the memory 620 and high-speed expansion ports 650, and a low speed interface/controller 660 connecting to a low speed bus 670 and a storage device 630. Each of the components 610, 620, 630, 640, 650, and 660, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 610 (e.g., the data processing hardware 12, 62 of
The memory 620 (e.g., the memory hardware 14, 64 of
The storage device 630 is capable of providing mass storage for the computing device 600. In some implementations, the storage device 630 is a computer-readable medium. In various different implementations, the storage device 630 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. In additional implementations, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 620, the storage device 630, or memory on processor 610.
The high speed controller 640 manages bandwidth-intensive operations for the computing device 600, while the low speed controller 660 manages lower bandwidth-intensive operations. Such allocation of duties is exemplary only. In some implementations, the high-speed controller 640 is coupled to the memory 620, the display 680 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 650, which may accept various expansion cards (not shown). In some implementations, the low-speed controller 660 is coupled to the storage device 630 and a low-speed expansion port 690. The low-speed expansion port 690, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
The computing device 600 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 600a or multiple times in a group of such servers 600a, as a laptop computer 600b, or as part of a rack server system 600c.
Various implementations of the systems and techniques described herein can be realized in digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A software application (i.e., a software resource) may refer to computer software that causes a computing device to perform a task. In some examples, a software application may be referred to as an “application,” an “app,” or a “program.” Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.
The non-transitory memory may be physical devices used to store programs (e.g., sequences of instructions) or data (e.g., program state information) on a temporary or permanent basis for use by a computing device. The non-transitory memory may be volatile and/or non-volatile addressable semiconductor memory. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electronically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware, such as boot programs). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM) as well as disks or tapes.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product, non-transitory computer readable medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
The processes and logic flows described in this specification can be performed by one or more programmable processors, also referred to as data processing hardware, executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, one or more aspects of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, or touch screen for displaying information to the user and optionally a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.