1. Field
The disclosure relates to computer generation of voice with emotional content.
2. Background
Computer speech synthesis is increasingly prevalent in the human interface capabilities of modem computing devices. For example, modem smartphones may offer an intelligent personal assistant interface for a user of the smartphone, providing services such as answering user questions and providing reminders or other useful information. Other applications of speech synthesis may include any system in which speech output is desired to be generated, e.g., personal computer systems delivering media content in the form of speech, automobile navigation systems, systems for assisting people with visual impairment, etc.
Prior art techniques for generating voice may employ a straight text-to-speech conversion, in which emotional content is absent from the speech rendering of the underlying text. In such cases, the computer-generated voice may sound unnatural to the user, thus degrading the overall experience of the user when interacting with the system. Accordingly, it would be desirable to provide efficient and robust techniques for generating voice with emotional content to enhance user experience.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards techniques for generating speech output having emotion type. In one aspect, an apparatus includes a candidate generation block configured to generate a plurality of candidates associated with a message, and a candidate selection block configured to select one of the plurality of candidates as corresponding to a predetermined emotion type. The plurality of candidates preferably span a diverse emotional content range, such that a candidate having emotional content close to the predetermined emotion type will likely be present.
In one aspect, the plurality of candidates associated with a message may be generated offline via, e.g., crowd-sourcing, and stored in a look-up table or database associating each message with a corresponding plurality of candidates. The candidate generation block may query the look-up table to determine the plurality of candidates. Furthermore, the candidate selection block may be configured using predetermined parameters derived from a machine learning algorithm. The machine learning algorithm may be trained offline using training messages having known emotion types.
Other advantages may become apparent from the following detailed description and drawings.
Various aspects of the technology described herein are generally directed towards a technology for generating voice with emotional content. The techniques may be used in real time, while nevertheless drawing on substantial human feedback and algorithm training that is performed offline.
It should be understood that the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways to provide benefits and advantages in text-to-speech systems in general. For example, exemplary techniques for generating a plurality of emotionally diverse candidates and for selecting a candidate matching the specified emotion type are described, but any other techniques for performing similar functions may be used.
The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary aspects of the invention and is not intended to represent the only exemplary aspects in which the invention can be practiced. The term “exemplary” used throughout this description means “serving as an example, instance, or illustration,” and should not necessarily be construed as preferred or advantageous over other exemplary aspects. The detailed description includes specific details for the purpose of providing a thorough understanding of the exemplary aspects of the invention. It will be apparent to those skilled in the art that the exemplary aspects of the invention may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the novelty of the exemplary aspects presented herein.
In
Based on the processing performed by processor 125, device 120 may generate speech output 126 responsive to speech input 122, using speaker 128. Note in alternative processing scenarios, device 120 may also generate speech output 126 independently of speech input 122, e.g., device 120 may autonomously provide alerts or relay messages from other users (not shown) to user 110 in the form of speech output 126.
In
At block 220, speech recognition is performed on speech input 210. In an exemplary embodiment, speech recognition 220 converts speech input 210 into text form, e.g., based on knowledge of the language in which speech input 210 is expressed.
At block 230, language understanding is performed on the output of speech recognition 220. In an exemplary embodiment, natural language understanding techniques such as parsing and grammatical analysis may be performed to derive the intended meaning of the speech.
At block 240, a dialog engine generates a suitable response to the user's speech input as determined by language understanding 230. For example, if language understanding 230 determines that the user speech input corresponds to a query regarding a weather forecast for a particular location, then dialog engine 240 may obtain and assemble the requisite weather information from sources, e.g., a weather forecast service or database.
At block 250, language generation is performed on the output of dialog engine 240. Language generation presents the information generated by the dialog engine in a natural language format, e.g., obeying lexical and grammatical rules, for ready comprehension by the user. The output of language generation 250 may be, e.g., sentences in the target language that convey the information from dialog engine 240 in a natural language format. For example, in response to a query regarding the weather, language generation 250 may output the following text: “The weather today will be 72 degrees and sunny.”
At block 260, text-to-speech conversion is performed on the output of language generation 250. The output of text-to-speech conversion 260 may be an audio waveform.
At block 270, speech output in the form of an acoustic signal is generated from the output of text-to-speech conversion 260. The speech output may be provided to a listener, e.g., user 110 in
In certain applications, it is desirable for speech output 270 to be generated not only as an emotionally neutral rendition of text, but further for speech output 270 to include specified emotional content when delivered to the listener. In particular, a human listener is sensitive to a vast array of cues indicating the emotional content of speech segments. For example, the perceived emotional content of speech output 270 may be affected by a variety of parameters, including, but not limited to, speed of delivery, lexical content, voice and/or grammatical inflection, etc. The vast array of parameters renders it particularly challenging to artificially synthesize natural sounding speech with emotional content. Accordingly, it would be desirable to provide efficient yet reliable techniques to generate speech having emotional content.
In
It will be appreciated that semantic content 312 may be represented in any of a plurality of ways, and need not correspond to a full, grammatically correct sentence in a natural language such as English. For example, alternative representations of semantic content may include semantic representations employing abstract formal languages for representing meaning.
Emotion type 312, on the other hand, may indicate an emotion to be associated with the corresponding semantic content 310, as determined by dialog engine 240.1. For example, in certain circumstances, dialog engine 240.1 may specify the emotion type 312 to be “excited.” However, in other circumstances, dialog engine 240.1 may specify the emotion type 312 to be “neutral,” or “sad,” etc.
Semantic content 310 and emotion type 312 generated by dialog engine 240.1 are provided to a composite language generation block 320. In the exemplary embodiment shown, block 320 may be understood to perform both the functions of language generation block 250 and text-to-speech block 260 in
In
For example, returning to the sports news example described hereinabove, candidate speech segments corresponding to the particular semantic content 310 of “The Red Sox have won the World Series” may include the following:
In Table I, the first column lists the identification numbers associated with four candidate speech segments. The second column provides the text content of each candidate speech segment. The third column provides certain heuristic characteristics of each candidate speech segment. Note the heuristic characteristics of each candidate speech segment are provided only to aid the reader of the present disclosure in understanding the nature of the corresponding candidate speech segment when listened to in person. The heuristic characteristics are not required to be explicitly determined by any means, or otherwise explicitly provided for each candidate speech segment.
It will be appreciated that the four candidate speech segments shown in Table I offer a diversity of emotional content corresponding to the specified semantic content, in that each candidate speech segment has text content and heuristic characteristics that will likely provide the listener with a perceived emotional content distinct from the other candidate speech segments.
Note that Table I is shown for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular parameters or characteristics shown in Table I. For example, the candidate speech segments need not have different text content from each other, and may all include identical text, with differing heuristic characteristics only. Furthermore, any number of candidate speech segments (e.g., more than four) may be provided. It will be appreciated that the number of candidate speech segments generated is a design parameter that may depend on, e.g., the effectiveness of block 410 in generating suitably diverse candidate speech segments, as well as processing and memory constraints of computer hardware implementing the processes described. Note there generally need not be any predetermined relationship between the different candidate speech segments, or any significance attributed to the sequence in which the candidate speech segments are presented.
Various techniques may be employed to generate a plurality of emotionally diverse candidate speech segments associated with a given semantic content. For example, in an exemplary embodiment, an emotionally neutral reading of a sentence may be generated, and the reading may then be post-processed to modify one or more speech parameters known to be correlated with emotional content. For example, the speed of a single candidate speech segment may be alternately set to fast and slow to generate two candidate speech segments. Other parameters to be varied may include, e.g., volume, rising or falling pitch, etc. In an alternative exemplary embodiment, crowd-sourcing techniques may be utilized to generate the plurality of emotionally diverse candidate speech segments, as further described hereinbelow with reference to
Returning to
Further in
In an exemplary embodiment, as shown in
Note the plurality of candidate speech segments (e.g., 510a.1 through 510a.N) for each entry in LUT 410.1 may be predetermined and stored in, e.g., memory local to device 120, or in memory accessible via a wired or wireless network remote from device 120. The determination of candidate speech segments associated with a given semantic content 310 may be performed, e.g., as described with reference to
In an exemplary embodiment, LUT 410.1 may correspond to a database, to which a module of block 410 submits a query requesting a plurality of candidates associated with a given message. Responsive to the query, the database returns a plurality of candidates having diverse emotional content associated with the given message. In an exemplary embodiment, block 410 may submit the query wirelessly to an online version of LUT 410.1 that is located, e.g., over a network, and LUT 410.1 may return the results of such query also over the network.
In an exemplary embodiment, block 412 may be implemented as, e.g., an algorithm that applies certain rules to rank a plurality of candidate speech segments to determine consistency with a specified emotion type 312. Such algorithm may be executed locally on device 120, or the results of the ranking may be accessible via a wired or wireless network remote from device 120.
It will be appreciated that using the architecture shown in
In
The task 612a formulated by module 612 is subsequently provided to task distribution/results collection module 614. Module 614 transmits information regarding the formulated task 612a to crowd-sourcing (CS) agents 620.1 through 620.N. Each of CS agents 620.1 through 620.N may independently execute the formulated task 612a, and returns the results of the executed task to module 614. Note in
In an exemplary embodiment, module 614 may interface with any or all of CS agents 620.1 through 620.N over a network, e.g., a plurality of terminals linked by the standard Internet protocol. In particular, any CS agent may correspond to one or more human users (not shown in
Given the variety of distinct users participating as CS agents 620.1 through 620.N, it is probable that one of the expressions generated by the CS agents will closely correspond to the target emotion type 312, as may be subsequently determined by a module for identifying the optimal candidate speech segment, such as block 412 described with reference to
Note CS agents 620.1 through 620.N may be provided with only the semantic content 310. The CS agents need not be provided with emotion type 312. In alternative exemplary embodiments, the CS agent may be provided with emotion type 312. In general, since it is not necessary to provide the CS agents with knowledge of the emotion type 312, the crowd-sourcing operations as shown in
In view of the techniques disclosed herein, it will be appreciated that any techniques known for performing crowd-sourcing not explicitly described herein may generally be employed for the task of generating a plurality of emotionally diverse candidate speech segments for a given semantic content 310. For example, standard techniques for providing incentives to crowd-sourcing agents, for distributing tasks, etc., may be applied along with the techniques of the present disclosure. Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.
Note while a plurality N of crowd-sourcing agents are shown in
In
In certain exemplary embodiments, the algorithm underlying engine 720 may be derived from machine learning techniques. For example, in a classification-based approach, the algorithm may determine, for every candidate, whether it is or is not of the given emotion type. In a ranking-based approach, the algorithm may rank all candidates in order of their consistency with the predetermined emotion type.
While certain exemplary embodiments of block 412 are described herein with reference to machine-learning based techniques, it will be appreciated that the scope of the present disclosure need not be so limited. Any algorithms for assessing the emotion type of candidate text or speech segments may be utilized according to the techniques of the present disclosure. Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.
In
In an exemplary embodiment, crowd-sourcing scheme 600 may be utilized to derive the training inputs, e.g., training speech segments 810 and tagged emotion type 820. For example, any of CS agents 620.1 through 620.N may be requested to provide a tagged emotion type 820 corresponding to the speech segment generated by that CS agent.
Algorithm training block 801 may further accept a list of features to be extracted 830 from speech segments 810 relevant to the determination of emotion type. Based on the list of features, algorithm training block 801 may derive dependencies amongst the features 830 and the tagged emotion type 820 that most correctly match the training speech segments 810 to their corresponding predetermined emotion type 820 over the entire sample of training speech segments 810. Similar machine learning techniques may also be applied to, e.g., text segments, and/or combinations of text and speech. Note techniques for algorithm training in machine learning may include, e.g., Bayesian techniques, artificial neural networks, etc. The output of algorithm training block 801 includes learned algorithm parameters 801a, e.g., weights or other specified dependencies to estimate the emotion type 820 of an arbitrary speech segment.
In certain exemplary embodiments, the features to be extracted 830 from speech segments 810 may include (but are not restricted to) any combination of the following:
1. Lexical features. Each word in a speech segment may be a feature.
2. N-gram features. Each sequence of N-words, where N ranges from 2 to any arbitrarily large integer, in a sentence may be a feature.
3. Language model score. Based on raw sentences and/or speech segments for each predetermined emotion type, language models may be trained to recognize the raw sentences and/or speech segments as corresponding to the predetermined emotion type. The score assigned to a sentence by the language model of the given emotion type may be a feature. Such language models may include those used in statistical natural language processing (NLP) tasks such as speech recognition, machine translation, etc., wherein, e.g., probabilities are assigned to a particular sequence of words or N-grams. It will be appreciated that the language model score may enhance the accuracy of emotion type assessment.
4. Topic model score. Based on raw sentences and/or speech segments for each predetermined emotion type, topic models may be trained to recognize the raw sentences and/or speech segments as corresponding to a topic. The score assigned to a sentence by the topic model may be a feature. Topic modeling may utilize, e.g., latent semantic analysis techniques.
5. Word embedding. Word embedding may correspond to a neural network-based technique for mapping a word to a real-valued vector, wherein vectors of semantically related words may be geometrically close to each other. The word embedding feature can be used to convert sentences into real-valued vectors, according to which sentences with the same emotion type may be clustered together.
6. Number of words. The word count, e.g., normalized word count, of a sentence may be a feature.
7. Number of clauses. The normalized count of clauses in each sentence may be a feature. A clause may be defined, e.g., as a smallest grammatical unit that can express a complete proposition. The proposition may generally include a verb and possible arguments, which are then identifiable by algorithms.
8. Number of personal pronouns. The normalized count of personal pronouns (such as “I,” “you,” “me,” etc.) in a sentence may be a feature.
9. Number of emotional/sentimental words. The normalized count of emotional words (e.g., “happy,” “sad,” etc.) and sentimental words (e.g., “like,” “good,” “awful,” etc.) may be features.
10. Number of exclamation words. The (normalized) count of exclamation words (e.g., “oh,” “wow,” etc.) may be a feature.
Note the preceding list of features is provided for illustrative purposes only, and is not meant to limit the scope of the present disclosure to any particular features enumerated herein. One of ordinary skill in the art will appreciate that other features not explicitly disclosed herein may readily be extracted and utilized for the purposes of the present disclosure. Exemplary embodiments incorporating such alternative features are contemplated to be within the scope of the present disclosure.
Learned algorithm parameters 801a are provided to real-time emotional classification/ranking algorithm 412.1.1. In an exemplary embodiment, configurable parameters of the real-time emotional classification/ranking algorithm 412.1.1 may be programmed to the learned settings 801a. Based on the learned parameters 801a, algorithm 412.1.1 may, in an exemplary embodiment, classify each of candidates 410a according to whether they are consistent with the predetermined emotion type 312. Alternatively, algorithm 412.1.1 may rank candidates 410a in order of their consistency with the predetermined emotion type 312. In either case, algorithm 412.1.1 may output an optimal candidate 412.1.1a most consistent with the predetermined emotion type 312.
Computing system 900 includes a processor 910 and a memory 920. Computing system 900 may optionally include a display subsystem, communication subsystem, sensor subsystem, camera subsystem, and/or other components not shown in
Processor 910 may include one or more physical devices configured to execute one or more instructions. For example, the processor may be configured to execute one or more instructions that are part of one or more applications, services, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more devices, or otherwise arrive at a desired result.
The processor may include one or more processors that are configured to execute software instructions. Additionally or alternatively, the processor may include one or more hardware or firmware logic machines configured to execute hardware or firmware instructions. Processors of the processor may be single core or multicore, and the programs executed thereon may be configured for parallel or distributed processing. The processor may optionally include individual components that are distributed throughout two or more devices, which may be remotely located and/or configured for coordinated processing. One or more aspects of the processor may be virtualized and executed by remotely accessible networked computing devices configured in a cloud computing configuration.
Memory 920 may include one or more physical devices configured to hold data and/or instructions executable by the processor to implement the methods and processes described herein. When such methods and processes are implemented, the state of memory 920 may be transformed (e.g., to hold different data).
Memory 920 may include removable media and/or built-in devices. Memory 920 may include optical memory devices (e.g., CD, DVD, HD-DVD, Blu-Ray Disc, etc.), semiconductor memory devices (e.g., RAM, EPROM, EEPROM, etc.) and/or magnetic memory devices (e.g., hard disk drive, floppy disk drive, tape drive, MRAM, etc.), among others. Memory 920 may include devices with one or more of the following characteristics: volatile, nonvolatile, dynamic, static, read/write, read-only, random access, sequential access, location addressable, file addressable, and content addressable. In some embodiments, processor 910 and memory 920 may be integrated into one or more common devices, such as an application specific integrated circuit or a system on a chip.
Memory 920 may also take the form of removable computer-readable storage media, which may be used to store and/or transfer data and/or instructions executable to implement the herein described methods and processes. Removable computer-readable storage media 930 may take the form of CDs, DVDs, HD-DVDs, Blu-Ray Discs, EEPROMs, and/or floppy disks, among others.
It is to be appreciated that memory 920 includes one or more physical devices that stores information. The terms “module,” “program,” and “engine” may be used to describe an aspect of computing system 900 that is implemented to perform one or more particular functions. In some cases, such a module, program, or engine may be instantiated via processor 910 executing instructions held by memory 920. It is to be understood that different modules, programs, and/or engines may be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine may be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The terms “module,” “program,” and “engine” are meant to encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc.
In an aspect, computing system 900 may correspond to a computing device including a memory 920 holding instructions executable by a processor 910 to retrieve a plurality of speech candidates having semantic content associated with a message, and select one of the plurality of speech candidates corresponding to a specified emotion type. The memory 920 may further hold instructions executable by processor 910 to generate speech output corresponding to the selected one of the plurality of speech candidates. Note such a computing device will be understood to correspond to a process, machine, manufacture, or composition of matter.
In
At block 1020, one of the plurality of speech candidates corresponding to a specified emotion type is selected.
At block 1030, speech output corresponding to the selected one of the plurality of candidates is generated.
In this specification and in the claims, it will be understood that when an element is referred to as being “connected to” or “coupled to” another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected to” or “directly coupled to” another element, there are no intervening elements present. Furthermore, when an element is referred to as being “electrically coupled” to another element, it denotes that a path of low resistance is present between such elements, while when an element is referred to as being simply “coupled” to another element, there may or may not be a path of low resistance between such elements.
The functionality described herein can be performed, at least in part, by one or more hardware and/or software logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.