The present disclosure relates to text-to-speech systems.
Text-to-speech processing is also known as speech synthesis, that is, the artificial production of human speech from a text source. Text-to-speech conversion is a complex process that converts a stream of written text into an audio output file or audio signal. There are many conventional text-to-speech (TTS) programs that convert text to audio. Conventional TTS algorithms typically function by trying to understand the composition of the text that is to be converted. Example techniques can split text into phonemes, splitting phrases within a line of text, digitizing speech, and so forth.
TTS processing capability is useful for visually impaired computer users that have difficulty interpreting visually displayed content and for users of mobile and embedded computing devices, where the mobile and embedded computing devices may either lack a screen, possess a tiny screen unsuitable for displaying large amounts of content, or can be used in an environment where it is not appropriate for a user to visually focus upon a display. Such an inappropriate environment can include, for example, a vehicle navigation environment, where outputting navigation information to a display for viewing can be distracting to a driver. Thus, TTS systems provide a convenient way to listen to text-based communications.
One challenge in converting text-to-speech is accurately conveying emotion or audible expressivity. Conventional TTS systems are limited to analyzing punctuation and word arrangement in an attempt to guess at a possible mood of a text block to add some type of inflection, speech/pitch change, pause, etc. Such attempts at introducing inflection from approximated natural language understanding can be at times close, or just as easily completely miss the mark. Generally it is difficult determine mood from mere language analysis because the actual mood of a composer can vary dramatically even when using identical text.
Accordingly, techniques disclosed herein include systems and methods that improve audible emotion characteristics when synthesizing speech from a text source. Specifically, techniques disclosed herein use emoticons as a basis for providing contextual text-to-speech expressivity. Emoticons are common in text messages and chat messages, and their presence often indicates a sender's mood or attitude when composing the text. With the system herein, when a given emoticon has been identified in a given character string or block of text, a text-to-speech (TTS) engine makes use of the identified emoticon to enhance expressivity of the audio read out. For example, a common emoticon is known as a “smiley face,” which is conventionally formed using a colon immediately followed by a right parenthesis “:)” or, alternatively, a colon immediately followed by a hyphen and then immediately followed by a right parenthesis “:-).” Sometimes applications graphically convert this combination of punctuation marks to a drawing of a smiley face.
With techniques disclosed herein, when a smiley face emoticon is included in a text message, then the TTS engine can read out the text in a more cheerful or upbeat manner. Likewise, if the system identifies an angry emoticon, then the TTS engine can make use of this information to change a read out tone to match an angry mood of a respective message. Changing the expressivity through emoticon-based contextual cues allows for an enhanced audio experience and the perception of a more intelligent and advanced TTS system. The expressivity of the TTS engine can include, but is not limited to, changes in intonation, prosody, speed, pauses and other features.
One embodiment includes an expressivity manager of a software application and/or hardware device. The expressivity manager receives a character string, such as a text message or other unit of text. The expressivity manager identifies one or more emoticons within the character string, such as an emoticon at the end of a particular sentence. The expressivity manager tags the character string with an expressivity tag that indicates expressivity corresponding to the emoticon. Then the expressivity manager converts the character string into an audible signal or audio output file using a text-to-speech module or engine, such that audible expressivity of the audible signal is based on data from the expressivity tag, that is audible expressivity is driven by a particular type of identified emoticon.
Conventionally, TTS engines, when encountering emoticons, typically either ignore the emoticon or speak the name of the emoticon, such as literally speaking “smiley face” or “angry face” or even speaking the name of the punctuation combination such as “colon right parenthesis.” Emoticons are useful for disambiguating emotion or mood of textual content, which otherwise might be difficult to identify just from a textual analysis alone. Emoticons are helpful to a reader to mentally recreate a sound representative of how a sender would speak corresponding text. Emoticons thus have an immediate emotional tie-in to text, and thus driving text-to-speech expressivity using information from emoticons can provide an accurate enhancement to text read out.
Yet other embodiments herein include software programs to perform the steps and operations summarized above and disclosed in detail below. One such embodiment comprises a computer program product that has a computer-storage medium (e.g., a non-transitory, tangible, computer-readable medium, disparately located or commonly located storage media, computer storage media or medium, etc.) including computer program logic encoded thereon that, when performed in a computerized device having a processor and corresponding memory, programs the processor to perform (or causes the processor to perform) the operations disclosed herein. Such arrangements are typically provided as software, firmware, microcode, code data (e.g., data structures), etc., arranged or encoded on a computer readable storage medium such as an optical medium (e.g., CD-ROM), floppy disk, hard disk, one or more ROM or RAM or PROM chips, an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), and so on. The software or firmware or other such configurations can be installed onto a computerized device to cause the computerized device to perform the techniques explained herein.
Accordingly, one particular embodiment of the present disclosure is directed to a computer program product that includes one or more non-transitory computer storage media having instructions stored thereon for supporting operations such as: receiving a character string; identifying an emoticon within the character string; tagging the character string with an expressivity tag that indicates expressivity corresponding to the emoticon; and converting the character string into an audible signal using a text-to-speech module, such that audible expressivity of the audible signal is based on data from the expressivity tag. The instructions, and method as described herein, when carried out by a processor of a respective computer device, cause the processor to perform the methods disclosed herein.
Other embodiments of the present disclosure include software programs to perform any of the method embodiment steps and operations summarized above and disclosed in detail below.
Of course, the order of discussion of the different steps as described herein has been presented for clarity sake. In general, these steps can be performed in any suitable order.
Also, it is to be understood that each of the systems, methods, apparatuses, etc. herein can be embodied strictly as a software program, as a hybrid of software and hardware, or as hardware alone such as within a processor, or within an operating system or within a software application, or via a non-software application such a person performing all or part of the operations.
As discussed above, techniques herein are well suited for use in software applications supporting speech synthesis and text-to-speech functionality. It should be noted, however, that embodiments herein are not limited to use in such applications and that the techniques discussed herein are well suited for other applications as well.
Additionally, although each of the different features, techniques, configurations, etc. herein may be discussed in different places of this disclosure, it is intended that each of the concepts can be executed independently of each other or in combination with each other. Accordingly, the present invention can be embodied and viewed in many different ways.
Note that this summary section herein does not specify every embodiment and/or incrementally novel aspect of the present disclosure or claimed invention. Instead, this summary only provides a preliminary discussion of different embodiments and corresponding points of novelty over conventional techniques. For additional details and/or possible perspectives of the invention and embodiments, the reader is directed to the Detailed Description section and corresponding figures of the present disclosure as further discussed below.
The foregoing and other objects, features, and advantages of the invention will be apparent from the following more particular description of preferred embodiments herein as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, with emphasis instead being placed upon illustrating the embodiments, principles and concepts.
Techniques disclosed herein include systems and methods that improve audible representation of emotion when synthesizing speech from a text source. Specifically, techniques disclosed herein use emoticons to provide contextual text-to-speech expressivity. In general, techniques herein analyze text received at (or accessed by) a text-to-speech engine. The system parses out emoticons (and can also identify punctuation) and uses identified emoticons to form expressivity of the text read out, that is machine-generated speech. For example, if the system identifies a smiley face emoticon at the end of a sentence, then the system can infer that this sentence—and possibly a subsequent sentence—has a tone or mood associated with it. Depending on whether the emoticon is a smiley face, angry face, sad face, laughing face, etc., the system can infer use or mood from the various emoticons and then change or modify the expressivity of the TTS output. Expressivity of the TTS system, and modifications to it, can include several changes. For example, a speech pitch can be modified between high and low, a read speed can be slowed or accelerated, certain words can be emphasized, and other audible characteristics such as intonation, prosody. This includes essentially any changes to the audible read out of text that can reflect or represent one or more given emotions.
Emoticons are common in text messages, and their presence often indicates a sender's mood or attitude. When a given emoticon has been identified in a given character string or block of text, a text-to-speech (TTS) engine makes use of the identified emoticon to enhance expressivity of the audio read out. For example, a common emoticon is known as a “smiley face,” which is conventionally formed using a colon immediately followed by a right parenthesis “:)” or, alternatively, a colon immediately followed by a hyphen and then immediately followed by a right parenthesis “:-).” Sometimes applications graphically convert this combination of punctuation marks to a drawing of a smiley face.
Referring now to
In the
Modifying expressivity based on emoticons becomes more complex, however, as the number and type of emoticons used increases.
In other embodiments, the TTS system can identify confidence around a particular emoticon identified/tagged as part of the emoticon processing. This is especially useful for text bodies having more than one emoticon because each emoticon used can influence other emoticons. For example, a given text message reads: “I'm really excited to go the football game. :), but my best friend is not going to be able to attend. :(.” With no confidence or intensity tags, the system might read the first sentence with intense happiness and then dramatically switch to intense sadness for the second sentence. Such an extreme mood flip would typically not happen in natural conversation. Thus, by assigning confidence levels and/or intensity levels to each mood tag, subsequent or surrounding emoticons can modify an initial confidence level and/or intensity level to either increase or decrease intensity. By way of a more specific example, in the example text message about the football game, there is a first instance of a smiley face emoticon, and then a subsequent instance of a sad face emoticon. In one processing example, the system tags the first sentence with a happy mood tag and a 50 percent intensity level. Then the system tags the second sentence with a sad mood tag and a 50 percent intensity level. Next, the system recognizes that two opposite mood tags are in close proximity to each other. In response, the system could then lower both intensity levels to perhaps 25 percent. The system can optionally include a separate tag that instructs a smooth transition between sentences. As a result, during read out, the first sentence can be read with a relatively slight increase in happiness expressivity, and then the second sentence is read with a relatively slight increase in sadness expressivity. In other words, the mood characteristics during read out are more subdued, which reflects mood of the sentence because the happiness of going to a football game is checked by not having a best friend at the game. This helps the tags define a more conversational and natural speech.
In other embodiments, the TTS system can also lower or increase expressivity based on a number of emoticons per characters of text. For example, if a given paragraph is scattered with emoticons of various moods, then a confidence level can be lowered, or an intensity level of expressivity can be lowered. Conversely, if a given block of text includes multiple emoticons that are all smiley faces, then the system can increase happiness expressivity because of increased confidence of a happy mood. Thus, emoticons can influence both a type of expressivity and an intensity level of expressivity.
The confidence evaluation can be simultaneous with mood tagging, or occur after initial tagging. In some embodiments, a decision engine or module can be used to make micro or macro decisions. For example, TTS expressivity can be modified based on an entire block of text, instead of merely a single sentence from a block of text. The system can make decisions on which phrases to influence, such as by using a sliding window of influence. For example, there may be an emoticon between two sentences. Does this emoticon influence the prior sentence, the subsequent sentence, or both? In some embodiments, this emoticon could be determined to influence the first sentence, and part of the second (subsequent) sentence, and then return to default speech expressivity.
Global analysis can help determine transitions and pauses to insert. Some pauses can be based on punctuation. Pauses, however, can be exaggerated. In some embodiments, the system aims to avoid extreme expression swings, such as going from exuberantly happy to miserably sad. For example, if one sentence has a smiley face and then a next sentence has a sad face, one modification response can be represented as extreme happiness to extreme sadness, but this may not be ideal. Alternatively, both the happiness and sadness (or anger) could be subdued. Such conflicting emoticons can affect a confidence level. For example, when exact opposite emoticons are identified close to each other, this may not result in a confidence level sufficient to modify default TTS read back.
There is local and global expressivity available, and both can be tagged. For example, local expressivity can be influenced by emoticons immediately surrounding or close to a given sentence or phrase of a character string. A global level of expressivity can be based on confidence about the mood of the speaker and/or number of emoticons, number of mood transitions, type of mood transitions, etc. For example, there could be a string of smiley faces, which could indicate a globally positive message. In contrast, there could be alternating smiley faces, angry faces, and sad faces through out a text sample, which mood swing could lower confidence because quickly switching expressivity among those emotions could result in the text reading seeming unnatural or extreme. Thus, in some embodiments an initial confidence level and/or intensity level is assigned, and then a corresponding passage is rescored after parsing an entire message or unit of text. In some embodiments, the global value can be a multiplier, which can normalize transitions. The global multiplier can also function to increase intensity. For example, if a given text message is identified as having nothing but smiley faces throughout, then the level of intensity for happy expressivity can be increased proportionately.
The TTS system can also incorporate information about the font. For example, bold, italics, and capitalized text can also increase or decrease corresponding intensity levels and/or support confidence levels.
Note that as used herein, “emoticon” refers to any combination of punctuation marks and/or characters appearing in a character or text string used to express a person's mood. This can include pictorial representations of facial expressions. Emoticon also includes graphics or images within text used to convey tone or mood, such as emoji or other picture characters or pictograms. The system can update mood tags as new emoticons are introduced. Conventionally there are numerous emoticons, and some of these can be ambiguous or add nothing to change mood. Thus, optionally, specific emoticons can be ignored or grouped with similar emoticons represented by a single mood tag. Certain TTS systems can include advanced expressivity such as different types of audible happiness, laughs, sadness, and so forth. In other words, there can be more than one way to vary a certain type of expressivity on specific TTS systems (apart from simply increasing or decreasing speed or intensity. TTS systems disclosed herein can maintain mood tags for the various subclasses of moods available for read out.
Functionality associated with TTS expressivity manager 140 will now be discussed via flowcharts and diagrams in
Now describing embodiments more specifically,
In step 220, the TTS expressivity manager identifies an emoticon within the character string, such as by parsing the character string to recognize punctuation mark combinations or graphical characters such as emojis.
In step 230, the TTS expressivity manager tags the character string with an expressivity tag that indicates expressivity corresponding to the emoticon. For example, if the identified emoticon was a smiley face, then the corresponding expressivity tag would indicate a happy mood. Likewise, if the identified emoticon was an angry face, then the corresponding expressivity tag would indicate an angry mood for read out.
In step 240, the TTS expressivity manager converts the character string into an audible signal using a text-to-speech module, such that audible expressivity of the audible signal is based on data from the expressivity tag. In other words, when selecting or modifying a speed, pitch, intonation, prosody, etc. of a read out, the TTS system uses included mood tags to structure or change the expressivity. Note that the TTS system can use concatenated recorded speech (such as stringing together individual phonemes), purely machine-synthesized speech (computer voice), or otherwise.
In step 310, the TTS expressivity manager receives a character string, such as a sentence, statement, group of sentences, block of text, or any other unit of text that has at least one emoticon included.
In step 312, the character string includes a sequence of alphanumeric characters, special characters, and spaces.
In step 320, the TTS expressivity manager identifies multiple emoticons within the character string. Note that emoticons that appear at the end of a sentence or text block are still within or part of the character string, such as that composed and sent by another person.
In step 322, the TTS expressivity manager identifies punctuation within the character string, that is, non-emoticon punctuation such as periods, exclamation marks quotes, and so forth.
In step 330, the TTS expressivity manager tags the character string with expressivity tags that indicate expressivity corresponding to each respective emoticon. For example a mapping table can be used to determine which expressivity tags are used with which emoticons or emoticon combinations.
In step 332, each expressivity tag indicates a type of expressivity and indicates a level of intensity assigned to the type of expressivity. For example, a given expressivity tag might indicate that a type of expressivity is happiness or anger, and then also indicate how strong the happiness or anger should be conveyed. Any scoring system or scale can be used for the intensity level. The intensity level essentially serves to instruct whether the expressivity is going to be conveyed as subdued, moderate, bold, exaggerated, and so forth.
In step 333, each expressivity tag indicates a specific portion of the character string that receives corresponding audible expressivity. This can be accomplished either by specific placement of an expressivity tag, or range indicator. For example, in one embodiment, the expressivity tag can include a pair of tags or a two-part tag where a first tag indicates when a particular type of expressivity should begin, and when/where that particular type of expressivity should terminate. Alternatively, a single expressivity tag can be used that indicates a number of characters/words either before and/or after the expressivity tag that should be modified with the particular type of expressivity.
In step 334, the TTS expressivity manager assigns an initial confidence level to each respective assigned level of intensity based on individual emoticons, and modifies respective assigned levels of intensity based on analyzing the multiple emoticons within the character string as a group. Thus, the TTS expressivity manager can first execute local tagging based on each emoticon occurrence, and then revise/modify confidences and/or intensity levels after examining emoticons within the entire text corpus being analyzed.
In step 335, the TTS expressivity manager analyzes an amount of emoticons within the character string, and modifies intensity levels based on analyzed amounts of emoticons. For example, identifying many emoticons of a same type can increase a corresponding intensity, while identifying multiple emoticons of various types can result in decreasing intensity across various types of expressivity.
In step 336, the TTS expressivity manager analyzes placement of emoticons within the character string, and modifies intensity levels based on analyzed placement of emoticons. For example, if several emoticons appear only at the end of a unit of text, or only at the beginning of a unit of text, then expressivity can be increased or decreased at corresponding sections of the text, and left to a default expressivity at sections with no emoticons.
In step 338, the TTS expressivity manager modifies the expressivity tag based on identified punctuation, such as exclamation point placement. Such punctuation can serve to enhance or influence initial confidence and intensity assignments.
In step 340, the TTS expressivity manager converts the character string into an audible signal using a text-to-speech module, such that audible expressivity of the audible signal is based on data from the expressivity tags. In other words, a TTS system uses expressivity tags to drive expressivity selected for use during read out.
In step 342, the TTS expressivity manager modifies audible expressivity selected from the group consisting of intonation, prosody, speed, and pitch, as compared to a default audible expressivity.
Continuing with
In different embodiments, computer system 149 may be any of various types of devices, including, but not limited to, a cell phone, a personal computer system, desktop computer, laptop, notebook, or netbook computer, mainframe computer system, handheld computer, workstation, network computer, router, network switch, bridge, application server, storage device, a consumer electronics device such as a camera, camcorder, set top box, mobile device, video game console, handheld video game device, or in general any type of computing or electronic device.
Computer system 149 is shown connected to display monitor 130 for displaying a graphical user interface 133 for a user 136 to operate using input devices 135. Repository 138 can optionally be used for storing data files and content both before and after processing. Input devices 135 can include one or more devices such as a keyboard, computer mouse, microphone, etc.
As shown, computer system 149 of the present example includes an interconnect 143 that couples a memory system 141, a processor 142, I/O interface 144, and a communications interface 145, which can communicate with additional devices 137.
I/O interface 144 provides connectivity to peripheral devices such as input devices 135 including a computer mouse, a keyboard, a selection tool to move a cursor, display screen, etc.
Communications interface 145 enables the TTS expressivity manager 140 of computer system 149 to communicate over a network and, if necessary, retrieve any data required to create views, process content, communicate with a user, etc. according to embodiments herein.
As shown, memory system 141 is encoded with TTS expressivity manager 140-1 that supports functionality as discussed above and as discussed further below. TTS expressivity manager 140-1 (and/or other resources as described herein) can be embodied as software code such as data and/or logic instructions that support processing functionality according to different embodiments described herein.
During operation of one embodiment, processor 142 accesses memory system 141 via the use of interconnect 143 in order to launch, run, execute, interpret or otherwise perform the logic instructions of the TTS expressivity manager 140-1. Execution of the TTS expressivity manager 140-1 produces processing functionality in TTS expressivity manager process 140-2. In other words, the TTS expressivity manager process 140-2 represents one or more portions of the TTS expressivity manager 140 performing within or upon the processor 142 in the computer system 149.
It should be noted that, in addition to the TTS expressivity manager process 140-2 that carries out method operations as discussed herein, other embodiments herein include the TTS expressivity manager 140-1 itself (i.e., the un-executed or non-performing logic instructions and/or data). The TTS expressivity manager 140-1 may be stored on a non-transitory, tangible computer-readable storage medium including computer readable storage media such as floppy disk, hard disk, optical medium, etc. According to other embodiments, the TTS expressivity manager 140-1 can also be stored in a memory type system such as in firmware, read only memory (ROM), or, as in this example, as executable code within the memory system 141.
In addition to these embodiments, it should also be noted that other embodiments herein include the execution of the TTS expressivity manager 140-1 in processor 142 as the TTS expressivity manager process 140-2. Thus, those skilled in the art will understand that the computer system 149 can include other processes and/or software and hardware components, such as an operating system that controls allocation and use of hardware resources, or multiple processors.
Those skilled in the art will also understand that there can be many variations made to the operations of the techniques explained above while still achieving the same objectives of the invention. Such variations are intended to be covered by the scope of this invention. As such, the foregoing descriptions of embodiments of the invention are not intended to be limiting. Rather, any limitations to embodiments of the invention are presented in the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6963839 | Ostermann | Nov 2005 | B1 |
6990452 | Ostermann | Jan 2006 | B1 |
7089504 | Froloff | Aug 2006 | B1 |
7360151 | Froloff | Apr 2008 | B1 |
7434176 | Froloff | Oct 2008 | B1 |
7720784 | Froloff | May 2010 | B1 |
7908554 | Blattner | Mar 2011 | B1 |
8855798 | DiMaria et al. | Oct 2014 | B2 |
20030137515 | Cederwall et al. | Jul 2003 | A1 |
20040221224 | Blattner et al. | Nov 2004 | A1 |
20050144002 | Ps | Jun 2005 | A1 |
20060009978 | Ma et al. | Jan 2006 | A1 |
20070011012 | Yurick | Jan 2007 | A1 |
20080040227 | Ostermann et al. | Feb 2008 | A1 |
20080059570 | Bill | Mar 2008 | A1 |
20080096533 | Manfredi et al. | Apr 2008 | A1 |
20080109391 | Chan | May 2008 | A1 |
20080280633 | Agiv | Nov 2008 | A1 |
20080294443 | Eide | Nov 2008 | A1 |
20090019117 | Bonforte et al. | Jan 2009 | A1 |
20100114579 | Ostermann et al. | May 2010 | A1 |
20100182325 | Cederwall et al. | Jul 2010 | A1 |
20100332224 | Makela | Dec 2010 | A1 |
20110040155 | Guzak | Feb 2011 | A1 |
20110112821 | Basso et al. | May 2011 | A1 |
20110148916 | Blattner | Jun 2011 | A1 |
20110294525 | Jonsson | Dec 2011 | A1 |
20120001921 | Escher et al. | Jan 2012 | A1 |
20120095976 | Hebenthal | Apr 2012 | A1 |
20120130717 | Xu et al. | May 2012 | A1 |
20130247078 | Nikankin | Sep 2013 | A1 |
20140101689 | Roberts et al. | Apr 2014 | A1 |
Entry |
---|
Walt Froloff, “Irrational Intelligence”, 2008, Patentalchemy Press, Amazon.com, www. |
http://feelingsintel.com/gamemodel.html. |
Number | Date | Country | |
---|---|---|---|
20140067397 A1 | Mar 2014 | US |