AUDITORY AUGMENTATION OF SPEECH

INTRODUCTION
Field of the Disclosure

The present disclosure generally relates to audio processing techniques, and more specifically, to techniques for dynamically augmenting voice content with audio content.

BACKGROUND

In today's media-driven society, there are increasingly more ways for a user to access video and audio, with a plethora of devices producing sound in the home, car, or almost any other environment. Portable devices producing audio, such as phones, tablets, laptops, headphones, portable loudspeakers, soundbars, and many other devices, are ubiquitous. The sounds produced by these devices may include, for example, a large variety of audio such as music, speech, podcasts, sound effects, and audio associated with video content.

Additionally, many devices today employ speech recognition technology to allow users to use their voice to interact with the devices. For example, speech recognition technology generally involves converting speech content into text content. The ability to use voice to interact with a device is much easier and more intuitive than using a mouse, keyboard, touchscreen, and other input devices.

SUMMARY

One embodiment described herein is a computer-implemented method. The computer-implemented method includes obtaining, via at least one microphone, voice content within an environment, and determining text content corresponding to the voice content. The computer-implemented method also includes, upon detecting at least one keyword within the text content, determining a first audio content, based at least in part on the at least one keyword. The computer-implemented method also includes predicting at least one emotion associated with the text content, based on evaluating a set of words of the text content with a machine learning algorithm. The computer-implemented method also includes determining a second audio content, based at least in part on evaluating the at least one emotion with a procedural audio engine. The computer-implemented method also includes determining one or more output parameters for at least one of the first audio content or the second audio content, based on one or more acoustic parameters of the voice content. The computer-implemented method further includes controlling one or more transducers within the environment to output at least one of the first audio content or the second audio content, according to the one or more output parameters, as the voice content is output within the environment.

Another embodiment described herein is a computer-implemented method. The computer-implemented method includes obtaining, via at least one microphone communicatively coupled to a loudspeaker device, voice content within an environment. The computer-implemented method also includes determining text content corresponding to the voice content, and determining at least one audio content, based at least in part on the text content. The computer-implemented method further includes dynamically augmenting the voice content within the environment with the at least one audio content, comprising outputting, via a transducer of the loudspeaker device, the at least one audio content in the environment as the voice content is output in the environment.

Another embodiment described herein is a system. The system includes at least one microphone and a loudspeaker communicatively coupled to the at least one microphone. The loudspeaker includes a processor and a memory. The memory stores instructions, which, when executed on the processor, preform an operation. The operation includes obtaining, via the at least one microphone, voice content within an environment. The operation also includes determining text content corresponding to the voice content, and determining at least one audio content, based at least in part on the text content. The operation further includes dynamically augmenting the voice content within the environment with the at least one audio content, comprising outputting, via a transducer of the loudspeaker, the at least one audio content in the environment as the voice content is output in the environment.

Other embodiments provide: an apparatus operable, configured, or otherwise adapted to perform any one or more of the aforementioned methods and/or those described elsewhere herein; a non-transitory, computer-readable media comprising instructions that, when executed by a processor of an apparatus, cause the apparatus to perform the aforementioned methods as well as those described elsewhere herein; a computer program product embodied on a computer-readable storage medium comprising code for performing the aforementioned methods as well as those described elsewhere herein; and/or an apparatus comprising means for performing the aforementioned methods as well as those described elsewhere herein. By way of example, an apparatus may comprise a processing system, a device with a processing system, or processing systems cooperating over one or more networks.

The following description and the appended figures set forth certain features for purposes of illustration.

BRIEF DESCRIPTION OF DRAWINGS

Various embodiments in accordance with the present disclosure will be described with reference to the drawings, where like designations denote like elements. Note that the appended drawings illustrate typical embodiments and are therefore not to be considered limiting; other equally effective embodiments are contemplated.

FIG. 1 illustrates an example system, according to one embodiment.

FIG. 2 illustrates an example computing environment, according to one embodiment.

FIG. 3 illustrates an example workflow for dynamically augmenting voice content, according to one embodiment.

FIG. 4 further illustrates certain components of the workflow illustrated in FIG. 3, according to one embodiment.

FIG. 5 further illustrates certain components of the workflow illustrated in FIG. 3, according to one embodiment.

FIG. 6 further illustrates certain components of the workflow illustrated in FIG. 3, according to one embodiment.

FIG. 7 is a flowchart of a method for dynamically augmenting voice content with audio content, according to one embodiment.

FIG. 8 is a flowchart of another method for dynamically augmenting voice content with audio content, according to one embodiment.

FIG. 9 is a flowchart of another method for dynamically augmenting voice content with audio content, according to one embodiment.

FIG. 10A, FIG. 10B, and FIG. 10C illustrate an example scenarios in which voice content is dynamically augmented with audio content in real-time, according to one embodiment.

DETAILED DESCRIPTION

The present disclosure provides systems and techniques for dynamically augmenting speech with audio content. More specifically, embodiments provide techniques for augmenting real-time speech with contextual sound reproduced over loudspeaker devices.

In one embodiment described herein, an auditory augmentation system includes a microphone(s) and a loudspeaker(s). The auditory augmentation system may capture real-time speech within an environment via the microphone(s) and analyze the speech in real-time to detect (i) keywords in the speech and (ii) the underlying mood/emotion of the speech. The auditory augmentation system may use the detected keywords to trigger relevant/contextual audio clips from a database and use the classified mood/emotion of the spoken speech to steer a procedural audio engine. The auditory augmentation system may output the triggered audio clips and the procedural audio output from the procedural audio engine from the loudspeaker(s).

As used herein, a hyphenated form of a reference numeral refers to a specific instance of an element and the un-hyphenated form of the reference numeral refers to the collective element. Thus, for example, device “12-1” refers to an instance of a device class, which may be referred to collectively as devices “12” and any one of which may be referred to generically as a device “12”.

Example System with Auditory Augmentation of Speech

FIG. 1 illustrates an example auditory augmentation system 100, which is configured to implement one or more techniques described herein, according to one embodiment. The auditory augmentation system 100 may be located in any environment, such as a home, for example, in a living room, home theater, bedroom, yard, or other room, in a vehicle, in an indoor or outdoor venue, or any other suitable location.

As shown, the auditory augmentation system 100 includes a computing system 110 and one or more loudspeakers 1041-M. The computing system 110 is representative of a variety of computing systems, including, for example, a laptop computer, desktop computer, server, and similar computing devices. In one embodiment, the computing system 110 is located in a cloud computing environment. In such an embodiment, the computing system 110 may include a number of compute resources (e.g., processor(s), memory, and storage) distributed across one or more systems in the cloud computing environment.

The loudspeakers 104 are generally representative of any type of speakers, such as surround-sound speakers, satellite speakers, tower or floor-standing speakers, bookshelf speakers, sound bars, TV speakers, in-wall speakers, smart speakers, and portable speakers. Additionally, the loudspeakers 104 may be installed in fixed positions or moveable. The loudspeakers 104 may be communicatively coupled to the computing system 110 via a wireless or wired connection. That is, the loudspeakers 104 may be wired or wireless. The loudspeakers 104 are generally capable of converting an electrical audio signal into a corresponding sound. As shown, for example, each loudspeaker 104 includes an electroacoustic transducer(s) 140 for converting electrical audio signals into sound. Each loudspeaker 104 also includes a microphone(s) 130 for capturing audio signals in an environment in which the loudspeaker is located.

In some embodiments, the loudspeakers 104 may be controlled via an input controller, such as the computing device 150 (e.g., smartphone or tablet). For example, the computing device 150 may receive user input and may provide corresponding control signals the loudspeakers 104 to control various settings/functions, such as volume, communication settings, and other suitable settings. In some systems, the loudspeakers 104 may have integrated input controllers.

Note, however, that while the FIG. 1 depicts microphone(s) 130 within one or more loudspeakers 104, in certain embodiments, the microphone(s) 130 may be non-collocated with the loudspeakers 104. In these embodiments, the microphone(s) 130 may be communicatively coupled with the loudspeakers 104 via a wireless or wired connection. Additionally, although not depicted in FIG. 1, each loudspeaker 104 may also include one or more speaker drivers, subwoofer drivers, woofer drivers, mid-range drivers, tweeter drivers, coaxial drivers, and amplifiers which may be mounted in a speaker enclosure. Additionally, note that while FIG. 1 depicts the computing system 110 as separate from the loudspeakers 104, in certain embodiments, the computing system 110 is integral with at least one of the loudspeakers 104.

In certain embodiments, each loudspeaker 104 includes an augmentation component 102. Additionally or alternatively, in other embodiments, the computing system 110 includes an augmentation component 102. As described in greater detail below, the augmentation component 102 is configured to implement one or more techniques described herein for dynamically augmenting speech or voice content with audio content in real-time.

Consider the scenario in FIG. 1, which depicts a user 108 interacting with another user 106 within an environment, such as a living room, bedroom, or any other suitable location. In an exemplary scenario, the user 108 (e.g., parent) may be reading a bedtime story to the user 106 (e.g., child). In such a scenario, the voice content 120 of the user 108 may be captured by one or more microphone(s) 130 of one or more loudspeakers 104 in the environment. The augmentation component 102 may analyze the voice content 120 in real-time using speech processing techniques, speech recognition techniques, sentiment analysis techniques, emotion mining, procedural audio algorithms, and combinations thereof, to generate audio content 112 that is contextual with the voice content 120.

For example, assuming the user 108 is reading a bedtime story to the user 106, the augmentation component 102 may employ speech processing to remove background noise from the voice content 120, compute an estimated loudness of the voice content 120, and employ automatic speech recognition (ASR) to convert the voice content 120 to text content. In addition, the augmentation component 102 may analyze the text content to detect certain predefined keywords as well as to predict the underlying emotion of the text content. The augmentation component 102 may then use the predefined keywords and predicted emotion(s) to determine contextual audio content 112 to augment the voice content 120 with in real-time. In the “bedtime story” scenario, for example, when the user 108 reads the word “firework” in the bedtime story, the augmentation component 102 may trigger the loudspeaker(s) 104 to output a firework audio clip. Additionally, the augmentation component 102 may trigger the loudspeaker(s) 104 to play ambient background music that is associated with the underlying emotion of the portion of the bedtime story being read to the user 106. Further, as the mood of the bedtime story shifts (e.g., from upbeat to dark/mysterious or vice versa), the augmentation component 102 may trigger the loudspeaker(s) 104 to gradually shift the ambient background music to match the current underlying emotion of the portion of the bedtime story being read to the user 106.

In some embodiments, the augmentation component 102 also controls one or more output parameters of the audio content 112, based on one or more acoustic parameters of the voice content 120. For example, the augmentation component 102 may compute an estimated loudness of the voice content 120 and use the estimated loudness to adjust the output gain of the loudspeaker(s) 104 to ensure that the voice content 120 is not masked by the loudspeaker(s) 104. It should be understood that the “bedtime story” examples described herein ae merely illustrative, and that the augmentation component 102 may dynamically augment voice content in any scenario in which live or pre-recorded voice content is captured in an environment.

In this manner, embodiments described herein provide techniques that allow for dynamic augmentation of speech or voice content in real-time with contextual audio content in order to create an enhanced immersive auditory experience for users in an environment. For example, the techniques described herein can dynamically adapt to different acoustic environments when performing dynamic augmentation of speech, such that the auditory experience is modified in a way that is intuitive and expected by the users (e.g., speaker and listeners).

FIG. 2 illustrates an example of a computing environment 200 used to dynamically augment voice content with audio content, according to one embodiment. As shown, the computing environment 200 includes one or more loudspeakers 1041-M, computing system 110, and the computing device 150, which are interconnected via a network 240.

The network 240, in general, may be a wide area network (WAN), a local area network (LAN), a wireless LAN, a personal area network (PAN), a cellular network, a wired network, etc. In a particular embodiment, the network 240 is the Internet. Wireless connections between components of the computing environment 200 may be provided via a short-range wireless communication technology, such as Bluetooth, WiFi, ZigBee, ultra wideband (UWB), or infrared. Wired connections between components of the multimedia system 100 may be via auxiliary audio cable, universal serial bus (USB), high-definition multimedia interface (HDMI), video graphics array (VGA), or any other suitable wired connection.

As shown, each loudspeaker 104 includes a processor 202, a memory 204, a storage 206, one or more sensors 208, and a network interface 212. The processor 202 represents any number of processing elements, which can include any number of processing cores. The memory 204 can include volatile memory, non-volatile memory, and combinations thereof.

The memory 204 generally includes program code for performing various functions for dynamically augmenting voice content with audio content. The program code is generally described as various functional “components” or “modules” within the memory 204, although alternate implementations may have different functions or combinations of functions. Here, the memory 204 includes an augmentation component 102. The augmentation component 102 includes an analysis component 222, an analysis component 224, an analysis component 226, and analysis component 228, and a loudspeaker renderer 230, each of which is a software component and is described in greater detail below.

The storage 206 may be a disk drive storage device. Although shown as a single unit, the storage 206 may be combination of fixed and/or removable storage devices, such as fixed disc drives, removable memory cards, optical storage, network attached storage (NAS), or a storage area network (SAN). Here, the storage 206 includes audio content 112, speech content 120, ML algorithm(s) 232, and a local reproduction setup 234, which is described in greater detail herein.

The sensor(s) 208 includes one or more microphone(s) 130 for recording audio content and one or more electroacoustic transducer(s) 140 for converting electrical audio signals into corresponding sounds. In general, however, the sensor(s) 208 can include any suitable type of sensor that is configured to sense information from the physical environment. The network interface 212 may be any type of network communications interface that allows the loudspeaker 104 to communicate with other computers and/or components in the computing environment 200 via a data communications network (e.g., network 240).

Computing device 150 is generally representative of a mobile or handheld computing device, including, for example, a smartphone, a tablet, a laptop computer, etc. Here, the computing device 150 includes a processor 250, a memory 252, a storage 258, a screen 260, and a network interface 262. The processor 250 represents any number of processing elements, which can include any number of processing cores. The memory 252 can include volatile memory, non-volatile memory, and combinations thereof.

The memory 252 generally includes program code for performing various functions related to applications (e.g., application 256, browser 254) hosted on the computing device 150. The program code is generally described as various functional “applications” or “modules” within the memory 252, although alternate implementations may have different functions or combinations of functions. Here, the memory 252 includes a browser 254 and an application 256. The application 256 and/or browser 254 may be used for a variety of functions, including, for example, accessing voice content, accessing computing system 110 (including augmentation component 102), playing audio content, accessing/controlling settings of the loudspeakers 104, and other suitable functions.

In particular, the browser 254 may be used to access the computing system 110 by rendering web pages received from the computing system 110. The application 256 may be representative of a component of a client server application or other distributed application which can communicate with the computing system 110 over the network 240. Application 256 may be a “thin” client where the processing is largely directed by the application 256, but performed by computing systems, or a conventional software application installed on the computing device 150.

The storage 258 may be a disk drive storage device. Although shown as a single unit, the storage 258 may be combination of fixed and/or removable storage devices, such as fixed disc drives, removable memory cards, optical storage, network attached storage (NAS), or a storage area network (SAN). Here, the storage 258 includes the local reproduction setup 234, which is described in greater detail herein. The screen 260 may include a Liquid Crystal Display (LCD), a Light Emitting Diode (LED), or other display technology. In one embodiment, the screen 260 includes a touch-screen interface. The network interface 262 may be any type of network communications interface that allows the computing device 150 to communicate with other computers and/or components in the computing environment 200 via a data communications network (e.g., network 240).

Note that FIG. 2 illustrates a reference example of a computing environment 200 in which the techniques presented herein can be implemented and that the techniques presented herein can be implemented in other computing environments.

FIG. 3 illustrates an example workflow 300 for dynamically augmenting voice content with audio content, according to one embodiment. In the workflow 300, the analysis component 222 is configured to employ speech processing techniques on the voice content 120. The speech processing techniques may be used to filter speech signals, for example, to remove competing talkers or background noise. For example, the analysis component 222 includes a speech processing tool 310, which receives audio samples 302 of the voice content 120. The speech processing tool 310 may process the audio samples 302 to improve the speech to noise ratio and may output filtered audio samples 304. In particular, the speech processing tool 310 may remove background noise from the audio samples 302. The filtered audio samples 304 are then provided to the analysis component 224, the analysis component 226, and the analysis component 228.

The analysis component 224 is generally configured to convert the filtered audio samples 304 into text content, and analyze the text content in order to detect and trigger keywords within the text content. As shown, the analysis component 224 includes a keyword detection component 320, a keyword trigger tool 330, and a keyword sounds database 340. The keyword detection component 320 analyzes text content associated with the filtered audio samples 304 at a fine grain resolution to detect a keyword(s) 322.

The keyword trigger tool 330 receives the detected keyword(s) 322 and selects one or more of the detected keyword(s) 322 to use as the triggered keyword(s) 324. For example, assuming the keyword detection component 320 identifies “firework,” “thunder,” and “ocean,” as detected keywords 322, the keyword trigger tool 330 may select “firework” as the triggered keyword 324. In some embodiments, the triggering of detected keyword(s) 322 may be controlled by a time constant to avoid keywords being triggered too often or too seldom.

The keyword trigger tool 330 may use the triggered keyword 324 to retrieve a corresponding audio file from a database. In FIG. 3, for example, the keyword trigger tool 330 retrieves audio samples 306, which correspond to the triggered keyword 324. In addition to the audio samples 306, the keyword trigger tool 330 may obtain metadata 312, which includes relevant contextual data to the audio samples 306 (e.g., target spatial location of the audio samples 306). The analysis component 224 may provide the audio samples 306 and the metadata 312 to a loudspeaker renderer 230.

In some embodiments, the analysis component 224 may use a script 326 containing text content of the voice content 120 to aid the keyword detection. For example, if the user is reading a book, then the user's voice content 120 may be known a priori, and the script 326 containing text content of the book can be used by the keyword detection component 320 to detect keyword(s) 322. Note, the analysis component 224 is described in greater detail below with respect to FIG. 4.

The analysis component 226 is generally configured to perform sentiment analysis to predict the mood or emotion of the voice content 120 and generate audio content, based on the predicted mood or emotion of the voice content 120. As shown, the analysis component 226 includes an emotion analysis tool 350 and a procedural audio component 360. The emotion analysis tool 350 is configured to predict the current underlying emotion 352 of the voice content 120, where the predicted emotion 352 is a time-varying parameter. For example, the emotion analysis tool 350 receives the filtered audio samples 304 and may evaluate the filtered audio samples 304 using one or more sentiment analysis techniques. In one embodiment described below, the emotion analysis tool 350 employs emotion mining to predict the current underlying emotion of the text content (e.g., words/sentences) of the filtered audio samples 304. The predicted emotion 352 is then provided to the procedural audio component 360.

The procedural audio component 360 is generally configured to dynamically generate audio clips, based at least in part on the current predicted emotion 352. Procedural audio generally involves adapting the playback of sound in real-time. Such sound adaptation may be achieved by creating a database of individual sound clips and dynamically selecting a subset of clips, which when combined, elicits a certain percept. Additionally, procedural audio may also involve real-time audio generation. Real-time audio generation can be achieved by controlling individual sound generators (e.g., waveform/tone generators) to compose sound in real-time. In the example depicted in FIG. 3, the procedural audio component 360 generates audio samples 308, based on the current predicted emotion 352 of the filtered audio samples 304 and provides the audio samples 308 to the loudspeaker renderer 230. Note, the analysis component 226 is described in greater detail below with respect to FIG. 5.

The analysis component 228 is generally configured to determine one or more acoustic parameters of the voice content 120. As shown, the analysis component 228 includes a loudness/energy calculation component 370 and a loudness compensation component 380. The loudness/energy calculation component 370 estimates the loudness of the filtered audio samples 304 and generates a loudness estimate 372. The loudness compensation 380 uses the loudness estimate 372 to control the output level of the loudspeaker(s) 104. For example, the loudness compensation 380 may compute a gain 382, based on the loudness estimate 372. The gain 328 may be used to dynamically adjust the output level of the loudspeaker(s) 104 to ensure that the reproduced audio is at a target relative level to the speech (e.g., the reproduced audio is audible and does not mask the speaker of the voice content 120).

The loudspeaker renderer 230 is generally configured to map input audio signals to the loudspeakers 104. As shown, the loudspeaker renderer 230 receives audio samples 306, metadata 312, audio samples 308, gain 382, and a local reproduction setup 234. In the embodiment depicted in FIG. 3, the loudspeaker renderer 230 maps the audio samples 306 and the audio samples 308 to one or more of the loudspeakers 104, based on the metadata 312 and the local reproduction setup 234.

The local reproduction setup 234 includes an indication of at least one of (i) the number of loudspeakers 104 within an environment, (ii) the positions of the loudspeakers 104 within an environment, or (iii) capabilities of loudspeakers 104 (e.g., maximum output level, frequency response, sensitivity, non-linearities, and other suitable parameters). For example, different environments may include different numbers of loudspeakers with different capabilities as well as different loudspeaker arrangements.

For example, the loudspeaker renderer 230 can apply positional information within the metadata 312 to the audio samples 306. The loudspeaker renderer 230 outputs M-channels audio signal 392 containing the audio samples 306 and the audio samples 308 to the loudspeakers 104. The output level of the loudspeakers 104 may be controlled by the gain 382. Note, the loudspeaker renderer 230 is described in greater detail below with respect to FIG. 6.

FIG. 4 further illustrates certain components of the analysis component 224 of the workflow 300, described relative to FIG. 3, according to one embodiment. As shown, the keyword detection component 320 includes a speech-to-text converter 410, a text database 420, and a detector component 430. In one embodiment, the speech-to-text converter 410 converts the filtered audio samples 304 into text content 412. The speech-to-text converter 410 can employ any suitable ASR technique to convert the filtered audio samples 304 to text content 412. The text content 412 is stored as words in the text database 420 and parsed as individual words 414.

The detector component 430 implements the keyword detection stage. For example, the detector component 430 may scan the incoming word(s) 414 and flag if a keyword 322 is detected. In one embodiment, the keywords are predefined by the auditory augmentation system 100. In one example, the keywords may be located within a keyword sounds database 340, which is database of sounds being used. In such an example, a keyword may exist if a corresponding audio file in the keyword sounds database 340 exists. As shown, for example, the detector component 430 may receive a bias 444, which includes an indication of keyword(s) within the keyword sounds database 340.

Additionally, in embodiments where a script 326 is available (e.g., known a priori), the script 326 can be used to bias the keyword detection stage. For example, the detector component 430 may receive a bias 442, which includes an indication of keyword(s) from the script 326. Additionally, in some embodiments, the script 326 itself can be used to bias the keyword sounds database 340. For example, if a keyword from the script 326 does not have a corresponding audio file, the keyword sounds database 340 may be updated.

Once keyword(s) 322 have been detected, the keyword trigger tool 330 implements a keyword trigger stage, which involves triggering one or more of the detected keywords 322. In one embodiment, the triggering of a detected keyword(s) 322 is controlled by the time constant 450. When the keyword trigger tool 330 triggers a keyword, a corresponding sound file is retrieved from the keyword sound database 340. As shown, for example, the keyword trigger tool 330 transmits a sound file request 464, which includes the triggered keyword 324, to the keyword sounds database 340. In response to the sound file request 464, the keyword trigger tool 330 receives the corresponding sound file 462 as well as metadata 312 associated with the sound file 462. The keyword trigger tool 330 may provide audio samples 306 of the sound file 462 and the metadata 312 to the loudspeaker renderer 230.

FIG. 5 further illustrates certain components of the analysis component 226 of the workflow 300, described relative to FIG. 3, according to one embodiment. As shown, the emotion analysis tool 350 includes a speech-to-text converter 510, a text database 520, and an emotion mining component 530. In one embodiment, the speech-to-text converter 510 converts the filtered audio samples 304 into text content 512. The speech-to-text converter 510 can employ any suitable ASR technique to convert the filtered audio samples 304 to text content 512. The text content 512 is stored as words in the text database 420 and parsed as individual words, sentences, or longer chunks of text (collectively referred to as document(s) 514).

The emotion mining component 530 parses through the document(s) 514 and evaluates the document(s) 514 using an emotion mining algorithm to determine a predicted emotion P(emotion) 532 associated with the document(s) 514. The resolution of the parsing of the document(s) 514 may be controlled by the time constant 552. For example, shorter time constants may result in a predicted emotion varying more rapidly, whereas longer time constants may result in a predicted emotion varying less often.

The emotion mining component 530 may employ any suitable emotion mining algorithm to determine P(emotion) 532. For example, the emotion mining component may implement emotion detection (e.g., detecting if text conveys any type of emotion), emotion polarity classification (e.g., determining polarity of existing emotion), emotion classification (e.g., fine grained classification of emotion of text into one or more emotions), and/or emotion cause detection (e.g., determining the reason behind a certain detected emotion). In one embodiment, the emotion mining component 530 uses a 2-dimensional emotional space as the underlying model of the emotion mining algorithm, which outputs a prediction value for each of a set of emotions. In another embodiment, the emotion mining component 530 may use an emotion mining algorithm that classifies the input as a single emotion.

In FIG. 5, the emotion mining component 530 passes the predicted emotion 532 to a smoothing component 540, which is configured to process the predicted emotion 532 with a smoothing algorithm to reduce rapid changes in the predicted emotion 532. The smoothing component 540 outputs a smoothed predicted emotion 534 to the weighting component 550. The weighting component 550 is configured to apply a set of weights to smoothed predicted emotion 534 to give greater emphasis to certain emotions. The weighting component 550 may apply weights based on a priori knowledge of the input voice content 120 and/or corresponding text content. For example, if it is known that the voice content 120 contains zero references to “sadness,” then the weighting can be set to 0 for “sadness,” to minimize any false positives for “sadness” emotion. The weighting component 550 outputs the weighted predicted emotion 352 to the procedural audio component 360.

The procedural audio component 360 uses the predicted emotion 352 to drive a procedural audio engine 560. As noted, the procedural audio engine 560 may dynamically select a set of audio clips from the sounds database 572, based on the predicted emotion 352 to elicit a certain perception. In some embodiments, assuming the script 326 is available, the script 326 can be used to update the sounds database 572. The procedural audio component 360 outputs the audio samples 308 (containing the selected audio clips) to the loudspeaker renderer 230.

FIG. 6 further illustrates certain components of the loudspeaker renderer 230 of the workflow 300, described relative to FIG. 3, according to one embodiment. As shown, the loudspeaker renderer 230 includes a distribution tool 610 and a panning algorithm 620. The distribution tool 610 receives the audio samples 308, and provides the audio samples 308 and metadata 670 to the panning algorithm 620. In one embodiment, the metadata 670 includes an indication to distribute the audio samples 308 to the loudspeakers 104 equally. The distribution tool 610 may generate the metadata 670 based on the local reproduction setup 234. For example, the metadata 670 may set the target positions for the audio samples 308 to be equally distributed across the real positions for the loudspeakers 104.

The panning algorithm 620 is configured to map the audio samples 306 and audio samples 308 to the available loudspeakers 104, based on the metadata 312, metadata 670, and local reproduction setup 234. The panning algorithm 620 may be any suitable panning algorithm that maps input signals with target spatial locations to output loudspeaker signals with known spatial locations. For the audio samples 306, the panning algorithm 620 may use the corresponding metadata 312 to position the audio samples 306 at specific target locations in space. For example, if the triggered keyword audio is a “bird,” then the metadata 312 may inform the panning algorithm 620 that the target spatial location is elevated. In such an example, the panning algorithm 620 may determine whether elevating the target spatial location for the audio samples 306 is possible, based on the local reproduction setup 234 (e.g., elevating the target spatial location may be a function of the real loudspeaker positions). As shown, the panning algorithm 620 outputs at least one of the audio samples 306 or the audio samples 308 as M-channels audio 392.

FIG. 7 is a flowchart of a method 700 for dynamically augmenting voice content with audio content, according to one embodiment. The method 700 may be performed by an augmentation component (e.g., augmentation component 102).

Method 700 may enter at block 702, where the augmentation component obtains voice content (e.g., voice content 120) within an environment. As noted, the augmentation component may obtain the voice content via a microphone(s) 130. The microphone(s) 130 may be integral with one or more loudspeakers (e.g., loudspeakers 104) in the environment or separate from the loudspeakers in the environment.

At block 704, the augmentation component determines text content (e.g., text content 412, 512) corresponding to the voice content. For example, the augmentation component may use ASR techniques to convert the voice content into text content. At block 706, the augmentation component determines one or more audio content (e.g., audio samples 306 and audio samples 308), based in part on the text content. At block 708, the augmentation component dynamically augments the voice content by outputting the one or more audio content as the voice content is output in the environment.

FIG. 8 is a flowchart of another method 800 for dynamically augmenting voice content with audio content, according to one embodiment. The method 800 may be performed by an augmentation component (e.g., augmentation component 102).

Method 800 may enter at block 802, where the augmentation component obtains voice content (e.g., voice content 120) within an environment. As noted, the augmentation component may obtain the voice content via a microphone(s) 130. The microphone(s) 130 may be integral with one or more loudspeakers (e.g., loudspeakers 104) in the environment or separate from the loudspeakers in the environment.

At block 804, the augmentation component determines text content (e.g., text content 412, 512) corresponding to the voice content. For example, the augmentation component may use ASR techniques to convert the voice content into text content.

At block 806, the augmentation component detects at least one keyword (e.g., keyword 322, 324) within the text content. At block 808, the augmentation component predicts at least one emotion associated with the text content (e.g., predicted emotion 352). At block 810, the augmentation component determines a first audio content (e.g., audio samples 306), based at least in part on the at least one keyword.

At block 812, the augmentation component determines a second audio content (e.g., audio samples 308), based at least in part on the at least one emotion associated with the text content. At block 814, the augmentation component determines one or more output parameters (e.g., gain 382) for at least one of the first audio content or the second audio content, based at least in part on one or more acoustic parameters (e.g., loudness estimate 372) of the voice content. At block 816, the augmentation component outputs at least one of (i) the first audio content or (ii) the second audio content using the one or more output parameters.

FIG. 9 is a flowchart of another method 900 for dynamically augmenting voice content with audio content, according to one embodiment. The method 900 may be performed by an augmentation component (e.g., augmentation component 102).

The method 900 may be performed while receiving voice content (e.g., voice content 120) in an environment. At block 902, the augmentation component converts the voice content into text content (e.g., text content 412, 512). At block 904, the augmentation component determines whether a keyword (e.g., keyword 322, 324) is detected within the text content. If so, at block 906, the augmentation component retrieves first audio content (e.g., audio samples 306) associated with the keyword from a database (e.g., keyword sounds database 340).

At block 908, the augmentation component determines an emotion (e.g., predicted emotion 352) associated with a set of words within the text content. At block 910, the augmentation component determines, via a procedural audio engine (e.g., procedural audio engine 560), second audio content (e.g., audio samples 308) associated with the emotion.

At block 912, the augmentation component determines at least one acoustic parameter (e.g., loudness estimate 372) of the voice content. At block 914, the augmentation component obtains configuration information associated with one or more loudspeakers (e.g., loudspeakers 104) in the environment (e.g., local reproduction setup 234). At block 916, the augmentation component determines an output level of the one or more loudspeakers in the environment, based on the at least one acoustic parameter.

At block 918, the augmentation component maps at least one of the first audio content (if available) or the second audio content to the one or more of the loudspeakers in the environment, based on the configuration information. At block 920, the augmentation component outputs at least one of the first audio content (if available) or the second audio content from the one or more loudspeakers, according to the mapping and the output level.

FIGS. 10A-10C illustrate an example scenario in which voice content is dynamically augmented with audio content in real-time, according to one embodiment. As shown in FIG. 10A, a first user may be reading a book to a second user in the environment 1000 (e.g., bedroom). As the first user reads the book, the loudspeakers 1041-2 in the environment 1000 may capture the voice content 120 of the first user in real-time (e.g., as the first user is speaking). In FIG. 10A, for example, at a particular point in time, the loudspeakers 1041-2 detect the first user speaking “It was a beautiful day . . . .” In response to detecting this voice content 120, the loudspeakers 1041-2 output background ambient audio (e.g., audio samples 308-1) that is associated with “happy” and “uplifting” emotions being conveyed by the voice content 120.

As shown in FIG. 10B, at a subsequent point in time to FIG. 10A, the loudspeakers 1041-2 detect the first user speaking “A firework explodes!.” Here, the loudspeakers 1041-2 may detect the keyword 322 (e.g., “firework”) within the voice content 120 and trigger a corresponding audio clip of a “firework” sound (e.g., audio samples 306). Similarly, as shown in FIG. 10C, at a subsequent point in time to FIG. 10A or FIG. 10B, the loudspeakers 1041-2 may detect a change in mood as the first user reads the book. For example, the loudspeakers 1041-2 may detect the first user speaking “The dragon swept in . . . .” In response to detecting this voice content 120, the loudspeakers 1041-2 may gradually shift to outputting background ambient audio (e.g., audio samples 308-2) associated with “dramatic” and “fearful” emotions being conveyed by the voice content 120.

Advantageously, by providing techniques that allow for dynamic augmentation of speech or voice content in real-time with contextual audio content, embodiments allow devices to create an enhanced immersive auditory experience for users in an environment. For example, the techniques described herein can dynamically adapt to different acoustic environments when performing dynamic augmentation of speech, such that the auditory experience is modified in a way that is intuitive and expected by the users (e.g., speaker and listeners).

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

In the preceding, reference is made to embodiments presented in this disclosure. However, the scope of the present disclosure is not limited to specific described embodiments. Instead, any combination of the features and elements described herein, whether related to different embodiments or not, is contemplated to implement and practice contemplated embodiments. Furthermore, although embodiments disclosed herein may achieve advantages over other possible solutions or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the scope of the present disclosure. Thus, the aspects, features, embodiments and advantages described herein are merely illustrative and are not considered elements or limitations of the appended claims except where explicitly recited in a claim(s). Likewise, reference to “the invention” shall not be construed as a generalization of any inventive subject matter disclosed herein and shall not be considered to be an element or limitation of the appended claims except where explicitly recited in a claim(s).

Aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.”

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Embodiments of the invention may be provided to end users through a cloud computing infrastructure. Cloud computing generally refers to the provision of scalable computing resources as a service over a network. More formally, cloud computing may be defined as a computing capability that provides an abstraction between the computing resource and its underlying technical architecture (e.g., servers, storage, networks), enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. Thus, cloud computing allows a user to access virtual computing resources (e.g., storage, data, applications, and even complete virtualized computing systems) in “the cloud,” without regard for the underlying physical systems (or locations of those systems) used to provide the computing resources.

Typically, cloud computing resources are provided to a user on a pay-per-use basis, where users are charged only for the computing resources actually used (e.g. an amount of storage space consumed by a user or a number of virtualized systems instantiated by the user). A user can access any of the resources that reside in the cloud at any time, and from anywhere across the Internet. In context of the present invention, a user may access applications or related data (e.g., augmentation component 102) available in the cloud. Doing so allows a user to access this information from any computing system attached to a network connected to the cloud (e.g., the Internet).

While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

AUDITORY AUGMENTATION OF SPEECH

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims