1. Field of the Invention
The invention relates to systems and methods for improving intelligibility of human speech (e.g., dialog) determined by a multi-channel audio signal. In some embodiments, the invention is a method and system for filtering an audio signal having a speech channel and a non-speech channel to improve intelligibility of speech determined by the signal, by determining at least one attenuation control value indicative of a measure of similarity between speech-related content determined by the speech channel and speech-related content determined by the non-speech channel, and attenuating the non-speech channel in response to the attenuation control value.
2. Background of the Invention
Throughout this disclosure including in the claims, the term “speech” is used in a broad sense to denote human speech. Thus, “speech” determined by an audio signal is audio content of the signal that is perceived as human speech (e.g., dialog, monologue, singing, or other human speech) upon reproduction of the signal by a loudspeaker (or other sound-emitting transducer). In accordance with typical embodiments of the invention, the audibility of speech determined by an audio signal is improved relative to other audio content (e.g., instrumental music or non-speech sound effects) determined by the signal, thereby improving the intelligibility (e.g., clarity or ease of understanding) of the speech.
Throughout this disclosure including in the claims, the expression “speech-enhancing content” of a channel of a multi-channel audio signal is content (determined by the channel) that enhances the intelligibility or other perceived quality of speech content determined by another channel (e.g., a speech channel) of the signal.
Typical embodiments of the invention assume that the majority of speech determined by a multi-channel input audio signal is determined by the signal's center channel. This assumption is consistent with the convention in surround sound production according to which the majority of speech is usually placed into only one channel (the Center channel), and the majority of music, ambient sound, and sound effects is usually mixed into all the channels (e.g., the Left, Right, Left Surround and Right Surround channels as well as the Center channel).
Thus, the center channel of a multi-channel audio signal will sometimes be referred to herein as the “speech” channel and all other channels (e.g., Left, Right, Left Surround, and Right Surround) channels of the signal will sometimes be referred to herein as “non-speech” channels. Similarly, a “center” channel generated by summing the left and right channels of a stereo signal whose speech is center panned will sometimes be referred to herein as a “speech” channel, and a “side” channel generated by subtracting such a center channel from the stereo signal's left (or right) channel will sometimes be referred to herein as a “non-speech” channel.
Throughout this disclosure including in the claims, the expression performing an operation “on” signals or data (e.g., filtering, scaling, or transforming the signals or data) is used in a broad sense to denote performing the operation directly on the signals or data, or on processed versions of the signals or data (e.g., on versions of the signals that have undergone preliminary filtering prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout the disclosure including in the claims, the expression “ratio” of a first value (“A”) to a second value (“B”) is used in a broad sense to denote A/B, or B/A, or a ratio of a scaled or offset version one of A and B to a scaled or offset version of the other one of A and B (e.g., (A+x)/(B+y), where x and y are offset values).
Throughout the disclosure including in the claims, the expression “reproduction” of signals by sound-emitting transducers (e.g., speakers) denotes causing the transducers to produce sound in response to the signals, including by performing any required amplification and/or other processing of the signals.
When speech is heard in the presence of competing sounds (such as listening to a friend over the noise of a crowd in a restaurant), a portion of the acoustic features that signal the phonemic content of the speech (speech cues) are masked by the competing sounds and are no longer available to the listener to decode the message. As the level of the competing sound increases relative to the level of the speech, the number of speech cues that are received correctly diminishes and speech perception becomes progressively more cumbersome until, at some level of competing sound, the speech perception process breaks down. While this relation holds true for all listeners, the level of competing sound that can be tolerated for any speech level is not the same for all listeners. Some listeners, e.g., those with hearing loss due to aging (presbyacusis) or those listening to a language that they acquired after puberty, are less capable of tolerating competing sounds than are listeners with good hearing or those operating in their native language.
The fact that listeners differ in their ability to understand speech in the presence of competing sounds has implications for the level at which ambient sounds and background music in news or entertainment audio are mixed with speech. Listeners with hearing loss or those operating in a foreign language often prefer a lower relative level of non speech audio than that provided by the content creator.
To accommodate these special needs, it is known to apply attenuation (ducking) to non-speech channels of a multi-channel audio signal, but less (or no) attenuation to the signal's speech channel, to improve intelligibility of speech determined by the signal.
For example, PCT International Application Publication Number WO 2010/011377, naming Hannes Muesch as inventor and assigned to Dolby Laboratories Licensing Corporation (published Jan. 28, 2010), discloses that non-speech channels (e.g., left and right channels) of a multi-channel audio signal may mask speech in the signal's speech channel (e.g., center channel) to the point that a desired level of speech intelligibility is no longer met. WO 2010/011377 describes how to determine an attenuation function to be applied by ducking circuitry to the non-speech channels in an attempt to unmask the speech in the speech channel while preserving as much of the content creator's intent as possible. The technique described in WO 2010/011377 is based on the assumption that content in a non-speech channel never enhances the intelligibility (or other perceived quality) of speech content determined by the speech channel.
The present invention is based in part on the recognition that, while this assumption is correct for the vast majority of multi-channel audio content, it is not always valid. The inventor has recognized that when at least one non-speech channel of a multi-channel audio signal does include content that enhances the intelligibility (or other perceived quality) of speech content determined by the signal's speech channel, filtering of the signal in accordance with the method of WO 2010/011377 can negatively affect the entertainment experience of one listening to the reproduced filtered signal. In accordance with typical embodiments of the present invention, application of the method described in WO 2010/011377 is suspended or modified during times when content does not conform to the assumptions underlying the method of WO 2010/011377.
There is a need for a method and system for filtering a multi-channel audio signal to improve speech intelligibility in the common case that at least one non-speech channel of the audio signal includes content that enhances the intelligibility of speech content in the audio signal's speech channel.
In a first class of embodiments, the invention is a method for filtering a multi-channel audio signal having a speech channel and at least one non-speech channel, to improve intelligibility of speech determined by the signal. The method includes steps of: (a) determining at least one attenuation control value indicative of a measure of similarity between speech-related content determined by the speech channel and speech-related content determined by at least one non-speech channel of the multi-channel audio signal; and (b) attenuating at least one non-speech channel of the multi-channel audio signal in response to the at least one attenuation control value. Typically, the attenuating step comprises scaling a raw attenuation control signal (e.g., a ducking gain control signal) for the non-speech channel in response to the at least one attenuation control value. Preferably, the non-speech channel is attenuated so as to improve intelligibility of speech determined by the speech channel without undesirably attenuating speech-enhancing content determined by the non-speech channel. In some embodiments, each attenuation control value determined in step (a) is indicative of a measure of similarity between speech-related content determined by the speech channel and speech-related content determined by one non-speech channel of the audio signal, and step (b) includes the step of attenuating this non-speech channel in response to said each attenuation control value. In some other embodiments, step (a) includes a step of deriving a derived non-speech channel from at least one non-speech channel of the audio signal, and the at least one attenuation control value is indicative of a measure of similarity between speech-related content determined by the speech channel and speech-related content determined by the derived non-speech channel. For example, the derived non-speech channel can be generated by summing or otherwise mixing or combining at least two non-speech channels of the audio signal. Determining each attenuation control value from a single derived non-speech channel can reduce the cost and complexity of implementing some embodiments of the invention, relative to the cost and complexity of determining different subsets of a set of attenuation values from different non-speech channels. In embodiments in which the input audio signal has at least two non-speech channels, step (b) can include the step of attenuating a subset of the non-speech channels (e.g., each non-speech channel from which a derived non-speech channel has been derived), or all of the non-speech channels, in response to the at least one attenuation control value (e.g., in response to a single sequence of attenuation control values).
In some embodiments in the first class, step (a) includes a step of generating an attenuation control signal indicative of a sequence of attenuation control values, each of the attenuation control values indicative of a measure of similarity between speech-related content determined by the speech channel and speech-related content determined by the at least one non-speech channel at a different time (e.g., in a different time interval), and step (b) includes steps of: scaling a ducking gain control signal in response to the attenuation control signal to generate a scaled gain control signal, and applying the scaled gain control signal to attenuate the at least one non-speech channel (e.g., asserting the scaled gain control signal to ducking circuitry to control attenuation of the at least one non-speech channel by the ducking circuitry). For example, in some such embodiments, step (a) includes a step of comparing a first speech-related feature sequence (indicative of the speech-related content determined by the speech channel) to a second speech-related feature sequence (indicative of the speech-related content determined by the at least one non-speech channel) to generate the attenuation control signal, and each of the attenuation control values indicated by the attenuation control signal is indicative of a measure of similarity between the first speech-related feature sequence and the second speech-related feature sequence at a different time (e.g., in a different time interval). In some embodiments, each attenuation control value is a gain control value.
In some embodiments in the first class, each attenuation control value is monotonically related to likelihood that at least one non-speech channel of the audio signal is indicative of speech-enhancing content that enhances the intelligibility (or another perceived quality) of speech content determined by the speech channel. In some other embodiments in the first class, each attenuation control value is monotonically related to an expected speech-enhancing value of the at least one non-speech channel (e.g., a measure of probability that the at least one non-speech channel is indicative of speech-enhancing content, multiplied by a measure of perceived quality enhancement that speech-enhancing content determined by the at least one non-speech channel would provide to speech content determined by the multi-channel signal). For example, where step (a) includes a step of comparing a first speech-related feature sequence indicative of speech-related content determined by the speech channel to a second speech-related feature sequence indicative of speech-related content determined by the at least one non-speech channel, the first speech-related feature sequence may be a sequence of speech likelihood values, each indicating the likelihood at a different time (e.g., in a different time interval) that the speech channel is indicative of speech (rather than audio content other than speech), and the second speech-related feature sequence may also be a sequence of speech likelihood values, each indicating the likelihood at a different time (e.g., in a different time interval) that the at least one non-speech channel is indicative of speech. Various methods of automatically generating such sequences of speech likelihood values from an audio signal are known. For example, one such method is described by Robinson and Vinton in “Automated Speech/Other Discrimination for Loudness Monitoring” (Audio Engineering Society, Preprint number 6437 of Convention 118, May 2005). Alternatively, it is contemplated that the sequences of speech likelihood values could be created manually (e.g., by the content creator) and transmitted alongside the multi-channel audio signal to the end user.
In a second class of embodiments, in which the multi-channel audio signal has a speech channel and at least two non-speech channels including a first non-speech channel and a second non-speech channel, the inventive method includes steps of: (a) determining at least one first attenuation control value indicative of a measure of similarity between speech-related content determined by the speech channel and second speech-related content determined by the first non-speech channel (e.g., including by comparing a first speech-related feature sequence indicative of speech-related content determined by the speech channel to a second speech-related feature sequence indicative of the second speech-related content); and (b) determining at least one second attenuation control value indicative of a measure of similarity between speech-related content determined by the speech channel and third speech-related content determined by the second non-speech channel (e.g., including by comparing a third speech-related feature sequence indicative of speech-related content determined by the speech channel to a fourth speech-related feature sequence indicative of the third speech-related content, where the third speech-related feature sequence may be identical to the first speech-related feature sequence of step (a)). Typically, the method includes the step of attenuating the first non-speech channel (e.g., scaling attenuation of the first non-speech channel) in response to the at least one first attenuation control value and attenuating the second non-speech channel (e.g., scaling attenuation of the second non-speech channel) in response to the at least one second attenuation control value. Preferably, each non-speech channel is attenuated so as to improve intelligibility of speech determined by the speech channel without undesirably attenuating speech-enhancing content determined by either non-speech channel.
In some embodiments in the second class:
the at least one first attenuation control value determined in step (a) is a sequence of attenuation control values, and each of the attenuation control values is a gain control value for scaling the amount of gain applied to the first non-speech channel by ducking circuitry so as to improve intelligibility of speech determined by the speech channel without undesirably attenuating speech-enhancing content determined by the first non-speech channel; and
the at least one second attenuation control value determined in step (b) is a sequence of second attenuation control values, and each of the second attenuation control values is a gain control value for scaling the amount of gain applied to the second non-speech channel by ducking circuitry so as to improve intelligibility of speech determined by the speech channel without undesirably attenuating speech-enhancing content determined by the second non-speech channel.
In a third class of embodiments, the invention is a method for filtering a multi-channel audio signal having a speech channel and at least one non-speech channel, to improve intelligibility of speech determined by the signal. The method includes steps of: (a) comparing a characteristic of the speech channel and a characteristic of the non-speech channel to generate at least one attenuation value for controlling attenuation of the non-speech channel relative to the speech channel; and (b) adjusting the at least one attenuation value in response to at least one speech enhancement likelihood value to generate at least one adjusted attenuation value for controlling attenuation of the non-speech channel relative to the speech channel. Typically, the adjusting step is (or includes) scaling each said attenuation value in response to one said speech enhancement likelihood value to generate one said adjusted attenuation value. Typically, each speech enhancement likelihood value is indicative of (e.g., monotonically related to) a likelihood that the non-speech channel (or a non-speech channel derived from the non-speech channel or from a set of non-speech channels of the input audio signal) is indicative of speech-enhancing content (content that enhances the intelligibility or other perceived quality of speech content determined by the speech channel). In some embodiments, the speech enhancement likelihood value is indicative of an expected speech-enhancing value of the non-speech channel (e.g., a measure of probability that the non-speech channel is indicative of speech-enhancing content multiplied by a measure of perceived quality enhancement that speech-enhancing content determined by the non-speech channel would provide to speech content determined by the multi-channel audio signal). In some embodiments in the third class, the at least one speech enhancement likelihood value is a sequence of comparison values (e.g., difference values) determined by a method including a step of comparing a first speech-related feature sequence indicative of speech-related content determined by the speech channel to a second speech-related feature sequence indicative of speech-related content determined by the non-speech channel, and each of the comparison values is a measure of similarity between the first speech-related feature sequence and the second speech-related feature sequence at a different time (e.g., in a different time interval). In typical embodiments in the third class, the method also includes the step of attenuating the non-speech channel in response to the at least one adjusted attenuation value. Step (b) can comprise scaling the at least one attenuation value (which typically is, or is determined by, a ducking gain control signal or other raw attenuation control signal) in response to the at least one speech enhancement likelihood value.
In some embodiments in the third class, each attenuation value generated in step (a) is a first factor indicative of an amount of attenuation of the non-speech channel necessary to limit the ratio of signal power in the non-speech channel to the signal power in the speech channel not to exceed a predetermined threshold, scaled by a second factor monotonically related to the likelihood of the speech channel being indicative of speech. Typically, the adjusting step in these embodiments is (or includes) scaling each said attenuation value by one said speech enhancement likelihood value to generate one said adjusted attenuation value, where the speech enhancement likelihood value is a factor monotonically related to one of: a likelihood that the non-speech channel is indicative of speech-enhancing content (content that enhances the intelligibility or other perceived quality of speech content determined by the multi-channel signal), and an expected speech-enhancing value of the non-speech channel (e.g., a measure of probability that the non-speech channel is indicative of speech-enhancing content multiplied by a measure of the perceived quality enhancement that speech-enhancing content in the non-speech channel would provide to speech content determined by the multi-channel signal).
In some embodiments in the third class, each attenuation value generated in step (a) is a first factor indicative of an amount (e.g., the minimum amount) of attenuation of the non-speech channel sufficient to cause predicted intelligibility of speech determined by the speech channel in the presence of content determined by the non-speech channel to exceed a predetermined threshold value, scaled by a second factor monotonically related to the likelihood of the speech channel being indicative of speech. Preferably, the predicted intelligibility of speech determined by the speech channel in the presence of content determined by the non-speech channel is determined in accordance with a psycho-acoustically based intelligibility prediction model. Typically, the adjusting step in these embodiments is (or includes) scaling each said attenuation value by one said speech enhancement likelihood value to generate one said adjusted attenuation value, where the speech enhancement likelihood value is a factor monotonically related to one of: a likelihood that the non-speech channel is indicative of speech-enhancing content, and an expected speech-enhancing value of the non-speech channel.
In some embodiments in the third class, step (a) includes the steps of generating each said attenuation value including by determining a power spectrum (indicative of power as a function of frequency) of each of the speech channel and the non-speech channel, and performing a frequency-domain determination of the attenuation value in response to each said power spectrum. Preferably, the attenuation values generated in this way determine attenuation as a function of frequency to be applied to frequency components of the non-speech channel.
In a class of embodiments, the invention is a method and system for enhancing speech determined by a multi-channel audio input signal. In some embodiments, the inventive system includes an analysis module (subsystem) configured to analyze the input multi-channel signal to generate attenuation control values, and an attenuation subsystem. The attenuation subsystem is configured to apply ducking attenuation, steered by at least some of the attenuation control values, to each non-speech channel of the input signal to generate a filtered audio output signal. In some embodiments, the attenuation subsystem includes ducking circuitry (steered by at least some of the attenuation control values) coupled and configured to apply attenuation (ducking) to each non-speech channel of the input signal to generate the filtered audio output signal. The ducking circuitry is steered by control values in the sense that the attenuation it applies to the non-speech channels is determined by current values of the control values.
In typical embodiments, the inventive system is or includes a general or special purpose processor programmed with software (or firmware) and/or otherwise configured to perform an embodiment of the inventive method. In some embodiments, the inventive system is a general purpose processor, coupled to receive input data indicative of the audio input signal and programmed (with appropriate software) to generate output data indicative of the audio output signal in response to the input data by performing an embodiment of the inventive method. In other embodiments, the inventive system is implemented by appropriately configuring (e.g., by programming) a configurable audio digital signal processor (DSP). The audio DSP can be a conventional audio DSP that is configurable (e.g., programmable by appropriate software or firmware, or otherwise configurable in response to control data) to perform any of a variety of operations on input audio. In operation, an audio DSP that has been configured to perform active speech enhancement in accordance with the invention is coupled to receive the audio input signal, and the DSP typically performs a variety of operations on the input audio in addition to (as well as) speech enhancement. In accordance with various embodiments of the invention, an audio DSP is operable to perform an embodiment of the inventive method after being configured (e.g., programmed) to generate an output audio signal in response to the input audio signal by performing the method on the input audio signal.
Aspects of the invention include a system configured (e.g., programmed) to perform any embodiment of the inventive method, and a computer readable medium (e.g., a disc) which stores code for implementing any embodiment of the inventive method.
Many embodiments of the present invention are technologically possible. It will be apparent to those of ordinary skill in the art from the present disclosure how to implement them. Embodiments of the inventive system, method, and medium will be described with reference to
The inventor has observed that some multi-channel audio content has different, yet related speech content in the speech channel and at least one non-speech channel. For example, multi-channel audio recordings of some stage shows are mixed such that “dry” speech (i.e., speech without noticeable reverberation) is placed into the speech channel (typically, the center channel, C, of the signal) and the same speech, but with a significant reverberation component (“wet” speech) is placed in the non-speech channels of the signal. In a typical scenario, the dry speech is the signal from the microphone that the stage performer holds close to his mouth and the wet speech is the signal from microphones placed in the audience. The wet speech is related to the dry speech since it is the performance as heard by the audience in the venue. Yet it differs from the dry speech. Typically the wet speech is delayed relative to the dry speech, and has a different spectrum and different additive components (e.g., audience noises and reverberation).
Depending on the relative levels of dry and wet speech, it is possible that the wet speech component masks the dry speech component to a degree that attenuation of non-speech channels in ducking circuitry (e.g., as in the method described in above-cited WO 2010/011377) undesirably attenuates the wet speech signal. Although the dry and wet speech components can be described as separate entities, a listener perceptually fuses the two and hears them as a single stream of speech. Attenuating the wet speech component (e.g., in ducking circuitry) may have the effect of lowering the perceived loudness of the fused speech stream along with collapsing its image width. The inventor has recognized that for multi-channel audio signals having wet and dry speech components of the noted type, it would often be more perceptually pleasing as well as more conducive to speech intelligibility if the level of the wet speech components were not altered during speech enhancement processing of the signals.
The invention is based in part on the recognition that, when at least one non-speech channel of a multi-channel audio signal includes content that enhances the intelligibility (or other perceived quality) of speech content determined by the signal's speech channel, filtering the signal's non-speech channels using ducking circuitry (e.g., in accordance with the method of WO 2010/011377) can negatively affect the entertainment experience of one listening to the reproduced filtered signal. In accordance with typical embodiments of the invention, attenuation (in ducking circuitry) of at least one non-speech channel of a multi-channel audio signal is suspended or modified during times when the non-speech channel includes speech-enhancing content (content that enhances the intelligibility or other perceived quality of speech content determined by the signal's speech channel). At times when the non-speech channel does not include speech-enhancing content (or does not include speech-enhancing content that meets a predetermined criterion), the non-speech channel is attenuated normally (the attenuation is not suspended or modified).
A typical multi-channel signal (having a speech channel) for which conventional filtering in ducking circuitry is inappropriate is one including at least one non-speech channel that carries speech cues that are substantially identical to speech cues in the speech channel. In accordance with typical embodiments of the present invention, a sequence of speech related features in the speech channel is compared to a sequence of speech related features in the non-speech channel. A substantial similarity of the two feature sequences indicates that the non-speech channel (i.e., the signal in the non-speech channel) contributes information useful for understanding the speech in the speech channel and that attenuation of the non-speech channel should be avoided.
To appreciate the significance of examining the similarity between such speech related feature sequences rather than the signals themselves, it is important to recognize that “dry” and “wet” speech content (determined by speech and non-speech channels) is not identical; the signals indicative of the two types of speech content are typically temporally offset, and have undergone different filtering processes and have had different extraneous components added. Therefore, a direct comparison between the two signals will yield a low similarity, regardless of whether the non-speech channel contributes speech cues that are the same as the speech channel (as in the case of dry and wet speech), unrelated speech cues (as in the case of two unrelated voices in the speech and non-speech channel [e.g., a target conversation in the speech channel and background babble in the non-speech channel]), or no speech cues at all (e.g., the non-speech channel carries music and effects). By basing the comparison on speech features (as in preferred embodiments of the present invention), a level of abstraction is achieved that lessens the impact of irrelevant signal aspects, such as small amounts of delay, spectral differences, and extraneous added signals. Thus, preferred implementations of the invention typically generate at least two streams of speech features: one representing the signal in the speech channel; and at least one representing the signal a non-speech channel.
A first embodiment (125) of the inventive system will be described with reference to
With reference again to
The power of each channel of the multi-channel input signal is measured with a bank of power estimators (104, 105, and 106) and expressed on a logarithmic scale [dB]. These power estimators may implement a smoothing mechanism, such as a leaky integrator, so that the measured power level reflects the power level averaged over the duration of a sentence or an entire passage. The power level of the signal in the speech channel is subtracted from the power level in each of the non-speech channels (by subtraction elements 107 and 108) to give a measure of the ratio of power between the two signal types. The output of element 107 is a measure of the ratio of power in non-speech channel 103 to power in speech channel 101. The output of element 108 is a measure of the ratio of power in non-speech channel 102 to power in speech channel 101.
Comparison circuit 109 determines for each non-speech channel the number of decibels (dB) by which the non-speech channel must be attenuated in order for its power level to remain at least dB below the power level of the signal in the speech channel (where the symbol “,” also known as script theta, denotes a predetermined threshold value). In one implementation of circuit 109, addition element 120 adds the threshold value (stored in element 110, which may be a register) to the power level difference (or “margin”) between non-speech channel 103 and speech channel 101, and addition element 121 adds the threshold value to the power level difference between non-speech channel 102 and speech channel 101. Elements 111-1 and 112-1 change the sign of the output of addition elements 120 and 121, respectively. This sign change operation converts attenuation values into gain values. Elements 111 and 112 limit each result to be equal to or less than zero (the output of element 111-1 is asserted to limiter 111 and the output of element 112-1 is asserted to limiter 112). The current value C1 output from limiter 111 determines the gain (negated attenuation) in dB that must be applied to non-speech channel 103 to keep its power level dB below the power level of speech channel 101 (at the relevant time, or in the relevant time window, of the multi-channel input signal). The current value C2 output from limiter 112 determines the gain (negated attenuation) in dB that must be applied to non-speech channel 102 to keep its power level dB below the power level of the speech channel 101 (at the relevant time, or in the relevant time window, of the multi-channel input signal). A typical suitable value fort is 15 dB.
Because there is a unique relation between a measure expressed on a logarithmic scale (dB) and that same measure expressed on a linear scale, a circuit (or programmed or otherwise configured processor) that is equivalent to elements 104, 105, 106, 107, 108, and 109 of
The signal C1 output from limiter 111 is a raw attenuation control signal for non-speech channel 103 (a gain control signal for ducking amplifier 116) which could be asserted directly to amplifier 116 to control ducking attenuation of non-speech channel 103. The signal C2 output from limiter 112 is a raw attenuation control signal for non-speech channel 102 (a gain control signal for ducking amplifier 117) which could be asserted directly to amplifier 117 to control ducking attenuation of non-speech channel 102.
In accordance with the invention, however, raw attenuation control signals C1 and C2 are scaled in multiplication elements 114 and 115 to generate gain control signals S3 and S4 for controlling ducking attenuation of the non-speech channels by amplifiers 116 and 117. Signal C1 is scaled in response to a sequence of attenuation control values S1, and signal C2 is scaled in response to a sequence of attenuation control values S2. Each control value S1 is asserted from the output of processing element 134 (to be described below) to an input of multiplication element 114, and signal C1 (and thus each “raw” gain control value C1 determined thereby) is asserted from limiter 111 to the other input of element 114. Element 114 scales the current value C1 in response to the current value S1 by multiplying these values together to generate the current value S3, which is asserted to amplifier 116. Each control value S2 is asserted from the output of processing element 135 (to be described below) to an input of multiplication element 115, and signal C2 (and thus each “raw” gain control value C2 determined thereby) is asserted from limiter 112 to the other input of element 115. Element 115 scales the current value C2 in response to the current value S2 by multiplying these values together to generate the current value S4, which is asserted to amplifier 117.
Control values S1 and S2 are generated in accordance with the invention as follows. In speech likelihood processing elements 130, 131, and 132, a speech likelihood signal (each of signals P, Q, and T of
Speech likelihood signal Q is a value monotonically related to the likelihood that the signal in the speech channel is in fact indicative of speech. Speech likelihood signal P is a value monotonically related to the likelihood that the signal in non-speech channel 102 is speech, and speech likelihood signal T is a value monotonically related to the likelihood that the signal in non-speech channel 103 is speech. Processors 130, 131, and 132 (which are typically identical to each other, but are not identical to each other in some embodiments) can implement any of various methods for automatically determining the likelihood that the input signals asserted thereto are indicative of speech. In one embodiment, speech likelihood processors 130, 131, and 132 are identical to each other, processor 130 generates signal P (from information in non-speech channel 102) such that signal P is indicative of a sequence of speech likelihood values, each monotonically related to the likelihood that the signal in channel 102 at a different time (or time window) is speech, processor 131 generates signal Q (from information in channel 101) such that signal Q is indicative of a sequence of speech likelihood values, each monotonically related to the likelihood that the signal in channel 101 at a different time (or time window) is speech, processor 132 generates signal T (from information in non-speech channel 103) such that signal T is indicative of a sequence of speech likelihood values, each monotonically related to the likelihood that the signal in channel 102 at a different time (or time window) is speech, and each of processors 130, 131, and 132 does so by implementing (on the relevant one of channels 102, 101, and 103) the mechanism described by Robinson and Vinton in “Automated Speech/Other Discrimination for Loudness Monitoring” (Audio Engineering Society, Preprint number 6437 of Convention 118, May 2005). Alternatively, signal P may be created manually, for example by the content creator, and transmitted alongside the audio signal in channel 102 to the end user, and processor 130 may simply extract such previously created signal P from channel 102 (or processor 130 may be eliminated and the previously created signal P directly asserted to processor 134). Similarly, signal Q may be created manually and transmitted alongside the audio signal in channel 101, processor 131 may simply extract such previously created signal Q from channel 101 (or processor 131 may be eliminated and the previously created signal Q directly asserted to processors 134 and 135), signal T may be created manually and transmitted alongside the audio signal in channel 103, and processor 132 may simply extract such previously created signal T from channel 103 (or processor 132 may be eliminated and the previously created signal T directly asserted to processor 135).
In a typical implementation of processor 134, speech likelihood values determined by signals P and Q are pairwise compared to determine the difference between the current values of signals P and Q for each of a sequence of current values of signal P. In a typical implementation of processor 135, speech likelihood values determined by signals T and Q are pairwise compared to determine the difference between the current values of signals T and Q for each of a sequence of current values of signal Q. As a result, each of processors 134 and 135 generates a time sequence of difference values for a pair of speech likelihood signals.
Processors 134 and 135 are preferably implemented to smooth each such difference value sequence by time averaging, and optionally to scale each resulting averaged difference value sequence. Scaling of the averaged difference value sequences may be necessary so that the scaled averaged values output from processors 134 and 135 are in such a range that the outputs of multiplication elements 114 and 115 are useful for steering the ducking amplifiers 116 and 117.
In a typical implementation, the signal S1 output from processor 134 is a sequence of scaled averaged difference values (each of these scaled averaged difference values being a scaled average of the difference between current values of signals P and Q difference values in a different time window). The signal S1 is a ducking gain control signal for non-speech channel 102, and is employed to scale the independently generated raw ducking gain control signal C1 for non-speech channel 102. Similarly, in a typical implementation, the signal S2 output from processor 135 is a sequence of scaled averaged difference values (each of these scaled averaged difference values being a scaled average of the difference between current values of signals T and Q in a different time window). The signal S2 is a ducking gain control signal for non-speech channel 103, and is employed to scale the independently generated raw ducking gain control signal C2 for non-speech channel 103.
Scaling of raw ducking gain control signal C1 in response to ducking gain control signal S1 in accordance with the invention can be performed by multiplying (in element 114) each raw gain control value of signal C1 by a corresponding one of the scaled averaged difference values of signal S1, to generate signal S3. Scaling of raw ducking gain control signal C2 in response to ducking gain control signal S2 in accordance with the invention can be performed by multiplying (in element 115) each raw gain control value of signal C2 by a corresponding one of the scaled averaged difference values of signal S2, to generate signal S4.
Another embodiment (125′) of the inventive system will be described with reference to
In the system of
The
To generate the sequence of attenuation control values V1, the signal Q (asserted at the output of processor 131) is asserted to an input of multiplier 214, and the control signal S1 (asserted at the output of processor 134) is asserted to the other input of multiplier 214. The output of multiplier 214 is the sequence of attenuation control values V1. Each of the attenuation control values V1 is one of the speech likelihood values determined by signal Q, scaled by a corresponding one of the attenuation control values S1.
Similarly, to generate the sequence of attenuation control values V2, the signal Q (asserted at the output of processor 131) is asserted to an input of multiplier 215, and the control signal S2 (asserted at the output of processor 135) is asserted to the other input of multiplier 215. The output of multiplier 215 is the sequence of attenuation control values V2. Each of the attenuation control values V2 is one of the speech likelihood values determined by signal Q, scaled by a corresponding one of the attenuation control values S2.
The
In variations on the
Similarly, in variations on the
Another embodiment (225) of the inventive system will be described with reference to
In the
The
The power spectra are fed into comparison circuit 204. The purpose of circuit 204 is to determine the attenuation to be applied to each non-speech channel to ensure that the signal in the non-speech channel does not reduce the intelligibility of the signal in the speech channel to be less than a predetermined criterion. This functionality is achieved by employing an intelligibility prediction circuit (205 and 206) that predicts speech intelligibility from the power spectra of the speech channel signal (201) and non-speech channel signals (202 and 203). The intelligibility prediction circuits 205 and 206 may implement a suitable intelligibility prediction model according to design choices and tradeoffs. Examples are the Speech Intelligibility Index as specified in ANSI S3.5-1997 (“Methods for Calculation of the Speech Intelligibility Index”) and the Speech Recognition Sensitivity model of Muesch and Buus (“Using statistical decision theory to predict speech intelligibility. I. Model structure” Journal of the Acoustical Society of America, 2001, Vol. 109, p 2896-2909). It is clear that the output of the intelligibility prediction model has no meaning when the signal in the speech channel is something other than speech. Despite this, in what follows the output of the intelligibility prediction model will be referred to as the predicted speech intelligibility. The perceived mistake is accounted for in subsequent processing by scaling the gain values output from the comparison circuit 204 with parameters S1 and S2, each of which is related to the likelihood of the signal in the speech channel being indicative of speech.
The intelligibility prediction models have in common that they predict either increased or unchanged speech intelligibility as the result of lowering the level of the non-speech signal. Continuing on in the process flow of
It is of course possible that the signal in the speech channel is such that the criterion intelligibility cannot be reached even in the absence of a signal in the non-speech channel. An example of such a situation is a speech signal of very low level or with severely restricted bandwidth. If that happens a point will be reached where any further reduction of the gain applied to the non-speech channel does not affect the predicted speech intelligibility and the criterion is never met. In such a condition, the loop formed by elements 205, 207, and 209 (or elements 206, 208, and 210) continues indefinitely, and additional logic (not shown) may be applied to break the loop. One particularly simple example of such logic is to count the number of iterations and exit the loop once a predetermined number of iterations has been exceeded.
Scaling of raw ducking gain control signal C3 in response to ducking gain control signal S1 in accordance with the invention can be performed by multiplying (in element 114) each raw gain control value of signal C3 by a corresponding one of the scaled averaged difference values of signal S1, to generate signal S5. Scaling of raw ducking gain control signal C4 in response to ducking gain control signal S2 in accordance with the invention can be performed by multiplying (in element 115) each raw gain control value of signal C4 by a corresponding one of the scaled averaged difference values of signal S2, to generate signal S6.
The
In variations on the
Similarly, in variations on the
Another embodiment (225′) of the inventive system will be described with reference to
In the system of
The
The system of
A second major respect in which the
In operation, the
Another embodiment (325) of the inventive system will be described with reference to
In the
In the
The process of
Describing now the side-branch path of the process of
Depending on the computational resources available and the constraints imposed, the form and complexity of the optimization circuits (307, 308) may vary greatly. According to one embodiment an iterative, multidimensional constrained optimization of N free parameters is used. Each parameter represents the gain applied to one of the frequency bands of the non-speech channel. Standard techniques, such as following the steepest gradient in the N-dimensional search space may be applied to find the maximum. In another embodiment, a computationally less demanding approach constrains the gain-vs.-frequency functions to be members of a small set of possible gain-vs.-frequency functions, such as a set of different spectral gradients or shelf filters. With this additional constraint the optimization problem can be reduced to a small number of one-dimensional optimizations. In yet another embodiment an exhaustive search is made over a very small set of possible gain functions. This latter approach might be particularly desirable in real-time applications where a constant computational load and search speed are desired.
Those of ordinary skill in the art will easily recognize additional constraints that might be imposed on the optimization according to additional embodiments of the present invention. One example is restricting the loudness of the modified non-speech channel to be not larger than the loudness before modification. Another example is imposing a limit on the gain differences between adjacent frequency bands in order to limit the potential for temporal aliasing in the reconstruction filter bank (313, 314) or to reduce the possibility for objectionable timbre modifications. Desirable constraints depend both on the technical implementation of the filter bank and on the chosen tradeoff between intelligibility improvement and timbre modification. For clarity of illustration, these constraints are omitted from
Scaling of N-dimensional raw ducking gain control vector C6 in response to ducking gain control signal S2 in accordance with the invention can be performed by multiplying (in element 115′) each raw gain control value of vector C6 by a corresponding one of the scaled averaged difference values of signal S2, to generate N-dimensional ducking gain control vector S8. Scaling of N-dimensional raw ducking gain control vector C5 in response to ducking gain control signal S1 in accordance with the invention can be performed by multiplying (in element 114′) each raw gain control value of vector C5 by a corresponding one of the scaled averaged difference values of signal S1, to generate N-dimensional ducking gain control vector S7.
The
In variations on the
Similarly, in variations on the
It will be apparent to those of ordinary skill in the art from this disclosure how the
As described, the system of
(a) determining at least one attenuation control value (e.g., signal S1 or S2 of
(b) attenuating at least one non-speech channel of the audio signal in response to the at least one attenuation control value (e.g., in element 114 and amplifier 116, or element 115 and amplifier 117, of
Typically, the attenuating step comprises scaling a raw attenuation control signal (e.g., ducking gain control signal C1 or C2 of
In some embodiments in the first class, each attenuation control value is monotonically related to likelihood that the non-speech channel is indicative of speech-enhancing content that enhances the intelligibility (or another perceived quality) of speech content determined by the speech channel. In some other embodiments in the first class, each attenuation control value is monotonically related to an expected speech-enhancing value of the non-speech channel (e.g., a measure of probability that the non-speech channel is indicative of speech-enhancing content, multiplied by a measure of perceived quality enhancement that speech-enhancing content determined by the non-speech channel would provide to speech content determined by the multi-channel signal). For example, where step (a) includes a step of comparing (e.g., in element 134 or 135 of
As described, the system of
(a) comparing a characteristic of the speech channel and a characteristic of the non-speech channel to generate at least one attenuation value (e.g., values determined by signal C1 or C2 of
(b) adjusting the at least one attenuation value in response to at least one speech enhancement likelihood value (e.g., signal S1 or S2 of
In operation of the
In operation of the
In operation of the
In a class of embodiments, the invention is a method and system for enhancing speech determined by a multi-channel audio input signal. In some such embodiments, the inventive system includes an analysis module or subsystem (e.g., elements 130-135, 104-109, 114, and 115 of
In some embodiments, a ratio of speech channel (e.g., center channel) power to non-speech channel (e.g., side channel and/or rear channel) power is used to determine how much ducking (attenuation) should be applied to each non-speech channel. For example, in the
In some alternative embodiments, a modified version of the analysis module of
In operation, an audio DSP that has been configured to perform speech enhancement in accordance with the invention (e.g., system 420 of
In some embodiments, the inventive system is or includes a general purpose processor coupled to receive or to generate input data indicative of a multi-channel audio signal. The processor is programmed with software (or firmware) and/or otherwise configured (e.g., in response to control data) to perform any of a variety of operations on the input data, including an embodiment of the inventive method. The computer system of
The computer system of
Computer readable storage medium 504 (e.g., an optical disk or other tangible object) has computer code stored thereon that is suitable for programming processor 501 to perform an embodiment of the inventive method. In operation, processor 501 executes the computer code to process data indicative of a multi-channel audio input signal in accordance with the invention to generate output data indicative of a multi-channel audio output signal.
The system of above-described
Aspects of the invention are a computer system programmed to perform any embodiment of the inventive method, and a computer readable medium which stores computer-readable code for implementing any embodiment of the inventive method.
While specific embodiments of the present invention and applications of the invention have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the invention described and claimed herein. It should be understood that while certain forms of the invention have been shown and described, the invention is not to be limited to the specific embodiments described and shown or the specific methods described.
This application is a continuation of U.S. patent application Ser. No. 13/583,204 filed Sep. 6, 2012, which is a national-stage entry of International Patent application no. PCT/US2011/026505 filed Feb. 28, 2011, which claims priority to U.S. Patent Provisional Application No. 61/311,437, filed 8 Mar. 2010, all of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61311437 | Mar 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13583204 | Sep 2012 | US |
Child | 14942706 | US |