The invention relates to audio signal processing and speech enhancement. In accordance with one aspect, the invention combines a high-quality audio program that is a mix of speech and non-speech audio with a lower-quality copy of the speech components contained in the audio program for the purpose of generating a high-quality audio program with an increased ratio of speech to non-speech audio such as may benefit the elderly, hearing impaired or other listeners. Aspects of the invention are particularly useful for television and home theater sound, although they may be applicable to other audio and sound applications. The invention relates to methods, apparatus for performing such methods, and to software stored on a computer-readable medium for causing a computer to perform such methods.
In movies or on television, dialog and narrative are often presented together with other, non-speech, sounds such as music, jingles, effects, and ambiance. In many cases the speech sounds and the non-speech sounds are recorded separately and mixed under the control of a sound engineer. When speech and non-speech sounds are mixed, the non-speech sounds may partially mask the speech, thereby rendering a fraction of the speech inaudible. As a result, listeners must comprehend the speech based on the remaining, partial information. A small amount of masking is easily tolerated by young listeners with healthy ears. However, as masking increases, comprehension becomes progressively more difficult until the speech eventually becomes unintelligible (see e.g., ANSI S3.5 1997 “Methods for Calculation of the Speech Intelligibility Index”). The sound engineer is intuitively aware of this relationship and mixes speech and background at relative levels that usually provide adequate intelligibility for the majority of viewers.
While background sounds hinder intelligibility for all viewers, the detrimental effect of background sounds is larger for seniors and persons with hearing impairment (c.f., Killion, M. 2002. “New thinking on hearing in noise: A generalized Articulation Index” in Seminars in Hearing, Volume 23, Number 1, pages 57 to 75, Thieme Medical Publishers, New York, N.Y.). The sound engineer, who typically has normal hearing and is younger than at least part of his audience, selects the ratio of speech to non-speech audio based on his own internal standards. Sometimes that leaves a significant portion of the audience straining to follow the dialog or narrative.
One solution known in the prior art exploits the fact that speech and non-speech audio exist separately at some point in the production chain in order to provide the viewer with two separate audio streams. One stream carries primary content audio (mainly speech) and the other carries secondary content audio (the remaining audio program, which excludes speech). The user is given control over the mixing process. Unfortunately, this scheme is impractical because it does not build on the current practice of transmitting a fully mixed audio program. Rather, it replaces the main audio program with two audio streams that are not in use today. A further disadvantage of the approach is that it requires approximately twice the bandwidth of current broadcast practice because two independent audio streams, each of broadcast quality, must be delivered to the user.
The successful audio coding standard AC-3 allows simultaneous delivery of a main audio program and other, associated audio streams. All streams are of broadcast quality. One of these associated audio streams is intended for the hearing impaired. According to the “Dolby Digital Professional Encoding Guidelines,” section 5.4.4, available at http://www.dolby.com/assets/pdf/tech_library/46_DDEncodingGuidelines.pdf, this audio stream typically contains only dialog and is added, at a fixed ratio, to the center channel of the main audio program (or to the left and right channels if the main audio is two-channel stereo), which already contains a copy of that dialog. See also ATSC Standard: Digital Television Standard (A/53), revision D, Including Amendment No. 1, Section 6.5 Hearing Impaired (HI). Further details of AC-3 may be found in the AC-3 citations below under the heading “Incorporation by Reference.”
It is clear from the preceding discussion that at present there is a need for, but no way of increasing the ratio of speech to non-speech audio in a manner that exploits the fact that speech and non-speech audio are recorded separately while building on the current practice of transmitting a fully mixed audio program and also requiring minimal additional bandwidth. Therefore, it is the object of the present invention to provide a method for optionally increasing the ratio of speech to non-speech audio in a television broadcast that requires only a small amount of additional bandwidth, exploits the fact that speech and non-speech audio are recorded separately, and is an extension rather than a replacement of existing broadcast practice.
According to a first aspect of the invention for enhancing speech portions of an audio program having speech and non-speech components, the audio program having speech and non-speech components is received, the audio program having a high quality such that when reproduced in isolation the program does not have audible artifacts that listeners would deem objectionable, a copy of speech components of the audio program is received, the copy having a low quality such that when reproduced in isolation the copy has audible artifacts that listeners would deem objectionable, and the low-quality copy of speech components and the high-quality audio program are combined in such proportions that the ratio of speech to non-speech components in the resulting audio program is increased and the audible artifacts of the low-quality copy of speech components are masked by the high-quality audio program.
According to an aspect of the invention in which speech portions of an audio program having speech and non-speech components are enhanced with a copy of speech components of the audio program, the copy having a low quality such that when reproduced in isolation the copy has audible artifacts that listeners would deem objectionable, the low-quality copy of the speech components and the audio program are combined in such proportions that the ratio of speech to non-speech components in the resulting audio program is increased and the audible artifacts of the low-quality copy of speech components are masked by the audio program.
In either of the just-mentioned aspects, the proportions of combining the copy of speech components and the audio program may be such that the speech components in the resulting audio program have substantially the same dynamic characteristics as the corresponding speech components in the audio program and the non-speech components in the resulting audio program have a compressed dynamic range relative to the corresponding non-speech components in the audio program.
Alternatively, in either of the just-mentioned aspects, the proportions of combining the copy of speech components and the audio program are such that the speech components in the resulting audio program have a compressed dynamic range relative to the corresponding speech components in the audio program and the non-speech components in the resulting audio program have substantially the same dynamic characteristics as the corresponding non-speech components in the audio program.
In accordance with another aspect of the invention, enhancing speech portions of an audio program having speech and non-speech components includes receiving the audio program having speech and non-speech components, receiving a copy of speech components of the audio program, and combining the copy of speech components and the audio program in such proportions that the ratio of speech to non-speech components in the resulting audio program is increased, the speech components in the resulting audio program having substantially the same dynamic characteristics as the corresponding speech components in the audio program, and the non-speech components in the resulting audio program having a compressed dynamic range relative to the corresponding non-speech components in the audio program.
In accordance with another aspect of the invention, enhancing speech portions of an audio program having speech and non-speech components with a copy of speech components of the audio program includes combining the copy of speech components and the audio program in such proportions that the ratio of speech to non-speech components in the resulting audio program is increased, the speech components in the resulting audio program have substantially the same dynamic characteristics as the corresponding speech components in the audio program, and the non-speech components in the resulting audio program have a compressed dynamic range relative to the corresponding non-speech components in the audio program.
In accordance with yet another aspect of the invention for enhancing speech portions of an audio program having speech and non-speech components, the audio program having speech and non-speech components is received, a copy of speech components of the audio program is received, and the copy of speech components and the audio program are combined in such proportions that the ratio of speech to non-speech components in the resulting audio program is increased, the speech components in the resulting audio program have a compressed dynamic range relative to the corresponding speech components in the audio program, and the non-speech components in the resulting audio program have substantially the same dynamic characteristics as the corresponding non-speech components in the audio program.
In accordance with a further aspect of the invention for enhancing speech portions of an audio program having speech and non-speech components with a copy of speech components of the audio program, the copy of speech components and the audio program are combined in such proportions that the ratio of speech to non-speech components in the resulting audio program is increased, the speech components in the resulting audio program have a compressed dynamic range relative to the corresponding speech components in the audio program, and the non-speech components in the resulting audio program have substantially the same dynamic range characteristics as the corresponding non-speech components in the audio program.
Although the examples of implementing the present invention are in the context of television or home theater sound, it will be understood by those of ordinary skill in the art that the invention may be applied in other audio and sound applications.
If television or home theater viewers have access to both the main audio program and a separate audio stream that contains only the speech components, any ratio of speech to non-speech audio can be achieved by suitably scaling and mixing the two components. For example, if it is desired to suppress the non-speech audio completely so that only speech is heard, only the stream containing the speech sound is played. At the other extreme, if it is desired to suppress the speech completely so that only the non-speech audio is heard, the speech audio is simply subtracted from the main audio program. Between the extremes, any intermediate ratio of speech to non-speech audio may be achieved.
To make an auxiliary speech channel commercially viable it must not be allowed to increase the bandwidth allocated to the main audio program by more than a small fraction. To satisfy this constraint, the auxiliary speech must be encoded with a coder that reduces the data rate drastically. Such data rate reduction comes at the expense of distorting the speech signal. Speech distorted by low-bitrate coding can be described as the sum of the original speech and a distortion component (coding noise). When the distortion becomes audible it degrades the perceived sound quality of the speech. Although the coding noise can have a severe impact on the sound quality of a signal, its level is typically much lower than that of the signal being coded.
In practice, the main audio program is of “broadcast quality” and the coding noise associated with it is nearly imperceptible. In other words, when reproduced in isolation the program does not have audible artifacts that listeners would deem objectionable. In accordance with aspects of the present invention, the auxiliary speech, on the other hand, if listened to in isolation, may have audible artifacts that listeners would deem objectionable because its data rate is restricted severely. If heard in isolation, the quality of the auxiliary speech is not adequate for broadcast applications.
Whether or not the coding noise that is associated with the auxiliary speech is audible after mixing with the main audio program depends on whether the main audio program masks the coding noise. Masking is likely to occur when the main program contains strong non-speech audio in addition to the speech audio. In contrast, the coding noise is unlikely to be masked when the main program is dominated by speech and the non-speech audio is weak or absent. These relationships are advantageous when viewed from the perspective of using the auxiliary speech to increase the relative level of the speech in the main audio program. Program sections that are most likely to benefit from adding auxiliary speech (i.e., sections with strong non-speech audio) are also most likely to mask the coding noise. Conversely, program sections that are most vulnerable to being degraded by coding noise (e.g., speech in the absence of background sounds) are also least likely to require enhanced dialog.
These observations suggest that, if a signal-adaptive mixing process is employed, it is possible to combine auxiliary speech that is audibly distorted with a high-quality main audio program to create an audio program with an increased ratio of speech to non-speech audio that is free of audible distortions. The adaptive mixer preferably limits the relative mixing levels so that the coding noise remains below the masking threshold caused by the main audio program. This is possible by adding low-quality auxiliary speech only to those sections of the audio program that have a low ratio of speech to non-speech audio initially. Exemplary implementations of this principle are described below.
In an experimental implementation of aspects of the invention, a speech encoder implemented as a CELP vocoder running at 8 Kbit/sec was found to be suitable and to provide the perceptual equivalent of about a 10-dB increase in speech to non-speech audio level.
If the coding delays of the two encoders differ, at least one of the signals should be time shifted to maintain time alignment between the signals (not shown). The outputs of both the high-quality Audio Encoder 110 and the low-quality Speech Encoder 120 may subsequently be combined into a single bitstream by a multiplexer or multiplexing function (“Multiplexer”) 104 and packed into a bitstream 103 suitable for broadcasting or storage.
Referring now to the
The Signal-Adaptive Crossfader 181 scales the decoded auxiliary speech by α and the decoded main audio program by (1-α) prior to additively combining them in the Crossfader 160. The symmetry in the scaling causes the level and dynamic characteristics of the speech components in the resulting signal to be independent of the scaling factor α—the scaling does not affect the level of the speech components in the resulting signal nor does it impose any dynamic range compression or other modifications to the dynamic range of the speech components. The level of the non-speech audio in the resulting signal, in contrast, is affected by the scaling. Specifically, because the value of α increases with increasing power level P of the non-speech audio, the scaling tends to counteract any change of that level, effectively compressing the dynamic range of the non-speech audio signal. The form of the dynamic range compression is determined by the Transformation 170. For example, if the function α=ƒ(P) takes the form as shown in
The function of the Adaptive Crossfader 181 may be summarized as follows: when the level of the non-speech audio components is very low, the scaling factor α is zero or very small and the Adaptive Crossfader outputs a signal that is identical or nearly identical to the decoded main audio program. When the level of the non-speech audio increases, the value of α increases also. This leads to a larger contribution of the decoded auxiliary speech to the final audio program 180 and to a larger suppression of the decoded main audio program, including its non-speech audio components. The increased contribution of the auxiliary speech to the enhanced signal is balanced by the decreased contribution of speech in the main audio program. As a result, the level of the speech in the enhanced signal remains unaffected by the adaptive crossfading operation—the level of the speech in the enhanced signal is substantially the same level as the level of the decoded speech audio signal 141 and the dynamic range of the non-speech audio components is reduced. This is a desirable result inasmuch as there is no unwanted modulation of the speech signal.
For the speech level to remain unchanged, the amount of auxiliary speech added to the dynamic-range-compressed main audio signal should be a function of the amount of compression applied to the main audio signal. The added auxiliary speech compensates for the level reduction resulting from the compression. This automatically results from applying the scale factor α to the auxiliary speech signal and the complementary scale factor (1-α) to the main audio when α is a function of the dynamic range compression applied to the main audio. The effect on the main audio is similar to that provided by the “night mode” in AC-3 in which as the main audio level input increases the output is turned down in accordance with a compression characteristic.
To ensure that the coding noise does not become unmasked, the adaptive cross fader 160 should prevent the suppression of the main audio program beyond a critical value. This may be achieved by limiting α to be less than or equal to αmax. Although satisfactory performance may be achieved when αmax is a fixed value, better performance is possible if αmax is derived with a psychoacoustic masking model that compares the spectrum of the coding noise associated with the low-quality speech signal 141 to the predicted auditory masking threshold caused by the main audio program signal 131.
Referring to the
The function of the
The decoding examples of
In the
Although the Compressor 301 gain is not critical, a gain of about 15 to 20 dB has been found to be acceptable.
The purpose of the Compressor 301 may be better understood by considering the operation of the
Problems such as overload or excessive loudness may by overcome by including Compressor 301 and adding compressed speech to the main audio. Assume again that α=1. When the instantaneous speech level is high, the compressor has no effect (0 dB gain) and the speech level of the summed signal increases by a comparatively small amount (6 dB). This is identical to the case in which there is no compressor 301. But when the instantaneous speech level is low (say 30 dB below the peak level), the compressor applies a high gain (say 15 dB). When added to the main audio the instantaneous speech level in the resultant audio is practically dominated by the compressed auxiliary audio, i.e., the instantaneous speech level is boosted by about 15 dB. Compare this to the 6 dB boost of the speech peaks. So even when α is constant (e.g., because the power level, P, of the non-speech audio components is constant), there is a time-varying speech to non-speech improvement that is largest in the speech troughs and smallest at the speech peaks.
As the level of the non-speech audio decreases and a decreases, the speech peaks in the summed audio remain nearly unchanged. This is because the level of the decoded speech copy signal is substantially lower than the level of the speech in the main audio (due to the attenuation imposed by α<1) and adding the two together does not significantly affect the level of the resulting speech signal. The situation is different for low-level speech portions. They receive gain from the compressor and attenuation due to α. The end result is levels of the auxiliary speech that are comparable to (or even larger than, depending on the compressor settings) the level of the speech in the main audio. When added together they do affect (increase) the level of the speech components in the summed signal.
The end result is that the level of the speech peaks is more “stable” (i.e., changes never more than 6 dB) than the speech level in the speech troughs. The speech to non-speech ratio is increased most where increases are needed most and the level of the speech peaks changes comparatively little.
Because the psychoacoustic model is computationally expensive, it may be desirable from a cost standpoint to derive the largest permissible value of α at the encoding rather than the decoding side and to transmit that value or components from which that value may be easily calculated as a parameter or plurality of parameters. For example that value may be transmitted as a series of αmax values to the decoding side. An example of such an arrangement is shown in
The function or device 203 also has knowledge of the processes performed by the decoder and the details of its operation depend on the decoder configuration in which αmax is used. Suitable decoder configurations may be in the form of the
If the stream of αmax values generated by the function or device 203 is intended to be used by a decoder such as illustrated in
If the stream of αmax values generated by the function or device 203 is intended to be used by a decoder such as illustrated in
The value of αmax should be updated at a rate high enough to reflect changes in the predicted masking threshold and in the coding noise 202 adequately. Finally, the coded auxiliary speech 121, the coded main audio program 111, and the stream of αmax values 204 may subsequently be combined into a single bitstream by a multiplexer or multiplexing function (“Multiplexer”) 104 and packed into a single data bitstream 103 suitable for broadcasting or storage. Those of ordinary skill in the art will understand that the details of multiplexing, demultiplexing, and the packing and unpacking of a bitstream in the various example embodiments are not critical to the invention.
Aspects of the present invention include modifications and extensions of the examples set forth above. For example, the speech signal and the main signal may each be split into corresponding frequency subbands in which the above-described processing is applied in one or more of such subbands and the resulting subband signals are recombined, as in a decoder or decoding process, to produce an output signal.
Aspects of the present invention may also allow a user to control the degree of dialog enhancement. This may be achieved by scaling the scaling factor α with an additional user-controllable scale factor β, to obtain a modified scaling factor α′, i.e., α′=β*α, where 0 ≦β≦1. If β is selected to be zero, the unmodified main audio program is heard always. If β is selected to be 1, the maximum amount of dialog enhancement is applied. Because αmax ensures that the coding noise is never unmasked, but also because the user can only reduce the degree of dialog enhancement relative to the maximal degree of enhancement, the adjustment does not carry the risk of making coding distortions audible.
In the embodiments just described, the dialog enhancement is performed on the decoded audio signals. This is not an inherent limitation of the invention. In some situations, for example when the audio coder and the speech coder employ the same coding principles, at least some of the operations may be performed in the coded domain (i.e., before full or partial decoding).
The following patents, patent applications and publications are hereby incorporated by reference, each in their entirety.
The invention may be implemented in hardware or software, or a combination of both (e.g., programmable logic arrays). Unless otherwise specified, the algorithms included as part of the invention are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct more specialized apparatus (e.g., integrated circuits) to perform the required method steps. Thus, the invention may be implemented in one or more computer programs executing on one or more programmable computer systems each comprising at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices, in known fashion.
Each such program may be implemented in any desired computer language (including machine, assembly, or high level procedural, logical, or object oriented programming languages) to communicate with a computer system. In any case, the language may be a compiled or interpreted language.
Each such computer program is preferably stored on or downloaded to a storage media or device (e.g., solid state memory or media, or magnetic or optical media) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein. The inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer system to operate in a specific and predefined manner to perform the functions described herein.
A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, some of the steps described herein may be order independent, and thus can be performed in an order different from that described.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US08/01841 | 2/12/2008 | WO | 00 | 8/11/2009 |
Number | Date | Country | |
---|---|---|---|
60900821 | Feb 2007 | US |