This application claims priority to Chinese application Serial No. 202110895493.X filed Aug. 5, 2021, the disclosure of which is hereby incorporated in its entirety by reference herein.
The present disclosure relates generally to the field of audio signal processing, and more particularly, to a method and system for dynamic voice enhancement of an audio source.
Thanks to new ways of media consumption such as high-definition cable TV and online streaming, and with the advent of large-screen TVs and displays, the cinema experience is gaining popularity in the consumer market. These media sources are often accompanied by multi-channel audio technology or commonly referred to as surround technology. Surround providers such as Dolby, THX, and DTS have their own multi-channel audio encoding technology that provides a better spatial audio resolution for source content. Since one purpose of content in a movie format is to provide an immersive surround experience, it is often preferred to sacrifice voice intelligibility in favor of the surround experience. While this provides benefits in terms of immersion and spatial resolution, it often results in poor voice quality and sometimes even difficulty in understanding the movie content. In order to improve the quality of voice in a movie content source to improve intelligibility and audibility, methods of voice enhancement are often applied to the movie content.
A common method for existing voice enhancement is to utilize static equalization. This method applies static equalization only on an audio channel about 200 Hz to 4 kHz to increase the loudness of a voice band. This implementation requires very few system resources, but the distortion that occurs in this method is obvious. Since this implementation method works all the time even when there is no voice or dialogue in a clip, a pitch imbalance will be caused, and the background will be amplified. A more advanced method is to first detect voice within each time frame, and then automatically process an audio signal based on the detection result. This one-way execution method requires accurate detection of voice and fast response of system processing. However, some existing methods cannot detect voice quickly and accurately, and often color a signal so that it sounds harsh.
Therefore, there is a need for an improved technical solution to overcome the above-mentioned shortcomings in the existing solutions.
According to an aspect of the present disclosure, a method of dynamic voice enhancement is provided. The method may include performing a first path signal processing, the first path signal processing including receiving an audio source input and performing dynamic loudness balancing on the audio source input based on a first gain control parameter. The method may also include: performing a second path signal processing, the second path signal processing including performing voice detection on the audio source input and calculating a detection confidence, wherein the detection confidence indicates the possibility of voice in the audio source input; and calculating a second gain control parameter based on the detection confidence. The method may further include updating the first gain control parameter with the second gain control parameter, and performing the first path signal processing based on the updated first gain control parameter.
According to one or more embodiments, the audio source input may include a multi-channel source input, and performing voice detection on the audio source input and calculating a detection confidence may include: extracting a center channel signal from the multi-channel source input; performing normalization on the center channel signal; and performing fast autocorrelation on the normalized center channel signal, the result of the fast autocorrelation representing the detection confidence.
According to one or more embodiments, calculating a second gain control parameter based on the detection confidence may include: calculating the second gain control parameter based on a logarithmic function of the detection confidence; smoothing the calculated second gain control parameter; and limiting the smoothed second gain control parameter.
According to one or more embodiments, the audio source input may include a multi-channel source input, and performing dynamic loudness balancing on the audio source input includes: extracting a center channel signal from the multi-channel source input; enhancing the loudness of the center channel signal and reducing the loudness of other channel signals based on the first gain control parameter or the updated first gain control parameter; and concatenating and mixing the enhanced center channel signal and the reduced other channel signals to generate an output signal.
According to one or more embodiments, the method may also include performing crossover filtering on the audio source input before performing the dynamic loudness balancing.
According to one or more embodiments, the method may also include: performing the dynamic loudness balancing only on signals in a mid frequency range of the audio source input; and concatenating and mixing signals in a low frequency range and a high frequency range of the audio source input and signals in the mid frequency range of the audio source input after the dynamic loudness balancing to generate the output signal.
According to one or more embodiments, the audio source input also includes a dual-channel source input, and the method also includes generating a multi-channel source input based on the dual-channel source input.
According to one or more embodiments, the generating a multi-channel source input based on the dual-channel source input may include: performing a cross-correlation between a left channel signal and a right channel signal from the dual-channel source input; and generating the multi-channel source input according to a combination ratio. The combination ratio depends on the result of the cross-correlation.
According to one or more embodiments, the first path signal processing and the second path signal processing are synchronous or asynchronous.
According to another aspect of the present disclosure, a system for voice enhancement is provided, including: a memory and a processor. The memory is configured to store computer-executable instructions. The processor is configured to execute the instructions to implement the method described above.
The present disclosure may be better understood by reading the following description of non-limiting implementations with reference to the accompanying drawings, in which:
It should be understood that the following description of the embodiments is given for purposes of illustration only and not limitation. The division of examples in functional blocks, modules or units shown in the figures should not be construed as implying that these functional blocks, modules or units must be implemented as physically separate units. The functional blocks, modules or units shown or described may be implemented as separate units, circuits, chips, functional blocks, modules, or circuit elements. One or more functional blocks or units may also be implemented in a common circuit, chip, circuit element, or unit.
The use of singular terms (for example, but not limited to, “a”) is not intended to limit the number of items. Relational terms, for example but not limited to, “top,” “bottom,” “left,” “right,” “upper,” “lower,” “down,” “up,” “side,” “first,” “second” (“third,” etc.), “entry,” “exit,” etc., are used herein for clarity in specific reference to the drawings and are not intended to limit the scope of the present disclosure or the appended claims, unless otherwise noted. The terms “couple,” “coupling,” “being coupled,” “coupled,” “coupler,” and similar terms are used broadly herein and may include any method or device for fixing, bonding, adhering, fastening, attaching, associating, inserting, forming thereon or therein, communicating with, or otherwise directly or indirectly mechanically, magnetically, electrically, chemically, and operatively associated with an intermediate element and one or more members, or may also include, but is not limited to, one member being integrally formed with another member in a unified manner. Coupling may occur in any direction, including rotationally. The terms “including” and “such as” are illustrative rather than restrictive, and the word “may” entails “may, but not necessarily,” unless stated otherwise. Although any other language is used in the present disclosure, the embodiments shown in the figures are examples given for purposes of illustration and explanation and are not the only embodiments of the subject matter herein.
In order to overcome the defects of the existing technical solutions and improve the quality of voice output so as to bring a better experience to users, the present disclosure proposes a solution of actively detecting human voice and dynamically enhancing voice loudness in an audio source (for example, a theater audio source) based on a detection confidence that indicates the possibility of voice in an audio source input. The method and system of the present disclosure may simultaneously perform signal processing of two paths on an input signal. The first path signal processing includes receiving an audio source input and performing dynamic loudness balancing on the audio source input based on a first gain control parameter. The second path signal processing includes: performing voice detection on the audio source input and calculating a detection confidence; and calculating a second gain control parameter based on the detection confidence. The first path signal processing and the second path signal processing may be synchronous or asynchronous. The method of the present disclosure also includes updating the first gain control parameter with the second gain control parameter calculated by a second processing path, and performing the first path signal processing based on the updated first gain control parameter. The method and system of the present disclosure can better enhance the intelligibility of voice and improve the user's experience of using audio products.
The method and system of dynamic voice enhancement according to various embodiments of various implementations of the present disclosure will be described in detail below with reference to the accompanying drawings.
Referring to
The new gain control parameter estimated by the gain control module 110 may be used to update the gain control parameter currently used by the dynamic loudness balancing module 104. Thus, the dynamic loudness balancing module 104 may perform the first path signal processing based on the updated gain control parameter. That is, the dynamic loudness balancing module 104 may perform dynamic loudness balancing on the received audio source input signal based on the updated gain control parameter. The audio signal after the dynamic loudness balancing may be output through the signal output module 106.
The audio source input may include a multi-channel source input, a dual-channel source input, and a single-channel source input. The processing aspects of different source inputs will be described below respectively with reference to the accompanying drawings.
x
i_norm(n)=(xi(n)−μi)/σi (1)
where xi(n) represents an input signal at an nth sampling point of an ith time frame, and xi_norm(n) represents an output signal at the nth sampling point of the ith time frame, that is, the normalized signal. μi and σi are the mean and variance of the input signals corresponding to the ith time frame.
Next, fast autocorrelation processing is performed on the normalized signal and an autocorrelation result is output. For example, the fast autocorrelation processing may first perform a Fourier transformation on the normalized input signal by using a short-time Fourier transform (STFT) method, and perform fast autocorrelation on the Fourier transformed signal. For example, the fast autocorrelation processing procedure is shown in the following equations (2)-(4).
X
i(z)=STFT(xi_norm(n)) (2)
c
i(n)=iSTFT(Xi(z)*
C
i=norm(ci(n)) (4)
where Xi(z) is a Fourier transformed signal,
G
i
=D
0*ln(Ci+D1) (5)
where Gi represents an output of a dynamic control module; D0 and D1 are control parameters of a dynamic gain fluctuation range, which may be real numbers greater than zero; and ln(·) is a natural logarithmic function. In some examples, Gi may be provided to dynamic loudness balancing module 104 as an output from the gain control module 110.
In some other examples, Gi may be further processed and then serve as an output from the gain control module 104. For example, Gi is smoothed to reduce audio distortion. In addition, a soft limiter may also be used to ensure that the gain Gi_lim is within a reasonable range of magnitude. For example, a tangent function of the following equation (6) may be used as the soft limiter.
G
i_lim=tan h(αGi+β)+γ (6)
where α, β and γ are limiter parameters, which depend on the system configuration, α may be a real number greater than zero, and β and γ may be non-zero real numbers. At this moment, Gi_lim may serve as the output from gain control module 110.
A number of processing methods performed in the case where the source input is a multi-channel source input with a center channel are described above in conjunction with
In the case where the source input is a dual-channel source input, it is necessary to add a center extraction process in advance before implementing the method and system disclosed above, so that a multi-channel source input is generated based on the dual-channel source input.
An upmixing process shown in
center(n)=θ*corr(left(n),right(n))*(left(n)+right(n)) (7)
where left(n) is the left channel input signal, right(n) is the right channel input signal, center(n) is the center channel signal, corr( ) represents a cross-correlation function, θ is a tuning parameter in practice, and θ is greater than 0 and less than or equal to 1.
The method and system provided by the present disclosure may be applied not only to consumer products such as Soundbars and stereo speakers, but also to products in cinema applications such as theaters and concert halls. The method and system provided by the present disclosure can better enhance the intelligibility of voice and improve the user's experience of using audio products and applications. The above-mentioned method and system described in the present disclosure with reference to the accompanying drawings may both be implemented by the at least one processor.
Aspect 1. A method for dynamic voice enhancement, comprising: performing a first path signal processing, the first path signal processing comprising receiving an audio source input and performing dynamic loudness balancing on the audio source input based on a first gain control parameter; performing a second path signal processing, the second path signal processing comprising: performing voice detection on the audio source input and calculating a detection confidence, wherein the detection confidence indicates the possibility of voice in the audio source input; and calculating a second gain control parameter based on the detection confidence; and updating the first gain control parameter with the second gain control parameter, and performing the first path signal processing based on the updated first gain control parameter.
Aspect 2. The method according to aspect 1, wherein the audio source input comprises a multi-channel source input, and the performing voice detection on the audio source input and calculating a detection confidence comprises: extracting a center channel signal from the multi-channel source input; performing normalization on the center channel signal; and performing fast autocorrelation on the normalized center channel signal, the result of the fast autocorrelation representing the detection confidence.
Aspect 3. The method according to any one of the preceding aspects, wherein calculating a second gain control parameter based on the detection confidence comprises: calculating the second gain control parameter based on a logarithmic function of the detection confidence; smoothing the calculated second gain control parameter; and limiting the smoothed second gain control parameter.
Aspect 4. The method according to any one of the preceding aspects, wherein the audio source input comprises a multi-channel source input, and performing dynamic loudness balancing on the audio source input comprises: extracting a center channel signal from the multi-channel source input; enhancing the loudness of the center channel signal and reducing the loudness of other channel signals based on the first gain control parameter or the updated first gain control parameter; and concatenating and mixing the enhanced center channel signal and the reduced other channel signals to generate an output signal.
Aspect 5. The method according to any one of the preceding aspects, further comprising: performing crossover filtering on the audio source input before performing the dynamic loudness balancing.
Aspect 6. The method according to any one of the preceding aspects, further comprising: performing the dynamic loudness balancing only on signals in a mid-frequency range of the audio source input; and concatenating and mixing signals in a low frequency range and a high frequency range of the audio source input and signals in the mid frequency range of the audio source input after the dynamic loudness balancing to generate the output signal.
Aspect 7. The method according to any one of the preceding aspects, wherein the audio source input further comprises a dual-channel source input, and the method further comprises generating a multi-channel source input based on the dual-channel source input.
Aspect 8. The method according to any one of the preceding aspects, wherein the generating a multi-channel source input based on the dual-channel source input comprises: performing a cross-correlation between a left channel signal and a right channel signal from the dual-channel source input; and generating the multi-channel source input according to a combination ratio, wherein the combination ratio depends on the result of the cross-correlation.
Aspect 9. The method according to any one of the preceding aspects, wherein the first path signal processing and the second path signal processing are synchronous or asynchronous.
Aspect 10. A system of dynamic voice enhancement, comprising: a memory configured to store computer-executable instructions; and a processor configured to execute the computer-executable instructions to implement the method according to any one of the preceding aspects 1-9.
The description of the implementations has been presented for the purposes of illustration and description. The implementations may be appropriately modified and changed according to the above description or these modifications and changes may be obtained by practicing the method. For example, unless otherwise indicated, one or more of the methods described may be performed by a suitable device and/or a combination of devices. The method may be performed by using one or more logic devices (for example, processors) in combination with one or more additional hardware elements (such as storage devices, memories, hardware network interfaces/antennas, switches, actuators, clock circuits, etc.) to perform stored instructions. The method described and associated actions may also be executed in parallel and/or simultaneously in various orders other than the order described in this application. The system described is illustrative in nature, and may include additional elements and/or omit elements. The subject matter of the present disclosure includes all novel and non-obvious combinations of the disclosed various systems and configurations as well as other features, functions, and/or properties.
The system may include additional or different logic, and may be implemented in many different ways. The processor may be implemented as a microprocessor, a microcontroller, an Application Specific Integrated Circuit (ASIC), digital signal processor DSP, discrete logic, or a combination of these and/or other types of circuits or logic. Similarly, the memory may be a dynamic random access memory (DRAM), a static random access memory (SRAM), a flash memory, or other types of memory. Parameters (for example, conditions and thresholds) and other data structures may be stored and managed separately, may be combined into a single memory or database, or may be logically and physically organized in many different ways. Programs and instruction sets may be parts of a single program, or separate programs, or distributed across a plurality of memories and processors.
As used in this application, an element or step listed in the singular form and preceded by the word “a/one” should be understood as not excluding a plurality of said elements or steps, unless such exclusion is indicated. Furthermore, references to “one implementation” or “one example” of the present disclosure are not intended to be interpreted as excluding the existence of additional implementations that also incorporate the recited features. The present invention has been described above with reference to specific implementations. However, those of ordinary skill in the art will appreciate that various modifications and changes may be made therein without departing from the broader spirit and scope of the present invention as set forth in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
202110895493.X | Aug 2021 | CN | national |