Priority is claimed to application No. 202311388318.7, filed Oct. 24, 2023, in China, the disclosure of which is incorporated in its entirety by reference.
The inventive subject matter relates generally to the field of audio signal processing, and more particularly to a method and a system for intelligent dynamic speech enhancement for an audio source.
With the popularity of high-definition cable TVs, online streaming, and large-screen display devices, the home theater experience is becoming increasingly popular in the marketplace. These media sources are typically equipped with multi-channel audio to provide users with a more immersive surround sound experience. However, since the primary purpose of the movie format is to provide highly immersive sound effects, speech clarity is often sacrificed in the pursuit of surround sound.
The speech enhancement technology plays a vital role in the filmmaking process and is designed to improve the quality, audibility, and clarity of the dialogue in the film material.
The development of the existing speech enhancement technology can be derived from the surround sound technology. For example, surround sound providers such as Dolby, THX, and DTS bring the multi-channel audio coding technology for better spatial resolution and stereo sound experience. This technology immerses the audience in a surround sound environment, but sometimes results in less clear dialogue in the mix. In addition, filmmaking includes a mixing process in which a sound mixer is responsible for adjusting the volume, balancing, and spatial positioning of different audio elements, including dialogue, music, and sound effects, in order to create a realistic listening effect. However, in the pursuit of an immersive surround sound experience, dialogue clarity tends to be compromised.
Therefore, there is a need for a method and a system for intelligent dynamic speech enhancement for an audio source to overcome the above disadvantages in the existing solutions.
On the one hand, the inventive subject matter provides a method for intelligent dynamic speech enhancement. The method for intelligent dynamic speech enhancement comprises performing speech detection and intelligent enhancement gain control on a multi-channel audio source input to determine speech enhancement gain, the multi-channel audio source input comprising a signal of a center channel and signals of other channels. The method for intelligent dynamic speech enhancement further comprises applying the speech enhancement gain in dynamic loudness balancing performed on the multi-channel audio source input, wherein the intelligent enhancement gain control comprises setting the speech enhancement gain based on a signal power strength ratio of the signal of the center channel to a sum of the signals of the other channels, and setting the speech enhancement gain based on a system volume level.
On the other hand, the inventive subject matter provides a system for intelligent dynamic speech enhancement. The system for intelligent dynamic speech enhancement comprises a memory configured to store computer-executable instructions, and a processor configured to execute the computer-executable instructions to implement the method for intelligent dynamic speech enhancement.
The inventive subject matter can be better understood by reading the following description of non-limiting implementations with reference to the accompanying drawings, in which:
It should be understood that the following description of the embodiments is given for illustrative purposes only, and not restrictive. The division of the examples in the functional blocks, modules or units illustrated in the accompanying drawings should not be interpreted as indicating that these functional blocks, modules, or units must be implemented as physically separated units. The functional blocks, modules or units illustrated or described may be implemented as separate units, circuits, chips, functional blocks, modules, or circuit elements. One or more functional blocks or units may also be implemented in a common circuit, chip, circuit element or unit.
The use of singular terms (for example, but not limited to, “one”) is not intended to limit the number of items. The use of relational terms, for example, but not limited to, “top”, “bottom”, “left”, “right”, “upper”, “lower”, “downward”, “upward”, “side”, “first”, “second” (“third”, and the like), “entrance”, “exit”, and the like shall be used in written descriptions for the purpose of clarity in the specific reference to the accompanying drawings and are not intended to limit the scope of the inventive subject matter or the accompanying claims, unless otherwise indicated. The terms “couple”, “coupling”, “being coupled”, “coupled”, “coupler” and similar terms are used broadly herein and may include any method or apparatus for fixing, bonding, adhering, fastening, attaching, combining, inserting thereon, forming thereon or therein, and in communication therewith, or otherwise being directly or indirectly mechanically, magnetically, electrically, chemically, or operatively associated with an intermediate element, or one or more members, or may also include, but is not limited to, one member being integrally formed with another member in a uniform manner. The coupling may occur in any direction, including in a rotational manner. The terms “comprise/include” and “such as” are illustrative rather than restrictive and, unless otherwise indicated, the word “may” means “may, but does not have to”. Notwithstanding the use of any other language in the inventive subject matter, the embodiments illustrated in the accompanying drawings are examples given for purposes of illustration and explanation and are not the only embodiments of the subject matter herein.
In order to improve the quality of the speech output and thereby provide a better listening experience to a user, the inventive subject matter proposes a solution for active detection of human speech based on detection confidence level scores and intelligent dynamic enhancement of speech loudness for an audio source (e.g., a theater audio source).
One speech enhancement method for improving the quality, audibility, and clarity of dialogue in, for example, a movie product utilizes static equalization techniques. The method may use a static equalizer that typically adjusts the frequency response within the range of 200 Hz to 4 kHz to increase the volume and the clarity of the dialogue range, thereby emphasizing the speech dialogue. However, the disadvantage of this approach is that its processing of the audio is activated throughout the audio clipping, resulting in unbalanced sound even in the absence of dialogue and amplifying background noise at the same time. Another dynamic speech enhancement method is to detect speech signals in each time frame and apply adaptive audio processing based on the result of the detection. This method enables sound to be enhanced when the speech is detected so as to improve the clarity and the intelligibility of the dialogue. However, this method requires very accurate and fast speech detection algorithms to process the sound quickly.
However, when speech enhancement is applied continuously at high system volumes, there are some side effects of continuous speech enhancement. For example, it can lead to perceptual imbalances and loss of dynamics in the listening experience. In addition, in audio clips where only speech signals are present, speech may already be clear and not need further enhancement. Accordingly, in order to obtain reasonable audio processing and enhancement methods to improve the quality of dialogue in a movie product and to allow viewers to better understand and appreciate the content of the movie, the disclosure of the inventive subject matter provides a method and a system for further optimized and intelligent dynamic speech enhancement.
The inventive subject matter focuses on a multi-channel audio source input from which speech and environment channels in an audio source can be easily extracted and directly analyzed.
The audio source input that can be handled by this system may include a single channel source input, a dual channel source input, and a multi-channel audio source input. In the home theater example, the audio source input can typically be considered to be a multi-channel audio source input that has been configured. For example, in a 5.1 Dolby surround sound theater, the audio source input can typically include one bass channel and five surround channels, where these five surround channels include a center channel, a left front channel, a right front channel, a left rear channel, and a right rear channel. In the case of the multi-channel audio source input, most of the speech signals are usually present in the center channel, and the other channels (i.e., the left front channel, the right front channel, the left rear channel, and the right rear channel, or the like) can be considered to include the surrounding environment channels. Accordingly, in an example of the inventive subject matter, the speech detection processing in the speech detection module uses a method for detecting speech in the center channel.
As shown in
where xi(n) denotes the input signal of the nth sampling point of the ith time frame, and xi_norm (n) denotes the output signal, i.e., the normalized signal, of the nth sampling point of the ith time frame. μi and σi are the mean and variance corresponding to the input signal of the ith time frame.
Next, fast autocorrelation processing 206 is performed on the normalized signal and the autocorrelation result is output. For example, the fast autocorrelation processing may first perform a Fourier transform on the normalized input signal using a short-time Fourier transform (STFT) method and perform fast autocorrelation on the Fourier-transformed signal. For example, for the fast autocorrelation processing, reference is made to the following Equations (2)-(4).
where Xi (z) is the Fourier transformed signal,
Next,
The dynamic range of the enhancement gain can be defined as follows:
where Gi denotes the gain of the dynamic control module 304, Ci represents the aforementioned detection confidence level, and D0 and D1 are control parameters for a dynamic gain fluctuation range, which control parameters may be real numbers greater than zero; and ln(⋅) is a natural logarithmic function.
In some examples, further processing of the enhancement gain Gi is needed to reduce audio distortion. For example, smoothing processing 306 can be performed on Gi.
In one or more implementations of the inventive subject matter, for the gain Gi, when speech enhancement is applied continuously at high system volumes, the continuous enhancement of the speech can lead to side effects, for example, perceptual imbalance and loss of dynamics, in the listening experience. In addition, in scenarios where only speech signals are present, for these audio clips where the speech is already clear, there is no longer a need to apply further gain processing to enhance the speech signals. Accordingly, with respect to these proposed side effects, the inventive subject matter further provides intelligent logical judgment 308 and intelligent enhancement gain 310 processing to intelligently optimize speech enhancement gain processing.
In an example, in the intelligent logical judgment, the setting of the speech enhancement gain is judged by means of the comparison of the signal power strength of the center channel to the other channels. Specifically, in one or more embodiments of the inventive subject matter, the energy of the signals of the other channels is first extracted and summed and then compared to the energy of the center channel. For example, when there exists a speech signal in the center channel and the surrounding environment sound in the other channels is loud, the speech enhancement gain needs to be set high, while, for example, when there exists a speech signal in the center channel but the environment sound in the other channels is weak, the speech enhancement gain needs to be set low. As expressed in the following equation:
where Gp is the gain of the intelligent speech enhancement control module, (Pc/Pa) is a power ratio of the center channel to the other channels, Pc is the power of the center channel, Po is the sum of the power of the other channels, and α is a limiting parameter according to the system configuration.
In other words, the speech enhancement gain needs to be set high when the signal power strength ratio of the center channel to the sum of the other channels is small; and the speech enhancement gain can be set low when the signal power strength ratio of the center channel to the sum of the other channels is large.
By performing a logical judgment of the signal power strength ratio of the center channel to the sum of the other channels and comparing the relative strength of the environment sound based on the vocal dialogue in the audio clip, the problem of perceptual imbalance in enhancing the audio sound can be solved, and in the case of poor audibility and clarity due to a strong environment sound and an insufficiently prominent human voice, the user is provided with better listening perception by further intelligently optimizing the gain of the human voice dialogue.
In an example, in the intelligent logical judgment, the setting of the speech enhancement gain is judged by means of the system volume of an audio source, for example, a movie product. The current system volume level can be recognized, and then different speech enhancement gain can be set within different volume ranges recognized. For example, the current system volume level is recognized, and when the system volume level is within a low range, the speech enhancement gain can be set high, while when the system volume level is within a high range, the speech enhancement gain should be set low. For example, when the current system volume level exceeds a threshold, the speech enhancement gain may be set to 1, i.e., making the loudness level unchanged for the speech signal.
wherein Gv denotes the output of the intelligent speech enhancement control module, fV is a non-linear volume correlation function; and as described above, Gi denotes the gain of the dynamic control module, and Gp is the gain of the intelligent speech enhancement control module. The plot 400 of this non-linear volume correlation function may be as shown, for example, in
Referring back to
wherein α, β, and γ are limiter parameters according to the system configuration and these parameters depend on the system configuration, where α may be a real number greater than zero, and β and γ may be non-zero real numbers. At this point, Gv
By performing the logical judgment as mentioned above by recognizing the system volume level of the film product and according to the strength of the system volume in the audio clip, it is possible to avoid audio distortion caused by too much enhancement of the speech dialogue when the volume is enhanced to a greater strength, thus further intelligently optimizing the dynamic speech gain, which enables the user to obtain a better auditory perception.
By means of the above logical judgment of two aspects, namely, the signal power strength ratio of the center channel to the sum of the other channels as well as the system volume level, it is possible to intelligently optimize the processing of the dynamic speech enhancement so as to further improve the user experience, thereby intelligently adjusting the speech enhancement for various scenarios in the diversified film product sources. Therefore, the use of this method undoubtedly enables the user to obtain a better movie viewing experience.
The intelligent dynamic processing provided in the inventive subject matter is all directed to speech enhancement. Since the frequency range of the human voice is substantially within the mid-frequency range, e.g., between 250-4000 Hz, the dynamic loudness balancing module 108 in the inventive subject matter may also focus primarily on the mid-frequency range of the input audio source signals for processing. Therefore, a crossover filter may be used to first crossover the inputted audio signals, for example, in
Returning to
Additionally or alternatively, the steps of the method for intelligent dynamic speech enhancement as shown in
The method and the system for intelligent dynamic speech enhancement provided in the inventive subject matter can be used not only for consumer products such as Soundbars and stereo speakers, but also for applications in occasions such as theaters and concert halls. Compared to static audio equalization and static speech enhancement techniques, the method and the system for intelligent dynamic speech enhancement provided in the inventive subject matter can improve speech clarity by means of intelligent gain control. In the method for intelligent dynamic speech enhancement provided in the inventive subject matter, optimization is performed for the existing speech enhancement technology so as to intelligently incorporate dynamic speech enhancement gain control in applications such as theaters, for example, thereby enabling an improved user experience for a diversity of movie clips, including, and not limited to: fast implementation of speech activity detection based on the center channel; intelligent enhancement gain control that intelligently adjusts the speech enhancement gain level according to the movie content and intelligently adjusts the speech enhancement gain level according to the volume of the playback system; and implementation of multi-channel intelligent dynamic loudness balancing using dual processing paths.
Examples of one or more implementations of the inventive subject matter are described in the following clauses:
Clause 1. performing speech detection and intelligent enhancement gain control on a multi-channel audio source input to determine speech enhancement gain, the multi-channel audio source input comprising a signal of a center channel and signals of other channels;
applying the speech enhancement gain in dynamic loudness balancing performed on the multi-channel audio source input, wherein the intelligent enhancement gain control comprises:
setting the speech enhancement gain based on a signal power strength ratio of the center channel to a sum of the other channels, and
setting the speech enhancement gain based on a system volume level.
Clause 2. The method for intelligent dynamic speech enhancement of clause 1, wherein setting the speech enhancement gain based on a signal power strength ratio of the center channel to a sum of the other channels comprises:
setting the speech enhancement gain to be high when the signal power strength ratio of the center channel to the sum of the other channels is small; and
setting the speech enhancement gain to be low when the signal power strength ratio of the center channel to the sum of the other channels is large.
Clause 3. The method for intelligent dynamic speech enhancement of clause 1 or 2, wherein setting the speech enhancement gain based on a system volume level comprises:
recognizing the system volume level and setting different speech enhancement gain when the recognized system volume level is within different volume ranges,
wherein the speech enhancement gain is set to be high when the system volume level is within a low range, and
wherein the speech enhancement gain is set to be low when the system volume level is within a high range.
Clause 4. The method for intelligent dynamic speech enhancement of any one of clauses 1 to 3, wherein the speech detection comprises:
extracting the signal of the center channel from the multi-channel audio source input;
performing normalization on the signal of the center channel; and
performing fast autocorrelation on the normalized signal of the center channel, a result of the fast autocorrelation representing a detection confidence level which is indicative of the possibility of speech being present in the signal of the center channel.
Clause 5. The method for intelligent dynamic speech enhancement of any one of clauses 1 to 4, wherein the intelligent enhancement gain control further comprises:
converting the detection confidence level to the speech enhancement gain; and
performing smoothing processing on the speech enhancement gain.
Clause 6. The method for intelligent dynamic speech enhancement of any one of clauses 1 to 5, wherein the intelligent enhancement gain control further comprises performing soft limiting processing on the set speech enhancement gain.
Clause 7. The method for intelligent dynamic speech enhancement of any one of clauses 1 to 6, wherein the dynamic loudness balancing performed on the multi-channel audio source input comprises:
enhancing the loudness of the signal of the center channel and attenuating the loudness of the signals of the other channels based on the set speech enhancement gain; and
performing concatenating and mixing processing on the enhanced signal of the center channel and the attenuated signals of the other channels to generate an output signal.
Clause 8. The method for intelligent dynamic speech enhancement of any one of clauses 1 to 7, further comprising: performing crossover filtering processing on the multi-channel audio source input prior to the dynamic loudness balancing performed on the multi-channel audio source input.
Clause 9. The method for intelligent dynamic speech enhancement of any one of clauses 1 to 8, further comprising:
performing the dynamic loudness balancing only on the multi-channel audio source input within a mid-frequency range; and
concatenating and mixing the multi-channel audio source input within the mid-frequency range that has undergone the dynamic loudness balancing with the multi-channel audio source input within a low-frequency range and a high-frequency range to generate an output signal.
Clause 10. A system for intelligent dynamic speech enhancement, comprising:
a memory configured to store computer-executable instructions; and
one or more processors configured to execute the stored computer-executable instructions to implement the method for intelligent dynamic speech enhancement of any one of clauses 1-9.
A description of implementations has been presented for illustrative and descriptive purposes. Appropriate modifications and changes to the implementations may be made in accordance with the above description, or may be obtained from the practice of the method. For example, one or more of the described methods may be performed by suitable apparatuses and/or combinations of apparatuses unless otherwise indicated. The method may be performed by executing the stored instructions with one or more logic apparatuses (e.g., processors) in conjunction with one or more additional hardware elements (such as storage apparatuses, memories, hardware network interfaces/antennas, switches, actuators, clock circuits, and so on). In addition to the order described in the present application, the described methods and associated actions may also be performed in various sequences in parallel and/or simultaneously. The systems described are exemplary in nature and may include additional elements and/or omit elements. The subject matter of the inventive subject matter includes all novel and non-obvious combinations of the various systems and configurations, and other features, functions, and/or properties disclosed.
The elements of various real-time schemes for modules, elements, and components for implementing the methods provided in the inventive subject matter may be fabricated as one or more electronic devices, including but not limited to arrays of fixed or programmable logic elements (e.g., transistors or gates, or the like), residing on the same chip or in a chipset. One or more elements of the various implementations of the devices described herein may also be implemented, in whole or in part, as one or more instruction sets, which may be arranged to be executed on one or more arrays of fixed or programmable logic elements (e.g., microprocessors, embedded processors, IP cores, digital signal processors, FPGAs, ASSPs, and ASICs, and so on).
The terminology used herein has been chosen to best explain the principles of the embodiments, the practical applications or the improvements to the technology found in the market, or to enable those of ordinary skill in the art to understand the embodiments disclosed herein.
In the foregoing, the embodiments presented in the inventive subject matter are identified by reference. However, the scope of the inventive subject matter is not limited to the specifically described embodiments. Rather, any combination of the foregoing features and elements, whether or not involving different embodiments, is envisioned to implement and practice the envisioned embodiments.
Furthermore, while the embodiments disclosed herein can achieve advantages over other possible solutions or over the prior art, whether or not a given embodiment achieves a particular advantage does not limit the scope of the inventive subject matter. Accordingly, the foregoing aspects, features, embodiments, and advantages are merely illustrative and are not considered to be elements or limitations of the appended claims unless expressly set forth in the claims.
While the foregoing is directed to embodiments of the inventive subject matter, other and further embodiments of the inventive subject matter may be devised without departing from the essential scope of the inventive subject matter, and the scope of the inventive subject matter is determined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
202311388318.7 | Oct 2023 | CN | national |