This invention relates to audio processing including dialogue enhancement. Specifically, the invention relates to improving dialogue enhancement by smoothing an amplified extracted dialogue.
Dialogue enhancement is an algorithm to enhance speech/dialog in an audio signal to improve intelligibility. One example of a dialogue enhancement system is shown in
It is desirable to even further improve the performance of such dialogue enhancement algorithms.
Methods, systems, and computer program product of method of enhancing dialog intelligibility in audio are described.
A first aspect of the invention relates to a method of enhancing dialog intelligibility in an audio signal, comprising determining, by a speech classifier, a speech confidence score that the audio content includes speech content, determining, by a music classifier, a music confidence score that the audio content includes music correlated content, in response to the speech confidence score, applying, by a dialog enhance module, a user selected gain of selected frequency bands of the audio signal to obtain a dialogue enhanced audio signal, wherein the user selected gain is smoothed by an adaptive smoothing algorithm, an impact of past frames in said smoothing algorithm being determined by a smoothing factor, the smoothing factor being selected in response to the music confidence score, and having a relatively higher value for content having a relatively higher music confidence score and a relatively lower value for speech content having a relatively lower music confidence score, so as to increase the impact of past frames on the dialogue enhancement of music correlated content.
By “music related content” is simply intended content for which speech classification can be expected to be more difficult due to presence of music. By increasing the impact of past frames, the dialogue enhancement becomes less sensitive to “false positives” in the speech classifier.
The smoothing factor relates to the number of frames taken into consideration in the adaptive smoothing. So, for a large smoothing factor, more frames are taken into account, thus making the application of dialogue enhancement more gradual (slower) and thus avoiding fluctuating boost caused by “false positives”. For a small smoothing factor, fewer frames are taken into account, thus allowing for faster application of dialogue enhancement. The relationship between smoothing factor and the smoothing function may be direct (e.g. the smoothing factor defines how many frames that are taken into account, or indirect (e.g. the smoothing factor defines the slope of decline in relative weight of a past frame.
The adaptive smoothing factor makes it possible to adapt the smoothing factor based on the content. For content where music is present (high music confidence score) the smoothing factor can be set relatively large (e.g. in the order of 500 ms or larger), while for content where music is not present (low music confidence score) the smoothing factor can be set relatively small (e.g. in the order of 100 ms or smaller).
The smoothing factor may be further adapted based on additional parameters. For example, a low signal-to-noise ratio (SNR) may result in a larger smoothing factor, and a large latency in the speech classifier may result in a larger smoothing factor.
According to a second aspect, the speech and music classifiers receive an audio signal, the audio signal including audio content. The speech classifier determines the speech confidence and the music classifier determines the music confidence. In response to an output of the speech and classifiers, an adaptive smoothing algorithm calculates a higher value of a dialog smoothing factor for the music correlated content and a lower value of the dialog smoothing factor for the pure speech content. The adaptive smoothing algorithm adjusts the dialog smoothing factor based on SNR of the audio content. A lower SNR corresponds to a larger increase of the dialog smoothing factor. A transient detector could be used to measure the latency of speech classifier in real time, the dialog smoothing factor should be increased linearly along with the latency is increased. A dialog enhancer enhances the audio content based on the adjusted dialog smoothing factor to generate enhanced audio.
The present invention will be described in more detail with reference to the appended drawings, showing currently preferred embodiments of the invention.
A conventional dialogue enhancement algorithm, es. as illustrated in
In an attempt to overcome these drawbacks, look-ahead is sometimes introduced to reduce the false positive and latency of speech classifier. For example, a 2000 ms latency may be acceptable on the encoding side. However, on the mobile playback side, latency is very sensitive and critical, and look-ahead is not allowable. As a result, the accuracy and latency issues are even worse in a conventional speech classifier.
Additionally, the above artifacts may be eliminated or at least mitigated, by using a conventional smoothing algorithm, as shown in
The technology disclosed in this specification relate to dialogue enhancement that result in dialogs that are not only pronounced but also comfortable, with less artifacts.
Some examples of how the smoothing factor may be adapted are the following:
Take use of history and current music confidence score
If the music is dominant in the last few frames or in current frame, the smoothing factor should be tended to large, for example 500 ms or more, to filter out any false positives.
Reduce the smoothing for pure speech content
If the content is pure speech, the smoothing factor could be small for example, 50 ms to 100 ms to make the dialogue boost more pronounced.
Take use of the SNR.
SNR could be measured to help guiding the smoothing, the false positive/negative tends to high with low SNR content, as a result, the smoothing factor should be more conservative to be large, for example, 500 ms.
Dynamically change the smoothing factor by measuring the latency in real time
A VAD or transient detector could be used to measure the latency of speech classifier in real time, the smoothing factor should be increased linearly along with the latency is increased. Depending on the contents, the latency could be small as 100 ms or large to 500 ms
A more detailed embodiment of the invention is shown in
The speech confidence score is used to activate a dialogue enhance module 23, e.g. of a type known in the art. In a simple case, the dialogue enhancement module is static, and configured to boost preselected frequencies of the audio signal by a user selected gain. In more complex cases, the enhancement module makes a dynamic estimation of a dialogue component, and boosts this estimated dialogue component.
In principle, the speech confidence score may be used directly as an activation signal, which is multiplied by the user gain. However, it may be advantageous to first map the confidence score to a binary value ON/OFF. In
The confidence score or the binary activation signal is multiplied by the user gain, which is supplied to an adaptive smoothing module 25 before being fed to the dialogue enhancement module 23. Much like the conventional smoothing module in
The system further comprises a signal-to-noise ratio (SNR) detector 26, which detects an SNR in the audio signal (frame by frame) and provides this to the adaptive smoothing module 25.
The system further comprises a less complex, but fast, voice detector 27, such as a conventional voice activation detector (VAD) or a transient detector. The output from the voice detector 27 is provided to the adaptive smoothing module to enable determination of a latency of the speech classifier.
The adaptive smoothing module may use various smoothing functions to smooth the gain applied to the dialogue enhancement module 23. Generally speaking, the smoothing factor relates to the number of past frames that are taken into account when determining the gain for a current frame. In a simple example, the smoothing factor may define a window of past frames which are included in a moving average to determine the smoothed gain for a current frame.
In another example, the filter is a weighted average, a one-pole filter method as below:
Out(n)=α Out(n−1)+(1−α) In(n),
where Out(n) is the smoothed output gain of the current frame, Out(n−1) is the smoothed output gain of the previous frame, In(n) is the original input gain of the current frame, and a is an adaptively adjusted variable between zero and one. It is clear that the impact of a past frame will decline exponentially with alpha as the base. The larger the value of a is, the slowly will a past frame fade, and the more smoothly output gain changes.
The relationship between a and the smoothing factor may for example be as follows:
α=0.5samples per frame/(sample rate*smoothing factor)
The smoothing factor could be e.g. 50 ms, 300 ms, 500 ms or even is depending on the circumstances as discussed herein.
An example of how the smoothing factor is adaptively set is provided by the simple flow chart in
First, in step S1, the music confidence score is used to determine if the audio signal is music correlated. in a simple approach, the determination is performed by comparing the music confidence score of the current frame with a threshold, thus generating a binary signal ON/OFF. A hysteresis model may also be applied, using the binary value of one or several preceding frames. If the detei inination is positive, i.e. the frame is found to be music correlated, then the larger smoothing factor (here >500 ms) iis applied.
If the content is not music correlated, processing continues to step S2, where the SNR from detector 26 is compared to a threshold, e.g. 0 dB. If the SNR is below the threshold, indicating that the signal is weak in relation to the noise, then again the larger (here >500 ms) smoothing factor is applied.
Further, in step S3, latency of the speech classifier is compared to a threshold, es. 150 ms. If the latency is not below the threshold, again the larger (here >500 ms) smoothing factor is applied.
For all other content, which may be considered to be “pure speech”, a small smoothing factor (here in the range 50-100 ms) is applied.
The speech and music classifiers of the dialog enhancement system receive (410) an audio signal, the audio signal including audio content. The speech classifier of the dialog enhancement system determines (420) the speech confidence. The music classifier determines (430) the music confidence.
In response to an output of the speech and music classifiers, an adaptive smoothing algorithm calculates (440) a higher value of a dialog smoothing factor for the music correlated content and a lower value of the dialog smoothing factor for the pure speech content. The adaptive smoothing algorithm adjusts (450) the dialog smoothing factor based on a measured signal to noise ratio (SNR) of the audio content. A lower SNR value corresponds to a larger increase of the dialog smoothing factor.
In some implementations, the system adjusts the dialog smoothing factor based on latency. The latency-based adjusting can include measuring, by a transient detector, an amount of latency of the output of the speech classifier; and increasing, by the adaptive smoothing algorithm, the dialog smoothing factor according to the amount of latency. A higher latency corresponds to a higher amount of increase. The amount of increase can linearly correspond to the amount of latency. Measuring the amount of latency and increasing the dialog smoothing factor can occur real-time. Each of the first portion of the audio content includes a given number of one or more frames. The dialog smoothing factor can be set to an optimum value for reducing false positives. The optimum value for reducing false positives is 500 millisecond (ms). The dialog smoothing factor can be set to an optimum value for boosting dialog. The optimum value boosting dialog can be between 50 and 100 millisecond (ins), inclusive.
A dialog enhancer enhances (460) the audio content based on the adjusted dialog smoothing factor to generate enhanced audio. During the enhancing, a higher value of the dialog smoothing factor reduces false positives in the enhancing and a lower value of the dialog smoothing factor increases dialogue boost in the enhancing. The system then provides (470) the enhanced audio content to a downstream device, e.g., a processor, an amplifier, a streaming servicer, or a storage medium for processing, playback, streaming, or storage.
Memory interface 814 is coupled to processors 801, peripherals interface 802 and memory 815 (e.g., flash, RAM, ROM). Memory 815 stores computer program instructions and data, including but not limited to: operating system instructions 816, communication instructions 817, GUI instructions 818, sensor processing instructions 819, phone instructions 820, electronic messaging instructions 821, web browsing instructions 822, audio processing instructions 823, GNSS/navigation instructions 824 and applications/data 825. Audio processing instructions 823 include instructions for performing the audio processing described in reference to
Aspects of the systems described herein may be implemented in an appropriate computer-based sound processing network environment for processing digital or digitized audio files. Portions of the adaptive audio system may include one or more networks that comprise any desired number of individual machines, including one or more routers (not shown) that serve to buffer and route the data transmitted among the computers. Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
One or more of the components, blocks, processes or other functional components may be implemented through a computer program that controls execution of a processor-based computing device of the system. It should also be noted that the various functions disclosed herein may be described using any number of combinations of hardware, filinware, and/or as data andlor instructions embodied in various machine-readable or computer-readable media, in terms of their behavioral, register transfer, logic component, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, physical (non-transitory), non-volatile storage media in various forms, such as optical, magnetic or semiconductor storage media.
While one or more implementations have been described by way of example and in terms of the specific embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.
List of enumerated exemplary embodiments (EEE):
receiving, by a speech and music classifiers of a dialog enhancement system, an audio signal, the audio signal including audio content;
determining, by the speech classifier, a confidence score that the audio content includes pure speech content;
determining, by the music classifier, a confidence score that the audio content includes music correlated content;
in response to an output of the speech and music classifiers, calculating, by an adaptive smoothing algorithm, a higher value of a dialog smoothing factor for the music correlated content and a lower value of the dialog smoothing factor for the pure speech content;
adjusting the dialog smoothing factor by the adaptive smoothing algorithm based on a measured signal to noise ratio (SNR) of the audio content, wherein a lower SNR value corresponds to a larger increase of the dialog smoothing factor; and
enhancing, by a dialog enhancer, the audio content based on the adjusted dialog smoothing factor to generate enhanced audio, wherein a higher value of the dialog smoothing factor reduces false positives in the enhancing and a lower value of the dialog smoothing factor increases dialogue boost in the enhancing,
wherein each of the determining, calculating, adjusting and enhancing is performed by one or more processors.
measuring, by a transient detector, an amount of latency of the output of the speech classifier; and
increasing, by the adaptive smoothing algorithm, the dialog smoothing factor according to the amount of latency, wherein a higher latency corresponds to a higher amount of increase.
one or more computer processors; and
a non-transitory computer-readable medium storing instructions that, upon execution by the one or more processors, cause the one or more processors to perform operations of any one of claims EEE1-EEE9.
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2019/102775 | Aug 2019 | CN | national |
This application claims priority of U.S. Provisional Patent Application No. 62/963,711, filed Jan. 21, 2020, U.S. Provisional Patent Application No. 62/900,969, filed Sep. 16, 2019, and International Patent Application No. PCT/CN2019/102775, filed Aug. 27, 2019, all of which are hereby incorporated by reference in their entireties.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/048034 | 8/26/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62963711 | Jan 2020 | US | |
62900969 | Sep 2019 | US |