The present invention relates to a method and a system for performing volume leveling of an audio signal.
Various audio processing techniques have been applied in playback end-point devices to improve audio quality. An example of an audio processing module is a volume leveling unit, which aims to monitor and adjust loudness of the audio from moment to moment to maintain a consistent loudness for the consumer.
The volume leveling unit was originally developed for Professionally Generated Content (PGC). However, in recent years User Generated Content (UGC) has become increasingly popular and must also be properly handled. Therefore, the volume leveler should ideally be able to ensure good performance for both PGC and UGC.
One of the most important aspects of UGC handling is the environmental noise contained in UGC (hereafter called the UGC noise). The UGC noise can be caused by capturing content using a mobile phone in real scenes. In general, the UGC noise is the background noise and thus meaningless or unwanted. Therefore, the UGC noise—especially approximately stationary noise—should not be boosted by the volume leveling unit.
However, the PGC also contains approximately stationary noise-like content (hereafter called the PGC noise). The PGC noise intervals occur frequently, e.g., as background sound intervals between dialogue in a movie. Such PGC noise is usually captured independently from the dialogue using professional recording devices, and carefully processed by the audio mixer in content creation phase. In contrast to the UGC noise, such PGC noise is part of the content and is usually wanted from an artist/content creator perspective. In such cases, the volume leveling unit can safely boost the PGC noise.
The present invention seeks to provide a volume leveling unit which satisfactorily can handle both PGC and UGC noise. Specifically, the boosting level should be reduced for UGC noise while maintain the original behavior for PGC noise. To achieve this objective, the present invention proposes a method and system for intelligently steering a control signal (e.g. between zero and one) for a volume leveling unit.
A first aspect of the present invention relates to a method for applying volume leveling of an audio signal including a plurality of time segments each consisting of a set of N frames. The method comprises providing a volume leveling control signal, applying volume leveling to the audio signal using the volume leveling control signal, identifying, in a current time segment, all noise-like frames which are likely to contain noise, and determining a noise reliability ratio w(n) as a ratio of noise-like frames over all frames in the current time segment, determining, for the current time segment, a PGC noise confidence score xPGC (n) indicating a likelihood that professionally generated content, PGC, noise is present in the audio signal and determining, for the current time segment, whether the noise reliability ratio is above a predetermined threshold. When the noise reliability ratio is above the predetermined threshold, the volume leveling control signal is updated based on the PGC noise confidence score, and when the noise reliability ratio is below the predetermined threshold, the volume leveling control signal is left unchanged.
According to this approach, a noise-type adaptive volume leveling is achieved using a two-stage noise classifier. As a result, the performance of the volume leveling is improved by preventing boosting of e.g. phone-recorded environmental noise in UGC, while keeping original behavior for other types of content. The control signal will be updated if and only if the reliability of noise for the segment is high. As a result, the volume leveling control signal will be stable for each segment, and also stay consistent for the entire audio signal. The update may be made on a frame-by-frame basis.
A two-stage noise classifier can be used with the present invention. In a first stage, noise is distinguished from other types of content, and in a second stage, PGC noise is distinguished from UGC noise. The classifier in the first stage can also output a frame weight to identify stationary noise with low latency, and a clip weight, indicating whether the output from the second stage is reliable or not.
The outputs from the classifiers are not always stable. In order to obtain a stable and consistent control signal, the updated volume leveling control signal may be formed by weighting a volume leveling control signal for a previous frame with an updating value based on the PGC noise confidence score. In order to increase the rate of change when the noise detection is reliable, the updating value can be made proportional to the noise reliability ratio.
A second aspect of the present invention relates to a system for volume leveling of an audio signal including a plurality of time segments each consisting of a set of N frames. The system comprises a noise detector configured to identify, in a current time segment, all noise-like frames which are likely to contain noise, and determining a noise reliability ratio w(n) as a ratio of noise-like frames over all frames in the current time segment, a noise discriminator configured to determine, for the current time segment, a PGC noise confidence score xPGC(n) indicting a likelihood that professionally generated content, PGC, noise is present in the audio signal, and a controller. The controller is configured to provide a volume leveling control signal, determine, for the current time segment, whether the noise reliability ratio is above a predetermined threshold, when the noise reliability ratio is above the predetermined threshold, update the volume leveling control signal based on the PGC noise confidence score, and when the noise reliability ratio is below the predetermined threshold, keep the volume leveling control signal unchanged.
The present invention will be described in more detail with reference to the appended drawings, showing currently preferred embodiments of the invention.
Systems and methods disclosed in the present application may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation.
The computer hardware may for example be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that computer hardware. Further, the present disclosure shall relate to any collection of computer hardware that individually or jointly execute instructions to perform any one or more of the concepts discussed herein.
Certain or all components may be implemented by one or more processors that accept computer-readable (also called machine-readable) code containing a set of instructions that when executed by one or more of the processors carry out at least one of the methods described herein. Any processor capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken are included. Thus, one example is a typical processing system (i.e. a computer hardware) that includes one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system further may include a memory subsystem including a hard drive, SSD, RAM and/or ROM. A bus subsystem may be included for communicating between the components. The software may reside in the memory subsystem and/or within the processor during execution thereof by the computer system.
The one or more processors may operate as a standalone device or may be connected, e.g., networked to other processor(s). Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
The software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, physical (non-transitory) storage media in various forms, such as EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information, and which can be accessed by a computer. Further, it is well known to the skilled person that communication media (transitory) typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
The volume leveling system 10 in
In use, an audio signal is provided to the volume leveling unit 1, which applies volume levelling based on a control signal from the controller 2. The audio signal is also provided to the audio classification arrangement, and output from the various classifiers is provided to the controller 2, which adjusts the control signal accordingly. The details of the control signal adjustment will be discussed in more detail in the following.
With reference to
Consecutive time segments may overlap. In a real-time processing framework, the number of overlapping frames can be set to N−1, which means processing operates frame-by-frame. Without loss of generality, a frame index n here denotes the n-th segment 11 which consists of the n−N+1, n−N+2, . . . , n frames.
With reference to both
The noise discriminator 4 calculates a PGC noise confidence score xPGC (n)∈[0,1] representing the likelihood of PGC noise in the frame. Similar to the noise reliability ratio, the PGC confidence score is calculated based on all frames in the segment 11. Therefore, the reliability ratio w(n) indicates whether the confidence scores provided by the noise discriminator 4 are reliable or not.
In addition to the two-stage noise classifier 3, 4, the auxiliary classifiers 5a, 5b may be used to further increase the confidence of noise. As an example, the auxiliary classifiers may include a speech classifier 5a and a music classifier 5b, outputting a speech confidence score xSP(n) and a music confidence score xmu(n). As indicated in
The controller 5 will calculate a volume leveling control signal y(n) for the n-th frame. The control signal y(n) should be equal to the previous value y(n−1) as long as the noise reliability ratio w(n) is below a given threshold Tw. When the noise reliability ratio w(n) exceeds the threshold, the control signal y(n) should be adjusted based on the classification results, including the PGC noise confidence score.
In the present implementation, the control signal y(n) is formed as a weighted combination of the leveling control signal for the previous frame, y(n−1), and a term y′(n) based on the classification results. This can be written as:
where αu(n) is a weighting coefficient that will be discussed below.
It can easily be seen that y′(n) should be equal to y(n−1) when the noise reliability ratio w(n) is below the threshold Tw (to ensure that y(n) is also equal to y(n−1)). When the noise reliability ratio w(n) exceeds the threshold Tw, y′(n) may be a function of the noise reliability ratio w(n) and the PGC noise confidence score xPGC (n).
To achieve this, y′(n) can be defined by
where αr denotes a reliable factor defined by
and γ is a preset constant. According to equation (2) and (3), the higher reliability ratio w(n), the more weight will be assigned for (1−xPGC(n)), meaning the bigger influence of for xPGC (n) on y′(n).
Returning to equation (1), it would be beneficial to update the control signal more quickly if the following conditions are satisfied simultaneously: low speech confidence, low music confidence, and high noise confidence. This can be achieved by properly setting the weighting coefficient αu(n) as
where β∈(0,1) is a constant. A high likelihood of PGC noise type will lead to a large a, (n), and thus a fast updating of the control signal (heavy weighting of y′(n)).
The volume leveling control signal y(n) is used to adjust the dynamic range controller (DRC) in the volume leveling unit 1. Noise related to UGC will result in a large control signal and cause a large reduction of the DRC gain. In contrast, noise related to PGC will lead to a low control signal/steering signal, maintaining the original behavior of the DRC.
In one example, the control signal y(n) is first clamped and normalized to get a steering signal s(n):
where T2∈(0,1) is another preset threshold.
Given an original DRC gain gDRC, the reduced DRC gains can be obtained by
The original and reduced boosting gain curves of DRC are illustrated in
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the disclosure discussions utilizing terms such as “processing”, “computing”, “calculating”, “determining”, “analyzing” or the like, refer to the action and/or processes of a computer hardware or computing system, or similar electronic computing devices, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities.
It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with the necessary instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Note that when the method includes several elements, e.g., several steps, no ordering of such elements is implied, unless specifically stated. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the invention. In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Thus, while there has been described specific embodiments of the invention, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the invention, and it is intended to claim all such changes and modifications as falling within the scope of the invention. For example, different amounts of overlap between successive segments may be implemented. Also, the time windows used to determine various confidence scores may include not only look-back frames but also look-ahead frames (at the cost of certain delay).
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN2022/075381 | Feb 2022 | WO | international |
22161093.4 | Mar 2022 | EP | regional |
This application claims priority to PCT application PCT/CN2022/075381, filed 7 Feb. 2022 and U.S. provisional application 63/312,921, filed 23 Feb. 2022 and European Patent Application No. 22161093.4, filed 9 Mar. 2022, all of which are incorporated herein by reference in their entirety
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2023/062062 | 2/6/2023 | WO |
Number | Date | Country | |
---|---|---|---|
63312921 | Feb 2022 | US |