This application claims priority to European Patent Application titled, “AUDIO SYSTEM AND METHOD,” filed Nov. 28, 2023, and having Application No. EP 23212578.1. The subject matter of this related application is hereby incorporated herein by reference.
The disclosure relates to an audio system and related method, in particular an audio system and method for adding reverberation to an audio signal.
By extending an audio signal with surround or 3D information, e.g., adding a reverberation effect to an audio signal that matches a reverberation already existent in the audio signal, thereby simulating a certain listening environment, the listening experience of a user to whom the audio signal is presented can be significantly increased. An audio signal may be expanded, e.g., in the context of an upmixing process, by adding a reverberation which matches the original audio signal, or by creating additional reverberation channels or signals which match the existing audio signal. Acoustically simulating a concert hall, or any other kind of listening space by suitably adding and reproducing a matching reverberation, however, can be challenging. The resulting audio signal may comprise unwanted artifacts, may not be satisfying to listen to, and high computational load may be required to generate an extended audio signal.
There is a need for an audio system and related method that allow to simulate a listening environment by extending an audio signal with surround or 3D information, resulting in a highly satisfying listening experience for a listener, while requiring comparably little computational load.
An audio system includes a processing unit, and a reverb classification unit, wherein the reverb classification unit is configured to receive a first plurality of audio input signals, estimate a class of reverberation suitable for the first plurality of audio input signals using a deep learning (DL) classification algorithm, and output a prediction to the processing unit, the prediction including information concerning the estimated class of reverberation, and the processing unit is configured to receive the first plurality of audio input signals, generate a second plurality of audio output signals based on the first plurality of audio input signals, and output the second plurality of audio output signals, wherein generating the second plurality of audio output signals includes adding reverberation to at least one of the second plurality of audio output signals based on the prediction received from the reverb classification unit.
A method includes estimating a class of reverberation suitable for a first plurality of audio input signals using a deep learning (DL) classification algorithm, and make a prediction including information concerning the estimated class of reverberation, generating a second plurality of audio output signals based on the first plurality of audio input signals, wherein generating the second plurality of audio output signals includes adding reverberation to at least one of the second plurality of audio output signals based on the prediction, and outputting the second plurality of audio output signals.
Other systems, features and advantages of the disclosure will be or will become apparent to one with skill in the art upon examination of the following detailed description and figures. It is intended that all such additional systems, methods, features and advantages included within this description, be within the scope of the present disclosure and be protected by the following claims.
The arrangements and methods may be better understood with reference to the following description and drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the disclosed embodiments. Moreover, in the figures, like referenced numerals designate corresponding parts throughout the different views.
The audio system and related method according to the various embodiments described herein allow to simulate different listening environments by adding reverberation to an audio signal. Only comparably little computational load is required, and the resulting audio signal is highly satisfying to listen to, as it comprises only few or even no artifacts. Instead of “blindly” processing an audio signal as is done in conventional audio systems, the audio system and method disclosed herein preform an “informed” processing on an audio signal.
Generally, by adding 3D or surround information to an audio signal, the listening experience for a user listening to the audio signal can be significantly increased. Especially multi-channel playback provides the possibility of providing the impression to the user that they are located at a certain location or event, while listening to a musical piece. Additional ambience playback can create an envelopment that can be compared to the experience of being at a live event. For example, if a center signal of a multi-channel audio signal is extracted and added to a centrally positioned speaker, an optimum listening area (so-called sweet spot) can be enlarged and the stability of the front image can be significantly improved. Using 3D speakers, the feeling of a realistic immersion into an audio event can be improved further. There is even the possibility of lifting the stage and playing back overhead effects.
In order to improve the quality of a reproduced sound scene, the perception of the sound scene is often modeled as a combination of the foreground sound and the background sound, which are often also referred to as primary (or direct) and ambient (or diffuse) components, respectively. The primary components consist of point-like directional sound sources, whereas the ambient components are generally made up of diffuse environmental sound (reverberation). Due to perceptual differences between the primary components and the ambient components, different rendering schemes are generally applied to the primary components and the ambient components for optimal spatial audio reproduction of sound scenes. Channel-based audio, however, only provides mixed signals. Some approaches, therefore, focus on extracting the primary components and the ambient components from the mixed signals. Known methods which may include, e.g., ambience estimation in the frequency domain, often require a large computational load, and the resulting audio signal often is not satisfying to listen to, as it comprises a significant amount of artifacts.
The audio system and related method disclosed herein, in contrast to conventional methods, use artificial intelligence (deep learning, DL) algorithms in order to classify the spatial component of an existing audio signal, and further use this information in order to create artificial reverberation that matches the original reverberation. That is, instead of extracting the ambient components from the mixed signal, an environment in which a musical piece might have been recorded is estimated by using artificial intelligence. A musical piece generally includes a specific class of reverberation. The reverberation included in the musical piece generally has certain characteristics that are typical for the specific class of reverberation, e.g., long reverb tail, specific room modes, early reflections, etc. As a result, reverberation may be added to the audio signal that matches the reverberation included in the musical piece. That is, based on an estimated class of reverberation, for example, a matching artificial reverberation is added to the audio signal.
Now referring to
In
The reverb classification unit 200 is configured to estimate the class of reverberation suitable for the first plurality of audio input signals IN1, . . . , INN (a class of reverberation that matches the first plurality of audio input signals IN1, . . . , INN). According to some embodiments of the disclosure, estimating a class of reverberation suitable for the first plurality of audio input signals IN1, . . . , INN comprises separating the first plurality of audio input signals IN1, . . . , INN into a plurality of successive separate frames, extracting one or more features from each of the separate frames, each of the one or more features being characteristic for one of a plurality of types of listening environments, identifying a specific pattern in each of the separate frames based on the extracted features, and estimating a class of reverberation suitable for each of the separate frames based on the identified specific pattern. The length of each frame may influence the accuracy of the deep learning, DL, classification algorithm and may be chosen in any suitable way.
Referring to
Suitable signal representations may include time-frequency representations such as, e.g., log-frequency spectrograms, for example. This is schematically illustrated in
The transformed audio input signal frames (e.g., input samples) can then be processed in batches in a classification unit 208 using the classification algorithm. This results in a prediction of a suitable reverberation class for each of the audio input signal segments (frames). The reverb classification unit 200 may further comprise a prediction unit 210 that is configured to make and output the prediction P1 to the processing unit 100.
The reverb classification unit 200 may be configured to make a global prediction P1 based on the estimated classes of reverberation in a defined plurality of separate successive frames. The defined plurality of separate successive frames may constitute a musical piece, and the global prediction P1 may include information concerning the estimated class of reverberation suitable for the entire musical piece. That is, a musical piece may be separated into a plurality of separate frames. A suitable reverberation may be determined for each of the plurality of separate frames. For example, a suitable reverberation may either be “high reverberation” or “low reverberation”. If it is determined that within the plurality of separate frames of a musical piece the result “high reverberation” is predominant as compared to “low reverberation”, high reverberation may be added to the entire musical piece, or vice versa. “High reverberation”, and “low reverberation”, however, are merely examples. Other classes of reverberation may include, but are not limited to, e.g., “large/medium/small Jazz hall”, “large/medium/small living room”, “wooden large/medium/small concert hall”, etc. Other even more generic classes of reverberation may include “Hall 1”, “Hall 2”, “Hall 3”, etc.
Alternatively, it is also possible that a sub-prediction P1 is made based on the estimated classes of reverberation in a sub-set of a defined plurality of separate successive frames. The defined plurality of separate successive frames may constitute a musical piece, and the sub-prediction P1 may include information concerning the estimated class of reverberation suitable for a fraction of the musical piece. That is, a musical piece may be separated into a plurality of separate frames. A suitable reverberation may be determined for each of the plurality of separate frames. For example, a suitable reverberation may either be “high reverberation” or “low reverberation”. A different reverberation may be added to each of the different frames of the audio input signal based on the respective predictions. It is, however, also possible to make a combined prediction for several of the separate frames, but not to all.
Referring to
The deep learning, DL, classification algorithm may be based on a deep learning, DL, model, wherein the DL model is trained using annotated data consisting of audio signals with different known grades of reverberation. The DL model may learn hierarchical representations from input samples, for example. In order to be able to predict reverberation classes with a high accuracy, it may be trained with annotated data consisting of audio signals with different known grades of reverberation. The grades of reverberation may be perceptually measured, for example. One or more different databases may generally be used for this purpose. The one or more databases may be obtained in any suitable way.
The audio systems described herein are able to directly classify an amount of reverberation present in an audio input signal, which is directly aligned with the perceptual measure for reverberation. The audio signal is highly flexible concerning the amount of reverberation classes that can be estimated. According to one example, two reverberation classes, e.g., “high reverberation” and “low reverberation” may be estimated. According to another example, three reverberation classes, e.g., “high reverberation”, “mid reverberation”, and “low reverberation” may be estimated, wherein “mid reverberation” is a reverberation that is less than “high reverberation” and greater than “low reverberation”. Any other intermediate reverberation classes between “high reverberation” and “low reverberation” may generally be estimated as well. As mentioned above, other additional or alternative classes of reverberation may include, but are not limited to, e.g., “large/medium/small Jazz hall”, “large/medium/small living room”, “wooden large/medium/small concert hall”, “Hall 1”, “Hall 2”, “Hall 3”, etc.
The audio systems described above may be surround sound systems, or any kind of 3D audio systems (e.g., VR/AR applications), for example. That is, the number of audio input signals included in the first plurality of audio input signals IN1, . . . , INN may equal the number of audio signals included in the second plurality of audio output signals OUT1, . . . , OUTM, as is illustrated in
A surround sound system is schematically illustrated in
The reverberation added to the one or more audio output signals OUT1, . . . , OUTM may be generated based on the prediction P1 (e.g., by using a reverberation engine 104) as exemplarily illustrated in
It is, however, also possible that the processing unit 100 comprises or is coupled to a memory 110, wherein different types of reverberation are stored in the memory 110. The processing unit 100 (e.g., a reverberation engine 104 of the processing unit 100), based on the prediction P1, can retrieve a suitable reverberation from the memory 110 and add it to the one or more audio output signals OUT1, . . . , OUTM accordingly.
Referring to
According to some embodiments of the disclosure, estimating a class of reverberation suitable for the first plurality of audio input signals IN1, . . . , INN may comprise separating the first plurality of audio input signals IN1, . . . , INN into a plurality of successive separate frames, extracting one or more features from each of the separate frames, each of the one or more features being characteristic for one of a plurality of types of listening environments, identifying a specific pattern in each of the separate frames by using the extracted features, and estimating a class of reverberation suitable for each of the separate frames based on the identified specific pattern.
According to some embodiments, the method may further comprise, after separating the first plurality of audio input signals IN1, . . . , INN into a plurality of successive separate frames and before extracting one or more features from each of the separate frames, transforming the separate signal frames into a log-frequency spectrogram.
The artificially generated spatiality matches the (possibly) existing spatiality present in the original audio signal (first plurality of audio input signals IN1, . . . , INN). Ideally, it has the same room acoustic properties. Classic ambience extraction methods that may be used to extract the original ambience signal from an input signal are algorithmically complex and cause perceptually relevant artefacts. A multiplication and distribution of an extracted ambience signal to the speaker channels of a multi-channel system further increases the perceived artefacts. The audio system described above overcomes these drawbacks. The reverberation class information of the ambience portion of the original signal is used to control or configure an algorithm that artificially generates the output ambience signals. The manner in which the artificial ambience component is created is generally not relevant. The artificial ambience component may generally be created in any suitable way. Any type of reverb method can generally be used (e.g., Feedback Delay Networks (FDNs), convolution reverb, etc.)
Based on the requirements of the specific application (e.g., upmix technology), the number and semantic properties of the reverb classes are adjusted, and the deep learning, DL, classification network may be trained accordingly. Further, the classes may be defined based on perceptually relevant characteristics (e.g., reverberation length, density of reflections, early reflection patterns, dry/wet ratio, decay rate, spectral behavior, etc.), and the artificial reverberation algorithm may be configured accordingly. For example, if the DL network was trained to estimate reverberation lengths in the input signal, the output predictions can be used to set the same parameter in an artificial reverb engine.
It may be understood, that the illustrated systems are merely examples. While various embodiments have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the present disclosure. In particular, the skilled person will recognize the interchangeability of various features from different embodiments. Although these techniques and systems have been disclosed in the context of certain embodiments and examples, it will be understood that these techniques and systems may be extended beyond the specifically disclosed embodiments to other embodiments and/or uses and obvious modifications thereof. Accordingly, the present disclosure is not to be restricted except in light of the attached claims and their equivalents.
The description of embodiments has been presented for purposes of illustration and description. Suitable modifications and variations to the embodiments may be performed in light of the above description or may be acquired from practicing the methods. The described arrangements are exemplary in nature, and may include additional elements and/or omit elements. As used in this application, an element recited in the singular and proceeded with the word “a” or “an” should be understood as not excluding plurals of the elements, unless such exclusion is stated. Furthermore, references to “one embodiment” or “one example” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features. The terms “first,” “second,” and “third,” etc. are used merely as labels, and are not intended to impose numerical requirements or a particular positional order on their objects. The described systems are exemplary in nature, and may include additional elements and/or omit elements. The subject matter of the present disclosure includes all novel and non-obvious combinations and sub-combinations of the various systems and configurations, and other features, functions, and/or properties disclosed. The following claims particularly point out subject matter from the above disclosure that is regarded as novel and non-obvious.
| Number | Date | Country | Kind |
|---|---|---|---|
| 23212578.1 | Nov 2023 | EP | regional |