For sound reproduction, there are different kinds of systems which differ with regard to their complexity and reproduction quality. The reference for movie sound is the cinema. Cinemas provide multi-channel surround sound, with loudspeakers installed not only in the front of the listener (usually behind the screen), but additionally on the sides and rear, and recently also on the ceiling. The side and rear loudspeakers enable a horizontally enveloping sound reproduction, which can be further enhanced by vertically engulfing sound using height and ceiling loudspeakers.
With latest coding techniques, immersive, interactive, and object-based audio content can not only be used in professional environments, but can also conveniently be transmitted into the consumer's home, adding further features and dimensions, such as e.g. height reproduction.
Enhanced reproduction setups for realistic sound reproduction use loudspeakers not only mounted in the horizontal plane (usually at or close to ear-height of the listener), but additionally also loudspeakers spread in vertical direction. Those loudspeakers are e.g. elevated (mounted on the ceiling, or at some angle above head height) or are placed below the listener's ear height (e.g. on the floor, or on some intermediate or specific angle).
Often it is inconvenient or impossible to install loudspeakers at top or bottom directions.
In a home environment, likely only enthusiasts will install the number of loudspeakers needed to replicate the loudspeaker setups that are used in professional environments, research labs, or cinemas. Here, the term loudspeaker setup does also include devices and topologies like soundbars, TVs with built in loudspeakers, boomboxes, sound plates, loudspeaker arrays, smart speakers, and so forth.
Nonetheless, when rendering sound for an immersive sound experience or virtual reality, it is often desirable to render sound also in height (top and bottom) directions (denoted “top and bottom directions” in the following. Of course, not both directions have to be processed each time, so this is equivalent to “(either) top or bottom directions” or “top/bottom directions”).
Therefore, the need arises to render sound in top and bottom directions without having height loudspeakers, e.g. top loudspeakers and/or bottom loudspeakers.
A convenient alternative to those rather complex setups is compact reproduction systems that use signal processing means to generate a comparable or similar spatial auditory perception as the enhanced loudspeaker setups. Here, the term reproduction systems include all devices and topologies for audio reproduction like setups comprising a number of individual loudspeakers, soundbars, TVs with built in loudspeakers, boomboxes, sound plates, loudspeaker arrays, smart speakers, and so forth.
A practical method and an apparatus to achieve this is presented in the following.
According to an embodiment, an apparatus for generating loudspeaker signals for a plurality of loudspeakers so that an application of the loudspeaker signals at the plurality of loudspeakers renders at least one audio object at an intended virtual position, may have: an interface configured to receive an audio input signal which represents the at least one audio object, a first panning gain determiner, configured to determine, depending on the intended virtual position, first panning gains for a first set of loudspeakers of the plurality of loudspeakers, which are arranged within a first horizontal layer, the first panning gains defining a derivation of first partial loudspeaker signals from the at least one audio input signal, which are associated with a rendering of the at least one audio object at a first virtual position upon application of the first partial loudspeaker signals onto the first set of loudspeakers, a vertical panning gain determiner, configured to determine, depending on the intended virtual position, further panning gains for a panning between the first partial loudspeaker signals and one or more second partial loudspeaker signals which is to be applied to a second set of one or more loudspeakers, which is arranged within a second horizontal layer, which is vertically offset relative to the first layer set, and is associated with a rendering of the at least one audio object at a second position so as to pan between the first virtual position and the second position, wherein the apparatus is configured to compose the loudspeaker signals from the audio input signal using the first panning gains and the further panning gains, wherein the apparatus is adaptive to different setups of the plurality of loudspeakers and configured to associate the plurality of loudspeakers to a plurality of horizontal layers so that one of the loudspeakers may be associated with different ones of the horizontal layers, and to select the first horizontal layer and the second horizontal layer out of the plurality of horizontal layers so that the intended virtual position is between the first horizontal layer and the second horizontal layer.
According to another embodiment, an apparatus for generating loudspeaker signals for a plurality of loudspeakers so that an application of the loudspeaker signals at the plurality of loudspeakers renders at least one audio object at an intended virtual position, wherein the plurality of loudspeakers are distributed onto one or more horizontal layers, may have: an interface configured to receive an audio input signal which represents the at least one audio object, a first loudspeaker signal set determiner, configured to determine, depending on the intended virtual position, first panning gains for a first set of loudspeakers of the plurality of loudspeakers, and use the first panning gains to derive first partial loudspeaker signals from the at least one audio input signal, which are associated with a rendering of the at least one audio object at a first virtual position upon application of the first partial loudspeaker signals onto the first set of loudspeakers, a second loudspeaker signal set determiner, configured to, by spectral shaping and by panning gains, derive second partial loudspeaker signals from the at least one audio input signal, the second partial loudspeaker signals being associated with a rendering of the at least one audio object at a second virtual position upon application of the second partial loudspeaker signals onto a second set of loudspeakers) of the plurality of loudspeakers, wherein the panning gains are selected so that the second virtual position is above or below the one or more horizontal layers and corresponds to a horizontal position which coincides with a listener position along a vertical projection, and a vertical panning gain determiner configured to, depending on the intended virtual position, determine further panning gains for the first and second partial loudspeaker signals so as to pan between the first and second virtual positions, and a composer configured to compose the loudspeaker signals from the first and second partial loudspeaker signals using the further panning gains.
According to another embodiment, a system may have: a plurality of loudspeakers and any of the inventive apparatuses.
According to another embodiment, a method for generating loudspeaker signals for a plurality of loudspeakers so that an application of the loudspeaker signals at the plurality of loudspeakers renders at least one audio object at an intended virtual position may have the steps of: receiving an audio input signal which represents the at least one audio object, determining, depending on the intended virtual position, first panning gains for a first set of loudspeakers of the plurality of loudspeakers, which are arranged within a first layer set of one or more first horizontal layers, the first panning gains defining a derivation of first partial loudspeaker signals from the at least one audio input signal, which are associated with a rendering of the at least one audio object at a first virtual position upon application of the first partial loudspeaker signals onto the first set of loudspeakers, determining, depending on the intended virtual position, further panning gains for a panning between the first partial loudspeaker signals and one or more second partial loudspeaker signals which is to be applied to a second set of one or more loudspeakers, which is vertically offset relative to the first layer set, and is associated with a rendering of the at least one audio object at a second position so as to pan between the first virtual position and the second position, composing the loudspeaker signals from the audio input signal using the first panning gains and the further panning gains.
According to another embodiment, a method for generating loudspeaker signals for a plurality of loudspeakers so that an application of the loudspeaker signals at the plurality of loudspeakers renders at least one audio object at an intended virtual position, wherein the plurality of loudspeakers are distributed onto one or more horizontal layers, may have the steps of: receiving an audio input signal which represents the at least one audio object, determining, depending on the intended virtual position, first panning gains for a first set of loudspeakers of the plurality of loudspeakers, and use the first panning gains to derive first partial loudspeaker signals from the at least one audio input signal, which are associated with a rendering of the at least one audio object at a first virtual position upon application of the first partial loudspeaker signals onto the first set of loudspeakers, by spectral shaping, deriving second partial loudspeaker signals from the at least one audio input signal, the second partial loudspeaker signals being associated with a rendering of the at least one audio object at a second virtual position upon application of the second partial loudspeaker signals onto a second set of loudspeakers, the second virtual position being above or below the one or more horizontal layers, and depending on the intended virtual position, determining further panning gains for the first and second partial loudspeaker signals so as to pan between the first and second virtual positions, and composing the loudspeaker signals from the first and second partial loudspeaker signals using the further panning gains.
Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform any of the inventive methods when said computer program is run by a computer.
A more efficient rendering of audio objects, which allows 3D panning, is achieved by performing the panning in two stages, namely at least one horizontal in-layer panning leading to a first virtual (speaker) position and a second virtual or real (speaker) position, which is vertically offset, and another panning vertically between the two positions. Although acting in such a manner seems to increase the computational complexity, this staged processing increases, in fact, the stability of the rendering and the precision of localization of the intended virtual position. Moreover, the staged processing enables to perform, according to an embodiment, the panning by use of amplitude panning gains only, i.e. phase processing is not necessary, thereby rendering the computational complexity low. Even further, the rendering is flexible with respect to applicability to a variety of loudspeaker setups.
Embodiments of the present application refer to an apparatus for generating loudspeaker signals for a plurality of loudspeakers so that an application of the loudspeaker signals at the plurality of loudspeakers renders at least one audio object at an intended virtual position. The apparatus comprises an interface configured to receive an audio input signal which represents the at least one audio object. It may be one of a channel-based audio signal, object-based audio signal, and/or scene-based audio signal. A first panning gain determiner is configured to determine, depending on the intended virtual position, first panning gains for a first set of loudspeakers of the plurality of loudspeakers, which are arranged within a first layer set of one or more first horizontal layers, the first panning gains defining a derivation of first partial loudspeaker signals from the at least one audio input signal, which are associated with a rendering of the at least one audio object at a first virtual position upon application of the first partial loudspeaker signals onto the first set of loudspeakers. This is the afore-mentioned in-layer panning. A vertical panning gain determiner is configured to determine, depending on the intended virtual position, further panning gains for a panning (or fading) between the first partial loudspeaker signals and one or more second partial loudspeaker signals which is to be applied to a second set of one or more loudspeakers and is associated with a rendering of the at least one audio object at a second position, which is vertically offset relative to the first position, so as to pan between the first virtual position and the second position. This is the vertical panning. The one or more second partial loudspeaker signals may be the result of another in-layer panning in which case the second position is a second virtual position or the second position may be the real position of another one of the loudspeakers, which is positioned vertically offset to the first set of loudspeakers. The apparatus is configured to compose the loudspeaker signals from the first partial loudspeaker signals and the one or more second partial loudspeaker signals using the first panning gains and the further panning gains. That is, in the composition, the first and further panning gains are actually applied onto the audio input signal, thereby leading to the loudspeaker signals. There may possibly be one or more loudspeaker signals, for the generation of which just one of the panning gains is to be used, such as for the just-mentioned second loudspeaker positioned at the real loudspeaker position and fed with the second partial loudspeaker signal.
According to some embodiments, as said, the second set of one or more loudspeakers comprises more than one loudspeaker, and the one or more second partial loudspeaker signals comprise more than one second partial loudspeaker signals and the apparatus further comprises a second panning gain determiner, configured to determine, depending on the intended virtual position, second panning gains for the second set of loudspeakers, the second panning gains defining a derivation of second partial loudspeaker signals from the at least one audio input signal, wherein the apparatus is configured to compose the loudspeaker signals from the first and second partial loudspeaker signals using the first and second panning gains and the further panning gains. Here, according to an embodiment, the second partial loudspeaker signals may be derived from the at least one audio signal by spectral shaping, so that the second position is a virtual position above or below the second layer set, such as not between or within any of the one or more first horizontal layers, and the one or more second horizontal layers, within which the second set of loudspeakers are arranged, but on one side, vertically, relative to these horizontal layers. In accordance with corresponding embodiments, an apparatus results which is for generating loudspeaker signals for a plurality of loudspeakers so that an application of the loudspeaker signals at the plurality of loudspeakers renders at least one audio object at an intended virtual position, wherein the plurality of loudspeakers are distributed onto one or more horizontal layers, the apparatus comprising an interface configured to receive an audio input signal which represents the at least one audio object, a first loudspeaker signal set determiner, configured to determine, depending on the intended virtual position, first panning gains, e.g., as said pure amplitude panning gains so that the first virtual position is in-between positions of the first set of loudspeakers, for a first set of loudspeakers of the plurality of loudspeakers, and use the first panning gains to derive first partial loudspeaker signals from the at least one audio input signal, which are associated with a rendering of the at least one audio object at a first virtual position upon application of the first partial loudspeaker signals onto the first set of loudspeakers, a second loudspeaker signal set determiner, configured to, by spectral shaping, derive second partial loudspeaker signals from the at least one audio signal, the second partial loudspeaker signals being associated with a rendering of the at least one audio object at a second virtual position upon application of the second partial loudspeaker signals onto the second set of loudspeakers, the second virtual position being above or below the one or more horizontal layers, e.g. not between or within any of the one or more horizontal layers, but on one side, vertically, relative to the one or more horizontal layers, and a vertical panning gain determiner configured to, depending on the intended virtual position, determine second panning gains for the first and second partial loudspeaker signals so as to pan between the first and second virtual positions, and a composer configured to compose the loudspeaker signals from the first and second partial loudspeaker signals using the second panning gains.
Embodiments set-out herein reveal, thus, a concept for rendering at least one audio object to a set of loudspeakers from at least one audio input signal. In brief, audio input signals may comprise information about audio objects that are to be output by the loudspeakers. For example, such an audio object can be a sound of a helicopter flying in a movie, sound of an instrument playing in an orchestra, or sound of a voice. The audio object is rendered using loudspeakers. The audio input signal is processed to determine how the audio object is to be output at individual loudspeakers. For this each audio input signal is associated with position information of the at least one audio object. Such position information can be static, e.g. the violin is located on the left of the orchestra, the speaker is in front of the listener, or dynamic, e.g. the helicopter flies from right to left. The set of loudspeakers used to render the audio object may comprise one or more groups of loudspeakers, each group located in one horizontal layer. An additional loudspeaker may be a physical or virtual loudspeaker, located above or below the one or more groups.
That means that for the set of loudspeakers an association with layers and positions offset to the layers above or below the layers may be defined. For example, the setup can comprise four loudspeakers in one layer, e.g. all at the same height, and one physical or virtual loudspeaker higher, e.g. elevated, above the four other loudspeakers. This setup would then have one layer. Additional one or more layers are also possible.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
The following description starts with a description of an embodiment of an apparatus for generating loudspeaker signals for a plurality of loudspeakers. More specific embodiments are outlined herein below along with a description of details which may, individually or in groups, apply to the apparatus of
The apparatus of
The apparatus 10 might be configured for a certain arrangement of loudspeakers 14, i.e., for certain positions in which the plurality of loudspeakers 14 are positioned or positioned and oriented. The apparatus may, however, alternatively be able to be configurable for different loudspeaker arrangements of loudspeakers 14. Likewise, the number of loudspeakers 14 may be two or more and the apparatus may be designed for a set number of loudspeakers 14 or may be configurable to deal with any number of loudspeakers 14.
The apparatus 10 comprises an interface 16 at which apparatus 10 receives an audio signal 18 which represents the at least one audio object. For the time being, let's assume that the audio input signal 18 is a mono audio signal which represents the audio object such as the sound of a helicopter or the like. Additional examples and further details are provided below.
In any case, the audio signal 18 may represent the audio object in time domain, in frequency domain or in any other domain and it may represent the audio object in a compressed manner or without compression.
As depicted in
As depicted in
Apparatus 10 of
The further panning gains 32 determined by vertical panning gain determiner 30 finally result into a panning between the first virtual position and the second position.
As shown in
A further task of composer 40 is the following: as mentioned above, loudspeaker sets 26 and 36 may or may not overlap. As a task of composer 40, composer 40 correctly distributes the partial loudspeaker signals 28 and 34, obtained by panning using panning gains 24 and 32, onto loudspeakers 14. For those partial loudspeaker signals of sets 28 and 34, which merely belong to one of sets 28 and 34, the corresponding partial loudspeaker signal becomes one of the loudspeaker signals 12. For those one or more partial loudspeaker signals, however, which are associated with the same loudspeaker out of loudspeakers 14, however, composer 40 adds them up using an adder 46 so that the sum of mutually corresponding partial loudspeaker signals out of set 28 and 34, respectively, become one of the loudspeaker signals 12.
It should be noted that, owing to the associative and commutative properties of the multiplication, composer 40 is not restricted to perform the multiplications for each partial loudspeaker signal in the order depicted in
The first processing step corresponds to a horizontal panning with respect to the partial loudspeaker signals 34 in a manner substantially corresponding to the horizontal panning realized by elements 22, 24 and 42 with respect to partial loudspeaker signals 28. That is, as shown in
Additionally or alternatively relative to elements 52-56, apparatus 10 may comprise a spectral shaper 58 which performs spectral shaping to the input audio signal or intermediary or final products as a result of the horizontal panning at multipliers 56 and vertical panning at multiplier 44b, so that the second partial loudspeaker signals 34 are derived from the at least one audio input signal by this spectral shaping. The spectral shaping is, for instance, for each of the partial loudspeaker signals 34 equal, i.e., the same spectral shaping function may be used. As outlined in more detail below, the spectral shaping function 60 used by spectral shaper 58, is selected so as to form a psycho-acoustical cue for the listener that the second virtual position associated with the second partial loudspeaker signals 34 is positioned above or below the second set 36 of loudspeakers.
The spectral shaping performed by spectral shaper 58 may be performed in spectral domain by means of a multiplication of the partial loudspeaker signals' spectrum with the shaping function 60, or may be done in time domain such as by means of a time domain filter such as an IIR or FIR filter, which time domain filter then would have the frequency response corresponding to spectral shaping function 60. Further notes will be made with respect to the sets 26 and 36. The apparatus may select same depending on a current speaker setup. In other words, the apparatus may be adaptive to different setups. The apparatus may select the first set 26 of loudspeakers out of the plurality of loudspeakers depending on a horizontal component of the intended virtual position such as out of one layer those speakers nearest to the intended virtual position (as far as its vertical projection into the one layer is concerned) or depending on the horizontal component of the intended virtual position and a vertical component of the intended virtual position such as by selecting an outmost layer nearest to the intended virtual position and then selecting the speakers within that one layer. Additionally or alternatively, the second set 36 of loudspeakers may be selected out of the plurality of loudspeakers depending on a vertical component of the intended virtual position such as by selecting an outmost layer nearest to the intended virtual position and using all the speakers belonging to that layer for set 36, or depending on the horizontal component of the intended virtual position and the vertical component of the intended virtual position such as by selecting an outmost layer nearest to the intended virtual position and selecting the set 36 out of the speakers of the layer so that same are nearest to the intended virtual position (as far as its vertical projection into the one layer is concerned).
As mentioned before with respect to the first partial loudspeaker signals 28, composer 40 may be configured to perform the multiplication 56 and 44b as well as the spectral shaping 58 in any order, i.e., may apply the three tasks in any order onto the audio input signal 18 in order to result into the corresponding partial loudspeaker signals 34.
Lastly, it should be noted that according to an example, it may be that the number of loudspeakers within set 36 and, thus, a number of partial loudspeaker signals 34, respectively, may be one, even in case of using the spectral shaper 58.
Before proceeding with the description of certain details and embodiments of the present application, which are described in the following by reusing the reference signs and the description brought forward above, the following note shall be made with respect to the composer 40: in case of
Before resuming the description with the announced further details and further detailed embodiments, a brief note shall be made with respect the achieved advantages resulting from the concept of audio rendering as depicted in
Moreover, although the decomposition of the 3D panning into horizontal panning on the one hand and vertical panning on the other hand might appear to result in a more complex rendering procedure, the resulting computational complexity is still low, while the rendering accuracy in terms of positioning the intended virtual position is still high even at this computational moderate complexity.
That is, embodiments described herein provide an alternative to the rather complex setups set-out in the introductory portion of the specification and form a compact reproduction that uses signal processing means to generate a comparable or similar spatial auditory perception as more complex loudspeaker setups. The concepts presented above and in the following are capable of
Note that the embodiments described herein are independent of the reproduction environment and could, e.g., also be used e.g. in an automotive environment. Furthermore, the embodiments are independent of the specific type of transducer or topology used for reproduction. That is, the embodiments could be applied e.g. in headphone reproduction, as well as in reproduction using specific loudspeakers such as loudspeaker arrays, soundbars, smart speakers, etc.
That is, the just-made notes render clear that the loudspeakers 14 may be headphone loudspeakers or stereo loudspeakers, but may, as well, form a loudspeaker array, a soundbar, or a set of loudspeakers, smart speakers, or a set of smart speakers, from a surround sound setup or may be individual loudspeakers, wherein combinations may be feasible as well. Moreover, the description made clear that apparatus 10 operates adaptive in order to adapt, in real-time, the composition of the loudspeaker signals 12 to the intended virtual position 21 which may vary in time.
In this regard, it shall briefly be noted that, while embodiments of the rendering apparatuses may be pre-configured for certain loudspeaker setups, i.e. that they expect a predefined set of loudspeakers 14 to be positioned at predefined positions, it might also be that the apparatuses described herein are adaptive to different loudspeaker setups, differing in number of loudspeakers and/or speaker positions, in terms of an initialization of the apparatus and/or in terms of an adaptation to moving loudspeaker positions. In the former case, the apparatus may, after initialization, assume the loudspeaker setup to be constant. The latter case, the apparatus may even adapt to speaker setup variations during runtime. Even the number of speakers could vary in runtime. Accordingly, the apparatus may receive information on the loudspeaker positions with this optional circumstance, however, not being explicitly shown in the figures. Thus, similar to the optional reception of the listener position information, apparatus of
Commonly used methods for rendering are amplitude panning techniques. To generate the perception of an auditory object at positions that are not covered by loudspeakers (e.g. not between two or more loudspeakers), rendering techniques such as crosstalk cancelation can be utilized. Crosstalk cancellation (XTC) [1-7] has the goal to control the left and right ear signals of a listener by means of loudspeakers. This is achieved by “cancelling the crosstalk between the ears” which occurs when a loudspeaker's signal reaches a listener. Once the ear signals can directly be controlled, binaural techniques [8, 9] can be applied to render sound at top and bottom directions. There are two major limitations of the before mentioned technique. Firstly, XTC has limitations related to sound coloration, extremely small sweet spot, and high dependence on loudspeaker positions relative to the listener. Secondly, without head tracking/listener tracking and/or individualized head related transfer functions (HRTFs) or binaural room impulse responses (BRIRs), binaural techniques are limited in the achievable quality/performance. Both of these would add high complexity, cost, and user inconvenience to the system.
Enhancements to conventional amplitude panning have been proposed, using virtual loudspeakers in dimensions not covered by the loudspeaker setup, see e.g. [14, 15]. Height panning using such techniques is not entirely realistic as timbre deviates from sources truly rendered at height.
Vertical Hemispherical Amplitude Panning (VHAP) [10, 11] uses two lateral loudspeakers to render objects with height and on top of a listener. As the loudspeakers have to be at ±90 degrees lateral directions, VHAP is inflexible in terms of listener position.
In this specification, the term virtual loudspeaker is used for a non-existent loudspeaker which is considered during the process of panning an object.
The concept of
The embodiments described herein allow for very straight forward implementations of virtual height rendering.
That is, object panning according to
In the following, the concept of embodiments of the present application is visualized three-dimensionally. See
Stated differently,
a and 5b show, decomposed into individual sub-concepts or steps, as to how the rendering at an intended virtual position 104 using the available loudspeakers 14a to 14d and the virtual loudspeaker 102 is done.
Note that the distance of the intended virtual position 104 does not play a major role in the context of this application and that, accordingly, position 104 is depicted as being far away from the listener for sake of an easier perspective representation only. The rendition may, optionally, operate dependent on the direction towards position 104 only.
According to the concepts set forth above, the efficient generation of a virtual height reproduction is part of a panning algorithm that allows for using the corresponding virtual height speaker in arbitrary loudspeaker setups. Further details are described in the following.
An (object) panning algorithm/panning processor or an apparatus according to any of
Due to the efficiency of the underlying concept, it can also be used for static as well as moving listener positions, i.e. also for applications, for instance, in which the position of the listener 100 is tracked, and the rendering by the apparatus is adapted to the listener position. Adaptation examples are set-out below. Furthermore, an apparatus as described herein could even be applied to scenarios with static as well as moving loudspeakers 14.
In typical reproduction scenarios, the loudspeaker positions are fixed, but the listener's 100 position may continuously change. In such a case, the angles under which the listener 100 sees the loudspeakers 14, as well as the respective angles between loudspeakers change as a function of the listener's 100 position.
Conventional panning algorithms, such as VBAP, typically need initialization for their considered invariant sweet spot and loudspeaker positions. During initialization phase, some complex operations are used, such as mapping loudspeakers to pair, triplet, or quadruplet panning groups.
Since in a tracking scenario, relative positioning of loudspeakers 14 and listener 100 frequently changes, it is undesirable to have a complex initialization phase and fixed mapping. The described panning according to
In particular, the following steps assist in achieving an efficient rendering and to deal with speaker setups with more than one layer of speakers 14a-d as exemplarily shown in
The steps/functions/blocks participating in the rendering between two layers, or speakers of two layers, is depicted in
The cooperation of the individual elements of
In accordance with an embodiment, both horizontal pannings, namely the one or more module with respect to partial loudspeaker signals 28 and the one regarding the other partial loudspeaker signals 34 by way of elements 52 to 56 use the same azimuth angle for panning. That is, the same azimuth angle is used for both layers. In other words, the horizontal panning is done in a manner so that the projected virtual positions 106 depicted in
A beneficial feature of the embodiments discussed herein is the fact that they do not require extensive initialization. Instead, panning parameters are computed directly from given or changing listener and loudspeaker coordinates or positions. The initialization of the rendering is not dependent on predefined pairs, triplets, or quadruplets of loudspeakers.
As is clear from the above description, apparatuses according to embodiments of the present application are not restricted to deal with loudspeaker setups where the available loudspeakers 14 are arranged in one layer only. The latter example had been depicted in
Given an arbitrary loudspeaker setup, initialization may involve only that each loudspeaker 14 is classified as belonging to one or more of the following categories:
Layer 1:
Typically this loudspeaker layer is used for panning objects horizontally (approx. on ear height of a seated listener).
Layer 2 to N:
Optionally, loudspeakers in a second layer can be defined, such as loudspeakers in a height (top or bottom) layer. These are layers vertically above or below Layer 1. The loudspeaker layers can, thus, be more than two. The distinction between Layer 1, being on ear height, and any other layer or the other layers is optional.
Top:
Loudspeaker(s) over which vertical top direction is reproduced. This can be a dedicated loudspeaker, or a subset of loudspeakers of other layers.
Bottom:
Loudspeaker(s) over which vertical bottom direction is reproduced. This can be a dedicated loudspeaker, or a subset of other layers.
The above description is not limited to regular setups, where regular would e.g. imply that an equal number of loudspeakers is present in every layer, having equal angles/distances between them, or that all layers completely surround the listener, or that all layers have loudspeakers arranged at exactly the same vertical angle as seen from the listener.
Actually, as mentioned before, any arbitrary setup can be used. The different loudspeakers could be positioned at different/arbitrary azimuth angles, and at different/arbitrary elevation angles (i.e. different heights). Loudspeakers considered to be part of one layer do not necessarily need to lie within a plane. Variations in their vertical positioning is allowed.
Horizontal panning by module 70 would be done using all available loudspeakers (Layer 1). Top and Bottom directions are rendered using module 72 over all loudspeakers except the center (C). That is, set 36 would comprise all loudspeakers except the center, while set 28 would encompass all speakers.
Please note that this is an explicit decision for this example. Of course, the center loudspeaker could also be used for height rendering.
A further classification using a 5.0+2H loudspeaker setup is depicted in
In this example, the middle layer surround loudspeakers (M_Ls and M_Rs) are used for both layers (Layer 1 and Layer2), since otherwise Layer 2 would not surround the listener. That is, Layer 1 and Layer 2 speakers would be used for inter-layer panning as illustrated in
Alternative classifications in this setup could be to decide for rendering without a Layer 2. The Top could be rendered using only the elevated loudspeakers U_L and U_R, or alternatively, the top could also be rendered by a combination of the U_L, U_R, M_Ls, and M_Rs as described before.
Further examples are readily derivable. E.g. with bottom layer loudspeakers, or with more or less elevated loudspeakers, or with more or less loudspeakers in the middle layer, or with more arbitrary or irregular loudspeaker setups.
In the following, the case of rendering an object in 3D is explained for an example case where the object is panned in a direction (as seen from the listener) that lies between two physically present loudspeakers layers (which are at different height). This had already been discussed above with respect to
The object is amplitude panned in the first layer by giving the object signal to loudspeakers in this layer with different gains 24, e.g. by giving the object signal to M L and M Ls such that it is amplitude panned to bottom layer gray dot position 1061 in
This weighting for the horizontal panning between (real) loudspeaker layers can additionally be frequency dependent to compensate for the effect that in vertical panning different frequency ranges may be perceived at different elevation [13].
Rendering Objects above or below a layer or outmost layer is further inspected now, as an additional information relative to the description set forth above.
An object may have a direction or position 104 which is not within the range of directions between two layers as discussed wrt
In this case, horizontal amplitude panning is applied by module 70 to the height layer to render the object in that layer. The resulting position 1061 of the rendered object is indicated as height layer gray dot position 1061 in
Then, panning is applied between position 1061 in the height layer and the vertical direction/position 1062, indicated as gray dot position 1062 in
Since there is no real loudspeaker at the vertical top or bottom direction, the vertical signal at 1062 is equalized by module 58 to mimic coloration of top or bottom sound respectively (see subsequent explanation for more details on the equalization). The vertical signal is then given to the loudspeakers designated for top/bottom direction, i.e. set 36.
As to the rendering of the virtual Top or Bottom loudspeakers 102 the following may be said.
In general, different approaches can be chosen to render the virtual vertical Top or Bottom loudspeakers.
In general, two different approaches can be chosen:
As application examples, (1) could be beneficially chosen, if the listener position can be tracked, while (2) could be chosen if the possibility for listener tracking is not available.
A simple implementation uses the same gain for each loudspeaker selected for Top or Bottom rendering, i.e. the gains 54 would be chosen the be equal. This scheme works well. (It can e.g. be used as the simplest implementation and is especially useful, when the listener position is not tracked and such not known.)
Especially when the listener is not centrally located within the loudspeaker setup, then the following considerations can improve top and bottom rendering:
In the following, the equalizer (or spectral shaper) 58 is further exemplified using further details. The main cues enabling the listener 100 to localize a sound source in the horizontal plane are differences between the left and right ear input signals (interaural time differences (ITDs) and interaural level differences (ILDs)). The primary cues for estimating the vertical position of a sound source are spectral variations due to reflections produced by the listener's head, torso, and pinnae. Such cues are often called monaural cues (MCs), called psycho-acoustical cue in the above description.
The specific ILDs, ITDs, and MCs, which occur due to the unique body features of each individual and the considered direction of incidence, are commonly sub-summed under the term Head Related Transfer Functions (HRTFs). Especially the MCs are highly individual. Still, there are some common features that influence the height perception in general.
By shaping the frequency content of a specific source signal that is received from one direction, the illusion that this sound actually comes from a different elevation and/or front-back-orientation on the same cone of confusion can be supported. This corresponds to changing MCs and is the purpose of the equalizer (EQ) 58.
A simple but well working implementation of the concept of using virtual top/bottom loudspeakers, and equalization of these signals, uses a specific static EQ for the top and bottom direction respectively.
The equalizer 60a for top direction typically has one or more notches and/or peaks. Typically there is a notch below 1 kHz and one or more peaks at higher frequencies. An equalizer 60b for bottom direction includes the effect of “body shadowing”, that is, overall high frequencies are attenuated. In other words, by function 60a, the second partial loudspeaker signals 34 are, relative to the audio input signal 18, dampened in a notch spectral range 120 between 200 and 1000 Hz and amplified within one or more in peak spectral ranges 1221 and 1222—here there are exemplarily two—lying between 1000 and 10 kHz. By function 60b, the second partial loudspeaker signals 34 are, relative to the at least one audio signal, dampened in a spectral range 124 above 1000 Hz with a reduction of the dampening within a spectral subrange 126 within the spectral range 124, which subrange is located between 5 and 10 kHz. Further, function 60b may, es depicted in
The effective overall spectrum of the acoustic signal arriving at the listener is determined partially by non-EQ'ed signal (amplitude panning within a layer) 28 and partially by EQ'ed signal (signal from virtual top/bottom) 34. Thus the effective overall EQ is a linear combination of unity and the top/bottom EQs 60a/60b. In that way, the EQing at the listener is fading in as a source 104 moves towards top position (or correspondingly towards bottom position).
Such a continuous fade/change in the amount of EQing is specifically beneficial, since the human auditory system can use those changes in the spectrum of the received signal to judge its location. Especially in tracked scenarios, this changes can be used to distinguish weather a specific spectral feature is a property of the actual signal, or changes while the listener is moving, and it can such be interpreted as a feature related to the source location.
Summarizing, a reproduction of object based audio or multichannel audio with reproduction of elevated or lowered height sounds (top and bottom) is enabled. A playback of input audio signals (featuring sound intended for reproduction over elevated or lower loudspeaker layers) over arbitrary loudspeaker setups is possible. Here, “loudspeaker setups” does also include devices and topologies like soundbars, TVs with built in loudspeakers, boomboxes, soundplates, loudspeaker arrays, smart speakers, and so forth. There is no need to have elevated or lower loudspeaker layers. Thus, a perceptual effect of top or bottom sounds in almost any arbitrary loudspeaker setup (even without elevated or lower loudspeakers) is made possible.
The embodiments are computationally efficient, such that it can also be beneficially used in scenarios where the (changing) listener position is known and/or (constantly) tracked by the playback system.
The embodiments can be used for channel-based audio, object-based audio, and scene-based audio (e.g. Ambisonics) input format signals.
Compared to rendering methods which are HRTF based, it is to be emphasized that the embodiments do not aim at simulating detailed specific binaural cues for specific object positions in all possible directions (which might be difficult to achieve over a wide range). Instead, a good simulation of cues is produced that evoke the perception of a sound source above or below the listener (i.e., produce a virtual source above or below) at one specific position/direction. Thus, it is tried to mimic the perception for those two directions (top/bottom 102) in a very good/convincing way. A benefit of these two specific directions chosen is that, besides the spectral cues, the two other dominant spatial audio cues (i.e. ITDs and ILDs) are minimal; theoretically, no ITD and no ILD occurs for sound sources perfectly above or below a listener, i.e., the particle velocity in horizontal direction is close to zero for the direct sound from the sound source. Thus, the two stage approach with panning horizontally and vertically, potentially with virtually rendering the top/bottom speaker 102, is stable and leads to high accuracy.
In the following, we describe some further example selection criteria how loudspeakers of the plurality of loudspeakers could automatically be assigned to a set or a layer of loudspeakers for reproduction of a virtual loudspeaker
Possible input parameters for (possibly adaptive) rendering are:
To conclude, the embodiments described herein can optionally be supplemented by any of the important points or aspects described here. However, it is noted that the important points and aspects described here can either be used individually or in combination and can be introduced into any of the embodiments described herein, both individually and in combination. As an outcome of the latter, the above description inter alia, includes an apparatus for generating loudspeaker signals 12 for a plurality of loudspeakers 14 so that an application of the loudspeaker signals 12 at the plurality of loudspeakers 14 renders at least one audio object at an intended virtual position 104, the apparatus comprising an interface 16 configured to receive an audio input signal 18 which represents the at least one audio object, a first panning gain determiner 22, configured to determine, depending on the intended virtual position, first panning gains 24 for a first set 26 of loudspeakers of the plurality of loudspeakers, which are arranged within, or form, a first horizontal layer, the first panning gains 24 defining a derivation of first partial loudspeaker signals 28 from the at least one audio input signal 18, which are associated with a rendering of the at least one audio object at a first virtual position 106 upon application of the first partial loudspeaker signals 28 onto the first set 26 of loudspeakers, a vertical panning gain determiner 30, configured to determine, depending on the intended virtual position, further panning gains 32 for a panning between the first partial loudspeaker signals 28 and second partial loudspeaker signals 34 which are to be applied to a second set 36 of loudspeakers, which is vertically offset relative to the first layer set, so as to be arranged in, or form, a second horizontal layer, and is associated with a rendering of the at least one audio object at a second position 102 so as to pan between the first virtual position 106 and the second position 102, wherein the apparatus is configured to compose the loudspeaker signals 12 from the audio input signal 18 using the first panning gains 24 and the further panning gains 32. A second panning gain determiner 52 is also comprised, which is configured to determine, depending on the intended virtual position, second panning gains 54 for the second set of loudspeakers, the second panning gains 54 defining a derivation of the second partial loudspeaker signals 34 from the at least one audio input signal, and the apparatus is configured to compose the loudspeaker signals 12 from the audio input signal 18 using the first and second panning gains and the further panning gains. The first and second panning gain determiners 22, 52 are configured to select the first and second sets 26, 36 of loudspeakers of the plurality of loudspeakers so that the first and second layer sets have, among horizontal layers which the plurality of loudspeakers are distributed onto, the intended virtual position 104 vertically therebetween. Note that the first set 26 of loudspeakers and the second set 36 of loudspeakers may partially overlap, i.e. one loudspeaker may be contained by both sets 26 and 36. To be more precise, the plurality of loudspeakers may be distributed onto the horizontal layers in a manner that, for each horizontal layers, the loudspeakers belonging to that horizontal layer surround, horizontally (i.e. in horizontal projection) a listener position, or, differently speaking, allow for, horizontally, a 360 degree panning around the listener position, and for sake of achieving this circumstance, for instance, at least one pair of horizontal layers may share one or more of their loudspeakers. That is, horizontality and vertical offsetness of the horizontal layers may be abstracted to an extent that sometimes, such as for at least one pair of horizontal layers, one or more loudspeakers belong to more than one of the horizontal layers, respectively. In even other words, the above description, inter alia, includes an apparatus for generating loudspeaker signals 12 for a plurality of loudspeakers 14 so that an application of the loudspeaker signals 12 at the plurality of loudspeakers 14 renders at least one audio object at an intended virtual position 104, wherein the plurality of loudspeakers are distributed onto one or more horizontal layers, the apparatus comprising an interface 16 configured to receive an audio input signal 18 which represents the at least one audio object, a first loudspeaker signal set determiner 70, configured to determine, depending on the intended virtual position, first panning gains 24 for a first set of loudspeakers 26 of the plurality of loudspeakers, and use the first panning gains 24 to derive first partial loudspeaker signals 28 from the at least one audio input signal 18, which are associated with a rendering of the at least one audio object at a first virtual position 106 upon application of the first partial loudspeaker signals onto the first set 26 of loudspeakers, a second loudspeaker signal set determiner 72, configured to, by spectral shaping, derive second partial loudspeaker signals 34 from the at least one audio input signal 18, the second partial loudspeaker signals 34 being associated with a rendering of the at least one audio object at a second virtual position 102 upon application of the second partial loudspeaker signals 34 onto a second set of loudspeakers 36, the second virtual position being above or below the one or more horizontal layers, and a vertical panning gain determiner 30 configured to, depending on the intended virtual position, determine further panning gains 32 for the first and second partial loudspeaker signals so as to pan between the first and second virtual positions, and a composer 40 configured to compose the loudspeaker signals from the first and second partial loudspeaker signals using the further panning gains 32. Again, note that the first set 26 of loudspeakers and the second set 36 of loudspeakers may partially overlap, i.e. one loudspeaker may be contained by both sets 26 and 36. To be more precise, the plurality of loudspeakers may be distributed onto the horizontal layers in a manner that, for each horizontal layer, the loudspeakers belonging to that horizontal layer surround, horizontally (i.e. in horizontal projection) a listener position, or, differently speaking, allow for, horizontally, a 360 degree panning around the listener position, and for sake of achieving this circumstance, for instance, at least one pair of horizontal layers may share one or more of their loudspeakers. That is, horizontality and vertical offsetness of the horizontal layers may be abstracted to an extent that sometimes, such as for at least one pair of horizontal layers, one or more loudspeakers belong to more than of the horizontal layers, respectively. All the other modifications described above and mentioned in the subsequent claims are feasible as well, such as the usage of spectral shaping 58 so as to derive the second partial loudspeaker signals 34 from the at least one audio signal 18 in order to result into the second position being a virtual position 102 above the highest one or below the lowest one of the horizontal layers.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a device or a part thereof corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding apparatus or part of an apparatus or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine-readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine-readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The apparatus described herein, or any components of the apparatus described herein, may be implemented at least partially in hardware and/or in software.
The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The methods described herein, or any parts of the methods described herein, may be performed at least partially by hardware and/or by software.
While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents, which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
PCT/EP2021/054853 | Feb 2021 | WO | international |
This application is a continuation of copending International Application No. PCT/EP2022/054880, filed Feb. 25, 2022, which is incorporated herein by reference in its entirety, and additionally claims priority from International Application No. PCT/EP2021/054853, filed Feb. 26, 2021, which is also incorporated herein by reference in its entirety. The invention relates to the technical field of audio reproduction. Specifically, reproduction of multichannel audio with reproduction of elevated or lowered height sounds is described herein.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2022/054880 | Feb 2022 | US |
Child | 18454942 | US |