Example embodiments disclosed herein relate generally to processing audio data, and more specifically, to adaptive panner of audio objects including dynamic audio objects and static audio objects.
Input audio content such as originally authored/produced audio content, and the like, may include a large number of audio objects individually represented in an object-based audio format such as Dolby ATMOS® to help create a spatially diverse, immersive and accurate audio experience. Audio playback systems such as those used by cinemas and home theaters are also becoming increasingly versatile and complex, evolving from 5.1 to 7.1, then from 5.1.2 to 7.1.4, then 22.2 (e.g., as defined in ITU-R BS.2051-0), the content of which is incorporated herein by reference in its entirety, among others. As audio source layouts (or audio speaker layouts) transition from planar two-dimensional (2D) arrays to three-dimensional (3D) arrays with elevated speakers and increasing audio channels, reproducing sounds in a playback environment is becoming increasingly complex.
In content creation as well as end user content consumption, speaker positions might be presumed to be in compliance with a standard audio source layout's recommended specification. This presumption, however, can be incorrect in the real world. For example, in a home theater, speakers such as surround speakers are often located at non-standard positions despite the standard audio source layout's recommended specification. As a result, spatial distortion can occur in audio rendering if the audio rendering is based on a presumption that the speakers are located at the standard positions.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.
The example embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
Example embodiments, which relate to adaptive panner of audio objects, are described herein. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the example embodiments. It will be apparent, however, that the example embodiments may be practiced without these specific details. In other instances, well-known structures and devices are not described in exhaustive detail, in order to avoid unnecessarily occluding, obscuring, or obfuscating the example embodiments.
Example embodiments are described herein according to the following outline:
This overview presents a basic description of some aspects of the example embodiments described herein. It should be noted that this overview is not an extensive or exhaustive summary of aspects of the example embodiments. Moreover, it should be noted that this overview is not intended to be understood as identifying any particularly significant aspects or elements of the embodiment, nor as delineating any scope of the embodiment in particular, nor in general. This overview merely presents some concepts that relate to the example embodiment in a condensed and simplified format, and should be understood as merely a conceptual prelude to a more detailed description of example embodiments that follows below.
Example embodiments described herein relate to adaptive panner of audio objects. An audio object including audio content and object metadata is received. Examples of audio objects may include, but are not necessarily limited to only, any of: audio objects that are defined in a manner independent of any specific audio source layout, audio objects that represent audio channels of a specific audio source layout (e.g., a left audio channel or a right audio channel in a stereo audio source layout, a left front audio channel or a right front audio channel in a surround sound audio source layout, among others) that may be treated as static objects located at expected canonical positions of the audio channels (or speakers) in the specific audio source layout. The object metadata of the audio object indicates an object spatial position of the audio object to be rendered by a plurality of audio speakers in a playback environment. Each audio speaker in the plurality of audio speakers is located in a respective source spatial position in a plurality of source spatial positions in the playback environment. Based on the object spatial position of the audio object and the plurality of source spatial positions of the plurality of audio speakers, a plurality of initial gain values for the plurality of audio speakers is determined. Each audio speaker in the plurality of audio speakers is assigned with a respective initial gain value in the plurality of initial gain values. The plurality of initial gain values is used to select a set of audio speakers from among the plurality of audio speakers. Based on the object spatial position of the audio object and a set of source spatial positions at which the set of audio speakers are respectively located in the playback environment, a set of optimized gain values is determined for the set of audio speakers. The audio object at the object spatial position is caused to be rendered with the set of optimized gain values for the set of audio speakers. Each audio speaker in the set of audio speakers being assigned with a respective optimized gain value in the plurality of optimized gain values.
In some example embodiments, mechanisms as described herein form a part of a media processing system, including, but not limited to, any of: an audio video receiver, a home theater system, a cinema system, a game machine, a television, a set-top box, a tablet, a mobile device, a laptop computer, netbook computer, desktop computer, computer workstation, computer kiosk, various other kinds of terminals and media processing units, and the like.
Various modifications to the preferred embodiments and the generic principles and features described herein will be readily apparent to those skilled in the art. Thus, the disclosure is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features described herein.
Any of embodiments as described herein may be used alone or together with one another in any combination. Although various embodiments may have been motivated by various deficiencies with the prior art, which may be discussed or alluded to in one or more places in the specification, the embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some deficiencies or just one deficiency that may be discussed in the specification, and some embodiments may not address any of these deficiencies.
Techniques as described herein can be applied to support audio source layouts with arbitrary positions at which audio speakers positions may be (e.g., actually, virtually, etc.) located. These techniques can be implemented by a wide variety of media processing systems including but not limited to audio video receivers (AVRs), etc., some of which could be embedded systems with severe or stringent constraints in CPU power, memory space, I/O speed, and the like.
As compared with other audio rendering methods, techniques as described herein provide an audio object rendering method that is highly flexible, configurable, and adaptable, with different audio source layouts in different playback environments. Under the techniques as described herein, representations by interior objects (e.g., audio objects located in a small spatial volume contained inside the convex hull of the audio speakers) can be made with optimized gain values. In addition, calculation of the optimized gain values under the techniques as described herein do not require any previous geometrical construction (triangulation) as some other approaches (e.g., vector base amplitude panning (VBAP), among others) do. For example, the audio object rendering method can adopt a solution with complete flexibility with respect to spatial positions of audio speakers (e.g., loudspeakers, audio sources, etc.), can take advantage of system resources while avoiding adverse impacts of resource constraints (e.g., embedded resource constraints, etc.). Consequently, the audio object rendering under the techniques as described herein leads to better listening experiences, for example, in irregular audio source layouts.
As used herein, the term “audio object” (or simply “object”) refers to a combination of audio content (or audio signal) and object metadata (e.g., spatial positional metadata, etc.). The audio content and the object metadata may be created without reference to (or regardless of) any particular playback environment or audio source layouts therein that is to actually render the audio object. Examples of audio content may include, but are not necessarily limited to only, any of: audio frames, audio data blocks, audio samples, and the like. Examples of spatial positional metadata in the object metadata may include, but are not necessarily limited to only, any of: spatial positions (e.g., linear positions, angular positions, etc.), spatial velocities (e.g., linear velocities, angular velocities, etc.), spatial accelerations (e.g., linear accelerations, angular accelerations, etc.), spatial trajectories, and the like, in connection with an audio object.
As used herein, the term “audio sources” (or simply “sources”) refers to audio speakers, audio speaker clusters, audio speaker groups, and the like, in a playback environment for which audio channel data generated by an adaptive audio playback system based on audio objects is to be rendered. As used herein, the term “rendering” may refer to a process of transforming audio objects into audio channel data (1) to be used to directly drive the audio sources of the adaptive audio playback system for rendering, or (2) to be transmitted/delivered to a recipient audio rendering system for rendering. The audio channel data, which represents the audio objects in the specific playback environment, may be audio content data adapted for a specific audio source layout in the specific playback environment. In some example embodiments, the audio channel data may be compressed/encoded/packaged (e.g., by the adaptive audio playback system, by an audio encoder, etc.) in an efficient form for transmission/delivery to a downstream recipient audio rendering system for driving audio sources of a specific audio source layout in connection with the downstream recipient audio rendering system. The recipient audio rendering system may be local or remote to the adaptive audio playback system or the audio encoder that generates the audio channel data.
An adaptive audio playback system as described herein may receive or otherwise determine source configuration data for a specific audio source layout in a specific playback environment such as a movie theater, a concert hall, a theme park, a home, an office, a theater, a restaurant, a bar, and the like. As used herein, the term “source configuration data” may include location data indicating (source spatial) positions of some or all of audio speakers in a playback environment. For example, the source configuration data may define or specify a respective source spatial location for each audio source of a plurality of audio sources in the specific playback environment. A source spatial location as described herein may be provided as spatial coordinates of a spatial location of an audio source in a coordinate system such as one related to Cartesian coordinates, spherical coordinates, angular coordinates, and the like. The spatial coordinates can be defined relative to a reference location in the specific playback environment, such as a spatial location of a specific audio source in the specific playback environment, and the like. In some embodiments, each audio source in the plurality of audio sources may correspond to one or more audio speakers of the specific playback environment.
The adaptive audio playback system as described herein may receive one or more audio objects each of which comprises one or more respective audio content (e.g., respective audio signals) and respective object metadata (including but not limited to spatial positional metadata). Spatial positional metadata of an audio object may comprise a plurality of (e.g., time-varying, time-constant, etc.) object spatial locations of the audio object in a coordinate system (which may be the same coordinate system used to represent audio sources). The plurality of object spatial locations of the audio object may be a function of time, and may represent or indicate a spatial trajectory of the audio object in the spatial volume such as represented in the specific playback environment. More specifically, the adaptive audio playback system can be configured to translate the spatial positional metadata of the audio object into the spatial trajectory of the audio object in the spatial volume as represented in the specific playback environment.
When the audio object is rendered or played back in a specific playback environment, the audio object may be rendered in the specific playback environment according to at least the spatial positional metadata of the audio object and the source configuration data of the specific audio source layout. A process of rendering the audio object by the adaptive audio playback system may involve determining a respective (e.g., time-varying, time-constant, etc.) contribution (e.g., as represented by a gain value, etc.) from each audio source of the plurality of audio sources in the specific playback environment, based at least in part on the source spatial data of the specific audio source layout in the specific playback environment and the object spatial data of the audio object. In some embodiments, a contribution of an audio source in the plurality of audio sources for rendering the audio object may be represent by an audio object gain (e.g., gain, gain value, etc.) that is assigned to or determined for the audio source.
Determination of individual contributions from, or individual gains for, audio sources in the plurality of audio sources in the specific playback environment for the purpose of rendering the audio object can be made in one or more of a variety of methods. In some example embodiments, the adaptive audio playback system may determine the individual gains based on minimizing or optimizing an audio object cost function of which the individual gains are variables that form a search space, and (source) spatial positions of the audio sources in the specific playback environment are (e.g., input) parameters. Additionally, optionally, or alternatively, the adaptive audio playback system may incorporate one or more regularization terms in favor of a certain optimization solution among a large number of possible solutions.
For the purpose of illustration only, in some embodiments, gain optimization can be performed through an inverse-matrix method, a multiplicative-update method, or some other iterative method. Various embodiments include using gain optimization methods other than the inverse-matrix method, the multiplicative-update method, and the like. For example, in some embodiments, instead of using an inverse-matrix method to generate nonnegative and/or negative initial gain values, a different gain optimization method that can generate nonnegative and/or negative initial gain values may be used instead of, or in conjunction with, the inverse-matrix method. For example, a quadratic programming method that does not implement a nonnegativity constraint may be used to generate nonnegative and/or negative initial gain values. Additionally, optionally, or alternatively, in some embodiments, instead of using a multiplicative-update method to maintain nonnegativity of updated gain values, a different gain optimization method that can maintain nonnegativity of updated gain values may be used instead of, or in conjunction with, the multiplicative-update method. In an example, a quadratic programming method (e.g., implemented as a function in a third party extension of MATLAB such as pdco( ) etc.) that implements a nonnegativity constraint may be used to update gain values and maintain nonnegativity of the updated gain values. In another example, an interior point optimizer (e.g., implemented in the software library Interior Point OPTimizer, or IPOPT) may be used to update gain values and maintain nonnegativity of the updated gain values. Such a method may, but is not necessarily limited to only, be implemented as an iterative method, a recursive method, and the like.
Let g.{tilde over (g)} denote the element-wise product of two 1×N vectors g and {tilde over (g)}. Let g−1 denote a vector in which the i-th element is equal to the inverse gi−1 of the i-th element (gi) of a 1×N vector g.
By way of example but not limitation, the adaptive audio playback system may implement a Center of Mass Amplitude Panning (CMAP) paradigm that determines the individual gains for the audio sources based on minimizing/optimizing an audio object cost function (or objective function). In an example embodiment, such an audio object cost function may be given as follows:
E=E
CL
+E
distance
+E
sum-to-one (1)
where each term or criterion is given as follows:
E
CL=[(Σigi)−Σigi]2 (2)
E
distance=αdistanceΣigi2(−)2 (3)
E
sum-to-one=αsum-to-one[Σigi−1]2 (4)
where rs represents the (object) spatial position of the audio object; ri represent the (source) spatial positions of the audio sources; gi represent the individual gains of the audio sources; ECL is a term in favor of representing the audio object at a center of loudness of the audio sources; Edistance is a constraint term for penalizing activating those audio sources (e.g., firing audio speakers, etc.) that are far from the audio object with its weight, αdistance (e.g., set to 0.01, 0.02, etc.); Esum-to-one is another constraint term for restricting the magnitudes/values of the gains to unit sum with its weight, αsum-to-one (e.g., set to 1, 1.1, etc.).
Techniques as described herein can be applied to deriving optimal representation of audio objects by audio sources in a wide variety of possible audio source layouts. These techniques can be used to prevent audible artifacts, spatial distortion, instability (e.g., with negative gains for the audio sources), and the like. While an audio object cost function that includes terms such as the center-of-loudness term, the constraint terms, and the like, may be used to determine gains for audio sources, other audio object cost functions may also be used instead of or in addition to the audio object cost function as described herein. Additionally, alternatively or optionally, other terms for other regularization purposes may also be used instead of or in addition to the center-of-loudness term, the constraint terms, and the like, as given above.
The audio object cost function in expression (1) may be represented in a matrix notation as follows:
E(g)=gTA′g+BTg+C, (5)
where A′ represents a matrix including matrix elements/components denoted as Aij′, B represents a vector including vector elements/components denoted as Bi, and C represents a constant, as follows:
A
ij
′=[r
s
2+·−·(+)]+αdistance(−)2δij+αsum-to-one (6)
Bi=−2αsum-to-one (7)
C=αsum-to-one (8)
The above expression may also be rewritten as follows:
E(g)=1/2gTAg+BTg+C (9)
where A represents a symmetric matrix that can be derived from the matrix A′ and the transpose of A′T as follows:
A=A′+A′
T (10)
From expression (5) above, a derivative ∇E(g) (or a gradient in a search space formed by gains) of the audio object cost function E( . . . |g) can be obtained with respect to g as follows:
∇E(g)=Ag+B (11)
In some embodiments, the adaptive audio playback system may use an inverse-matrix method to determine optimized values of the gains as follows:
Ag+B=0→g=−A−1B (12)
A center of loudness, CL, of the audio sources for the purpose of representing the audio object can be defined as the weighted sum of the spatial positions of the audio sources as weighted by respective gains of the audio sources as follows:
CL=Σ
i
g
i
/Σigi (13)
In many operational scenarios, the center of loudness of the audio sources for the purpose of representing the audio object does not always lie inside the convex hull of the audio sources. For example, (e.g., all) speakers in the specific playback environment that constitute audio sources may be located in a relatively small region of a room. It may not be possible to obtain a center of loudness to match a spatial position of the audio object outside that small region, unless negative gains are used. Accordingly, the inverse-matrix method as represented by expression (12) may lead to nonnegative gains as well as negative gains for audio sources (or negative speaker gains).
As used herein, an audio source that uses a positive gain in rendering an audio object tends to pull the audio object spatially close to the audio source. In contrast, an audio source that uses a negative gain in rendering an audio object tends to push the audio object spatially away from the audio source. Negative gains may cause audible artifacts, spatial distortions, instability, and other similarly undesirable effects in rendering audio objects.
If these negative gains are set to zero, discontinuity may be observed on the border of the convex hull formed by the audio sources. For example, sound signals generated by audio sources (or audio speakers) have drop-ins and outs each time when the audio object crosses the convex hull, introducing audible artifacts and spatial distortions.
In some example embodiments, instead of or in addition to using the inverse-matrix method, the adaptive audio playback system may use a multiplicative-update method to determine optimized values of the gains and to enforce a non-negativity constraint in optimized values computed for gains of audio sources. Under this approach, current values of the gains are obtained by iteratively updating previous values of the gains (which were also ensured to be nonnegative) with a nonnegative multiplier. For the purpose of illustration only, the current values of the gains may be derived from the previous values of the gains with a nonnegative multiplier as follows:
g←1/2g.(√{square root over (B.B+4([A]+g).([A]−g))}−B).([A]+g)−1 (14)
where a positive component [A]+ and a negative component [A]− of a matrix A are respectively defined as follows:
Updating gain values (or values of the gains) through an update factor that is a positive multiplier ensures non-negativity in the optimization process of the values of the gains, provided that initial values of the gains are not negative.
The update factor, as represented by expression (14), can be further simplified as follows:
g.{[∇E(g)]−/[∇E(g)]+}α, (17)
where typically 1≦α≦2; [∇E(g)]+ and [∇E(g)]− are both nonnegative, and are related in ∇E(g) as follows:
∇E(g)=[∇E(g)]+−[∇E(g)]− (18)
[∇E(g)]+=[A]+g and [∇E(g)]−=−B−[A]−g (19)
In some embodiments, the matrix A (e.g., related to the audio object cost function E(g) in expression (5), etc.) is positive definite; the audio object cost function E(g) in expression (5) is bounded below (e.g., greater than or equal to zero since all terms in expression (5) are nonnegative, etc.) and the optimization of the audio object cost function E(g) is convergent. It is worth noting that while A may be diagonalizable and positive definite, the gains obtained under the inverse-matrix method in expression (12) are not necessarily positive. In contrast, gains obtained under a multiplicative-update method as described herein such as in expressions (14) and (17) remain positive provided the initial values of the gains are positive. In some embodiments, gains obtained under a multiplicative-update method as described herein such as in expressions (14) and (17) remain zero provided the initial values of the gains are zero.
In some discussion herein, it has been described that the multiplicative-update method can be applied to a cost function as given in expression (5). This is for the purpose of illustration only. It should be noted that in various embodiments the multiplicative-update method can also be applied to any of a wide variety of cost or objective functions including but not limited to only the examples given above.
In some other embodiments, the adaptive audio playback system may use an alternate method for optimization to determine optimized values of the gains and to enforce a non-negativity constraint in optimized gain values, such as using a quadratic programming framework with non-negative constraints or a general optimization method, such as IPOPT, which guarantees minimizing a cost function such as expression (1) subject to the constraint gi≧0 for all values of i.
Thus, under techniques as described herein, panning of audio objects is determined by solving a minimization/optimization problem with a method that constraints all gain values of audio sources to be non-negative. In some embodiments, two general steps can be used to achieve a final solution to the minimization/optimization problem.
In a first step, a set of initial gain values (or seed gain values) is assigned, determined, and/or calculated. In some cases, the seed gain values are close to the final solution; in some other cases the seed gain values are to be non-negative; in yet other cases there may be no strict requirements for the seed gain values. In various embodiments, the set of initial gain values can be computed via matrix inversion, an iterative method, or even sometimes with a trivial initialization (all gains equal), among others.
In a second step, the constrained minimization/optimization problem can be solved with a multiplicative method, with a quadratic programming (QP) method, an IPOPT method or whatever other method, starting from the seed gain values to compute the non-negative gain values.
In embodiments in which the multiplicative method is used (e.g., to solve the constrained minimization/optimization problem), the two steps above may be particularized as follows. In the first step, the initial gain values (or the seed gain values) can be set reasonably close to the final solution. In an example, gain values (e.g., the initial gain values) for all active loudspeaker should be strictly positive. In some embodiments, gain values from an inverse matrix based solution can be taken with all gains clipped from below a threshold to a small positive value (e.g., a negligible positive value below the threshold). In the second step, these positive gain values (not necessarily optimized) can be optimized to gain values of the final solution through iterative minimization, for example, in accordance with the multiplicative equations/expressions as described herein that ensure non-negative updates of gain values between successive iterations.
In some example embodiments, the gain optimizer (100), which may be implemented by one or more computing devices, includes an audio object cost function generator 102, a gain value initializer 104, and a multiplicative updater 106,
In some example embodiments, the audio object cost function generator (102) includes software, hardware, a combination of software and hardware, and the like, configured to receive source configuration data that specifies or defines a specific audio source layout in a specific playback environment. The source configuration data may include but is not necessarily limited to only, any, some or all of: (source) spatial positions of a plurality of audio sources in the specific audio source layouts, room configuration, reference locations, coordinate system information, and the like.
In some embodiments, the audio object cost function generator (102) is configured to receive object configuration data for the audio object, which may be one of one or more audio objects that are to be (e.g., concurrently, serially, partly concurrently, partly serially, etc.) rendered by the plurality of audio sources. As used herein, object configuration data for an audio object includes or specifies one or more spatial positions (which form the spatial trajectory) of the audio object as a function of time, as a time-indexed table, as a time-dependent array, as a time-dependent sequence, etc.
In some embodiments, based on some or all of the source configuration data, the object configuration data and the room configuration, the audio object cost function generator (102) generates a spatial representation of the audio sources and the audio object in the specific playback environment. The audio object cost function generator (102) uses the spatial representation of the audio sources and the audio object in the specific playback environment to generate audio object cost functions (e.g., expression (5), etc.) to be used to determine optimized values for individual gains of the audio sources at each of the spatial positions representing the spatial trajectory of the audio object. For example, based on the source spatial positions of the audio sources, a spatial position of the audio object, etc., in the spatial representation, the audio object cost function generator (102) generates an audio object cost function for that spatial position of the audio object.
In some example embodiments, the gain value initializer (104) comprises software, hardware, a combination of software and hardware, etc., configured to generate initial values (e.g., denoted as “initial gains” in
In some example embodiments, the multiplicative updater (106) includes software, hardware, a combination of software and hardware, and the like, configured to iteratively generate an update factor (e.g., expression (14), and/or expression (17)) from the audio object cost function that is generated by the audio object cost function generator (102) for the spatial position of the audio object. The update factor may include one or more multiplicative factors, zero or more offset factors, etc. The multiplicative updater (106) uses the update factor to derive current values of the gains for the audio sources for the spatial position of the audio object from previous values of the gains for the audio sources for the same spatial position of the audio object, until converged (or optimized) values of the gains for the audio sources for the spatial position of the audio object are obtained. The converged values of the gains are reached, provided that one or more convergent criteria (e.g., differences in gain values between two successive updates become smaller than convergence thresholds (e.g., present_convergence_threshold in TABLE 1), etc.) are satisfied. The multiplicative updater (106) then outputs the converged values (denoted as “gains” in
An example implementation of the multiplicative-update method is shown in TABLE 1 as follows:
In an example embodiment, the update factor is a positive multiplier. For example, the audio object cost function generator (102) may generate a gradient ∇E(g) (denoted as “the derivative of the criterion” in
Since the spatial trajectory of the audio object may include a plurality of (e.g., discrete) spatial positions at a plurality of time points, some or all of the operations as described above (e.g., the audio object cost function generation, the gain value initialization, the gain value updates, etc.) may be repeated for any, some or all of these spatial positions of the audio object. In some embodiments, the initial gains are set for each spatial position of the audio object. In some embodiments, the initial gains are set for each group (e.g., every two adjacent spatial positions, every three adjacent spatial positions, etc.) of spatial positions of the audio object. In some embodiments, the initial gains are set only for an initial spatial position. Once the optimized values of the gains for the initial spatial position of the audio object are obtained through the convergence process, the optimized values for the initial spatial position of the audio object may be used as initial values of the gains for the spatial position of the audio object immediately following the initial spatial position of the audio object. Similarly, optimized values for a non-initial spatial position of the audio object may be used as initial values of the gains for the spatial position of the audio object immediately following the non-initial spatial position of the audio object, until optimized values for all spatial positions in the plurality of spatial positions of the audio object are computed.
In some embodiments, the specific playback environment is populated by a plurality of (e.g., discrete) precomputed spatial positions—at which an audio object to be rendered by the adaptive audio playback system may or may not be located. In various embodiments, the plurality of precomputed (object) spatial positions may be distributed in the specific playback environment uniformly or non-uniformly. In some embodiments, more spatial positions may be placed or distributed in certain portions of the specific playback environment than in other portions of the same environment. Additionally, optionally, or alternatively, the plurality of precomputed spatial positions may be distributed in the specific playback environment regularly or irregularly.
By way of example but not limitation, the specific playback environment may be represented by a three-dimensional (3D) rectangular room of
In some embodiments, the precomputation of the plurality of sets of optimized values of gains for the plurality of precomputed (object) spatial positions in the offline processing is only calculated once, given the specific audio source layout in the specific playback environment. Each set of optimized values of gains in the plurality of sets of optimized values of gains may correspond to a respective precomputed spatial position in the plurality of precomputed spatial positions. More specifically, a set of optimized values of gains (for the audio sources), which corresponds to a respective precomputed spatial position, is precomputed in the offline processing for the respective precomputed spatial position as if an audio object is located at the respective precomputed spatial position.
In some embodiments, the adaptive audio playback system stores the plurality of sets of gains precomputed in the offline processing at the plurality of precomputed spatial positions (denoted as “discrete object positions” in
In online processing when the adaptive audio playback system is to use the audio sources in the specific playback environment to reproduce or render an actual audio object in the specific playback environment, to reduce computational complexity of the online processing, gain values for actual spatial positions of the actual audio object may be obtained through interpolation based on the optimized values of gains precomputed in the offline processing. More specifically, optimized values of gains for actual spatial positions of the actual audio object may be computed by the interpolation operator (110) through interpolating the optimized values of gains that were precomputed and stored in memory (e.g., in the look-up table, etc.) in the offline processing based on the actual spatial positions of the actual audio object.
In the present example of the grid or lattice as illustrated in
In some embodiments, interpolation can be applied to the precomputed values of gains without first performing other operations such as normalization, gating, expanding, clipping, etc. In some embodiments, these other operations may be applied after the interpolation.
In some embodiments, during offline processing, for each precomputed spatial position in a plurality of (e.g., discrete) precomputed spatial positions that are populated in a specific playback environment, the CMAP gain value initializer (104-1) generates optimized gain values for that precomputed spatial position based at least in part on the CMAP paradigm and uses the optimized gain values as (optimized) initial values of the gains of the audio sources as if an audio object is located at that precomputed spatial position. These initial values of gains generated by the CMAP gain value initializer (104-1) for each spatial position may be used to deactivate audio sources (e.g.., with negative initial gain values, with negative and zero initial gain values, with initial gain values below a gain value threshold, etc.). The remaining initial gains for the remaining audio sources (or activated audio sources) are refined for the precomputed spatial position by the multiplicative updater (106) until reaching convergence. Converged values (or optimized values) of gains for activated audio sources at each such precomputed spatial position in the plurality of precomputed spatial positions are stored into the sparse storage (108). In some embodiments, the adaptive audio playback system may select, from one or more different sparseness settings, a sparseness setting for populating precomputed spatial positions in the specific playback environment. The sparseness setting may include the total number of precomputed spatial positions, possibly same or different densities of precomputed spatial positions in different portions of a spatial volume represented by the specific playback environment, etc.
Given an actual spatial position of an actual audio object in online processing, an interpolation such as a trilinear interpolation, or the like, can be applied by the interpolation operator (110), which uses optimized values of gains at the neighboring precomputed spatial positions—for example, one or more precomputed spatial positions that are closest to the actual spatial position of the actual object—to derive approximate values of gains (for the audio sources) for reproducing or rendering the actual audio object at the actual spatial position.
Consumer devices, such as televisions, audio-video receivers (AVRs), mobile devices, and the like generally have rigorous memory and/or computation limitations. For example, the audio processing capabilities, disk storage space limitations, and the like, of a home theater system will generally not be on par with those of a cinema sound system. Accordingly, some implementations may need to use relatively small amounts of memory, as such some implementations may need to have relatively low computational complexity. Hence, different usage scenarios and applications may need different balances and tradeoffs between memory footprint and computational power (e.g., in terms of computational cost, etc.).
Various tradeoffs between computational load and memory space can be made under techniques as described herein.
In offline processing, the adaptive audio playback system can select or use a specific sparseness setting, from different sparseness settings, for a sparseness storage. The selection of the specific sparseness setting from the different sparseness settings can be based on one or more selection criteria including but not limited to, available memory space, computational power, an upper bound (e.g., 200 milliseconds, 50 milliseconds, 10 milliseconds, 5 milliseconds, 3 milliseconds, 1 millisecond or less, etc.) for online processing convergence time, and the like. The specific sparseness setting determines how the specific playback environment is populated by a plurality of (e.g., discrete) precomputed spatial positions.
The gain optimizer (e.g., 100 of
In some embodiments, the adaptive audio playback system stores the plurality of sets of gains precomputed in the offline processing at the plurality of precomputed spatial positions in the sparse storage (108), for example, in the form of a look-up table.
In online processing, the adaptive audio playback system is to use the audio sources in the specific playback environment to reproduce or render an actual audio object in the specific playback environment. To reduce computational complexity of the online processing, initial values of gains to reproduce or render the actual audio object at an actual spatial position may be obtained by the interpolation operator (110) through interpolation based on the optimized values of gains precomputed in the offline processing. More specifically, given the actual spatial position of the actual audio object in the online processing, an interpolation such as a trilinear interpolation, etc., can be applied by the interpolation operator (110), which uses optimized values of gains at the neighboring vertices. For example, one or more neighboring precomputed spatial positions that are closest to the actual spatial position of the actual object—of the lattices to derive initial values of gains (for the audio sources) for reproducing or rendering the actual audio object at the actual spatial position.
In some embodiments, the online audio object cost function generator (102-1) comprises software, hardware, a combination of software and hardware, and the like, configured to receive source configuration data for the specific playback environment, object configuration data for the actual audio object, which may be one of one or more audio objects that are to be (e.g., concurrently, serially, partly concurrently, partly serially, etc.) rendered by the audio sources.
In some embodiments, based on some or all of the source configuration data, the object configuration data and the room configuration, the online audio object cost function generator (102-1) generates a spatial representation of the audio sources and the actual audio object in the specific playback environment. The online audio object cost function generator (102-1) uses the spatial representation of the audio sources and the actual audio object in the specific playback environment to generate audio object cost functions (e.g., expression (5), etc.). For example, based on source spatial positions of the audio sources, an actual spatial position of the audio object, and the like, in the spatial representation, the online audio object cost function generator (102-1) generates an audio object cost function for the actual spatial position of the actual audio object.
In some embodiments, the online multiplicative updater (106-1) includes software, hardware, a combination of software and hardware, and the like, configured to iteratively generate or determine an update factor (e.g., expression (14) or expression (17)) from the audio object cost function that is generated by the online audio object cost function generator (102-1) for the actual spatial position (e.g., the initial spatial position) of the actual audio object. The multiplicative updater (106-1) uses the update factor to derive current values of the gains for the audio sources for the actual spatial position of the actual audio object from previous values of the gains for the audio sources for the same actual spatial position of the actual audio object, until converged (or optimized) values of the gains for the audio sources for the actual spatial position of the actual audio object are obtained. The multiplicative updater (106) then outputs the converged values (denoted as “gains” in
As illustrated in
Conversely, if the specific sparseness setting corresponds to a relatively small number of precomputed spatial positions populated (or a higher lattice density) in the specific playback environment, dispersion gets larger; accordingly, (e.g., linearly) interpolated gain values becomes less accurate. In the embodiments where the interpolated gain values are further refined or optimized (e.g., by a multiplicative update method, etc.) as illustrated in
Many operational scenarios may involve some irregular surround set-ups in which all audio speakers are in small or irregular regions of spatial volumes of playback environments. Since the center of (speaker) loudness is inside the convex hull of the audio speakers, it may not be possible to obtain a center of speaker loudness to match an audio object in out-of-hull regions, unless negative gains are used. While it is possible to obtain final nonnegative gains by post-processing of gains such as zeroing negative gains at the end of optimization, the result represented by the nonnegative gains after zeroing negative gains is an incomplete solution to optimization and does not represent an optimized solution for the given speaker set-up; these final nonnegative gains are no longer optimized values of gains.
Techniques as described herein can be implement to support out-of-hull optimization of gain values. The out-of-hull optimization refers to a determination of optimized values of gains for audio sources (e.g., in an adaptive source layout, etc.) to reproduce or render an audio object that is located out of the convex hull formed by the audio sources.
In some embodiments, a playback environment may include a plurality of audio sources (or audio speakers). Each audio speaker in the plurality of audio speakers is located in a respective spatial position in a plurality of (e.g., discrete) source spatial positions in the playback environment.
Under adaptive source layout techniques as described herein, an adaptive audio playback system may activate a first subset of selected audio sources in the plurality of audio sources for reproducing or rendering an audio object at a first spatial position of a spatial trajectory of the audio object. The adaptive audio playback system may activate a second subset of selected audio sources in the plurality of audio sources for reproducing or rendering the audio object at a second spatial position of the spatial trajectory of the audio object. The first subset of selected audio sources and the second subset of selected audio sources may or may not have an identical composition of audio sources in the specific playback environment.
Similarly, under the adaptive source layout techniques as described herein, an adaptive audio playback system may activate a first subset of selected audio sources in the plurality of audio sources for reproducing or rendering a first audio object at a first spatial position of a first spatial trajectory of the first audio object. The adaptive audio playback system may activate a second subset of selected audio sources in the plurality of audio sources for reproducing or rendering a second audio object at a second spatial position of a second spatial trajectory of the second audio object. The first subset of selected audio sources and the second subset of selected audio sources may or may not have an identical composition of audio sources in the specific playback environment. Additionally, optionally, or alternatively, the first and second audio objects may be (e.g., in entirety, in part, etc.) concurrently rendered by the first subset of selected audio sources and the second subset of selected audio sources in the specific playback environment.
In some embodiments, some media applications (e.g., audio applications, audiovisual applications, etc.) may need activating fewer audio sources (e.g., firing fewer audio speakers) than what available in a given audio source layout in a specific playback environment. The activation of fewer than available audio sources can be used to reduce potentials or probabilities of spatial combing due to excessive phantom imaging, to comply with specific regularizations in spatial coding, to meet artistic intent such as zone-masking, etc.
Using the adaptive source layout techniques as described herein, an adaptive audio playback system may activate a first subset of selected audio sources in the plurality of audio sources in a first media application. The adaptive audio playback system may activate a second subset of selected audio sources in the plurality of audio sources in a second different media application. The first subset of selected audio sources and the second subset of selected audio sources may or may not have an identical composition of audio sources.
Additionally, optionally, or alternatively, an adaptive audio playback system may activate a first subset of selected audio sources in the plurality of audio sources for creating a first audio effect in compliance with artistic intent. The adaptive audio playback system may activate a second subset of selected audio sources in the plurality of audio sources in a second different audio effect in compliance with artistic intent. The first subset of selected audio sources and the second subset of selected audio sources may or may not have an identical composition of audio sources in the specific playback environment.
From an implementation point of view, relatively high computational cost may be associated with a high number of non-zero gains due to audio mixing operations in connection with a high number of audio sources that correspond to the high number of the non-zero gains. An adaptive audio playback system as described herein can tune or select a rendering method to fire “fewer speakers” than what available in a specific playback environment without sacrificing spatial quality. The adaptive audio playback system can apply different criteria to select or force only a subset of audio sources in a plurality of audio sources in a given audio source layout in a specific playback environment to be activated (or fired). Examples of criteria for selecting fewer than available audio sources may include but are not necessarily limited to only, any, some, or all of: distances of audio sources (e.g., relative to an audio object to be reproduced or rendered, etc.), gain rankings (e.g., ranks in initial gain values obtained using a gain computation method that may generate positive and/or negative gain values, etc.), media applications, audio effect types, audio source control information (e.g., as received in audio metadata, etc.), or some other metrics used to differentiate among audio sources/objects/applications/effects.
By way of example but not limitation, in some embodiments, a first gain optimization method (e.g., the inverse-matrix method, a (quadratic programming) QP-based solution that does not enforce nonnegativity gain constraint, a gradient descent method, etc.) that may generate nonnegative as well as negative gain values may be combined with a second gain optimization method (e.g., the multiplicative-update method, a QP-based solution that enforces nonnegativity or positivity gain constraint, an interior point optimizer, a gradient descent method that enforces nonnegativity or positivity gain constraint, etc.) that maintains positivity of updated gain values into an efficient and optimized method for firing fewer audio sources. More specifically, gain values derived by the first gain optimization method may be used as (e.g., optimized) initial gain values. Furthermore, based on the initial gain values obtained with the first gain optimization method, those audio sources with negative initial gain values may (e.g., automatically) become unselected simply by setting each of those negative initial gain values to a special value such as zero or a negligible small gain value (e.g., 0.001, 0.0001, a gain value below a near-zero positive gain value limit, etc.) indicating that audio sources associated with those negative initial gain values are excluded from optimization, before the second gain optimization method is applied to obtain optimized gain values that are nonnegative (e.g., positive, above a positive gain value threshold, etc.). Those audio sources that have not been excluded based on the initial gain values obtained by the first gain optimization method may (e.g., automatically) become selected (or activated) for the optimization of gain values based on the second gain optimization method.
In some embodiments, only audio sources with negative initial gain values are excluded from being optimized in the second gain optimization method and become unselected. In some embodiments, only audio sources with negative and zero initial gain values are excluded from being optimized in the second gain optimization method and become unselected. In some embodiments, only audio sources with initial gain values below a gain value threshold (which may be a positive gain value) are excluded from being optimized in the second gain optimization method and become unselected. Thus, in some embodiments, an audio source with a small positive gain value below an applicable gain value threshold may have its gain value to be reset to zero or a negligible small gain value (e.g., 0.001, 0.0001, a gain value below a near-zero positive gain value limit, etc.) by a gain optimizer as described herein (which may mean that the audio source is relatively far from the audio object to be rendered).
By way of example but not limitation, the irregular 7.1.4 speaker setup (in the present example, the audio source layout 802-2 of
Panning curves (902-1) are generated for all audio sources (or audio speakers) in Configuration-II under the inverse-matrix method. Panning curves (902-2) are generated for selected audio sources (or selected audio speakers) in Configuration-II under a combination of the inverse-matrix method and the multiplicative-update method. Panning curves (902-3) are generated for all audio sources (or audio speakers) in Configuration-I under the inverse-matrix method.
In some embodiments, in Configuration-II, for the purpose of reproducing or rendering the audio object with the diagonal trajectory, only audio sources (or “activatable speakers”) that can deliver nonnegative initial gain values (e.g., based on initial gain values as determined under the inverse-matrix method, etc.) will be engaged or selected in the optimization of gain values, whilst the other speakers (or “unactivatable speakers”) will be automatically excluded from the optimization of gain values. Panning curves (902-2 of
In some embodiments, according to the spatial trajectory of the audio object and spatial positions of audio sources (or source spatial positions) in the audio source layout (802-2 of
For example, a set of panning curves with solid lines in 902-2 of
Techniques as described herein and other approaches give different optimization results with different topologies (or different topological changes) of audio source layouts. For example, in Configuration-II, the audio object is outside the convex hull of the audio sources for the first 100 frames, whereas in Configuration-I (which has the additional speaker at (0, 0, 0)), the audio object is inside the convex hull of the audio sources for the first 100 frames. As can be seen from
In contrast, under techniques as described herein, initialization is performed with a gain optimization method that generates nonnegative as well as negative optimized gain values for activating/deactivating audio sources, and further optimization of selected audio sources is performed with a second gain optimization method that maintains nonnegativity of updated gain values. The approach under these techniques manages to produce globally optimized gains and avoid spatial distortion during rendering as shown in panning curves (902-2) of
By comparing panning curves (902-2) and panning curves (902-3), it can be seen that panning curves (902-2) are relatively consistent with panning curves (902-3) that represent optimization by the inverse-matrix method after Configuration-II is changed to Configuration-I by placing the additional audio speaker at (0, 0, 0). In other words, the optimization results, or the panning curves, for Configuration-II with the selected audio sources under the techniques as described herein are consistent with the optimization results, or the panning curves, for Configuration-I with the additional audio source added at the source spatial position (0, 0, 0).
In addition, when continuing disabling more speakers or selecting fewer speakers, the optimization result under the techniques as described herein changes in a consistent way. For example, when the right speaker at (1, 0, 0) in the audio source layout (802-1) of
In block 1002, the adaptive audio playback system determines a reference audio source layout available in a specific playback environment. The adaptive audio playback system uses the reference audio source layout for initializing gain values and/or for performing offline processing to generate precomputed gain values for precomputed (object) spatial locations in the specific playback environment.
The reference audio source layout may or may not represent an actual audio source layout in the specific playback environment. In some embodiments, the reference audio source layout may represent a superset of one or more (e.g., defined, standard, proprietary, etc.) audio source layouts each of which may be used in some specific or general audio playing applications (e.g., cinema, home theater, living room, auditorium, bar, restaurant, amusement park, etc.). By way of example but not limitation, in some embodiments, the reference audio source layout may represent a 7.1.4 speaker layout, which may represent a superset of a 7.1.2 speaker layout, a 7.1 speaker layout, a 5.1.4 speaker layout, a 5.1.2 speaker layout, a 5.1 speaker layout, a stereo speaker layout, etc., each of which may be applicable to a respective set of specific or general media applications (e.g., audio playing applications, etc.).
In some example embodiments, the reference audio source layout may represent a 22.2 speaker layout, which may be a superset or pseudo-superset of other speaker layouts. As used herein, a pseudo-superset may, but is not limited to only, refer to a virtual speaker layout that is not necessarily defined in standards or in proprietary specifications. In some example embodiments, a pseudo-superset may be formed by audio sources in a standard or proprietary defined audio source layout plus or minus certain audio sources, for example, in scenarios that the standard or proprietary defined audio source layout does not include audio source located at certain specific (e.g., irregular, etc.) locations of a specific audio source layout in a specific playback environment. In some embodiments, lattice points may be populated in the specific playback environment as source spatial positions for audio sources included in a pseudo-superset.
In block 1004, for one or more spatial positions of an audio object, the adaptive audio playback system links an adaptive audio source layout to the reference audio source layout by identifying which audio sources in the reference audio source are to be deactivated from being used as selected audio sources to reproduce or render the audio object at the one or more spatial positions. This may be done with a first gain optimization method that generates nonnegative and/or negative gain values as initial gain values for audio sources in the reference audio source layout, as if all the audio sources in the reference audio source layout are to be used to reproduce or render the audio object at the one or more spatial positions.
In an example embodiment, the first gain optimization method that generates the nonnegative and/or negative initial gain values may be, but is not limited to only, the inverse-matrix method as represented in expression (12).
In some embodiments, audio sources that have negative (optimized) initial gain values as derived from the first gain optimization method are deactivated from being used to reproduce or render the audio object at the one or more spatial positions. In some embodiments, audio sources that have negative and zero initial gain values are deactivated from being used to reproduce or render the audio object at the one or more spatial positions. In some embodiments, audio sources that have initial gain values below a gain value threshold are deactivated from being used to reproduce or render the audio object at the one or more spatial positions.
The deactivated audio sources in the reference audio source layout are excluded from further optimization for reproducing or rendering the audio object at the one or more spatial positions. These deactivated audio sources could be used to reproduce or render the audio object in one or more other spatial positions. These deactivated audio sources could also be used to reproduce or render one or more different audio objects.
In block 1006, for the one or more spatial positions of the audio object, the adaptive audio playback system applies a second gain optimization method such as the multiplicative-update method that maintains nonnegativity (e.g., positivity, etc.) of gain values to converge the initial gain values for activated audio sources in the adaptive audio source layout (or audio sources in the reference audio source layout that have not been deactivated in block 1004) into optimized gain values to reproduce or render the audio object at the one or more spatial positions by the activated audio sources (which represents a set audio sources that form an adaptive source layout).
In some embodiments, additional processing such as interpolation, etc., can be performed in conjunction with some or all of the operations as described herein. In an example, in connection with or as a part of operations in block 1004, interpolation between source spatial positions of audio sources defined in the reference audio source layout and source spatial positions of actual audio sources in the actual audio source layout may be performed to adapt (optimized) initial gain values obtained with the reference audio source layout into initial gain values for the audio sources of the actual audio source layout in the specific playback environment. The interpolated initial gain values may be used deactivate audio sources in the actual audio source layout that have disqualifying initial gain values (e.g., negative interpolated initial gain values, etc.). The remaining audio sources in the actual audio source layout with interpolated initial gain values may be used for further optimization.
In another example embodiment, in connection with or as a part of operations in block 1006, interpolation between source spatial positions of activated (e.g., with positive gain) audio sources defined in the reference audio source layout and source spatial positions of actual audio sources in the actual audio source layout may be performed to adapt optimized gain values obtained with the activated audio sources of the reference audio source layout into approximate gain values for the audio sources of an actual audio source layout in the specific playback environment. Further optimization, for example using the second gain optimization method as mentioned above, may be performed on the approximate gain values (or interpolated gain values) to generate final optimized gain values for the audio sources of the actual source layout in the specific audio playback to reproduce or render the audio object at the one or more spatial positions.
Under other approaches that do not implement the techniques as described herein, an optimization method may need to be re-implemented or specifically ported (with device specific functionality that is tied to specific system configuration) many times on different platforms, and may need to involve complicated and customized distributed processing across multiple processors. As a result, the optimizations implemented under the other approaches often have to run in stringent, specialized system configurations and cannot be efficiently applied or adapted to a wide variety of playback environments, audio source layouts, systems, applications, etc.
By way of comparison, a number of benefits can be obtained under techniques as described herein. For example, an iterative gain optimization method such as nonnegative multiplicative updates can be implemented in a wide variety of playback environments, audio source layouts, systems, applications, etc. The iterative gain optimization method may be implemented with fewer or no tunable parameters or ad hoc heuristics to ensure convergence. In addition, the iterative gain optimization method can be implemented to provide a guarantee of monotonic convergence, as the updates of the iterative gain optimization can be implemented to decrease the numeric value (representing the cost) of the audio object cost function at each iteration.
Techniques as described herein can also be used to eliminate undesirable features of generating negative gains and sub-optimal approximations ab initio before actual optimization of activated audio sources in a specific playback environment rather than simply zeroing negative gains at the end of optimization as in other approaches. The techniques as described herein are also computationally efficient and can be implemented in an audio playback system that has relatively stringent computational resources.
Many computing processors such as fixed-point processors are to some extent inefficient at “division-shaped” problems such as those performed in the inverse-matrix method. In addition, division-shaped problems may create scalability issues. Matrix inversion operations may involve re-estimating all multiple elements in a gain vector in parallel, as opposed to simply performing coordinate descent under an iterative method. For vectors or matrixes of large dimensionality, the matrix inversion operations may be prohibitively expensive in terms of CPU costs and memory usages.
In contrast, under the techniques as described herein, most if not all gain computation can be performed in an iterative method that involves few or no computing divisions. Iterative multiplicative operations can be performed relatively efficiently with a variety of type of computing processors including but not necessarily limited to only fixed-point processors.
Techniques as described herein further allow flexibilities in several aspects. Tradeoffs can be made between memory space and computational complexity. Gain computation as described herein can operate with a relatively small memory space and a relatively large number of computations. Gain computation can also operate with a relatively large memory space and a relatively small number of computations. Distributions of precomputed spatial positions in a playback environment for generating precomputed gain values can be controlled flexibly by sparseness settings. In addition, optimization of gain values can be generated with adaptive source layouts adapted from a reference audio source layout that may or may not be an actual audio source layout in a specific playback environment, a superset or pseudo-superset that may or may not be based on standards or proprietary specifications, etc.
In some example embodiments, initial gain values may be individually determined for each spatial position in a plurality of spatial positions that represent a spatial trajectory of an audio object, for example, using a gain optimization method (e.g., one that generates nonnegative and/or negative gain values, etc.) for reproducing or rendering the audio object at that spatial position.
More specifically, initial gain values may be determined for a first spatial position of one or more spatial positions in a plurality of spatial positions that represent a spatial trajectory of an audio object, for example, using a gain optimization method (e.g., one that generates nonnegative and/or negative gain values, etc.) for reproducing or rendering the audio object at the one or more spatial positions. Initial gain values for another spatial position of the one or more spatial positions may use optimized gain values of a spatial position (e.g., the first spatial position) that is spatially or time-wise before the other spatial position. This may be used to ensure the same set of audio sources is (e.g., stably, smoothly, continuously, etc.) activated for all of the one or more spatial positions in these embodiments.
As described herein, a spatial position of an audio object may be associated with, or correspond to, one or more audio frames or a subdivision (e.g., one or more audio data blocks, one or more audio samples, etc.) of a single audio frame. In an example, a set of activated audio sources used to reproduce or render an audio object at a spatial position may mean that the set of activated audio sources are used to reproduce or render the audio object represented in one or more specific audio frames. In another example, a set of activated audio sources used to reproduce or render an audio object at a spatial position may mean that the set of activated audio sources are used to reproduce or render the audio object represented in one or more specific audio data blocks of a specific audio frame. In yet another example, a set of activated audio sources used to reproduce or render an audio object at a spatial position may mean that the set of activated audio sources are used to reproduce or render the audio object represented in one or more audio samples in a specific audio data block of a specific audio frame. Embodiments may include these and other variations of what portion of audio content a spatial position of an audio object may correspond to.
In some application scenarios such as those related to AVRs, both memory and computation resources could be severely limited. An adaptive audio playback system may be implemented with a system configuration such as illustrated in
It may be noted that in expression (4) the value of αsum-to-one, relative to the range of values of spatial positions of an audio object and source spatial positions (or spatial positions of audio sources), could have a relatively significant effect on the speed of convergence.
Assuming that the objective constraint Σigi=1 is satisfied, from expressions (6) through (12) it can be seen that if αsum-to-one is numerically large, such that it dominates the other terms in (6), then [∇E(g)]−≈[∇E(g)]+≈2αsum-to-one. As a result, the update rule in expression (17) becomes approximately as follows:
g←g. (20)
In other words, convergence would require potentially infinitely many iterations. Thus, to achieve fast convergence, αsum-to-one may be kept to a small value, relative to the magnitude of the other terms in (6). In some embodiments, a value of αsum-to-one=0.01 or some other small values (e.g., 0.02, etc.) may be used.
In some discussion herein, an audio object has been described to be located at a specific spatial position. This is for the purpose of illustration only. In various embodiments, an audio object as described herein may or may not have a single spatial position at any given time. For example, an audio object may not be a single point, but rather may be of a non-zero spatial size (e.g., a volume or planar size, etc.) that corresponds to more than one spatial location. In some embodiments, a spatial location of an audio object may represent a center of loudness, a point of symmetry, and the like, of the audio object that may be of a non-zero spatial size. In some embodiments, an audio object that is of a non-zero spatial size may be represented spatially as an integration of many small component audio objects that are approximated as spatial points with zero or infinitesimally small spatial sizes.
In block 1102, the audio playback system receives an audio object comprising audio content and object metadata, the object metadata of the audio object indicating an object spatial position of the audio object to be rendered by a plurality of audio speakers in a playback environment, each audio speaker in the plurality of audio speakers being located in a respective source spatial position in a plurality of source spatial positions in the playback environment.
In block 1104, the audio playback system determines, based on the object spatial position of the audio object and the plurality of source spatial positions of the plurality of audio speakers, a plurality of initial gain values for the plurality of audio speakers, each audio speaker in the plurality of audio speakers being assigned with a respective initial gain value in the plurality of initial gain values.
In block 1106, the audio playback system determines, based on the object spatial position of the audio object and a set of source spatial positions at which the set of audio speakers are respectively located in the playback environment, a set of optimized gain values for the set of audio speakers.
In block 1108, the audio playback system causes the audio object at the object spatial position to be rendered with the set of optimized gain values for the set of audio speakers, each audio speaker in the set of audio speakers being assigned with a respective optimized gain value in the plurality of optimized gain values.
In an embodiment, the audio playback system uses one or more negative initial gain values among the plurality of initial gain values to deactivate one or more corresponding audio sources, in the plurality of audio sources in the playback environment, from taking part in rendering the audio object located at the object spatial position.
In an embodiment, the audio playback system uses one or more zero and negative initial gain values among the plurality of initial gain values to deactivate one or more corresponding audio sources, in the plurality of audio sources in the playback environment, from taking part in rendering the audio object located at the object spatial position.
In an embodiment, the audio playback system uses one or more initial gain values below a gain value threshold among the plurality of initial gain values to deactivate one or more corresponding audio sources, in the plurality of audio sources in the playback environment, from taking part in rendering the audio object located at the object spatial position.
In an embodiment, the plurality of initial gain values is generated by a first gain optimizer that generates nonnegative optimized gain values and negative optimized gain values; the set of initial gain values is generated by a second different gain optimizer that maintains nonnegativity of nonnegative optimized gain values.
In an example embodiment, the first gain optimizer represents one of an inverse-matrix gain optimizer, a gain optimizer that does not preclude negative gain values, and the like.
In an example embodiment, the second gain optimizer represents one of a multiplicative-update gain optimizer, an interior point optimizer, a quadratic-programming gain optimizer, a gradient descent gain optimizer, a gain optimizer that maintains nonnegativity of nonnegative optimized gain values, and the like.
In an embodiment, the object spatial position represents a spatial position in a spatial trajectory of the audio object.
In an embodiment, the object spatial position is related to audio content in one of one or more audio frames, one or more subdivision of an audio frame, etc.
In an embodiment, the plurality of initial gain values for the plurality of audio speakers are at least in part derived through interpolating precomputed optimized gain values for the plurality of audio speakers in the playback environment.
In an embodiment, the precomputed optimized gain values are a part of a plurality of sets of precomputed optimized gain values for a plurality of precomputed object spatial positions in the playback environment. In an embodiment, the plurality of precomputed object spatial positions in the playback environment is determined based on a specific sparseness setting.
In an embodiment, the precomputed optimized gain values are precomputed and stored in a lookup table in offline processing.
In an embodiment, the audio playback system performs: while in offline processing:
selecting, based on one or more selection criteria, a specific sparseness setting from among a plurality of selectable sparseness settings, the specific sparseness setting determining a plurality of precomputed spatial positions in the playback environment; generating a plurality of sets of precomputed optimized gain values for the plurality of precomputed spatial positions, each set of precomputed optimized gain values in the plurality of sets of precomputed optimized gain values corresponding to a respective precomputed spatial position in the plurality of precomputed spatial positions; while in online processing: deriving the plurality of initial gain values for the plurality of audio speakers at least in part from interpolated gain values from the plurality of sets of precomputed optimized gain values.
In an embodiment, the audio playback system, while in the online processing: performs optimization of the interpolated gain values to determine the plurality of initial gain values for the plurality of audio speakers.
In an embodiment, the plurality of initial gain values for the plurality of audio speakers are directly set to the interpolated gain values in the online processing.
Embodiments include a media processing system configured to perform any one of the methods as described herein.
Embodiments include an apparatus including a processor and configured to perform any one of the foregoing methods.
Embodiments include a non-transitory computer readable storage medium, storing software instructions, which when executed by one or more processors cause performance of any one of the foregoing methods. Note that, although separate embodiments are discussed herein, any combination of embodiments and/or partial embodiments discussed herein may be combined to form further embodiments.
According to one embodiment, the techniques described herein are implemented by one or more special-purpose computing devices. The special-purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques. The special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
For example,
Computer system 1200 also includes a main memory 1206, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 1202 for storing information and instructions to be executed by processor 1204. Main memory 1206 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1204. Such instructions, when stored in non-transitory storage media accessible to processor 1204, render computer system 1200 into a special-purpose machine that is device-specific to perform the operations specified in the instructions.
Computer system 1200 further includes a read only memory (ROM) 1208 or other static storage device coupled to bus 1202 for storing static information and instructions for processor 1204. A storage device 1210, such as a magnetic disk or optical disk, is provided and coupled to bus 1202 for storing information and instructions.
Computer system 1200 may be coupled via bus 1202 to a display 1212, such as a liquid crystal display (LCD), for displaying information to a computer user. An input device 1214, including alphanumeric and other keys, is coupled to bus 1202 for communicating information and command selections to processor 1204. Another type of user input device is cursor control 1216, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1204 and for controlling cursor movement on display 1212. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
Computer system 1200 may implement the techniques described herein using device-specific hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 1200 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 1200 in response to processor 1204 executing one or more sequences of one or more instructions contained in main memory 1206. Such instructions may be read into main memory 1206 from another storage medium, such as storage device 1210. Execution of the sequences of instructions contained in main memory 1206 causes processor 1204 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The term “storage media” as used herein refers to any non-transitory media that store data and/or instructions that cause a machine to operation in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 1210. Volatile media includes dynamic memory, such as main memory 1206. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1202. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 1204 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 1200 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 1202. Bus 1202 carries the data to main memory 1206, from which processor 1204 retrieves and executes the instructions. The instructions received by main memory 1206 may optionally be stored on storage device 1210 either before or after execution by processor 1204.
Computer system 1200 also includes a communication interface 1218 coupled to bus 1202. Communication interface 1218 provides a two-way data communication coupling to a network link 1220 that is connected to a local network 1222. For example, communication interface 1218 may be an integrated service digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 1218 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 1218 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Network link 1220 typically provides data communication through one or more networks to other data devices. For example, network link 1220 may provide a connection through local network 1222 to a host computer 1224 or to data equipment operated by an Internet Service Provider (ISP) 1226. ISP 1226 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the “Internet” 1228. Local network 1222 and Internet 1228 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 1220 and through communication interface 1218, which carry the digital data to and from computer system 1200, are example forms of transmission media.
Computer system 1200 can send messages and receive data, including program code, through the network(s), network link 1220 and communication interface 1218. In the Internet example, a server 1230 might transmit a requested code for an application program through Internet 1228, ISP 1226, local network 1222 and communication interface 1218.
The received code may be executed by processor 1204 as it is received, and/or stored in storage device 1210, or other non-volatile storage for later execution.
In the foregoing specification, example embodiments have been described with reference to numerous specific details that may vary from implementation to implementation. Any definitions expressly set forth herein for terms contained in the claims shall govern the meaning of such terms as used in the claims. Hence, no limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of such claim in any way. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Various modifications and adaptations to the foregoing example embodiments may become apparent to those skilled in the relevant arts in view of the foregoing description, when it is read in conjunction with the accompanying drawings. Any and all modifications will still fall within the scope of the non-limiting and example embodiments. Furthermore, other example embodiment category forth herein will come to mind to one skilled in the art to which these embodiments pertain having the benefit of the teachings presented in the foregoing descriptions and the drawings.
Accordingly, the present invention may be embodied in any of the forms described herein. For example, the following enumerated example embodiments (EEEs) describe some structures, features, and functionalities of some aspects of the present invention.
EEE 1. A computer-implemented method, comprising: receiving an audio object comprising audio content and object metadata, the object metadata of the audio object indicating an object spatial position of the audio object to be rendered by a plurality of audio speakers in a playback environment, each audio speaker in the plurality of audio speakers being located in a respective source spatial position in a plurality of source spatial positions in the playback environment; determining, based on the object spatial position of the audio object and the plurality of source spatial positions of the plurality of audio speakers, a plurality of initial gain values for the plurality of audio speakers, each audio speaker in the plurality of audio speakers being assigned with a respective initial gain value in the plurality of initial gain values; determining, based on the object spatial position of the audio object and a set of source spatial positions at which the set of audio speakers are respectively located in the playback environment, a set of optimized non-negative gain values for the set of audio speakers; causing the audio object at the object spatial position to be rendered with the set of optimized gain values for the set of audio speakers, each audio speaker in the set of audio speakers being assigned with a respective optimized gain value in the plurality of optimized gain values.
EEE 2. The method as recited in EEE 1, further comprising using one or more negative initial gain values among the plurality of initial gain values to deactivate one or more corresponding audio sources, in the plurality of audio sources in the playback environment, from taking part in rendering the audio object located at the object spatial position.
EEE 3. The method as recited in EEE 1, further comprising using one or more zero and negative initial gain values among the plurality of initial gain values to deactivate one or more corresponding audio sources, in the plurality of audio sources in the playback environment, from taking part in rendering the audio object located at the object spatial position.
EEE 4. The method as recited in EEE 1, further comprising using one or more initial gain values below a gain value threshold among the plurality of initial gain values to deactivate one or more corresponding audio sources, in the plurality of audio sources in the playback environment, from taking part in rendering the audio object located at the object spatial position.
EEE 5. The method as recited in EEE 1, wherein the plurality of initial gain values is generated by a first gain optimizer that generates nonnegative optimized gain values and negative optimized gain values; and wherein the set of initial gain values is generated by a second different gain optimizer that maintains nonnegativity of nonnegative optimized gain values and turns negative gain values non-negative.
EEE 6. The method as recited in EEE 5, wherein the first gain optimizer represents one of an inverse-matrix gain optimizer, or a gain optimizer that does not preclude negative gain values.
EEE 7. The method as recited in EEE 5, wherein the second gain optimizer represents one of a multiplicative-update gain optimizer, an interior point optimizer, a quadratic-programming gain optimizer, a gradient descent gain optimizer, or a gain optimizer that maintains nonnegativity of nonnegative optimized gain values and turns negative gain values non-negative.
EEE 8. The method as recited in EEE 1, wherein the object spatial position represents a spatial position in a spatial trajectory of the audio object.
EEE 9. The method as recited in EEE 1, wherein the object spatial position is related to audio content in one of one or more audio frames, or one or more subdivision of an audio frame.
EEE 10. The method as recited in EEE 1, wherein the plurality of initial gain values for the plurality of audio speakers are at least in part derived through interpolating precomputed optimized gain values for the plurality of audio speakers in the playback environment.
EEE 11. The method as recited in EEE 10, wherein the precomputed optimized gain values are a part of a plurality of sets of precomputed optimized gain values for a plurality of precomputed object spatial positions in the playback environment.
EEE 12. The method as recited in EEE 11, wherein the plurality of precomputed object spatial positions in the playback environment is determined based on a specific sparseness setting.
EEE 13. The method as recited in EEE 10, wherein the precomputed optimized gain values are precomputed and stored in a lookup table in offline processing.
EEE 14. The method as recited in EEE 1, further comprising: while in offline processing: selecting, based on one or more selection criteria, a specific sparseness setting from among a plurality of selectable sparseness settings, the specific sparseness setting determining a plurality of precomputed spatial positions in the playback environment; generating a plurality of sets of precomputed optimized gain values for the plurality of precomputed spatial positions, each set of precomputed optimized gain values in the plurality of sets of precomputed optimized gain values corresponding to a respective precomputed spatial position in the plurality of precomputed spatial positions; while in online processing: deriving the plurality of initial gain values for the plurality of audio speakers at least in part from interpolated gain values from the plurality of sets of precomputed optimized gain values.
EEE 15. The method as recited in EEE 14, further comprising: while in the online processing: performing optimization of the interpolated gain values to determine the plurality of initial gain values for the plurality of audio speakers.
EEE 16. The method as recited in EEE 14, wherein the plurality of initial gain values for the plurality of audio speakers are directly set to the interpolated gain values in the online processing.
EEE 17. The method as recited in EEE 1, further comprising using the plurality of initial gain values to select a set of audio speakers from among the plurality of audio speakers.
EEE 18. A media processing system configured to perform any one of the methods recited in EEEs 1-17.
EEE 19. An apparatus comprising a processor and configured to perform any one of the methods recited in EEEs 1-17.
EEE 20. A non-transitory computer readable storage medium, storing software instructions, which when executed by one or more processors cause performance of any one of the methods recited in EEEs 1-17.
It will be appreciated that the embodiments of the invention are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are used herein, they are used in a generic and descriptive sense only, and not for purposes of limitation.
Number | Date | Country | Kind |
---|---|---|---|
16181436.3 | Jul 2016 | EP | regional |
Number | Date | Country | |
---|---|---|---|
62345602 | Jun 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15451241 | Mar 2017 | US |
Child | 15647121 | US |