The invention relates to systems and methods for upmixing (or otherwise modifying an audio object trajectory determined by) object based audio (i.e., audio data indicative of an object based audio program) to generate modified data (i.e., data indicative of a modified version of the audio program) from which multiple speaker feeds can be generated. In some embodiments, the invention is a system and method for rendering object based audio to generate speaker feeds for driving sets of loudspeakers, including by performing upmixing on the object based audio.
Conventional channel-based audio encoders typically operate under the assumption that each audio program (that is output by the encoder) will be reproduced by an array of loudspeakers in predetermined positions relative to a listener. Each channel of the program is a speaker channel. This type of audio encoding is commonly referred to as channel-based audio encoding.
Another type of audio encoder (known as an object-based audio encoder) implements an alternative type of audio coding known as audio object coding (or object based coding and operates under the assumption that each audio program (that is output by the encoder) may be rendered for reproduction by any of a large number of different arrays of loudspeakers. Each audio program output by such an encoder is an object based audio program, and typically, each channel of such object based audio program is an object channel. In audio object coding, audio signals associated with distinct sound sources (audio objects) are input to the encoder as separate audio streams. Examples of audio objects include (but are not limited to) a dialog track, a single musical instrument, and a jet aircraft. Each audio object is associated with spatial parameters, which may include (but are not limited to) source position, source width, and source velocity and/or trajectory. The audio objects and associated parameters are encoded for distribution and storage. Final audio object mixing and rendering is performed at the receive end of the audio storage and/or distribution chain, as part of audio program playback. The step of audio object mixing and rendering is typically based on knowledge of actual positions of loudspeakers to be employed to reproduce the program.
Typically, during generation of an object based audio program, the content creator embeds the spatial intent of the mix (e.g., the trajectory of each audio object determined by each object channel of the program) by including metadata in the program. The metadata can be indicative of the position or trajectory of each audio object determined by each object channel of the program, and/or at least one of the size, velocity, type (e.g., dialog or music), and another characteristic of each such object.
During rendering of an object based audio program, each object channel can be rendered (“at” a time-varying position having a desired trajectory) by generating speaker feeds indicative of content of the channel and applying the speaker feeds to a set of loudspeakers (where the physical position of each of the loudspeakers may or may not coincide with the desired position at any instant of time). The speaker feeds for a set of loudspeakers may be indicative of content of multiple object channels (or a single object channel). The rendering system typically generates the speaker feeds to match the exact hardware configuration of a specific reproduction system (e.g., the speaker configuration of a home theater system, where the rendering system is also an element of the home theater system).
In the case that an object based audio program indicates a trajectory of an audio object, the rendering system would typically generate speaker feeds for driving a set of loudspeakers to emit sound intended to be perceived (and which typically will be perceived) as emitting from an audio object having said trajectory. For example, the program may indicate that sound from a musical instrument (an object) should pan from left to right, and the rendering system might generate speaker feeds for driving a 5.1 array of loudspeakers to emit sound that will be perceived as panning from the L (left front) speaker of the array to the C (center front) speaker of the array and then the R (right front) speaker of the array. Herein, “trajectory” of an audio object (indicated by an object based audio program) is used in a broad sense to denote the position or positions (e.g., position as a function of time) from which sound emitted during rendering of the program is the object is intended to be perceived as emitting. Thus, a trajectory could consist of a single, stationary point (or other position), or it could be a sequence of positions, or it could be a point (or other position) which varies as a function of time.
However, until the present invention it had not been known how to render an object based audio program (which is indicative of a trajectory of an audio source) by generating speaker feeds for driving a set of loudspeakers to emit sound intended to be perceived as emitting from the source but with said source having a different trajectory than the one indicated by the program. Typical embodiments of the invention are methods and systems for rendering an object based audio program (which is indicative of a trajectory of an audio source), including by efficiently generating speaker feeds for driving a set of loudspeakers to emit sound intended to be perceived as emitting from the source but with said source having a different trajectory than the one indicated by the program (e.g., with said source having a trajectory in a vertical plane, or a three-dimensional trajectory, where the program indicates the source's trajectory is in a horizontal plane).
There are many conventional methods for rendering audio programs in systems that employ channel-based audio encoding. For example, conventional upmixing techniques could be implemented during rendering of the audio programs (comprising speaker channels) which are indicative of sound from sources moving along trajectories within a subspace of a full three-dimensional volume (e.g., trajectories which are along horizontal lines), to generate speaker feeds for driving speakers positioned outside this subspace. Such upmixing techniques are based on phase and amplitude information included in the program to be rendered, whether this information was intentionally coded (in which case the upmixing can be implemented by matrix encoding/decoding with steering) or is naturally contained in the speaker channels of the program (in which case the upmixing is blind upmixing). Thus, the conventional phase/amplitude-based upmixing techniques which have been applied to audio programs comprising speaker channels are subject to a number of limitations and disadvantages, including the following:
whether the content is matrix encoded or not, they generate a significant amount of crosstalk across speakers;
in the case of blind upmixing, the risk of panning a sound in a non-coherent way with video is greatly increased, and the typical way to lower this risk is to upmix only what appears to be non-directional elements of the program (typically decorrelated elements); and
they often create artifacts either by limiting the steering logic to wide band, often making the sound collapse during reproduction, or by applying a multiband steering logic that creates a spatial smearing of the frequency bands of a unique sound (sometimes referred to as “the gargling effect”).
Even if conventional phase/amplitude-based techniques for upmixing audio programs comprising speaker channels (to generate upmixed programs having more speaker channels than the input programs) were somehow applied to object based audio programs (to generate speaker feeds for more loudspeakers than could be generated from the input programs without the upmixing), this would result in a loss of perceived discreteness (of the audio objects indicated by the upmixed programs) and/or would generate artifacts of the type described above. Thus, systems and related methods are needed for rectifying the deficiencies noted above.
Typical embodiments of the invention are methods for rendering an object based audio program (which is indicative of a trajectory of an audio source), including by generating speaker feeds for driving a set of loudspeakers to emit sound intended to be perceived as emitting from the source, but with the source having a different trajectory than the one indicated by the program (e.g., with the source having a trajectory in a vertical plane or a three-dimensional trajectory, where the program indicates a source trajectory in a horizontal plane). The term “trajectory” of an audio object (indicated by an object based audio program) is used herein in a broad sense to denote the position or positions (e.g., position as a function of time) from which sound emitted during rendering of the program is the object is intended to be perceived as emitting. Thus, a trajectory could consist of a single, stationary position, or it could be a sequence of positions, or it could be a point (or other position) which varies as a function of time.
In some embodiments, the invention is a method for rendering an object based audio program for playback by a set of loudspeakers, where the program is indicative of a trajectory of an audio object, and the trajectory is within a subspace of a full three-dimensional volume (e.g., the trajectory is limited to be in a horizontal plane within the volume, or is a horizontal line within the volume). The method includes the steps of modifying the program to determine a modified program indicative of a modified trajectory of the object (e.g., by modifying coordinates of the program indicative of the trajectory), where at least a portion of the modified trajectory is outside the subspace (e.g., where the trajectory is a horizontal line, the modified trajectory is a path in a vertical plane including the horizontal line); and generating speaker feeds in response to the modified program, such that the speaker feeds include at least one feed for driving at least one speaker in the set whose position corresponds to a position outside the subspace and feeds for driving speakers in the set whose positions correspond to positions within the subspace.
In other embodiments, the inventive method includes a step of modifying an object based audio program indicative of a trajectory of an audio object, to determine a modified program indicative of a modified trajectory of the object, where both the trajectory and the modified trajectory are defined in the same space (i.e., no portion of the modified trajectory extends outside the space in which the trajectory extends). For example, the trajectory may be modified to optimize (or otherwise modify) the timbre of sound emitted in response to speaker feeds determined from the modified program relative to the sound that would be emitted in response to speaker feeds determined from the original program (e.g., in the case that the modified trajectory, but not the original trajectory, determines a single ended “snap to” or “snap toward” a speaker).
Typically, the object based audio program (unless it is modified in accordance with the invention) is capable of being rendered to generate only speaker feeds for driving a subset of the set of loudspeakers (e.g., only those speakers in the set whose positions correspond to the subspace of the full three-dimensional volume). For example, the audio program may be capable of being rendered to generate only speaker feeds for driving the speakers in the set which are positioned in a horizontal plane including the listener's ears, where the subspace is said horizontal plane. The inventive rendering method can implement upmixing by generating at least one speaker feed (in response to the modified program) for driving a speaker in the set whose position corresponds to a position outside the subspace, as well as generating speaker feeds for driving speakers in the set whose positions correspond to positions within the subspace. For example, one embodiment of the method includes a step of generating speaker feeds in response to the modified program for driving all the loudspeakers of the set. Thus, this embodiment leverages all speakers present in the playback system, whereas rendering of the original (unmodified) program would not generate speaker feeds for driving all the speakers of the playback system.
In typical embodiments, the method includes steps of distorting over time a trajectory of an authored object to determine a modified trajectory of the object, where the object's trajectory is indicated by an object based audio program and is within a subspace of a three-dimensional volume, and such that at least a portion of the modified trajectory is outside the subspace, and generating at least one speaker feed for a speaker whose position corresponds to a position outside the subspace (e.g., a speaker feed for a speaker located at a nonzero elevational angle relative to a listener, where the subspace is a horizontal plane at an elevational angle of zero relative to the listener). For example, the method may include a step of distorting an audio object's trajectory indicated by an object based audio program, where the trajectory is in a horizontal plane at an elevational angle of zero relative to the listener, in order to generate a speaker feed for a speaker (of a playback system) located at a nonzero elevational angle relative to a listener, where none of the speakers of the original authoring speaker system was located at a nonzero elevational angle relative to the content creator.
In some embodiments, the inventive method includes the step of modifying (upmixing) an object based audio program indicative of a trajectory of an audio object, and the trajectory is within a subspace of a full three-dimensional volume, to determine a modified program indicative of a modified trajectory of the object (e.g., by modifying coordinates of the program indicative of the trajectory, where such coordinates are determined by metadata included in the program), such that at least a portion of the modified trajectory is outside the subspace. Some such embodiments are implemented by a stand-alone system or device (an “upmixer”). The modified program determined by the upmixer's output is typically provided to a rendering system configured to generate speaker feeds (in response to the modified program) for driving a set of loudspeakers, typically including a speaker feed for driving at least one speaker in the set whose position corresponds to a position outside the subspace. Alternatively, some such embodiments of the inventive method are implemented by a rendering system which generates the modified program and generates speaker feeds (in response to the modified program) for driving a set of loudspeakers, typically including a speaker feed for driving at least one speaker in the set whose position corresponds to a position outside the subspace.
Some embodiments of the method implement both audio object trajectory modification and rendering in a single step. For example, the rendering could implicitly distort (modify) a trajectory (of an audio object) determined by an object based audio program (to determine a modified trajectory for the object) by explicit generation of speaker feeds for speakers having distorted versions of known positions (e.g., by explicit distortion of known loudspeaker positions). The distortion could be implemented as a scale factor applied to an axis (e.g., a height axis). For example, application of a first scale factor (e.g., a scale factor equal to 0.0) to the height axis of a trajectory during generation of speaker feeds could cause the modified trajectory to intersect the position of an overhead speaker (resulting in “100% distortion”), so that the sound emitted from the speakers of the playback system in response to the speaker feeds would be perceived as emitting from a source whose (modified) trajectory includes the location of the overhead speaker. Application of a second scale factor (e.g., a scale factor greater than 0.0 but not greater than 1.0) to the height axis of the trajectory during generation of speaker feeds could cause the modified trajectory to approach (but not intersect) the position of the overhead speaker more closely than does the original trajectory (resulting in “X % distortion,” where the value of X is determined by the value of the scale factor), so that the sound emitted from the speakers of the playback system in response to the speaker feeds would be perceived as emitting from a source whose (modified) trajectory approaches (but does not include) the location of the overhead speaker. Application of a third scale factor (e.g., a scale factor greater than 1.0) to the height axis of the trajectory during generation of speaker feeds could cause the modified trajectory to diverge from the position of the overhead speaker (farther than the original trajectory does). Combined trajectory modification and speaker feed generation can be implemented without any need to determine an inflection point, or to implement look ahead.
Typically, the playback system includes a set of loudspeakers, and the set includes a first subset of speakers at known positions in a first space corresponding to positions in the subspace containing the object trajectory indicated by the audio program to be rendered (e.g., loudspeakers at positions nominally in a horizontal plane including the listener's ears, where the subspace is a horizontal plane including the listener's ears), and a second subset including at least one speaker, where each speaker in the second subset is at a known position corresponding to a position outside the subspace. To determine the modified trajectory (which is typically, but not necessarily, a curved trajectory), the rendering method may determine a candidate trajectory. The candidate trajectory may include a start point in the first space (such that one or more speakers in the first subset can be driven to emit sound perceived as originating at the start point) which coincides with a start point of the object trajectory, an end point in the first space (such that one or more speakers in the first subset can be driven to emit sound perceived as originating at the end point) which coincides with an end point of the object trajectory, and at least one intermediate point corresponding to the position of a speaker in the second subset (such that, for each intermediate point, a speaker in the second subset can be driven to emit sound perceived as originating at said intermediate point). In some cases, the candidate trajectory is used as the modified trajectory.
In other cases, a distorted version of the candidate trajectory (determined by distorting the candidate trajectory by applying at least one distortion coefficient thereto) is used as the modified trajectory. Each distortion coefficient's value determines a degree of distortion applied to the candidate trajectory. For example, in one embodiment, the projection of each intermediate point (along the candidate trajectory) on the first space defines an inflection point (in the first space) which corresponds to the intermediate point. The line (normal to the first space) between the intermediate point and the corresponding inflection point is referred to as a distortion axis for the intermediate point. A distortion coefficient (for each intermediate point), whose value indicates position along the distortion axis for the intermediate point, determines a modified version of the intermediate point. Using such a distortion coefficient for each intermediate point, the modified trajectory may be determined to be a trajectory which extends from the start point of the candidate trajectory, through the modified version of each intermediate point, to the end point of the candidate trajectory. Because the modified trajectory determines (with the audio content for the relevant object) each speaker feed for the relevant object channel, each distortion coefficient controls how close the rendered object will be perceived to get to the corresponding speaker (in the second subset) when the rendered object pans along the modified trajectory.
In the case that the inventive system (either a rendering system, or an upmixer for generating a modified program for rendering by a rendering system) is configured to process content in a non-real-time manner, it is useful to include metadata in an object based audio program to be rendered, where the metadata indicates both the starting and finishing points for each object trajectory indicated by the program, and to configure the system to use such metadata to implement upmixing (to determine a modified trajectory for each such trajectory) without need for look-ahead delays. Alternatively, the need for look-ahead delays could be eliminated by configuring the inventive system to average over time the coordinates of an object trajectory (indicated by an object based audio program to be rendered) to generate a trajectory trend and to use such averages to predict the path of the trajectory and find each inflection point of the trajectory.
Additional metadata could be included in an object based audio program, to provide to the inventive system (either a system configured to render the program, or an upmixer for generating a modified version of the program for rendering by a rendering system) information that enables the system to override a coefficient value or otherwise influences the system's behavior (e.g., to prevent the system from modifying the trajectories of certain objects indicated by the program). For example, the metadata could indicate a characteristic (e.g., a type or a property) of an audio object, and the system could be configured to operate in a specific mode in response to such metadata (e.g., a mode in which it is prevented from modifying the trajectory of an object of a specific type). For example, the system could be configured to respond to metadata indicating that an object is dialog, by disabling upmixing for the object (e.g., so that speaker feeds will be generated using the trajectory, if any, indicated by the program for the dialog, rather than from a modified version of the trajectory, e.g., one which extends above or below the horizontal plane of the intended listener's ears).
In a class of embodiments, the inventive rendering system is configured to determine, from an object based audio program (and knowledge of the positions of the speakers to be employed to play the program), the distance between each position of an audio source indicated by the program and the position of each of the speakers. The positions of the speakers can be considered to be desired positions of the source (if it is desired to render a modified version of the program so that the emitted sound is perceived as emitting from positions that include positions at or near all the speakers of the playback system), and the source positions indicated by the program can be considered to be actual positions of the source. The system is configured in accordance with the invention to determine, for each actual source position (e.g., each source position along a source trajectory) indicated by the program, a subset of the full set of speakers (a “primary” subset) consisting of those speakers of the full set which are (or the speaker of the full set which is) closest to the actual source position, where “closest” in this context is defined in some reasonably defined sense (e.g., the speakers of the full set which are “closest” to a source position may be each speaker whose position in the playback system corresponds to a position, in the three dimensional volume in which the source's trajectory is defined, whose distance from the source position is within a predetermined threshold value, or whose distance from the source position satisfies some other predetermined criterion). Typically, speaker feeds are generated (for each source position) which cause sound to be emitted with relatively large amplitudes from the speaker(s) of the primary subset (for the source position) and with relatively smaller amplitudes (or zero amplitudes) from the other speakers of the playback system.
A sequence of source positions indicated by the program (which can be considered to define a source trajectory) determines a sequence of primary subsets of the full set of speakers (one primary subset for each source position in the sequence). The positions of the speakers in each primary subset define a three-dimensional (3D) space which contains each speaker of the primary subset and the relevant actual source position (but contains no other speaker of the full set). The steps of determining a modified trajectory (in response to a source trajectory indicated by the program) and generating speaker feeds (for driving all speakers of the playback system) in response to the modified trajectory, can thus be implemented in the exemplary rendering system as follows: for each of the sequence of source positions indicated by the program (which can be considered to define a trajectory, e.g., the “original trajectory” of
Optionally, a scaling parameter is applied to each of the 3D spaces (which are determined in accordance with an embodiment in the noted class) to generate a scaled space (sometimes referred to herein as a “warped” space) in response to the 3D space, and speaker feeds are generated for driving the speakers (of the full set employed to play the program) to emit sound intended to be perceived (and which typically will be perceived) as being emitted by the source from a characteristic point of the warped space rather than from the above-noted characteristic point of the 3D space (e.g., the characteristic point of the warped space may be the intersection of the top surface of the warped space with a vertical line through the source position determined by the program). The warping could be implemented as a scale factor applied to a height axis, so that the height of each warped space is a scaled version of the height of the corresponding 3D space.
Aspects of the invention include a system (e.g., an upmixer or a rendering system) configured (e.g., programmed) to perform any embodiment of the inventive method, and a computer readable medium (e.g., a disc or other tangible object) which stores code for implementing any embodiment of the inventive method.
In some embodiments, the inventive system is or includes a general or special purpose processor programmed with software (or firmware) and/or otherwise configured to perform an embodiment of the inventive method. In some embodiments, the inventive system is or includes a general purpose processor, coupled to receive input audio (and optionally also input video), and programmed to generate (by performing an embodiment of the inventive method) output data (e.g., output data determining speaker feeds) in response to the input audio. In other embodiments, the inventive system is implemented as an appropriately configured (e.g., programmed and otherwise configured) audio digital signal processor (DSP) which is operable to generate output data (e.g., output data determining speaker feeds) in response to input audio.
Throughout this disclosure, including in the claims, the expression performing an operation “on” signals or data (e.g., filtering, scaling, or transforming the signals or data) is used in a broad sense to denote performing the operation directly on the signals or data, or on processed versions of the signals or data (e.g., on versions of the signals that have undergone preliminary filtering prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the following expressions have the following definitions:
speaker and loudspeaker are used synonymously to denote any sound-emitting transducer. This definition includes loudspeakers implemented as multiple transducers (e.g., woofer and tweeter);
speaker feed: an audio signal to be applied directly to a loudspeaker, or an audio signal that is to be applied to an amplifier and loudspeaker in series;
channel (or “audio channel”): a monophonic audio signal;
speaker channel (or “speaker-feed channel”): an audio channel that is associated with a named loudspeaker (at a desired or nominal position), or with a named speaker zone within a defined speaker configuration. A speaker channel is rendered in such a way as to be equivalent to application of the audio signal directly to the named loudspeaker (at the desired or nominal position) or to a speaker in the named speaker zone;
object channel: an audio channel indicative of sound emitted by an audio source (sometimes referred to as an audio “object”). Typically, an object channel determines a parametric audio source description. The source description may determine sound emitted by the source (as a function of time), the apparent position (e.g., 3D spatial coordinates) of the source as a function of time, and optionally also other at least one additional parameter (e.g., apparent source size or width) characterizing the source;
audio program: a set of one or more audio channels (at least one speaker channel and/or at least one object channel) and optionally also associated metadata that describes a desired spatial audio presentation;
object based audio program: an audio program comprising a set of one or more object channels (and typically not comprising any speaker channel) and optionally also associated metadata that describes a desired spatial audio presentation (e.g., metadata indicative of a trajectory of an audio object which emits sound indicated by an object channel);
render: the process of converting an audio program into one or more speaker feeds, or the process of converting an audio program into one or more speaker feeds and converting the speaker feed(s) to sound using one or more loudspeakers (in the latter case, the rendering is sometimes referred to herein as rendering “by” the loudspeaker(s)). An audio channel can be trivially rendered (“at” a desired position) by applying a speaker feed indicative of content of the channel directly to a physical loudspeaker at the desired position, or one or more audio channels can be rendered using one of a variety of virtualization techniques designed to be substantially equivalent (for the listener) to such trivial rendering. In this latter case, each audio channel may be converted to one or more speaker feeds to be applied to loudspeaker(s) in known locations, which are in general different from the desired position, such that sound emitted by the loudspeaker(s) in response to the feed(s) will be perceived as emitting from the desired position. Examples of such virtualization techniques include binaural rendering via headphones (e.g., using Dolby Headphone processing which simulates up to 7.1 channels of surround sound for the headphone wearer) and wave field synthesis. An object channel can be rendered (“at” a time-varying position having a desired trajectory) by applying speaker feeds indicative of content of the channel to a set of physical loudspeakers (where the physical position of each of the loudspeakers may or may not coincide with the desired position at any instant of time);
azimuth (or azimuthal angle): the angle, in a horizontal plane, of a source relative to a listener/viewer. Typically, an azimuthal angle of 0 degrees denotes that the source is directly in front of the listener/viewer, and the azimuthal angle increases as the source moves in a counter clockwise direction around the listener/viewer;
elevation (or elevational angle): the angle, in a vertical plane, of a source relative to a listener/viewer. Typically, an elevational angle of 0 degrees denotes that the source is in the same horizontal plane as the listener/viewer (e.g., the ears of the listener/viewer), and the elevational angle increases as the source moves upward (in a range from 0 to 90 degrees) relative to the listener/viewer;
L: Left front audio channel. A speaker channel, typically intended to be rendered by a speaker positioned at about 30 degrees azimuth, 0 degrees elevation;
C: Center front audio channel. A speaker channel, typically intended to be rendered by a speaker positioned at about 0 degrees azimuth, 0 degrees elevation;
R: Right front audio channel. A speaker channel, typically intended to be rendered by a speaker positioned at about −30 degrees azimuth, 0 degrees elevation;
Ls: Left surround audio channel. A speaker channel, typically intended to be rendered by a speaker positioned at about 110 degrees azimuth, 0 degrees elevation;
Rs: Right surround audio channel. A speaker channel, typically intended to be rendered by a speaker positioned at about −110 degrees azimuth, 0 degrees elevation;
Full Range Channels: All audio channels of an audio program other than each low frequency effects channel of the program. Typical full range channels are L and R channels of stereo programs, and L, C, R, Ls and Rs channels of surround sound programs. The sound determined by a low frequency effects channel (e.g., a subwoofer channel) comprises frequency components in the audible range up to a cutoff frequency, but does not include frequency components in the audible range above the cutoff frequency (as do typical full range channels);
Front Channels: speaker channels (of an audio program) associated with frontal sound stage. Typical front channels are L and R channels of stereo programs, or L, C and R channels of surround sound programs; and
AVR: an audio video receiver. For example, a receiver in a class of consumer electronics equipment used to control playback of audio and video content, for example in a home theater.
Exemplary embodiments are directed to systems and methods that implement a type of audio coding called audio object coding (or object based coding or “scene description”), and operate under the assumption that each audio program (that is output by the encoder) may be rendered for reproduction by any of a large number of different arrays of loudspeakers. Each audio program output by such an encoder is an object based audio program, and typically, each channel of such object based audio program is an object channel. In audio object coding, audio signals associated with distinct sound sources (audio objects) are input to the encoder as separate audio streams. Examples of audio objects include (but are not limited to) a dialog track, a single musical instrument, and a jet aircraft. Each audio object is associated with spatial parameters, which may include (but are not limited to) source position, source width, and source velocity and/or trajectory. The audio objects and associated parameters are encoded for distribution and storage. Final audio object mixing and rendering may be performed at the receive end of the audio storage and/or distribution chain, as part of audio program playback. The step of audio object mixing and rendering is typically based on knowledge of actual positions of loudspeakers to be employed to reproduce the program.
Typically, during generation of an object based audio program, the content creator may embed the spatial intent of the mix (e.g., the trajectory of each audio object determined by each object channel of the program) by including metadata in the program. The metadata can be indicative of the position or trajectory of each audio object determined by each object channel of the program, and/or at least one of the size, velocity, type (e.g., dialog or music), and another characteristic of each such object.
During rendering of an object based audio program, each object channel can be rendered (“at” a time-varying position having a desired trajectory) by generating speaker feeds indicative of content of the channel and applying the speaker feeds to a set of loudspeakers (where the physical position of each of the loudspeakers may or may not coincide with the desired position at any instant of time). The speaker feeds for a set of loudspeakers may be indicative of content of multiple object channels (or a single object channel). The rendering system typically generates the speaker feeds to match the exact hardware configuration of a specific reproduction system (e.g., the speaker configuration of a home theater system, where the rendering system is also an element of the home theater system).
In the case that an object based audio program indicates a trajectory of an audio object, the rendering system would typically generate speaker feeds for driving a set of loudspeakers to emit sound intended to be perceived (and which typically will be perceived) as emitting from an audio object having said trajectory. For example, the program may indicate that sound from a musical instrument (an object) should pan from left to right, and the rendering system might generate speaker feeds for driving a 5.1 array of loudspeakers to emit sound that will be perceived as panning from the L (left front) speaker of the array to the C (center front) speaker of the array and then the R (right front) speaker of the array.
Audio object coding allows an object based audio program (sometimes referred to herein as a mix) to be played on any speaker configuration. Some embodiments for rendering an object based audio program assume that each audio object determined by the program is positioned in a space (e.g., moves along a trajectory in the space) which matches the space in which the speakers of the loudspeaker array to be employed to reproduce the program are located. For example, if an object based audio program indicates an object moving in a panning plane defined by a panning axis (e.g., a horizontally oriented front-back axis, a horizontally oriented left-right axis, a vertically oriented up-down axis, or near-far axis) and a listener, the rendering system would conventionally generate speaker feeds (in response to the program) for a loudspeaker array consisting of speakers nominally positioned in a plane parallel to the panning plane (i.e., the speakers are nominally in a horizontal plane if the panning plane is a horizontal plane).
Many embodiments of the present invention are technologically possible. It will be apparent to those of ordinary skill in the art from the present disclosure how to implement them. Embodiments of the inventive system, method, and medium will be described with reference to
Typical embodiments of the invention are methods for rendering an object based audio program (which is indicative of a trajectory of an audio source), including by generating speaker feeds for driving a set of loudspeakers to emit sound intended to be perceived as emitting from the source, but with the source having a different trajectory than the one indicated by the program (e.g., with the source having a trajectory in a vertical plane or a three-dimensional trajectory, where the program indicates a source trajectory in a horizontal plane).
In some embodiments, the invention is a method for rendering an object based audio program for playback by a set of loudspeakers, where the program is indicative of a trajectory of an audio object, and the trajectory is within a subspace of a full three-dimensional volume (e.g., the trajectory is limited to be in a horizontal plane within the volume, or is a horizontal line within the volume). The method includes the steps of modifying the program to determine a modified program indicative of a modified trajectory of the object (e.g., by modifying coordinates of the program indicative of the trajectory), where at least a portion of the modified trajectory is outside the subspace (e.g., where the trajectory is a horizontal line, the modified trajectory is a path in a vertical plane including the horizontal line); and generating speaker feeds (in response to the modified program) for driving at least one speaker in the set whose position corresponds to a position outside the subspace and for driving speakers in the set whose positions correspond to positions within the subspace.
Typically, the object based audio program (unless it is modified in accordance with the invention) is capable of being rendered to generate only speaker feeds for driving a subset of the set of loudspeakers (e.g., only those speakers in the set whose positions correspond to the subspace of the full three-dimensional volume). For example, the audio program may be capable of being rendered to generate only speaker feeds for driving the speakers in the set which are positioned in a horizontal plane including the listener's ears, where the subspace is said horizontal plane. The inventive rendering method implements upmixing by generating at least one speaker feed (in response to the modified program) for driving a speaker in the set whose position corresponds to a position outside the subspace, as well as generating speaker feeds for driving speakers in the set whose positions correspond to positions within the subspace. For example, a preferred embodiment of the method includes a step of generating speaker feeds in response to the modified program for driving all the loudspeakers of the set. Thus, the preferred embodiment leverages all speakers present in the playback system, whereas rendering of the original (unmodified) program would not generate speaker feeds for driving all the speakers of the playback system.
In other embodiments, the inventive method includes a step of modifying an object based audio program indicative of a trajectory of an audio object, to determine a modified program indicative of a modified trajectory of the object, where both the trajectory and the modified trajectory are defined in the same space (i.e., no portion of the modified trajectory extends outside the space in which the trajectory extends). For example, the trajectory may be modified to optimize (or otherwise modify) the timbre of sound emitted in response to speaker feeds determined from the modified program relative to the sound that would be emitted in response to speaker feeds determined from the original program (e.g., in the case that the modified trajectory, but not the original trajectory, determines a single ended “snap to” or “snap toward” a speaker).
In typical embodiments, the inventive method includes steps of distorting over time a trajectory of an authored object to determine a modified trajectory of the object, where the object's trajectory is indicated by an object based audio program and is within a subspace of a three-dimensional volume, and such that at least a portion of the modified trajectory is outside the subspace, and generating at least one speaker feed for a speaker whose position corresponds to a position outside the subspace (e.g., where the subspace is a horizontal plane at a first elevational angle relative to an expected listener, a speaker feed is generated for driving a speaker located at a second elevational angle relative to the listener, where the second elevational angle is different than the first elevational angle. For example, the first elevational angle may be zero and the second elevational angle may be nonzero). For example, the method may include a step of distorting an audio object's trajectory indicated by an object based audio program, where the trajectory is in a horizontal plane at an elevational angle of zero relative to the listener, in order to generate a speaker feed for a speaker (of a playback system) located at a nonzero elevational angle relative to a listener, where none of the speakers of the original authoring speaker system was located at a nonzero elevational angle relative to the content creator.
In some embodiments, the inventive method includes the step of modifying (upmixing) an object based audio program indicative of a trajectory of an audio object, where the trajectory is within a subspace of a full three-dimensional volume, to determine a modified program indicative of a modified trajectory of the object (e.g., by modifying coordinates of the program indicative of the trajectory, where such coordinates are determined by metadata included in the program), such that at least a portion of the modified trajectory is outside the subspace. Some such embodiments are implemented by a stand-alone system or device (an “upmixer”). The modified program determined by the upmixer's output is typically provided to a rendering system configured to generate speaker feeds (in response to the modified program) for driving a set of loudspeakers, typically including a speaker feed for driving at least one speaker in the set whose position corresponds to a position outside the subspace. Alternatively, some such embodiments of the inventive method are implemented by a rendering system which generates the modified program and generates speaker feeds (in response to the modified program) for driving a set of loudspeakers, typically including a speaker feed for driving at least one speaker in the set whose position corresponds to a position outside the subspace.
An example of the inventive method is the rendering of an audio program which includes an object channel indicative of a source which undergoes front to back panning (i.e., the source's trajectory is a horizontal line). The pan may have been authored on a traditional 5.1 speaker setup, with the content creator monitoring an amplitude pan between the center speaker and the two (left rear and right rear) surround speakers of the 5.1 speaker array. The exemplary embodiment of the inventive rendering method generates speaker feeds for reproducing the program over all the speakers of a 6.1 speaker system, including an overhead speaker (e.g., speaker Ts of
Typically, the playback system includes a set of loudspeakers, and the set includes a first subset of speakers at positions in a first space corresponding to positions in the subspace containing the object trajectory indicated by the audio program to be rendered (e.g., loudspeakers at positions nominally in a horizontal plane including the listener, where the subspace is a horizontal plane including the listener), and a second subset including at least one speaker, where each speaker in the second subset is at a position corresponding to a position outside the subspace. To determine the modified trajectory (which is typically but not necessarily a curved trajectory), the rendering method may determine a candidate trajectory. The candidate trajectory includes a start point in the first space (such that one or more speakers in the first subset can be driven to emit sound perceived as originating at the start point) which coincides with a start point of the object trajectory, an end point in the first space (such that one or more speakers in the first subset can be driven to emit sound perceived as originating at the end point) which coincides with an end point of the object trajectory, and at least one intermediate point corresponding to the position of a speaker in the second subset (such that, for each intermediate point, a speaker in the second subset can be driven to emit sound perceived as originating at said intermediate point). In some cases, the candidate trajectory is used as the modified trajectory.
In other cases, a distorted version of the candidate trajectory (determined by at least one distortion coefficient) is used as the modified trajectory. Each distortion coefficient's value determines a degree of distortion applied to the candidate trajectory. For example, in one embodiment, the projection of each intermediate point (along the candidate trajectory) on the first space defines an inflection point (in the first space) which corresponds to the intermediate point. The line (normal to the first space) between the intermediate point and the corresponding inflection point is referred to as a distortion axis for the intermediate point. A distortion coefficient (for each intermediate point), whose value indicates position along the distortion axis for the intermediate point, determines a modified version of the intermediate point. Using such a distortion coefficient for each intermediate point, the modified trajectory may be determined to be a trajectory which extends from the start point of the candidate trajectory, through the modified version of each intermediate point, to the end point of the candidate trajectory. Because the modified trajectory determines (with the audio content for the relevant object) each speaker feed for the relevant object channel, each distortion coefficient controls how close the rendered object will be perceived to get to the corresponding speaker (in the second subset) when the rendered object pans along the modified trajectory.
One may define the direction of arrival of sound from an audio source in terms of Azimuth and Elevation angles (Az, El), or in terms of an (x,y,z) unit vector. For example, in
An exemplary embodiment will be described with reference to
The rendering system is configured to generate speaker feeds for driving all speakers of the 6.1 array (including the overhead speaker, Ts) in response to an object based audio program (e.g., the program in the example) which is not specifically indicative of audio content to be perceived as emitting from a location above the horizontal plane of the listener's ears. In accordance with the invention, the rendering system is configured to modify the original (horizontal) trajectory indicated by the program to determine a modified trajectory (for the same audio object) which extends from the location (point A) of the center speaker, C, upward and backward toward the location of the overhead speaker, Ts, and then downward and backward to the location (point B) midway between the surround speakers, Rs and Ls. Such a modified trajectory is also shown in
As shown in
The rendering system may use the candidate trajectory as the modified trajectory (e.g., in response to assertion of the below-described distortion coefficient with the value 100%, or in response to some other user-determined control value).
The rendering system is preferably also configured to use any of a set of distorted versions of the candidate trajectory as the modified trajectory (e.g., in response to the below-described distortion coefficient having some value other than 100%, or in response to some other user-determined control value).
In the example, the rendering system is configured to respond to a user specified distortion coefficient having a value in the range from 100% (to achieve maximum distortion of the original trajectory, thereby maximizing use of the overhead speaker) to 0% (preventing any distortion of the original trajectory for the purpose of increasing use of the overhead speaker). In response to the specified value of the distortion coefficient, the rendering system uses a corresponding one of the distorted versions of the candidate trajectory as the modified trajectory. Specifically, the candidate trajectory is used as the modified trajectory in response to the distortion coefficient having the value 100%, the distorted candidate trajectory passing through point F (of
In the example, the rendering system is configured to efficiently determine the modified trajectory so as to achieve a desired degree of use of the overhead speaker determined by the distortion coefficient's value. This can be understood by considering the distortion axis through points I and E of
The intersection of each distorted version of the candidate trajectory with the distortion axis is the inflection point of said distorted version of the candidate trajectory. Thus, point G of
In a class of embodiments, the inventive rendering system is configured to determine, from an object based audio program (and knowledge of the positions of the speakers to be employed to play the program), the distance between each position of an audio source indicated by the program and the position of each of the speakers. Desired positions of the source can be defined relative to the positions of the speakers (e.g., it may be desired to play back sound so that the sound will be perceived as emitting from one of the speakers, e.g. an overhead speaker), and the source positions indicated by the program can be considered to be actual positions of the source. The system is configured in accordance with the invention to determine, for each actual source position (e.g., each source position along a source trajectory) indicated by the program, a subset of the full set of speakers (a “primary” subset) consisting of those speakers of the full set which are (or the speaker of the full set which is) closest (in some reasonably defined sense) to the source position. Typically, speaker feeds are generated (for each source position) which cause sound to be emitted with relatively large amplitudes from the speaker(s) of the primary subset (for the source position) and with relatively smaller amplitudes (or zero amplitudes) from the other speakers of the playback system. The speaker(s) of the full set which are (or is) “closest” to a source position may be each speaker whose position in the playback system corresponds to a position (in the three dimensional volume in which the source trajectory is defined) whose distance from the source position is within a predetermined threshold value, or whose distance from the source position satisfies some other predetermined criterion.
A sequence of source positions indicated by the program (which can be considered to define a source trajectory) determines a sequence of primary subsets of the full set of speakers (one primary subset for each source position in the sequence).
The positions of the speakers in each primary subset define a three-dimensional (3D) space which contains each speaker of the primary subset and a position corresponding to the relevant source position, but which contains no other speaker of the full set. Each such position which “corresponds” to an actual source position is a position, in the actual playback system, which “corresponds” to the source position in the sense that the content creator intends that sound emitted from the speakers of the playback system should be perceived by a listener as emitting from said source position. Thus, for convenience, such a position in the playback system which “corresponds” to a source position will sometimes be referred to as an actual source position, where it is clear from the context that it is a position in an actual playback system (e.g., a 3D space including a primary subset of a set of speakers, which is a space in a playback system of the type mentioned above in this paragraph, will sometimes be referred to as a 3D space including the source position which corresponds to the primary subset). For example, consider the 6.1 speaker array of
The steps of determining a modified trajectory (in response to a source trajectory indicated by the program) and generating speaker feeds (for driving all speakers of the playback system) in response to the modified trajectory, can thus be implemented in the exemplary rendering system as follows: for each of the sequence of source positions indicated by the program (which can be considered to define a trajectory, e.g., the “original trajectory” of
Optionally, a scaling parameter is applied to each of the 3D spaces (which are determined in accordance with an embodiment in the noted class) to generate a scaled space (sometimes referred to herein as a “warped” space) in response to the 3D space, and speaker feeds are generated for driving the speakers (of the full set employed to play the program) to emit sound intended to be perceived (and which typically will be perceived) as being emitted by the source from a characteristic point of the warped space rather than from the above-noted characteristic point of the 3D space (e.g., the characteristic point of the warped space may be the intersection of the top surface of the warped space with a vertical line through the source position determined by the program). Warping of a 3D space is a relatively simple, well known mathematical operation. In the example described with reference to
For example, a scaling parameter of “0.0” could maximize the height of the warped space (e.g., the warped space determined by applying such a scaling parameter of 0.0 to volume V′ of
Some embodiments of the inventive method implement both audio object trajectory modification and rendering in a single step. For example, the rendering could implicitly distort (modify) a trajectory (of an audio object) determined by an object based audio program (to determine a modified trajectory for the object) by explicit generation of speaker feeds for speakers having distorted versions of known positions (e.g., by explicit distortion of known loudspeaker positions). The distortion could be implemented as a scale factor applied to an axis (e.g., a height axis). For example, application of a first scale factor (e.g., a scale factor equal to 0.0) to the height axis of a trajectory (e.g., the original trajectory shown in
In some embodiments, the inventive system is or includes a general or special purpose processor programmed with software (or firmware) and/or otherwise configured to perform an embodiment of the inventive method. In some embodiments, the inventive system is or includes a general purpose processor, coupled to receive input audio (and optionally also input video), and programmed to generate (by performing an embodiment of the inventive method) output data (e.g., output data determining speaker feeds) in response to the input audio. For example, the system (e.g., system 3 of
In some embodiments, the inventive system is or includes a general or special purpose processor (e.g., an audio digital signal processor (DSP)), coupled to receive input audio data (indicative of an object based audio program) and programmed with software (or firmware) and/or otherwise configured to generate output data (a modified version of source position metadata indicated by the program, or data determining speaker feeds for rendering a modified version of the program) in response to the input audio data by performing an embodiment of the inventive method. The processor may be programmed with software (or firmware) and/or otherwise configured (e.g., in response to control data) to perform any of a variety of operations on the input audio data, including an embodiment of the inventive method.
The
The
More specifically, a typical implementation of upmixer 4 is programmed to modify (upmix) the object based audio program (which is indicative of a trajectory of an audio object and the trajectory is within a subspace of a full three-dimensional volume) determined by the audio data from subsystem 2, in response to source position metadata of the program to generate (and assert at at least one output 4B) output data which determine (with the original audio data from subsystem 2) a modified version of the program. For example, upmixer 4 may be configured to modify the source position metadata of the program to generate output data indicative of modified source position data which determine a modified trajectory of the object, such that at least a portion of the modified trajectory is outside the subspace. The output data (with the audio content of the object, included in the original audio data from subsystem 2) determine a modified program indicative of the modified trajectory of the object. In response to the modified program, rendering system 5 generates speaker feeds for driving the speakers of array 6 to emit sound that will be perceived as being emitted by the object as it translates along the modified trajectory.
For another example, upmixer 4 may be configured to generate (from the source position metadata of the program) output data indicative of a sequence of characteristic points (one for each of the sequence of source positions indicated by the program), each of the characteristic points being in one of a sequence of 3D spaces (e.g., scaled 3D spaces of the type described above with reference to
The system of
Similarly, the system of
In the case that the inventive system (either a rendering system, e.g., system 3 of
Additional metadata could be included in an object based audio program, to provide to the inventive system (either a system configured to render the program, e.g., system 3 of
Upmixing in accordance with the invention can be directly applied to an object based audio program whose content was object audio from the beginning (i.e., which was originally authored as an object based program). Such upmixing can also be applied to content that has been “objectized” (i.e., converted to an object based audio program) through the use of a source separation upmixer. A typical source separation upmixer would apply analysis and signal processing to content (e.g., an audio program including only speaker channels; not object channels) to separate individual tracks (each corresponding to audio content from an individual audio object) that had been mixed together to generate the content, thereby determining an object channel for each individual audio object.
Aspects of the invention include a system (e.g., an upmixer or a rendering system) configured (e.g., programmed) to perform any embodiment of the inventive method, and a computer readable medium (e.g., a disc or other tangible object) which stores code for implementing any embodiment of the inventive method.
In some embodiments of the inventive method, some or all of the steps described herein are performed simultaneously or in a different order than specified in the examples described herein. Although steps are performed in a particular order in some embodiments of the inventive method, some steps may be performed simultaneously or in a different order in other embodiments.
While specific embodiments of the present invention and applications of the invention have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the invention described and claimed herein. It should be understood that while certain forms of the invention have been shown and described, the invention is not to be limited to the specific embodiments described and shown or the specific methods described.
This application claims priority to U.S. Provisional Application No. 61/504,005 filed 1 Jul. 2011 and U.S. Provisional Application No. 61/635,930 filed 20 Apr. 2012, all of which are hereby incorporated by reference in entirety for all purposes.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2012/044345 | 6/27/2012 | WO | 00 | 12/12/2013 |
Number | Date | Country | |
---|---|---|---|
61504005 | Jul 2011 | US | |
61635930 | Apr 2012 | US |