One or more implementations relate generally to spatial audio rendering, and more particularly to creating the perception of sound at a virtual auditory source location.
A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
There are an increasing number of applications in which it is desirable to create an acoustic sound field that creates the impression of a particular sound scene for listeners within the sound field. One example is the sounds created as part of a cinema presentation using newer developed formats that extend the sound field beyond standard 5.1 or 7.1 surround sound systems. The sound field may include elements that are a reproduction of a recorded sound event using one or more microphones. The microphone placement and orientation can be used to capture spatial relationships within an existing sound field. In other cases, an auditory source may be recorded or synthesized as a discrete signal without accompanying location information. In this latter case, location information can be imparted by an audio mixer using a pan control (panner) to specify a desired auditory source location. The audio signal can then be rendered to individual loudspeakers to create the intended auditory impression. A simple example is a two-channel panner that assigns an audio signal to two loudspeakers so as to create the impression of an auditory source somewhere at or between the loudspeakers. the following, the term “sound” refers to the physical attributes of acoustic vibration, while “auditory” refers to the perception of sound by a listener. Thus, the term “auditory event” may refer to generally a perception of sound rather than a physical phenomenon, such as the sense of sound itself.
At present there are several existing rendering methods that generate loudspeaker signals from an input signal to create the desired auditory event at a particular source location. In general, a renderer determines a set of gains, such as one gain value for each loudspeaker output, that is applied to the input signal to generate the associated output loudspeaker signal. The gain value is typically positive, but can be negative (e.g., Ambisonics) or even complex (e.g., amplitude and delay panning, Wavefield Synthesis) Known existing audio renderers determine the set of gain values based on the desired, instantaneous auditory source location. Such present systems are competent to recreate static auditory events, i.e., auditory events that emanate from a non-moving, static source in 3D space. However, these systems do not always satisfactorily recreate moving or dynamic auditory events.
To generate a sense of motion through acoustics, the desired source location is time-varying. Analog systems (e.g., pan pots) can provide continuous location updates; and digital panners can provide discrete time and location updates. The renderer may then apply gain smoothing to avoid discontinuities or clicks such as might occur if the gains are changed abruptly in a digital, discrete-time panning and rendering system.
With existing, instantaneous location renderers, the loudspeaker gains are determined based on the instantaneous location of the desired auditory source location. The loudspeaker gains may be based on the relative location of the desired auditory source and the available loudspeakers, the signal level or loudness of the auditory source, or the capabilities of the individual loudspeakers. In many cases, the renderer includes a database describing the location, and capabilities of each loudspeaker. In many cases the loudspeaker gains are controlled such that the signal power is preserved, and loudspeaker(s) that are closest to the desired instantaneous auditory location are usually assigned larger gains than loudspeaker(s) that are further away. This type of system does not take into account the trajectory of a moving auditory source, so that the selected loudspeaker may be fine for an instantaneous location of the source, but not for the future location of the source. For example if the trajectory of the source is front-to-back rather than left-to-right, it may be better to bias the front and rear loudspeakers to play the sound rather than the side loudspeakers, even though the instantaneous location along the trajectory may favor the side loudspeakers.
It is therefore advantageous to provide a method for accommodating the trajectory of a dynamic auditory source in 3D space to determine the most appropriate loudspeakers for gain control so that the motion of the sound is accurately played back with minimal distortion or rendering discontinuities.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. Dolby, Atmos, Dolby Digital Plus, Dolby TrueHD, DD+, and Dolby Pulse are trademarks of Dolby Laboratories.
Embodiments are directed to a method of rendering an audio program by generating one or more loudspeaker channel feeds based on the dynamic trajectory of each audio object in the audio program, wherein the parameters of the dynamic trajectory may be included explicitly in the audio program, or may be derived from the instantaneous location of audio objects at two or more points in time. In this context, an audio program may be accompanied by picture, and may be a complete work intended to be viewed in its entirety (e.g. a movie soundtrack), or may be a portion of the complete work.
Embodiments are further directed to a method of rendering an audio program by defining a nominal loudspeaker map of loudspeakers used for playback in a listening environment, determining a trajectory of an auditory source corresponding to each audio object through 3D space, and deforming the loudspeaker map to create an updated loudspeaker map based on the audio object trajectory to playback audio to match the trajectory of the auditory source as perceived by a listener in the listening environment. The map deformation results in different gains being applied to the loudspeaker feeds. Depending on configuration and in a general case, the loudspeakers may in the listening environment, outside the listening environment or placed behind or within acoustically transparent scrims, screens, baffles, and other structures. Similarly, the auditory location may be within or outside of the listening environment, that is, sounds could be perceived to come from outside of the room or behind the viewing screen.
Embodiments are further directed to a system for rendering an audio program, comprising a first component collecting or deriving dynamic trajectory parameters of each audio object in the audio program, wherein the parameters of the dynamic trajectory may be included explicitly in the audio program or may be derived from the instantaneous location of audio objects at two or more points in time, a second component deforming a loudspeaker map comprising locations of loudspeakers based on the audio object trajectory parameters; and a third component deriving one or more loudspeaker channel feeds based on the instantaneous audio object location, and the corresponding deformed loudspeaker map associated with each audio object.
Embodiments are yet further directed to systems and articles of manufacture that perform or embody processing commands that perform or implement the above-described method acts.
Each publication, patent, and/or patent application mentioned in this specification is herein incorporated by reference in its entirety to the same extent as if each individual publication and/or patent application was specifically and individually indicated to be incorporated by reference.
In the following drawings like reference numbers are used to refer to like elements. Although the following figures depict various examples, the one or more implementations are not limited to the examples depicted in the figures.
Systems and methods are described for rendering audio streams to loudspeakers to produce a sound field that creates the perception of a sound at a particular location, the auditory source location, and that accurately reproduces the sound as it moves along a trajectory. This provides an improvement over existing solutions for situations where the intended auditory source location changes with time. In an embodiment, the degree to which each loudspeaker is used to generate the sound field is determined at least in part by the velocity of the auditory source location. Aspects of the one or more embodiments described herein may be implemented in an audio or audio-visual system that processes source audio information in a rendering/encoding system for transmission to a decoding/playback system, wherein both the rendering and playback systems include one or more computers or processing devices executing software instructions. Any of the described embodiments may be used alone or together with one another in any combination. Although various embodiments may have been motivated by various deficiencies with the prior art, which may be discussed or alluded to in one or more places in the specification, the embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some deficiencies or just one deficiency that may be discussed in the specification, and some embodiments may not address all of these deficiencies.
For purposes of the present description, the following terms have the associated meanings: the term “channel” means an audio signal plus metadata in which the position is explicitly or implicitly coded as a channel identifier, e.g.., left-front or right-top surround; “channel-based audio” is audio formatted for playback through a pre-defined set of loudspeaker zones with associated nominal locations, e.g.,5.1, 7.1, and so on; the term “object” or “object-based audio” means one or more audio channels with a parametric source description, such as apparent source position (e.g., 3D coordinates), apparent source width, etc.; “immersive audio” means channel-based and/or object-based audio signals plus metadata that renders the audio signals based on the playback environment using an audio stream plus metadata in which the position is coded as a 3D position in space; and “listening environment” means any open, partially enclosed, or fully enclosed area, such as a room that can be used for playback of audio content alone or with video or other content, and can be embodied in a home, cinema, theater, auditorium, studio, game console, and the like.
Further terms in the following description and in relation to one or more of the Figures have the associated definition, unless stated otherwise: “sound field” means the physical acoustic pressure waves in a space that are perceived as sound; “sound scene” means auditory environment, natural, captured, or created; “virtual sound” means an auditory event in which the apparent auditory source does not correspond with a physical auditory source, such as a “virtual center” created by playing the same signal from a left and right loudspeaker; “render” means conversion of input audio streams and descriptive data (metadata) to streams intended for playback over a specific loudspeaker configuration, where the metadata can include sound location, size, and other descriptive of control information; “panner” means a control device used to indicate intended auditory source location within an sound scene; “panning laws” means the algorithms used to generate per-loudspeaker gains based on auditory source location; and “loudspeaker map” means the set of locations of the available reproduction loudspeakers.
In an embodiment, the rendering system is implemented as part of an audio system that is configured to work with a sound format and processing system that may be referred to as an “immersive audio system” (and which may be referred to as a “spatial audio system,” “hybrid audio system,” or “adaptive audio system” in other related documents). Such a system is based on an audio format and rendering technology to allow enhanced audience immersion, greater artistic control, and system flexibility and scalability. An overall immersive audio system generally comprises an audio encoding, distribution, and decoding system configured to generate one or more bitstreams containing both conventional channel-based audio elements and audio object coding elements (object-based audio). Such a combined approach provides greater coding efficiency and rendering flexibility compared to either channel-based or object-based approaches taken separately.
An example implementation of an immersive audio system and associated audio format is the Dolby® Atmos® platform. Such a system incorporates a height (up/down) dimension that may be implemented as a 9.1 surround system, or similar surround sound configurations. Such a height-based system may be designated by different nomenclature where height loudspeakers are differentiated from floor loudspeakers through an x.y.z designation where x is the number of floor loudspeakers, y is the number of subwoofers, and z is the number of height loudspeakers. Thus, a 9.1 system may be called a 5.1.4 system comprising a 5.1 system with 4 height loudspeakers.
Audio objects can be considered groups of auditory events that may be perceived to emanate from a particular physical location or locations in the listening environment. Such objects can be static (i.e., stationary) or dynamic (i.e., moving). Audio objects are controlled by metadata that defines the position of the sound at a given point in time, along with other functions When objects are played back, they are rendered according to the positional metadata using the loudspeakers that are present, rather than necessarily being output to a predefined physical channel.
The immersive audio system is configured to support audio beds in addition to audio objects, where beds are effectively channel-based sub-mixes or stems. These can be delivered for final playback (rendering) either individually, or combined into a single bed, depending on the intent of the content creator. These beds can be created in different channel-based configurations such as 5.1, 7.1, and 9.1, and arrays that include overhead loudspeakers, such as shown in
For an immersive audio mix, a playback system can be configured to render and playback audio content that is generated through one or more capture, pre-processing, authoring and coding components that encode the input audio as a digital bitstream. An immersive audio component may be used to automatically generate appropriate metadata through analysis of input audio by examining factors such as source separation and content type. For example, positional metadata may he derived from a multi-channel recording through an analysis of the relative levels of correlated input between channel pairs. Detection of content type, such as speech or music, may be achieved, for example, by feature extraction and classification. Certain authoring tools allow the authoring of audio programs by optimizing the input and codification of the sound engineer's creative intent allowing him to create the final audio mix once that is optimized for playback in practically any playback environment. This can be accomplished through the use of audio objects and positional data that is associated and encoded with the original audio content. Once the immersive audio content has been authored and coded in the appropriate codec devices, it is decoded and rendered for playback through loudspeakers, such as shown in
Many audio programs may feature audio objects that are fixed in space, such as when certain instruments are tied to specific locations in a sound stage. For other audio/visual (e.g., TV, cinema, game) content, however, audio objects are dynamic in that they are associated with objects that move through space, such as cars, planes, birds, etc. Rendering and playback systems mimic or recreate this movement of sound associated with a moving object by sending the audio signal to different loudspeakers in the listening environment so that perceived auditory source location matches the desired location of the object. In general, the frame of reference for the trajectory of the moving object could be the listener, the listening environment itself, or any location within the listening environment.
Embodiments are directed to generating loudspeaker signals (loudspeaker feeds) for audio objects that are situated and move through 3D space. The audio objects comprise program content may be provided in various different formats including cinema, TV streaming audio, live broadcast (and sound), UGC (user generated content), games and music. Traditional surround sound (and even stereo) is distributed in the form of channel signals (i.e., loudspeaker feeds) where each audio track delivered is intended to be played over a specific loudspeaker (or loudspeaker array) at a nominal location in the listening environment. Object-based audio comprising an audio program that is distributed in the form of a “scene description” consists of audio signals and their location properties. For streaming audio the program may be received and played back while being delivered.
The authored content generated by component 212 represents the audio program to be transmitted over link 213. The audio program is generally prepared for transmission using a content encoder. In general the audio is also combined with other parts of the program that may include associated video and subtitles (e.g., digital cinema). The link 213 may comprise a direct connection, physical media, short or long-distance network link, Internet connection, wireless transmission link, or any other appropriate transmission link for transmitting the digital A/V program data.
The playback environment typically comprises a movie theatre or similar venue for playback of a movie and associated audio (cinema content) to an audience, but any room or environment is possible. The encoded program content transmitted over link 213 is received and decoded from the transmission format. Renderer 214 takes in the audio program and renders the audio based on a map of the local playback loudspeaker configuration 216 for playback through loudspeakers 218 in the listening environment. The renderer outputs channel-based audio 219 that comprises loudspeaker feeds to the individual playback loudspeakers 218. The overall playback stage may include one or more amplifier, buffer, or sound processing components that amplify and process the audio for playback through loudspeakers. The loudspeakers typically comprise an array of loudspeakers, such as a surround-sound array or immersive audio loudspeaker array, such as shown in
The description of the arrangement of loudspeakers in the listening environment with respect to the physical location of each loudspeaker relative to the other loudspeakers and the audio boundaries (wall/floor/ceiling) of the room represents a loudspeaker map. For the example of
In an embodiment in which the program content comprises immersive audio, the renderer 214 converts the object-based scene description into channel signals. With object-based audio distribution, the renderer operates in the listening environment, and combines the audio scene description and the room description (loudspeaker map) to compute channel signals. A similar process is followed during program authoring. In particular, the authoring process involves capturing the input of the mix engineer using the mixing tool, such as by turning pan pots, or moving a joystick, and then converting the output to loudspeaker feeds using a renderer. In this case, the transmission link 413 is a direct connection with little or no encoding or decoding, the loudspeaker map 216 describes the playback equipment in the authoring environment.
For the embodiment of
As shown in
With respect to the step 402 of estimating the current velocity, at a given point in time, for each auditory source to be rendered, the process estimates the velocity based on previous, current and/or future auditory source locations. The velocity comprises one or both of speed and direction of the auditory source. The trajectory may thus comprise a velocity as well as a change in velocity of the audio object, such as a change in speed (slowing down or speeding up) or a change in direction of the audio object. The trajectory of an audio object thus represents higher-order position information of the audio object as manifested as the change instantaneous location of the apparent auditory source of the object over time.
The derivation of future information may depend on the type of content comprising the audio program. If the content is cinema content, typically the whole program file is provided to the renderer. In this case future information is derived simply by looking ahead in the file by an appropriate amount of time, e.g., 1 second ahead, 1/10 second ahead, and so on.) In the case of streaming content or instantaneously generated content in which the entire file is not available, a buffer and delay scheme may he utilized in which playback is delayed by an appropriate amount of time (e.g., 1 second or 1/10 second, etc.) This delay provides a look-ahead capability that allows for derivation future location. In some cases, if future auditory source locations are used, algorithmic latency must be accounted for as part of the system design, in some systems, the audio program to be rendered may include velocity as part of the sound scene description, in which case velocity need not be computed,
With respect to the step of deforming the nominal loudspeaker map, 404, at a given point in time, for each auditory source to be rendered, the process modifies the nominal loudspeaker map based on the object velocity. The nominal loudspeaker map represents an initial layout of loudspeakers (such as shown in
Although embodiments are described with respect to trajectory based on velocity of an audio object or auditory source, it should be noted that the trajectory could be also or instead be based on the acceleration of the auditory source, the variance of the direction of the auditory source, or past and future values of the auditory source velocity.
In an embodiment the renderer thus begins with a nominal map defining loudspeaker locations in the listening environment. This can be defined in an AVR or cinema processor using known loudspeaker location definitions (e.g., left front, right front, center, etc.). The loudspeaker map is then deformed so as to modify the signals that are derived and reproduced over the loudspeakers. In particular, the loudspeaker map may be deformed using appropriate gain values sent to each of the loudspeakers so that the sound scene may effectively collapse in a given direction, such as shown in
In an embodiment, audio object (auditory source) location is sent to the renderer at regular intervals, such as 100 times/second, or any other appropriate interval, at a time (e.g., 1/10 second) in the future. The renderer then determines how much gain to apply to each loudspeaker to accurately reproduce an instantaneous location of the object at that time. The frequency of the updates and the amount of time delay (look ahead) can be set by the renderer, or these may be parameters that can be set based on actual configuration and content requirements.
In an embodiment, a location-based renderer is used to determine the loudspeaker gains based on source location, the deformed loudspeaker map, and preferred panning laws. This may represent renderer 214 of
In an alternative embodiment, other features of the auditory source location may be computed such as auditory source acceleration, rate of change of auditory source velocity direction, or the variance of the auditory source velocity. In some systems, the audio program to be rendered may include auditory source velocity, or other parameters, as part of the sound scene description, in which case the velocity and/or other parameters need not be estimated at the time of playback. The map scaling may alternatively or additionally be determined by the auditory source acceleration, rate of chance of auditory source velocity direction, or the variance of the auditory source velocity.
Hence, a method 400 for rendering an audio program is described. The audio program may comprise one of: an audio file downloaded in its entirety to a playback processor including a renderer 214, and streaming digital audio content. The audio program comprises one or more audio objects 506, which are to be rendered as part of the audio program. Furthermore, the audio program may comprise one or more audio beds. The method 400 may comprise determining a nominal loudspeaker map representing a layout of loudspeakers 508 used for playback of the audio program. The loudspeakers 508 (i.e. the loudspeakers) may be arranged in a listening environment 502 such as a cinema. The loudspeakers 508 may be located within the listening environment 502 in accordance with the nominal loudspeaker map. As such, the nominal loudspeaker map may correspond to the physical layout of loudspeakers 508 within a listening environment 502.
The method 400 may further comprise determining 402 a trajectory of an audio object 506 of the audio program from and/or to a source location through 3D space. The audio object 506 may be positioned at a first time instant at the (current) source location. Furthermore, the audio object 506 may move away from the (current) source location through 3D space at later time instants according to the determined trajectory. As such, the trajectory may comprise or may indicate a direction of motion of the audio object 506 starting from the (current) source location. In particular, the trajectory may comprise or may indicate a difference of location of the audio object 506 at a first time instant and at a (subsequent) second time instant. In other word, the trajectory may indicate a sequence of different locations at a corresponding sequence of subsequent time instants.
The trajectory may be determined based at least in part on past, present, and/or future location values of the audio object 506. As such, the trajectory is indicative of the object location and of object change information. The future location values may be determined by one of: looking ahead in an audio tile containing the audio object 506, and using a latency factor created by a delay in playback of the audio program. The trajectory may further comprise or may further indicate a velocity or speed and/or an acceleration/deceleration of the audio object 506. The direction of motion, the velocity and/or the change of velocity of the trajectory may be determined based on the location values (which indicate the location of the audio object 506 within the 3D space, as a function of time).
The method 400 may further comprise deforming 404 the nominal loudspeaker map such that the map is scaled relative to the source location in the direction of motion of the audio object 506, to create an updated loudspeaker map. In other words, the nominal loudspeaker map may be scaled to move the loudspeakers 508 which are arranged to the left and to the right of the direction of motion of the audio object 506 closer to or further away from the audio object 506. A degree of scaling of the nominal loudspeaker map may depend on the velocity of the audio object 506. In particular, the degree of scaling may increase with increasing velocity of the audio object 506 or may decrease with decreasing velocity of the audio object 506. As such, the loudspeakers of the updated loudspeaker map may be moved towards the trajectory of the audio object 506, thereby moving the loudspeakers 508 into a collapsed region 510 around the trajectory of the audio object 506. The width of this region 510 perpendicular to the trajectory of the audio object 506 may decrease with increasing velocity of the audio object 506 (and vice versa). By making the degree of scaling dependent on the velocity of the audio object 506, the rendering of moving audio objects 506 may be improved further.
The step of deforming 404 the nominal loudspeaker map may comprise determining gain values for the loudspeakers 508 such that loudspeakers 508 along the direction of motion of the audio object 506 (i.e, to the left and right of the direction of motion) move closer to the source location and/or closer to the trajectory of the audio object 506. By determining such gain values for the loudspeakers 508, the loudspeakers 508 are mapped to a collapsed region 510 which follows the shape of the trajectory of the audio object 506. As such, the task of selecting two or more loudspeakers 508 for rendering sound that is associated with the audio object 506 is simplified. Furthermore, a smooth transition between selected loudspeakers 508 along the trajectory of the audio object 506 may be achieved, thereby enabling a consistent rendering of moving audio objects 506.
The method 400 may further comprise determining 406 loudspeaker gains for the loudspeakers 508 for rendering the audio object 506 based on the trajectory, based on the nominal loudspeaker map and based on a panning law, particular, the loudspeaker gains may be determined based on the updated loudspeaker map and based on a panning law (and possibly based on the source location). The panning law may be used for determining the loudspeaker gains for the loudspeakers 508 based on a relative position of the loudspeakers 508 in the updated loudspeaker map. Furthermore, the trajectory and/or the (current) source location may be taken into consideration by the panning law. By way of example, the two loudspeakers 508 in the updated loudspeaker map which are closest to the (current source location of the audio object 506 may be selected for rendering the sound associated with the audio object 506. The sound may then be panned between the two selected loudspeakers 508. As such, panning of audio objects 506 may be improved and simplified by deforming a nominal loudspeaker map based on the trajectory of the audio object 506. In particular, at each time instant (at which a panning law is to be applied, e.g. at a periodic rate), the two loudspeakers 508 from the updated (i.e. deformed) loudspeaker map which are closest to the current source location of the audio object 506 may be selected for panning the sound that is associated with the audio object 506. By doing this, a smooth and consistent rendering of moving audio objects 506 may be achieved.
In other words, a method 400 for rendering a moving audio object 506 of an audio program in a consistent manner is described. A trajectory of the audio object 506 starting from a current source location of the audio object 506 is determined. Furthermore, a nominal loudspeaker map is determined, which indicates the layout of loudspeakers 508 within a listening environment 502. The nominal loudspeaker map may be deformed based on the trajectory of the audio object 506 (i.e. based on the current, and past and/or future locations of the audio object). The nominal loudspeaker map may be deformed by scaling the nominal loudspeaker map relative to the source location in the direction of motion of the audio object 506. As a result of this, an updated loudspeaker map is obtained which follows the trajectory of the audio object 506. The loudspeaker gains for the loudspeakers 508 for rendering the audio object 506 may then be determined based on the updated loudspeaker map and based on a panning law (and possibly based on the source location).
As a result of using the updated loudspeaker map for determining the loudspeaker gains, panning of the sound associated with the audio object 506 is simplified. In particular, the selection of the appropriate loudspeakers 508 for rendering the sound associated with the audio object 506 along the trajectory is simplified, due to the fact that the loudspeakers 508 have been scaled to follow the trajectory of the audio object 506. This enables a smooth and consistent rendering of the sound associated with moving audio objects 506.
The method 400 may be applied to a plurality of different audio objects 506 of an audio program. Due to the different trajectories of the different audio objects 506, the nominal loudspeaker map is typically deformed differently for the different audio objects 506.
The method 400 may further comprise generating loudspeaker signals feeding the loudspeakers 508 (i.e. generating loudspeaker feeds) using the loudspeaker gains. In particular, the sound associated with the audio object 506 may be amplified / attenuated with the loudspeaker gains for the different loudspeakers 508, thereby generating the different loudspeaker signals for the different loudspeakers 508. As indicated above, this process may be repeated at a periodic rate (e.g. 100 times/second), in order to update the loudspeaker gains for the updated source location of the audio object 506. By doing this, the sound associated with the audio object 506 may be rendered smoothly along the trajectory of the moving audio object 506.
The method 400 may comprise encoding the trajectory as metadata defining e.g. instantaneous x, y, z position coordinates of the audio object 506, which are updated at the defined periodic rate. The method 400 may further comprise transmitting the metadata with the loudspeaker gains from a renderer 214.
The audio program may be part of audio/visual content and the direction of motion of the audio object 506 may be determined based on a visual representation of the audio object 506 comprised within the audio/visual content. As such, the trajectory of an audio object 506 may be determined to be consistent with the visual representation of the audio object 506.
Furthermore, a system for rendering an audio program is described. The system comprises a component for determining a nominal loudspeaker map representing a layout of loudspeakers 508 used for playback of the audio program. The system also comprises a component for determining a trajectory of an audio object 506 of the audio program from and/or to a source location through 3D space, wherein the trajectory comprises a direction of motion of the audio object 506 from and/or to the source location. In addition, the system may comprise a component for deforming the nominal loudspeaker map such that the map is scaled relative to the source location in the direction of motion of the audio object 506, to create an updated loudspeaker map. Furthermore, the system comprises a component for determining loudspeaker gains for the loudspeakers 508 for rendering the audio object 506 based on the source location, based on the updated loudspeaker map and based on a panning law. The panning law may determine the loudspeaker gains for the loudspeakers based on a relative position of the loudspeakers 508 in the updated loudspeaker map and the source location. The system may further comprise an encoder for encoding the trajectory as a trajectory description that includes a current instantaneous location of the audio object 506 as well as information on how the location of the audio object 506 changes with time.
In an embodiment, the immersive audio system includes components that generate metadata from an original spatial audio format. The methods and components of the described systems comprise an audio rendering system configured to process one or more bitstreams containing both conventional channel-based audio elements and audio object coding elements. The audio content thus comprises audio objects, channels, and position metadata. Metadata is generated in the audio workstation in response to the engineer's mixing inputs to provide rendering queues that control spatial parameters (e.g., position, velocity, intensity, timbre, etc. ) and specify which driver(s) or loudspeaker(s) in the listening environment play respective sounds during playback. The metadata is associated with the respective audio data in the workstation for packaging and transport by an audio processor.
In an embodiment, the audio type (i.e., channel or object-based audio) metadata definition is added to, encoded within, or otherwise associated with the metadata payload transmitted as part of the audio bitstream processed by an immersive audio processing system. In general, authoring and distribution systems for immersive audio create and deliver audio that allows playback via fixed loudspeaker locations (left channel, right channel, etc.) and object-based audio elements that have generalized 3D spatial information including position, size and velocity. The system provides useful information about the audio content through metadata that is paired with the audio essence by the content creator at the time of content creation/authoring. The metadata thus encodes detailed information about the attributes of audio that can be used during rendering. Such attributes may include content type (e.g., dialog, music, effect, Foley, background/ambience, etc.) as well as audio object information such as spatial attributes (e.g., 3D position, object size, velocity, etc.) and useful rendering information (e,g., snap to loudspeaker location, channel weights, gain, ramp, bass management information, etc.). The audio content and reproduction intent metadata can either be manually created by the content creator or created through the use of automatic, media intelligence algorithms that can be run in the background during the authoring process and be reviewed by the content creator during a final quality control phase if desired.
Many other metadata types may be defined by the audio processing framework. In general, a metadatum consists of an identifier, a payload size, an offset into the data buffer, and an optional payload. Many metadata types do not have any actual payload, and are purely informational. For instance, the “sequence start” and “sequence end” signaling metadata have no payload, as they are just signals without further information. The actual object audio metadata is carried in “Evolution” frames, and the metadata type for Evolution has a payload size equal to the size of the Evolution frame, which is not fixed and can change from frame to frame. The term Evolution frame generally refers to a secure, extensible metadata packaging and delivery framework in which a frame can contain one or more metadata payloads and associated timing and security information. Although embodiments are described with respect to Evolution frames, it should be noted that any appropriate frame configuration that provides similar capabilities may be used.
In an embodiment, the metadata conforms to a standard defined for the Dolby Atmos system. Such as format is defined in WD Standard, SMPTE 429- XX:20YY entitled “Immersive Audio Bitstream Specification.”
In an embodiment, the metadata packages includes location audio object location information in the form of the (x,y,z) coordinates as 16 bit scalar values, with updates corresponding to a rate of up to 192 times per second, where sb is a time index:
The velocity is computed based on current and past values as follows:
velocity[sb]=(ObjectPosX[sb]−ObjectPosX[sb−n])/n*x+(ObjectPosY[sb]−ObjectPosY[sb−n])n*y+(ObjectPosZ[sb]−ObjectPosZ[sb−n])/n*z
In the above expressions, n is the time interval over which to estimate the average velocity, and x,y,z are unit vectors in the location coordinate space.
Alternatively, by reading ahead in a file, or by introducing latency in a streaming application, the velocity can be computed over a time interval centered on the current time, sb:
velocity[sb]=sqrt((ObjectPosX[sb+n/2]−ObjectPosX[sb−n/2])/n*x+(ObjectPosY[sb+n/2]−ObjectPosY[sb−n/2])/n*y+(ObjectPosZ[sb+n/2]−ObjectPosZ[sb−n/2])/n*z
Embodiments have been described for a system that uses different loudspeakers in a listening environment to generate a different sound field (i.e., change the physical sound attributes), with the intention of having listeners perceive the sound scene exactly as described in the soundtrack by maintaining the perceived auditory attributes.
Although embodiments have been described with respect to digital audio signals and program transmission using digital bitstreams, it should be noted that the audio content and associated transfer function information may instead comprise analog signals. In this case, the transfer function can be encoded and defined, or a transfer function preset selected, using analog signals such as tones. Alternatively, for analog or digital programs, the target transfer function could be described using an audio signal; for example, a signal with flat frequency response (e.g. a tone sweep or pink noise) could be processed using a pre-emphasis filter so as to give a flat response when the desired transfer function (acting as a de-emphasis filter is applied.
Furthermore, although embodiments have been primarily described in relation to content and distribution for cinema (movie) applications, it should be noted that embodiments are not so limited. The playback environment may be a cinema or any other appropriate listening environment for any type of audio content, such as a home, room, car, small auditorium, outdoor venue, and so on.
Aspects of the methods and systems described herein may be implemented in an appropriate computer-based sound processing network environment for processing digital or digitized audio files. Portions of the immersive audio system may include one or more networks that comprise any desired number of individual machines, including one or more routers (not shown) that serve to buffer and route the data transmitted among the computers. Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof. In an embodiment in which the network comprises the Internet, one or more machines may be configured to access the Internet through web browser programs.
One or more of the components, blocks, processes or other functional components may be implemented through a computer program that controls execution of a processor-based computing device of the system. It should also be noted that the various functions disclosed herein may be described using any number of combinations of hardware, firmware, and/or as data and/or instructions embodied in various machine-readable or computer-readable media, in terms of their behavioral, register transfer, logic component, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, physical (non-transitory), non-volatile storage media in various forms, such as optical, magnetic or semiconductor storage media.
Embodiments are further directed to systems and articles of manufacture that perform or embody processing commands that perform or implement the above-described method acts, such as those illustrated in the flowchart of
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of “including, but not limited to,” Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words “herein,” “hereunder,” “above,” “below,” and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word “or” is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.
While one or more implementations have been described by way of example and in terms of the specific embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation on so as to encompass all such modifications and similar arrangements.
Various aspects of the present invention may be appreciated from the following enumerated example embodiments (EEEs).
defining a nominal loudspeaker map of loudspeakers used for playback of the audio program;
determining a trajectory of an auditory source corresponding to one or more audio objects through 3D space;
generating loudspeaker signals feeding the loudspeakers based on the one or more audio object trajectories; and
defining a loudspeaker map of loudspeakers used for playback in a listening environment;
determining a subsequent location of the audio object at a second time, the difference in location between the first time and second time defining a trajectory of the audio object through 3D space; and
using the trajectory to change loudspeaker feed signals to the loudspeakers by applying different loudspeaker gains to same or different sets of loudspeakers while maintaining perceived auditory attributes of the audio object.
EEE 21. A system for rendering an audio program, comprising:
Number | Date | Country | Kind |
---|---|---|---|
15192091.5 | Oct 2015 | EP | regional |
This application claims priority to U.S. Provisional Patent Application No. 62/221,536, filed Sep. 21, 2015, and European Patent Application No. 15192091,5, filed Oct. 29, 2015, both of which are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
62221536 | Sep 2015 | US |