One or more embodiments relate generally to audio signal processing, and more specifically to clustering audio objects based on perceptual criteria to compress object-based audio data for efficient coding and/or rendering through various playback systems.
The advent of object-based audio has significantly increased the amount of audio data and the complexity of rendering this data within high-end playback systems. For example, cinema sound tracks may comprise many different sound elements corresponding to images on the screen, dialog, noises, and sound effects that emanate from different places on the screen and combine with background music and ambient effects to create the overall auditory experience. Accurate playback requires that sounds be reproduced in a way that corresponds as closely as possible to what is shown on screen with respect to sound source position, intensity, movement, and depth. Object-based audio represents a significant improvement over traditional channel-based audio systems that send audio content in the form of speaker feeds to individual speakers in a listening environment, and are thus relatively limited with respect to spatial playback of specific audio objects.
The introduction of digital cinema and the development of three-dimensional (“3D”) content has created new standards for sound, such as the incorporation of multiple channels of audio to allow for greater creativity for content creators, and a more enveloping and realistic auditory experience for audiences. Expanding beyond traditional speaker feeds and channel-based audio as a means for distributing spatial audio is critical, and there has been considerable interest in a model-based audio description that allows the listener to select a desired playback configuration with the audio rendered specifically for their chosen configuration. The spatial presentation of sound utilizes audio objects, which are audio signals with associated parametric source descriptions of apparent source position (e.g., 3D coordinates), apparent source width, and other parameters. Further advancements include a next generation spatial audio (also referred to as “adaptive audio”) format has been developed that comprises a mix of audio objects and traditional channel-based speaker feeds (beds) along with positional metadata for the audio objects.
In some soundtracks, there may be several (e.g., 7, 9, or 11) bed channels containing audio. Additionally, based on the capabilities of an authoring system there may be tens or even hundreds of individual audio objects that are combined during rendering to create a spatially diverse and immersive audio experience. In some distribution and transmission systems, there may be large enough available bandwidth to transmit all audio bed and objects with little or no audio compression. In some cases, however, such as Blu-ray disc, broadcast (cable, satellite and terrestrial), mobile (3G and 4G) and over-the-top (OTT, or Internet) distribution there may be significant limitations on the available bandwidth to digitally transmit all of the bed and object information created at the time of authoring. While audio coding methods (lossy or lossless) may be applied to the audio to reduce the required bandwidth, audio coding may not be sufficient to reduce the bandwidth required to transmit the audio, particularly over very limited networks such as mobile 3G and 4G networks.
Some prior methods have been developed to reduce the number of input objects and beds into a smaller set of output objects by means of clustering. Essentially, objects with similar spatial or rendering attributes are combined into single or fewer new, merged objects. The merging process encompasses combining the audio signals (for example by summation) and the parametric source descriptions (for example by averaging). The allocation of objects to clusters in these previous methods is based on spatial proximity. That is, objects that have similar parametric position data are combined into one cluster while ensuring a small spatial error for each object individually. This process is generally effective as long as the spatial positions of all perceptually relevant objects in the content allow for such clustering with reasonably small error. In very complex content, however, with many objects active simultaneously having a sparse spatial distribution, the number of required output clusters to accurately model such content can become significant when only moderate spatial errors are tolerated. Alternatively, if the number of output clusters is restricted, such as due to bandwidth or complexity constraints, complex content may be reproduced with a degraded spatial quality due to the constrained clustering process and the significant spatial errors. Hence in that case, the use of proximity only to define the clusters often returns suboptimal results. In this case, the importance of objects themselves, as opposed to just their spatial position, should be taken into account to optimize the perceived quality of the clustering process.
Other solutions have also been developed to improve the clustering process. One such solution is a culling process that removes objects that are perceptually irrelevant, such as due to masking or due to an object being silent. Although this process helps to improve clustering process, it does not provide an improved clustering result if the number of perceptually relevant objects is larger than the available output clusters.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
Some embodiments are directed to compressing object-based audio data for rendering in a playback system by identifying a first number of audio objects to be rendered in a playback system, where each audio object comprises audio data and associated metadata; defining an error threshold for certain parameters encoded within the associated metadata for each audio object; and grouping audio objects of the first number of audio objects into a reduced number of audio objects based on the error threshold so that the amount of data for the audio objects transmitted through the playback system is reduced.
Some embodiments are further directed to rendering object-based audio by identifying a spatial location of each object of a number of objects at defined time intervals, and grouping at least some of the objects into one or more time-varying clusters based on a maximum distance between pairs of objects and/or distortion errors caused by the grouping on certain other characteristics associated with the objects.
Some embodiments are directed to a method of compressing object-based audio data for rendering in a playback system by determining a perceptual importance of objects in an audio scene, wherein the objects comprise object audio data and associated metadata, and combining certain audio objects into clusters of audio objects based on the determined perceptual importance of the objects, wherein a number of clusters is less than an original number of objects in the audio scene. In this method, the perceptual importance may be a value derived from at least one of a loudness value and a content type of the respective object, and the content type is at least one of dialog, music, sound effects, ambiance, and noise.
In an embodiment of the method, the content type is determined by an audio classification process that receives an input audio signal for the audio objects and the loudness is obtained by a perceptual model based on a calculation of excitation levels in critical frequency bands of the input audio signal, with the method further comprising defining a centroid for a cluster around a first object of the audio objects and aggregating all excitation of the audio objects. The loudness value may be dependent at least in part on spatial proximity of a respective object to the other objects, and the spatial proximity may be defined at least in part by a position metadata value of the associated metadata for the respective object. The act of combining may cause certain spatial errors associated with each clustered object. In an embodiment, the method further comprises clustering the objects such that a spatial error is minimized for objects of relatively high perceptual importance. In an embodiment, the determined perceptual importance of the objects depends on a relative spatial location of the objects in the audio scene, and step of combining further comprises determining a number of centroids, with each centroid comprising a center of a cluster for grouping a plurality of audio objects, the centroid positions being dependent on the perceptual importance of one or more audio objects relative to other audio objects, and grouping the objects into one or more clusters by distributing object signals across the clusters. The clustering may further comprise grouping an object with a nearest neighbor, or distributing an object over one or more clusters using a panning method.
The act of combining audio objects may involve combining waveforms embodying the audio data for the constituent objects within the same cluster together to form a replacement object having a combined waveform of the constituent objects, and combining the metadata for the constituent objects within the same cluster together to form a replacement set of metadata for the constituent objects.
Some embodiments are further directed to a method of rendering object-based audio by defining a number of centroids, with each centroid comprising a center of a cluster for grouping a plurality of audio objects, determining a first spatial location of each object relative to the other objects of the plurality of audio objects, determining a relative importance of each audio object of the plurality of audio objects, said relative importance depending on the relative spatial locations of objects, determining a number of centroids, each centroid comprising a center of a cluster for grouping a plurality of audio objects, the centroid positions being dependent on the relative importance of one or more audio objects, and grouping the objects into one or more clusters by distributing object signals across the clusters. This method may further comprise determining a partial loudness of each audio object of the plurality of audio objects and a content type and associated content type importance of each audio object of the plurality of audio objects. In an embodiment, the partial loudness and the content type of each audio object are combined to determine the relative importance of a respective audio object. Objects are clustered such that a spatial error is minimized for objects of relatively high perceptual importance, where the spatial error may be caused by moving an object from a first perceived source location to a second perceived source location when clustered with other objects.
Some further embodiments are described for systems or devices and computer-readable media that implement the embodiments for the method of compressing or the method of rendering described above.
The methods and systems described herein may be implemented in an audio format and system that includes updated content creation tools, distribution methods and an enhanced user experience based on an adaptive audio system that includes new speaker and channel configurations, as well as a new spatial description format made possible by a suite of advanced content creation tools. In such a system, audio streams (generally including channels and objects) are transmitted along with metadata that describes the content creator's or sound mixer's intent, including desired position of the audio stream. The position can be expressed as a named channel (from within the predefined channel configuration) or as three-dimensional (3D) spatial position information.
Each publication, patent, and/or patent application mentioned in this specification is herein incorporated by reference in its entirety to the same extent as if each individual publication and/or patent application was specifically and individually indicated to be incorporated by reference.
In the following drawings like reference numbers are used to refer to like elements. Although the following figures depict various examples, the one or more implementations are not limited to the examples depicted in the figures.
Systems and methods are described for an object clustering-based compression scheme for object-based audio data. Embodiments of the clustering scheme utilize the perceptual importance of objects for allocating objects to clusters, and expands on clustering methods that are position and proximity-based. A perceptual-based clustering system augments proximity-based clustering with perceptual correlates derived from the audio signals of each object to derive an improved allocation of objects to clusters in constrained conditions, such as when the number of perceptually-relevant objects is larger than the number of output clusters.
In an embodiment of an audio processing system, an object combining or clustering process is controlled in part by the spatial proximity of the objects, and also by certain perceptual criteria. In general, clustering objects results in a certain amount of error since not all input objects can maintain spatial fidelity when clustered with other objects, especially in applications where a large number of objects are sparsely distributed. Objects with relatively high perceived importance will be favored in terms of minimizing spatial/perceptual errors with the clustering process. The object importance can be based on factors such as partial loudness, which is the perceived loudness of an object factoring the masking effects among other objects in the scene, and content semantics or type (e.g., dialog, music, effects, etc.).
Aspects of the one or more embodiments described herein may be implemented in an audio or audio-visual (AV) system that processes source audio information in a mixing, rendering and playback system that includes one or more computers or processing devices executing software instructions. Any of the described embodiments may be used alone or together with one another in any combination. Although various embodiments may have been motivated by various deficiencies with the prior art, which may be discussed or alluded to in one or more places in the specification, the embodiments do not necessarily address any of these deficiencies. In other words, different embodiments may address different deficiencies that may be discussed in the specification. Some embodiments may only partially address some deficiencies or just one deficiency that may be discussed in the specification, and some embodiments may not address any of these deficiencies.
For purposes of the present description, the following terms have the associated meanings: the term “channel” or “bed” means an audio signal plus metadata in which the position is coded as a channel identifier, e.g., left-front or right-top surround; “channel-based audio” is audio formatted for playback through a pre-defined set of speaker zones with associated nominal locations, e.g., 5.1, 7.1, and so on; the term “object” or “object-based audio” means one or more audio channels with a parametric source description, such as apparent source position (e.g., 3D coordinates), apparent source width, etc.; “adaptive audio” means channel-based and/or object-based audio signals plus metadata that renders the audio signals based on the playback environment using an audio stream plus metadata in which the position is coded as a 3D position in space; and “rendering” means conversion to electrical signals used as speaker feeds.
In an embodiment, the scene simplification process using object clustering is implemented as part of an audio system that is configured to work with a sound format and processing system that may be referred to as a “spatial audio system” or “adaptive audio system.” Such a system is based on an audio format and rendering technology to allow enhanced audience immersion, greater artistic control, and system flexibility and scalability. An overall adaptive audio system generally comprises an audio encoding, distribution, and decoding system configured to generate one or more bitstreams containing both conventional channel-based audio elements and audio object coding elements. Such a combined approach provides greater coding efficiency and rendering flexibility compared to either channel-based or object-based approaches taken separately. An example of an adaptive audio system that may be used in conjunction with present embodiments is described in pending International Patent Application No. PCT/US2012/044388 filed 27 Jun. 2012, and entitled “System and Method for Adaptive Audio Signal Generation, Coding and Rendering,” which is hereby incorporated by reference. An example implementation of an adaptive audio system and associated audio format is the Dolby® Atmos™ platform. Such a system incorporates a height (up/down) dimension that may be implemented as a 9.1 surround system, or similar surround sound configuration.
Audio objects can be considered individual or collections of sound elements that may be perceived to emanate from a particular physical location or locations in the listening environment. Such objects can be static (that is, stationary) or dynamic (that is, moving). Audio objects are controlled by metadata that defines the position of the sound at a given point in time, along with other functions. When objects are played back, they are rendered according to the positional metadata using the speakers that are present, rather than necessarily being output to a predefined physical channel. A track in a session can be an audio object, and standard panning data is analogous to positional metadata. In this way, content placed on the screen might pan in effectively the same way as with channel-based content, but content placed in the surrounds can be rendered to individual speakers, if desired. While the use of audio objects provides control over discrete effects, other aspects of a soundtrack may work more effectively in a channel-based environment. For example, many ambient effects or reverberation actually benefit from being fed to arrays of speakers rather than individual drivers. Although these could be treated as objects with sufficient width to fill an array, it is beneficial to retain some channel-based functionality.
The adaptive audio system is configured to support “beds” in addition to audio objects, where beds are effectively channel-based sub-mixes or stems. These can be delivered for final playback (rendering) either individually, or combined into a single bed, depending on the intent of the content creator. These beds can be created in different channel-based configurations such as 5.1, 7.1, and 9.1, and arrays that include overhead speakers.
An adaptive audio system extends beyond speaker feeds as a means for distributing spatial audio and uses advanced model-based audio descriptions to tailor playback configurations that suit individual needs and system constraints so that audio can be rendered specifically for individual configurations. The spatial effects of audio signals are critical in providing an immersive experience for the listener. Sounds that are meant to emanate from a specific region of a viewing screen or room should be played through speaker(s) located at that same relative location. Thus, the primary audio metadatum of a sound event in a model-based description is position, though other parameters such as size, orientation, velocity and acoustic dispersion can also be described.
As stated above, adaptive audio content may comprise several bed channels 102 along with many individual audio objects 104 that are combined during rendering to create a spatially diverse and immersive audio experience. In a cinema environment with a great deal of processing bandwidth, virtually any number of beds and objects can be created and accurately rendered in a theater. However, as cinema or other complex audio content is produced for distribution and reproduction in home or personal listening environments the relatively limited processing bandwidth of such devices and media prevent optimum rendering or playback of this content. For example, typical transmission media used for consumer and professional applications include Blu-ray disc, broadcast (cable, satellite and terrestrial), mobile (3G and 4G) and over the top (OTT) or Internet distribution. These media channels may pose significant limitations on the available bandwidth to digitally transmit all of the bed and object information of adaptive audio content. Embodiments are directed to mechanisms to compress complex adaptive audio content so that it may be distributed through transmission systems that may not possess large enough available bandwidth to otherwise render all of audio bed and object data.
With current monophonic, stereo and multichannel audio content, the bandwidth constraints of the aforementioned delivery methods and networks are such that audio coding is generally required to reduce the bandwidth required to match the available bandwidth of the distribution method. Present cinema systems are capable of providing uncompressed audio data at a bandwidth on the order of 10 Mbps for typical 7.1 cinema format. In comparison to this capacity, the available bandwidth for the various other delivery methods and playback systems is substantially less. For example, disc-based bandwidth is on the order of several hundred kbps up to tens of Mbps; broadcast bandwidth is on the order of several hundred kbps down to tens of kbps; OTT Internet bandwidth is on the order of several hundred kbps up to several Mbps; and mobile (3G/4G) is only on the order of several hundred kbps down to tens of kbps. Because adaptive audio contains additional audio essence that is part of the format, i.e., objects 104 in addition to channel beds 102, the already significant constraints on transmission bandwidth are exacerbated above and beyond normal channel based audio formats, and additional reductions in bandwidth are required in addition to audio coding tools to facilitate accurate reproduction in reduced bandwidth transmission and playback systems.
Scene Simplification Through Object Clustering
In an embodiment, an adaptive audio system provides a component to reduce the bandwidth of object-based audio content through object clustering and perceptually transparent simplifications of the spatial scenes created by the combination of channel beds and objects. An object clustering process executed by the component uses certain information about the objects, including spatial position, content type, temporal attributes, object width, and loudness, to reduce the complexity of the spatial scene by grouping like objects into object clusters that replace the original objects.
The additional audio processing for standard audio coding to distribute and render a compelling user experience based on the original complex bed and audio tracks is generally referred to as scene simplification and/or object clustering. The purpose of this processing is to reduce the spatial scene through clustering or grouping techniques that reduce the number of individual audio elements (beds and objects) to be delivered to the reproduction device, but that still retain enough spatial information so that the perceived difference between the originally authored content and the rendered output is minimized.
The scene simplification process facilitates the rendering of object-plus-bed content in reduced bandwidth channels or coding systems using information about the objects including spatial position, temporal attributes, content type, width, and other appropriate characteristics to dynamically cluster objects to a reduced number. This process can reduce the number of objects by performing the following clustering operations: (1) clustering objects to objects; (2) clustering object with beds; and (3) clustering objects and beds to objects. In addition, an object can be distributed over two or more clusters. The process further uses certain temporal and/or perceptual information about objects to control clustering and de-clustering of objects. Object clusters replace the individual waveforms and metadata elements of constituent objects with a single equivalent waveform and metadata set, so that data for N objects is replaced with data for a single object, thus essentially compressing object data from N to 1. As mentioned above, alternatively, or additionally, an object or bed channel may be distributed over more than one cluster (for example using amplitude panning techniques), compressing object data from N to M, with M<N. The clustering process utilizes an error metric based on distortion due to a change in location, loudness or other characteristic of the clustered objects to determine an optimum tradeoff between clustering compression versus sound degradation of the clustered objects. The clustering process can be performed synchronously or it can be event-driven, such as by using auditory scene analysis (ASA) and event boundary detection to control object simplification through clustering. In some embodiments, the process may utilize knowledge of endpoint rendering algorithms and devices to control clustering. In this way, certain characteristics or properties of the playback device may be used to inform the clustering process. For example, different clustering schemes may be utilized for speakers versus headphones or other audio drivers, or different clustering schemes may be utilized for lossless versus lossy coding, and so on.
For purposes of the following description, the terms ‘clustering’ and ‘grouping’ or ‘combining’ are used interchangeably to describe the combination of objects and/or beds (channels) to reduce the amount of data in a unit of adaptive audio content for transmission and rendering in an adaptive audio playback system; and the terms ‘compression’ or ‘reduction’ may be used to refer to the act of performing scene simplification of adaptive audio through such clustering of objects and beds. The terms ‘clustering’, ‘grouping’ or ‘combining’ throughout this description are not limited to a strictly unique assignment of an object or bed channel to a single cluster only, instead, an object or bed channel may be distributed over more than one output bed or cluster using weights or gain vectors that determine the relative contribution of an object or bed signal to the output cluster or output bed signal.
In an adaptive audio system, at least a portion of the input audio comprises input signals 201 including objects that consist of audio and metadata. The metadata defines certain characteristics of the associated audio content, such as object spatial position, content type, loudness, and so on. Any practical number of audio objects (e.g., hundreds of objects) may be processed through the system for playback. To facilitate accurate playback of these multitude of objects in a wide variety of playback systems and transmission media, system 200 includes a clustering process or component 202 that reduces the number of objects into a smaller more manageable number of objects by combining the original objects into a smaller number of object groups. The clustering process thus builds groups of objects to produce a smaller number of output groups 203 from an original set of individual input objects 201. The clustering process 202 essentially processes the metadata of the objects as well as the audio data itself to produce the reduced number of object groups. The metadata is analyzed to determine which objects at any point in time are most appropriately combined with other objects, and the corresponding audio waveforms for the combined objects are then summed together to produce a substitute or combined object. The combined object groups are then input to the encoder 204, which generates a bitstream 205 containing the audio and metadata for transmission to the decoder 206.
In general, the adaptive audio system incorporating the object clustering process 202 includes components that generate metadata from the original spatial audio format. The codec circuit 200 comprises part of an audio rendering system configured to process one or more bitstreams containing both conventional channel-based audio elements and audio object coding elements. An extension layer containing the audio object coding elements is added to either one of the channel-based audio codec bitstream or the audio object bitstream. This approach enables bitstreams 205, which include the extension layer to be processed by renderers for use with existing speaker and driver designs or next generation speakers utilizing individually addressable drivers and driver definitions. The spatial audio content from the spatial audio processor comprises audio objects, channels, and position metadata. When an object is rendered, it is assigned to one or more speakers according to the position metadata, and the location of the playback speakers. Additional metadata may be associated with the object to alter the playback location or otherwise limit the speakers that are to be used for playback. Metadata may be generated in the audio workstation in response to the engineer's mixing inputs to provide rendering cues that control spatial parameters (e.g., position, velocity, intensity, timbre, etc.) and specify which driver(s) or speaker(s) in the listening environment play respective sounds during exhibition. The metadata is associated with the respective audio data in the workstation for packaging and transport by spatial audio processor.
The object processing component 256 utilizes certain processing configuration information 272. In an embodiment, these include the number of output objects, the frame size and certain media intelligence settings. Media intelligence can include several parameters or characteristics associated with the objects, such as content type (i.e., dialog/music/effects/etc.), regions (segment/classification), preprocessing results, auditory scene analysis results, and other similar information.
In an alternative embodiment, audio generation could be deferred by keeping a reference to all original tracks as well simplification metadata (e.g., which objects belongs to which cluster, which objects are to be rendered to beds, etc.). This can be useful to distribute the simplification process between a studio and an encoding house, or other similar scenario.
In the post-production stage 221, the input audio data 222, which could be cinema and/or home based adaptive audio content, is input to a metadata generation process 224. This process generates spatial metadata for the objects including: position, width, decorrelation, and rendering mode information, and well as content metadata including: content type, object boundaries and relative importance (energy/loudness). A clustering process 226 is then applied to the input data to reduce the overall number input objects into a smaller number of objects by combining certain objects together based on their spatial proximity, temporal proximity, or other characteristics. The clustering process 226 may be a dynamic clustering process that performs clustering as a constant or periodic process as the input data is processed in the system, and it may utilize user input 228 that specifies certain constraints such as target number of clusters, importance weighting to objects/clusters, filtering effects, and so on. The post-production stage may also include a cluster down-mixing step that provides certain processing of the clusters, such as mix, decorrelation, limiters, and so on. The post-production stage may include a render/monitor option 232 that allows the audio engineer to monitor or listen to the result of the clustering process, and modify the input data 222 or user input 228 if the results are not adequate.
The transmission stage 223 generally comprises components that perform raw data to codec interfacing 234, and packaging of the audio data into the appropriate output format 236 for delivery or streaming of the digital data using the appropriate codec (e.g., TrueHD, Dolby Digital+, etc.). In the transmission stage 223, a further dynamic clustering process 238 may also be applied to the objects that are produced during the post-production stage 221.
The playback system 225 receives the transmitted digital audio data and performs a final render step 242 for playback through the appropriate equipment (e.g., amplifiers plus speakers). During this stage an additional dynamic clustering process 240 may be applied using certain user input 244 and playback system (compute) capability 245 information to further group objects into clusters.
In an embodiment, the clustering processes 240 and 238 performed in either the transmission or playback stages may be limited clustering processes in that the amount of object clustering may be limited as compared to the post-production clustering process 226 in terms of number of clusters formed and/or the amount and type of information used to perform the clustering.
The loudness of the combined object may be derived by averaging or summing the loudness of the constituent objects. In an embodiment, the loudness metric of a signal represents the perceptual energy of the signal, which is a measure of the energy that is weighted based on frequency. Loudness is thus a spectrally weighted energy that corresponds to a listener's perception of the sound. In an alternative embodiment, instead of, or along with loudness, the process may use the pure energy (RMS energy) of the signal, or some other measure of signal energy as a factor in determining the importance of an object. In yet an alternative embodiment, the loudness of the combined object is derived from the partial loudness data of the clustered objects, in which the partial loudness represents the (relative) loudness of an object in the context of the complete set of objects and beds according to psychoacoustic principles. Thus, as shown in table 350, the loudness metadata type may be embodied as an absolute loudness, a partial loudness or a combined loudness metadata definition. Partial loudness (or relative importance) of an object can be used for clustering as an importance metric, or as means to selectively render objects if the rendering system does not have sufficient capabilities to render all objects individually.
Other metadata types may require other combination methods. For example, certain metadata cannot be combined through a logical or arithmetic operation, and thus a selection must be made. For example, in the case of rendering mode, which is either one mode or another, the rendering mode of the dominant object is assigned to be the rendering mode of the combined object. Other types of metadata, such as control signals and the like may be selected or combined depending on application and metadata characteristics.
With regard to content type, audio is generally classified into one of a number of defined content types, such as dialog, music, ambience, special effects, and so on. An object may change content type throughout its duration, but at any specific point in time it is generally only one type of content. The content type is thus expressed as a probability that the object is a particular type of content at any point in time. Thus, for example, a constant dialog object would be expressed as a one-hundred percent probability dialog object, while an object that transforms from dialog to music may be expressed as fifty percent dialog/fifty percent music. Clustering objects that have different content types could be performed by averaging their respective probabilities for each content type, selecting the content type probabilities for the most dominant object, or some other logical combination of content type measures. The content type may also be expressed as an n-dimensional vector (where n is the total number of different content types, e.g., four, in the case of dialog/music/ambience/effects). The content type of the clustered objects may then be derived by performing an appropriate vector operation. As shown in table 350, the content type metadata may be embodied as a combined content type metadata definition, where a combination of content types reflects the probability distributions that are combined (e.g., a vector of probabilities of music, speech, etc.).
With regard to classification of audio, in an embodiment, the process operates on a per time-frame basis to analyze the signal, identify features of the signal and compare the identified features to features of known classes in order to determine how well the features of the object match the features of a particular class. Based on how well the features match a particular class, the classifier can identify a probability of an object belonging to a particular class. For example, if at time t=T the features of an object match very well with dialog features, then the object would be classified as dialog with a high probability. If, at time=T+N, the features of an object match very well with music features, the object would be classified as music with a high probability. Finally, if at time T=T+2N the features of an object do not match particularly well with either dialog or music, the object might be classified as 50% music and 50% dialog.
The listing of metadata definitions in
In an embodiment and with reference to
A second clustering scheme 404 determines when it is appropriate to combine audio objects that may be spatially diverse with channel beds that represent fixed spatial locations. An example of this type of clustering is when there is not enough available bandwidth to transmit an object that may be originally represented as traversing in a three dimensional space, and instead to mix the object into its projection onto the horizontal plane, which is where channel beds are typically represented. This allows one or more objects to be dynamically mixed into the static channels, thereby reducing the number of objects that need to be transmitted.
A third clustering scheme 406 uses prior knowledge of certain known system characteristics. For example, knowledge of the endpoint rendering algorithms and/or the reproduction devices in the playback system may be used to control the clustering process. For example, a typical home theater configuration relies on physical speakers located in fixed locations. These systems may also rely on speaker virtualization algorithms that compensate for the absence of some speakers in the room and use algorithms to give the listener virtual speakers that exist within the room. If information such as the spatial diversity of the speakers and the accuracy of virtualization algorithms is known, then it may be possible to send a reduced number of objects because the speaker configuration and virtualization algorithms can only provide a limited perceptual experience to a listener. In this case, sending a full bed plus object representation may be a waste of bandwidth, so some degree of clustering would be appropriate. Other types of known information could also be used in this clustering scheme, such as the content type of the object or objects to control clustering, or the width of an object or objects to control clustering. For this embodiment, the codec circuit 200 may be configured to adapt the output audio signals 207 based on the playback device. This feature allows a user or other process to define the number of grouped clusters 203, as well as the compression rate for the compressed audio 211. Since different transmission media and playback devices can have significantly different bandwidth capacity, a flexible compression scheme for both standard compression algorithms as well as object clustering can be advantageous. For example, if the input comprises a first number, e.g., 100 original objects, the clustering process may be configured to generate 20 combined groups 203 for Blu-ray systems or 10 objects for cell phone playback, and so on. The clustering process 202 may be recursively applied to generate incrementally fewer clustered groups 203 so that different sets of output signals 207 may be provided for different playback applications.
A fourth clustering scheme 408 comprises the use of temporal information to control the dynamic clustering and de-clustering of objects. In one embodiment, the clustering process is performed at regular intervals or periods (e.g., once every 10 milliseconds). Alternatively, other temporal events can be used, including techniques such as auditory scene analysis (ASA) and auditory event boundary detection to analyze and process the audio content to determine the optimum clustering configurations based on the duration of individual objects.
It should be noted that the schemes illustrated in diagram 400 can be performed by the clustering process 202 either as stand-alone acts or in combination with one or more other schemes. They may also be performed in any order relative to the other schemes, and no particular order is required for execution of the clustering process.
For the case where clustering is based on spatial position 402, the original objects are grouped into clusters for which a spatial centroid is dynamically constructed. The position of the centroid becomes the new position of the group. The audio signal for the group is a mix-down of all the original audio signals for each object belonging to the group. Each cluster can be seen as a new object that approximates its original contents but shares the same core attributes/data structures as the original input objects. As a result, each object cluster can be directly processed by the object renderer.
In an embodiment, the clustering process dynamically groups an original number of audio objects and/or bed channels into a target number of new equivalent objects and bed channels. In most practical applications, the target number is substantially lower than the original number, e.g., 100 original input tracks combined into 20 or fewer combined groups. These solutions apply to scenarios where both bed and object channels are available either as an input and/or an output to the clustering process. A first solution to support both objects and bed tracks is to process input bed tracks as objects with fixed pre-defined position in space. This allows the system to simplify a scene comprising, for example, both objects and beds into a target number of object tracks only. However, it might also be desirable to preserve a number of output bed tracks as part of the clustering process. Less important objects can then be rendered directly to the bed tracks as a pre-process, while the most important ones can be further clustered into a smaller target number of equivalent object tracks. If some of the resulting clusters have high distortion they can also be rendered to beds as a post-process, as this may result in a better approximation of the original content. This decision can be made on a time-varying basis, since the error/distortion is a time-varying function.
In an embodiment, the clustering process involves analyzing the audio content of every individual input track (object or bed) 201 as well as the attached metadata (e.g., the spatial position of the objects) to derive an equivalent number of output object/bed tracks that minimizes a given error metric. In a basic implementation, the error metric is based on the spatial distortion due to shifting the clustered objects and can further be weighted by a measure of the importance of each object over time. The importance of an object can encapsulate other characteristics of the object, such as loudness, content type, and other relevant factors. Alternatively, these other factors can form separate error metrics that can be combined with the spatial error metric.
Error Calculation
The clustering process essentially represents a type of lossy compression scheme that reduces the amount of data transmitted through the system, but that inherently introduces some amount of content degradation due to the combination of original objects into a fewer number of rendered objects. As stated above, the degradation due to the clustering of objects is quantified by an error metric. The greater the reduction of original objects into relatively few combined groups and/or the greater the amount of spatial collapsing of original objects into combined groups, the greater the error, in general. In an embodiment, the error metric used in the clustering process is expressed as shown in Equation 1:
E(s,c)[t]=Importance_s[t]*dist(s,c)[t] (1)
As stated above, an object may be distributed over more than one cluster, rather than grouped into a single cluster with other objects. When an object signal x(s)[t] with index s is distributed over more than one cluster c with representative cluster audio signals y(c)[t] using amplitude gains g(s,c)[t] is as shown in Equation 2:
y(c)[t]=sum_s g(s,c)[t]*x(s)[t] (2)
The error metric E(s,c)[t] for each cluster c can be weighted combination of the terms expressed in Equation 1 with weights that are a function of the amplitude gains g(s,c)[t] as shown in Equation 3:
E(s,c)[t]=sum_s(f(g(s,c)[t])*Importance_s[t]*dist(s,c)[t]) (3)
In an embodiment, the clustering process supports objects with a width or spread parameter. Width is used for objects that are not rendered as pinpoint sources but rather as sounds with an apparent spatial extent. As the width parameter increases, the rendered sound becomes more spatially diffuse and consequently, its specific location becomes less relevant. It is thus advantageous to include width in the clustering distortion metric so that it favors more positional error as the width increases. The error expression E(s,c) can thus be modified to accommodate a width metric, as shown in Equation 4:
E(s,c)[t]=Importance_s[t]*(α*(1−Width_s[t])*dist(s,c)[t]+(1×α)*Width_s[t]) (4)
In the Equations 1 and 3 above, the importance factor s is the relative importance of the object, c the centroid of the cluster, and dist(s,c) the Euclidean three-dimensional distance between the object and the centroid of the cluster. All of these quantities are time-varying as denoted by the [t] term. A weighting term α can also be introduced to control the relative weight of size versus position of an object.
The importance function, Importance_s[t], can be a combination of signal-based metrics such as the loudness of the signal with higher level measure of how salient each object is relative to the rest of the mix. For example, a spectral similarity measure computed for each pair of input objects can further weight the loudness metric so that similar signals tend to be grouped together. For cinematic content as an example, it might also be desirable to give more importance to on-screen objects, in which case the importance can be further weighted by a directional dot-product term which is maximal for front-center objects and diminishes as the objects move off-screen.
When constructing the clusters, the importance function is temporally smoothed over a relatively long time window (e.g. 0.5 second) to ensure that the clustering is temporally consistent. In this context, including look-ahead or prior knowledge of object start and stop times can improve the accuracy of the clustering. In contrast, the equivalent spatial location of the cluster centroid can be adapted at a higher rate (10 to 40 milliseconds) using a higher rate estimate of the importance function. Sudden changes or increments in the importance metric (for example using a transient detector) may temporarily shorten the relatively long time window, or reset any analysis states in relation to the long time window.
As stated above, other information such as content type can be also included in the error metric as an additional importance weighting term. For instance, in a movie soundtrack dialog might be considered more important than music and sound effects. It would therefore be preferable to separate dialog in one or a few dialog-only clusters by increasing the relative importance of the corresponding objects. The relative importance of each object could also be provided or manually adjusted by a user. Similarly, only a specific subset of the original objects can be clustered or simplified if the user so desires, while the others would be preserved as individually rendered objects. The content type information could also be generated automatically using media intelligence techniques to classify audio content.
The error metric E(s,c) could be a function of several error components based on the combined metadata elements. Thus, other information besides distance could factor in the clustering error. For example, like objects may be clustered together rather than disparate objects, based on object type, such as dialog, music, effects, and so on. Combining objects of different types that are incompatible can result in distortion or degradation of the output sound. Error could also be introduced due to inappropriate or less than optimum rendering modes for one or more of the clustered objects. Likewise, certain control signals for specific objects may be disregarded or compromised for clustered objects. An overall error term may thus be defined that represents the sum of errors for each metadata element that is combined when an object is clustered. An example expression of overall error is shown in Equation 5:
Eoverallt]=ΣEMDn (5)
In Equation 5, MDn represents specific metadata elements of N metadata elements that are combined for each object that is merged in a cluster, and EMDn represents the error associated with combining that metadata value with corresponding metadata values for other objects in a cluster. The error value may be expressed as a percentage value for metadata values that are averaged (e.g., position/loudness), or as a binary 0 percent or 100 percent value for metadata values that are selected as one value or another (e.g., rendering mode), or any other appropriate error metric. For the metadata elements illustrated in
Eoverallt]=Espatial+Eloudness+Erendering+Econtrol (6)
The different error components other than spatial error can be used as criteria for the clustering and de-clustering of objects. For example, loudness may be used to control the clustering behavior. Specific loudness is a perceptual measure of loudness based on psychoacoustic principles. By measuring the specific loudness of different objects, the perceived loudness of an object may guide whether it is clustered or not. For example, a loud object is likely to be more apparent to a listener if it's spatial trajectory is modified, while the opposite is generally true for quieter objects. Therefore, specific loudness could be used as a weighting factor in addition to spatial error to control the clustering of objects. Another example is object type, wherein some types of objects may be more perceptible if their spatial organization is modified. For example, humans are very sensitive to speech signals and these types of objects may need to be treated differently than other objects such as noise-like or ambient effects for which spatial perception is less acute. Therefore, object type (such as speech, effects, ambience, etc.) could be used as a weighting factor in addition to spatial error to control the clustering of objects.
The clustering process 202 thus combines objects into clusters based on certain characteristics of the objects and a defined amount of error that cannot be exceeded. As shown in
In one embodiment, the clustering process analyzes the objects and performs clustering at regular periodic intervals, such as once every 10 milliseconds, or any other appropriate time period.
Instead of performing clustering every time period as shown in
As shown in
In an adaptive audio system, certain objects may be defined as fixed objects, such as channel beds that are associated with specific speaker feeds. In an embodiment, the clustering process accounts for bed plus dynamic object interaction, such that when an object creates too much error when being grouped with a clustered object (e.g., it is an outlying object), it is instead mixed to a bed.
In an embodiment, object signal-based saliency is the difference between the average spectrum of the mix and spectrum of each object and saliency metadata elements may be added to objects/clusters. The relative loudness is a percentage of the energy/loudness contributed by each object to the final mix. A relative loudness metadata element can also be added to objects/clusters. The process can then sort by saliency to cull masked sources and/or preserve most important sources. Clusters can be simplified by further attenuating low importance/low saliency sources.
The clustering process is generally used as a means for data rate reduction prior to audio coding. In an embodiment, object clustering/grouping is used during decoding based on the end-point device rendering capabilities. Various different end-point devices may be used in conjunction with a rendering system that employs a clustering process as described herein, such as anything from full cinema playback environment, home theater system, gaming system and personal portable device, and headphone system. Thus, the same clustering techniques may be utilized while decoding the objects and beds in a device, such as a Blu-ray player, prior to rendering in order that the capabilities of the renderer will not be exceeded. In general, rendering of the object and bed audio format requires that each object be rendered to some set of channels associated with the renderer as a function of each object's spatial information. The computational cost of this rendering scales with the number of objects, and therefore any rendering device will have some maximum number of objects it can render that is a function of its computational capabilities. A high-end renderer, such as an AVR, may contain an advanced processor that can render a large number of objects simultaneously. A less expensive device, such as a home theater in a box (HTIB) or a soundbar, may be able to render fewer objects due to a more limited processor. It is therefore advantageous for the renderer to communicate to the decoder the maximum number of objects and beds that it can accept. If this number is smaller than the number of objects and beds contained in the decoded audio, then the decoder may apply clustering of object and beds prior to transmission to the renderer so as to reduce the total to the communicated maximum. This communication of capabilities may occur between separate decoding and rendering software components within a single device, such as an HTIB containing an internal Blu-ray player, or over a communications link, such as HDMI, between two separate devices, such as a stand-alone Blu-ray player and an AVR. The metadata associated with objects and clusters may indicate or provide information as to optimally reduce the number of clusters by the renderer, by enumerating the order of importance, signaling the (relative) importance of clusters, or specify which clusters should be combined sequentially to reduce the overall number of clusters that should be rendered. This is described later with reference to
In some embodiments, the clustering process may be performed in the decoder stage 206 with no additional information other than that inherent to each object. However, the computational cost of this clustering may be equal to or greater than the rendering cost that it is attempting to save. A more computationally efficient embodiment involves computing a hierarchical clustering scheme at the encode side 204, where computational resources may be much greater, and sending the metadata along with the encoded bitstream which instructs the decoder how to cluster objects and beds into progressively smaller numbers. For example, the metadata may state: first merge object 2 with object 10. Second merge the resulting object with object 5, and so on.
In an embodiment, objects may have one or more time varying labels associated with them to denote certain properties of the audio contained in the object track. As described above, an object may be categorized into one of several discreet content types, such as dialog, music, effects, background, etc., and these types may be used to help guide the clustering. At the same time, these categories may also be useful during the rendering process. For example, a dialog enhancement algorithm might be applied only to objects labeled as dialog. When objects are clustered however, the cluster might be comprised of objects with different labels. In order to label the cluster, several techniques may be employed. A single label for the cluster may be chosen, for example, by selecting the label of the object with the largest amount of energy. This selection may also be time varying, where a single label is chosen at regular intervals of time during the cluster's duration, and at each particular interval the label is chosen from the object with the largest energy within that particular interval. In some cases, a single label may not be sufficient, and a new, combined label may be generated. For example, at regular intervals, the labels of all objects contributing to the cluster during that interval may be associated with the cluster. Alternatively, a weight may be associated with each of these contributing labels. For example, the weight may be set equal to the percentage of overall energy belonging to that particular type: for example, 50% dialog, 30% music, and 20% effects. Such labeling may then be used by then renderer in a more flexible manner. For example, a dialog enhancement algorithm may only be applied to clustered object tracks containing at least 50% dialog.
Once the clusters that combine different objects have been defined, equivalent audio data must be generated for each cluster. In an embodiment, the combined audio data is simply the sum of the original audio content for each original object in the cluster, as shown in
When generating a downmix, the process can further reduce the bit depth of a cluster to increase the compression of data. This can be performed through a noise-shaping, or similar process. A bit depth reduction generates a cluster that has a fewer number of bits than the constituent objects. For example, one or more 24-bit objects can be grouped into a cluster that is represented as 16 or 20-bits. Different bit reduction schemes may be used for different clusters and objects depending on the cluster importance or energy, or other factors. Additionally, when generating a downmix, the resulting downmix signal may have sample values beyond the acceptable range that can be represented by digital representations with a fixed number of bits. In such case, the downmix signal may be limited using a peak limiter, or (temporarily) attenuated by a certain amount to prevent out-of-range sample values. The amount of attenuation applied may be included in the cluster metadata so that it can be un-done (or inverted) during rendering, coding, or other subsequent process.
In an embodiment, the clustering process may employ a pointer mechanism whereby the metadata includes pointers to specific audio waveforms that are stored in a database or other storage. Clustering of objects is performed by pointing to appropriate waveforms by combined metadata elements. Such as system can be implemented in an archive system that generates a precomputed database of audio content, transmits the audio waveforms from the coder and decoder stages and then constructs the clusters in the decode stage using pointers to specific audio waveforms for the clustered objects. This type of mechanism can be used in a system that facilitates packaging of object-based audio for different end-point devices.
The clustering process can also be adapted to allow for re-clustering on the end-point client device. Generally substitute clusters replace original objects, however, for this embodiment, the clustering process also sends error information associated with each object to allow the client to determine whether or not an object is an individually rendered object or a clustered object. If the error value is 0, then it can be deduced that there was no clustering. If, however, the error value equals some amount, then it can be deduced that the object is the result of some clustering. Rendering decisions at the client can then be based on the amount of error. In general, the clustering process is run as an off-line process. Alternatively, it may be run as a live process as the content is created. For this embodiment, the clustering component may be implemented as a tool or application that may be provided as part of the content creation and/or rendering system.
Perceptual-Based Clustering
In an embodiment, a clustering method is configured to combine object and/or bed channels in constrained conditions, e.g., in which the input objects cannot be clustered without violating a spatial error criterion, due to the large number of objects and/or their spatially sparse distribution. In such conditions, the clustering process is not only controlled by spatial proximity (derived from metadata), but is augmented by perceptual criteria derived the corresponding audio signals. More specifically, objects with a high (perceived) importance in the content will be favored over objects with low importance in terms of minimizing spatial errors. Examples of quantifying importance include, but are not limited to partial loudness and semantics (content type).
The preprocessing unit 366 may include individual functional components such as a metadata processor 368, an object decorrelation unit 370, an offline processing unit 372, and a signal segmentation unit 374, among other components. External data, such as a metadata output update rate 396 may be provided to the preprocessor 366. The perceptual importance component 376 comprises a centroid initialization component 378, a partial loudness component 380, and a media intelligence unit 382, among other components. External data, such as an output beds and objects configuration data 398 may be provided to the perceptual importance component 376. The clustering component 384 comprises signal merging 386 and metadata merging 388 components that form the clustered beds/objects to produce the metadata 390 and clusters 392 for the combined bed channels and objects.
With regard to partial loudness, the perceived loudness of an object is usually reduced in the context of other objects. For example, objects may be (partially) masked by other objects and/or bed channels present in the scene. In an embodiment, objects with a high partial loudness are favored over objects with a low partial loudness in terms of spatial error minimization. Thus, relatively unmasked (i.e., perceptually louder) objects are less likely to be clustered while relatively masked objects are more likely to be clustered. This process preferably includes spatial aspects of masking, e.g., the release from masking if a masked object and a masking object have different spatial attributes. In other words, the loudness-based importance of a certain object of interest is higher when that object is spatially separated from other objects compared to when other objects are in the direct vicinity of the object of interest.
In an embodiment, the partial loudness of an object comprises the specific loudness extended with spatial unmasking phenomena. A binaural release from masking is introduced to represent the amount of masking based on the spatial distance between two objects, as provided in the equation below.
N′k(b)=(A+ΣEm(b))α+(A+ΣEm(b)(1−f(k,m)))α
In the above equation, the first summation is performed over all m, and the second summation is performed for all m≠k. The term Em(b) represents the excitation of object m, the term A reflects the absolute hearing threshold, and the term (1−ƒ(k, m)) represents the release from masking. Further details regarding this equation are provided in the discussion below.
With regard to content semantics or audio type, dialogue is often considered to be more important (or draws more attention) than background music, ambience, effects, or other types of content. The importance of an object is therefore dependent on its (signal) content, and relatively unimportant objects are more likely to be clustered than important objects.
The perceptual importance of an object can be derived by combining the perceived loudness and content importance of the objects. For example, in an embodiment, content importance can be derived based on a dialog confidence score, and a gain value (in dB) can be estimated based on this derived content importance. The loudness or excitation of the object can then be modified by the estimated loudness, with the modified loudness representing the final perceptual importance of the object.
In an embodiment, the estimate importance 906 and clustering 904 processes are performed as a function of time. For this embodiment, the audio signals of the input objects 900 are segmented into individual frames that are subjected to certain analysis components. Such segmentation may be applied on time-domain waveforms, but also using filter banks, or any other transform domain. The estimate importance function 906 operates on one or more characteristics of the input audio objects 902 including content type and partial loudness.
With regard to estimating the object content type (1102), the content type (e.g., dialog, music, and sound effects) provides critical information to indicate the importance of an audio object. For example, dialog is usually the most important component in a movie since it conveys the story, and proper playback typically requires not allowing the dialog to move around with other moving audio objects. The estimate importance function 906 in
With regard to estimating content-based audio object importance, for dialog oriented applications, the content-based audio object importance is computed based on the dialog confidence score only, assuming that dialog is the most important component in audio as stated above. In other applications, different content types confidence scores may be used, depending on the preferred type of content. In one embodiment, a sigmoid function is utilized, as provided in the following equation:
In the above equation, Ik is the estimated content-based importance of object k, pk is the corresponding estimated probability of object k consisting of speech/dialogue, and A and B are two parameters.
In order to further set the content-based importance to consistently close to 0 for those with dialog probability scores less than a threshold c, the above formula can be modified as follows:
In an embodiment, the constant c, can take the value of c=0.1, and the two parameters A and B can be either constants or adaptively tuned based on the probability score pk.
With regard to calculating object partial loudness, one method to calculate partial loudness of one object in a complex auditory scene is based on the calculation of excitation levels E(b) in critical bands (b). The excitation level for a certain object of interest Eobj(b) and the excitation of all remaining (masking) signals Enoise(b) results in a specific loudness N′(b) in band b, as provided in the following equation:
N′(b)=C[(GEobj+GEnoise+A)α−Aα]−C[(GEnoise+A)α−Aα],
with G, C, A and □ model parameters. Subsequently, the partial loudness N is obtained by summing the specific loudness N′(b) across critical bands as follows:
N=ΣbN′(b)
When an auditory scene consists of K objects (k=1, . . . , K) with excitation levels Ek(b), and for simplicity of notation, model parameters G and C are assumed to be equal to +1, the specific loudness N′k (b) of object k is given by:
N′k(b)=(A+ΣmEm(b))α−(−Ek(b)+A+Σm(b))α
The first term in the equation above represents the overall excitation of the auditory scene, plus an excitation A that reflects the absolute hearing threshold. The second term reflects the overall excitation except for the object of interest k, and hence the second term can be interpreted as a ‘masking’ term that applies to object k. This formulation does not account for a binaural release from masking. A release from masking can be incorporated by reducing the masking term above if the object of interest k is distant from another object m as given by the following equation:
N′k(b)=(A+ΣmEm(b))α−(−Ek(b)+A+ΣmEm(b)(1−ƒ(k,m)))α,
In the above equation, ƒ(k,m) is a function that equals 0 if object k and object m have the same position, and a value that is increasing to +1 with increasing spatial distance between objects k and m. Said differently, the function ƒ(k,m) represents the amount of unmasking as a function of the distance in parametric positions of objects k and m. Alternatively, the maximum value of ƒ(k,m) may be limited to a value slightly smaller than +1 such as 0.995 to reflect an upper limit in the amount of spatial unmasking for objects that are spatially separated.
The calculation of loudness can be accounted for by a defined cluster centroid. In general, a centroid is the location in attribute space that represents the center of a cluster, and an attribute is a set of values corresponding to a measurement (e.g., loudness, content type, etc.). The partial loudness of individual objects is only of limited relevance if objects are clustered, and if the goal is to derive a constrained set of clusters and associated parametric positions that provides the best possible audio quality. In an embodiment, a more representative metric is the partial loudness accounted for by a specific cluster position (or centroid), aggregating all excitation in the vicinity of that position. Similar to the case above, the partial loudness accounted for by cluster centroid c can be expressed as follows:
N′c(b)=(A+ΣmEm(b))α−(A+ΣmEm(b)(1−ƒ(m,c)))α
In this context, an output bed channel (e.g., an output channel that should be reproduced by a specific loudspeaker in a playback system) can be regarded as a centroid with a fixed position, corresponding to the position of the target loudspeaker. Similarly, input bed signals can be regarded as objects with a position corresponding to the position of the corresponding reproduction loudspeaker. Hence objects and bed channels can be subjected to the exact same analysis, under the constraint that bed channel positions are fixed.
In an embodiment, the loudness and content analysis data are combined to derive a combined object importance value, as shown in block 1108 of
E′k(b)=Ek(b)g(Ik)
In the above equation, Ik is the content-based object importance of object k, E′k(b) is the modified excitation level, and g(·) is a function to map the content importance into excitation level modifications. In an embodiment, g(·) is an exponential function interpreting the content importance as a gain in db.
g(Ik)=10GIk
where G is another gain over the content-based object importance, which can be tuned to obtain the best performance.
In another implementation, g(·) is a linear function, as follows:
g(Ik)=1+G·Ik
The above equations are merely examples of possible embodiments. Alternative methods can be applied onto loudness instead of excitation, and may include ways of combining information other than involving a simple product.
As also shown in
In one embodiment, the time constant is positively correlated to the content-based object importance, as follows:
τ=τ0+Ik·τ1
In the above equation, τ is the estimated importance dependent time constant, and τ0 and τ1 are parameters. Moreover, similar to the excitation/loudness level modification based on content importance, the adaptive time constant scheme can be also applied onto either loudness or excitation.
As stated above, the partial loudness of audio objects is calculated with respect to a defined cluster centroid. In an embodiment, a cluster centroid calculation is performed such that when the total number of clusters is constrained, a subset of cluster centroids is selected that accounts for the maximum partial loudness of the centroids.
In an alternative embodiment, the loudness processing could involve performing a loudness analysis on a sampling of all possible positions in the spatial domain, followed by selecting local maxima across all positions. In a further alternative embodiment, Hochbaum centroid selection is augmented with loudness. The Hochbaum centroid selection is based on the selection of a set of positions that have maximum distance with respect to one another. This process can be augmented by multiplying or adding loudness to the distance metric to select centroids.
As shown in
In an embodiment, the clustering process select n centroids within the X/Y plane for clustering the objects, where n is the number of clusters. The process selects the n centroids that correspond to the highest importance, or maximum loudness accounted for. The remaining objects are then clustered according to (1) nearest neighbor, or (2) rendered into the cluster centroids by panning techniques. Thus, audio objects can be allocated to clusters by adding the object signal of a clustered object to the closest centroid, or mixing the object signal into a (sub)set of clusters. The number of selected clusters may be dynamic and determined through mixing gains that minimize the spatial error in a cluster. The cluster metadata consists of weighted averages of the objects that reside in the cluster. The weights may be based on the perceived loudness, as well as object position, size, zone, exclusion mask, and other object characteristics. In general, clustering of objects is primarily dependent on object importance and one or more objects may be distributed over multiple output clusters. That is, an object may be added to one cluster (uniquely clustered), or it may be distributed over more than one cluster (non-uniquely clustered).
As shown in
In an embodiment, the clustering process involves analyzing the audio content of every individual input track (object or bed) as well as the attached metadata (e.g., the spatial position of the objects) to derive an equivalent number of output object/bed tracks that minimizes a given error metric. In a basic implementation, the error metric 1302 is based on the spatial distortion due to shifting the clustered objects and can further be weighted by a measure of the importance of each object over time. The importance of an object can encapsulate other characteristics of the object, such as loudness, content type, and other relevant factors. Alternatively, these other factors can form separate error metrics that can be combined with the spatial error metric.
Object and Channel Processing
In an adaptive audio system, certain objects may be defined as fixed objects, such as channel beds that are associated with specific speaker feeds. In an embodiment, the clustering process accounts for bed plus dynamic object interaction, such that when an object creates too much error when being grouped with a clustered object (e.g., it is an outlying object), it is instead mixed to a bed.
Playback System
As described above, various different end-point devices may be used in conjunction with a rendering system that employs a clustering process as described herein, and such devices may have certain capabilities that may impact the clustering process.
The adaptive audio system employing aspects of the clustering process may comprise a playback system that is configured render and playback audio content that is generated through one or more capture, pre-processing, authoring and coding components. An adaptive audio pre-processor may include source separation and content type detection functionality that automatically generates appropriate metadata through analysis of input audio. For example, positional metadata may be derived from a multi-channel recording through an analysis of the relative levels of correlated input between channel pairs. Detection of content type, such as speech or music, may be achieved, for example, by feature extraction and classification. Certain authoring tools allow the authoring of audio programs by optimizing the input and codification of the sound engineer's creative intent allowing him to create the final audio mix once that is optimized for playback in practically any playback environment. This can be accomplished through the use of audio objects and positional data that is associated and encoded with the original audio content. In order to accurately place sounds around an auditorium, the sound engineer needs control over how the sound will ultimately be rendered based on the actual constraints and features of the playback environment. The adaptive audio system provides this control by allowing the sound engineer to change how the audio content is designed and mixed through the use of audio objects and positional data. Once the adaptive audio content has been authored and coded in the appropriate codec devices, it is decoded and rendered in the various components of the playback system.
In general, the playback system may be any professional or consumer audio system, which may include home theater (e.g., A/V receiver, soundbar, and Blu-ray), E-media (e.g., PC, Tablet, Mobile including headphone playback), broadcast (e.g., TV and set-top box), music, gaming, live sound, user generated content, and so on. The adaptive audio content provides enhanced immersion for the consumer audience for all end-point devices, expanded artistic control for audio content creators, improved content dependent (descriptive) metadata for improved rendering, expanded flexibility and scalability for consumer playback systems, timbre preservation and matching, and the opportunity for dynamic rendering of content based on user position and interaction. The system includes several components including new mixing tools for content creators, updated and new packaging and coding tools for distribution and playback, in-home dynamic mixing and rendering (appropriate for different consumer configurations), additional speaker locations and designs
Aspects of the audio environment of described herein represents the playback of the audio or audio/visual content through appropriate speakers and playback devices, and may represent any environment in which a listener is experiencing playback of the captured content, such as a cinema, concert hall, outdoor theater, a home or room, listening booth, car, game console, headphone or headset system, public address (PA) system, or any other playback environment. The spatial audio content comprising object-based audio and channel-based audio may be used in conjunction with any related content (associated audio, video, graphic, etc.), or it may constitute standalone audio content. The playback environment may be any appropriate listening environment from headphones or near field monitors to small or large rooms, cars, open air arenas, concert halls, and so on.
Aspects of the systems described herein may be implemented in an appropriate computer-based sound processing network environment for processing digital or digitized audio files. Portions of the adaptive audio system may include one or more networks that comprise any desired number of individual machines, including one or more routers (not shown) that serve to buffer and route the data transmitted among the computers. Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof. In an embodiment in which the network comprises the Internet, one or more machines may be configured to access the Internet through web browser programs.
One or more of the components, blocks, processes or other functional components may be implemented through a computer program that controls execution of a processor-based computing device of the system. It should also be noted that the various functions disclosed herein may be described using any number of combinations of hardware, firmware, and/or as data and/or instructions embodied in various machine-readable or computer-readable media, in terms of their behavioral, register transfer, logic component, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, physical (non-transitory), non-volatile storage media in various forms, such as optical, magnetic or semiconductor storage media.
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of “including, but not limited to.” Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words “herein,” “hereunder,” “above,” “below,” and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word “or” is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.
While one or more implementations have been described by way of example and in terms of the specific embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.
This application claims the benefit of priority to U.S. Provisional Patent Application No. 61/745,401 filed 21 Dec. 2012 and U.S. Provisional Application No. 61/865,072 filed 12 Aug. 2013, hereby incorporated by reference in entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/071679 | 11/25/2013 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2014/099285 | 6/26/2014 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5598507 | Kimber | Jan 1997 | A |
5642152 | Douceur | Jun 1997 | A |
6108626 | Cellario | Aug 2000 | A |
7149755 | Obrador | Dec 2006 | B2 |
7340458 | Vaithilingam | Mar 2008 | B2 |
7711123 | Crockett | May 2010 | B2 |
7747625 | Gargi | Jun 2010 | B2 |
20020184193 | Cohen | Dec 2002 | A1 |
20050114121 | Tsingos | May 2005 | A1 |
20090017676 | Liao | Jan 2009 | A1 |
20090271433 | Perronnin | Oct 2009 | A1 |
20110013790 | Hilpert | Jan 2011 | A1 |
20110075851 | LeBoeuf | Mar 2011 | A1 |
20140023197 | Xiang | Jan 2014 | A1 |
20140133683 | Robinson | May 2014 | A1 |
Number | Date | Country |
---|---|---|
101473645 | Jul 2009 | CN |
101821799 | Sep 2010 | CN |
101926181 | Dec 2010 | CN |
102100088 | Jun 2011 | CN |
1650765 | Apr 2006 | EP |
2005-309609 | Nov 2005 | JP |
2009-020461 | Jan 2009 | JP |
2009-532372 | Sep 2009 | JP |
2011-501823 | Jan 2011 | JP |
1332 | Aug 2013 | RS |
Entry |
---|
Koo, K. et al “Variable Subband Analysis for High Quality Spatial Audio Object Coding” IEEE 10th International Conference on Advanced Communication Technology, Feb. 17-20, 2008, pp. 1205-1208. |
Stanojevic, T. et al “The Total Surround Sound System”, 86th AES Convention, Hamburg, Mar. 7-10, 1989. |
Stanojevic, T. et al “Designing of TSS Halls” 13th International Congress on Acoustics, Yugoslavia, 1989. |
Stanojevic, T. et al “TSS System and Live Performance Sound” 88th AES Convention, Montreux, Mar. 13-16, 1990. |
Stanojevic, Tomislav “3-D Sound in Future HDTV Projection Systems” presented at the 132nd SMPTE Technical Conference, Jacob K. Javits Convention Center, New York City, Oct. 13-17, 1990. |
Stanojevic, T. “Some Technical Possibilities of Using the Total Surround Sound Concept in the Motion Picture Technology”, 133rd SMPTE Technical Conference and Equipment Exhibit, Los Angeles Convention Center, Los Angeles, California, Oct. 26-29, 1991. |
Stanojevic, T. et al. “TSS Processor” 135th SMPTE Technical Conference, Oct. 29-Nov. 2, 1993, Los Angeles Convention Center, Los Angeles, California, Society of Motion Picture and Television Engineers. |
Stanojevic, Tomislav, “Virtual Sound Sources in the Total Surround Sound System” Proc. 137th SMPTE Technical Conference and World Media Expo, Sep. 6-9, 1995, New Orleans Convention Center, New Orleans, Louisiana. |
Stanojevic, T. et al “The Total Surround Sound (TSS) Processor” SMPTE Journal, Nov. 1994. |
Stanojevic, Tomislav “Surround Sound for a New Generation of Theaters, Sound and Video Contractor” Dec. 20, 1995. |
Raake, A. et al “Concept and Evaluation of a Downward-Compatible System for Spatial Teleconferencing Using Automatic Speaker Clustering” 8th Annual Conference of the International Speech Communication Association, Aug. 2007, p. 1873-1876, vol. 3. |
Miyabe, S. et al “Temporal Quantization of Spatial Information Using Directional Clustering for Multichannel Audio Coding” Oct. 18-21, 2009, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 261-264. |
Tsingos, N. et al “Perceptual Audio Rendering of Complex Virtual Environments” ACM Transactions on Graphics, vol. 23, No. 3, Aug. 1, 2004, pp. 249-258. |
“Dolby Atmos Next-Generation Audio for Cinema” Apr. 1, 2012. |
Moore, B. et al, “A Model for the Prediction of Thresholds, Loudness, and Partial Loudness,” Journal of the Audio Engineering Society (AES), vol. 5, Issue 4, pp. 224-240, Apr. 1997. |
Number | Date | Country | |
---|---|---|---|
20150332680 A1 | Nov 2015 | US |
Number | Date | Country | |
---|---|---|---|
61745401 | Dec 2012 | US | |
61865072 | Aug 2013 | US |