Example embodiments disclosed herein generally relate to audio content processing, and more specifically, to a method and system for audio object extraction with sub-band object probability estimation.
Traditionally, audio content is created and stored in channel-based formats. As used herein, the term “audio channel” or “channel” refers to the audio content that usually has a predefined physical location. For example, stereo, surround 5.1, surround 7.1 and the like are all channel-based formats for audio content. Recently, with the development in the multimedia industry, three-dimensional (3D) audio content is getting more and more popular in cinema and home. In order to create a more immersive sound field and to control discrete audio elements accurately, irrespective of specific playback speaker configurations, many conventional playback systems need to be extended to support a new format of audio that includes both the audio channels and audio objects.
As used herein, the term “audio object” refers to an individual audio element that exists for a defined duration of time in the sound field. An audio object may be dynamic or static. For example, an audio object may be human, animal or any other object serving as a sound source in the sound field. Optionally, the audio objects may have associated metadata, such as the information describing the position, velocity, and the size of an object. Use of the audio objects enables the audio content to have high immersive listening experience, while allowing an operator, such as an audio mixer, to control and adjust audio objects in a convenient manner. During transmission, the audio objects and channels can be sent separately, and then used by a reproduction system on the fly to recreate the artistic intention adaptively based on the configuration of playback speakers. As an example, in a format known as “adaptive audio content,” there may be one or more audio objects and one or more “audio beds”. As used herein, the term “audio beds” or “beds” refers to audio channels that are meant to be reproduced in pre-defined, fixed locations.
In general, object-based audio content is generated in a quite different way from the traditional channel-based audio content. Although the new object-based format allows creation of more immersive listening experience with the aid of audio objects, the channel-based audio format, especially the final-mixing audio format, still prevails in movie sound ecosystem, for example, in the chains of sound creation, distribution and consumption. As a result, given traditional channel-based content, in order to provide end users with similar immersive experiences as provided by the audio objects, there is a need to extract audio objects from the traditional channel-based content.
In order to address the foregoing and other potential problems, example embodiments disclosed herein proposes a method and system for extracting audio objects from audio content.
In one aspect, example embodiments disclosed herein provide a method for audio object extraction from audio content. The method includes determining a sub-band object probability for a sub-band of an audio signal in a frame of the audio content, the sub-band object probability indicating a probability of the sub-band of the audio signal containing an audio object. The method further includes dividing the sub-band of the audio signal into an audio object portion and a residual audio portion based on the determined sub-band object probability. Embodiments in this regard further include a corresponding computer program product.
In another aspect, example embodiments disclosed herein provide a system for audio object extraction from audio content. The system includes a probability determining unit configured to determine a sub-band object probability for a sub-band of an audio signal in a frame of the audio content, the sub-band object probability indicating a probability of the sub-band of the audio signal containing an audio object. The system further includes an audio dividing unit configured to divide the sub-band of the audio signal into an audio object portion and a residual audio portion based on the determined sub-band object probability.
Through the following description, it would be appreciated that in accordance with example embodiments disclosed herein, the sub-bands of audio signal can be softly divided into the audio object portion and the residual audio portion. In this way, the instability in the regenerated audio content by the divided audio object portions and the residual audio portions can be better prevented. Other advantages achieved by example embodiments disclosed herein will become apparent through the following descriptions.
Through the following detailed description with reference to the accompanying drawings, the above and other objectives, features and advantages of example embodiments disclosed herein will become more comprehensible. In the drawings, several example embodiments will be illustrated in an example and non-limiting manner, wherein:
Throughout the drawings, the same or corresponding reference symbols refer to the same or corresponding parts.
Principles of the example embodiments will now be described with reference to various example embodiments illustrated in the drawings. It should be appreciated that the depiction of these embodiments is only to enable those skilled in the art to better understand and further implement the example embodiments disclosed herein, not intended for limiting the scope in any manner.
As mentioned above, it is desired to extract audio objects from audio content. The developed channel-grouping based method typically works well on multi-channel pre-dubs and stems which usually contain only one audio object in each channel. As used herein, the term “pre-dub” refers to the channel-based audio content prior to being combined with other pre-dubs to produce a stem. The term “stem” refers to the channel-based audio content prior to being combined with other stems to produce a final mix. Examples of such a type of content comprise dialogue stems, sound effect stems, music stems, and the forth. For these kinds of audio content, there are few cases in which audio objects are overlapped within channels. The channel-grouping based method is appropriate to be used in the reauthoring or content creation use cases where pre-dubs and stems are available and audio mixers can further manipulate the audio objects, such as editing, deleting, or merging the audio objects, or modifying their positions, trajectories or other metadata. However, the above-presented method is not purposely designed (and may not work well) for another use case where more complex multi-channel final-mix may be considered and be automatically up-mixed from 2D to 3D to create 3D audio experience through the object extraction. Moreover, in multi-channel final-mixing, multiple sources are usually mixed together in one channel. Thus, the automatically extracted objects may contain more than one actual audio object, which may further make its position determination incorrect. If source separation algorithms are applied to separate the mixed sources, for example, extracting individual audio objects from audio content, the extracted audio objects may have audible artifacts, causing an instability problem.
In order to address the above and other potential problems, example embodiments disclosed herein propose a method and system for audio object extraction in a soft manner. Each sub-band of each frame (that is, each spectral-temporal tile) of audio is analyzed and softly assigned to an audio object portion and an audio bed (residual audio) portion. Compared with a hard decision scheme, where one spectral-temporal tile is extracted as an audio object in the current frame and extracted as residual audio in the next frame or vice versa, causing the audible switching artifacts at this transition point, the soft-decision scheme of the example embodiments can minimize the switching artifact.
Reference is first made to
At S101, a sub-band object probability is determined for a sub-band of the audio signal in a frame of the audio content. The sub-band object probability indicates a probability of the sub-band of the audio signal containing an audio object.
A frame is a processing unit for audio content, and the duration of a frame may be varied and may be depend on the configuration of the audio processing system. In some embodiments, a frame of audio content is converted into multiple filter band signals using a time-frequency transform such as conjugated quadrature mirror filterbanks (CQMF), Fast Fourier Transform (FFT), or the like. For a frame, its full frequency range may be divided into a plurality of frequency sub-bands, each of which occupies a predefined frequency range. For example, for a frame with a frequency range from 0 Hz to 24 kHz, a sub-band may occupy a frequency range of 400 Hz. In embodiments of the present embodiments, the plurality of sub-bands may have the same or different frequency range. The scope of the example embodiments is not limited in this regard.
The division of the whole frequency band into multiple frequency sub-bands is based on the observation that when different audio objects overlap within channels, they are not likely to overlap in all of the sub-bands due to the well-known sparsity property of most of the audio signals and thereby, it is much more reasonable to assume that each sub-band contains one dominant source at each time. Accordingly, the following processing of audio object extraction can be performed on a sub-band of the audio signal.
For audio content in traditional format, such as the final-mix multichannel audio, directly extracting each sub-band of the audio signal as an audio object might introduce some audible artifacts, especially in some “bad” cases, for example, where the sparsity assumption that the sub-band only contains one dominant object is not satisfied; or where some sub-bands are not suitable to be extracted as audio objects from the artistic point of view; or where some sub-bands are difficult to be rendered to a specific position by the renderer after being extracted as objects. In some cases, the sparsity assumption might not be satisfied since multiple sources (ambience, and/or objects from different spatial positions) might be mixed together in different sub-bands with different proportions. One example case is that two different objects, one in the left channel and the other in the right channel, are mixed in one sub-band. In this case, if the sub-band is extracted as an audio object, the two different objects will be processed as one object and rendered to the center channel, which will introduce audible artifacts.
Therefore, in order to extract sub-band objects from the input audio content without introducing audible artifacts, the sub-band object probability is proposed in example embodiments disclosed herein to indicate whether the sub-band is suitable to be extracted as an audio object or not. More specifically, the sub-band object probability is to avoid extracting audio objects in sub-bands in the “bad” cases discussed above. To this end, each sub-band of the audio signal is analyzed and the sub-band object probability is determined at this step. Based on the determined sub-band object probability, the sub-band of the audio signal will be assigned as an audio object portion and a residual audio portion in a soft manner.
For each “bad” case of object extraction, there may be one or more factors/clues associated with it. For example, when two different objects exist in one sub-band, the channel correlation of the sub-band would be low. Therefore, in some example embodiments disclosed herein, several factors, for example, a spatial position of the sub-band, channel correlation, panning rules and/or frequency range of the sub-band may be considered separately or jointly in sub-band object probability determination, which will be described below in more details.
At S102, the sub-band of the audio signal is split into an audio object portion and a residual audio portion based on the determined sub-band object probability. In this step, the sub-band of the audio signal may not be determined as either an audio object or an audio bed exactly, but may be split into an audio object portion and a residual audio/audio bed portion in a soft manner based on the sub-band object probability. In example embodiments disclosed herein, one audio object portion may not exactly contain one so-called audio object, such as a sound of a person, an animal or a thunder, but contain a portion of the sub-band of the audio signal that may be viewed as an audio object. In some embodiments, the audio object portion may then be rendered to estimate spatial position and the residual audio portions may then be rendered as bed channels in an adaptive audio content processing.
One of the advantages of soft audio object extraction is to avoid the switching artifact between audio object rendering and channel-based rendering that may be caused by a hard decision as well as the audio instability. For example, with a hard decision scheme, if one sub-band is extracted as an audio object in the current frame and extracted as an audio bed in the next frame, or vice versa, the switching artifacts may be audible at this transition point. However, with the soft-decision scheme of the example embodiments, part of the sub-band is extracted as an object and the other part of the sub-band remains in audio beds, and the switching artifact may be minimized.
In the processing illustrated in
With reference to
The block of sub-band object probability determining 202 of
With respect to the factors having impact on the sub-band object probability, according to some example embodiments disclosed herein, the determination of the sub-band object probability for the sub-band of the audio signal in step S101 of the method 100 may comprise determining the sub-band object probability based on at least one of: a first probability determined based on a spatial position of the sub-band of the audio signal; a second probability determined based on correlation between multiple channels of the sub-band of the audio signal when the audio content is of a format based on multiple-channels; a third probability determined based on at least one panning rule in audio mixing; and a fourth probability determined based on a frequency range of the sub-band of the audio signal.
The determination of the first, second, third, and fourth probabilities will be respectively discussed below.
The First Probability Based on Spatial Position
As known, in order to enhance spatial perception in audio processing, audio objects are usually rendered into different spatial positions by audio mixers. As a result, in traditional channel-based audio contents, spatially-different audio objects are usually panned into different sets of channels with different energy portions.
When an audio object is panned into multiple channels, the sub-bands where the audio object exists would have the same energy distribution across multiple channels as well as the same determined spatial position. Correspondingly, if several sub-bands are at the same or close position, there may be a high probability that these sub-bands belong to the same object. On the contrary, if the sub-bands are distributed sparsely, their sub-band objects probability may be low, since these sub-bands are probably the mixture of different objects or ambience.
For example, two different cases of spatial position distribution of the sub-bands are shown in
In view of the above, the spatial position of the sub-band of the audio signal may be used as a factor to determine the sub-band object probability, and a first probability based spatial position may be determined. In some example embodiments, to calculate the first probability determined based on a spatial position of the sub-band of the audio signal, the following steps may be performed: obtaining spatial positions of the plurality of sub-bands of the audio signal in the frame of the audio content; determining a sub-band density around the spatial position of the sub-band of the audio signal according to the obtained spatial positions of the plurality of sub-bands of audio signal; and determining the first probability for the sub-band of the audio signal based on the sub-band density. As discussed above, the first probability may be positively correlated with the sub-band density. That is, the higher the sub-band density is, the higher the first probability is. The first probability is in a range from 0 to 1.
There may be many ways to obtain spatial positions of the plurality of sub-bands of audio signal, for example, an energy weighting based method, or a loudness weighting based method. In some embodiments, clues or information provided by a human user may be used to determine spatial positions of the plurality of sub-bands of audio signal. The scope of the example embodiments disclosed herein are not limited in this regard. In one embodiment, spatial position determination using the energy weighting based method is presented as follows as an example:
where pi represents the spatial position of the ith sub-band in the processing frame; eim represents the energy of the mth channel of the ith sub-band; Pm represents the predefined spatial position of the mth channel in the playback place; and M represents the number of channels.
Usually the speakers of corresponding channels are deployed at a predefined position in a playback place, such as a TV room, or a cinema. Pm may be the spatial position of the speaker of the mth channel in one embodiment. If the input audio content is of a format based on a singular channel, Pm may be the position of the single channel. In cases where the deployment of channels is not clearly known, Pm may be a predefined position of the mth channel.
As discussed above, the sub-band object probability of a sub-band may be high if there are many sub-bands nearby, and it may be low if it is spatially sparse. From this point of view, the first probability may be positively correlated with the sub-band density and may be calculated as a monotonically increasing function of sub-band density. In one embodiment, a sigmoid function may be used to represent the relation between the first probability and the sub-band density, and the first probability may be calculated as follows:
Where prob1(i) represents the first probability of the ith sub-band; ea
It should be noted that there are many other ways to determine the first probability based on the sub-band density, as long as the first probability is positively correlated with the sub-band density. The scope of the example embodiment is not limited in this regard. For example, the first probability and the sub-band density may satisfy a linear relation. For another example, different ranges of sub-band density may correspond to linear functions with different slopes when determining the first probability. That is, the relation between the first probability and the sub-band density may be represented as a broken line, having several segments with different slopes. In any case, the first probability is in a range from 0 to 1.
Various approaches may be used here to estimate the sub-band density, including but not limited to a histogram based method, a kernel density determination method and a range of data clustering technique. The scope of the example embodiment is not limited in this regard. In one embodiment, the kernel density determination method is described as an example to estimate the sub-band density Di as follows:
Where N represents the number of sub-bands; pi and pj represent the spatial positions of the ith and jth sub-bands; and k(pi,pj) represents a kernel function that is equal to 1 if the ith sub-band and the jth sub-band are at the same position. The value of k(pi,pj) is decreasing to 0 with the spatial distance between the ith and jth sub-bands increasing. In other words, the function k(pi,pj) represents the density distribution as a function of spatial distance between the ith and jth sub-bands.
The Second Probability Based on Channel Correlation
To determine whether a spectral-temporal tile (a sub-band of the audio signal) is suitable to be extracted as an audio object and rendered to a specific position, another factor that may be used is the channel correlation. In this case, the input audio content may be of a format based on a plurality of channels. For each multichannel spectral-temporal tile, if it contains one dominant object, the correlation value between multiple channels may be high. On the contrary, the correlation value may be low if it contains large amounts of ambience or more than one object. Since the extracted sub-band object will be further down-mixed into mono-audio object for object-based rendering, low correlation among channels may cause a great challenge to the down-mixer and obviously a timber change may be perceived after down-mixing. Therefore, the correlation between different channels may be used as a factor to estimate the sub-band object probability, and a second probability based on channel correlation may be determined.
In some embodiments of the example embodiments, to calculate the second probability based on the correlation between multiple channels of the sub-band of the audio signal when the audio content is of a format based on multiple-channels, the following steps may be performed: determining a degree of correlation between each two of the multiple channels for the sub-band of the audio signal; obtaining a total degree of correlation between the multiple channels of the sub-band of the audio signal based on the determined degree of correlation; and determining the second probability for the sub-band of the audio signal based on the total degree of correlation. As discussed above, the second probability may be positively correlated with the total degree of correlation. That is, the higher the total degree of correlation is, the higher the second probability is. The second probability is in a range from 0 to 1.
There may be many ways to estimate the degree of correlation between multiple channels, for example, an energy weighted channel correlation based method, a loudness weighted channel correlation based method. The scope of the example embodiment is not limited in this regard. In one embodiment, the correlation determination using the energy weighted based method is presented as follows as an example:
Where Ci represents the total degree of correlation between multiple channels; {right arrow over (xin)} represents the temporal sequence of audio signal of the nth channel of the ith sub-band in the processing frame; {right arrow over (xim)} represents the temporal sequence of audio signal of the mth channel of the ith sub-band in the processing frame; M represents the number of channels; ein represents the energy of the nth channel of the ith sub-band; eim represents the energy of the mth channel of the ith sub-band; and corr({right arrow over (xin)},{right arrow over (xim)}) represents the degree of correlation between two channels, the nth channel and the mth channel, of the ith sub-band. The value of corr({right arrow over (xin)},{right arrow over (xim)}) may be determined as the correlation/similarity between the two temporal sequences of audio signal {right arrow over (xin)} and {right arrow over (xim)}.
As discussed above, the second probability based on channel correlation may be positively correlated with the total degree of correlation. In one embodiment, similar to the position distribution based probability, a sigmoid function may be used to represent the relation between the second probability and the total degree of correlation, and the second probability may be calculated as follows:
Where prob2(i) represents the second probability of the ith sub-band; ea
It should be noted that there are many other ways to determine the second probability based on the total degree of correlation, as long as the second probability is positively correlated with the total degree of correlation. The scope of the example embodiment is not limited in this regard. For example, the second probability and the total degree of correlation may satisfy a linear relation. For another example, different degrees of correlation may correspond to linear functions with different slopes when determining the second probability. That is, the relation between the second probability and the total degree of correlation may be represented as a broken line, having several segments with different slopes. In any case, the second probability is in a range from 0 to 1.
The Third Probability Based on Panning Rules
Although the extracted audio objects may be used to enhance the listening experience by rendering the audio objects with determined positions in adaptive audio content generation, it sometimes may violate the artistic intention of the content creator, such as an audio mixer, which is a great challenge for publishing the generated adaptive audio content to consumers. For example, audio mixer might pan an object into both the left channel and the right channel with same energy to create a wide central sound image, directly extracting this sound signal as object and rendering to the center channel might make the sound not as wide as the audio mixer intended. Therefore, the artistic intention of content creator may be taken into consideration during the audio object extraction, to avoid undesirable intention violation.
Audio mixers usually realize their artistic intention by panning audio objects/sources with specific panning rules. Therefore, to preserve the artistic intention of content creator during the audio object extraction, it is reasonable to understand what kinds of sub-bands are created with special artistic intention (and with specific panning rules). For sub-bands with special panning rules, they are undesirable to be extracted as objects.
In some example embodiments, the following panning rules in original audio mixing may be considered during the object extraction.
It should be noted that besides the above two panning rules, there may be other panning rules that should be taken into account during the audio object extraction. The scope of the example embodiment is not limited in this regard.
In some example embodiments, to calculate the third probability determined based on at least one panning rule in audio mixing, the following steps may be performed: determining for the sub-band of the audio signal a degree of association with each of the at least one panning rules in audio mixing, each panning rule indicating a condition where a sub-band of the audio signal is unsuitable to be an audio object; and determining the third probability for the sub-band of the audio signal based on the determined degree of association. As discussed above, the panning rules may generally indicate the cases where the sub-bands of audio signal may not be extracted as audio objects in order to avoid destroying the special artistic intention in audio mixing. As a result, the third probability may negatively be correlated with the total degree of association with the panning rules. That is, the higher the total degree of association with the panning rules is, the lower the third probability is. The third probability is in a range from 0 to 1.
Suppose there are K panning rules, each of which indicates a case in which the sub-band of the audio signal may not be suitable to be extracted as object from the artistic intention preservation point of view. In one embodiment, the third probability based on panning rules for each sub-band could be determined as follows:
Where prob3(i) represents the third probability of the ith sub-band; and qk(i) represents the degree that the ith sub-band is associated with the kth panning rule. Therefore, the third probability may be high if the sub-band is not associated with any specific panning rules, and it may be low if the sub-band is associated with one specific panning rule. In some embodiments, if the ith sub-band is totally associated with the kth panning rule, qk(i) is 1; and if not, qk(i) is 0. In other embodiments, the degree of association with the kth panning rule may be determined, the value of which may vary from 0 to 1.
In some other embodiments, the at least one panning rule may include at least one of: a rule based on untypical energy distribution and a rule based on vicinity to a center channel. The rule based on untypical energy distribution and the rule based on vicinity to a center channel is respectively corresponding to the two panning rules discussed above. Sub-bands associated with any of the two rules may be considered as undesirable to be extracted as objects.
In some embodiments, the determination of the degree of association with the rule based on untypical energy distribution may comprise: determining the degree of association with the rule based on untypical energy distribution according to a first distance between an actual energy distribution and an estimated typical energy distribution of the sub-band of the audio signal. In an example embodiment, the degree of association with the rule based on untypical energy distribution may be represented as a probability, and may be defined as below:
Where q1(i) represents the probability that the ith sub-band is associated with the rule based on untypical energy distribution; {right arrow over (e1)} represents the actual energy distribution of the ith sub-band; represents the estimated typical energy distribution of the ith sub-band by traditional panning methods; d({right arrow over (el)},) represents the distance between the two energy distributions, which indicates whether the actual energy distribution {right arrow over (ei)} of the ith sub-band is untypical or not; and ae and be represent the parameters of the sigmoid function to map the distance d({right arrow over (el)},) to the probability q1(i).
The actual energy distribution {right arrow over (ei)} of the ith sub-band may be measured by well-known methods. To determine the estimated typical energy distribution of the ith sub-band, the spatial position p1 of the ith sub-band may be determined based on the actual energy distribution {right arrow over (ei)}. For example, if the energy is distributed the same at the left and right channels, then the spatial position pi may be the center between the left and right channels. Assuming that the traditional panning methods are used, the ith sub-band may be panned to a channel nearby the spatial position pi with the estimated typical energy distribution . In this way, the typical energy distribution may be determined.
The higher the distance of the two energy distributions is, the higher the probability that the sub-band has untypical energy distribution, which means that the sub-band has less probability to be extracted as an audio object in order to preserve the special artistic intention. In this point of view, the parameter ae is typically negative. In some embodiments, ae and be may be predetermined and respectively remain the same values for different energy distributions (the actual energy distribution or the determined typical energy distribution). In some other embodiments, ae and be may be respectively a function of the energy distribution (the actual energy distribution or the determined typical energy distribution) or the distance ({right arrow over (el)},). For example, for different energy distributions or different d({right arrow over (el)},), ae and be may have different values.
It should be noted that there are many other ways to determine the degree of association with the rule based on untypical energy distribution besides the above sigmoid function, as long as the degree of association is positively correlated with the distance between the actual energy distribution and the estimated typical energy distribution. The scope of the example embodiment is not limited in this regard.
In some embodiments, the determination of the degree of association with the rule based on vicinity to a center channel may comprise: determining the degree of association with the rule based on vicinity to the center channel according to a second distance between a spatial position of the sub-band of the audio signal and a spatial position of the center channel. In an example embodiment, the degree of association with the rule based on vicinity to a center channel may be represented as a probability, and may be defined as below:
Where q2(i) represents the probability that the ith sub-band is associated with the rule based on vicinity to a center channel; pc represents the spatial position of the center channel, which may be predefined; pi represents the spatial position of the ith sub-band, which may be determined based on Equation (1); d (pc, pi) represents the distance between the center channel and the position of the ith sub-band; and ap and bp represent the parameters of the sigmoid function to map the distance d(pc, pi) to the probability q2(i).
The smaller the distance d(pc,pi) is, the higher the probability that the ith sub-band is associated with the rule based on vicinity to a center channel, which means that this sub-band has less probability to be extracted as an audio object in order to preserve the special artistic intention. In this point of view, the parameter ap is typically positive. In some embodiments, ap and bp may be predetermined and respectively remain the same value for different spatial positions (the center channel position or the position of the ith sub-band). In some other embodiments, ap and bp may be respectively a function of the spatial position (the center channel position or the position of the ith sub-band) or the distance d(pc,pi). For example, for different spatial positions or different distances d (pc,pi), ap and bp may have different values.
It should be noted that there are many other ways to determine the degree of association with the rule based on vicinity to a center channel besides the above sigmoid function, as long as the degree of association is negatively correlated with the distance between the center channel position and the position of the ith sub-band. The scope of the example embodiment is not limited in this regard.
The Fourth Probability Based on Frequency Range
Since the extracted audio objects may be reproduced and further played back by various devices with corresponding renderers, it would be beneficial to consider the performance limitation of the renderers during the object extraction. For example, there may be some energy building up when rendering the sub-band with a frequency lower than 200 Hz by various renderers. To avoid introducing the energy build-up, low frequency bands may be favored to be kept in audio beds/residual audio portions during the audio object extraction. Therefore, the frequency range of the sub-band may be used as a factor to estimate the sub-band object probability, and a fourth probability based on frequency band may be determined.
In some example embodiments, to calculate the fourth probability based on frequency range, the following steps may be performed: determining in the frequency range of the sub-band of the audio signal; and determining the fourth probability for the sub-band of the audio signal based on the center frequency. As discussed above, the fourth probability may be positively correlated with the value of the center frequency. That is, the lower the center frequency is, the lower the fourth probability is. The fourth probability is in a range from 0 to 1. It should be noted that, any other frequency in the frequency range of the sub-band may be used instead of the center frequency to estimate the fourth probability, such as, the low boundary, the high boundary, the frequency at ⅓, or ¼ of the frequency range, or any other frequency in the frequency range of the sub-band. In an example, the fourth probability may be determined as below:
Where prob4(i) represents the fourth probability of the ith sub-band; and fi represents a frequency in the frequency range of the ith sub-band, which may be the center frequency, the low boundary or the high boundary. For example, if the ith sub-band has a frequency range from 200 Hz to 600 Hz, fi may be 500 Hz, 200 Hz, or 600 Hz. af and bf represent the parameters of the sigmoid function to map the frequency fi of the ith sub-band to the fourth probability. Typically, af is negative, and then the fourth probability prob4(i) may be higher as the frequency fi becomes higher. In some embodiments, af and bf may be predetermined and respectively remain the same value for different value of the frequency fi. In some other embodiments, af and bf may be respectively a function of the frequency fi. For example, for different values of the frequency fi, af and bf may have different values.
It should be noted that there are many other ways to determine the fourth probability based on the frequency range, as long as the fourth probability is positively correlated with some frequency value in the frequency range of the ith sub-band. The scope of the example embodiment is not limited in this regard.
In the above discussion, four probabilities based on four factors are described. The sub-band object probability may be determined based on one or more of the first, second, third, and fourth probabilities.
In some example embodiments disclosed herein, to avoid introducing artifacts and preventing audio instability during audio object extraction, the combined sub-band object probability may be high only in the case that all of the individual factors are high, and it may be low as long as one of the individual factors is low. In one embodiment, the sub-band object probability may be the combination of different factors as follows:
Where probsub-band(i) represents the sub-band object probability of the ith sub-band; K represents the number of factors to be considered in sub-band object probability determination. For example, K may be 4, and all of the above-mentioned four factors are considered. In another example, K may be 3, and three of the above-mentioned four factors are considered. In yet another example, K may be 1, and one of the above-mentioned four factors is considered. probk(i)α
It should be noted that, in the sub-band object probability determination, other factors besides or instead of the above discussed four factors may be considered. For example, some clues or information about the audio objects in the audio content provided by the human user may be considered in sub-band object probability determination. The scope of the example embodiment is not limited in this regard.
In method 100, after the sub-band object probability is determined in step S102, the sub-band of the audio signal may be split into an audio object portion and a residual audio portion in step S103, which is also corresponding to the block of audio object/residual audio splitting 203 in
In some example embodiments disclosed herein, splitting the sub-band of the audio signal into an audio object portion and a residual audio portion based on the determined sub-band object probability may comprise: determining an object gain of the sub-band of the audio signal based on the sub-band object probability; and splitting each of the plurality of sub-bands of audio signal into the audio object portion and the residual audio portion according to the determined object gain. In one example, each sub-band may be split into an audio object portion and a residual audio portion as follows:
xobj(i)=x(i)*g(i)
xres(i)=x(i)*(1−g(i)) (11)
Where x(i) represents the ith sub-band of the input audio content, which may be a time-domain sequence or a frequency-domain sequence; g(i) represents the object gain of the ith sub-band; and xobj(i) and xres(i) represent the audio object portion and residual audio portion of the ith sub-band respectively.
In one example embodiments, determining an object gain of the sub-band of the audio signal based on the sub-band object probability may comprise determining the sub-band object probability as the object gain of the sub-band of the audio signal. That is, the sub-band object probability may be directly used as the object gain, which may be represented as below:
g(i)=probsub-band(i) (12)
Although the soft splitting directly using the sub-band object probability may avoid some instability or switching artifacts during audio object extraction, the stability of audio object extraction may be further improved since there may still be some noise in the determined sub-band object probability. In some example embodiments disclosed herein, the temporal smoothing and/or the spectral smoothing for the object gain may be proposed to improve the stability of extracted objects.
Temporal Smoothing
In some example embodiments disclosed herein, the object gain of the sub-band may be smoothed with a time related smoothing factor. The temporal smoothing may be performed on each sub-band separately over time, which may be represented as below:
{tilde over (g)}t(i)=αt(i)*{tilde over (g)}t-1(i)+(1−αt(i))*gt(i) (13)
Where gt(i) represents the object gain of the ith sub-band in the processing frame t, which may be the determined sub-band object probability of the ith sub-band; αt(i) represents the time related smoothing factor; and {tilde over (g)}t(i) and {tilde over (g)}t-1(i) represent the smoothed object gain of the ith sub-band in the processing frame t and the frame t−1.
Since the audio objects may appear or disappear frequently over time in each sub-band, especially in the complex final mix content, the time related smoothing factor may be changed correspondingly to avoid smoothing between two different kinds of content, for example, between two different objects or between object and ambience.
Therefore, in some example embodiments disclosed herein, the time related smoothing factor may be associated with appearance and disappearance of an audio object in the sub-band of the audio signal over time. In further embodiments, when at the time an audio object appears or disappears, a small time related smoothing factor may be used, which indicates that the object gain may largely depend on the current processing frame. The object appearance/disappearance information may be determined by sub-band transient detection, for example, the well-known onset probability corresponding to the appearance of an audio object and offset probability corresponding to the disappearance of the audio object. Supposing the transient probability of the ith sub-band in frame t is TPt(i), in an embodiment, the time related smoothing factor αt(i) for the spectral-temporal tile may be determined as follows:
αt(i)=TPt(i)*αfast+(1−TPt(i))*αslow (14)
Where αfast represents the fast smoothing time constant (smoothing factor) with small value, and αslow represents the slow smoothing time constant (smoothing factor) with large value, that is, αfast is smaller than αslow. Therefore, according to Equation (14), when the transient probability TPt(i) is large, which indicates there is a transient point (audio object appearance or disappearance) in the processing frame t, the smoothing factor may be small and then the object gain may largely depend on the current processing frame to avoid smoothing across two different kinds of content. The transient probability TPt(i) may be 1 if there is audio object appearance or disappearance, and may be 0 if there is no audio object appearance or disappearance in some embodiments. The transient probability TPt(i) may also be a continuous value between 0 and 1.
There are many other methods that can be used to smooth the object gain. For example, the smoothing factor used to smooth the object gain may be the same across multiple frames or all frames of the input audio content. The scope of the example embodiment is not limited in this regard.
Spectral Smoothing
In some example embodiments disclosed herein, the object gain of the sub-band may be smoothed in a frequency window. In these embodiments, a pre-defined smoothing window may be applied to multiple sub-bands to obtain spectral smoothed gain value:
Where {tilde over (g)}(i) represents the object gain of the sub-band i; g (i+l) represents the object gain of the sub-band (i+l), which may be the determined sub-band object probability of the sub-band (i+l); wl represents the coefficient of the frequency window corresponding to l, which may have a value between 0 to 1; and 2L+1 represents the length of the frequency window, which may be predetermined.
For some kinds of audio content, such as the final mix audio, there may be multiple sources (different objects and ambience) in different spectral regions, smoothing based on the fixed predetermined window may result in smoothing between two different sources in nearby spectral regions. Therefore, in some example embodiments disclosed herein, some spectral segmentation results may be utilized to avoid smoothing over the spectral boundary of two sources, and the length of the frequency window may be associated with a low boundary and a high boundary of the spectral segment of the sub-band. In one embodiment, if the low boundary of the spectral segment is larger than the low boundary of the predetermined frequency window, the low boundary of the spectral segment may be used instead of the low boundary of the predetermined frequency window; and if the high boundary of the spectral segment is smaller than the high boundary of the predetermined frequency window, the high boundary of the spectral segment may be used instead of the high boundary of the predetermined frequency window.
In one example, the smoothed object gain may be determined with the frequency window having the low boundary and the high boundary of the spectral segment of the sub-band considered, and the above Equation (15) may be modified as follows:
Where BLi represents the low boundary of the spectral segment of the sub-band i; and BHi represents the high boundary of the spectral segment of the sub-band i. The boundaries of the spectral segment may be determined based on the object gain or/and the spectrum similarity of the spectral-temporal tile (the sub-band).
In the sub-band dividing, in order to avoid different objects with different frequency ranges being contained in the same sub-band and the individual objects may not being extracted correctly, the frequency resolution of sub-bands may be high. That is to say, a sub-band has a short frequency range. As mentioned above, the audio object portions and residual audio portions split based on the sub-band object probabilities may be rendered in the adaptive audio content generation or other further audio processing. High frequency resolution may result in a large number of extracted audio object portions, which may pose new challenges for the rendering and the distribution of such content. Therefore, the number of audio object portions may be further reduced by some grouping/clustering approaches in some embodiments.
Reference is now made to
At step S501, a frame of audio content is divided into a plurality of sub-bands of an audio signal in a frequency domain. As mentioned above, considering the sparsity feature of audio objects in audio content, a soft splitting may be perform on a sub-band of the frame of audio content. The number of divided sub-bands and the frequency range of each sub-band are not limited in the example embodiment.
At step S502, a sub-band object probability is determined for each of the plurality of sub-bands of the audio signal. This step is similar to step S101 of the method 100 which has discussed the determination of sub-band object probability. Therefore, the detailed description of this step is omitted here for the sake of clarity.
At step S503, each of the plurality of sub-bands of the audio signal is split into an audio object portion and a residual audio portion based on the respective sub-band object probability. This step is similar to step S102 of the method 100 which has discussed the splitting of a sub-band. Therefore, the detailed description of this step is omitted here for the sake of clarity.
The method 500 proceeds to step S504, and in this step, the audio object portions of the plurality of sub-bands of the audio signal may be clustered. The number of the clustered audio object portions is smaller than the number of the split audio object portions of the plurality of sub-bands of audio signal.
As a result, the block diagram of audio object extraction of
Various grouping or clustering technologies may be applied to cluster the large number of split audio object portions into a small number of audio object portions. In some embodiments, the clustering of the audio object portions of the plurality of sub-bands of the audio signal may be based on at least one of: critical bands, spatial positions of the audio object portions of the plurality of sub-bands of the audio signal, and perceptual criteria.
Clustering Based on Critical Bands
Based on the auditory masking phenomena of psychoacoustics, it may be hard for humans to perceive an original sound signal when in the presence of a second signal of higher intensity within the same critical band. Therefore, the audio object portions of the plurality of sub-bands may be grouped together based on the critical bands without causing obvious audible problems. The ERB (Equivalent Rectangular Bandwidth) bands may be used to group the audio object portions. The ERB bands may be represented as:
ERB(f)=24.7*(4.37*f+1) (17)
Where f represents the center frequency of the ERB band in kHz and ERB (f) represents the bandwidth of the ERB band in Hz.
In one embodiment, the audio object portions of different sub-bands may be grouped into the ERB bands based the center frequency (or low boundary, or high boundary) of the sub-bands.
In different embodiments, the number of ERB bands may be preset, for example to 20, which means that the audio object portions of multiple sub-bands of the processing frame may be clustered into the preset number of ERB bands after clustering.
Clustering Based on Spatial Position
An alternative method of sub-band object clustering is based on the spatial position, since the sub-band audio object portion with the same or close spatial position may belong to the same object. Meanwhile, when rendering the extracted audio object portion with obtained spatial positions by various renderers, it may be obvious that rendering the group of sub-bands with a same position may be similar to rendering an individual sub-band with the same position. An example spatial position based hierarchical clustering method is described below.
It should be noted that there are many other ways to cluster the audio object portions besides the above described method, and the scope of the example embodiment is not limited in this regard.
Clustering Based on Perceptual Criteria
When the total number of clusters is constrained, clustering the sub-band audio object portions solely based on the spatial position may introduce some artifacts if the audio objects are sparsely distributed. Therefore, clustering based on perceptual criteria may be used to group the sub-band audio object portions in some embodiments. The perceptual criteria may relate to the perceptual factors of audio signal, such as the partial loudness, content semantics or type, and so on. In general, clustering sub-band objects may result in a certain amount of error since not all sub-band objects can maintain spatial fidelity when clustered with other objects, especially in applications where a large number of audio objects are sparsely distributed. Objects with a relatively high perceived importance will be favored in terms of minimizing spatial/perceptual errors with the clustering process. The object importance can be based on perceptual criteria such as partial loudness, which is the perceived loudness of an audio object factoring the masking effects among other audio objects in the scene, and content semantics or type (such as, dialog, music, effects, etc.). Usually, the high (perceived) importance objects may be favored over objects with a low importance in terms of minimizing spatial errors during the grouping process, and may be more probably clustered together. For the low importance object, they may be rendered into the nearby groups of high important objects and/or beds.
Therefore, in some embodiments of the example embodiment, the perceptual importance of each of the multiple audio object portions of a processing frame may be first determined, and then the audio object portions may be clustered based on the perceptual importance measured by perceptual criteria. The perceptual importance of an audio object portion may be determined by combining the perceived loudness (the partial loudness) and content importance of the audio object portion. For example, in an embodiment, content importance may be derived based on a dialog confidence score, and a gain value (in dB) can be determined based on this derived content importance. The loudness or excitation of the audio object portion may then be modified by the determined loudness, with the modified loudness representing the final perceptual importance of the audio object portion.
The split (or clustered) audio object portions and residual audio (audio bed) portions may then be used in an adaptive content generation system, where the audio object portions and residual audio (audio bed) portions of the input audio content may be converted to the adaptive audio content (including beds and objects with metadata) to create a 3D audio experience. An example framework of the system 700 is shown in
The block of direct/diffuse separation 10 in the system 700 may be used to first separate the input audio content into a direct signal and a diffuse signal, where the direct component may mainly contain the audio objects with direction, and the diffuse component may mainly contain the ambiance without direction.
The block of audio object extraction 11 may perform the process of audio object extraction discussed above according to embodiments of the example embodiment. The audio object portions and the residual audio portions may be extracted from the direct signal in this block. Based on some embodiments above, the audio object portions here may be the groups of audio object portions, and the number of groups may depend on the requirements of the system 700.
The block of audio bed generation 12 may be used to combine the diffuse signal as well as the residual audio portions of audio object extraction together to generate the audio beds. To enhance the immersive experience, up-mixing technologies may be applied to this block to create some overhead bed channels.
The block of down-mixing and metadata determination 13 may be used to down-mix the audio object portions into mono audio objects with determined metadata. The metadata may include information for better rendering the audio object content, like the spatial position, velocity, size of the audio object, and/or the like. The metadata may be derived from the audio content by some well-known techniques.
It should be noted that some additional components may be added to the system 700, and one or more blocks of the system 700 shown in the
The generated adaptive audio content (including the audio beds and mono audio objects with metadata) of the system 700 may be rendered by various kinds of renderers. It may enhance the audio experience in different listening environments, where the audio beds may be rendered to the pre-defined position, and the audio objects may be rendered based on the determined metadata. The rendered audio content may then be played back by various kinds of speakers, such as sound-boxes, headphones, earphones or the like.
The adaptive audio content generation and its playback are just some example use cases of the audio object portions and residual audio portions generated in the example embodiment, and there may be many other use cases. The scope of the example embodiment is not limited in this regard.
In some embodiments, the system 800 may further comprise a frequency band dividing unit configured to divide the frame of the audio content into a plurality of sub-bands of audio signal in a frequency domain. For the plurality of sub-bands of the audio signal, respective sub-band object probabilities may be determined, and wherein each of the plurality of sub-band of the audio signals may be split into an audio object portion and a residual audio portion based on a respective sub-band object probability.
In some embodiments, the determination of the sub-band object probability for each of the plurality of sub-bands of audio signal may be based on at least one of the following: a first probability determined based on a spatial position of the sub-band of the audio signal; a second probability determined based on correlation between multiple channels of the sub-band of the audio signal when the audio content is of a format based on multiple-channels; a third probability determined based on at least one panning rule in audio mixing; and a fourth probability determined based on a frequency range of the sub-band of the audio signal.
In some embodiments, the determination of the first probability may comprise: determining spatial positions of the plurality of sub-bands of audio signal; determining a sub-band density around the spatial position of the sub-band of the audio signal according to the obtained spatial positions of the plurality of sub-bands of audio signal; and determining the first probability for the sub-band of the audio signal based on the sub-band density, wherein the first probability is positively correlated with the sub-band density.
In some embodiments, the determination of the second probability may comprise: determining a degree of correlation between each two of the multiple channels for the sub-band of the audio signal; obtaining a total degree of correlation between the multiple channels of the sub-band of the audio signal based on the determined degree of correlation; and determining the second probability for the sub-band of the audio signal based on the total degree of correlation, wherein the second probability is positively correlated with the total degree of correlation.
In some embodiments, the determination of the third probability may comprise: determining for the sub-band of the audio signal a degree of association with each of the at least one panning rule in audio mixing, each panning rule indicating a condition where a sub-band of the audio signal is unsuitable to be an audio object; and determining the third probability for the sub-band of the audio signal based on the determined degree of association, wherein the third probability is negatively correlated with the degree of association.
In some embodiments, the at least one panning rule may include at least one of: a rule based on untypical energy distribution and a rule based on vicinity to a center channel. In one embodiment, the determination of the degree of association with the rule based on untypical energy distribution may comprise: determining the degree of association with the rule based on untypical energy distribution according to a first distance between an actual energy distribution and an estimated typical energy distribution of the sub-band of the audio signal. In another embodiment, the determination of the degree of association with the rule based on vicinity to a center channel may comprise: determining the degree of association with the rule based on vicinity to the center channel according to a second distance between a spatial position of the sub-band of the audio signal and a spatial position of the center channel.
In some embodiments, the determination of the fourth probability may comprise: determining a center frequency in the frequency range of the sub-band of the audio signal; and determining the fourth probability for the sub-band of the audio signal based on the center frequency, wherein the fourth probability is positively correlated with the value of the center frequency.
In some embodiments, the audio splitting unit 802 may comprise: an object gain determining unit configured to determine an object gain of the sub-band of the audio signal based on the sub-band object probability. The audio splitting unit 802 may be further configured to split each of the plurality of sub-bands of the audio signal into the audio object portion and the residual audio portion based upon the determined object gain.
In some embodiments, the object gain determining unit may be further configured to determine the sub-band object probability as the object gain of the sub-band of the audio signal. The system 800 may further comprise at least one of: a temporal smoothing unit configured to smooth the object gain of the sub-band of the audio signal with a time related smoothing factor; and a spectral smoothing unit configured to smooth the object gain of the sub-band of the audio signal in a frequency window. In one embodiment, the time related smoothing factor is associated with the appearance and disappearance of an audio object in the sub-band of the audio signal over time. In another embodiment, a length of the frequency window is predetermined or is associated with a low boundary and a high boundary of a spectral segment of the sub-band of the audio signal.
In some embodiments, the system 800 may further comprise: a clustering unit configured to cluster the audio object portions of the plurality of sub-bands of the audio signal, the number of the clustered audio object portions being smaller than the number of the audio object portions of the plurality of sub-bands of the audio signal. In one embodiment, the clustering of the audio object portions of the plurality of sub-bands of the audio signal may be based on at least one of: critical bands, spatial positions of the audio object portions of the plurality of sub-bands of the audio signal, and perceptual criteria.
For the sake of clarity, some optional components of the system 800 are not shown in
The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, or the like; an output section 907 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), or the like, and a loudspeaker or the like; the storage section 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs a communication process via the network such as the internet. A drive 910 is also connected to the I/O interface 905 as required. A removable medium 911, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 910 as required, so that a computer program read therefrom is installed into the storage section 908 as required.
Specifically, in accordance with embodiments of the example embodiment, the processes described above with reference to
Generally speaking, various example embodiments of the example embodiment may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While various aspects of the example embodiments of the example embodiment are illustrated and described as block diagrams, flowcharts, or using some other pictorial representation, it will be appreciated that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Additionally, various blocks shown in the flowcharts may be viewed as method steps, and/or as operations that result from operation of computer program code, and/or as a plurality of coupled logic circuit elements constructed to carry out the associated function(s). For example, embodiments of the example embodiment include a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program containing program codes configured to carry out the methods as described above.
In the context of the disclosure, a machine readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable medium may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Computer program code for carrying out methods of the example embodiment may be written in any combination of one or more programming languages. These computer program codes may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor of the computer or other programmable data processing apparatus, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are contained in the above discussions, these should not be construed as limitations on the scope of any embodiment or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular example embodiments disclosed herein. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination.
Various modifications, adaptations to the foregoing example embodiments may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings. Any and all modifications will still fall within the scope of the non-limiting and example embodiments. Furthermore, other example embodiments disclosed herein set forth herein will come to mind to one skilled in the art to which these embodiments pertain having the benefit of the teachings presented in the foregoing descriptions and the drawings.
Accordingly, the example embodiment may be embodied in any of the forms described herein. For example, the following enumerated example embodiments (EEEs) describe some structures, features, and functionalities of some aspects of the example embodiment.
EEE 1. A method of extracting sub-band objects from multichannel audio comprising:
EEE 2. The method according to EEE 1, wherein the sub-band object probability is determined based on at least one of: position distribution, channel correlation, panning rules, and center frequency.
EEE 3. The method according to EEE 2, wherein the sub-band object probability is positively correlated to the spatial density of sub-band distribution, that is, the higher the spatial density of sub-band distribution is, the higher the sub-band object probability is.
EEE 4. The method according to EEE 3, wherein the sub-band spatial position is determined based on the energy weight of the pre-define channel positions.
EEE 5. The method according to EEE 2, wherein the sub-band object probability is positively correlated to the energy weighted channel correlation, that is, the higher the channel correlation is, the higher the sub-band object probability is.
EEE 6. The method according to EEE 2, wherein the sub-band will be kept in the residual audio if it is associated with one of specific panning rules.
EEE 7. The method according to EEE 6, wherein the specific panning rules include at least one of:
EEE 8. The method according to EEE 2, wherein the sub-band object probability is positively correlated to the sub-band center frequency, that is, the lower the sub-band center frequency is, the lower the sub-band object probability is.
EEE 9. The method according to EEE 1, wherein the sub-band object probability is used as a gain for splitting the sub-band to an object and residual audio.
EEE 10. The method according to EEE 9, wherein both the temporal smoothing and spectral smoothing are used to smooth the sub-band object gain.
EEE 11. The method according to EEE 10, wherein the temporal transient detection is used to calculate the adaptive time constant for temporal smoothing.
EEE 12. The method according to EEE 10, wherein the spectral segmentation is used to calculate the adaptive smoothing window for spectral smoothing.
EEE 13. The method according to EEE 1, wherein the sub-band object grouping method includes at least one of:
It will be appreciated that the embodiments of the invention are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are used herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
Number | Date | Country | Kind |
---|---|---|---|
2014 1 0372867 | Jul 2014 | CN | national |
This application claims priority to Chinese Patent Application No. 201410372867.X filed on 25 Jul. 2014 and U.S. Provisional Patent Application No. 62/037,748 filed on 15 Aug. 2014, both hereby incorporated in their entirety by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2015/041765 | 7/23/2015 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2016/014815 | 1/28/2016 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20050086682 | Burges | Apr 2005 | A1 |
20110046759 | Kim | Feb 2011 | A1 |
20150332680 | Crockett | Nov 2015 | A1 |
20160150343 | Wang | May 2016 | A1 |
20160267914 | Hu | Sep 2016 | A1 |
20160337776 | Breebaart | Nov 2016 | A1 |
Number | Date | Country |
---|---|---|
2007089131 | Aug 2007 | WO |
2008120933 | Oct 2008 | WO |
2009048239 | Apr 2009 | WO |
2014053547 | Apr 2014 | WO |
Entry |
---|
Koo, K. et al “Variable Subband Analysis for High Quality Spatial Audio Object Coding” IEEE 10th International Conference on Advanced Communication Technology, Feb. 17-20, 2008, pp. 1205-1208, vol. 2. |
Smaragdis, P. et al “Separation by “Humming”: User-Guided Sound Extraction from Monophonic Mixtures” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 18, 2009, pp. 69-72. |
Mandel, M. et al “Model-Based Expectation Maximization Source Separation and Localization” IEEE Transactions on Audio, Speech and Language Processing, New York, USA, vol. 18, No. 2, Feb. 1, 2010, pp. 382-394. |
Number | Date | Country | |
---|---|---|---|
20170215019 A1 | Jul 2017 | US |
Number | Date | Country | |
---|---|---|---|
62037748 | Aug 2014 | US |