The present application is based on PCT filing PCT/EP2020/080819, filed Nov. 3, 2020, which claims priority to EP 19207275.9, filed Nov. 5, 2019, the entire contents of each are incorporated herein by reference.
The present disclosure generally pertains to the field of audio processing, in particular to devices, methods and computer programs for source separation and mixing.
There is a lot of audio content available, for example, in the form of compact disks (CD), tapes, audio data files which can be downloaded from the internet, but also in the form of sound tracks of videos, e.g. stored on a digital video disk or the like, etc. Typically, audio content is already mixed, e.g. for a mono or stereo setting without keeping original audio source signals from the original audio sources which have been used for production of the audio content. However, there exist situations or applications where a mixing of the audio content is envisaged.
With the arrival of spatial audio object oriented systems like Dolby Atmos, DTS-X or more recently Sony 360RA, there is a need to find some methods to also enjoy the huge amount of legacy content, which has not been mixed originally with the concept of audio oriented object in mind. Some existing upmixing systems are trying to extract some spectrally based features or are adding some external effects to render the legacy content spatially. Accordingly, although there generally exist techniques for mixing audio content, it is generally desirable to improve devices and methods for mixing of audio content.
According to a first aspect, the disclosure provides an electronic device comprising circuitry configured to analyze the results of a stereo or multi-channel source separation to determine one or more time-varying parameters, and to create spatially dynamic audio objects based on the one or more time-varying parameters.
According to a further aspect, the disclosure provides a method comprising analyzing the results of a stereo or multi-channel source separation to determine one or more time-varying parameters, and to create spatially dynamic audio objects based on the one or more time-varying parameters.
Further aspects are set forth in the dependent claims, the following description and the drawings.
Embodiments are explained by way of example with respect to the accompanying drawings, in which:
Before a detailed description of the embodiments under reference of
The embodiments disclose an electronic device comprising circuitry configured to analyze the results of a stereo or multi-channel source separation to determine one or more time-varying parameters, and to create spatially dynamic audio objects based on the one or more time-varying parameters.
The electronic device may thus provide audio content having spatial audio object oriented, which contents or creates a more natural sound comparing with conventional stereo audio content. By taking time-varying parameters into account, a time-dependent spatial upmix, which, for example, preserves the original balance of the content, may be achieved by analyzing the results of a multi-channels (source) separation and creating spatially dynamic audio objects.
The circuitry of the electronic device may include a processor, may, for example, be CPU, a memory (RAM, ROM or the like), and/or storage, interfaces, etc. Circuitry may also comprise or may be connected with input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.)), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, etc.). Moreover, the electronic device may be an audio-enabled product, which generates some multi-channel spatial rendering. The electronic device may be TV, sound-bar, multi-channels (playback) system, virtualizer on headphones, Binaural Headphones, or the like.
As mentioned in the outset, there is a lot of audio content already mixed as a stereo audio content signal, which has two audio channels. In particular, with conventional stereo, each sound of an audio signal is fixed with a specific channel. For example, in one channel may be fixed instruments like guitar, drums, or the like and in the other channel may be fixed instruments like guitar, vocals, other, or the like. Therefore, sounds of each channel are tied to a specific speaker.
Accordingly, the circuitry may be configured to determine, as a time-varying parameter, a parameter describing the signal level-loudness between separated channels, and/or a spectral balance parameter, and/or a primary-ambience indicator, and/or a dry-wet indicator, and/or a parameter describing the percussive-harmonic content.
Moreover, position mapping may include audio object positioning that may be genre dependent for example or may be computed dynamically based on a combination of different indexes. The position mapping may for example be implemented using an algorithm such as described in the embodiments below. For example, a dry/wet primary/ambience indicator may be used or may be combined with the ratio of anyone of the separated sources to modify the parameters of the audio-objects like spread in monopole synthesis, which may create a more enveloping sound field, or the like.
The electronic device, when performing upmixing, may modify the original content and may take into account its specificity in particular, the balance of instruments in the case of stereo content.
In particular, the circuitry may be configured to determine, as a time-varying parameter, a parameter describing the balance of instruments in a stereo content, and to create the spatially dynamic audio objects based on the balance of instruments in the stereo content.
The circuitry may be configured to determine, as a time-varying parameter, a side-mid ratio of a separated source, and to create the spatially dynamic audio objects based on the side-mid ratio.
In this way, the electronic device may create spatial mixes which are content dependent and match more naturally and intuitively to the original intention of the mixing engineers or composers. The derived meta-data can also be used as a starting point for an audio engineer to create a new spatial mix.
The circuitry may be configured to determine spatial positioning parameters for the audio objects based on the one or more time-varying parameters obtained from the results of the stereo or multi-channel source separation.
Determining spatial positioning parameters may comprise performing position mapping based on positioning indexes. Position indices may allow selecting a position of an audio object from an array of possible positions. Moreover, performing position mapping may result in an automatic creation of a spatial object audio mix from an analysis of an existing multi-channels content or the like.
In some embodiments, the circuitry may be further configured to perform segmentation based on the side-mid ratio to obtain segments of the separated source.
In some embodiments, the side-mid ratio calculation may include a silence suppression process. A silence suppression process may include a silence detection in stereo channels. In a presence of silent parts on the separated sources the side-mid ratio may be set to zero.
The circuitry may be configured to dynamically adapt positioning parameters of the audio objects. Spatial positioning parameters may for example positioning indexes, an array of positioning indexes, a vector of positions, an array of positions, or the like. Some embodiments may use a positioning index depending on an original balance between the separated channels of a music sound source separation process, without limiting the present invention to that regard.
Deriving spatial positioning parameters may result to a spatial mix, where each separated (instrument) sources may be treated separately. The spatial mixes may be content dependent and may match naturally and intuitively to original intention of mixing of a user. The derived content may be derived meta-data, which may be used as a starting point to create a new spatial mix, or the like.
The circuitry may be configured to create the spatially dynamic audio objects by monopole synthesis. For example, the circuitry may be configured to dynamically adapt a spread in monopole synthesis. In particular, the spatially dynamic audio objects may be monopoles.
The circuitry may be configured to dynamically create, based on the one or more time-varying parameter, a first monopole used for rendering the left channel of a separated source, and a second monopole used for rendering the right channel of the separated source.
The circuitry may be configured to create, from the results of the multi-channel source separation, a time-dependent spatial upmix which preserves the original balance of the content.
The circuitry may be configured to perform, based on the time-varying parameter, a segmentation process to obtain segments of a separated source.
In some embodiments, the automatic time-dependent spatial upmixing is based on the results of a similarity analysis of multi-channels content. The automatic time-dependent spatial upmixing may for example be implemented using an algorithm such as described in the embodiments below.
The circuitry may be configured to perform a cluster detection based on the time-varying parameter. The cluster detection may be implemented using an algorithm, such as described in the following embodiments.
The circuitry may be configured to perform a smoothening process on the segments of the separated source.
The circuitry may be configured to perform a beat detection process to analyze the results of the multi-channel source separation.
The time-varying parameter may be determined per beat, per window, or per frame of a separated source.
The embodiments also disclose a method comprising analyzing the results of a stereo or multi-channel source separation to determine one or more time-varying parameters, and to create spatially dynamic audio objects based on the one or more time-varying parameters.
The embodiments also disclose a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the methods and processes describe above and in the embodiments below.
Embodiments are now described by reference to the drawings.
The process of the embodiments described below in more detail starts with a (music) source separation approach (see
Audio Upmixing/Remixing by Means of Blind Source Separation (BSS)
In a second step, the separations 2a-2d and the possible residual 3 are remixed and rendered to a new loudspeaker signal 4, here a signal comprising five channels 4a-4e, namely a 5.0 channel system. On the basis of the separated audio source signals and the residual signal, an output audio content is generated by mixing the separated audio source signals and the residual signal on the basis of spatial information. The output audio content is exemplary illustrated and denoted with reference number 4 in
In the following, the number of audio channels of the input audio content is referred to as Min and the number of audio channels of the output audio content is referred to as Mout. As the input audio content 1 in the example of
In audio source separation, an input signal comprising a number of sources (e.g. instruments, voices, or the like) is decomposed into separations. Audio source separation may be unsupervised (called “blind source separation”, BSS) or partly supervised. “Blind” means that the blind source separation does not necessarily have information about the original sources. For example, it may not necessarily know how many sources the original signal contained or which sound information of the input signal belong to which original source. The aim of blind source separation is to decompose the original signal separations without knowing the separations before. A blind source separation unit may use any of the blind source separation techniques known to the skilled person. In (blind) source separation, source signals may be searched that are minimally correlated or maximally independent in a probabilistic or information-theoretic sense or on the basis of a non-negative matrix factorization structural constraints on the audio source signals can be found. Methods for performing (blind) source separation are known to the skilled person and are based on, for example, principal components analysis, singular value decomposition, (in)dependent component analysis, non-negative matrix factorization, artificial neural networks, etc.
Although some embodiments use blind source separation for generating the separated audio source signals, the present disclosure is not limited to embodiments where no further information is used for the separation of the audio source signals, but in some embodiments, further information is used for generation of separated audio source signals. Such further information can be, for example, information about the mixing process, information about the type of audio sources included in the input audio content, information about a spatial position of audio sources included in the input audio content, etc.
The input audio signal can be an audio signal of any type. It can be in the form of analog signals, digital signals, it can origin from a voice recorder, a compact disk, digital video disk, or the like, it can be a data file, such as a wave file, mp3-file or the like, and the present disclosure is not limited to a specific format of the input audio content. An input audio content may for example be a stereo audio signal having a first channel input audio signal and a second channel input audio signal, without that the present disclosure is limited to input audio contents with two audio channels. An input audio signal may be a multi-channels content signal. For example, in other embodiments, the input audio content may include any number of channels, such as remixing of an 5.1 audio signal or the like. The input signal may comprise one or more source signals. In particular, the input signal may comprise several audio sources. An audio source can be any entity, which produces sound waves, for example, music instruments, voice, vocals, artificial generated sound, e.g. origin form a synthesizer, etc.
The input audio content may represent or include mixed audio sources, which means that the sound information is not separately available for all audio sources of the input audio content, but that the sound information for different audio sources, e.g., at least partially overlaps or is mixed.
The separations produced by blind source separation from the input signal may for example comprise a vocals separation, a bass separation, a drums separations and another separation. In the vocals separation all sounds belonging to human voices might be included, in the bass separation all noises below a predefined threshold frequency might be included, in the drums separation all noises belonging to the drums in a song/piece of music might be included and in the other separation, all remaining sounds might be included. Source separation obtained by a Music Source Separation (MSS) system may result in artefacts such as interference, crosstalk or noise.
Time-Dependent Spatial Upmixing with Dynamic Sound Objects
According to the embodiments described below in more detail, a side-mid ratio parameter obtained from a separated source is used to modify the parameters of audio-objects of a virtual sound system used for rendering the separated source. In particular, the spread in monopole synthesis (i.e. the position of the monopoles used for rendering the separated source) is influenced. This creates a more enveloping sound field.
The “Bass” separation 2a is processed using a side-mid ratio calculation 5 in order to determine a side-mid ratio for the Bass separation. The side-mid ratio calculation 5 process compares the energy of the left channel to the energy of the right channel of the stereo file representing the Bass separation to determine the side-mid ratio and is described in more detail with regard to
In the above described embodiment, the process of source separation decomposes the stereo file into the separations “Bass”, “Drums”, “Other”, and “Vocals”. These types of separations are only given for the purpose of illustration but they can be replaced by an type as instrument as it has been trained with a DNN.
In the above described embodiment, audio upmixing is performed on a stereo file which comprises two channels. The embodiments, however, are not limited to stereo files. The input audio content may also be a multichannel content such as a 5.0 audio file, a 5.1 audio file, or the like.
It is understood that monopoles are only an example of audio objects that may be positioned according to the principles of the example process shown in
Still further, it is understood that this is only one example of a possible embodiment, but that each step can be replaced by other analysis method and the audio object positioning could be also made genre dependent for example or computed dynamically based on the combination of different indexes. For example, a dry/wet, or a primary/ambience indicator could also be used instead of the side/mid ratio or combined with the side/mid ratio to modify the parameters of the audio-objects like spread in monopole synthesis, which would create a more enveloping sound field.
Beat Detection
A process of beat detection is performed on the original stereo signal (embodiment of
In this embodiment of
By the beat detection, the audio signal of the original stereo signal (stereo file 1 in
A beat detection process, as describe above under reference of
Beat detection is a windowing process which is particularly adequate for audio signals that represent music content. As an alternative to beat detection, a windowing process (or framing process) may be performed based on a predefined and constant window size, and based on a predefined “hopping distance” (in samples). The window size may be arbitrarily chosen (e.g. in samples, such as 128 samples per window, 512 samples per window, or the like. The hopping distance may for example chosen as equal to the window length, or overlapping windows/frames might be chosen.
In still other embodiments, no beat detection or windowing process is applied, but a e.g. side-mid ration is processed on a sample by sample basis (which corresponds to a window size of one sample).
Side-Mid Processing
The side signal and the mid signal are computed using the equation 1:
side=0.5·(L−R)
mid=0.5·(L+R) (equation 1)
The mid signal mid is computed by summing the left signal L to the right signal R of the separated source 2a-2d, and then multiplying the computed sum with a normalization factor of 0.5 (in order to preserve loudness). The side signal side is computed by subtracting the signal R of the right channel of the separated source 2a-2d from the signal L of the left channel of the separated source 2a-2d, and then multiplying the computed difference with a normalization factor of 0.5
For each beat of the separated source 2a-2d, the Mid signal mid and the Side signal side are related to each other by determining the ratio rat of the energy of the Mid signal mid and the Side signal side using the equation 2:
Here, side2 is the energy side2 of the Side signal side which is computed by samplewise squaring the side signal side, and mid2 is the energy of the Mid signal mid is computed by samplewise squaring the mid signal mid. The ratio rat of the energy of the Mid signal mid and the Side signal side is computed by averaging the energy side2 of the Side signal side over a beat to obtain the average value mean (side2) of the side energy for the beat, by averaging the energy mid2 of the Mid signal mid over the same beat to obtain the average value mean (mid2) of the mid energy for the beat, and dividing the average mean (side2) of the side energy by the average mean (mid2) of the mid energy.
The energy of a signal is related to the amplitude of the signal, and may for example be obtained as the short-time energy as follows:
E=∫−∞∞|x(t)|2dt (equation 3)
where x(t) is the audio signal, here in particular the left channel L or the right channel R.
In this embodiment, the side-mid ratio is calculated per beats and therefore it leads to smoother values (compared to fixed window length). The beats are calculated based on the input stereo file as described with regard to
In the embodiment above, the energy side2 of the Side signal and the energy mid2 of the Mid signal is used to determine a time-varying parameter rat to create spatially dynamic audio objects based on the time-varying parameter. It is, however, not necessary to use the energy for calculating the time-varying parameter. In alternative embodiments, for example, the ratio of amplitude differences |L−R|/|L+R| may be used to determine a time-dependent factor.
Still further, in the embodiment above, a normalization factor of 0.5 is foreseen. This normalization factor is, however, only provided for reasons of convention. It is not essential as it does not influence the ration and can thus also be disregarded.
Silent parts in separated sources may still contain virtually imperceptible artefacts. Accordingly, the side-mid ratio may be set automatically to zero in silent parts of the separated sources 2a-2d, in order to minimize such artefacts as illustrated below with regard to the embodiment of
Silent parts of the separated sources 2a-2d may for example be identified by comparing the energies L2, and, respectively, R2 of the left and right stereo channel with respective predefined threshold levels (or by comparing the overall energy L2+R2 in both stereo channels with a predefined threshold level).
In the embodiment describe above, it is described here the derivation of a mid/side ratio as an example of a time-varying parameter. In other embodiments, time-varying parameters may for example also be signal level/loudness between separated channels, spectral balance, primary/ambience, dry/wet, percussive/harmonic content or others parameters which can be derived from Music Information Retrieval approaches, without limiting the present disclosure to that regard.
Segmentation (Cluster Detection)
For preventing unnatural, unpleasant, or too fast position variations, such as fast spatial jumps in time, or the like, the side-mid ratio may be segmented in beats and smoothened using time-smoothing methods. For example, an embodiment of an exemplary segmentation process, in which the side-mid ratio is segmented, as it will be described in detail in
It should be noted that in the embodiment above, the segmentation happens based on the side-mid ratio (or other time-varying parameter) which provides different results for the individual separated sources (instruments). However, the time markers (detected beats) of the segmentation of the clustering process are common to all separated signals. The segmentation is done beat-synchronous to the original stereo signal, which is down-mixed into mono. Between successive beats, a time-varying parameter such as the per-beat mean of the mid-side ratio is computed for each separated signal.
In
As stated above, the goal of audio clustering is to identify and group together all beats, which have the same per-beat side-mid ratio. Audio beats with different per-beat side-mid ratio classification are clustered in different segments. Any clustering algorithm known to the skilled person, such as the K-means algorithm, Agglomerative Clustering (as described in https://en.wikipedia.org/wiki/Hierarchical_clustering), or the like, can be used to identify the side-mid ratio clusters which are indicative of segments of the audio signal.
The distance measure when comparing two clusters using the BIC can be stated as a model selection criterion where one model is represented by two separated clusters C1 and C2 and the other model represents the clusters joined together C={C1, C2}. The BIC expression may be given as follows:
BIC=n log|Σ|−n1 log|Σ1|−n2 log|Σ2|−λP (equation 4)
where n=n1+n2 is the data size (overall number of beats, windows, etc.), Σ is the covariance matrix for cluster C={C1, C2}, Σ1 and Σ2 are the covariance matrices for cluster C1, and, respectively, cluster C2, P is a penalty factor related with the number of parameters in the model, and λ, is a penalty weight. The covariance matrix Σ is given by equation 5:
where Σ1
In
By means of the segmentation process described in
where Nn=Σi∈S
According to the embodiments described here in more detail, the positions of final monopoles are determined based on the side-mid ratio, and in particular based on the smoothened side-mid ratio, which attributes a side-mid ratio to every segment of the audio signal.
Position Mapping
It should be noted that it is difficult to render virtual monopoles directly at the position of a physical speaker, or very close to a physical speaker. Accordingly, the possible monopole positions which are close to one of the speakers SP1, SP2, SP3, SP4 are marked with a dotted pattern, whereas all other possible positions are marked with a dashed pattern.
In the embodiment of
Still further, in the embodiment if
The mapping between the smoothened side-mid ratio
For example, the mapping process may be performed as follows:
Where,
In the above described mapping process the side-mid ratio
Still further, in the embodiment described above, it is described that the position mapping happens for the left and the right stereo channel separately. In alternative embodiments, however, a position mapping as described above might only be performed for one of the stereo channels (e.g. the left channel), and the monopole position for the other stereo channel (e.g. the right channel) might be obtained by mirroring the position of the mapped stereo channel (e.g. left channel).
In the embodiments described above, the determination of the monopole positions for performing a rendering the stereo signal of a separated source is based on a side-mid ratio parameter obtained from the separated source. However, in alternative embodiments, other parameters of the separated source may be chosen to determine the monopole positions for rendering the stereo signal. For example, a dry/wet, or a primary/ambience indicator could also be used to modify the parameters of the audio-objects like spread in monopole synthesis, which would create a more enveloping sound field. Also combinations of such parameters might be used to modify the parameters of the audio-objects.
Monopole Synthesis
The technique, which is implemented in the embodiments of US 2016/0037282 A1, is conceptually similar to the Wavefield synthesis, which uses a restricted number of acoustic enclosures to generate a defined sound field. The fundamental basis of the generation principle of the embodiments is, however, specific, since the synthesis does not try to model the sound field exactly but is based on a least square approach.
A target sound field is modelled as at least one target monopole placed at a defined target position. In one embodiment, the target sound field is modelled as one single target monopole. In other embodiments, the target sound field is modelled as multiple target monopoles placed at respective defined target positions. For example, each target monopole may represent a noise cancelation source comprised in a set of multiple noise cancelation sources positioned at a specific location within a space. The position of a target monopole may be moving. For example, a target monopole may adapt to the movement of a noise source to be attenuated. If multiple target monopoles are used to represent a target sound field, then the methods of synthesizing the sound of a target monopole based on a set of defined synthesis monopoles as described below may be applied for each target monopole independently, and the contributions of the synthesis monopoles obtained for each target monopole may be summed to reconstruct the target sound field.
A source signal x(n) is fed to delay units labelled by z−n
In this embodiment, the synthesis is thus performed in the form of delayed and amplified components of the source signal x.
According to this embodiment, the delay np for a synthesis monopole indexed p is corresponding to the propagation time of sound for the Euclidean distance r=Rp0=|rp−ro| between the target monopole ro and the generator rp.
Further, according to this embodiment, the amplification factor
is inversely proportional to the distance r=Rp0.
In alternative embodiments of the system, the modified amplification factor according to equation (118) of reference US 2016/0037282 A1 can be used.
Example Process for Spatial Upmixing of Stereo Content
Real-Time Processing
The above described process of upmixing/remixing by dynamically determining parameters of audio objects to be rendered by e.g. a 3D audio rendering process may be performed as a post-processing step on an audio source file, respectively on the separated sources that have been obtained from the audio source file by a source separation process. In such a post processing scenario, the whole audio file is available for processing. Accordingly, a side-mid ratio may be determined for all beats/windows/frames of a separated source as described in
The above processes may, however, also be implemented as a real-time system. For example, upmixing/remixing of a stereo file may be performed in real-time on a received audio stream. In the case that the audio signal is processed in real time, it is not appropriate to determine segments of the audio stream only after receipt of the complete audio file (piece of music, or the like). However, a change of audio characteristics or segment boundaries should be detected “on-the-fly” during the streaming process, so that the audio object rendering parameters can be changed immediately after detection of a change, during streaming of the audio file.
For example, a smoothening may be performed by continuously determining a parameter such as the side-mid ratio, and by continuously determining the standard deviation σ of this parameter. Current changes in the parameter can be related to the standard deviation σ. If a current change in the parameter is large with respect to the standard deviation, then the system may determine that there is a significant change in the audio characteristics. A significant change in the audio signal (a jump) may for example be detected when a difference between subsequent parameters (e.g. per-beat side-mid ratio) in the signal is higher than a threshold value, for example, when the difference is equal to 2σ, or the like, without limiting the present disclosure in that regard.
Such a significant change in the audio characteristics which is detected on-the-fly can be treated like a segment boundary described in the embodiments above. That is, the significant change in the audio characteristics may trigger a reconfiguration of the parameters of the 3D audio rendering process, e.g. a repositioning of monopole positions used in monopole synthesis.
Implementation
The electronic system 700 further comprises a data storage 702 and a data memory 703 (here a RAM). The data memory 703 is arranged to temporarily store or cache data or computer instructions for processing by the processor 701. The data storage 702 is arranged as a long-term storage, e.g. for recording sensor data obtained from the microphone array 710. The data storage 702 may also store audio data that represents audio messages, which the public announcement system may transport to people moving in the predefined space.
It should be noted that the description above is only an example configuration. Alternative configurations may be implemented with additional or other sensors, storage devices, interfaces, or the like.
It should be recognized that the embodiments describe methods with an exemplary ordering of method steps. The specific ordering of method steps is, however, given for illustrative purposes only and should not be construed as binding.
It should also be noted that the division of the electronic system of
All units and entities described in this specification and claimed in the appended claims can, if not stated otherwise, be implemented as integrated circuit logic, for example, on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software.
In so far as the embodiments of the disclosure described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a transmission, storage or other medium by which such a computer program is provided are envisaged as aspects of the present disclosure.
Note that the present technology can also be configured as described below.
(1) An electronic device comprising circuitry configured to analyze the results of a stereo or multi-channel source separation to determine one or more time-varying parameters, and to create spatially dynamic audio objects based on the one or more time-varying parameters.
(2) The electronic device of (1), wherein the circuitry is configured to determine, as a time-varying parameter, a parameter describing the signal level-loudness between separated channels, and/or a spectral balance parameter, and/or a primary-ambience indicator, and/or a dry-wet indicator, and/or a parameter describing the percussive-harmonic content.
(3) The electronic device of (1) or (2), wherein the circuitry is configured to determine, as a time-varying parameter, a parameter describing the balance of instruments in a stereo content, and to create the spatially dynamic audio objects based on the balance of instruments in the stereo content.
(4) The electronic device of (1) to (3), wherein the circuitry is configured to determine, as a time-varying parameter, a side-mid ratio of a separated source, and to create the spatially dynamic audio objects based on the side-mid ratio.
(5) The electronic device of (1) to (4), wherein the circuitry is configured to determine spatial positioning parameters for the audio objects based on the one or more time-varying parameters obtained from the results of the stereo or multi-channel source separation.
(6) The electronic device of (1) to (5), wherein the circuitry is configured to dynamically adapt positioning parameters of the audio objects.
(7) The electronic device of (1) to (6), wherein the circuitry is configured to create the spatially dynamic audio objects by monopole synthesis.
(8) The electronic device of (1) to (7), wherein the circuitry is configured to dynamically adapt a spread in monopole synthesis.
(9) The electronic device of (1) to (8), wherein the spatially dynamic audio objects are monopoles.
(10) The electronic device of (1) to (9), wherein the circuitry is configured to dynamically create, based on the one or more time-varying parameter, a first monopole used for rendering the left channel of a separated source, and a second monopole used for rendering the right channel of the separated source.
(11) The electronic device of (1) to (10), wherein the circuitry is configured to create, from the results of the multi-channel source separation, a time-dependent spatial upmix which preserves the original balance of the content.
(12) The electronic device of (1) to (11), wherein the circuitry is further configured to perform, based on the time-varying parameter, a segmentation process to obtain segments of a separated source.
(13) The electronic device of (1) to (12), wherein the circuitry is configured to perform a cluster detection based on the time-varying parameter.
(14) The electronic device of (1) to (13), wherein the circuitry is configured to perform automatic time-dependent spatial upmixing based on the results of a similarity analysis of multi-channels content.
(15) The electronic device of (1) to (14), wherein the circuitry is configured to perform a smoothening process on the segments of the separated source.
(16) The electronic device of (1) to (15), wherein the circuitry is configured to perform a beat detection process to analyze the results of the multi-channel source separation.
(17) The electronic device of (1) to (16), wherein the time-varying parameter is determined per beat, per window, or per frame of a separated source or original content.
(18) A method comprising analyzing the results of a stereo or multi-channel source separation to determine one or more time-varying parameters, and to create spatially dynamic audio objects based on the one or more time-varying parameters.
(19) A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of (18).
Number | Date | Country | Kind |
---|---|---|---|
19207275 | Nov 2019 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2020/080819 | 11/3/2020 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2021/089544 | 5/14/2021 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
2686294 | Hower | Aug 1954 | A |
8952233 | Johnson | Feb 2015 | B1 |
20110081024 | Soulodre | Apr 2011 | A1 |
20120177204 | Hellmuth | Jul 2012 | A1 |
20140297296 | Koppens | Oct 2014 | A1 |
20150146873 | Chabanne et al. | May 2015 | A1 |
20160037282 | Giron | Feb 2016 | A1 |
20160125867 | Jarvinen | May 2016 | A1 |
20170289721 | Davis | Oct 2017 | A1 |
20210055796 | Mansbridge | Feb 2021 | A1 |
20220392461 | Giron | Dec 2022 | A1 |
Number | Date | Country |
---|---|---|
1377959 | Jun 2011 | EP |
2000295700 | Oct 2000 | JP |
2002304191 | Oct 2002 | JP |
2012211768 | Nov 2012 | JP |
2013006325 | Jan 2013 | WO |
2014204997 | Dec 2014 | WO |
Entry |
---|
International Search Report and Written Opinion mailed on Jan. 12, 2021, received for PCT Application PCT/EP2020/080819, Filed on Nov. 3, 2020, 9 pages. |
Kamado et al., “Object-Based Stereo Up-Mixer for Wave Field Synthesis Based on Spatial Information Clustering”, 20th European Signal Processing Conference (EUSIPCO 2012), Aug. 27-31, 2012, pp. 594-598. |
Kraft et al., “Low-Complexity Stereo Signal Decomposition and Source Separation for Application in Stereo to 3D Upmixing”, Audio Engineering Society, Convention Paper 9586, Presented at the 140th Convention, Jun. 4-7, 2016, pp. 1-10. |
Cano et al., “Musical Source Separation: An Introduction”, IEEE Signal Processing Magazine, vol. 36, No. 1, Jan. 2019, pp. 31-40. |
Number | Date | Country | |
---|---|---|---|
20220392461 A1 | Dec 2022 | US |