This patent application claims priority of co-pending EP Patent Application No. 15198713.8, filed Dec. 9, 2015 and FR Patent Application No. 1463482, filed Dec. 31, 2014, each of which is herein incorporated by reference in its entirety and for all that it describes.
The present application relates to the field of processes and systems for separation of a plurality of components in a mixture of acoustic signals and in particular the separation of a vocal component affected by reverberation and of a musical background component in a mixture of acoustic signals.
A soundtrack of a song is composed by a vocal component (the lyrics sung by one or more singers) and a musical component (the musical accompaniment or background played by one or more instruments). A soundtrack of a film has a vocal component (dialogue between actors) superimposed on a musical component (sound effects and/or background music). There are certain instances where one needs to separate a vocal component from a musical component in a soundtrack. For example, in a film, one may need to isolate the background component from the vocal component in order to use a dubbed dialogue in a different language to produce a new soundtrack.
Several algorithms which aim at separating the vocal component from the musical component exist in the literature. For example, the article by Jean-Louis Durrieu et al. “An Iterative Approach to Musical Mixture of Monaural-Soloing,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, April 2009, pp. 105-108 discloses a source separation algorithm in under-determined conditions based on a Non-negative Matrix Factorization (NMF) framework, that allows specifically for the separation of the vocal contribution from a music background contribution. However, known separation algorithms do not explicitly and properly deal with the reverberation effects that affect the components of the mixture.
In the particular case of a vocal component, the reverberated voice results from the superposition of the dry voice, corresponding to the recording of the sound produced by the singer that propagates directly to the microphone, and the reverb, corresponding to the recording of the sound produced by the singer that arrives indirectly to the microphone, i.e. by reflection, possibly multiple, on the walls of the recording room. The reverberation, composed of echoes of the pure voice at given instants, spreads over a time interval that may be significant (e.g. three seconds). Stated otherwise, at a given instant, the vocal component results from the superposition of the dry voice at this instant and the various echoes of the pure voice at preceding instants.
Existing separation algorithms do not take into account the long-term effects of reverberation affecting a component of the mixture of acoustic signals. The article by Ngoc Duong Q K, Emmanuel Vincent, and Remi Gribonval, “Underdetermined Reverberant Sound Source Separation Using a Full-Rank Spatial Covariance Model,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, no. 7, pp. 1830-1840, September 2010, focuses on the instantaneous effects of reverberation related to the spatial diffusion, but does not model memory effects, i.e. the delay between the recording of a dry sound and the recording of the echoes associated to that dry sound. Thus, the type of algorithm proposed by the authors of the article applies only to multi-channel signals and does not allow for a correct extraction of reverberation effects which are common in music. Thus, the reverberation that affects a specific component, for example the vocal component, is distributed in the various components obtained after the separation. As a result, the separated vocal component then loses its richness and the musical accompaniment component is not of good quality.
Embodiments of the disclosure provide a method and system for separation of components in a mixture of audio components, where the components incorporate reverberations of a corresponding dry signal. For example, embodiments of the disclosure may be used to separate a dry vocal component x(t) affected by reverberation from a musical background component z(t) in a mixture acoustic signal w(t). The system includes non-transitory computer readable medium containing computer executable instructions for separating the components. The medium includes computer executable instructions to run an estimation-correction loop that includes, at each iteration, an estimation function and a correction function. The steps in the estimation-correction loop include first using a model of spectrogram of the mixture acoustic signal {circumflex over (V)}rev corresponding to the sum of a model of spectrogram of a specific acoustic signal affected by reverberation {circumflex over (V)}rev,y and of a model of spectrogram of the background acoustic signal {circumflex over (V)}z, the model of spectrogram of the specific acoustic signal affected by reverberation being related to the model of spectrogram of the specific dry acoustic signal model {circumflex over (V)}x according to:
where R is a reverberation matrix of dimensions F×T, f is a frequency index, t is a time index, and i an integer between 1 and T; and computing iteratively an estimation of the model of spectrogram of the background acoustic signal {circumflex over (V)}z, of the model of spectrogram of the specific dry acoustic signal {circumflex over (V)}x and of the reverberation matrix R so as to minimize a cost-function (C) between the spectrogram of the mixture acoustic signal V and the model of spectrogram of the mixture acoustic signal {circumflex over (V)}rev.
Embodiments of the disclosure will be described in even greater detail below based on the exemplary figures. The present application is not limited to the exemplary embodiments. All features described and/or illustrated herein can be used alone or combined in different combinations in embodiments. The features and advantages of various embodiments of the disclosure will become apparent by reading the following detailed description with reference to the attached drawings which illustrate the following:
In
The mixture signal data structure w(t) represents acoustical waves that comprise at least a first and a second component. In an embodiment, the first component is referred to as specific and may be a vocal component corresponding to lyrics sung by a singer, and the second component is referred to as background and may be a musical component corresponding to accompaniment of the singer.
The vocal signal data structure y(t) is a representation, computed and stored on a computer readable medium, of acoustical waves that represent the first component of the acoustical waves represented by the mixture signal data structure w(t) isolated from the remaining components of the acoustical waves represented by the mixture signal data structure. The background signal data structure z(t) is a representation, computed and stored on a computer readable medium, of acoustical waves that represent the second component of the acoustical waves represented by the mixture signal data structure w(t) isolated from the remaining components of the acoustical waves represented by the mixture signal data structure w(t).
In the embodiment of
y(t)=r(t)*x(t)
where x(t) is the dry vocal signal data structure, i.e. the acoustic signal produced by the singer which propagates directly to the microphone; and where r(t) is an impulse response data structure, which corresponds to a distribution giving the amplitudes of echoes for each time of arrival of the corresponding echoes to the microphone, and where * is the convolution product.
The dry vocal signal data structure x(t) is a representation, computed and stored on a computer readable medium, of acoustical waves that corresponds to the signal propagated in free-field. The impulse response data structure r(t) is a representation, computed and stored on a computer readable medium, that characterizes the acoustic environment of the recording of the dry vocal signal data structure x(t). In some embodiments, the reverberation can result from the environment where the sound is being recorded as described above, but it can also be artificially added during the mixing or the post-production process of the vocal component, mainly for aesthetic reasons.
In the time-frequency domain, for non-negative spectrograms, this reverberation model can be approximated, as proposed in the article of Rita Singh, Raj Bhiksha and Paris Smaragdis, “Latent Variable-Based Decomposition of Dereverberati on and Multi Monaural Channel Signals,” in IEEE International Conference on Audio and Speech Signal Processing, Dallas, Tex., USA, March 2010, by:
where Vrev,y is the spectrogram of the vocal signal data structure y(t), considered as affected by reverberation, Vx is the spectrogram of the dry vocal signal data structure x(t), R is a reverberation matrix of dimensions F×T corresponding to the spectrogram of the impulse response data structure r(t), with F being the frequency dimension and T being the temporal dimension of R.
At Step 105 in
At Step 110, the process 100 creates a data structure representing a spectrogram of the mixture signal data structure w(t). This step may be performed by calculating the spectrogram V of the mixture signal data structure w(t) and storing V at a computer readable medium. In general, a spectrogram is defined as the modulus (or the square of the modulus) of the Short-Time Fourier Transform of a signal. In other embodiments, other time-frequency transformations can be used, such as a Constant Q Transform (CQT), or a Short-Time Fourier Transform followed by a filtering in the frequency domain (using filter banks in Mel or Bark scale for instance). For each time-frame of the signal, the spectrogram is composed by a vector that represents the instantaneous energy of the signal for each frequency point. In this embodiment, the spectrogram V is therefore a matrix of dimensions F×U, composed of positive real numbers. U is the total number of time-frames which divide the duration of the mixture signal data structure w(t), and F is the total number of frequency points, which may be between 200 and 2000. After step 110, two paths are defined, a first path and a second path, where the first path follows steps 115-140 and the second path follows steps 215-240. The first and second paths are referred to as the first part of the process and the second part of the process, respectively.
At Step 115, the process progresses to determining a cost function and parameters of the cost function using data structures representing spectrograms of the mixture signal data structure w(t), the vocal signal data structure y(t), and the background signal data structure z(t). This step involves first assuming that the vocal signal data structure y(t) is a dry vocal signal data structure, that is, the vocal signal data structure contains no reverberations.
With the foregoing assumption, the spectrogram of modelling of the mixture signal data structure is assumed to be the sum of the spectrogram of the vocal signal data structure {circumflex over (V)}y, and the spectrogram of the background signal data structure {circumflex over (V)}z. {circumflex over (V)}y is the data structure representing the spectrogram of the signal y(t), considered unaffected by the reverberation, and {circumflex over (V)}z is the data structure representing the spectrogram of the signal z(t). This additive model is commonly assumed within the framework of Non-negative Matrix Factorization. Note that the nomenclature a denotes an estimation of a, thus the data structures {circumflex over (V)}z and {circumflex over (V)}y in this step are estimates. The estimated spectrograms are created at a computer readable medium. This step involves the task of estimating the spectrograms of the two contributions with the constraint that their sum is approximately equal to the spectrogram of the mixture signal data structure. In a mathematical expression, this is equivalent to:
V≈{circumflex over (V)}={circumflex over (V)}y+{circumflex over (V)}z
In some embodiments, the modelling of the spectrogram of the vocal signal may be based on a source-filter voice production model, as proposed in Jean-Louis Durrieu et al. “An iterative approach to monaural musical mixture de-soloing,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, April 2009, pp. 105-108:
{circumflex over (V)}y=(WF0HF0)⊙(WKHK)
where the first term corresponds to a modelling of the vocal source produced by the vibration of the vocal folds: WF0 is a matrix representation composed of predefined harmonic atoms and HF0 is a matrix of activation that controls at every instant which harmonic atoms of WF0 are activated. The second term corresponds to a modelling of the vocal filter and reproduces the filtering that is performed in the vocal tract: WK is a matrix representation of filter atoms, and HK is a matrix of activation that controls at every instant which filter atoms of WK are activated. The operator ⊙ corresponds to the element-wise matrix product (also known as the Hadamard product).
Similarly, the modelling of the musical background signal may be based on a generic Non-negative Matrix Factorization model:
{circumflex over (V)}z=(WRHR)
where the columns of WR can be seen as elementary spectral patterns, and HR as a matrix of activation of these elementary spectral patterns over time.
At Step 115, when using the foregoing representations, the parameters being determined relate to the matrix representations of HF0, WK, HK, WR and HR. In order to estimate the parameters of these matrices, a cost-function C, based on an element-wise divergence d is used:
C=D(V|{circumflex over (V)}y+{circumflex over (V)}z)=Σf,td(Vft|{circumflex over (V)}fty+{circumflex over (V)}ftz)
An implementation is herein contemplated in which the Itakura-Saito divergence, well-known by a person skilled in the art, is used. This divergence is obtained from the beta-divergence family when setting the parameter β=0 and writes:
For reminder the beta-divergence family is defined by:
where a and b are two real, positive scalars.
At Step 120, the cost-function C is thus minimized so as to estimate an optimal value of the parameters of each matrix. This minimization is performed iteratively using multiplicative update rules successively applied to each of the parameters of HF0, WK, HK, WR and HR matrices.
For each parameter, an update rule can be obtained from the partial derivative of the cost-function C with respect to that parameter. More precisely, the partial derivative of the cost-function with respect to a given parameter is decomposed as a difference of two positive terms and the update rule of the considered parameters consists in a multiplication of the parameter by the ratio of the two positive terms. This technique ensure that parameters initialized with positive values stay positive at each iteration, and that partial derivatives that are null, corresponding to local minima, leave the value of the corresponding parameters unchanged. Using a such optimization algorithm, the parameters are updated so that the cost-function approaches a local minimum.
The update rules of the parameters of the spectrograms can be written as:
where ⊙ is the element-wise matrix (or vector) product operator; .⊙(.) is the element-wise exponentiation of a matrix by a scalar operator; (.)T is a matrix transpose operator. For this first part of the process, all the parameters of HF0, WK, HK, WR and HR matrices are initialized with non-negative values randomly chosen.
At Step 130, the process involves using a tracking algorithm with the parameters of the spectrogram corresponding to the vocal component in order to identify frequency components of the vocal component with maximum energy in given timesteps. The matrix HF0 is processed by using a tracking algorithm, such as a Viterbi algorithm, in order to select, for each time step, the frequency component (corresponding to one atom of the matrix WF0) for which the energy is maximal at each time step while constraining this selection not being too far from the selection at the preceding time step. This step leads to the estimation of a melodic line corresponding to the melody sung by the singer.
At Step 140, the process then removes frequency components distant from the maximum energy at each timestep determined in Step 130. In some embodiments, this is accomplished by setting the elements of the matrix HF0 that are distant from the melodic line from a predefined value to 0. By modifying the HF0 matrix, a new matrix H′F0 is thus obtained.
In process 100 of
After Step 140, the assumption that the vocal signal data structure contains no reverberations is abandoned. At Step 215, the process determines cost function parameters using V (the data structure representing spectrogram of mixture signal data structure w(t)) and the data structures representing a spectrogram estimating a vocal signal data structure with reverberation and a spectrogram estimating a background signal data structure. Since the vocal component is considered as being affected by some reverberation, the modelling of the spectrogram of the vocal signal data structure considered as reverberated {circumflex over (V)}rev,y, as a function of the spectrogram of the dry vocal signal {circumflex over (V)}x, is expressed as:
where *t denotes a line-wise convolutional operator as defined in the right term of the above equation. The reverberation matrix R is composed of T time steps (of same duration as the time steps of the spectrogram of the mixture signal) and F frequency steps. In some embodiments, T is predefined by the user and is usually in the range 20-200, for instance 100.
Similarly to the previous discussion, the data structure representing the spectrogram of the dry vocal signal {circumflex over (V)}x is modelled as:
{circumflex over (V)}x=(WF0HF0)⊙(WKHK)
and the spectrogram of the music background signal {circumflex over (V)}z is modelled as:
{circumflex over (V)}z=(WRHR)
Thus, steps 215 to 240 involve the estimation of parameters for the matrices HF0, WK, HK, WR, HR and R that best approximate V (the spectrogram of the mixture signal data structure). Mathematically, this is written as:
V≈{circumflex over (V)}rev={circumflex over (V)}rev,y+{circumflex over (V)}z
In order to estimate the parameters of these matrices, at step 215, a cost-function C, based on an element-wise divergence d is used:
C=D(V|{circumflex over (V)}rev,y+{circumflex over (V)}z)=Σf,td(Vft|{circumflex over (V)}ftrev,y+{circumflex over (V)}ftz)
where divergence is obtained from the beta-divergence family, when setting the parameter β=0, as:
With similar models utilized in steps 115 and 215, the cost-function obtained in step 215 is similar to the cost-function in step 115.
At step 220, the cost function C is then minimized in order to estimate an optimal value for each parameter, in particular for the parameters of the reverberation matrix. The minimization is performed iteratively by means of multiplicative update rules, successively applied to each parameters of the matrices. For the matrices modelling the vocal component with reverberation, these updates rules are expressed as:
where *t denotes a line-wise convolutional operator between two matrices defined as [A*t B]f,τ=Στ=tTAf,τBf,τ−t+1.
For the background component, similarly to the background component with no reverberation, the update rules are given by:
Analogous to Step 120, the update rules are obtained from the partial derivatives of the cost-function with respect to each corresponding parameter. These update rules thus relate to the type of cost-function that has been chosen, and then to the type of divergence used in building the cost-function. As such, all the update rules given above are examples derived from using beta-divergence. Other models may yield different rules.
Even though different models may yield different rules, embodiments of the disclosure obtain update rules from partial derivatives with respect to a specific parameter. As such, the update rule of the reverberation matrix R is generic in the sense that it is not a function of the modelling selected for the spectrogram of the dry vocal signal data structure {circumflex over (V)}x or the spectrogram of the music background signal data structure {circumflex over (V)}z.
The estimation of the matrix HF0 is accomplished iteratively starting with the initialization set to H′F0, which is dubbed the activation matrix obtained from Step 140. Note that since the update rules are multiplicative, the coefficients of the matrix HF0 that are initialized with 0 will remain null during the minimization of the cost-function of the second part of the process. The other parameters of the model, in particular those related to the specific contribution reverberated {circumflex over (V)}rev,y are initialized with non-negative random values.
When the value of the cost-function measuring the divergence between the spectrogram of the mixture signal V and the estimated spectrogram {circumflex over (V)}rev={circumflex over (V)}rev,y+{circumflex over (V)}z falls below a certain predefined threshold, or when the number of iterations of the optimization process reaches a limit fixed beforehand, the process exits from the iteration loop and the values obtained for the matrices R, HF0, WK, HK, WR and HR, are dubbed the final estimates.
At Step 230, the estimated complex spectrograms of the dry vocal signal {circumflex over (V)}x and of the background signal {circumflex over (V)}z are obtained by means of a Wiener-like filtering applied to the time-frequency transform of the mixture signal. In some embodiments, this step involves creating time-frequency masks to estimate {circumflex over (V)}x and {circumflex over (V)}z. An example of a mask (or Wiener mask) for the dry signal is {circumflex over (V)}x/({circumflex over (V)}rev,y+{circumflex over (V)}z), and an example of a mask for the background signal is {circumflex over (V)}z/({circumflex over (V)}rev,y+{circumflex over (V)}z). To obtain the time-frequency representations of the dry signal and the background signal, these masks are successively applied (element-wise multiplication) on the spectrogram of the mixture signal (V) and multiplied by the phase component of the time-frequency transform of the mixture signal (the spectrogram being defined as the modulus of the time-frequency transform). Thus for each source, a complex spectrogram is obtained.
Then, at step 240, the process obtains data structures representing the dry vocal signal x(t) and the background signal z(t) by using an inverse transformation on the spectrograms {circumflex over (V)}x and
The described embodiment is applied to the extraction of a specific component of interest which is preferably a vocal signal. However, the modelling of the reverberation affecting a component is generic and can be applied to any kind of component. In particular, the music background component might also be affected by reverberation. Moreover, any kind of model of non-negative spectrogram for a dry component can be equally used, in place of those described above. Furthermore, in the presented embodiment, the mixture signal is composed by two components. The generalization to any number of component is straightforward for a person skilled in the art.
In the embodiment depicted in
The central server 12 includes means of executing computations, e.g. one or more processors, and computer readable media, e.g. non-volatile memory. The computer readable media can store processor executable instructions for performing the process 100 depicted in
The server 12 also includes a first step module 30 configured to obtain (in a manner such as that described in connection with steps 115, 120, 130, and 140 of
The server 12 also includes a second step module 40 configured to obtain (in a manner such as that described in connection with elements 215 and 220 of
The second step module 40 further includes a second modeling module 60 configured to obtain a parametric spectrogram data structure {circumflex over (V)}z that models the spectrogram of the background signal (similar to the second modeling module 34). In addition, the second step module 40 includes an estimation module 70 configured to estimate the parameters of the parametric spectrogram data structures {circumflex over (V)}rev,y and {circumflex over (V)}z using the spectrogram data structure V. The estimation module 70 is configured to perform an estimation (in a manner such as that described in connection with element 220 of
Furthermore, the central server 12 includes a filtering module 80 configured to implement Wiener filtering for determining the spectrogram data structure {circumflex over (V)}x of the dry vocal signal data structure x(t) and the spectrogram data structure {circumflex over (V)}z of the background signal data structure z(t) from the optimized parameters in a manner such as that described in connection with element 230 of the process described by
The computer environment includes a computer 300, which includes a central processing unit (CPU) 310, a system memory 320, and a system bus 330. The system memory 320 includes both read only memory (ROM) 340 and random access memory (RAM) 350. The ROM 34 stores a basic input/output system (BIOS) 360, which contains the basic routines that assist in the exchange of information between elements within the computer, for example, during start-up. The RAM 350 stores a variety of information including an operating system 370, an application programs 380, other programs 390, and program data 400. The computer 300 further includes secondary storage drives 410A, 410B, and 410C, which read from and writes to secondary storage media 420A, 420B, and 420C, respectively. The secondary storage media 420A, 420B, and 420C may include but is not limited to flash memory, one or more hard disks, one or more magnetic disks, one or more optical disks (e.g. CDs, DVDs, and Blu-Ray discs), and various other forms of computer readable media. Similarly, the secondary storage drives 410A, 410B, and 410C may include solid state drives (SSDs), hard disk drives (HDDs), magnetic disk drives, and optical disk drives. In some implementations, the secondary storage media 420A, 420B, and 420C may store a portion of the operating system 370, the application programs 380, the other programs 390, and the program data 400.
The system bus 330 couples various system components, including the system memory 320, to the CPU 310. The system bus 330 may be of any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system bus 330 connects to the secondary storage drives 410A, 410B, and 410C via a secondary storage drive interfaces 430A, 430B, and 430C, respectively. The secondary storage drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, programs, and other data for the computer 300.
A user may enter commands and information into the computer 300 through user interface device 440. User interface device 440 may be but is not limited to any of a microphone, a touch screen, a touchpad, a keyboard, and a pointing device, e.g. a mouse or a joystick. User interface device 440 is connected to the CPU 310 through port 450. The port 450 may be but is not limited to any of a serial port, a parallel port, a universal serial bus (USB), a 1394 bus, and a game port. The computer 300 may output various signals through a variety of different components. For example, in
The computer 300 may operate in a networked environment by utilizing connections to one or more devices within a network 500, including another computer, a server, a network PC, a peer device, or other network node. These devices typically include many or all of the components found in the example computer 300. For example, the example computer 300 depicted in
Comparative tests were performed to evaluate the performance of the proposed embodiment of the disclosure with other known processes. The first system performs the extraction of the vocal part by considering a Non-negative Matrix Factorization model based on source-filter voice production model, without modelling the reverberation. The second system corresponds to the process described above and therefore explicitly models the effects of reverberation on the vocal component. The third system corresponds to a theoretical limit that can be reached using Weiner masks computed from the actual spectrogram of the original separated sources, available for our experiments.
In order to quantify the results for the different systems, objective metrics commonly used in the domain of audio source separation are computed. These metrics are the Signal to Distortion Ratio (SDR), which corresponds to a global quantitative metric; the Signal to Artifact Ratio (SAR), which quantifies the amount of artifacts present in the separated components; and the Signal to Interference Ratio (SIR), which quantifies the amount of residual interferences between the separated components. For all three metrics, a higher the value signifies a higher performance system.
Results are presented in
All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.
The use of the terms “a” and “an” and “the” and “at least one” and similar referents in the context of describing the embodiments (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The use of the term “at least one” followed by a list of one or more items (for example, “at least one of A and B”) is to be construed to mean one item selected from the listed items (A or B) or any combination of two or more of the listed items (A and B), unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate certain aspects of the disclosure and does not pose a limitation on the scope of the application unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
Preferred embodiments of the disclosure are described herein, including the best mode known to the inventors for carrying out the embodiments. Variations of those preferred embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the invention to be practiced otherwise than as specifically described herein. Accordingly, this application includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the present application unless otherwise indicated herein or otherwise clearly contradicted by context.
Number | Date | Country | Kind |
---|---|---|---|
14 63482 | Dec 2014 | FR | national |
15198713 | Dec 2015 | EP | regional |
Number | Name | Date | Kind |
---|---|---|---|
20090310444 | Hiroe | Dec 2009 | A1 |
20130282369 | Visser | Oct 2013 | A1 |
20150156578 | Alexandridis | Jun 2015 | A1 |
Entry |
---|
Lee, Daniel D., et al., “Learning the parts of objects by non-negative matrix factorization”, Nature, vol. 401, pp. 788-791, (Oct. 21, 1999). |
Durrieu, Jean-Louis, et al., “An Iterative Approach to Monaural Musical Mixture De-Soloing”, IEEE International Conference on Acoustics, Speech and Signal Processing, 4 pages total, (2009). |
Virtanen, Tuomas, “Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, No. 3, pp. 1066-1074, (Mar. 2007). |
Smaragdis, Paris, et al., “Supervised and Semi-Supervised Separation of Sounds from Single-Channel Mixtures”, ICA '07 Proceedings of the 7th International Conference on Independent Component Analysis and Signal Separation, 8 pages. total, (2007). |
Duong, Ngoc Q.K., et al., “Under-determined reverberant audio source separation using a full-rank spatial covariance model”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, No. 7, pp. 1-11, (2010). |
Arberet, Simon, et al., “Nonnegative matrix factorization and spatial covariance model for under-determined reverberant audio source separation”, pp. 1-4, 10TH International Conference on Information Sciences Signal Processing and their Applications (ASSPA), 6 pages total, (May 2010). |
Singh, Rita, et al., “Latent-Variable Decomposition based Dereverberation of Monaural and Multi-Channel Signals”, IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) , 4 pages, (2010). |
Smaragdis, Paris, et al., “Non-Negative Matrix Factorization for Polyphonic Music Transcription”, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 177-180, (Oct. 19-22, 2003). |
Vincent, Emmanuel, et al,, “Adaptive harmonic spectral decomposition for multiple pitch estimation”, IEEE Transactions on Audio, Speech and Language Processing, IEEE, 18(3), 10 pages. total, (2010). |
Ozerov, Alexey, et al., “A General Flexible Framework for the handling of Prior Information in Audio Source Separation”, IEEE Transactions on Audio, Speech and Language Processing, 20(4), 17 pages total, (2011). |
Ozerov, Alexey, et al., “Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation”, IEEE transactions on Audio, Speech, and Language Processing, 18(3), pp. 550-563 (Mar. 2010). |
Fevotte, Cédric, et al., “Algorithms for nonnegative matrix factorization with the beta-divergence”, Neural Computation, 25 pages, (Mar. 2011). |
Vincent, Emmanuel, “Performance measurement in blind audio source separation”, IEEE Transactions on Audio, Speech, and Language Processing, institute of Electric and Electronics Engineers, 14(4), 10 pages total, (2006). |
French Preliminary Research Report and Written Opinion for French Application No. 1463482, filed Dec. 31, 2014, (13 pages). |
Hennequin, Romain, et al., “Internal Research Report: Long-Term Reverberation Modeling in Under-Determined Audio Source Separation”, (5 pages), unpublished, date no later than Jan. 8, 2016. |
Number | Date | Country | |
---|---|---|---|
20160189731 A1 | Jun 2016 | US |