The present invention relates to the field of processes and systems for separation of a specific contribution from a background component of an audio mixture signal.
A soundtrack of a movie or a TV show consists of dialogue superimposed with special audio effects and/or music. For an old movie, the soundtrack is a mixture of at least two of these components. Thus, if one wishes to broadcast the movie in a version other than the original one, one may need to separate the dialogue component from the background component in the original soundtrack. Doing so makes it possible to add, onto an isolated background component, a dubbed dialogue in a different language in order to produce a new soundtrack.
In some situations, the producers of a movie may only have a license to broadcast a piece of music in a particular country or region or for a limited duration of time. It may be illegal to broadcast a movie for which the soundtrack does not conform to the contract terms. To broadcast the movie, it may then be necessary to separate the dialogue component of the soundtrack from the background component of the soundtrack in order to use the isolated original dialogue to a new piece of music in order to get a new soundtrack.
In the general field of audio signal processing, source separation has been an important topic during the past decade. In the prior art, audio source separation was first addressed in a blind context. Non-negative matrix factorization (NMF) has been widely used in this context. For instance, the document by T. Virtanen, “Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 3, pp. 1066-1074, March 2007, divulges an NMF for source separation. However, one of the main drawbacks of this technique is the difficulty to cluster the factorized elements and associate them with a particular source.
More recently, numerous works have proposed adding extra information to NMF methods to improve results. In the particular field of musical source separation (i.e. separation of an instrument from a band or orchestra), an algorithm was proposed in which the different spectral shapes of each source are learned on isolated sounds and then used to decompose the mixture. In another work, a MIDI file is used to guide the separation of instruments in music pieces.
In the particular field of separating speech from background noise, one proposal has been to use a guide sound signal and to mimick the dialogue component of the mixture signal in order to guide the separation process. More particularly, the guide signals correspond to a recording of the voice of a speaker dubbing the target dialogue component that is to be separated. The document P. Smaragdis and G. Mysore “Separation by Humming: User-Guided Sound Extraction from Monophonic Mixture,” in Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, N.Y., USA, October 2009 proposed such an approach. In this document, the authors use a process based on Probabilistic Latent Component Analysis (PLCA). This process uses a guide signal that mimics the dialogue component to be extracted from the audio mixture signal and is set as an input to the PLCA.
The document by L. Le Magoarou et al. “Text-Informed Audio Source Separation Using Nonnegative Matrix Partial Co-Factorization,” in IEEE International Workshop on Machine Learning for Signal Processing, Southampton, UK, September 2013 divulges an algorithm, based on a source-filter model for vocal production in the dialogue contribution of the mixture signal and in the guide signal, that models time misalignments and equalization differences but does not model pitch differences between a guide signal and the dialogue contribution of the mixture signals.
A method is described herein for transforming an audio mixture signal data structure x(t) representing an audio mixture signal having a specified component and a background component into a data structure corresponding to the specified component and a data structure corresponding to the background component, the method including obtaining a guide signal data structure g(t) corresponding to a dubbing of the specified component and storing the guide signal data structure g(t) at a computer readable medium, modeling, by a first modeling module, a spectrogram of a specified signal data structure y(t) as a parametric spectrogram data structure {circumflex over (V)}py having a plurality of frames and including, for each of the plurality of frames, a parameter that models a pitch difference between the guide signal data structure g(t) and the specified component, modeling, by a second modeling module, a spectrogram of a background signal data structure z(t) as a parametric spectrogram data structure {circumflex over (V)}pz, estimating, by an estimating module, the parameters of the parametric spectrogram data structure {circumflex over (V)}py to produce a temporary specified signal spectrogram data structure Viy for the specified signal data structure y(t), estimating, by the estimating module, the parameters of the parametric spectrogram data structure {circumflex over (V)}pz to produce a temporary background signal spectrogram data structure Viz for the background signal data structure z(t), obtaining, from the audio mixture signal data structure x(t), an audio mixture signal constant Q transform (CQT) data structure Vx and storing the CQT data structure Vx at the computer readable medium, and filtering, to provide a specified audio signal CQT data structure Vy and a background audio signal CQT data structure Vz, the audio mixture signal CQT Vx using the temporary specified signal spectrogram Viy and the temporary background signal spectrogram Viz, wherein the specified audio signal CQT data structure Vy is the data structure corresponding to the specified component, and wherein the background audio signal CQT data structure Vz is the data structure corresponding to the background component.
A system is described herein for transforming an audio mixture signal data structure x(t) representing an audio mixture signal having a specified component and a background component into a data structure corresponding to the specified component and a data structure corresponding to the background component, the system including a spectrogram computation module configured to apply a time-frequency transform to the audio mixture signal data structure x(t) to produce an audio mixture signal spectrogram data structure Vx, and apply a time-frequency transform to the audio guide signal data structure g(t) to produce an audio guide signal spectrogram data structure Vg, a first modeling module configured to model a spectrogram of a specified signal data structure y(t) corresponding to the specified component as a parametric spectrogram data structure {circumflex over (V)}py having a plurality of frames and including, for each of the plurality of frames, a parameter that accounts for a pitch difference between the audio guide signal data structure g(t) and the specified component, a second modeling module configured to model a spectrogram of a background audio signal data structure z(t) corresponding to the background component as a parametric spectrogram data structure {circumflex over (V)}pz, an estimation module configured to produce a temporary specified signal spectrogram data structure Viy by estimating values for the parameters of the model parametric spectrogram data structure {circumflex over (V)}py, and produce a temporary background audio signal spectrogram data structure Viz by estimating values for parameters of the model parametric spectrogram data structure {circumflex over (V)}pz, and a filtering module configured to filter an audio mixture signal CQT data structure Vx using the temporary specified signal spectrogram data structure Viy and the temporary background signal spectrogram data structure Viz to provide a specific audio signal CQT data structure Vy and an audio background signal data structure CQT Vz, wherein the specified audio signal CQT data structure Vy is the data structure corresponding to the specified component, and wherein the background audio signal CQT data structure Vz is the data structure corresponding to the background component.
The present invention will be described in even greater detail below based on the exemplary figures. The invention is not limited to the exemplary embodiments. All features described and/or illustrated herein can be used alone or combined in different combinations in embodiments of the invention. The features and advantages of various embodiments of the present invention will become apparent by reading the following detailed description with reference to the attached drawings which illustrate the following:
The process 100 transforms an audio mixture signal data structure x(t) by using a guide signal data structure g(t) in order to provide a dialogue signal data structure y(t) and a background signal data structure z(t), all of which are functions of time. In the filtering process depicted in
At 110, the process obtains the guide signal g(t) by, for example, recording a dubbing of the dialogue to which the first component of the mixture signal x(t) corresponds and creating a data structure representing the dubbing at a computer readable medium.
At 115, the process creates a data structure representing a log-frequency spectrogram Vg of the guide signal g(t) at a computer readable medium. The log-frequency spectrogram Vg is defined as the squared modulus of the constant-Q transform (CQT) of the guide signal g(t). In order to avoid any confusion, it is preferable to distinguish non-negative matrices (obtained from squared modulus of a CQT) and complex matrices (obtained from the CQT directly). In the remainder of this document, the term “spectrogram” denotes a non-negative matrix and the term “constant-Q transform,” or “CQT,” denotes a complex matrix. The process uses an algorithm to facilitate a transform from the time domain to the frequency domain, in such a way that central frequencies fc of each frequency bin are distributed on a logarithmic scale and the quality factors Q of each bin are constant. The quality factor Q of a frequency bin is provided by the equation
where fc is the central frequency of the bin and Δf is the width of the bin.
At 116, the process creates a data structure representing a spectrogram Vx of the audio mixture signal x(t) at a computer readable medium in the same manner in which the spectrogram Vg of the guide signal g(t) was created at 115.
Assuming that the mixture signal x(t) and the guide signal g(t) have the same duration, the spectrograms Vg and Vx are both F×T matrices where T corresponds to a total number of frames that subdivide the total duration of the mixture signal x(t) and the guide signal g(t). If the guide signal g(t) and the mixture signal x(t) do not have the same duration, a synchronization matrix S having dimensions T′×T (where T′ is the duration of matrix Vg and T the duration of matrix Vx) can be used to perform a time modification on Vg.
In the process of
Vx≈{circumflex over (V)}y+{circumflex over (V)}z (1)
As the guide signal g(t) is not identical to the dialogue signal y(t), there are differences between the guide signal g(t) and the dialogue component of the mixture signal x(t) that must be modeled in order to account for them in the separation process. A parametric spectrogram {circumflex over (V)}py enables the differences between the spectrogram of the guide signal Vg and the dialogue component of the spectrogram of the mixture signal Vx to be modeled. Determining values for the parameters of the parametric spectrogram {circumflex over (V)}py provides the estimated spectrogram of the dialogue signal {circumflex over (V)}y in equation (1). The parametric spectrogram {circumflex over (V)}py is determined by performing three types of operation on the spectrogram of the guide signal Vg. First, a pitch shift operator is applied in order to account for pitch difference between the guide signal g(t) and the dialogue component of the mixture signal x(t) within a frame. Next, a synchronization operator is applied in order to account for temporal misalignment of frames of the guide signal and corresponding frames of the dialogue component of the mixture signal x(t). Finally, an equalization operator is applied to permit an adjustment that accounts for global spectral differences, or equalization differences, between the guide signal g(t) and the mixture signal x(t). In these operations, all corresponding parameters can be constrained to be non-negative.
At 120, a data structure representing a pitch shift operator P is created at a computer readable medium and applied to the spectrogram Vg to produce a pitch-shifted spectrogram Vshiftedg, for which another data structure is created at a computer readable medium. In a time-frequency representation of an audio signal, a pitch modification of a sound corresponds to a simple shift along a frequency axis of a spectrogram, or at least to a simple shift along the frequency axis for a single frame of the spectrogram. The pitch shift operator P is a Φ×T matrix that applies a vertical shift to each frame of the spectrogram of the guide signal Vg. It is worth noting that a frame of a spectrogram corresponds to sampling period of a time-dependent signal. For spectrograms computed with a CQT, a vertical shift of a frame corresponds to a pitch modification as previously mentioned herein above. This operation can be written as:
Vshiftedg=Σφ↓φVgdiag(Pφ,:) (2)
where ↓φVg corresponds to a shift of the spectrogram Vg of φ bins down (i.e. [↓φVg]=[Vg]f−φ,t) and diag(Pφ,:) is the diagonal matrix which has the coefficients of the φth row of P as a main diagonal.
The pitch shift operator P is a model for a difference between the instant pitch of the guide signal g(t) and the one of the dialogue component of the mixture signal x(t). In practice, only one pitch shift φ must be retained for each frame t. To achieve this, a selection procedure will be applied as described below.
At 130, a data structure is created at a computer readable medium for a synchronization operator S, which is applied to the pitch-shifted spectrogram Vshiftedg to produce a pitch-shifted and synchronized spectrogram Vsyncg, for which a data structure is also created. The synchronization operator S is a T′×T matrix that models a temporal misalignment of the spectrogram of the guide signal Vg and the dialogue component of the spectrogram of the mixture signal Vx. A time frame of the spectrogram of the mixture signal Vx is modeled as a linear combination of the previous and following frames of the pitch-shifted spectrogram Vshiftedg. This operation can be written as:
Vsyncg=VshiftedgS (3)
where S is a band matrix, i.e. there exists a positive integer w such that, for all pairs of frames (t1, t2), where |t1 t2|>w, St
The bandwidth w of the matrix S corresponds to the misalignment tolerance between frames of the guide signal and frames of the dialogue component of the mixture signal. A large value of w allows a large tolerance but at the cost of quality of estimation of the model parameters. Limiting w to a small number of time frames can therefore be advantageous. The correct synchronization can also be optimized with a selection procedure that will be described below.
At 140, the process creates data structures representing the parametric spectrogram of the dialogue signal {circumflex over (V)}py and an equalization operator E, which is an F×1 vector, at a computer readable medium. The equalization operator E models global spectral differences, or equalization differences, between the guide signal g(t) and the mixture signal x(t) and is modeled as a global filter on the pitch-shifted and synchronized spectrogram Vsyncg such that the parametric spectrogram of the dialogue signal {circumflex over (V)}py, can be modeled as:
{circumflex over (V)}py=diag(E)(Σφφ↓φVgdiag(Pφ,:))S (4)
where diag(E) is a diagonal matrix which has the coefficients of E as a main diagonal.
At 150, as no information on the content of the audio background signal z(t) is available, a parametric spectrogram of the audio background signal Vpz is modeled from a standard NMF, and a data structure representative of Vpz is created and stored at a computer readable medium. In this manner, the spectrogram of the audio background signal {circumflex over (V)}z is parametrically modeled as:
{circumflex over (V)}pz=WH (5)
where W is an F×R non-negative matrix and H is an R×T non-negative matrix. R is constrained to be far less than F and T. The choice of R is important and application-dependent. Columns of W can be considered as elementary spectral shapes and H can be considered to be a matrix for activation of the elementary spectral shapes over time. At 150, the process also creates data structures for W and H and stores the data structures at a computer readable medium.
At 160, the process performs a first estimation of the parameters of model parametric spectrograms {circumflex over (V)}py and {circumflex over (V)}pz and updates the data structures representative of the model parametric spectrograms {circumflex over (V)}py and {circumflex over (V)}pz and of their parameters accordingly. For the first estimation, all parameters can be initialized with random non-negative values. In order to estimate the parameters of the spectrograms {circumflex over (V)}py and {circumflex over (V)}pz, a cost function C, based on an element-wise divergence d, is used:
C=D(V|{circumflex over (V)}py+{circumflex over (V)}pz)=Σf,td(νft|{circumflex over (ν)}fty+{circumflex over (ν)}ftz) (6)
An implementation is herein contemplated in which the Itakura-Saito divergence, well known to those skilled in the art, is used. It is written as:
The cost function C is minimized in order to determine the optimal value of each parameter. This minimization is done iteratively, with multiplicative update rules that are successively applied to each parameter of the model spectrograms: W, H, E, S, and P.
The update rules can be derived from the gradient of the cost function C with respect to each parameter. Specifically, the gradient of the cost function with respect to a selected parameter can be written as the difference of two non-negative terms, and the corresponding update rule is then the element-wise multiplication of the selected parameter by the element-wise ratio of both these terms. This ensures that parameters remain non-negative for each update and become constant if the gradient of the cost function with respect to the selected parameter is zero. In this manner, the parameters approach a local minimum of the cost function.
The update rules of the parameters of the parametric spectrogram of the dialogue signal {circumflex over (V)}py can be written:
where ⊙ is an operator that corresponds to an element-wise product between matrices (or vectors); .⊙(.) is an operator that corresponds to element-wise exponentiation of a matrix by a scalar; (.)T is a matrix transposition; and 1T is a T×1 vector with all coefficients equal to 1.
The update rules for W and H are the standard multiplicative update rules for NMF with a cost function based on Itakura-Saito divergence. For instance, the document by C. Févotte et al., “Nonnegative matrix factorization with the Itakura-Saito divergence, with application to music analysis,” Neural Computation, vol. 11, no. 3, pp. 793-830, March 2009, describes such update rules.
At 170, the process enters a tracking step, particularly, of parameters of the pitch shift operator P. A frame of the spectrogram Vy is modeled (up to an equalization operator and a synchronization operator) as a linear combination of the corresponding frame of the spectrogram Vg pitch-shifted with different pitch shift values. In order to describe a global pitch shift, only one pitch shift must be retained for each frame. The tracking step aims at determining this unique shift value for every frame. To do so, a method of pitch shift tracking in matrix P is used. The Viterbi algorithm, which is well known by those skilled in the art, can be applied to matrix P after the first estimation at 160. For instance, the document J.-L. Durrieu et al, “An iterative approach to monaural musical mixture de-soloing,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, April 2009, pp. 105-108, describes such a tracking algorithm. Then, once the optimal pitch shift is selected for each frame, the coefficients of matrix P that do not correspond to this pitch shift are set to 0 to provide an optimized pitch shift matrix Popt, and a data structure representative of Popt is created at a computer readably medium.
In practice, one can allow a small margin around the optimal pitch shift in order to achieve advantages attributable to several factors. First, pitch shifts are quantified in this process, but they are physically continuous. Second, the tracking algorithm may produce small errors. Then, the non-zero area of matrix P is smoothed around the optimal pitch shift value. In alternative implementations, the synchronization matrix S is optimized using a tracking method adapted to the optimization of the parameters of that operator.
At 180, the process performs a second estimation of the parametric spectrograms {circumflex over (V)}py and {circumflex over (V)}pz. The second estimation is similar to the estimation performed at 160 but instead of initializing the operators with random values, the operators are initialized with the values obtained from the first estimate optimization at 170. It is worth noting that, since update rules are multiplicative, coefficients of P (and of S) initialized to 0 will remain 0 during the second estimation. At 190, temporary spectrograms {circumflex over (V)}iy and {circumflex over (V)}iz are computed with the parameter values obtained from the second estimation at 180.
At 200, separation is performed by means of Wiener filtering on the CQT of the mixture signal Vx using temporary spectrograms {circumflex over (V)}iy and {circumflex over (V)}iz. This way, one obtains the CQT of the estimated dialogue signal Vy and the CQT of the estimated audio background signal Vz. At 205 and 206, the estimated dialogue signal x(t) and the estimated background signal z(t) are obtained from the CQT of the estimated dialogue signal Vy and the CQT of the estimated audio background signal Vz, respectively, using a transform that is the inverse of the transform used at 115 and 116.
In the embodiment depicted in
The central server 12 includes means of executing computations, e.g. one or more processors, and computer readable media, e.g. non-volatile memory. The computer readable media can store processor executable instructions for performing the process 100 depicting in
The server 12 also includes a first modeling module 30 configured to obtain (in a manner such as that described in connection with elements 120, 130, and 140 of FIG. 1), from the spectrogram data structure Vg, a parametric spectrogram data structure {circumflex over (V)}py that models the spectrogram of the dialogue signal. The first modeling module 30 includes a pitch-shift modeling sub-module 32 configured to model a pitch shift operator P (in a manner such as that described in connection with element 120 of
In addition, the central server 12 includes an estimation module 50 configured to estimate the parameters of the parametric spectrogram data structures {circumflex over (V)}py and {circumflex over (V)}pz using the spectrogram data structure Vx. The estimation module 50 is configured to perform a first estimation (in a manner such as that described in connection with element 160 of
The central server 12 further includes a tracking module 60 configured to perform a tracking step, such as that described in connection with element 170 of
Furthermore, the central server 12 includes a filtering module 70 configured to implement Weiner filtering for determining the spectrogram data structure {circumflex over (V)}y of the dialogue signal data structure y(t) and the spectrogram data structure {circumflex over (V)}z of the background signal data structure z(t) from the optimized parameters in a manner such as that described in connection with element 200 of the process described by
The computer environment includes a computer 300, which includes a central processing unit (CPU) 310, a system memory 320, and a system bus 330. The system memory 320 includes both read only memory (ROM) 340 and random access memory (RAM) 350. The ROM 34 stores a basic input/output system (BIOS) 360, which contains the basic routines that assist in the exchange of information between elements within the computer, for example, during start-up. The RAM 350 stores a variety of information including an operating system 370, an application programs 380, other programs 390, and program data 400. The computer 300 further includes secondary storage drives 410A, 410B, and 410C, which read from and writes to secondary storage media 420A, 420B, and 420C, respectively. The secondary storage media 420A, 420B, and 420C may include but is not limited to flash memory, one or more hard disks, one or more magnetic disks, one or more optical disks (e.g. CDs, DVDs, and Blu-Ray discs), and various other forms of computer readable media. Similarly, the secondary storage drives 410A, 410B, and 410C may include solid state drives (SSDs), hard disk drives (HDDs), magnetic disk drives, and optical disk drives. In some implementations, the secondary storage media 420A, 420B, and 420C may store a portion of the operating system 370, the application programs 380, the other programs 390, and the program data 400.
The system bus 330 couples various system components, including the system memory 320, to the CPU 310. The system bus 330 may be of any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system bus 330 connects to the secondary storage drives 410A, 410B, and 410C via a secondary storage drive interfaces 430A, 430B, and 430C, respectively. The secondary storage drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, programs, and other data for the computer 300.
A user may enter commands and information into the computer 300 through user interface device 440. User interface device 440 may be but is not limited to any of a microphone, a touch screen, a touchpad, a keyboard, and a pointing device, e.g. a mouse or a joystick. User interface device 440 is connected to the CPU 310 through port 450. The port 450 may be but is not limited to any of a serial port, a parallel port, a universal serial bus (USB), a 1394 bus, and a game port. The computer 300 may output various signals through a variety of different components. For example, in
The computer 300 may operate in a networked environment by utilizing connections to one or more devices within a network 500, including another computer, a server, a network PC, a peer device, or other network node. These devices typically include many or all of the components found in the example computer 300. For example, the example computer 300 depicted in
Comparative tests were performed to compare separation results of the process described in
A database of soundtracks was made for the testing. Soundtracks in the database were constructed by adding a soundtrack including a dialogue (in English) with a soundtrack including only music and audio effects. This way, the contributions of each component of the mixture signal are well known. The database can be, e.g., made of ten such soundtracks. In order to obtain a guide signal, each soundtrack was dubbed using the corresponding mixture signal as a time reference. All dubbings were recorded by the same male native English speaker.
The guide signals were used for both the process of the present invention and the second known process. Spectrograms were computed using a CQT with a minimum frequency fmin=40 Hz, a maximum frequency fmax=16000 Hz, and 48 bins per octave. In order to quantify the results obtained for each known process and for the process of the present invention, standard source separation metrics were computed. These metrics are the signal to distortion ratio (SDR), the signal to artifact ratio (SAR) and the signal to interference ratio (SIR).
Results are presented in
The differences between the third known process and the process of the present invention are less clear: differences in terms of SDR are not significant. Results in terms of SAR and SIR are roughly the opposite for the dialogue extraction task and the dialogue removal task. However, other qualitative metrics indicate an advantage to the process of the present invention. A blind listening test based on the MUSHRA protocol was performed and listeners were asked to rate the “usability” of each sound for the dialog extraction task only for the results of the third known process and the process of the present invention. The results of the process of the present invention were globally preferred by the listeners. Moreover, it is worth noting that the process of the current invention does not require the tedious and costly pitch annotation required by the third known process.
As an alternative, other systems can implement the process of the present invention. The present implementation illustrates the particular case of the separation of a dialogue from a mixture signal, by adapting the spectrogram of the guide signal in pitch, in synchronization and in equalization, with a NMF method. However, the present process does not use a model that is specific to speech for the guide signal. The model used is generic and can thus be applied to broad classes of audio signals.
Consequently, the present process is also adapted to the separation from a mixture signal of any kind of specific contribution for which the user has at his disposal an audio guide signal. Such a guide signal can be another recording of the specific audio component of the mixture signal that can contain pitch differences, time misalignment and equalization differences. The present invention can model these differences and compensate for them during the separation process.
This way, instead of a voice, the specific contribution can also be the sound of a specific instrument in a music signal that mixes several instruments. The contribution of this specific instrument is played again and recorder to be used as a guide signal. Alternatively, the specific contribution can be a recording of the music that was used to create the soundtrack of an old movie. This recording has generally small playback speed differences (that imply both pitch differences and misalignment) an equalization differences with the music component of the original soundtrack of the music caused by old analog recording devices. This recording can be used as a guide signal in the present process, in order to extract both the dialogue and audio effects. A person skilled in the art will understand that the process of the document by L. Le Magoarou et al. does not permit the last two applications.
While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. It will be understood that changes and modifications may be made by those of ordinary skill within the scope of the following claims. In particular, the present invention covers further embodiments with any combination of features from different embodiments described above and below.
The terms used in the claims should be construed to have the broadest reasonable interpretation consistent with the foregoing description. For example, the use of the article “a” or “the” in introducing an element should not be interpreted as being exclusive of a plurality of elements. Likewise, the recitation of “or” should be interpreted as being inclusive, such that the recitation of “A or B” is not exclusive of “A and B,” unless it is clear from the context or the foregoing description that only one of A and B is intended. Further, the recitation of “at least one of A, B and C” should be interpreted as one or more of a group of elements consisting of A, B and C, and should not be interpreted as requiring at least one of each of the listed elements A, B and C, regardless of whether A, B and C are related as categories or otherwise. Moreover, the recitation of “A, B and/or C” or “at least one of A, B or C” should be interpreted as including any singular entity from the listed elements, e.g., A, any subset from the listed elements, e.g., A and B, or the entire list of elements A, B and C.
Acts and operations described herein can include the execution of microcoded instructions as well as the use of sequential logic circuits to transform data or to maintain it at locations in the memory system of the computer or in the memory systems of a distributed computing environment. Programs executing on a computer system or being executed by parts of a CPU can also perform acts and operations described herein. A “program” is any instruction or set of instructions that can execute on a computer, including a process, procedure, function, executable code, dynamic-linked library (DLL), applet, native instruction, engine, thread, or the like. A program, as contemplated herein, can also include a commercial software application or product, which may itself include several programs. However, while the invention can be described in the context of software, that context is not meant to be limiting. Those of skill in the art will appreciate that various acts and operations described herein can also be implemented in hardware.
Number | Date | Country | Kind |
---|---|---|---|
13 61792 | Nov 2013 | FR | national |
Number | Name | Date | Kind |
---|---|---|---|
6691082 | Aguilar | Feb 2004 | B1 |
8812322 | Mysore | Aug 2014 | B2 |
9460729 | Dickins | Oct 2016 | B2 |
9495970 | Dickins | Nov 2016 | B2 |
20060100867 | Lee | May 2006 | A1 |
20150248889 | Dickins | Sep 2015 | A1 |
20150356978 | Dickins | Dec 2015 | A1 |
20160189731 | Hennequin | Jun 2016 | A1 |
20160307554 | Tsai | Oct 2016 | A1 |
Entry |
---|
Virtanen, Tuomas, “Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria”, IEEE Transactions on Audio, Speech and Language Processing, vol. 15., No. 3, Mar. 2007 (pp. 1066-1074). |
Smaragdis, Paris, et al., “Separation by “Humming”: User-Guided Sound Extraction from Monophonic Mixtures”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, Oct. 2009 (4 pages). |
LeMagoarou, L, et al., “Text-informed Audio Source Separation Using Nonnegative Matrix Partial Co-Factorization”, 2013 IEEE International Workshop on Machine Learning for Signal Processing, Southampton, UK, Sep. 2013, (7 pages). |
Fevotte, C., et al., “Nonnegative Matrix Factorization with Itakura-Saito Divergence, with Application to Music Analysis”, Neural Computation, 21, 2008 Massachusetts Institute of Technology, (pp. 793-830). |
Durrieu, Jean-Louis, et al., “An Interative Approach to Monaural Musical Mixture De-Soling”, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, Apr. 2009, (pp. 105-108). |
Number | Date | Country | |
---|---|---|---|
20150149183 A1 | May 2015 | US |