This application claims the benefit, under 35 U.S.C. §365 of International Application PCT/EP2014/061576, filed 4 Jun. 2014, which was published in accordance with PCT Article 21(2) on 11 Dec. 2014 under number WO2014/195359 in the English language and which claims the benefit of European patent application No. 13305757.0, filed 5 Jun. 2013.
The present disclosure generally relates to audio source separation for a wide range of applications such as audio enhancement, speech recognition, robotics, and post-production.
In a real world situation, audio signals such as speech are perceived against a background of other audio signals with different characteristics. While humans are able to listen and isolate individual speech in a complex acoustic mixture (known as the “cocktail party problem”, where a number of people are talking simultaneously in a room (like at a cocktail party)) in order to follow one of several simultaneous discussions, audio source separation remains a challenging topic for machine implementation. Audio source separation, which aims to estimate individual sources in a target comprising a plurality of sources, is one of the emerging research topics due to its potential applications to audio signal processing, e.g., automatic music transcription and speech recognition. A practical usage scenario is the separation of speech from a mixture of background music and effects, such as in a film or TV soundtrack. According to prior art, such separation is guided by a ‘guide sound’, that is for example produced by a user humming a target sound marked for separation. Yet another prior art method proposes the use of a musical score to guide source separation of a music in audio mixture. According to the latter method, the musical score is synthesized, and then the synthesized musical score, i.e. the resulting audio signal is used as a guide source that relates to a source in the mixture. However, it would be desirable to be able to take into account other sources of information for generating the guide audio source, such as textual information about a speech source that appears in the mixture.
The present disclosure tries to alleviate some of the inconveniences of prior-art solutions.
In the following, the wording ‘audio signal’, ‘audio mix’ or ‘audio mixture’ is used. The wording indicates a mixture comprising several audio sources, among which at least one speech component, mixed with the other audio sources. Though the wording ‘audio’ is used, the mixture can be any mixture comprising audio, such as a video mixed with audio.
The present disclosure aims at alleviating some of the inconveniences of prior art by taking into account auxiliary information such as text and/or a speech example) to guide the source separation.
To this end, the disclosure describes a method of audio source separation from an audio signal comprising a mix of a background component and a speech component, comprising a step of producing a speech example relating to a speech component in the audio signal; a step of estimating a first set of characteristics of the audio signal and of estimating a second set of characteristics of the produced speech example; and a step of obtaining an estimated speech component and an estimated background component of the audio signal by separation of the speech component from the audio signal through filtering of the audio signal using the first and the second set of estimated characteristics I.
According to a variant embodiment of the method of audio source separation, the speech example is produced by a speech synthesizer.
According to a variant embodiment of the method, the speech synthesizer receives as input subtitles that are related to the audio signal.
According to a variant embodiment of the method, the speech synthesizer receives as input at least a part of a movie script related to the audio signal.
According to a variant embodiment of the method of audio source separation, the method further comprises a step of dividing the audio signal and the speech example into blocks, each block representing a spectral characteristic of the audio signal and of the speech example.
According to a variant embodiment of the method of audio source separation, the characteristics are at least one of:
tessitura;
prosody;
dictionary built from phonemes;
phoneme order;
recording conditions.
The disclosure also concerns a device for separating an audio source from an audio signal comprising a mix of a background component and a speech component, comprising the following means: a speech example producing means for producing of a speech example relating to a speech component in said audio signal; a characteristics estimation means for estimating of a first set of characteristics of the audio signal and a second set of characteristics of the produced speech example; a separation means for separating the speech component of the audio signal by filtering of the audio signal using the estimated characteristics estimated by the characteristics estimation means, to obtain an estimated speech component and an estimated background component of the audio signal.
According to a variant embodiment of the device according to the disclosure, the device further comprises division means for dividing the audio signal and the speech example in blocks, where each block represents a spectral characteristic of the audio signal and of the speech example.
More advantages of the disclosure will appear through the description of particular, non-restricting embodiments of the disclosure.
The embodiments will be described with reference to the following figures:
One of the objectives of the present disclosure is the separation of speech signals from a background audio in single channel or multiple channel mixtures such as a movie audio track. For simplicity of explanation of the features of the present disclosure, the description hereafter concentrates on single-channel case. The skilled person can easily extend the algorithm to multichannel case where the spatial model accounting for the spatial locations of the sources are added. The background audio component of the mixture comprises for example music, background speech, background noise, etc). The disclosure presents a workflow and an example algorithm where available textual information associated with the speech signal comprised in the mixture is used as auxiliary information to guide the source separation. Given the associated textual information, a sound that mimics the speech in the mixture (hereinafter referred to as the “speech example”) is generated via, for example, a speech synthesizer or a human speaker. The mimicked sound is then time-synchronized with the mixture and incorporated in an NMF (Non-negative Matrix Factorization) based source separation system. State of the art source separation has been previously briefly discussed. Many approaches use a PLCA (Probabilistic Latent Component Analysis) modeling framework or Gaussian Mixture Model (GMM), which is however less flexible for an investigation of a deep structure of a sound source compared to the NMF model. Prior art also takes into account a possibility for manual annotation of source activity, i.e. to indicate when each source is active in a given time-frequency region of a spectrum. However, such prior-art manual annotation is difficult and time-consuming.
The disclosure also concerns a new NMF based signal modeling technique that is referred to as Non-negative Matrix Partial Co-Factorization or NMPCF that can handle a structure of audio sources and recording conditions. A corresponding parameter estimation algorithm that jointly handles the audio mixture and the generated guide source (the speech example) is also disclosed.
The blocks are matrices comprised of information about the audio signal, each matrix (or block) containing information about a specific characteristic of the audio signal e.g. intonation, tessitura, phoneme spectral envelopes. Each block models one spectral characteristic of the signal. Then these “blocks” are estimated jointly in the so-called NMPCF framework described in the disclosure. Once they are estimated, they are used to compute the estimated sources.
From the combination of both, the time-frequency variations between the speech example and the speech component in the audio mixture can be modeled.
In the following, a model will be introduced where the speech example shares linguistic characteristics with the audio mixture, such as tessitura, dictionary of phonemes, and phonemes order. The speech example is related to the mixture so that the speech example can serve as a guide during the separation process. In this step 31, the characteristics are jointly estimated, through a combination of NMF and source filter modeling on the spectrograms. In a third step 32, a source separation is done using the characteristics obtained in the second step, thereby obtaining estimated speech and estimated background, classically through Wiener filtering.
The previous discussed characteristics can be translated in mathematical terms by using an excitation-filter model of speech production combined with an NMPCF model, as described hereunder.
The excitation part of this model represents the tessitura and the prosody of speech such that:
The filter part of the excitation-filter model of speech production represents the dictionary of phonemes and their temporal distribution such that:
For the recording conditions 403 and 411, a stationary filter is used: denoted by wY 411 for the speech example and wS 403 for the audio mixture.
The background in the audio mixture is modeled by a matrix WB 405 of a dictionary of background spectral shapes and the corresponding matrix HB 406 representing temporal activations.
Finally, the temporal mismatch 402 between the speech example and the speech part of the mixture is modeled by a matrix D (that can be seen as a Dynamic Time Warping (DTW) matrix).
The two parts of the excitation-filter model of speech production can then be summarized by these two equations:
Where ⊙ denotes the entry-wise product (Hadamard) and i is a column vector whose entries are one when the recording condition is unchanged.
Parameter estimation can be derived according to either Multiplicative Update (MU) or Expectation Maximization (EM) algorithms. A hereafter described example embodiment is based on a derived MU parameter estimation algorithm where the Itakura-Saito divergence between spectrograms VY and VX and their estimates {circumflex over (V)}Y and {circumflex over (V)}X is minimized (in order to get the best approximation of the characteristics) by a so-called cost function (CF):
CF=dIS(VY|{circumflex over (V)}Y)+dIS(VX|{circumflex over (V)}X)
where
is the Itakura-Saito (“IS”) divergence.
Note that a possible constraint over the matrices WYφ, wY and wS can be set to allow only smooth spectral shapes in these matrices. This constraint takes the form of a factorization of the matrices by a matrix Pthat contains elementary smooth shapes (blobs), such that:
WYφ=PEφ,wY=PeY,wS=PeS
where P is a matrix of frequency blobs, Eφ, eY and eS are encodings used to construct WYφ, wY and wS, respectively.
In order to minimize the cost function CF, its gradient is cancelled out. To do so its gradient is computed with respect to each parameter and the derived multiplicative update (MU) rules are finally as follows.
To obtain the prosody characteristic 410 HYE for the speech example:
To obtain the prosody characteristic 404 HSE for the audio mix:
To obtain the dictionary of phonemes WYφ=PEφ:
To obtain the characteristic 409 of the temporal distribution of phonemes HYφ of the example speech:
To obtain characteristic D 402, the synchronization matrix of synchronization between the speech example and the audio mix:
To obtain the example channel filter wY=PeY:
To the mixture channel filter wS=PeS:
To obtain characteristic HB 406 representing temporal activations of the background in the audio mix:
To obtain characteristic WB 405 of a dictionary of background spectral shapes of the background in the audio mix:
Then, once the model parameters are estimated (i.e. via the above mentioned equations), the STFT of the speech component in the audio mix can be reconstructed in the reconstruction function 44 via a well-known Wiener filtering:
Where A,ij is the entry value of matrix A at row i and column j, X is the STFT of the mixture, {circumflex over (V)}S is the speech related part of {circumflex over (V)}X and {circumflex over (V)}B its background related part.
Thereby obtaining the estimated speech component 201. The STFT of the estimated background audio component 202 is then obtained by:
A program for estimating the parameters can have the following structure:
As will be appreciated by one skilled in the art, aspects of the present principles can be embodied as a system, method or computer readable medium. Accordingly, aspects of the present principles can take the form of an entirely hardware embodiment, en entirely software embodiment (including firmware, resident software, micro-code and so forth), or an embodiment combining hardware and software aspects that can all generally be defined to herein as a “circuit”, “module” or “system”. Furthermore, aspects of the present principles can take the form of a computer readable storage medium. Any combination of one or more computer readable storage medium(s) can be utilized.
Thus, for example, it will be appreciated by those skilled in the art that the diagrams presented herein represent conceptual views of illustrative system components and/or circuitry embodying the principles of the present disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable storage media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
A computer readable storage medium can take the form of a computer readable program product embodied in one or more computer readable medium(s) and having computer readable program code embodied thereon that is executable by a computer. A computer readable storage medium as used herein is considered a non-transitory storage medium given the inherent capability to store the information therein as well as the inherent capability to provide retrieval of the information there from. A computer readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. It is to be appreciated that the following, while providing more specific examples of computer readable storage mediums to which the present principles can be applied, is merely an illustrative and not exhaustive listing as is readily appreciated by one of ordinary skill in the art: a portable computer diskette; a hard disk; a read-only memory (ROM); an erasable programmable read-only memory (EPROM or Flash memory); a portable compact disc read-only memory (CD-ROM); an optical storage device; a magnetic storage device; or any suitable combination of the foregoing.
Number | Date | Country | Kind |
---|---|---|---|
13305757 | Jun 2013 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2014/061576 | 6/4/2014 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2014/195359 | 11/12/2014 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
8340943 | Kim | Dec 2012 | B2 |
8563842 | Kim | Oct 2013 | B2 |
20100254539 | Jeong et al. | Oct 2010 | A1 |
20130132077 | Mysore | May 2013 | A1 |
20150046156 | Coifman | Feb 2015 | A1 |
Entry |
---|
Kim et al.; Nonnegative Matrix Partial Co-Factorization for Spectral and Temporal Drum Source Separation; IEEE Journal of Selected Topics in Signal Processing, vol. 5, No. 6, Oct. 2011; pp. 1192-1204. |
Sebastien Ewert etal: “using score-informed constraints for NMF-based source separation”, 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP 2012): Kyoto, Japan Mar. 25-30, 2012; [Proceedings], IEEE, Piscataway, NJ, Mar. 25, 2012, pp. 129-132, XP032227079, DOI: 10.1109/ICASSP.2012.6287834 ISBN: 978-1-4673-0045-2, p. 129, right-hand column, paragraph 2.—p. 131, left-hand column. |
Derry Fitzgerald etal :“user assisted source separation using non negative matrix factorisation”, 22nd IET Irish signals and systems conference, Jun. 23, 2011, pp. 1-6, XP05513298, Dublin, Ireland, Retrieved from the internet: URL: http://arrow.dit.ie/cgi/viewcontent.cgi?article=1064&context=argcon, [Retrieved on Jul. 31, 2014], p. 2, right-hand column, paragraph III—p. 4, left-hand column. |
Minje Kim eta: “nonnegative matrix partial co facorization for spectral and temporal drum source separation”, IEEE journal of selected topics in signal processing, IEEE, US, vol. 5, No. 6, Oct. 1, 2011, pp. 1192-1204, XP011386719, ISSN: 1932-4553, DOI: 10.1109/JSTSP.2011.2158803, p. 1196, left-hand column, line 1—p. 1199, right-hand column, line 1. |
Luc Le Magoarou etal: “text informed audio source separation using nonnegative matrix partial co factorization”, 2013 IEEE international workshop on machine learning for signal processing (MLSP), Sep. 1, 2013, pp. 1-6, XP055122931, DOI: 10.1109/MLSP.2013.6661995, ISBN: 978-1-47-991180-6, the whole document. |
Smaragdis P etal: “separation by humming: user guided sound extraction from monophonic mixtures”, applications of signal processing to audio and acoustics, 2009. WASPAA '09. IEEE workshop on, IEEE, Piscataway, NJ, USA Oct. 18, 2009, pp. 69-72, XP031575167, ISBN: 978-1-4244-3678-1, p. 70, left-hand column, paragraph 3.—p. 71, left-hand column. |
Jiho Yoo etal: “nonnegative matrix partial co factorization for drum source separation”, acoustics speech and signal processing (ICASSP), 2010 IEEE international conference on, IEEE, Piscataway, NJ, USA, Mar. 14, 2010, pp. 1942-1945, XP031697261, ISBN: 978-1-4244-4295-9, p. 1942, right-hand column, line 2—line 36, p. 1942, right-hand column, paragraph 2.—p. 1944, left-hand column, paragraph 3. |
Demir etal: “Catalog based single channel speech music separation with the Itakura Saito divergence”, 2012 20th european signal processing conference. |
Grais etal: “Single channel speech music separation using nonnegative matrix factorization and spectral masks”, 2011 17th international conference on digital signal processing (DSP 2011). |
Joder etal : “Real time speech separation by semi supervised nonnegative matrix factorization”, Proceedings 10th international conference, LVA/ICA 2012. |
Weninger etal: “Supervised and semi supervised suppression of background music in monaural speech recordings”, proceedings of the 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP 2012). |
Zheng etal: “Model based non negative matrix factorization for single channel speech separation”, 2011 IEEE international conference on signal processing, commuinications and computing (ICSPCC). |
Fevotte etal: “Nonnegative Matrix Factorization with Itakura Saito divergence”, Neural Computation 2009. |
Ganseman etal: “Source separation by score synthesis”. |
Lee etal: “Learning the parts of objects by nonnegative matrix factorization”, Nature, pp. 788-791, 1999. |
Lefevre etal: “Semi supervised NMF with time frequency annotations for single-channel source separation”, International society for music information retrieval conference (ISMIR), 2012. |
Ozerov etal: “A general flexible framework for the handling of prior information in audio source separation”, IEEE transactions on audio, speech and lang, proc. vol. 20, n) 4, 99 1118-1133, 2012. |
Pedone etal: “Phoneme level text to audio synchronization on speech signals with background music”, In Audionamix, 2011. |
Chen etal: “Low resource noise robust feature post processing on aurora 2—0”, in proc. Int. conference on spoken language processing (ICSLP), 2002, pp. 2445-2448. |
Durrieu etal: “Source filter model for unsupervised main melody Extraction From Polyphonic Audio Signals”, IEEE transactions on audio, speech and language processing, vol. 18, No. 3, pp. 564-575, 2010. |
Durrieu etal: “Musical audio source separation based on user selected F0 track”, in Proc. Int. conf on latent variable analysis and signal separation (LVA/ICA), Tel Aviv, Israel, Mar. 2012, pp. 438-445. |
Ellis: “Dynamic Time Warp in Matlab” 2003. |
Emiya etal: “Subjective and objective quality assessment of audio source separation”, IEEE transactions on audio speech and language processing, vol. 19, No. 7, pp. 2046-2057. |
Fritsch etal: “Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis”, ICASSP 2013. |
Fuentes etal: “Blind harmonic adaptive decomposition applied to supervised source separation”, 20th european signal processing conference (EUSIPCO 2012), Bucharest, Romania, Aug. 27-31, 2012. |
Hennequin etal: “Score informed audio source separation using a parametric model of non negative spectrogram”, Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011, Prague, Czech Republic. 2011. |
Kim etal: “Nonnegative matrix partial co factorization for spectral and temporal drum source separation”, IEEE Journal of Selected Topics in Signal Processing, vol. 5, No. 6, Oct. 2011. |
Mysore etal: “A Non negative Approach to Language Informed speech separation”, in Proc. Int. Conference on Latent Variable, Analysis and Signal Separation (LVA/ICA), Tel Aviv, Israel, Mar. 2012. |
Ozerov etal: “Multichannel Nonnegative tensor factorization with structured constraints for user-guided audio source separation”, in Proc IEEE Int. Cont on acoustics, speech and signal processing (ICASSP) Prague, Czech Republic, May 2011. |
Roweis: “One Microphone Source Separation”, in Advances in neural inforamtion processing systems 13, 2000. |
Simsekli etal: “Score guided musical source separation using generalized coupled tensor factorization”, in 20th EUSIPCO 2012, Bucharest, Romania, Aug. 27-31, 2012. |
Vincent etal: “Performance measurement in blind audio source separation”, IEEE Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2006, 14 (4), pp. 1462-1469. |
Vincent etal: “The signal separation evaluation campaign 2007—2010: achievements and remaining challenges”,, Signal Processing, vol. 92, No. 8, pp. 1928-1936, 2012. |
Virtanen etal: “Analysis of polyphonic audio using source filter model and non negative matrix factorization”, in advances in models for acoustic processing, neural information processing sytems workshop, 2006. |
Wang etal: “Video assisted speech source separation”, ICASSP 2005, pp. 425-428. |
Garofolo etal: “DARPA TIMIT acoustic phonetic continuous speech corpus”, Tech. Rep. NIST, 1993, distributed with the TIMIT CD-ROM. |
Number | Date | Country | |
---|---|---|---|
20160125893 A1 | May 2016 | US |