The present disclosure generally relates to audio source separation and in particular to separation of a singing voice from a mixture comprising a singing voice component and an accompaniment component.
Audio source separation allows separating individual sound sources from a noisy mixture. It is applied in audio/music signal processing and audio/video post-production. A practical application is to separate desired speech from background music and audible effects in an audio mix track of a movie or TV series for audio dubbing. Another practical application is the extracting of a voice from a noisy recording to help a speech recognition system or robotic application, or to isolate a singing voice from an accompaniment in a music mixture that comprises both, for audio remastering purposes or for karaoke type applications. Non-negative Matrix Factorization (NMF) is a well-known technique for audio source separation and has been successfully applied to various source separation systems in a human-supervised manner. In NMF based source separation algorithms, a matrix V corresponding to the power spectrum of an audio signal (the matrix rows representing time frame indexes and the matrix columns representing frequency indexes) is decomposed in the product of a matrix W containing a spectral basis and a time activation matrix H describing when each basis spectra are active. In the single-channel case, i.e. only one audio track is used to separate several sources, the source spectral basis W is usually pre-learned from training segments for different sources in the mixture and then used in a testing phase to separate relating sources from the mixture. The training segments are chosen from an available (different) dataset, hummed, or specified manually through human intervention. In NMF-based source separation algorithms the model parameters (W, H) for each source are estimated. Then these model parameters W and H are used to separate the sources. A good estimation improves the source separation result. The present disclosure tries to alleviate some of the inconveniences of prior solutions by using additional information to guide the source separation process.
In the following, the wording ‘audio mix’ or ‘audio mixture’ is used. The wording indicates a mixture comprising several audio sources mixed together, among which at least one desired audio source is to be separated. By “sources” is meant the different types of audio signals present in the audio mix such as speech (human voice, spoken or sung), music (played by different musical instruments), and audible effects (footsteps, door closing . . . ).
Though the wording ‘audio’ is used, the mixture can be any mixture comprising audio, such as an audio track of a video for example.
The present principles aim at alleviating some of the inconveniences of prior techniques by improving the source separation process through the use of specific auxiliary information that is related to the audio mixture. This auxiliary information is comprised of both musical score and song lyrics information. One or more guide audio signals are produced from this auxiliary information to guide the source separation. According to a particular, non-limiting embodiment of the present principles, NMF is used as a core of the source separation processing model.
To this end, the present principles comprise a method of audio separation from an audio mixture comprising a singing voice component and an accompaniment component, the method comprising: receiving the audio mixture; receiving symbolic digital musical score information of the singing voice in the received audio mixture; receiving symbolic digital lyrics information of the singing voice in the received audio mixture; determining at least one audio signal from both the received symbolic digital musical score information and the symbolic digital lyrics information; determining characteristics of the received audio mixture and of the at least one audio signal through nonnegative matrix factorization; and determining an estimated singing voice and an estimated accompaniment by applying a filtering of the audio mixture using the determined characteristics.
According to a variant embodiment of the method of audio separation, the at least one audio signal is a single audio signal produced by a singing voice synthesizer from the received symbolic digital musical score information and from the received lyrics information.
According to a variant embodiment of the method of audio separation, the at least one audio signal is a first audio signal, produced by a speech synthesizer from the lyrics information, and a second audio signal produced by a musical score synthesizer from the symbolic digital musical score information.
According to a variant embodiment of the method of audio separation, the characteristics of the at least one audio signal is at least one of a group comprising: temporal activations of pitch; and temporal activation of phonemes.
According to a variant embodiment of the method of audio separation, the nonnegative matrix factorization is done according to a Multiplicative Update rule.
According to a variant embodiment of the method of audio separation, the nonnegative matrix factorization is done according to Expectation Maximization.
The present principles also relate to device for separation of a singing voice component and an accompaniment component from an audio mixture, the device comprising: a receiver interface for receiving the audio mixture, for receiving symbolic digital musical score information of the singing voice in the received audio mixture and for receiving symbolic digital lyrics information of the singing voice in the received audio mixture; a processing unit for determining at least one audio signal from both the received symbolic digital musical score information and the symbolic digital lyrics information, for determining characteristics of the received audio mixture and of the at least one audio signal through nonnegative matrix factorization; and a filter for determining an estimated singing voice and an estimated accompaniment by filtering of the audio mixture using the determined characteristics.
According to a variant embodiment of the device, it further comprises a singing voice synthesizer for producing a single audio signal from the received symbolic digital musical score information and from the received symbolic digital lyrics information.
According to a variant embodiment of the device, it further comprises a speech synthesizer for producing a first audio signal from the symbolic digital lyrics information, and a musical score synthesizer from the symbolic digital musical score information for producing a second audio signal.
More advantages of the present principles will appear through the description of particular, non-restricting embodiments of the present principles.
The embodiments will be described with reference to the following figures:
The width of a time frame ‘n’ is typically 16 to 64 ms. The width of a frequency bin ‘f’ is typically 16 to 44 kHz. The matrix V is then factorized by a basis matrix W (of size F-by-K) and a time activation matrix H (of size K-by-N), where K denotes the number of NMF components, via an NMF model parameter estimation 12, thus obtaining V=W*H, where * denotes matrix multiplication. This factorization is here described for single channel mixtures. However, its extension to multichannel mixtures is straightforward. Each column of the matrix W is associated with a spectral basis of an elementary audio component in the mixture. If the mixture contains several sources (e.g. music, speech, background noise), a subset of elementary components will represent one source. As an example, in a mixture comprising music, speech and background noise, Cm, Cs, and Cb are elementary components for each source. Then the first Cm columns of W are spectral basis of music, the next Cs columns are spectral basis of speech and the remaining Cb columns are for the noise, and K=Cm+Cs+Cb. Each row of H represents the activation of the spectral coefficients along the time.
In order to help estimating the values in the matrices W and H, some guiding information is needed and incorporated in an initialization step 12, where the spectral basis of different sources, represented in W, are learned from training segments where only a single considered type of source is present. Then the values in matrices W and H are estimated from the mixture via either a prior technique Expectation-Maximization (EM) algorithm or a prior technique Multiplicative Update (MU) algorithm in a step 13. In the next step, the estimated source STFT coefficients are reconstructed in a step 14 via well known Wiener filtering:
where Sj,fn denotes the STFT coefficient of source j at time frame n and frequency bin index f; Wj and Hj are parts of the matrix W and H that corresponding to source j, Vfn is the value of the input matrix V at time frame n and frequency bin index f.
Finally the time-domain estimated sources are reconstructed by applying well-known inverse short time Fourier transform (ISTFT), thereby obtaining separated sources 101 (e.g. the speech component of the audio mixture) and 102 (the background component of the audio mixture).
In an NMF parameter estimation, the parameter update rule is derived from the following cost function:
D(V|WH)=Σf=1FΣn=1Nd([V]fn|[WH]fn) (1)
This cost function is to be minimized, so that the product of W and H comes close to V. D(V|WH) is a scalar cost function for which a popular choice is Euclidean or Itakura-Saito (IS) divergence, and [X]fn denotes an entry of matrix X (at frequency f and time t).
With regard to
{circumflex over (V)}
X=(WXeHXe)⊚(WXφHXφ)⊚(wXciXT)+WBHB
{circumflex over (V)}
M=(WXePHXeDM)⊚(WMφHMφ)⊚(wMeiMT)
{circumflex over (V)}
L=(WLeHLe)⊚(WXφHXφDL)⊚(wLeiLT) (2)
Where ⊚ denotes the Hadamard product (in mathematics, the Hadamard product (also known as the Schur product or the entrywise product) is a binary operation that takes two matrices of the same dimensions, and produces another matrix where each element ij is the product of elements ij of the original two matrices) and i is a column vector whose entries are one when the recording condition is unchanged.
V is a power spectrogram and {circumflex over (V)} is its model, and we recall that the objective is to minimize the distance between the actual spectrogram and its model.
WeX, WeL, P, iX, iM and iL are parameters that are fixed in advance; HeX, HφX, and WφX are parameters that are shared between the mixture and the example signal generated according to the auxiliary information and are to be estimated; the other parameters are not shared and are to be estimated.
WeX is the redundant dictionnary of pitches (tessitura) of the singing voice, that is shared with the melodic example.
P is a permutation matrix allowing a little pitch difference between the singing voice and the melodic example.
HeX is the temporal activations of the pitches for the singing voice, shared with the melodic example.
DM is a synchronization matrix modeling the temporal mismatch between the singing voice and the melodic example.
WeL is the dictionnary of pitches (tessitura) of the lyrics example.
HeL is the temporal activations of the pitches for the lyrics example.
WφX is the dictionary of phonemes for the singing voice, shared with the lyrics example.
HφX is the phoneme temporal activations for the singing voice, shared with the lyrics example.
DL is a synchronization matrix modeling the temporal mismatch between the singing voice and the lyrics example.
WφM is the dictionary of filters for the melodic example.
HφM is the filter temporal activations for the melodic example.
wcX, wcM and wcL are the recording condition filters of the mixture, the melodic example and the lyrics example respectively.
iX, iM and iL are vectors of ones because the recording conditions are time invariant.
WB is the dictionary of characteristic spectral shapes for the accompaniment.
HB is the temporal activations for the accompaniment.
To summarize, the parameters to estimate are:
θ={HXl,DM,HLl,WXφ,HXφ,DL,WMφ,HMφ,wXc,wMc,wLc,WB,HB} 3)
Estimation of the parameters θ is done by minimization of a cost function that is defined as follows:
C(θ)=λXdIS(VX|{circumflex over (V)}X(θ))+λMdIS(VM|{circumflex over (V)}M(θ))+λLdIS(VL|{circumflex over (V)}L(θ)) (4)
Where
is the Itakura-Saito (“IS”) divergence.
λX, λM and λL are scalars determining the relative importance of VX, VM and VL during the estimation. The NMF parameter estimation can be derived according to either the well known Multiplicative Update (MU) rule or Expectation Maximization (EM) algorithms. Once the model is estimated, the separated singing voice and the accompaniment (more precisely their STFT coefficients) can be reconstructed via the well known Wiener filtering (X(f,n) being the mixture's STFT):
Estimated singing voice:
Estimated accompaniment:
Â(f,n)=(1−α)X(f,n) (5)
According to the variant embodiment of
{circumflex over (V)}
X=(WXeHXe)⊚(WX100 HXφ)⊚(wXciXT)+WBHB
{circumflex over (V)}
G=(WXcPHXcDG
This particular embodiment implies the usage of a more sophisticated system than the one of
As will be appreciated by one skilled in the art, aspects of the present principles can be embodied as a system, method or computer readable medium. Accordingly, aspects of the present principles can take the form of an entirely hardware embodiment, en entirely software embodiment (including firmware, resident software, micro-code and so forth), or an embodiment combining hardware and software aspects that can all generally be defined to herein as a “circuit”, “module” or “system”. Furthermore, aspects of the present principles can take the form of a computer readable storage medium. Any combination of one or more computer readable storage medium(s) can be utilized.
Thus, for example, it will be appreciated by those skilled in the art that the diagrams presented herein represent conceptual views of illustrative system components and/or circuitry embodying the principles of the present disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable storage media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
A computer readable storage medium can take the form of a computer readable program product embodied in one or more computer readable medium(s) and having computer readable program code embodied thereon that is executable by a computer. A computer readable storage medium as used herein is considered a non-transitory storage medium given the inherent capability to store the information therein as well as the inherent capability to provide retrieval of the information there from. A computer readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. It is to be appreciated that the following, while providing more specific examples of computer readable storage mediums to which the present principles can be applied, is merely an illustrative and not exhaustive listing as is readily appreciated by one of ordinary skill in the art: a portable computer diskette; a hard disk; a read-only memory (ROM); an erasable programmable read-only memory (EPROM or Flash memory); a portable compact disc read-only memory (CD-ROM); an optical storage device; a magnetic storage device; or any suitable combination of the foregoing.
Number | Date | Country | Kind |
---|---|---|---|
14306003.6 | Jun 2014 | EP | regional |