This invention concerns methods and apparatus for the attenuation or removal of unwanted sounds from recorded audio signals.
The introduction of unwanted sounds is a common problem encountered in audio recordings. These unwanted sounds may occur acoustically at the time of the recording, or be introduced by subsequent signal corruption. Examples of acoustic unwanted sounds include the drone of an air conditioning unit, the sound of an object striking or being struck, coughs, and traffic noise. Examples of subsequent signal corruption include electronically induced lighting buzz, clicks caused by lost or corrupt samples in digital recordings, tape hiss, and the clicks and crackle endemic to recordings on disc.
Current audio restoration techniques include methods for the attenuation or removal of continuous sounds such as tape hiss and lighting buzz, and methods for the attenuation or removal of short duration impulsive disturbances such as record clicks and digital clicks. A detailed exposition of hiss reduction and click removal techniques can be found in the book ‘Digital Audio Restoration’ by Simon J. Godsill and Peter J. W. Rayner, which in its entirety is incorporated herein by reference.
The invention in its various aspects provides a method and apparatus as defined in the appended independent claims.
Preferred or advantageous features of the invention are set out in dependent sub-claims.
In one aspect, the invention advantageously concerns itself with attenuating or eliminating the class of sounds that are neither continuous nor impulsive (i.e. of very short duration, such as 0.1 ms or less), and which current techniques cannot address. They are characterised by being localised both in time and infrequency. Preferably, the invention is applicable to attenuating or eliminating unwanted sounds of duration between 10 s and 1 ms, and particularly preferably between 2 s and 10 ms, or between 1 s and 100 ms.
Examples of such sounds include coughs, squeaky chairs, car horns, the sounds of page turns, the creaks of a piano pedal, the sounds of an object striking or being struck, short duration noise bursts (often heard on vintage disc recordings), acoustic anomalies caused by degradation to optical soundtracks, and magnetic tape drop-outs.
In a further aspect, the invention provides a method to perform interpolations that, in addition to being constrained to act upon a limited set of samples (constrained in time), are also constrained to act only upon one or more selected frequency bands, allowing the interpolated region within the band or bands to be attenuated or removed seamlessly and without adversely affecting the audio content outside of the selected band or bands.
Furthermore, standard interpolation techniques do not interpolate the noise content of the signal well. Methods exist that attempt to overcome this limitation, but they use a flawed assumption. A preferred embodiment of the invention thus provides an improved method for regenerating the noise content of the interpolated signal, for example by means of a template signal as described below. This, combined with the frequency band constraints, creates a powerful interpolation method that extends significantly the class of problems to which interpolation techniques can be applied.
In the preferred embodiment of the invention, a time/frequency spectrogram is provided. This is an invaluable aid in selecting the time constraints and the frequency bands for the interpolation, for example by specifying start and finish times and upper and lower frequency values which define a rectangle surrounding the unwanted sound or noise in the spectrogram. The methods of the invention may also advantageously apply to other time and/or frequency constraints, for example using variable time and/or frequency constraints which define portions of a spectrogram which are not rectangular.
In a preferred embodiment of the invention, the constrained region does not have to contain one simple frequency band; it can comprise several bands if necessary. In addition, it is not necessary for the unwanted signal samples to be contiguous in time; they can occupy several unadjoining regions. This is advantageous because successive interpolations of simple regions, which may be required to treat unwanted signal samples which are, for example, in the same or overlapping frequency bands and separated only by short time intervals, may give sub-optimal results due to dependencies built up between the interpolations. A single application of this embodiment of the invention may advantageously avoid this build up of dependencies by interpolating all the regions simultaneously.
In a preferred embodiment of the interpolation method of the invention, time and frequency constraints are selected which define a region of the audio recording containing the unwanted sound or noise (in which the unwanted signal is superimposed on the portion of the desired audio recording within the selected region) and which exclude the surrounding portion of the desired audio recording (the good signal). A mathematical model is then derived which describes the good data surrounding the unwanted signal. A second mathematical model is derived which describes the unwanted signal. This second model is constrained to have zero values outside the selected temporal region (outside the selected time constraints) Each of the models incorporates an independent excitation signal. The observed signal can be treated as the sum of the good signal plus the unwanted signal, with the good signal and the unwanted signal having unknown values in the selected temporal region. This can be expressed as a set of equations that can be solved analytically to find an interpolated estimate of the unknown good signal (within the selected region) that minimises the sum of the powers of the excitation signals.
In this embodiment of the invention, the relationship between the two models determines how much interpolation is applied at each frequency. By giving the model for the unwanted signal a spectrally-banded structure that follows the one or more selected frequency bands, this embodiment constrains the interpolation to affect the bands without adversely affecting the surrounding audio (subject to frequency resolution limits). A user parameter varies the relative intensities of the models in the bands, thus controlling how much interpolation is performed within the bands.
The preferred mathematical model to use in this embodiment is an autoregressive or “AR” model. However, an “AR” model plus “basis vector” model may also be used for either model (for either signal). These models are described in the book ‘Digital Audio Restoration’, the relevant pages of which are included below.
Because of the nature of the analytical solutions referred to above, the embodiment in the preceding paragraphs will not interpolate the noise content of the or each selected band or sub-band. The minimised excitation signals do not necessarily form ‘typical’ sequences for the models, and this can alter the perceived effect of each interpolation. This deficiency is most noticeable in noisy regions because the uncorrelated nature of noise means that the minimised excitation signal has too little power to be ‘typical’. The result of this may be an audible hole in the interpolated signal. This occurs wherever the interpolated signal spectrogram decays to zero due to inadequate excitation.
The conventional method to correct this problem proceeds on the assumption that the excitation signals driving the models are independent Gaussian white noise signals of a known power. The method therefore adds a correcting signal to the excitation signal in order to ensure that it is ‘white’ and of the correct power. Inherent inaccuracies in the models mean that, in practice, the excitation signals are seldom white. This method may therefore be inadequate in many cases.
A preferred implementation provided in a further aspect of the invention extends the equations for the interpolator to incorporate a template signal for the interpolated region. The solution for these extended equations converges on the template signal (as described below) in the frequency bands where the solution would otherwise have decayed to zero. A user parameter may advantageously be used to scale the temporal signal, adjusting the amount of the template signal that appears in the interpolated solution.
In this implementation, the template signal is calculated to be noise with the same spectral power as the surrounding good signal but with random phase. Analysis shows that this is equivalent to adding a non-white correcting factor to generate a more ‘typical’ excitation signal.
This eliminates a flaw in existing methods which manifests itself as a loss of energy in the interpolation such that the signal power spectrum decays inappropriately in parts of the interpolated region.
A different implementation could use an arbitrary template signal, in which case the interpolation would in effect replace the frequency bands in the original signal with their equivalent portions from the template signal.
A further, less preferred, embodiment of the invention applies a filter to split the signal into two separate signals: one approximating the signal inside a frequency band or bands (containing the unwanted sounds) and one approximating the signal outside the band or bands. Time and frequency constraints may be selected on a spectrogram in order to specify the portion(s) of the signal containing the unwanted sound, as described above. A conventional unconstrained (in frequency) interpolation can then be performed on the signal containing the unwanted sound(s) (the sub-band frequencies) Subsequently, the two signals can be combined to create a resulting signal that has had the interpolations confined to the band containing the unwanted sound. Ideally, the band-split filter may be of the ‘linear phase’ variety, which ensures that the two signals can be summed coherently to create the interpolated signal. This method has one significant drawback in that the action of filtering spreads the unwanted sound in time. The time constraints of the interpolator must therefore widen to account for this spread, thereby affecting more of the audio than would otherwise be necessary. The preferred embodiment of the invention, as described previously, includes the frequency constraints as a fundamental part of the interpolation algorithm and therefore avoids this problem.
Specific embodiments of the invention will now be described by way of example with reference to the accompanying drawings, in which;
FIGS. 6 to 13 show spectrograms illustrating a second example of unwanted sound removal;
FIGS. 19 to 23 are reproductions of
Example 1 (referring to
Example 2 (FIGS. 6 to 13) shows an embodiment of the invention applied to the sound of a car horn that sounded and was recorded during the sound of a choir inhaling. The car horn sound is observed as comprising several distinct harmonics, the longest of which has a duration of about 40,000 samples (a little under one second). The sound of the indrawn breath has a strong noise-like characteristic and can be observed on the spectrogram as a general lifting of the noise floor. To eliminate the sound of the horn, each harmonic is marked as a separate sub-band and then replaced with audio that matches the surrounding breathy sound. Once all the harmonics have been marked and replaced, the resulting audio signal contains no audible residue from the car horn, and there is no audible degradation to the breath sound.
FIGS. 6 to 13 illustrate the removal of the unwanted car-horn sound in a series of steps, each using the same principles as the method illustrated in FIGS. 1 to 5. However, the car-horn comprises a number of distinct harmonics at different frequencies, each harmonic being sustained for a different period of time. Each harmonic is therefore removed individually.
The computer system will then display a time/frequency spectrogram of the audio (as in FIGS. 1 to 13). The time frequency spectrogram displays two dimensional colour images where the horizontal axis of the spectrogram represents time, the vertical axis represents frequency and the colour of each pixel in an image represents the calculated spectral power at the relevant time and frequency. The spectrogram powers can be estimated using successive overlapped windowed Discrete Fast Fourier transforms 40, see
The following embodiment can either reduce the signal in the selected region or replace it with a signal template synthesised from the surrounding audio. The embodiment has two parameters that determine how much synthesis and reduction are applied.
This method for replacing the signal proceeds as follows:
1. Derive an AR model for the good signal outside the constrained region, using the following steps:
2. Postulate a signal that is constrained to lie in the selected frequency bands and derive an AR model for the unwanted signal from it, using the following steps:
3. Calculate a template signal that has a power spectrum that matches the good signal, but that has a randomised phase. Scale this synthetic signal depending on how much synthesis the user has requested. From the synthesised signal and the matrix representation of the good AR model, calculate the synthetic excitation.
4. Estimate the unwanted signal outside the time constraints. In this implementation that estimate is zero.
5. Use the combined equations to calculate an estimate for the unknown data. This estimate will fulfil the requirement that the interpolation is constrained to affect only those frequencies within the selected bands but not affect those outside the selected bands.
The implementation will then redisplay the spectrogram so that the operator can see the effect of the interpolation (
See below for a more detailed description of the equation used in each stage and diagrams of how they interact.
The flow diagram in
Model Assumptions
Sample Sets
The operator has selected T contiguous samples 60 from a discrete time signal that have been stored in an array of values y(t), 0≦t<N. From this region the operator has selected a subset of these samples to be interpolated. We define the set Tu as the subset of Nu sample times selected by the operator for interpolation We define the set Tk as the subset of Nk sample times (within T but outside the subset Tu) not selected by the operator. The lengths of the two sets are related such that N=Nk+Nu. It is also desirable that there are at least twice as many samples in the set Tk as there are in Tu. Furthermore the operator has selected one or more frequency bands within which to apply the interpolation
Observation Model
The signal y(t) is assumed to be formed from the sum of two independent signals, the good signal x(t) and an unwanted signal w(t). Therefore we have the following model for the observations
y(t)=x(t)+w(t) (1)
or, in vector notation
y=x+w (2)
where
y=[y(0) . . . y(T−1)]T (3)
x=[x(0) . . . x(T−1)]T (4)
w=[w(0) . . . w(T−1)]T (5)
We can further partition these vectors into those elements corresponding to the set of sample times Tu and those corresponding to the set of sample times Tk.
yu=xu+wu (6)
yk=xk+wk (7)
where
yu=[y(t0) . . . y(tNu−1)]T,tj∈Tu (8)
xu=[x(t0) . . . x(tNu−1)]T,tj∈Tu (9)
wu=[w(t0) . . . w(tNu−1)]T,tj∈Tu (10)
yk=[y(t0) . . . y(tNk−1)]T,tj∈Tk (11)
xk=[x(t0) . . . x(tNk−1)]T,tj∈Tk (12)
wk=[w(t0) . . . w(tNk−1)]T,tj∈Tk (13)
Obviously both yu and yk are known as they form the observed signal values.
We stipulate that the values of x(t) and w(t) must be known a priori for the set of sample times Tk. Hence, in the case where the unwanted signal is zero in this region we get
wk=0 (14)
xk=yk (15)
We define our interpolation method as estimating the unknown values of xu
Deriving the AR Model for the Good Signal
The basic form of an AR model is shown in
or in its alternate form
where
The autoregressive model is specified by the coefficients a1 ex(t) defines an excitation sequence hat drives the model.
In this case we have to estimate the coefficients of the model only from the known values of xk. It is sufficient for this purpose to create a new vector x1(t) that assumes the unknown values of x(t) are zero.
Solving for the AR Coefficients
There are several methods for calculating the model coefficients. This example uses the covariance method as follows:
Equation 16 can now be reformulated into a matrix form as
which can be expressed more compactly in the following equation an appropriate definition ex, x1, a and X1
ex=x1+X1·a (20)
The values of a that minimise the excitation energy
Jx=exTex (21)
can be calculated jointly using the formula
a=−(X1TX1)−1X1Tx1 (22)
This minimum value of Jx should also be calculated (Jxmin) using equations 20 and 21
Expressing the Model in Terms of the Known and Unknown Signal
Having calculated the model coefficients a, we can use equation 17 to express an alternative matrix representation of the model.
which can be expressed more compactly with an appropriate definition of A as
ex=A·x.
this matrix equation can be partitioned into two parts as
ex=Au·xu+Akxk (24)
where the matrix Au is submatrix of A formed by taking the columns of A appropriate to the unknown data xu and the matrix Ak is submatrix of A formed by taking the columns of A appropriate to the known data xk.
Deriving the AR Model for the Unwanted Signal
The model for the unwanted signal uses an AR model as in the Good signal model. Mathematically this is expressed as
or in its alternate form
where
The autoregressive model is specified by the coefficients bi ew(t) defines an excitation sequence that drives the model
Solving for the AR Coefficients
The difficulty is in finding a model that adequately expresses the frequency constraints. One method is to create a hypothetical artificial waveform with the required band limited structure and then solve the model equations for this artificial waveform Let this artificial waveform be w′(t). We can get a solution for the model coefficients purely by knowing the correlation function of this waveform:
rww(τ)=E{w′(t)w′(t−τ)} (27)
Create an artificial power spectrum W′({overscore (w)}) which has an amplitude of 1.0 inside the frequency bands and zero outside it. Talking the inverse Discrete Fourier Transform of this power spectrum will give a suitable estimate for rww(τ)
The filter coefficients can be found by the following equation
Furthermore the excitation power required for this artificial model can be calculated as:
Expressing the Model in Terms of the Known and Unknown Signal
Having calculated the model coefficients b, we can use equation 26 to express an alternative matrix representation of the model.
which can be expressed more compactly with al appropriate definition of B as
ew=B·w
this matrix equation can be partitioned into two parts as
ew=Bu·wu+Bkwk (31)
with suitable definitions of Bk and Bu
We now use equation 1 to express equation 33 in terms of y and x
ew=Bu·(yu−xu)+Bk·(yk−xk) (32)
In the case where wk=0 this collapses to
ew=Bu·(yu−xu) (33)
The Template signal
We calculate the template signal s from the known good data xk as follows. We calculate the Discrete Fourier Transform X1({overscore (w)}) of the waveform x1(t) defined in equation 18. We then create a synthetic spectrum S1({overscore (w)}) that has the same amplitude as X1({overscore (w)}) but uses pseudo-random phases. This spectrum is then inverted using the inverse Discrete Fourier Transform to give the template signal s. This has to be subsequently filtered by the good signal model to give a template excitation es as follows:
es=As
We hypothesise a new signal
Δx=x−Δs, (34)
where λ is a user defined parameter that scales the template signal in order to increase or decrease its effect. This difference signal can itself be modelled by the good signal model.
Δe=ex−λes=AΔx (35)
This can be expanded into
Δe=Au·xu+Akxk−λAs (36)
The Interpolation Model
The diagram in
where μ is a user defined parameter that controls how much interpolation is performed in the frequency bands. This equation can be modified by substituting
Minimising this equation with respect to xu leads to the following estimate {circumflex over (x)}u for xu:
{circumflex over (x)}u=(AuTAu+μ′.BuTBu)−1(μ′.BuTBuyu−AuTAkxk+λAuTes)
Background Reference
The following pages show copies of pages 86 to 89, 111, and 114 to 116 of the book ‘Digital Audio Restoration” referenced above.
86 4. Parameter Estimation, Model Selection and Classification
The transfer function for this model is
where B(z)=Σj=0Qbjz−j and A(z)=1−Σi=1Paiz−i.
The model can be seen to consist of applying an IIR filter (see 2.5 1) to the ‘excitation’ or ‘innovation’ sequence {en}, which is i.i.d. noise Generalisations to the model could include the addition of additional deterministic input signals (the ARMAX model [114, 21]) or the inclusion of linear basis functions in the same way as for the general linear model:
y=x+Gθ
An important special case of the ARMA model is the autoregressive (AR) or ‘all-pole’ (since the transfer function has poles only) model in which B(z)=1. This model is used considerably throughout the text and is considered in the next section.
4.3 Autoregressive (AR) Modelling
A time series model which is fundamental to much of the work in this book is the autoregressive (AR) model, in which the data is modelled as the output of an all-pole filter excited by white noise. This model formulation is a special case of the innovations representation for a stationary random signal in which the signal {Xn} is modelled as the output of a linear time invariant filter driven by white noise. In the AR case the filtering operation is restricted to a weighted sum of past output values and a white noise innovations input {en}:
The coefficients {ai; i=1 . . . P} are the filter coefficients of the all-pole filter, henceforth referred to as the AR parameters, and P, the number of coefficients, is the order of the AR process. The AR model formulation is closely related to the linear prediction framework used in many fields of signal processing (see e.g. [174, 119]). AR modelling has some very useful properties as will be seen later and these will often lead to simple analytical results where a more general model such as the ARMA model (see previous section) does not. In addition, the AR model has a reasonable basis as a source-filter model for the physical sound production process in many speech and audio signals [156, 187].
4.3 Autoregressive (AR) Modelling 87
4.3.1 Statistical Modelling and Estimation of AR Models
If the probability distribution function pe (en) for the innovation process is known, it is possible to incorporate the AR process into a statistical framework for classification and estimation problems. A straightforward change of variable In to en gives us the distribution for In conditional on the previous P data values as
Since the excitation sequence is i.i.d. we can write the joint probability for a contiguous block of N−P data samples IP+1 . . . IN conditional upon the first P samples I1 . . . IP as
This is now expressed in matrix-vector notation. The data samples I1, . . . , IN and parameters a1, a2, . . . , aP−1, aP are written as column vectors of length N and P, respectively
x=[I1 I2 . . . IN]T, a=[a1 a2 . . . aP−1 aP]T (4.44)
x is partitioned into x0, which contains the first P samples I1, . . . , IP, and x1 which contains the remaining (N−P) samples IP+1 . . . IN:
x0=[I1 I2 . . . IP]T, x1=[IP+1 . . . IN]T (4.45)
The AR modelling equation of (4.41) is now rewritten for the block of N data samples as
x1=G a+e (4.46)
where e is the vector of (N−P) excitation values and the ((N−P)×P) matrix G is given by
The conditional probability expression (4.43) now becomes
p(x1|x0,a)=pe(x1−Ga) (4.48)
88 4. Parameter Estimation, Model Selection and Classification
and in the case of a zero-mean Gaussian excitation we obtain
Note that this introduces a variance parameter σe2 which is in general unknown. The p.d.f. given is thus implicitly conditional on σe2 as well as a and x0.
The form of the modelling equation of (4.46) looks identical to that of the general linear parametric model used to illustrate previous sections (4.1). We have to bear in mind, however, that G here depends upon the data values themselves, which is reflected in the conditioning of the distribution of x1 upon x0. It can be argued that this conditioning becomes an insignificant ‘end-effect’ for N>>P [155] and we can then make an approximation to obtain the likelihood for x:
p(x|a)≈p(x1|a,x0), N>>P (4.50)
How much greater than P N must be will in fact depend upon the pole positions of the AR process. Using this result an approximate ML estimator for a can be obtained by maximisation w.r.t. a, from which we obtain the well-known covariance estimate for the AR parameters,
aCov=(GTG)−1GTx1 (4.51)
which is equivalent to a minimisation of the sum-squared prediction error over the block, E=Σi=P+1Nei2, and has the same form as the ML parameter estimate in the general linear model.
Consider now an alternative form for the vector model equation (4.46) which will be used in subsequent work for Bayesian detection of clicks and interpolation of AR data:
e=Ax (4.52)
where A is the ((N−P)×(N)) matrix defined as
The conditional likelihood for white Gaussian excitation is then rewritten as:
4.3 Autoregressive (AR) Modelling 89
In order to obtain the exact (i.e. not conditional upon x0) likelihood we need the distribution p(x0|a), since
p(x|a)=p(x1|x0,a)p(x0|a)
In appendix C this additional term is derived, and the exact likelihood for all elements of x is shown to require only a simple modification to the conditional likelihood, giving:
and Mx
While the exact likelihood is quite easy to incorporate in missing data or interpolation problems with known a, it is much more difficult to use for AR parameter estimation since the functions to maximise are non-linear in the parameters a. Hence the linearising approximation of equation (4.50) will usually be adopted for the likelihood when the parameters are unknown.
In this section we have shown how to calculate exact and approximate likelihoods for AR data, in two different forms: one as a quadratic form in the data x and another as a quadratic (or approximately quadratic) form in the parameters a. This likelihood will appear on many subsequent occasions throughout the book.
5.2 Interpolation of Missing Samples 111
5.2.3.1 Pitch-Based Extension to the AR Interpolator
Vaseghi and Rayner [191] propose an extended AR model to take account of signals with long-term correlation structure, such as voiced speech, singing or near-periodic music. The model, which is similar to the long term pre diction schemes used in some speech coders, introduces extra predictor parameters around the pitch period T, so that the AR model equation is modified to:
where Q is typically smaller than P. Least squares/ML interpolation using this model is of a similar form to the standard LSAR interpolator, and parameter estimation is straightforwardly derived as an extension of standard AR parameter estimation methods (see section 4.3.1). The method gives a useful extra degree of support from adjacent pitch periods which can only be obtained using very high model orders in the standard AR case. As a result, the ‘under-prediction’ sometimes observed when interpolating long gaps is improved. Of course, an estimate of T is required, but results are quite robust to errors in this. Veldhuis [192, chapter 4] presents a special case of this interpolation method in which the signal is modelled by one single ‘prediction’ element at the pitch period (i.e. Q=0 and P=0 in the above equation).
5.2.3.2 Interpolation with an AR+Basis Function Representation
A simple extension of the AR-based interpolator modifies the signal model to include some deterministic basis functions, such as sinusoids or wavelets. Often it will be possible to model most of the signal energy using the deterministic basis, while the AR model captures the correlation structure of the residual. The sinusoid+residual model, for example, has been applied successfully by various researchers, see e.g. [169, 158, 165, 66]. The model for In with AR residual can be written as:
Here φi[n] is the nth element of the ith basis vector φi and rn is the residual, which is modelled as an AR process in the usual way. For example, with a sinusoidal basis we might take φ2i−1[n]=cos(winT) and φ2i[n]=sin(winT), where wi is the ith sinusoid frequency. Another simple example of basis functions would be a d.c. offset or polynomial trend. These can be incorporated within exactly the same model and hence the interpolator presented here is a means for dealing also with non-zero mean or smooth underlying trends.
114 5. Removal of Clicks
If we assume for the moment that the set of basis vectors {φi} is fixed and known for a particular data vector x then the LSAR interpolator can easily be extended to cover this case. The unknowns are now augmented by the basis coefficients, {ci}. Define c as a column vector containing the ci's and a (N×Q) matrix G such that x=Gc+r, where r is the vector of residual samples. The columns of G are the basis vectors, i.e. G=[φ1 . . . φQ]. The excitation sequence can then be written in terms of x and c as e=A(x−Gc), which is the same form as for the general linear model (see section 4.1).As before the solution can easily be obtained from least squares, ML and MAP criteria, and the solutions will be equivalent in most cases. We consider here the least squares solution which minimises eTe as before, but this time with respect to both x(i) and c, leading to the following estimate:
This extended version of the interpolator reduces to the standard interpolator when the number of basis vectors, Q, is equal to zero. If we back-substitute for c in (5.17), the following expression is obtained for x(i)
5.2 Interpolation of Missing Samples 115
alone:
x(i)=−(A(i)T(I−AG(GTATAG)−1GTAT)A(i))−1
(A(i)TA−(i)x−(i)−A(i)AG(GTATAG)−1GTATA−(i)x−(i))
These two representations are equivalent to both the maximum likelihood (ML) and maximum a posteriori (MAP)1 interpolator under the same conditions as the standard AR interpolator, i.e. that no missing samples occur in the first P samples of the data vector. In cases where missing data does occur in the first P samples, a similar adaptation to the algorithm can be made as for the pure AR case. The modified interpolator involves some extra computation in estimating the basis coefficients, but as for the pure AR case many of the terms can be efficiently calculated by utilising the banded structure of the matrix A.
1 assuming a uniform prior distribution for the basis coefficients
We do not address the issue of basis function selection here. Multiscale and ‘elementary waveform’ representations such as wavelet bases may capture the non-stationary nature of audio signals, while a sinusoidal basis is likely to capture the character of voiced speech and the steady-state section of musical notes. Some combination of the two may well provide a good match to general audio. Procedures have been devised for selection of the number and frequency of sinusoidal basis vectors in the speech and audio literature [127, 45, 66] which involve various peak tracking and selection strategies in the discrete Fourier domain. More sophisticated and certainly more computationally intensive methods might adopt a time domain model selection strategy for selection of appropriate basis functions from some large ‘pool’ of candidates. A Bayesian approach would be a strong possibility for this task, employing some of the powerful Monte Carlo variable selection methods which are now available [65, 108]. Similar issues of iterative AR parameter estimation apply as for the standard AR interpolator in the AR plus basis function interpolation scheme.
5.2.3.2.1 Example: Sinusoid+AR Residual Interpolation
As a simple example of how the inclusion of deterministic basis vectors can help in restoration performance we consider the interpolation of a short section of brass music, which has a strongly ‘voiced’ character, see
116 5. Removal of Clicks
are estimated rather crudely at each step by simply selecting the 25 frequencies in the DFT of the interpolated data which have largest magnitude. The number of iterations was 5.
5.2.3.3 Random Sampling Methods
A further modification to the LSAR method is concerned with the characteristics of the excitation signal. We notice that the LSAR procedure seeks to minimise the excitation energy of the signal, irrespective of its time domain autocorrelation. This is quite correct, and desirable mathematical properties result. However,
E=(x(i)−x(i)LS)TA(i)TA(i)(x(i)−x(i)LS)+ELS, E>ELS, (5.18)
where ELS is the excitation energy corresponding to the LSAR estimate x(i)LS. The positive definite matrix A(i)TA(i) can be factorised into ‘square roots’ by Cholesky or any other suitable matrix decomposition [86] to give A(i)TA(i)=MTM, where M is a non-singular square matrix. A transformation of variables u=M(x(i)−x(i)LS) then serves to de-correlate the missing data samples, simplifying equation (5.18) to:
E=uTu+ELS, (5.19)
from which it can be seen that the (non-unique) solutions with constant excitation energy correspond to vectors u with constant L2-norm. The resulting interpolant can be obtained by the inverse transformation x(i)=M−1u+x(i)LS.
Number | Date | Country | Kind |
---|---|---|---|
0202386.9 | Feb 2002 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB03/00440 | 2/3/2003 | WO |