Embodiments described herein relate generally to speech processing systems and speech processing methods.
Reverberation is a process under which acoustic signals generated in the past reflect off objects in the environment and are observed simultaneously with acoustic signals generated at a later point in time. It is often necessary to understand speech in reverberant environments such as train stations and stadiums, large factories, concert and lecture halls.
It is possible to enhance a speech signal such that it is more intelligible in such environments.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Systems and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which:
According to one embodiment, there is provided a speech intelligibility enhancing system for enhancing speech, the system comprising:
According to another embodiment, there is provided a speech intelligibility enhancing system for enhancing speech, the system comprising:
In an embodiment, the modification is applied to the frame of the speech received from the speech input by modifying the signal spectrum such that the frame of speech has a modified frame power.
In an embodiment, the prescribed frame power for each frame of inputted speech is calculated from the input frame power, the frame importance and the level of reverberation.
In an embodiment, the penalty term is:
where w is greater than 1, y is the prescribed frame power and x is the frame power of the extracted frame. In an embodiment, w=2.
In an embodiment, the prescribed frame power is calculated subject to λ being a function of l.
In an embodiment, the prescribed frame power is calculated subject to λ being a function of the measure of the frame importance. The term λ is parametrized such that it has a dependence on the frame importance.
The frame importance is a measure of the similarity between the current extracted frame and one or more previous extracted frames. In an embodiment, the measure of the frame importance is a measure of the dissimilarity of the mel cepstrum of the extracted frame to that of the previous extracted frame.
In an embodiment, the contribution due to late reverberation is estimated by modelling the impulse response of the environment as a pulse train that is amplitude-modulated with a decaying function. The convolution of the section of this impulse response from time tl onwards and a section of the previously modified speech signal gives a model late reverberation signal frame. The contribution due to late reverberation to the frame power of the speech when reverbed is the power of the model late reverberation signal frame.
In an embodiment, the prescribed frame power is calculated from:
where y is the prescribed frame power, x is the frame power of the extracted frame, l is the contribution due to late reverberation, w is greater than 1, c1 and c2 are determined from a first and second boundary condition and b is a constant.
In an embodiment, the first boundary condition is:
y(α)=α
where a is the minimum value of the frame power obtained from sample speech data and wherein the second boundary condition is:
y′(ψ)=l
where ∈(0,1) and ψ>>β, where β is the maximum value of the frame power obtained from sample speech data.
In an embodiment, the term λ is parametrized such that it has a dependence on the frame importance, and such that the crossing point of the prescribed frame power as a function of x and the function y=x is limited by β, where β is the maximum value of the frame power obtained from sample speech data and is the value of the crossing point at l={tilde over (l)}. Furthermore, λ is parametrized such that the value of the crossing point for values of l below the critical value does not depend on the value of l and depends on the frame importance, and the value of the crossing point for values of l above the critical value does not depend on the value of l and depends on the frame importance.
In an embodiment, λ is calculated from:
λ=max(λ1,{tilde over (λ)}) l≤{tilde over (l)}
λ=λ2 l>{tilde over (l)}
wherein {tilde over (λ)} is a constant determined such that the crossing point of the prescribed frame power as a function of x and the function y=x for l={tilde over (l)} and λ={tilde over (λ)} is β, and such that this is the maximum value of the crossing point for all values of l, and λ1 and λ2 are calculated as a function of the frame importance.
λ1 and λ2 are calculated such that the crossing point of the prescribed frame power as a function of x and the function y=x for all values of l is a value calculated as a function of the frame importance.
In an embodiment, the multiplier λ is calculated from:
λ=max(λν
λ=λ
where {tilde over (λ)} corresponds to an upper bound for the prescribed frame power y(x=β, l={tilde over (l)}, λ={tilde over (λ)})=β, wherein {tilde over (λ)} is given by:
λν
where
λ
where
where s is a constant, ξ is the frame importance and the value of {tilde over (l)} is calculated from
In an embodiment, step iii) comprises:
The signal gain applied to the frame may be the prescribed signal gain gi, where
Alternatively, prescribed signal gain may be smoothed before it is applied, such that the applied signal gain {umlaut over (g)}l is a smoothed gain.
In an embodiment, the rate of change of the modification is limited such that:
where i is the frame index, {umlaut over (g)}l is the smoothed signal gain, i.e. the square root of the ratio of the modified frame power to the power of the extracted frame, gi is the square root of the ratio of the prescribed frame power to the power of the extracted frame, and ϕ, U and D are constants.
In an embodiment, the modification applied to the frame of the speech received from the speech input is calculated from:
{umlaut over (g)}l=min(ui,gi) if gi>1
{umlaut over (g)}l=max(di,gi) if gi≤1
where:
where s is a constant, ϕ is a constant, and ξ is the frame importance.
The value of ϕ for a frame may be selected from two or more values, based on some characteristic of the frame. The value of s may be different for the calculation of u and d.
Step i) may comprise:
Step vi) may comprise:
In an embodiment, the threshold value is the correlation value where the target segment is the last segment, multiplied by Ω, where Ωϵ(0,1).
According to another embodiment, there is provided a method of enhancing speech, the method comprising the steps of:
According to another embodiment, there is provided a carrier medium comprising computer readable code configured to cause a computer to perform the method of enhancing speech.
The system 1 comprises a processor 3 comprising a program 5 which takes input speech and enhances the speech to increase its intelligibility. The storage 7 stores data that is used by the program 5. Details of the stored data will be described later.
The system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to an input 15 for data relating to the speech to be enhanced. The input 15 may be an interface that allows a user to directly input data. Alternatively, the input may be a receiver for receiving data from an external storage medium or a network. The input 15 may receive data from a microphone for example.
Connected to the output module 13 is audio output 17. The audio output 17 may be a speaker for example.
In use, the system 1 receives data through data input 15. The program 5, executed on processor 3, enhances the inputted speech in the manner which will be described with reference to
The system is configured to increase the intelligibility of speech under reverberation. The system modifies plain speech such that it has higher intelligibility in reverberant conditions.
In the presence of reverberation, multiple, delayed and attenuated copies of an acoustic signal are observed simultaneously. The phenomenon is more expressed in enclosed environments where the contained acoustic energy affects auditory perception until propagation attenuation and absorption in reflecting surfaces render the delayed signal copies inaudible. Similar to additive noise, high reverberation levels degrade intelligibility. The system is configured to apply a signal modification that mitigates the impact of reverberation on intelligibility.
In one embodiment, the system is configured to apply a modification, producing a modified frame power, based on an estimate of the contribution to the reverbed speech due to late reverberation.
Signal portions with low importance often have high energy. Reducing the power of these portions improves the detectability of adjacent sounds of higher importance and prominence. In an embodiment, the system takes account of the frame importance when applying the modification.
The system may be further configured to apply a time-scale modification.
A speech modification framework taking these aspects into consideration is described in relation to
In the framework, the input speech signal is split into overlapping frames for which frame importance evaluation is performed. In other words, each of the frames is characterized in terms of its information content. In parallel, a statistical model of late reverberation provides an estimate of the expected reverberant power at the resolution of the speech frame, i.e. the contribution to the frame power of the reverbed speech from late reverberation. An auditory distortion criterion is optimized to determine the frame-specific power gain adjustment. The criterion is composed of an auditory distortion measure and a penalty on the output power. The penalty term T is a function of the late reverberation power l, the power gain, and a multiplier λ, wherein the function is a non-linear function of l configured to increase with l faster than the distortion measure above a critical value of the late reverberation power. λ is made a function of the frame importance. The estimate of the expected late reverberant power is included in the distortion measure as uncorrelated, additive noise. The criterion is used to derive the prescribed frame power, which is used to determine an optimal modification for a given frame. The frame importance, reverberation power and input power together are thus used to compute the optimal output power for a given frame.
When the late reverberation power is low, the distortion is the dominant term and the prescribed power gain, that is the ratio of the prescribed frame power to the power of the extracted frame, increases with late reverberation power, depending on the frame importance. Once the late reverberation power increases above a critical value, the penalty term starts to dominate, and the power gain starts to decrease with increasing late reverberation power, again depending on the frame importance.
In an embodiment, if the prescribed frame power is reduced from the input frame power and the late reverberation power is greater than the critical value, time warping is initiated. The time warp may be of the order of one pitch period and subject to smoothness constraints.
Blocks S101, S107 and S109 are part of the signal processing backbone. Steps S102 and S103 incorporate context awareness, including both acoustic properties of the environment and local speech statistics.
In an embodiment, the input speech signal is split into overlapping frames and each of these is characterized in terms of information content, or frame importance. In parallel, a statistical model of late reverberation provides an estimate of the expected reverberant power at the resolution of the speech frame. Optimizing a distortion criterion determines the locally optimal output power, referred to as prescribed frame power. Locally, the power of late reverberation is modelled as uncorrelated, additive noise. In the event that the ratio of the modified frame power to the power of the extracted frame is less than 1 and the late reverberant power is greater than the critical value, time warping, or slow-down, is initiated, subject to a smoothing constraint.
Step S101 is “Extract active speech frames”. This step comprises extracting overlapping frames from the speech signal x received from the speech input 15. The frames may be windowed, for example using a Hann window function.
Frames xi are output from the step S101.
Step S102 is “Evaluate frame importance”. In this step, a measure of the frame importance is determined.
The frame importance characterizes the dissimilarity of the current frame to one or more previous frames. In an embodiment, the frame importance characterizes the dissimilarity to the adjacent previous frame. Low dissimilarity indicates less new information and therefore lower importance. Lower frame importance corresponds to higher redundancy. A frame with a low dissimilarity to previous frames, and thus high redundancy, has a low frame importance. Frame importance reflects the novelty of the frame and is used to limit the maximum boosting power.
The output of this step for each frame xi is the corresponding frame importance value ξi.
The frame importance is based on measuring the auditory domain dissimilarity between the current and one or more previous frames, for example by assessing the change between two consecutive frames in an auditory domain. In an embodiment, the frame importance is a measure of the dissimilarity of the mel cepstra of the frame to the previous frame. An estimate of the frame importance may be given by the normalized distance of the Mel frequency cepstral coefficients (MFCCs) in adjacent frames. In one embodiment, the frame importance is given by:
where mi represents the set of Mel frequency cepstral coefficients (MFCCs) derived from signal frame i, i.e. the MFCC vector at frame i.
The frame importance is a causal estimator, in other words it is not necessary for a future frame to be received in order to determine the frame importance of the current frame.
For the above relationship given in equation (1), ξiϵ(0,1). This means that the frame importance parameter approximates the information content, where ξi→0 corresponds to low information content and ξi≤1 corresponds to high information content.
In this embodiment, the information content of a segment, or frame, is approximated with a simple estimator. The frame importance calculated is an approximation describing the information content on a continuous scale. Explicit probabilistic modelling is not used, however the adopted parameter space is capable of approximating the information content with a high resolution, i.e. with a continuous measure, as opposed to a binary classifier.
A rigorous estimation of the amount of information in the speech signal at a given time using probabilistic modelling and the notion of entropy can alternatively be used to determine a measure of the frame importance.
Step S103 is “Model late reverberation”.
Reverberation can be modelled as a convolution between the impulse response of the particular environment and the signal. The impulse response splits into three components: direct path, early reflections and late reverberation. Reverberation thus comprises two components: early reflections and late reverberation.
Early reflections have high power, depend on the geometry of the space and are individually distinguishable. They arrive within a short time window after the direct sound and are easily distinguishable when examining the room impulse response (RIR). Early reflections depend on the hall geometry and the position of the speaker and the listener. Early reflections arrive within a short interval, for example 50 ms, after the direct sound. Early reflections are not considered harmful to intelligibility, and in fact can improve intelligibility.
Late reverberation is diffuse in nature due to the large number of reflections and longer acoustic paths. It is the primary factor for reduced intelligibility due to masking between neighbouring sounds. This can be relevant for communication in places such as train stations and stadiums, large factories, concert and lecture halls. Identifying individual reflections is hard because their number increases while their magnitudes decrease. Late reverberation is considered more harmful to intelligibility because it is the primary cause of masking between different sounds in the speech signal. Late reverberation is the contribution of reflections arriving after the early reflections. Late reverberation is composed of delayed and attenuated replicas that have reflected more times than the early reflections. Late reverberation is thus diffuse and comprises a large number of reflections with diminishing magnitudes.
The late reverberation model in step S103 is used to assess the reverberant power that is considered to have a negative impact on intelligibility at a given time instant, i.e. that decreases intelligibility at a given time instant. The model outputs an approximation to the contribution to the reverbed speech frame due to late reverberation.
The boundary tl between early reflections and late reverberation in a RIR is the point where distinct reflections turn into a diffuse mixture. The value of tl is a characteristic of the environment. In an embodiment, tl is in the range 50 to 100 ms after the arrival of the sound following the direct path, i.e. the direct sound. tl seconds after the arrival of the direct sound, individual reflections become indistinguishable. This is thus the boundary between early reflections and late reverberation.
In step S103, the late reverberation is modelled, i.e. the contribution to the reverbed speech frame due to late reverberation is approximated. In one embodiment, the late reverberation can be modelled accurately to reproduce closely the acoustics of a particular hall. In alternative embodiments, simpler models that approximate the masking power due to late reverberation can be used, because the objective is power estimation of the late reverberation. Statistical models can be used to predict late reverberation power.
In an embodiment, the late reveberant part of the impulse response is modelled as a pulse train with exponentially decaying envelope. In an embodiment, the Velvet Noise model can be used to model the contribution due to late reverberation.
The first plot shows an example acoustic environment, which is a hall with dimensions fixed to 20 m×30 m×8 m, the dimensions being width, length and height respectively. Length is shown on the vertical axis and width is shown on the horizontal axis. The speaker and listener locations are {10 m, 5 m, 3 m} and {10 m, 25 m, 1.8 m} respectively. These values are used to generate the model RIR used for illustration of an RIR in the second plot. For the late reverberation power modelling, the particular locations of the speaker and the listener are not used.
The second plot shows a room impulse response where the propagation delay and attenuation are normalized to the direct sound. Time is shown on the horizontal axis in seconds. The normalized room impulse response shown here is a model RIR based on knowledge of the intended acoustic environment, which is shown in the first plot. The model is generated with the image-source method, given the dimensions of the hall shown in the first plot and a target RT60.
The room impulse response may be measured, and the value of the boundary tl between early reflections and late reverberation and the reverberation time RT60 can be obtained from this measurement. The reverberation time RT60 is the time it takes late reverberation power to decay 60 dB below the power of the direct sound, and is also a characteristic of the environment.
The third plot shows the same normalised room impulse response model {tilde over (h)} as the second plot, as well as the portion of the RIR corresponding to the late reverberation, discussed below. The late reverberation model is generated using the Velvet Noise model.
In one embodiment, the model of the late reverberation is based on the assumption that the power of late reverberation decays exponentially with time. Using this property, a model is implemented to estimate the power of late reverberation in a signal frame. A pulse train with appropriate density is generated using the framework of the Velvet Noise model, and is amplitude modulated with a decaying function.
The late reverberation room impulse response model is obtained as a product of the pulse train l[k] and the envelope e[k]:
{tilde over (h)}[k]=l[k]e[k] (2)
where e[k] is given by equation (5) below, and l[k] is a pulse train, and is given by equation (3) below:
where a[m] is a randomly generated sign of value +1 or −1, rnd(m) is a random number uniformly distributed between 0 and 1, “round” denotes rounding to an integer, Td is the average time in seconds between pulses and Ts is the sampling interval. u denotes a pulse with unit magnitude. This pulse train is the Velvet Noise model.
In an embodiment, the late reverberation pulse train is scaled. An initial value is chosen for the pulse density. In an embodiment, an initial value of greater than 2000 pulses/second is used. In an embodiment an initial value of 4000 pulses/second is used. The generated late reverberation pulse train is then scaled to ensure that its energy is the same as the part of a measured RIR corresponding to late reverberation. A recording of an RIR for the acoustic environment may be used to scale the late reverberation pulse train. It is not important where the speaker and listener are situated for the recording. The values of tl and RT60 can be determined from the recording. The energy of the part of the RIR after tl is also measured. The energy is computed as the sum of the squares of the values in the RIR after point tl. The amplitude of the late reverberation pulse train is then scaled so that the energy of the late reverberation pulse train is the same as the energy computed from the RIR.
Any recorded RIR may be used as long as it is from the target environment. Alternatively, a model RIR can be used.
The continuous form of the decaying function, or envelope, is:
The discretized envelope is given by:
This relationship ensures a 60 dB power decay between the initial instant, t=0, which corresponds to the arrival of the direct path, and the reverberation time RT60. Ts is the sampling interval of the input speech signal, where:
Ts=1/fs (6)
and fs is the sampling frequency.
The model of the late reverberation represents the portion of the RIR corresponding to late reverberation as a pulse train, of appropriate density, that is amplitude-modulated with a decaying function of the form given in (2).
An approximation to the late reverberation signal {circumflex over (l)}, which is the noise caused by late reverberation, for the duration of the target frame is computed from:
where {tilde over (h)} is the late reverberation room impulse response model, given in (2), i.e. the artificial, pulse-train-based impulse response, fs is the sampling frequency and the beginning of the target frame is associated with time index k=0.
Thus equation (5) is the envelope applied to the pulse train in (3) to generate {tilde over (h)}. From equation (5), at k=0, e(t)=1, meaning there is no decay for the direct path, which is used as the reference. At k=RT60/Ts. e(t)=10−3, which in the power domain corresponds to −60 dB.
y[k−tlfs−n] corresponds to a point from the output “buffer”, i.e. the already modified signal corresponding to previous frames xp, where p<i. The convolution of {tilde over (h)} from tl onwards and the signal history from the output buffer give a sample or model realization of the late reverberation signal.
A sample-based late reverberation power estimate l is computed from {circumflex over (l)} [k]. For a frame i, the value of {circumflex over (l)} [k] for each value of k is determined, resulting in a set of values {circumflex over (l)}, where each value corresponds to a value of k inside the frame.
Values for RT60, tl, Td and fs may be stored in the storage 7 of the system shown in
Step S103 may be performed in parallel to step S102.
The following steps S104 and S105, are directed to calculating a prescribed frame power that optimises the distortion criterion between the natural speech and the modified speech plus late reverberant power. In step S104, the frame power of the input speech signal and the estimated late reverberation signal are calculated. In step S105, the frame power values of the input speech signal xi and the late reverberation signal {circumflex over (l)}i are used to calculate the prescribed frame power y that minimizes a distortion measure, subject to some penalty term which is a function of the late reverberant frame power l, the ratio of the prescribed frame power to the power of the input speech frame, and a multiplier λ, wherein the function is a non-linear function of l configured to increase with l faster than the distortion measure above a critical value, and wherein λ is a function of the frame importance. The frame of input speech is then modified such that is has a modified frame power in step S107, by applying a signal gain. The modification is calculated from the prescribed frame power. The modification may be calculated by further applying a post-filtering and/or smoothing to the value of the signal gain calculated directly from the prescribed frame power.
A distortion measure is used to evaluate the instantaneous, which in practice is approximated by frame-based, deviation between a set of signal features, in the perceptual domain, from clean and modified reverberated speech. Minimizing distortion provides the locally optimal modification parameters.
Step S104 is “Compute frame powers”. The frame power xi for each frame of the input speech signal xi is calculated. The frame power li for the late reverberation signal {circumflex over (l)}i calculated in S103 is also calculated. The frame power for the late reverberation signal {circumflex over (l)}i is the contribution li to the frame power of the reverbed speech due to late reverberation.
In an alternative embodiment, the fraction of the frame power of the input speech signal xi in each of two or more frequency bands is calculated, and the fraction of the frame power of the late reverberation signal {circumflex over (l)}i calculated in S103 in each of the frequency bands is calculated. In an embodiment, the bands are linearly spaced on a MEL scale. In an embodiment, the bands are non-overlapping. In an embodiment, there are 10 frequency bands.
In an embodiment, the bands of the input speech frame are ranked in order of descending power. In other words, for each frame, the order of the frequency bands in descending power is determined. The bands corresponding to a predetermined fraction of the total frame power in descending order are then determined. For example, the bands in which 90% of the total frame power is contained in descending order are determined. For example, in a first frame, 90% of the frame power may come from the n highest power bands. In a second frame, 90% of the frame power may come from the m highest power bands, the m highest power bands in the second frame being different to those in the first frame.
The frame power of the late reverberation signal is then determined as the total power in those bands determined for the corresponding input speech frame. For the above example, in the first frame, the late reverberant frame power is calculated as the power of the late reverberation signal in the n bands. In the second frame, the late reverberant frame power is calculated as the power of the late reverberation signal in the m bands. The frame power of the late reverberation signal is thus calculated by summing the band powers of the bands determined from the input speech frame.
The frame power of the input speech signal may then be calculated by summing the band powers for all the bands of the input speech frame, i.e. not just the determined bands. The frame power of the input speech signal is xi and the frame power of the late reverberation noise signal is li. In this embodiment, the late reverberation frame power is computed from certain spectral bands only. The spectral bands are determined for each frame by determining the spectral bands of the input speech frame corresponding to the highest powers, for example, the highest power spectral bands corresponding to a predetermined fraction of the frame power. This takes into account the different spectral energy distributions of different sounds.
Step S105 is “Optimise frame output power”.
A prescribed frame power is calculated. The prescribed frame power minimizes a distortion measure, subject to some penalty term which is a function of l, the ratio of the prescribed frame power to the power of the input speech frame, and a multiplier λ, wherein the function is a non-linear function of l configured to increase with l faster than the distortion measure above the critical value. The prescribed frame power is calculated subject to λ being a function of the frame importance.
In one embodiment, an iterative method is used to determine the prescribed frame power. For the first iteration, the distortion between the unmodified speech and the unmodified speech plus reverberation noise is evaluated, subject to the penalty term. This is output as the modified speech frame yi. This is then repeated, for the new modified speech frame yi. These steps are iterated, to find the prescribed frame power that reduces the distortion calculated, subject to the penalty term. In another embodiment, calculating a prescribed frame power value comprises using a searching algorithm to find a local minimum for the prescribed frame power, subject to the penalty term.
In one embodiment, there is a closed form solution to the optimization problem. In this case an iterative search for the optimum prescribed frame power is not performed. In step S105 the values for frame importance, frame power of the input signal xi and frame power of the late reverberation signal li are inputted into an equation for the prescribed frame power, which corresponds to the solution of the optimization problem. There may be some further alteration to the signal gain calculated from the prescribed frame power before it is applied, for example a smoothing filter. The signal gain is applied in step S107. There is no iteration to determine the prescribed frame power in this case. The prescribed frame power is simply calculated from a pre-determined function. In this embodiment, the speech modification has low-complexity.
A set of processing steps S105 to S107 in accordance with an embodiment in which there is a closed-form solution to the optimization problem are now described.
In these steps, the function for the prescribed frame power is determined by minimizing a distortion measure in the power domain, subject to a penalty term, wherein the penalty term is a function of l, the ratio of the prescribed frame power to the power of the input speech frame, and a multiplier λ, wherein the function is a non-linear function of l configured to increase with l faster than the distortion measure above a critical value of l, and wherein λ is a function of the frame importance. In these steps, the prescribed power of the frame is calculated using a function which minimises the distortion criterion.
A composite criterion, comprising the distortion term and a power increase penalty, is used to prevent excessive increase in output power. To facilitate the analysis, late reverberation is locally, i.e., for the duration of the current frame, regarded as uncorrelated, additive noise. This is motivated by i) the time separation between the current frame and the period when the interfering speech was produced and ii) the long-term non-stationary nature of the speech signal. Late reverberation is thus considered as additive and uncorrelated with the signal, due to the differences in propagation time and noise.
Any composite distortion criterion for speech in noise having a distortion term and a power gain penalty, the power gain penalty being configured to decrease the power gain as the contribution to late reverberation increases above a critical value, can be used to determine a prescribed frame power in this step. A speech in noise criterion is used because late reverberation can be interpreted as additive uncorrelated non-stationary noise.
In one embodiment, a criterion composed of an auditory distortion measure and a constraint on the output power is used to derive the optimal prescribed modified frame power at a given time:
where x, y and l are the instantaneous powers of the waveforms x, y and l, in practice approximated by frame powers. Italic font is used to indicate the frame powers. Thus for a particular frame there is a value x, where x is the frame power of the original frame of speech signal. There is also a value of l, where l is the power of the noise in that frame, estimated in step S103. The prescribed modified power for the frame is denoted by y.
In equation (8), the penalty term T is
In general however, any penalty term T which is a function of l, the ratio of the prescribed frame power to the power of the input frame, and a multiplier λ, wherein the function is a non-linear function of l configured to increase with l faster than the distortion measure above a critical value can be used. For example, the penalty term may be may be:
where w>1. In an embodiment,
Thus the first additive term in the criterion is the distortion in the instantaneous power dynamics. In an embodiment, the instantaneous late reverberation power in the power gain penalty term is raised to a power larger than unity. In an embodiment, the late reverberation power in the power gain penalty term is raised to a power 2. A power of 2 facilitates the mathematical analysis for calibrating the mapping function. An increase of l past a critical value causes the power gain penalty to outweigh the distortion, and induces an inversion in the modification direction.
For speech signals in a reverberant environment, the intelligibility is reduced because the late reverberation from earlier speech overlaps and masks the current speech. Increasing the power of the speech in order to increase the intelligibility also increases the amount of late reverberation caused, and thus can actually have a detrimental effect on the intelligibility. The penalty term acts to suppress the increase in power subject to the frame importance. Furthermore, above a critical value of late reverberation, the ratio of the modified frame power to the power of the extracted frame decreases with late reverberation. Thus for a particular input frame power and frame importance, as late reverberation increases but remains below the critical value, the prescribed frame power increases. As late reverberation increases further above the critical value, the prescribed frame power decreases. This self-suppressing behaviour allows the system to be used in highly reverberant environments.
The penalty term is configured to increase with l faster than the distortion measure above the critical value. Above the critical value of l, the ratio of the prescribed frame power to the input speech frame power decreases with increasing l.
β and α are bounds for the interval of interest. In other words, and β and α bound the optimal operating range. In one embodiment, the parameter α is set to the minimum observed frame powers in a sample data set of pre-recorded standard speech data, with normalised variance. In one embodiment, the upper bound β is the highest expected short-term power in the input speech. Alternatively, β is the maximum observed frame power in pre-recorded standard speech data.
fx(x|b) is the probability density function of the Pareto distribution with shape parameter b. The Pareto distribution is given by:
The value of b is obtained from a maximum likelihood estimation for the parameters of the (two-parameter) Pareto distribution fitted to a sample data set, for example the standard pre-recorded speech used to determine α and β. The Pareto distribution may be fitted off-line to variance-equalized speech data, and a value for b obtained. In one embodiment, b is less than 1.
Thus, in an embodiment, the parameter α may be set to the minimum observed frame powers in the data used for fitting fX(x|b) and the parameter β may be set to the maximum observed frame power in the data used to fit fX(x|b). Consistency between the estimates for α and β and the frame powers may be achieved when the utterances in the data used to fit fX(x|b) are the same power as the input speech signal. The power referred to here is a long-term power measured over several seconds, for example, measured over a time scale that is the same as the utterance duration.
In an embodiment, the values of β and α are scaled in real time. If the long-term variance of the input speech signal is not the same as that of the data to which the Pareto distribution is fitted, the parameters of the Pareto distribution are updated accordingly. The long-term variance of the input speech is thus monitored and the values of the parameters β and α are scaled with the ratio of the current input speech signal variance and the reference variance, i.e. that of the sample data. The variance is the long term variance, i.e. on a time scale of 2 or more seconds.
Values for b, α and β may be stored in the storage 7 of the system shown in
The first term under the integral in equation (8) is the distortion in the instantaneous power dynamics and the second term is the penalty on the power gain. This distortion criterion is used due to the flexibility and low complexity of the resulting modification. The late reverberant power l is included in the distortion term as additive noise. The term λ is a multiplier for the penalty term. The penalty term also includes a factor l2. In general, the penalty term is a function of l, the ratio of the prescribed frame power to the input speech power y|x, and a multiplier λ, wherein the function is a non-linear function of l configured to increase with l faster than the distortion measure above a critical value, and wherein λ is a function of the frame importance.
The solution in closed form for the minimum of the functional (8) found by using calculus of variations is:
where c1 and c2 are constants identified by setting the boundary conditions as:
y(α)=α (12)
where
Equation (11) is the solution for the case for w=2. The form of the solution for the more general case where w>1 is:
Where the penalty term is a function other than l raised to the power of w, the solution will have a different form.
The parametrization p(l) ensures that in the absence of reverberation, i.e. where y′(ψ)=1, the input-output (IO) relationship (11) passes the input unchanged, i.e. y=x.
The values for c1 and c2 are thus dependent on λ and are given by:
yi is the prescribed power of the modified speech frame. The prescribed signal gain, i.e. the prescribed modification, for a frame i is thus √{square root over (yi/xi)}, i.e. is the square root of the ratio of the prescribed frame power to the power of the input frame.
The integrand is a Lagrangian and λ is a Lagrange multiplier. The distortion criterion is subject to an explicit constraint, i.e. an equality or inequality. In an embodiment, the constraint is
for some value of Q. This prevents the power gain growing excessively. The Q falls off in the formulation of the Euler-Lagrange equation, and the constraint is thus implicitly in equation (8). In order to incorporate the frame importance, the term λ is parametrized such that it has a dependence on the frame importance through υ. The frame importance is introduced to limit the increase of the gain. This avoids introducing the frame importance through Q, e.g. by making Q a function of the frame importance through υ, and determining the value of λ once the solution to the Euler-Lagrange equation is found. Calibration is also performed to determine the value for λ, as described below. Calibration is used to set the turning point in the gain with increase in late reverberation power.
A value for λ for each frame may be calculated as described below. The value of λ for the target frame i is calculated in step S105.
An increase in the late reverberation power induces an increase in the speech output power. This behaviour can lead to instability due to recursive increase of signal power. In other words, increasing the speech power in a reverberant environment also increases the power of the late reverberation. The penalty term prevents this recursive increase and instability. The penalty term means that there is a critical value of late reverberant power {tilde over (l)}, above which the power gain, i.e. the ratio of the prescribed frame power to the power of the extracted frame, starts to decrease.
If the critical value is too high, too much reverberation is generated. This is prevented by calibration of the system, described below. The calibration is realised by determining the expressions for λ below. During processing of the speech, a value of λ for each frame is calculated from the expressions.
For any value of late reverberant power l and multiplier λ there is a maximum boosting power (MBP). The MBP is the crossing point of the power mapping curve y(x), i.e. which provides the prescribed frame power, and the function y=x. An input speech power below the MBP is boosted and an input speech power above the MBP is suppressed.
As a result of the calibration, at low values of late reverberant power, the MBP is allowed to increase with increasing late reverberation power. There is also a dependence on the frame importance. Above the critical value of late reverberant power, the MBP decreases, again depending on the frame importance.
The calibration of the system and the derivation of the expressions for λ is described below.
The desired upper bound of the input-output power map is represented by a maximum boosting power β. As described above, β may be the maximum observed frame power in pre-recorded standard speech data for example. {tilde over (λ)} is the Lagrange multiplier for which the input-output power map achieves this upper bound β at l={tilde over (l)}, i.e. where:
y(x=β|l={tilde over (l)},λ={tilde over (λ)})=β (16)
For λ={tilde over (λ)}, the MBP will change direction at l={tilde over (l)}, such that for λ={tilde over (λ)} and l<{tilde over (l)}, the MBP increases with l, for λ={tilde over (λ)} and l>{tilde over (l)} the MBP decreases with increasing l.
Rearranging (16) along the powers of l gives the quadratic form:
Al2+Bl+C=0 (17)
The single root condition B2−4AC=0 identifies the turning point of the input-output power map. Solving (11) for λ gives:
Mapping curves for different reverberation power levels and for λ={tilde over (λ)} are shown in
The frame importance is also included in calculation of Δ, and prevents the MBP increase with late reverberant power below the critical value from exceeding a value νξ, and prevents too much suppression of a frame with a large amount of information content when the MBP is decreasing. An expression for Δ is derived which provides a particular MBP. This is used to determine expressions for Δ which control the increase and decrease of the MBP.
An expression for Δ that achieves a particular MBP for any value of l is derived below.
Solving the expression:
y(x=υ,l,λ=λν)=υ (19)
for λ as for (16) yields the expression:
λν is the value of λ corresponding to a prescribed frame power y(x=ν,l, λ=λν)=ν. The fractional polynomial function (11), with derivative y′(ψ)≥0, is guaranteed to be monotonically increasing on xϵ(α; ψ) for λ=λν,ν>α. Where λ=λν the MBP is fixed to the value ν, regardless of the late reverberant power l, that is the MBP is fixed with regard to the late reverberant power l.
This formula can be used to calculate a value for λν
λν
In an embodiment, the sigmoid:
with slope s and range limits L=α and H=ρ is used to map ξ to an maximum boosting power νξ in the log domain.
This provides a smooth mapping between frame importance and MBP.
Where λ=λν
For the descent of the MBP, i.e. in the region l>1, an expression for λ
Where λ=λ
In an embodiment, the sigmoid:
with slope s and range limits L=α and H=νξ is used to map
to an maximum boosting power
This ensures that νϵ[α,νξ] and gives a lower bounded input output power map.
By introducing a dependence on ξ, through λ
Thus for each frame of the input speech signal, the value of {tilde over (λ)} is calculated from (18). The critical value of the late reverberation power {tilde over (l)} is then derived as
Although {tilde over (λ)} depends on l through ρ, in practice, the exponential convergence rate in ρ→0 with the increase of l indicates that {tilde over (l)} does not vary for large l. Thus in an alternative embodiment, a single reference value for {tilde over (λ)} and {tilde over (l)} can be used.
The constants used in the expressions for λ
For each inputted speech frame, if l≤{tilde over (l)}, where {tilde over (l)} is the critical value calculated for that frame, the value for λ for the frame is calculated from:
λ=max(λν
If l>{tilde over (l)}, the value of λ for the frame is calculated from:
λ=λ
An input speech power below the MBP is boosted and an input speech power above the MBP is suppressed. In high reverberation, the MBP is reduced, leading to a larger suppression and a smaller boosting range of powers.
The value of λ for the target frame i is calculated using equation (27) or (28), depending on the value of l relative to the critical late reverberation power. Establishing a connection between the frame importance parameter ξ and λ provides the possibility for short-term power suppression or power boosting as a function of the redundancy in the speech signal.
Once a value for λ has been calculated for the frame, values for c1 and c2 can be calculated. These values can then be substituted into (11) to compute the prescribed frame power yi. The signal gain applied to the input speech signal can then be calculated from the prescribed frame power. In an embodiment, the modification is applied to the input speech signal by modifying the signal spectrum, using the signal gain gi. In this case a signal gain gi is calculated from the prescribed modified frame power.
In an embodiment, the signal gain calculated from the prescribed frame power is smoothed before being applied to the input speech signal. This is step S106.
The smoothed signal gain applied to the frame of the speech received from the speech input may be calculated from:
{umlaut over (g)}l=min(u,gi) if gi>1
{umlaut over (g)}l=max(d,gi) if gi≤1 (29)
where gi is the signal gain calculated from the prescribed frame power, where gi2=yi/xi, yi being the prescribed frame power and xi being the frame power of the speech received from the speech input, {umlaut over (g)}l is the smoothed signal gain and where:
where s and ϕ are constants and ξi is the frame importance, and U and D are selected to give the downward and upward limit rates. The operating rates converge to the limit rates with ξ.
The term Uϕ√{square root over (gi)} leads to greater power increase for weak transient components, without leading to excessive boosting elsewhere. If the input speech frame has a low frame power, and in particular if it has a high frame importance, for example a transient, the prescribed signal gain will be very high. In general this gives gi>>1. This term thus allows for a stronger gain for such transients. In an embodiment ϕ=3. In an alternative embodiment, there are a range of possible values for ϕ, and a value is selected for each frame depending on some characteristic of the frame. For example, ϕ=ϕ1 if over 50% of the spectral energy of a frame sits in a high-frequency region and ϕ=ϕ2 if over 50% of the spectral energy of a frame sits in a low-frequency region.
This form of smoothing has the effect of limiting the rate of change of the signal gain, without smearing frame importance across adjacent frames, such that:
D≤{umlaut over (g)}l≤Uϕ√{square root over (gi)} (32)
By controlling the rate of change, the modified signal has less perceptual distortion.
In an embodiment, there is a different rate for gi>1 and gi≤1, i.e. a different value of s for equation (30) and (31).
In an alternative embodiment, u is calculated from
In an alternative embodiment, the signal gain is instead smoothed using a relative constraint. Equations (29) and (32) above are replaced with equations (29a) and (32a) below:
Step S107 is “Modify speech frame”. The windowed waveform corresponding to the input speech frame is scaled by {umlaut over (g)}i. The modification is thus the signal gain, calculated from equation (29) above for example. In an embodiment, the modification is applied to the input speech signal by modifying the signal spectrum, using the smoothed signal gain
In the above described embodiments, the prescribed frame power is derived by optimizing a distortion measure that models the effect of late reverberation, subject to a penalty term. The signal gain is then calculated from the prescribed frame power.
The modification utilizes an explicit model of late reverberation and optimizes the frame power for the impact of the late reverberation which is locally treated as additive noise in a distortion measure. Any arbitrary distortion criterion for speech in noise can be used for the modification.
The modification mitigates the impact of late reverberation. Late reverberation can be modelled statistically due to its diffuse nature. At a particular time instant, late reverberation can be seen as additive noise that, given the time offset to the generation instant, or the time separation to its origin, can be assumed to be uncorrelated with the direct or shortest path speech signal. Boosting the signal is an effective intelligibility-enhancing strategy for additive noise since it improves the detectability of the sound. Suppressing this boosting above a critical late reverberation noise prevents excessive reverberation.
In an embodiment, the modified speech frames are simply overlap-added at this point, and the resulting enhanced speech signal is output.
Further speech enhancement is achieved by introducing an additional modification dimension. Under reverberation, boosting the signal can be counter-productive, as the boosted signal generates more noise in the future. Overlap-masking between sounds caused by acoustic echoes is a major contributor to the loss in intelligibility. Time-scaling reduces the effective overlap-masking between closely-situated sounds. Extending portions of the signal by time scaling results in reduced masking in these portions from previous sounds, as the late reverberation power decays exponentially with time. This effect improves intelligibility but also reduces the transmission rate. Slowing down the signal reduces the overlap-masking between closely situated sounds and improves intelligibility, but also slows down the transfer of information.
In an embodiment in which the system is configured to apply a modification which produces a modified frame power and a subsequent time scale modification, the time scale modification is performed in step S108.
Step S108 is “Warp time scale”. In general, time scaling improves intelligibility by reducing overlap-masking among different sounds. The time-warping functionality searches for the optimal lag when extending the waveform. The method allows for local warping. Time warping occurs when the frame power is reduced below that of the unmodified input frame power and when the late reverberation power is above the critical value.
In this step, it is first determined whether the smoothed signal gain is less than 1, wherein the smoothed signal gain is {umlaut over (g)}l and whether l is greater than {tilde over (l)}. If both these conditions are fulfilled then, using the history of the output signal y, the correlation sequence ryy(k) for a frame i is computed as:
where T is the frame duration (in seconds). The value for T may be stored in the storage 7 of the system shown in
The optimal lag, k*, is then calculated from:
where the lag is a discrete time index, or sample index and K1 and K2 are the minimum and maximum lag of the search interval. In an embodiment, K1 and K2 are constants. In an embodiment, K1 is 0.003 fs and K2 is 0.02 fs. The optimal lag is identified by the highest peak in the correlation function.
The modified frames after the overlap and add process performed in step S109 of
In the time scale modification process, a new frame yi is output from step S107 of
All frames are overlap added to the buffer in this manner. However, if the following conditions are met then the time will be warped around this point, in the manner described in the following steps, the following conditions being that 1) the smoothed signal gain is less than 1, 2) l is greater than {tilde over (l)}, and 3) the max correlation is greater than a threshold value. The time warp is thus only initiated when suppression occurs while in “descent” mode, i.e. when reverberation is high and l is greater than {tilde over (l)}. If suppression occurs when l≤{tilde over (l)}, for example due to low information content and high power of the frame, this will not be accompanied by time warp.
In step S108, it is desired to determine a time scale modification amount that will time warp the signal without introducing discontinuities. This involves calculating the correlation, from equation (33), of the “last frame” of the signal with a target segment of the buffer signal, starting from k=K1 in equation (33). This is repeated for target segments corresponding to k=K1-1 to k=K2. This corresponds to step S702 of the time scale modification process.
The value of k corresponding to the maximum peak in the correlation function gives the optimum lag k*. This is determined in step S703 of the time scale modification process.
In step S704, it is determined whether the value of the maximum correlation is larger than a threshold value.
In an embodiment, the threshold value is the correlation value at a lag of k=0, i.e. of the last segment, multiplied by Ω, where Ωϵ(0, 1). The correlation value at lag of k=0 is the energy of the frame.
In an embodiment, the threshold value corresponds to the condition that the time warp is only performed if the condition;
is fulfilled. This condition prevents distortion due to attempting to warp a transient for example.
If the conditions are fulfilled, the time warping is applied. In another embodiment, the number of consecutive time-warps is limited to two, in order to prevent over-periodicity.
The buffer signal is then extracted from this point on, i.e. the segment of the buffer signal from k=k* to the end of the buffer is replicated in step S704, and this is overlap added with the “last frame” from the point k=0 in step S705. In an embodiment, the overlap-add is on a scale twice as large as that of the frame-based processing. In an embodiment, the waveform extension is over-lap added using smooth complementary “half” windows in the overlap area
This overlap-adding therefore results in left over, or extra, samples at the end of the buffered signal, containing the “last frame”. This is the signal extension or the time warp effect.
In S109 therefore, the waveform extension is extracted from the position identified by k* and overlap-added to the last frame using complementary windows of appropriate length. The waveform extension is over-lap added using smooth “half” windows in the overlap area. Finally the end of the extension is smoothed, using the original overlap-add window to prepare for the next frame.
Speech intelligibility in reverberant environments decreases with an increase in the reverberation time. This effect is attributed primarily to late reverberation, which can be modelled statistically and without knowledge of the exact hall geometry and positions of the speaker and the listener. The system described above uses a low-complexity speech modification framework for mitigating the effect of late reverberation on intelligibility. Distortion in the speech power dynamics, caused by late reverberation, triggers multi-modal modification comprising adaptive gain control and local time warping. Estimates of the late reverberation power allow for context-aware adaptation of the modification depth.
The system is adaptive to the environment, and provides multi-modal, i.e. in gain control and local time scale modification for a wide operation range. The system uses a distortion criterion. The closed-form minimizer of the distortion criterion is parameterized in terms of a continuous measure of frame importance, for more efficient use of signal power. The system operates with low delay and complexity, which allows it to address a wide range of applications. The modularity of the framework facilitates incremental sophistication of individual components.
Step S201 is “Extract frame xi”. This corresponds to step S101 shown in the framework in
In one embodiment, the duration of the frame is between 10 and 32 ms. For these frame durations, the signal can be considered stationary. In one embodiment, the duration of the frame is 25 ms.
In one embodiment, the frame overlap is 50%. A 50% frame overlap may reduce discontinuities between adjacent frames due to processing.
Any sampling frequency reasonable for speech signal processing can be used. In an embodiment the sampling frequency may be between 1 and 50 kHz. In an embodiment, the sampling frequency fs=16 kHz. In one embodiment, fs=8 KHz.
Step S202 is “Compute frame importance”. This corresponds to step S102 in the framework shown in
The frame importance is a measure of the dissimilarity of the frame to the previous frame. In one embodiment, the frame importance is given by equation (1) above. The output from step S202 is ξi, the frame importance of the frame i.
In an embodiment, m contains MFCC orders 1 to 12.
Step S203 is “Calculate late reverberation signal”.
In an embodiment, a late reverberation signal is calculated by modelling the contribution of the late reverberation to the reverbed signal frame. In one embodiment, the late reverberation can be modelled accurately to reproduce closely the acoustics of a particular hall. In alternative embodiments, simpler models that approximate the masking power due to late reverberation can be used. Statistical models can be used to produce the late reverberation signal. In an embodiment, the Velvet Noise model can be used to model the contribution due to late reverberation. Any model that provides a late reverberation power estimate may be used.
In one embodiment, the late reverberation signal {circumflex over (l)} is calculated from equation (7) above. A sample-based late reverberation signal {circumflex over (l)} is computed. For a frame i, the value of {circumflex over (l)}[k] for each value of k is determined, resulting in a set of values {circumflex over (l)}, where each value corresponds to a value of k for the frame. An approximation to the masking signal {circumflex over (l)}, which is the late reverberation, for the duration of the target frame is thus computed from equation (7) above.
This step corresponds to step S103 in the framework shown in
The reverberation time for the intended acoustic environment may be measured, and this measured value is used as the value of RT60. Alternatively, an estimated value based on previous studies of similar environments is used. Alternatively, the reverberation time can be derived from a model, for example, if the dimensions and the surface reflection coefficients are known.
In one embodiment, tl=90 ms. In one embodiment, tl=50 ms. In one embodiment, tl is extracted from a model RIR based on knowledge of the intended acoustic environment. Alternatively, tl is extracted from the measured RIR. Alternatively, an estimated value based on previous studies of similar environments is used.
Step S204 is compute powers. In an embodiment, this corresponds to step S104 in
In one embodiment, the input signal frame power xi and late reverberation frame power li are calculated from the input signal xi and {circumflex over (l)}i, output from step S203. The late reverberation frame power li is thus calculated from a model of the contribution of the late reverberation to the reverbed speech frame.
In an alternative embodiment, the input signal band powers and the late reverberation band powers are calculated from the input signal xi and {circumflex over (l)}i, output from step S203. In other words the power in each of two or more frequency bands is calculated from the input signal xi and {circumflex over (l)}i, output from step S203. These may be calculated by transforming the frame of the speech received from the speech input and the late reverberation signal into the frequency domain, for example using a discrete Fourier transform. Alternatively, the calculation of the power in each frequency band may be performed in the time domain using a filter-bank.
In an embodiment, the bands are linearly spaced on a MEL scale. In an embodiment, the bands are non-overlapping. In an embodiment, there are 10 frequency bands.
The bands of the input speech frame are then ordered in order of descending power and the bands corresponding to a predetermined fraction of the total frame power in descending order are then determined. The frame power of the late reverberation signal is then determined as the sum of the powers in the bands determined for the corresponding input speech frame. The frame power of the late reverberation signal is thus calculated by summing the band powers of the bands determined from the input speech frame.
In this embodiment, the late reverberation frame power is computed from certain spectral regions only. The spectral regions are determined for each frame by determining the spectral regions of the input speech frame corresponding to the highest powers, for example, the highest power spectral regions corresponding to a predetermined fraction of the frame power. The input signal full band power xi can be calculated by summing the band powers.
In an embodiment, a prescribed frame power yi is then calculated from a function of the input signal frame power xi, the measure of the frame importance and the late reverberation frame power li. The function is configured to decrease the ratio of the prescribed frame power to the power of the extracted input speech frame as the late reverberation frame power li increases above a critical value, {tilde over (l)}.
In an embodiment, a prescribed frame power is calculated that minimizes a distortion measure subject to a penalty term, T, wherein T is a function of l, the ratio of the prescribed frame power to the power of the extracted frame, and a multiplier λ, wherein the function is a non-linear function of l configured to increase with l faster than the distortion measure when the late reverberant power is greater than the critical late reverberation power, and wherein λ is parameterised in terms of the frame importance.
The distortion measure may be the first term under the integral in (8) for example. The penalty term is a penalty on power gain. In an embodiment, the penalty term is that given in (9), where w>1. In one embodiment, w=2.
Step S205 comprises the steps of “Calculate λ, c1 and c2”
The value of λ for each frame is calculated from:
λ=max(λν
λ=λ
where an expression for {tilde over (λ)} is given in (18), a value for {tilde over (l)} is calculated from the value of {tilde over (λ)}, an expression for λν
Values for β, α, ψ and are stored in the storage 7. In one embodiment, =0.9. In one embodiment, =0.001. Values for s, which may be required to calculate λ are also stored in the storage 7. In an embodiment, s is between 1 and 50. In an embodiment, s=15. In an embodiment, s=28. In an embodiment the slopes, s, can be different for the regime in which the MBP is increasing, corresponding to l≤{tilde over (l)}, and the regime in which the MBP is decreasing, corresponding to for l>{tilde over (l)}.
λν
Once the value of λ has been calculated for the frame, values for c1 and c2 are calculated using equations (14) and (15).
In step S206, the prescribed frame power yi is calculated, from the values of xi, li, b, λi c1 and c2. In an embodiment, the prescribed frame power that minimizes the distortion measure subject to the penalty term is calculated from:
where b is a constant and w>1. In one embodiment, w=2. A value for b is stored in the storage 7. In an embodiment, b is determined from the Pareto model of training data and may be roughly 0.0981 for example in the full band/single band scenario.
This corresponds to step S105 in the framework in
A modification is calculated using the prescribed frame power and applied to the frame of the speech xi received from the speech input.
In an embodiment, the modification applied to the frame of the speech xi received from the speech input is √{square root over (yi/xi)}.
In an embodiment, smoothing is applied to the modification. This is step S207. The smoothed signal gain may be calculated from (29). Values for U and D may be stored in the storage 7. In an embodiment, U=1.05 and D=0.95. In another embodiment, U=1.3 and D=0.4. In another embodiment, U=1.15 and D=0.15.
The modified speech frame yi is generated by applying the modification in step S208. In an embodiment, the modification is applied by modifying the signal spectrum, using the signal gain or the smoothed signal gain.
In an embodiment, the modified speech frame is then overlap-added to the enhanced speech signal generated for previous frames in step S209, and the resultant signal is output from output 17.
Alternatively, a time modification is included before the signal is output. In an embodiment, the time modification is a time warp.
In step S210, it is determined whether the smoothed signal gain is less than 1 and whether l is greater than {tilde over (l)}.
If one of these conditions is not fulfilled, no time scale modification is applied.
If both of these conditions are fulfilled, the maximum correlation and corresponding value of time lag, k* are calculated in step S211. The correlation value for each time lag k is calculated from (33). The maximum correlation value and the corresponding lag, k* are then determined, according to (34).
At this point, it is determined whether the maximum correlation value is above a threshold value, in step S212. In an embodiment, the threshold is a constant value. In another embodiment, the threshold is determined from (35). In an embodiment, Ω=⅔.
If the maximum correlation value is not above the threshold, no time modification is applied. If the maximum correlation is above the threshold, the next step is “Overlap add extension”. In this step, the waveform extension is extracted from the position identified by k* and overlap-added to the last frame.
In an embodiment, the number of consecutive time-warps is limited to two.
The enhanced speech is then output.
In general, the power of the input speech signal is reduced in regions with high redundancy. The masking of transient regions by late reverberation is in turn decreased. This can be measured using the frame importance-weighted SNR. The frame-based SNR is weighted by the frame-importance (iwSNR). The performance of the system is identical to natural speech when the signal gain modification rates are fixed to unity, and quickly increases as these become more aggressive. The figure shown is for the case of RT60=1:8 s.
A subjective test with five native UK English listeners was performed. Five people were sufficient to measure significant (p<0.05) intelligibility improvement over natural speech. The signal gain modification parameter settings are indicated by the position of the red ellipse in
Combining AGC with time warping (TW) allows for a further increase of iwSNR.
Adaptive gain control and time warping (AGCTW) is used to denote the system described in relation to
The AGCTW modified speech was modified based on a prescribed output power, which was calculated from a function of input power, late reverberation power and frame importance. The function minimizes a tailored distortion criterion from the domain of power dynamics subject to a penalty term. Under reverberation-induced suppression, a time warp prevents loss of information. Signal gain smoothing for enhanced perceptual impact is also applied. The method of modification is described in relation to
The parameter settings used are as follows. The training data used to fit fx(x|b), and determine α and β was a British English recording comprising 720 sentences. The frame duration was 25 ms, and the frame overlap was 50%. tl was 50 ms and was 0:001. The search intervals K1 and K2 were 0:003 fs and 0:02 fs respectively. The sampling frequency was fs 16 kHz and m contained MFCC orders 1 to 12. The pulse density in i was 2000 s−1. J, the number of frequency bands, was set to 10, Ω was ⅔ and ψ was β4. The values for S, U and D were 15, 1:05 and 0:95 respectively. The relative constraints given in equations (29a) and (32a) were used.
Reverberation was simulated using a model RIR obtained with a source-image method. The hall dimensions were fixed to 20 m×30 m×8 m. The speaker and listener locations used for RIR generation were {10 m, 5 m, 3 m} and {10 m, 25 m, 1.8 m} respectively. The propagation delay and attenuation were normalized to the direct sound. Effectively, the direct sound is equivalent to the sound output from the speaker.
AGCTW decreased the power by 31%, 30% and 29% respectively, averaged over all data.
Under reverberation, aggressive modifications may be detrimental, thus slower tracking of the locally optimal power gain produces smoother signals and enhances intelligibility. There is a gradual elongation of the modified waveforms with the increase in reverberation time, and smoothness is also achieved with respect to the extent of time warping.
The signal duration gradually increases with RT60 up until saturation, to accommodate higher late reverberation power. Limiting the number of consecutive time-warps to two reduces over-periodicity. AGCTW has a low algorithmic delay due to the causality of the importance estimator. The method complexity is low, with late reverberation waveform computation as the most demanding task.
In an embodiment, real-time processing is achieved by accounting for the sparsity of {tilde over (h)} from eq. (2). The model RIR is long, in order to reflect the reverberation time, so the convolution becomes slow. In practice, the pulse locations in the model for the later reverberation part of the RIR are known, so this can be used to reduce the number of operations.
The signal modification framework described in relation to
Sufficiently high reverberation reduces speech intelligibility. Degradation of intelligibility can be encountered in large enclosed environments for example. It can affect public announcement systems and teleconferencing. Degradation of intelligibility is a more severe problem for the hard of hearing population.
Reverberation reduces modulation in the speech signal. The resulting smearing is seen as the source of intelligibility degradation.
Speech signal modification provides a platform for efficient and effective mitigation of the intelligibility loss.
The framework in
The modification is characterized by a low processing delay and a low complexity. In an embodiment, the most computationally costly operations are the search for the optimal lag k*, the MFCC computation in the frame redundancy estimator and the convolution with {tilde over (h)} in equation (2).
The modification can significantly improve intelligibility in reverberant environments.
In some embodiments, the system implements context awareness in the form of adaptation to reverberation time RT60 and local speech signal redundancy. The system allows modification optimality as a result of using an auditory-domain distortion criterion in determining the depth of the speech modification. The system allows simultaneous and coherent modification along different signal dimensions allowing for reduced processing artefacts.
In some embodiments, the system is based on a general theoretical framework that facilitates method analysis.
In some embodiments, the system can be used for public announcements in enclosed spaces such as train stations, airports, lecture halls, tunnels and covered stadiums. Alternatively, the system can be used for teleconferencing or disaster prevention systems.
As described above,
The framework provides a unified and general framework that combines context-awareness with multi-modal modifications. These support good performance in a wide range of conditions. The information content, or importance, of a speech segment is measured, and this information is used when optimizing the modification.
Speech intelligibility in reverberant environments decreases due to overlap-masking caused by late reverberation. Similar to additive noise, stronger reverberation induces a higher degradation. For reverberation, speech modification at a given time affects reverberation at a later time. Taking into account the specifics of the problem, a tailored distortion criterion from the domain of power dynamics is minimized to determine the optimal output power. The closed form solution depends on the late reverberation power and is parametrized in terms of the redundancy in the speech signal enabling context-aware modification.
In some embodiments, power suppression due to excessive reverberation is assisted by a time warp to mitigate possible loss of intelligibility cues. Multi-modal modifications offer an extended operating range and reduction in processing distortions. The method results in a significant improvement over natural speech in moderate-to-severe reverberation conditions.
In some embodiments, overlapping frames are extracted from the input speech signal and labelled according to their importance. A model of late reverberation predicts the concurrent late reverberation power. The optimal full-band output power is computed from the input power, late reverberation power and frame importance. Frame-based estimates are used in place of instantaneous power. The output power is smoothed to prevent distortion. The modified signal frame is synthesized and added to the buffer. In case of power reduction, the time is warped, conditional on the late reverberant power.
In some embodiments, enhancement of speech intelligibility in reverberant environments is achieved by jointly modifying spectral and temporal signal characteristics. Adapting the degree of modification to external (acoustic properties of the environment) and internal (local signal redundancy) factors offers scalability and leads to a significant intelligibility gain with low level of processing artefacts.
The speech intelligibility enhancing systems described above achieve significant speech intelligibility improvement in reverberant environments. The speech modification is performed based on a distortion criterion, which allows good adaptation to the acoustic environment. The speech intelligibility enhancing systems have good generalization capabilities and performance. The operating range extends to environments with heavy reverberation. In some embodiments, the speech intelligibility enhancing systems utilise simultaneous and coherent gain control and time warp. In some embodiments, the speech intelligibility enhancing systems provide a parametric perceptually-motivated approach to smoothing the locally-optimal gain.
In some embodiments, speech intelligibility enhancing systems use multi-band processing in a part of the processing chain.
In some embodiments, the notion of information content of a segment is approximated by the frame importance. Remaining in a deterministic setting, the adopted parameter space is capable of generalising the information content with a high resolution.
In some embodiments, late reverberation is modelled as noise and a distortion criterion is optimised. A distortion criterion targeting reverberation may be used.
In some embodiments, time warping occurs during signal suppression. The extent of time warping adapts to both the local speech properties and the acoustic environment.
Due to its diffuse nature, late reverberation can be modelled statistically. At a particular instant late reverberation can be treated as additive noise, uncorrelated with the signal due to differences in propagation time. Boosting the signal creates more reverberation “noise”, whereas slowing down the signal reduces the overlap-masking, but also reduces the information transfer rate. In some embodiments, a combination of adaptive gain control and time warping during power suppression is provided. This may be effective in particular for environments with reverberation time below two seconds for example.
In some embodiments, the speech intelligibility enhancing systems are adaptive to the environment and provide multi-modal, i.e. in time warp and adaptive gain control, modification. This extends the operation range. Use of high-resolution frame-importance may lead to more efficient use of signal power. Parametric smoothing of the locally-optimal gain may be included, to allow for further tuning and processing constraints.
In some embodiments, the speech intelligibility enhancing systems provide low delay and complexity and allow for addressing a wide range of applications. Furthermore, the framework modularity facilitates incremental sophistication of individual components.
In some embodiments, apart from a short processing delay, the system is causal and therefore suitable for on-line applications.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
1605750.7 | Apr 2016 | GB | national |
Number | Name | Date | Kind |
---|---|---|---|
9414157 | Lou | Aug 2016 | B2 |
20080059157 | Fukuda et al. | Mar 2008 | A1 |
20150043742 | Jensen | Feb 2015 | A1 |
20150124987 | Hazrati et al. | May 2015 | A1 |
20160210976 | Lopez | Jul 2016 | A1 |
Entry |
---|
Search Report dated Aug. 31, 2016 in United Kingdom Patent Application No. GB 1605750.7. |
Takayuki Arai, “Padding zero into steady-state portions of speech as a preprocess for improving intelligibility in reverberant environments” Acoust. Sci. & Tech., vol. 26, No. 5, 2005, pp. 459-461. |
Takayuki Arai, et al., “Using Steady-State Suppression to Improve Speech Intelligibility in Reverberant Environments for Elderly Listeners” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, No. 7, Sep. 2010, pp. 1775-1780. |
João B. Crespo, et al., “Speech Reinforcement in Noisy Reverberant Environments Using a Perceptual Distortion Measure” IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2014, pp. 910-914. |
João B. Crespo, et al., “Speech Reinforcement with a Globally Optimized Perceptual Distortion Measure for Noisy Reverberant Channels” 14th International Workshop on Acoustic Signal Enhancement (IWAENC), 2014, pp. 89-93. |
Richard C. Hendriks, et al., “Speech Reinforcement in Noisy Reverberant Conditions under an Approximation of the Short-Time SII” IEEE, ICASSP, 2015, pp. 4400-4404. |
Richard C. Hendriks, et al., “Optimal Near-End Speech Intelligibility Improvement Incorporating Additive Noise and Late Reverberation Under an Approximation of the Short-Time SII” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, No. 5, May 2015, pp. 851-862. |
Nao Hodoshima, et al., “Improving syllable identification by a preprocessing method reducing overlap-masking in reverberant environments” J. Acoust. Soc. Am., vol. 119, No. 6, Jun. 2006, pp. 4055-4064. |
Yuki Nakata, et al., “The Effects of Speech-Rate Slowing for Improving Speech Intelligibility in Reverberant Environments” IEICE Technical Report, Mar. 2006, pp. 21-24. |
Petko N. Petkov, et al., “Spectral Dynamics Recovery for Enhanced Speech Intelligibility in Noise” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, No. 2, Feb. 2015, pp. 327-338. |
Henning Schepker, et al., “Model-based integration of reverberation for noise-adaptive near-end listening enhancement” Interspeech, ISCA, Sep. 6-10, 2015, pp. 75-79. |
Kim Silverman, et al., “Tobi: A Standard for Labeling English Prosody” ISCA Archive, ICSLP 92, Oct. 12-16, 1992, pp. 867-870. |
Misaki Tsuji, et al., “Preprocessing using consonant emphasis and vowel suppression for improving speech intelligibility in reverberant environments” Acoustical Science and Technology, Technical Report, vol. 69, No. 4, 2013, pp. 179-183 (with English language translation). |
Number | Date | Country | |
---|---|---|---|
20170287498 A1 | Oct 2017 | US |