1. Field of the Invention
The present invention relates generally to audio processing, and more particularly to processing an audio signal to suppress noise.
2. Description of Related Art
Currently, there are many methods for reducing background noise in an adverse audio environment. A stationary noise suppression system suppresses stationary noise, by either a fixed or varying number of dB. A fixed suppression system suppresses stationary or non-stationary noise by a fixed number of dB. The shortcoming of the stationary noise suppressor is that non-stationary noise will not be suppressed, whereas the shortcoming of the fixed suppression system is that it must suppress noise by a conservative level in order to avoid speech distortion at low signal-to-noise ratios (SNR).
Another form of noise suppression is dynamic noise suppression. A common type of dynamic noise suppression systems is based on SNR. The SNR may be used to determine a degree of suppression. Unfortunately, SNR by itself is not a very good predictor of speech distortion due to the presence of different noise types in the audio environment. SNR is a ratio indicating how much louder speech is then noise. However, speech may be a non-stationary signal which may constantly change and contain pauses. Typically, speech energy, over a given period of time, will include a word, a pause, a word, a pause, and so forth. Additionally, stationary and dynamic noises may be present in the audio environment. As such, it can be difficult to accurately estimate the SNR. The SNR averages all of these stationary and non-stationary speech and noise components. There is no consideration in the determination of the SNR of the characteristics of the noise signal—only the overall level of noise. In addition, the value of SNR can vary based on the mechanisms used to estimate the speech and noise, such as whether it based on local or global estimates, and whether it is instantaneous or for a given period of time.
To overcome the shortcomings of the prior art, there is a need for an improved noise suppression system for processing audio signals.
The present technology provides a robust noise suppression system that may concurrently reduce noise and echo components in an acoustic signal while limiting the level of speech distortion. An acoustic signal may be received and transformed to cochlear-domain sub-band signals. Features, such as pitch, may be identified and tracked within the sub-band signals. Initial speech and noise models may be then be estimated at least in part from a probability analysis based on the tracked pitch sources. Improved speech and noise models may be resolved from the initial speech and noise models and noise reduction may be performed on the sub-band signals. An acoustic signal may be reconstructed from the noise-reduced sub-band signals.
In an embodiment, noise reduction may be performed by executing a program stored in memory to transform an acoustic signal from the time domain to cochlea-domain sub-band signals. Multiple sources of pitch may be tracked within the sub-band signals. A speech model and one or more noise models may be generated at least in part based on the tracked pitch sources. Noise reduction may be performed on the sub-band signals based on the speech model and one or more noise models.
A system for performing noise reduction in an audio signal may include a memory, frequency analysis module, source inference module, and a modifier module. The frequency analysis module may be stored in the memory and executed by a processor to transform a time domain acoustic to cochlea domain sub-band signals. The source inference engine may be stored in the memory and executed by a processor to track multiple sources of pitch within a sub-band signal and to generate a speech model and one or more noise models based at least in part on the tracked pitch sources. The modifier module may be stored in the memory and executed by a processor to perform noise reduction on the sub-band signals based on the speech model and one or more noise models.
The present technology provides a robust noise suppression system that may concurrently reduce noise and echo components in an acoustic signal while limiting the level of speech distortion. An acoustic signal may be received and transformed to cochlear-domain sub-band signals. Features, such as pitch, may be identified and tracked within the sub-band signals. Initial speech and noise models may be then be estimated at least in part from a probability analysis based on the tracked pitch sources. Improved speech and noise models may be resolved from the initial speech and noise models and noise reduction may be performed on the sub-band signals. An acoustic signal may be reconstructed from the noise-reduced sub-band signals.
Multiple pitch sources may be identified in a sub-band frame and tracked over multiple frames. Each tracked pitch source (“track”) is analyzed based on several features, including pitch level, salience, and how stationary the pitch source is. Each pitch source is also compared to stored speech model information. For each track, a probability of being a target speech source is generated based on the features and comparison to the speech model information.
A track with the highest probability may be, in some cases, designated as speech and the remaining tracks are designated as noises. In some embodiments, there may be multiple speech sources, and a “target” speech may be the desired speech with other speech sources considered noise. Tracks with a probability over a certain threshold may be designated as speech. In addition, there may be a “softening” of the decision in the system. Downstream of the track probability determination, a spectrum may be constructed for each pitch track, and each track's probability may be mapped to gains through which the corresponding spectrum is added into the speech and non-stationary noise models. If the probability is high, the gain for the speech model will be 1 and the gain for the noise model will be 0, and vice versa.
The present technology may utilize any of several techniques to provide an improved noise reduction of an acoustic signal. The present technology may estimate speech and noise models based on tracked pitch sources and probabilistic analysis of the tracks. Dominant speech detection may be used to control stationary noise estimations. Models for speech, noise and transients may be resolved into speech and noise. Noise reduction may be performed by filtering sub-bands using filters based on optimal least-squares estimation or on constrained optimization. These concepts are discussed in more detail below.
While the microphone 106 receives sound (i.e. acoustic signals) from the audio source 102, the microphone 106 also picks up noise 112. Although the noise 112 is shown coming from a single location in
Acoustic signals received by microphone 106 may be tracked, for example by pitch. Features of each tracked signal may be determined and processed to estimate models for speech and noise. For example, an audio source 102 may be associated with a pitch track with a higher energy level than the noise 112 source. Processing signals received by microphone 106 is discussed in more detail below.
Processor 202 may execute instructions and modules stored in a memory (not illustrated in
The exemplary receiver 200 may be configured to receive a signal from a communications network, such as a cellular telephone and/or data communication network. In some embodiments, the receiver 200 may include an antenna device. The signal may then be forwarded to the audio processing system 204 to reduce noise using the techniques described herein, and provide an audio signal to output device 206. The present technology may be used in one or both of the transmit and receive paths of the audio device 104.
The audio processing system 204 is configured to receive the acoustic signals from an acoustic source via the primary microphone 106 and process the acoustic signals. Processing may include performing noise reduction within an acoustic signal. The audio processing system 204 is discussed in more detail below. The acoustic signal received by primary microphone 106 may be converted into one or more electrical signals, such as, for example, a primary electrical signal and a secondary electrical signal. The electrical signal may be converted by an analog-to-digital converter (not shown) into a digital signal for processing in accordance with some embodiments. The primary acoustic signal may be processed by the audio processing system 204 to produce a signal with an improved signal-to-noise ratio.
The output device 206 is any device which provides an audio output to the user. For example, the output device 206 may include a speaker, an earpiece of a headset or handset, or a speaker on a conference device.
In various embodiments, the primary microphone is an omni-directional microphone; in other embodiments, the primary microphone is a directional microphone.
In operation, an acoustic signal is received from the primary microphone 106, is converted to an electrical signal, and the electrical signal is processed through transform module 305. The acoustic signal may be pre-processed in the time domain before being processed by transform module 305. Time domain pre-processing may also include applying input limiter gains, speech time stretching, and filtering using an FIR or IIR filter.
The transform module 305 takes the acoustic signals and mimics the frequency analysis of the cochlea. The transform module 305 comprises a filter bank designed to simulate the frequency response of the cochlea. The transform module 305 separates the primary acoustic signal into two or more frequency sub-band signals. A sub-band signal is the result of a filtering operation on an input signal, where the bandwidth of the filter is narrower than the bandwidth of the signal received by the transform module 305. The filter bank may be implemented by a series of cascaded, complex-valued, first-order IIR filters. Alternatively, other filters or transforms such as a short-time Fourier transform (STFT), sub-band filter banks, modulated complex lapped transforms, cochlear models, wavelets, etc., can be used for the frequency analysis and synthesis. The samples of the sub-band signals may be grouped sequentially into time frames (e.g. over a predetermined period of time). For example, the length of a frame may be 4 ms, 8 ms, or some other length of time. In some embodiments, there may be no frame at all. The results may include sub-band signals in a fast cochlea transform (FCT) domain.
The analysis path 325 may be provided with an FCT domain representation 302, hereinafter FCT 302, and optionally a high-density FCT representation 301, hereinafter HD FCT 301, for improved pitch estimation and speech modeling (and system performance). A high-density FCT may be a frame of sub-bands having a higher density than the FCT 302; a HD FCT 301 may have more sub-bands than FCT 302 within a frequency range of the acoustic signal. The signal path also may be provided with an FCT representation 304, hereinafter FCT 304, after implementing a delay 303. Using the delay 303 provides the analysis path 325 with a “lookahead” latency that can be leveraged to improve the speech and noise models during subsequent stages of processing. If there is no delay, the FCT 304 for the signal path is not necessary; the output of FCT 302 in the diagram can be routed to the signal path processing as well as to the analysis path 325. In the illustrated embodiment, the lookahead delay 303 is arranged before the FCT 304. As a result, the delay is implemented in the time domain in the illustrated embodiment, thereby saving memory resources as compared with implementing the lookahead delay in the FCT-domain. In alternative embodiments, the lookahead delay may be implemented in the FCT domain, such as by delaying the output of FCT 302 and providing the delayed output to the signal path. In doing so, computational resources may be saved compared with implementing the lookahead delay in the time-domain.
The sub-band frame signals are provided from transform module 305 to an analysis path 325 sub-system and a signal path sub-system. The analysis path 325 sub-system may process the signal to identify signal features, distinguish between speech components and noise components of the sub-band signals, and generate a modification. The signal path sub-system is responsible for modifying sub-band signals of the primary acoustic signal by reducing noise in the sub-band signals. Noise reduction can include applying a modifier, such as a multiplicative gain mask generated in the analysis path 325 sub-system, or applying a filter to each sub-band. The noise reduction may reduce noise and preserve the desired speech components in the sub-band signals.
Feature extraction module 310 of the analysis path sub-system 325 receives the sub-band frame signals derived from the acoustic signal and computes features for each sub-band frame, such as pitch estimates and second-order statistics. In some embodiments, a pitch estimate may be determined by feature extraction module 310 and provided to source inference engine 315. In some embodiments, the pitch estimate may be determined by source inference engine 315. The second-order statistics (instantaneous and smoothed autocorrelations/energies) are computed in feature extraction module 310 for each sub-band signal. For the HD FCT 301, only the zero-lag autocorrelations are computed and used by the pitch estimation procedure. The zero-lag autocorrelation may be a time sequence of the previous signal multiplied by itself and averaged. For the middle FCT 302, the first-order lag autocorrelations are also computed since these may be used to generate a modification. The first-order lag autocorrelations, which may be computed by multiplying the time sequence of the previous signal with a version of itself offset by one sample, may also be used to improve the pitch estimation.
Source inference engine 315 may process the frame and sub-band second-order statistics and pitch estimates provided by feature extraction module 310 (or generated by source inference engine 315) to derive models of the noise and speech in the sub-band signals. Source inference engine 315 processes the FCT-domain energies to derive models of the pitched components of the sub-band signals, the stationary components, and the transient components. The speech, noise and optional transient models are resolved into speech and noise models. If the present technology is utilizing non-zero lookahead, source inference engine 315 is the component wherein the lookahead is leveraged. At each frame, source inference engine 315 receives a new frame of analysis path data and outputs a new frame of signal path data (which corresponds to an earlier relative time in the input signal than the analysis path data). The lookahead delay may provide time to improve discrimination of speech and noise before the sub-band signals are actually modified (in the signal path). Also, source inference engine 315 outputs a voice activity detection (VAD) signal (for each tap) that is internally fed back to the stationary noise estimator to help prevent over-estimation of the noise.
The modification generator module 320 receives models of the speech and noise as estimated by source inference engine 315. Modification generator module 320 may derive a multiplicative mask for each sub-band per frame. Modification generator module 320 may also derive a linear enhancement filter for each sub-band per frame. The enhancement filter includes a suppression backoff mechanism wherein the filter output is cross-faded with its input sub-band signals. The linear enhancement filter may be used in addition or in place of the multiplicative mask, or not used at all. The cross-fade gain is combined with the filter coefficients for the sake of efficiency. Modification generator module 320 may also generate a post-mask for applying equalization and multiband compression. Spectral conditioning may also be included in this post-mask.
The multiplicative mask may be defined as a Wiener gain. The gain may be derived based on the autocorrelation of the primary acoustic signal and an estimate of the autocorrelation of the speech (e.g. the speech model) or an estimate of the autocorrelation of the noise (e.g. the noise model). Applying the derived gain yields a minimum mean-squared error (MMSE) estimate of the clean speech signal given the noisy signal.
The linear enhancement filter is defined by a first-order Wiener filter. The filter coefficients may be derived based on the 0th and 1st order lag autocorrelation of the acoustic signal and an estimate of the 0th and 1st order lag autocorrelation of the speech or an estimate of the 0th and 1st order lag autocorrelation of the noise. In one embodiment, the filter coefficients are derived based on the optimal Wiener formulation using the following equations:
where rxx[0] is the 0th order lag autocorrelation of the input signal, rxx[1] is the 1st order lag autocorrelation of the input signal, rss[0] is the estimated 0th order lag autocorrelation of the speech, and rss[1] is the estimated 1st order lag autocorrelation of the speech. In the Wiener formulations, * denotes conjugation and ∥ denotes magnitude. In some embodiments, the filter coefficients may be derived in part based on a multiplicative mask derived as described above. The coefficient β0 may be assigned the value of the multiplicative mask, and β1 may be determined as the optimal value for use in conjunction with that value of β0 according to the formula:
Applying the filter yields an MMSE estimate of the clean speech signal given the noisy signal.
The values of the gain mask or filter coefficients output from modification generator module 320 are time and sub-band signal dependent and optimize noise reduction on a per sub-band basis. The noise reduction may be subject to the constraint that the speech loss distortion complies with a tolerable threshold limit.
In embodiments, the energy level of the noise component in the sub-band signal may be reduced to no less than a residual noise level, which may be fixed or slowly time-varying. In some embodiments, the residual noise level is the same for each sub-band signal, in other embodiments it may vary across sub-bands and frames. Such a noise level may be based on a lowest detected pitch level.
Modifier module 330 receives the signal path cochlear-domain samples from transform block 305 and applies a modification, such as for example a first-order FIR filter, to each sub-band signal. Modifier module 330 may also apply a multiplicative post-mask to perform such operations as equalization and multiband compression. For Rx applications, the post-mask may also include a voice equalization feature. Spectral conditioning may be included in the post-mask. Modifier module 330 may also apply speech reconstruction at the output of the filter, but prior to the post-mask.
Reconstructor module 335 may convert the modified frequency sub-band signals from the cochlea domain back into the time domain. The conversion may include applying gains and phase shifts to the modified sub-band signals and adding the resulting signals.
Reconstructor module 335 forms the time-domain system output by adding together the FCT-domain sub-band signals after optimized time delays and complex gains have been applied. The gains and delays are derived in the cochlea design process. Once conversion to the time domain is completed, the synthesized acoustic signal may be post-processed or output to a user via output device 206 and/or provided to a codec for encoding.
Post-processor module 340 may perform time-domain operations on the output of the noise reduction system. This includes comfort noise addition, automatic gain control, and output limiting. Speech time stretching may be performed as well, for example, on an Rx signal.
Comfort noise may be generated by a comfort noise generator and added to the synthesized acoustic signal prior to providing the signal to the user. Comfort noise may be a uniform constant noise not usually discernible to a listener (e.g., pink noise). This comfort noise may be added to the synthesized acoustic signal to enforce a threshold of audibility and to mask low-level non-stationary output noise components. In some embodiments, the comfort noise level may be chosen to be just above a threshold of audibility and may be settable by a user. In some embodiments, the modification generator module 320 may have access to the level of comfort noise in order to generate gain masks that will suppress the noise to a level at or below the comfort noise.
The system of
Source inference engine 315 receives second order statistics data from feature extraction module 310 and provides this data to polyphonic pitch and source tracker (tracker) 420, stationary noise modeler 428 and transient modeler 436. Tracker 420 receives the second order statistics and a stationary noise model and estimates pitches within the acoustic signal received by microphone 106.
Estimating the pitches may include estimating the highest level pitch, removing components corresponding to the pitch from the signal statistics, and estimating the next highest level pitch, for a number of iterations per a configurable parameter. First, for each frame, peaks may be detected in the FCT-domain spectral magnitude, which may be based on the 0th order lag autocorrelation and may further be based on a mean subtraction such that the FCT-domain spectral magnitude has zero mean. In some embodiments, the peaks must meet a certain criteria, such as being larger than their four nearest neighbors, and must have a large enough level relative to the maximum input level. The detected peaks form the first set of pitch candidates. Subsequently, sub-pitches are added to the set for each candidate, i.e., f0/2 f0/3 f0/4, and so forth, where f0 denotes a pitch candidate. Cross correlation is then performed by adding the level of the interpolated FCT-domain spectral magnitude at harmonic points over a specific frequency range, thereby forming a score for each pitch candidate. Because the FCT-domain spectral magnitude is zero-mean over that range (due to the mean subtraction), pitch candidates are penalized if a harmonic does not correspond to an area of significant amplitude (because the zero-mean FCT-domain spectral magnitude will have negative values at such points). This ensures that frequencies below the true pitch are adequately penalized relative to the true pitch. For example, a 0.1 Hz candidate would be given a near-zero score (because it would be the sum of all FCT-domain spectral magnitude points, which is zero by construction).
The cross-correlation may then provide scores for each pitch candidate. Many candidates are very close in frequency (because of the addition of the sub-pitches f0/2 f0/3 f0/4 etc to the set of candidates). The scores of candidates close in frequency are compared, and only the best one is retained. A dynamic programming algorithm is used to select the best candidate in the current frame, given the candidates in previous frames. The dynamic programming algorithm ensures the candidate with the best score is generally selected as the primary pitch, and helps avoid octave errors.
Once the primary pitch has been chosen, the harmonic amplitudes are computed simply using the level of the interpolated FCT-domain spectral magnitude at harmonic frequencies. A basic speech model is applied to the harmonics to make sure they are consistent with a normal speech signal. Once the harmonic levels are computed, the harmonics are removed from the interpolated FCT-domain spectral magnitude to form a modified FCT-domain spectral magnitude.
The pitch detection process is repeated, using the modified FCT-domain spectral magnitude. At the end of the second iteration, the best pitch is selected, without running another dynamic programming algorithm. Its harmonics are computed, and removed from the FCT-domain spectral magnitude. The third pitch is the next best candidate, and its harmonic levels are computed on the twice-modified FCT-domain spectral magnitude. This process is continued until a configurable number of pitches has been estimated. The configurable number may be, for example, three or some other number. As a last stage, the pitch estimates are refined using the phase of the 1st order lag autocorrelation.
A number of the estimated pitches are then tracked by the polyphonic pitch and source tracker 420. The tracking may determine changes in frequency and level of the pitch over multiple frames of the acoustic signal. In some embodiments, a subset of the estimated pitches are tracked, for example the estimated pitches having the highest energy level(s).
The output of the pitch detection algorithm consists of a number of pitch candidates. The first candidate may be continuous across frames because it is selected by the dynamic programming algorithm. The remaining candidates may be output in order of salience, and therefore may not form frequency-continuous tracks across frames. For the task of assigning types to sources (talker associated with speech or distractor associated with noise), it is important to be able to deal with pitch tracks that are continuous in time, rather than collections of candidates at each frame. This is the goal of the multi-pitch tracking step, carried out on the per-frame pitch estimates determined by the pitch detection.
Given N input candidates, the algorithm outputs N tracks, immediately reusing a track slot when the track terminates and a new one is born. At each frame the algorithm considers the N! associations of (N) existing tracks to (N) new pitch candidates. For example, if N=3, tracks 1,2,3 from the previous frame can be continued to candidates 1,2,3 in the current frame in 6 manners: (1-1,2-2,3-3), (1-1,2-3,3-2), (1-2,2-3,3-1), (1-2,2-1,3-3), (1-3,2-2,3-1), (1-3,3-2,2-1). For each of these associations, a transition probability is computed to evaluate which association is the most likely. The transition probability is computed based on how close in frequency the candidate pitch is from the track pitch, the relative candidate and track levels, and the age of the track (in frames, since its beginning). The transition probabilities tend to favor continuous pitch tracks, tracks with larger levels, and tracks that are older than other ones.
Once the N! transition probabilities are computed, the largest one is selected, and the corresponding transition is used to continue the tracks into the current frame. A track dies when its transition probability to any of the current candidates is 0 in the best association (in other words, it cannot be continued into any of the candidates). Any candidate pitch that isn't connected to an existing track forms a new track with an age of 0. The algorithm outputs the tracks, their level, and their age.
Each of the tracked pitches may be analyzed to estimate the probability of whether the tracked source is a talker or speech source The cues estimated and mapped to probabilities are level, stationarity, speech model similarity, track continuity, and pitch range.
The pitch track data is provided to buffer 422 and then to pitch track processor 424. Pitch track processor 424 may smooth the pitch tracking for consistent speech target selection. Pitch track processor 424 may also track the lowest-frequency identified pitch. The output of pitch track processor 424 is provided to pitch spectral modeler 426 and to compute modification filter module 450.
Stationary noise modeler 428 generates a model of stationary noise. The stationary noise model may be based on second order statistics as well as a voice activity detection signal received from pitch spectral modeler 426. The stationary noise model may be provided to pitch spectral modeler 426, update control module 432, and polyphonic pitch and source tracker 420. Transient modeler 436 may receive second order statistics and provide the transient noise model to transient model resolution 442 via buffer 438. The buffers 422, 430, 438, and 440 are used to account for the “lookahead” time difference between the analysis path 325 and the signal path.
Construction of the stationary noise model may involve a combined feedback and feed-forward technique based on speech dominance. For example, in one feed-forward technique, if the constructed speech and noise models indicate that the speech is dominant in a given sub-band, the stationary noise estimator is not updated for that sub-band. Rather, the stationary noise estimator is reverted to that of the previous frame. In one feedback technique, if speech (voice) is determined to be dominant in a given sub-band for a given frame, the noise estimation is rendered inactive (frozen) in that sub-band during the next frame. Hence, a decision is made in a current frame not to estimate stationary noise in a subsequent frame.
The speech dominance may be indicated by a voice activity detector (VAD) indicator computed for the current frame and used by update control module 432. The VAD may be stored in the system and used by the stationary noise modeler 428 in the subsequent frame. This dual-mode VAD prevents damage to low-level speech, especially high-frequency harmonics; this reduces the “voice muffling” effect frequently incurred in noise suppressors.
Pitch spectral modeler 426 may receive pitch track data from pitch track processor 424, a stationary noise model, a transient noise model, second orders statistics, and optionally other data and may output a speech model and a nonstationary noise model. Pitch spectral modeler 426 may also provide a VAD signal indicating whether speech is dominant in a particular sub-band and frame.
The pitch tracks (each comprising pitch, salience, level, stationarity, and speech probability) are used to construct models of the speech and noise spectra by the pitch spectral modeler 426. To construct models of the speech and noise, the pitch tracks may be reordered based on the track saliences, such that the model for the highest salience pitch track will be constructed first. An exception is that high-frequency tracks with a salience above a certain threshold are prioritized. Alternatively, the pitch tracks may be reordered based on the speech probability, such that the model for the most probable speech track will be constructed first.
In pitch spectral modeler 426, a broadband stationary noise estimate may be subtracted from the signal energy spectrum to form a modified spectrum. Next, the present system may iteratively estimate the energy spectra of the pitch tracks according to the processing order determined in the first step. An energy spectrum may be derived by estimating an amplitude for each harmonic (by sampling the modified spectrum), computing a harmonic template corresponding to the response of the cochlea to a sinusoid at the harmonic's amplitude and frequency, and accumulating the harmonic's template into the track spectral estimate. After the harmonic contributions are aggregated, the track spectrum is subtracted to form a new modified signal spectrum for the next iteration.
To compute the harmonic templates, the module uses a pre-computed approximation of the cochlea transfer function matrix. For a given sub-band, the approximation consists of a piecewise linear fit of the sub-band's frequency response where the approximation points are optimally selected from the set of sub-band center frequencies (so that sub-band indices can be stored instead of explicit frequencies).
After the harmonic spectra are iteratively estimated, each spectrum is allocated in part to the speech model and in part to the non-stationary noise model, where the extent of the allocation to the speech model is dictated by the speech probability of the corresponding track, and the extent of the allocation to the noise model is determined as an inverse of the extent of the allocation to the speech model.
Noise model combiner 434 may combine stationary noise and non-stationary noise and provide the resulting noise to transient model resolution 442. Update control 432 may determine whether or not the stationary noise estimate is to be updated in the current frame, and provide the resulting stationary noise to noise model combiner 434 to be combined with the non-stationary noise model.
Transient model resolution 442 receives a noise model, speech model, and transient model and resolves the models into speech and noise. The resolution involves verifying the speech model and noise model do not overlap, and determining whether the transient model is speech or noise. The noise and non-speech transient models are deemed noise and the speech model and transient speech are determined to be speech. The transient noise models are provided to repair module 462, and the resolved speech and noise modules are provided to SNR estimator 444 as well as the compute modification filter module 450. The speech model and the noise model are resolved to reduce cross-model leakage. The models are resolved into a consistent decomposition of the input signal into speech and noise.
SNR estimator 444 determines an estimate of the signal to noise ratio. The SNR estimate can be used to determine an adaptive level of suppression in the crossfade module 464. It can also be used to control other aspects of the system behavior. For example, the SNR may be used to adaptively change what the speech/noise model resolution does.
Compute modification filter module 450 generates a modification filter to be applied to each sub-band signal. In some embodiments, a filter such as a first-order filter is applied in each sub-band instead of a simple multiplier. Modification filter module 450 is discussed in more detail below with respect to
The modification filter is applied to the sub-band signals by module 460. After applying the generated filter, portions of the sub-band signal may be repaired at repair module 462 and then linearly combined with the unmodified sub-band signal at crossfade module 464. The transient components may be repaired by module 462 and the crossfade may be performed based on the SNR provided by SNR estimator 444. The sub-bands are then reconstructed at reconstructor module 335.
The filter coefficients β0 and β1 are computed based on speech models derived by the source inference engine 315, combined with a sub-pitch suppression mask (for example by tracking the lowest speech pitch and suppressing the sub-bands below this min pitch by reducing the β0 and β1 values for those sub-bands), and crossfaded based on the desired noise suppression level. In another approach, the VQOS approach is used to determine the crossfade. The β0 and β1 values are then subjected to interframe rate-of-change limits and interpolated across frames before being applied to the cochlear-domain signals in the modification filter. For the implementation of the delay, one sample of cochlear-domain signals (a time slice across sub-bands) is stored in the module state.
To implement a first-order modification filter, the received sub-band signal is multiplied by β0 and also delayed by one sample. The signal at the output of the delay is multiplied by β1. The results of the two multiplications are summed and provided as the output s[k,t]. The delay, multiplications, and summation correspond to the application of a first-order linear filter. There may be N delay-multiply-sum stages, corresponding to an Nth order filter.
When applying a first-order filter in each sub-band instead of a simple multiplier, an optimal scalar multiplier (mask) may be used in the non-delayed branch of the filter. The filter coefficient for the delayed branch may be derived to be optimal conditioned on the scalar mask. In this way, the first-order filter is able to achieve a higher-quality speech estimate than using the scalar mask alone. The system can be extended to higher orders (an N-th order filter) if desired. Also, for an N-th order filter, the autocorrelations up to lag N may be computed in feature extraction module 310 (second-order statistics). In the first-order case, the zero-th and first-order lag autocorrelations are computed. This is a distinction from prior systems which rely solely on the zero-th order lag.
Monaural features are extracted from the cochlea domain sub-band signals at step 615. The monaural features are extracted by feature extraction module 310 and may include second order statistics. Some features may include pitch, energy level, pitch salience, and other data.
Speech and noise models may be estimated for cochlea sub-bands at step 620. The speech and noise models may be estimated by source inference engine 315. Generating the speech model and noise model may include estimating a number of pitch elements for each frame, tracking a selected number of the pitch elements across frames, and selecting one of the tracked pitches as a talker based on a probabilistic analysis. The speech model is generated from the tracked talker. A non-stationary noise model may be based on the other tracked pitches and a stationary noise model may be based on extracted features provided by feature extraction module 310. Step 620 is discussed in more detail with respect to the method of
The speech model and noise models may be resolved at step 625. Resolving the speech model and noise model may be performed to eliminate any cross-leakage between the two models. Step 625 is discussed in more detail with respect to the method of
The sub-bands may be reconstructed at step 635. Reconstruction of the sub-bands may involve applying a series of delays and complex-multiply operations to the sub-band signals by reconstructor module 335. The reconstructed time-domain signal may be post-processed at step 640. Post-processing may consist of adding comfort noise, performing automatic gain control (AGC) and applying a final output limiter. The noise-reduced time-domain signal is output at step 645.
A speech source is identified by a probability analysis at step 715. The probability analysis identifies a probability that each pitch track is the desired talker based on each of several features, including level, salience, similarity to speech models, stationarity, and other features. A single probability for each pitch is determined based on the feature probabilities for that pitch, for example, by multiplying the feature probabilities. The speech source may be identified as the pitch track with the highest probability of being associated with the talker.
A speech model and noise model are constructed at step 720. The speech model is constructed in part based on the pitch track with the highest probability. The noise model is constructed based in part on the pitch tracks having a low probability of corresponding to the desired talker. Transient components identified as speech may be included in the speech model and transient components identified as non-speech transient may be included in the noise model. Both the speech model and the noise model are determined by source inference engine 315.
A speech model and noise model are resolved into speech and noise at step 810. Portions of a speech model may leak into a noise model, and vice-versa. The speech and noise models are resolved such that there is no leakage between the two.
A delayed time-domain acoustic signal may be provided to the signal path to allow additional time (look-ahead) for the analysis path to discriminate between speech and noise in step 815. By utilizing a time-domain delay in the look-ahead mechanism, memory resources are saved as compared to implementing the lookahead delay in the cochlear domain.
The steps discussed in
The above described modules, including those discussed with respect to
While the present invention is disclosed by reference to the preferred embodiments and examples detailed above, it is to be understood that these examples are intended in an illustrative rather than a limiting sense. It is contemplated that modifications and combinations will readily occur to those skilled in the art, which modifications and combinations will be within the spirit of the invention and the scope of the following claims.
This application is a continuation of U.S. patent application Ser. No. 12/860,043, (now U.S. Pat. No. 8,447,596, issued May 21, 2013), filed Aug. 20, 2010, which claims the benefit of U.S. Provisional Application Ser. No. 61/363,638, filed Jul. 12, 2010, all of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
3517223 | Gaunt et al. | Jun 1970 | A |
3989897 | Carver | Nov 1976 | A |
4811404 | Vilmur et al. | Mar 1989 | A |
4910779 | Cooper et al. | Mar 1990 | A |
5012519 | Adlersberg et al. | Apr 1991 | A |
5027306 | Dattorro et al. | Jun 1991 | A |
5050217 | Orban | Sep 1991 | A |
5103229 | Ribner | Apr 1992 | A |
5335312 | Mekata et al. | Aug 1994 | A |
5408235 | Doyle et al. | Apr 1995 | A |
5473702 | Yoshida | Dec 1995 | A |
5687104 | Lane et al. | Nov 1997 | A |
5701350 | Popovich | Dec 1997 | A |
5774562 | Furuya et al. | Jun 1998 | A |
5796850 | Shiono et al. | Aug 1998 | A |
5806025 | Vis et al. | Sep 1998 | A |
5828997 | Durlach et al. | Oct 1998 | A |
5917921 | Sasaki et al. | Jun 1999 | A |
5950153 | Ohmori et al. | Sep 1999 | A |
5963651 | Van Veen et al. | Oct 1999 | A |
5974379 | Hatanaka et al. | Oct 1999 | A |
6011501 | Gong et al. | Jan 2000 | A |
6104993 | Ashley | Aug 2000 | A |
6138101 | Fujii | Oct 2000 | A |
6160265 | Bacchi et al. | Dec 2000 | A |
6240386 | Thyssen et al. | May 2001 | B1 |
6289311 | Omori et al. | Sep 2001 | B1 |
6326912 | Fujimori | Dec 2001 | B1 |
6343267 | Kuhn et al. | Jan 2002 | B1 |
6377637 | Berdugo | Apr 2002 | B1 |
6377915 | Sasaki | Apr 2002 | B1 |
6381570 | Li et al. | Apr 2002 | B2 |
6453284 | Paschall | Sep 2002 | B1 |
6480610 | Fang et al. | Nov 2002 | B1 |
6483923 | Marash | Nov 2002 | B1 |
6490556 | Graumann et al. | Dec 2002 | B1 |
6539355 | Omori et al. | Mar 2003 | B1 |
6594367 | Marash et al. | Jul 2003 | B1 |
6757395 | Fang et al. | Jun 2004 | B1 |
6876859 | Anderson et al. | Apr 2005 | B2 |
6895375 | Malah et al. | May 2005 | B2 |
7054808 | Yoshida | May 2006 | B2 |
7054809 | Gao | May 2006 | B1 |
7065486 | Thyssen | Jun 2006 | B1 |
7072834 | Zhou | Jul 2006 | B2 |
7110554 | Brennan et al. | Sep 2006 | B2 |
7245767 | Moreno et al. | Jul 2007 | B2 |
7254535 | Kushner et al. | Aug 2007 | B2 |
7257231 | Avendano et al. | Aug 2007 | B1 |
7283956 | Ashley et al. | Oct 2007 | B2 |
7343282 | Kirla et al. | Mar 2008 | B2 |
7346176 | Bernardi et al. | Mar 2008 | B1 |
7373293 | Chang et al. | May 2008 | B2 |
7379866 | Gao | May 2008 | B2 |
7461003 | Tanrikulu | Dec 2008 | B1 |
7472059 | Huang | Dec 2008 | B2 |
7516067 | Seltzer et al. | Apr 2009 | B2 |
7539273 | Struckman | May 2009 | B2 |
7546237 | Nongpiur et al. | Jun 2009 | B2 |
7574352 | Quatieri, Jr. | Aug 2009 | B2 |
7590250 | Ellis et al. | Sep 2009 | B2 |
7657427 | Jelinek | Feb 2010 | B2 |
7664640 | Webber | Feb 2010 | B2 |
7672693 | Kallio et al. | Mar 2010 | B2 |
7725314 | Wu et al. | May 2010 | B2 |
7769187 | Farrar et al. | Aug 2010 | B1 |
7792680 | Iser et al. | Sep 2010 | B2 |
7813931 | Hetherington et al. | Oct 2010 | B2 |
7873114 | Lin | Jan 2011 | B2 |
7925502 | Droppo et al. | Apr 2011 | B2 |
7957542 | Sarrukh et al. | Jun 2011 | B2 |
7986794 | Zhang | Jul 2011 | B2 |
8005238 | Tashev et al. | Aug 2011 | B2 |
8032369 | Manjunath et al. | Oct 2011 | B2 |
8046219 | Zurek et al. | Oct 2011 | B2 |
8060363 | Ramo et al. | Nov 2011 | B2 |
8078474 | Vos et al. | Dec 2011 | B2 |
8098844 | Elko | Jan 2012 | B2 |
8107631 | Merimaa et al. | Jan 2012 | B2 |
8107656 | Dreβler et al. | Jan 2012 | B2 |
8111843 | Logalbo et al. | Feb 2012 | B2 |
8112272 | Nagahama et al. | Feb 2012 | B2 |
8112284 | Kjorling et al. | Feb 2012 | B2 |
8140331 | Lou | Mar 2012 | B2 |
8155346 | Yoshizawa et al. | Apr 2012 | B2 |
8160262 | Buck et al. | Apr 2012 | B2 |
8160265 | Mao et al. | Apr 2012 | B2 |
8170221 | Christoph | May 2012 | B2 |
8180062 | Turku et al. | May 2012 | B2 |
8184822 | Carreras et al. | May 2012 | B2 |
8184823 | Itabashi et al. | May 2012 | B2 |
8190429 | Iser et al. | May 2012 | B2 |
8195454 | Muesch | Jun 2012 | B2 |
8204253 | Solbach | Jun 2012 | B1 |
8223988 | Wang et al. | Jul 2012 | B2 |
8249861 | Li et al. | Aug 2012 | B2 |
8271292 | Osada et al. | Sep 2012 | B2 |
8275610 | Faller et al. | Sep 2012 | B2 |
8280730 | Song et al. | Oct 2012 | B2 |
8311817 | Murgia et al. | Nov 2012 | B2 |
8359195 | Li | Jan 2013 | B2 |
8363850 | Amada | Jan 2013 | B2 |
8411872 | Stothers et al. | Apr 2013 | B2 |
8438026 | Fischer et al. | May 2013 | B2 |
8447045 | Laroche | May 2013 | B1 |
8447596 | Avendano et al. | May 2013 | B2 |
8473285 | Every et al. | Jun 2013 | B2 |
8473287 | Every et al. | Jun 2013 | B2 |
8526628 | Massie et al. | Sep 2013 | B1 |
8538035 | Every et al. | Sep 2013 | B2 |
8606571 | Every et al. | Dec 2013 | B1 |
8611551 | Massie et al. | Dec 2013 | B1 |
8611552 | Murgia et al. | Dec 2013 | B1 |
8682006 | Laroche et al. | Mar 2014 | B1 |
8700391 | Avendano et al. | Apr 2014 | B1 |
8761410 | Avendano et al. | Jun 2014 | B1 |
8781137 | Goodwin | Jul 2014 | B1 |
8848935 | Massie et al. | Sep 2014 | B1 |
8958572 | Solbach | Feb 2015 | B1 |
9008329 | Mandel et al. | Apr 2015 | B1 |
9143857 | Every et al. | Sep 2015 | B2 |
20010041976 | Taniguchi et al. | Nov 2001 | A1 |
20010044719 | Casey | Nov 2001 | A1 |
20010046304 | Rast | Nov 2001 | A1 |
20020036578 | Reefman | Mar 2002 | A1 |
20020052734 | Unno et al. | May 2002 | A1 |
20020097884 | Cairns | Jul 2002 | A1 |
20020128839 | Lindgren et al. | Sep 2002 | A1 |
20020194159 | Kamath et al. | Dec 2002 | A1 |
20030093278 | Malah | May 2003 | A1 |
20030162562 | Curtiss et al. | Aug 2003 | A1 |
20040047474 | Vries et al. | Mar 2004 | A1 |
20040153313 | Aubauer et al. | Aug 2004 | A1 |
20050049857 | Seltzer et al. | Mar 2005 | A1 |
20050069162 | Haykin et al. | Mar 2005 | A1 |
20050075866 | Widrow | Apr 2005 | A1 |
20050207583 | Christoph | Sep 2005 | A1 |
20050238238 | Xu et al. | Oct 2005 | A1 |
20050266894 | Rankin | Dec 2005 | A9 |
20050267741 | Laaksonen et al. | Dec 2005 | A1 |
20060074693 | Yamashita | Apr 2006 | A1 |
20060089836 | Boillot et al. | Apr 2006 | A1 |
20060116175 | Chu | Jun 2006 | A1 |
20060116874 | Samuelsson et al. | Jun 2006 | A1 |
20060165202 | Thomas et al. | Jul 2006 | A1 |
20060247922 | Hetherington et al. | Nov 2006 | A1 |
20070005351 | Sathyendra et al. | Jan 2007 | A1 |
20070038440 | Sung et al. | Feb 2007 | A1 |
20070041589 | Patel et al. | Feb 2007 | A1 |
20070053522 | Murray et al. | Mar 2007 | A1 |
20070055508 | Zhao et al. | Mar 2007 | A1 |
20070076896 | Hosaka et al. | Apr 2007 | A1 |
20070088544 | Acero et al. | Apr 2007 | A1 |
20070154031 | Avendano et al. | Jul 2007 | A1 |
20070253574 | Soulodre | Nov 2007 | A1 |
20070299655 | Laaksonen et al. | Dec 2007 | A1 |
20080019548 | Avendano | Jan 2008 | A1 |
20080147397 | Konig et al. | Jun 2008 | A1 |
20080159573 | Dressler et al. | Jul 2008 | A1 |
20080170716 | Zhang | Jul 2008 | A1 |
20080186218 | Ohkuri et al. | Aug 2008 | A1 |
20080187148 | Itabashi et al. | Aug 2008 | A1 |
20080208575 | Laaksonen et al. | Aug 2008 | A1 |
20080215344 | Song et al. | Sep 2008 | A1 |
20080228474 | Huang et al. | Sep 2008 | A1 |
20080232607 | Tashev et al. | Sep 2008 | A1 |
20080317261 | Yoshida et al. | Dec 2008 | A1 |
20090022335 | Konchitsky et al. | Jan 2009 | A1 |
20090043570 | Fukuda et al. | Feb 2009 | A1 |
20090067642 | Buck et al. | Mar 2009 | A1 |
20090086986 | Schmidt et al. | Apr 2009 | A1 |
20090095804 | Agevik et al. | Apr 2009 | A1 |
20090112579 | Li et al. | Apr 2009 | A1 |
20090119096 | Gerl et al. | May 2009 | A1 |
20090129610 | Kim et al. | May 2009 | A1 |
20090150144 | Nongpiur et al. | Jun 2009 | A1 |
20090164212 | Chan et al. | Jun 2009 | A1 |
20090175466 | Elko et al. | Jul 2009 | A1 |
20090216526 | Schmidt et al. | Aug 2009 | A1 |
20090220107 | Every | Sep 2009 | A1 |
20090228272 | Herbig et al. | Sep 2009 | A1 |
20090238373 | Klein | Sep 2009 | A1 |
20090248403 | Kinoshita et al. | Oct 2009 | A1 |
20090287481 | Paranjpe et al. | Nov 2009 | A1 |
20090287496 | Thyssen et al. | Nov 2009 | A1 |
20090299742 | Toman et al. | Dec 2009 | A1 |
20090304203 | Haykin et al. | Dec 2009 | A1 |
20090315708 | Walley et al. | Dec 2009 | A1 |
20090323982 | Solbach et al. | Dec 2009 | A1 |
20100063807 | Archibald et al. | Mar 2010 | A1 |
20100067710 | Hendriks et al. | Mar 2010 | A1 |
20100076756 | Douglas et al. | Mar 2010 | A1 |
20100076769 | Yu | Mar 2010 | A1 |
20100082339 | Konchitsky et al. | Apr 2010 | A1 |
20100087220 | Zheng et al. | Apr 2010 | A1 |
20100094622 | Cardillo et al. | Apr 2010 | A1 |
20100103776 | Chan | Apr 2010 | A1 |
20100158267 | Thormundsson et al. | Jun 2010 | A1 |
20100198593 | Yu | Aug 2010 | A1 |
20100223054 | Nemer et al. | Sep 2010 | A1 |
20100272275 | Carreras et al. | Oct 2010 | A1 |
20100272276 | Carreras et al. | Oct 2010 | A1 |
20100282045 | Chen et al. | Nov 2010 | A1 |
20100290636 | Mao et al. | Nov 2010 | A1 |
20110007907 | Park et al. | Jan 2011 | A1 |
20110019838 | Kaulberg et al. | Jan 2011 | A1 |
20110026734 | Hetherington et al. | Feb 2011 | A1 |
20110038489 | Visser et al. | Feb 2011 | A1 |
20110081026 | Ramakrishnan et al. | Apr 2011 | A1 |
20110099010 | Zhang | Apr 2011 | A1 |
20110099298 | Chadbourne et al. | Apr 2011 | A1 |
20110103626 | Bisgaard et al. | May 2011 | A1 |
20110137646 | Ahgren et al. | Jun 2011 | A1 |
20110158419 | Theverapperuma et al. | Jun 2011 | A1 |
20110164761 | McCowan | Jul 2011 | A1 |
20110169721 | Bauer et al. | Jul 2011 | A1 |
20110184732 | Godavarti | Jul 2011 | A1 |
20110191101 | Uhle et al. | Aug 2011 | A1 |
20110243344 | Bakalos et al. | Oct 2011 | A1 |
20110251704 | Walsh et al. | Oct 2011 | A1 |
20110257967 | Every et al. | Oct 2011 | A1 |
20110274291 | Tashev et al. | Nov 2011 | A1 |
20110299695 | Nicholson | Dec 2011 | A1 |
20110301948 | Chen | Dec 2011 | A1 |
20120010881 | Avendano et al. | Jan 2012 | A1 |
20120017016 | Ma et al. | Jan 2012 | A1 |
20120027218 | Every et al. | Feb 2012 | A1 |
20120093341 | Kim et al. | Apr 2012 | A1 |
20120116758 | Murgia et al. | May 2012 | A1 |
20120143363 | Liu et al. | Jun 2012 | A1 |
20120179461 | Every et al. | Jul 2012 | A1 |
20120198183 | Wetzel et al. | Aug 2012 | A1 |
20130066628 | Takahashi | Mar 2013 | A1 |
20130251170 | Every et al. | Sep 2013 | A1 |
20130322643 | Every et al. | Dec 2013 | A1 |
Number | Date | Country |
---|---|---|
2008065090 | Mar 2008 | JP |
200933609 | Aug 2009 | TW |
201205560 | Feb 2012 | TW |
201207845 | Feb 2012 | TW |
201214418 | Apr 2012 | TW |
I466107 | Dec 2014 | TW |
WO2009035614 | Mar 2009 | WO |
WO2011137258 | Mar 2011 | WO |
WO2011133405 | Oct 2011 | WO |
WO2012009047 | Jan 2012 | WO |
Entry |
---|
International Search Report and Written Opinion dated Sep. 1, 2011 in Application No. PCT/US11/37250. |
Cisco, “Understanding How Digital T1 CAS (Robbed Bit Signaling) Works in IOS Gateways”, Jan. 17, 2007, http://www.cisco.com/image/gif/paws/22444/t1-cas-ios.pdf, accessed on Apr. 3, 2012. |
International Search Report and Written Opinion mailed Jul. 5, 2011 in Patent Cooperation Treaty Application No. PCT/US11/32578. |
International Search Report and Written Opinion mailed Jul. 21, 2011 in Patent Cooperation Treaty Application No. PCT/US11/34373. |
Goldin et al., Automatic Volume and Equalization Control in Mobile Devices, AES, 2006. |
Guelou et al., Analysis of Two Structures for Combined Acoustic Echo Cancellation and Noise Reduction, IEEE, 1996. |
Fazel et al., An overview of statistical pattern recognition techniques for speaker verification, IEEE, May 2011. |
Sundaram et al., Discriminating two types of noise sources using cortical representation and dimension reduction technique, IEE, 2007. |
Bach et al., Learning Spectral Clustering with application to spech separation, Journal of machine learning research, 2006. |
Hioka et al., Estimating Direct to Reverberant energy ratio based on spatial correlation model segregating direct sound and reverberation, IEEE, Conference Mar. 14-19, 2010. |
Avendano et al., Study on Dereverberation of Speech Based on Temporal Envelope Filtering, IEEE, Oct. 1996. |
Park et al., Frequency Domain Acoustic Echo Suppression Based on Soft Decision, Interspeech 2009. |
Tognieri et al., A comparison of the LBG,LVQ,MLP,SOM and GMM algorithms for Vector Quantisation and Clustering Analysis, 1992. |
Klautau et al., Discriminative Gaussian Mixture Models a Comparison with Kernel Classifiers, ICML, 2003. |
Usher et. al., Enhancement of Spatial Sound Quality a New Reverberation Extraction Audio Upmixer, IEEE, 2007. |
Hoshuyama et al., “A Robust Generalized Sidelobe Canceller with a Blocking Matrix Using Leaky Adaptive Filters” 1997. |
Spriet et al., “The impact of speech detection errors on the noise reduction performance of multi-channel Wiener filtering and Generalized Sidelobe Cancellation” 2005. |
Hoshuyama et al., “A Robust Adaptive Beamformer for Microphone Arrays with a Blocking Matrix Using Constrained Adaptive Filters” 1999. |
Herbordt et al., “Frequency-Domain Integration of Acoustic Echo Cancellation and a Generalized Sidelobe Canceller with Improved Robustness” 2002. |
Office Action mailed Jun. 5, 2014 in Taiwanese Patent Application 100115214, filed Apr. 29, 2011. |
Office Action mailed Oct. 30, 2014 in Korean Patent Application No. 10-2012-7027238, filed Apr. 14, 2011. |
Jung et al., “Feature Extraction through the Post Processing of WFBA Based on MMSE-STSA for Robust Speech Recognition,” Proceedings of the Acoustical Society of Korea Fall Conference, vol. 23, No. 2(s), pp. 39-42, Nov. 2004. |
Notice of Allowance dated Nov. 7, 2014 in Taiwanese Application No. 100115214, filed Apr. 29, 2011. |
Krini, Mohamed et al., “Model-Based Speech Enhancement,” in Speech and Audio Processing in Adverse Environments; Signals and Communication Technology, edited by Hensler et al., 2008, Chapter 4, pp. 89-134. |
Office Action mailed Dec. 10, 2014 in Finnish Patent Application No. 20126083, filed Apr. 14, 2011. |
Lu et al., “Speech Enhancement Using Hybrid Gain Factor in Critical-Band-Wavelet-Packet Transform”, Digital Signal Processing, vol. 17, Jan. 2007, pp. 172-188. |
Kim et al., “Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms,” IEEE Transactions on Audio, Speech, and Language Processsing, vol. 18, No. 8, Nov. 2010, pp. 2080-2090. |
Sharma et al., “Rotational Linear Discriminant Analysis Technique for Dimensionality Reduction,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, No. 10, Oct. 2008, pp. 1336-1347. |
Temko et al., “Classiciation of Acoustinc Events Using SVM-Based Clustering Schemes,” Pattern Recognition 39, No. 4, 2006, pp. 682-694. |
Office Action mailed Jun. 26, 2015 in South Korean Patent Application 1020127027238 filed Apr. 14, 2011. |
Office Action mailed Jun. 23, 2015 in Japanese Patent Application 2013-508256 filed Apr. 28, 2011. |
Office Action mailed Jun. 23, 2015 in Finnish Patent Application 20126106 filed Apr. 28, 2011. |
Office Action mailed Jul. 2, 2015 in Finnish Patent Application 20126083 filed Apr. 14, 2011. |
Office Action mailed Jun. 17, 2015 in Japanese Patent Application 2013-519682 filed May 19, 2011. |
Office Action mailed Jun. 23, 2015 in Japanese Patent Application 2013-506188 filed Apr. 14, 2011. |
3GPP2 “Enhanced Variable Rate Codec, Speech Service Options 3, 68, 70, and 73 for Wideband Spread Spectrum Digital Systems”, May 2009, pp. 1-308. |
3GPP2 “Selectable Mode Vocoder (SMV) Service Option for Wideband Spread Spectrum Communication Systems”, Jan. 2004, pp. 1-231. |
3GPP2 “Source-Controlled Variable-Rate Multimode Wideband Speech Codec (VMR-WB) Service Option 62 for Spread Spectrum Systems”, Jun. 11, 2004, pp. 1-164. |
3GPP “3GPP Specification 26.071 Mandatory Speech CODEC Speech Processing Functions; AMR Speech Codec; General Description”, http://www.3gpp.org/ftp/Specs/html-info/26071.htm, accessed on Jan. 25, 2012. |
3GPP “3GPP Specification 26.094 Mandatory Speech Codec Speech Processing Functions; Adaptive Multi-Rate (AMR) Speech Codec; Voice Activity Detector (VAD)”, http://www.3gpp.org/ftp/Specs/html-info/26094.htm, accessed on Jan. 25, 2012. |
3GPP “3GPP Specification 26.171 Speech Codec Speech Processing Functions; Adaptive Multi-Rate—Wideband (AMR-WB) Speech Codec; General Description”, http://www.3gpp.org/ftp/Specs/html-info26171.htm, accessed on Jan. 25, 2012. |
3GPP “3GPP Specification 26.194 Speech Codec Speech Processing Functions; Adaptive Multi-Rate—Wideband (AMR-WB) Speech Codec; Voice Activity Detector (VAD)” http://www.3gpp.org/ftp/Specs/html-info26194.htm, accessed on Jan. 25, 2012. |
International Telecommunication Union “Coding of Speech at 8 kbit/s Using Conjugate-Structure Algebraic-code-excited Linear-prediction (CS-ACELP)”, Mar. 19, 1996, pp. 1-39. |
International Telecommunication Union “Coding of Speech at 8 kbit/s Using Conjugate Structure Algebraic-code-excited Linear-prediction (CS-ACELP) Annex B: A Silence Compression Scheme for G.729 Optimized for Terminals Conforming to Recommendation V.70”, Nov. 8, 1996, pp. 1-23. |
Number | Date | Country | |
---|---|---|---|
20130231925 A1 | Sep 2013 | US |
Number | Date | Country | |
---|---|---|---|
61363638 | Jul 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12860043 | Aug 2010 | US |
Child | 13859186 | US |