Embodiments of the invention relate to performing voice activity detection (VAD). In particular, embodiments of the invention relate to performing voice activity detection based on steered response power (SRP) values.
State of the art smart home devices may use speech technology to enable users to control devices using their voice. Speech technology may include speech recognition and text-to-speech functionalities. These devices may need to operate well even in the presence of ambient noise, reverberation, acoustic echoes, and other disturbances. Typical speech recognition systems may use multi-microphone input and may enhance speech, suppress noise, remove echo and detect a direction of arrival (DOA) of the speaker. Noise cancelation typically requires identification of audio segments that do not contain speech and extracting noise characteristics from these segments. The extracted noise characteristics may than be used for noise cancelation.
A commonly-used solution to enhance speech is the minimum variance distortionless response (MVDR) beamformer (BF), which requires the direction of arrival (DOA) of the speaker and of the noise spatial characteristics (e.g., the power spectral density (PSD) matrix). Two main relevant techniques can be used: SRP to estimate the speaker DOA and VAD to detect speech absence segments and estimate the noise PSD matrix. These two techniques usually act independently and have typical limitations.
VAD, also referred to as speech activity detection or speech detection, is a technique used to determine presence or absence of human speech in audio samples. Typical VAD techniques include extracting features from the speech signal, making binary decision regarding the presence or absence of speech, and smoothing the decisions along the time axis. The features may include the energy of the signal in each frequency, the periodicity of the signal in the frequency domain, the spectrum coefficients, etc.
Energy based VAD takes the energy of the signal as a feature. Usually, only the energy in speech frequencies is considered. The main drawback of energy based VAD is its low performance in low signal-to-noise ratio (SNR) cases. In high and intermediate SNR cases the energy based VAD performs well regardless of the directionality of the noise.
According to embodiments of the invention, there is provided a system and method for voice activity detection (VAD). Embodiments of the invention may include: obtaining audio frames from a multi-microphone array; calculating SRP values of the audio frames; calculating entropy levels of the SRP values; and determining whether an incoming audio frame contains voice activity based on the entropy levels.
According to embodiments of the invention, there is provided a system and method for speech recognition. Embodiments of the invention may include: obtaining audio frames sampled by a multi-microphone array; providing a vector of SRP values based on the audio frames, where each SRP value provides a probability of a speaker to be in a direction associated with the SRP value; calculating instantaneous entropy levels of the SRP values; and performing voice activity detection (VAD) of the audio frames based on the entropy levels.
According to some embodiments, determining whether an incoming audio frame contains voice activity may include: detecting a sequence of audio frames in which the entropy levels are substantially constant across the sequence of frames and denoting an entropy level of the sequence as a background entropy; and identifying an incoming audio frame as containing voice activity if the difference between a level of entropy of the incoming audio frame and the background entropy is larger than a first threshold, and as not containing voice activity otherwise.
According to some embodiments, detecting the sequence of audio frames in which entropy levels are substantially constant may include: for an incoming audio frame: finding a local minimum entropy level of the audio frames; finding a local maximum entropy level of the audio frames; and determining that the entropy levels of the set of audio frames are substantially constant if the difference between the local minimum entropy level and the local maximum entropy level is below a second threshold.
Embodiments of the invention may include, for a set of audio frames: finding the local minimum entropy level comprises selecting the minimal value between the entropy level of an incoming audio frame and the previous local minimum entropy level determined for an audio frame previous to the incoming audio frame; and finding the local maximum entropy level comprises selecting the maximum value between the entropy level of an incoming audio frame and the previous local maximum entropy level determined for an audio frame previous to the incoming audio frame.
According to some embodiments, one of the previous local minimum entropy level and the selected minimal value may be multiplied by a value larger than one, and one of the previous local maximum entropy level and the selected maximum value may be multiplied by a value smaller than one.
Embodiments of the invention may include performing single talk detection (STD) based on the entropy levels.
Embodiments of the invention may include: determining a global minimum of the entropy by finding a minimal value of the entropy levels in a predetermined time frame; determining that an audio frame contains speech originated from a single speaker if the difference between the level of entropy of the audio frame and the global minimum of the entropy is larger than a threshold; and determining that an audio frame contains speech originated from more than one speaker otherwise.
Embodiments of the invention may include performing noise cancelation by: characterizing noise parameters based on audio frames that do not contain voice activity; and using the noise parameters for performing noise cancelation.
According to some embodiments performing VAD may include: detecting a sequence of audio frames in which the entropy levels are substantially constant across the sequence of frames and denoting an entropy level of the sequence as a background entropy; and identifying a current audio frame as containing voice activity if the difference between a level of entropy of the current audio frame and the background entropy is larger than a first threshold, and as not containing voice activity otherwise.
Embodiments of the invention may include performing noise cancelation by: characterizing noise parameters based on audio frames that do not contain voice activity; and using the noise parameters for performing noise cancelation.
Embodiments of the invention may include performing single talk detection (STD) based on the entropy levels.
The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
In the following description, various aspects of the present invention will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the present invention. However, it will also be apparent to one skilled in the art that the present invention may be practiced without the specific details presented herein. Furthermore, well-known features may be omitted or simplified in order not to obscure the present invention.
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulates and/or transforms data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices.
Embodiments of the invention pertain, inter alia, to the technology of speech recognition. Embodiments may provide an improvement to speech recognition technology by, for example, improving VAD and STD. VAD may enable to distinguish between a sequence of audio samples or frames that contain speech and audio frames that do not contain speech. Audio frames that do not contain speech include only noise. Thus, those frames may be analyzed in order to characterize, categorize or otherwise describe noise parameters. The noise parameters extracted from the audio frames that do not contain speech may then be used for performing noise cancelation from the audio frames that do contain speech, thus enhancing noisy speech (e.g. enhancing the speech component of a recording including speech and noise) and improving the voice quality. An audio frame may be a data structure including a plurality of audio samples, e.g., an audio frame may include 512 audio samples, or other number of audio samples. Audio frames may be sequential in time and contiguous so that two adjacent frames in a series represent a continual time segment from the original audio stream.
Embodiments of the invention may improve VAD performance, especially in cases of low SNRs using SRP values. An SRP value may provide an estimation of the probability (or pseudo probability) of the speaker to be in a certain direction. Embodiments of the invention may detect voiced (e.g., including human voice) audio segments based on changes in the directionality of the audio sources, which may provide a good distinction between noise and speech even in cases of low SNRs. As used herein the entropy may refer to a measure of disorder or uncertainty (similarly to information entropy), e.g., in the directionality of the background noise. Thus, according to embodiments of the invention, the entropy of SRP values may represent or provide a measure of the directionality of the background noise. In many scenarios, the entropy of SRP values of the background noise is typically piecewise constant over time, e.g., the entropy of the SRP values may remain constant or substantially constant or similar for time durations that are longer than a duration of a typical utterance of a speaker. Thus, in a time interval in which the entropy of the SRP values is constant or substantially constant (e.g., remains within a predetermined range, for example, ±10%), changes in the entropy of the SRP values may be attributed to the presence of speech. Embodiments of the invention may detect the typical behavior of the entropy of the SRP in noisy frames that do not contain speech, and may further detect changes in the entropy of the SRP values that probably occur due to the presence of speech. According to embodiments of the invention, the SRP behavior in noisy frames may be determined using the background value of the entropy of the SRP values. The entropy of the SRP values may be indicative of the directionality of the observed audio signals (e.g., the combination of noise and speech). A variation in the directionality with respect to the directionality of the noise, may imply on speech samples or frames. Embodiments of the invention may detect speech even in case of a moving noise source, since the directionality, as estimated using the entropy, may not change with the movement of the noise source, as opposed to the direction of the noise source which may change.
Background noise usually exhibits a relatively constant pattern at the output of SRP beamformer. Even when the noise is nonstationary, with fluctuating power, or dynamic direction, this pattern may be slowly time-varying. This typical pattern of the SRP value for noisy frames may be transformed to a single value by, for example, measuring the entropy of the SRP value. According to embodiments of the invention, significant differences between the instantaneous entropy and the entropy associated with the noise, may be attributed to presence of speech in the audio frames. Thus, the entropy of the SRP values may be used as a feature for VAD decisions. Embodiments of the invention may provide an adaptive technique for estimating the typical noise entropy for arbitrary noise fields.
According to embodiments of the invention, the entropy of the SRP values may be also beneficial for performing STD. Frames that are dominated by a single speaker may be important for separately estimating their characteristics, e.g., location and relative transfer function (RTF), that may be used for speaker separation tasks. In single-talk frames (e.g. including speech from one speaker only) the SRP values may be concentrated around the speaker DOA and thus may exhibit low entropy. When another speaker (or another directional or non-directional noise source) becomes active, the SRP values may be more spread relatively to the single-talk frames and thus may produce higher entropy. According to embodiments of the invention, single talk-frames may be identified by determining local minimum values of the entropy measure.
Reference is made to
Microphone set or array 110 may include a plurality of microphones 112 arranged in any desired spatial configuration. The plurality of microphones may be arranged in a linear array, e.g., with I microphones along an x axis, a planar array, e.g., with I microphones along an x axis by J microphones along y axis, or may be distributed about a perimeter of a shape, e.g., a perimeter of a circle (circular arrays). Microphone array 110 may provide multiple spatial samples of audio waves. Using a microphone array 100 instead of a single microphone may improve the quality of the captured sound by taking advantage of the plurality of samples and using advanced techniques for noise cancellation.
According to embodiments of the invention, VAD may be determined using the multichannel signals sampled by microphone array 110. The samples may include speech in a noisy environment, and may be modelled in the short-time Fourier transform (STFT) domain as for example:
Where Yi(m, k) denotes a sample of the ith microphone at time or frame number m and frequency k, Xi(m,k) denotes the speech component in sample Yi(m, k), and Vi(m, k) denotes the ambient noise in sample Yi(m, k). 1 and 0 denote the speech presence and absence hypotheses, respectively. According to embodiments of the invention, VAD may include determining the most likely hypothesis, e.g., 1 or 0, in each time or frame number m.
SRP calculation unit 120 may calculate SRP values (e.g., raw SRP values), e.g. for the audio samples or frames, or for each frame, and may provide, based on these values, the probability of a speaker (a person speaking) being located in any one of N directions (e.g., normalized SRP values). For example, the raw SRP values may be normalized (e.g., by dividing each SRP value by the summation of all the raw SRP values) to be summed to 1. Then, each normalized SRP value may be considered as a probability of the speaker to be in a direction associated with the SRP value. SRP calculation unit 120 may provide an N-length vector of probabilities (e.g. an ordered set of values). SRP calculation unit 120 may provide a direction of arrival (DOA) of the audio, e.g., based on the vector of probabilities. For example, in case of speech, SRP calculation unit 120 may provide a DOA of the voice and thus may point to the direction of the speaker.
According to some embodiments of the invention, SRP may be calculated by the SRP-phase transform (PHAT) algorithm, which is an adaptation of the generalized cross correlation phase transform (GCC-PHAT) to an array of microphones in far-field scenarios. However, other algorithms may be used for calculating SRP.
According to some embodiments the SRP-PHAT algorithm may include calculating time smoothed cross-correlation between each two microphones for all i=1, . . . , N and j=i+1, . . . , N:
R
i,j(m,k)=αRi,j(m−1,k)+(1−α)Yi(m,k)Yj*(m,k), (Equation 2)
Where Ri,j(m, k) is the time smoothed cross-correlation between the ith and jth microphones at time index m and frequency k, * denotes a complex conjugate, and a is a smoothing or forgetting factor which may be determined empirically. in some embodiments α may be in the range of 0.9 to 1.4. other values may be used.
Next, a predefined set of DOAs may be examined. The DOA may be expressed as an angle θ relatively to a known baseline direction. For example, in circular arrays a full panoramic space may be examined, e.g., DOAs of θ=0°, . . . , 360° and DOAs of θ=0°, . . . , 180° for a linear microphone array. The interval of θ may be referred to as the resolution of the DOA measurement and may be determined based on the number of microphones in microphone array 100, e.g., the resolution may increase as the number of microphones increase. The resolution or the intervals of θ may be determined according to the computational power of the processor performing the calculations (e.g., processor 905 depicted in
When a directional signal originated from DOA θ is perceived by two microphones there may be an expected phase difference between the two observations in the frequency domain, since time-delay in the time domain is transformed to a phase difference in the frequency domain. The expected phase difference, Gi,j, may refer to the phase difference between the signals that would be perceived at the ith and jth microphones if a speaker would be active from DOA θ. These expected phase differences may be pre-calculated for each microphone pair and each DOA θ. For example, the expected phase difference, Gi,j, between the ith and jth microphones when the speaker is active from DOA θ may be calculated by:
Where K is the total number of examined frequencies, Ts is the sampling time, τ indicates imaginary number, and Ti,j(θ) is the expected time difference of arrival (TDOA) between the ith and jth microphones when the speaker is active from DOA θ. The expected TDOA, Ti,j, may refer to the difference in arrival time of the signal at two microphones, e.g., the ith and jth microphones. The expected TDOA, Ti,j, may also be pre-calculated for each microphone pair and each DOA θ. For example, for a uniform linear array (ULA), the TDOA, Ti,j, may equal:
where d is a physical distance between the ith and jth microphones and c is the sound velocity. It should be noted that Gi,j(k, θ) may be calculated in advance. The raw SRP values may be calculated by for example:
Where Q(m, θ) denotes raw SRP value at time index m and angle θ, and {·} is a function extracting a real-value component of an operand. The raw SRP values may be normalized (e.g., by dividing each SRP value by the summation of all the SRP values) to a probability density function, for example:
Where
In a presence of speaker and directional noise SRP calculation unit 120 may detect high energy sources in both directions e.g., the direction of the speaker and the direction of the noise. The distinction between the speaker and the noise may be impossible to make. However, if the noise is non-directional the SRP calculation unit 120 may easily detect the direction of the speaker even in low SNR cases.
According to embodiments of the invention, the directionality of the sampled signal, reflected in the output of SRP calculation unit 120, may be almost constant for continuously active noise sources, and the directionality may significantly change only when speech is added. When the noise type is non-directional, the SRP values may be assumed to be approximately equal for all DOAs. For example, in circular microphone arrays
for any θ, where M denotes the number of examined angles (e.g., the number of θ values). When the noise type is directional,
VAD and STD unit 140 may identify the presence or absence of voice activity (e.g. speech) represented in audio samples or frames, and may determine if an audio sample includes or does not include speech. According to embodiments of the invention, VAD and STD unit 140 may obtain the probability density function,
As used herein, entropy may provide a measure of uncertainty in the DOA. For example, in case of directional noise, the level of uncertainty in the DOA may be considered low and the entropy may typically be low, while in case of non-directional noise, the level of uncertainty in the DOA may be considered high and the entropy may typically be high. For example, entropy may obtain its maximum value when
Beamforming is a well-known noise reduction technique, that may exploit the spatial diversity of microphone arrays. Waves of the same frequency may be combined, either constructively or destructively, in order to enhance or cancel a wave coming from a certain direction. For example, waves of the same frequency recorded by microphones 112 may be multiplied by appropriate weights so that the noise is reduced, and the desired speech is enhanced. For example, a delay and sum (D&S) beamformer may steer the array to the speaker direction while arbitrarily summing the noise components. A minimum variance distortionless response (MVDR) beamformer may whiten the noise and then employ a D&S beamformer. The MVDR beamformer requires two major information sets: the speaker position (e.g., the DOA) and the noise characteristics. To automatically learn the noise characteristics, audio frames that do not contain speech, and therefore contain only noise, should be identified. Thus, it is desirable that a reliable VAD is designed. BF unit 130 may obtain or receive an audio signal such as audio samples or frames from microphone array 100, an indication whether an audio frame contain or does not contain speech from VAD and STD unit 140, and a DOA of the audio from SRP calculation unit 120. BF unit 130 may reduce the ambient noise in the audio frames based on the speech indication and the DOA. Audio data may be received in a format other than audio frames, but in a typical embodiment audio frames are used as input when determining VAD. For example, BF unit 130 may calculate noise parameters such as the noise spatial characteristics, e.g., the power spectral density (PSD) matrix of the noise, based on audio frames that do not contain voice activity, and may use the noise spatial characteristics for performing noise cancelation. For example, BF unit 130 may calculate weights that may be used to filter and sum the microphone signals, based on the noise PSD matrix and the steering vector (a vector that may represent the expected phase difference between each microphone signal and a reference microphone located in the assumed DOA of the speaker). BF unit 130 may calculate weights that may preserve the signal impinged from the assumed DOA of the speaker undistorted, while reducing as much as possible the ambient noise. For example, BF unit 130 may use the calculated weights to perform pre-whitening of the noise and then activate a D&S beamformer.
ASR unit 150 may obtain the processed audio frames from BF unit 130, e.g., the audio frames after noise cancelation, and may perform speech recognition. For example, ASR unit 150 may convert spoken words included in the voiced audio frames to text, and may perform other tasks that are required to understand the meaning of the words and the intention of the speaker.
According to one interpretation, entropy may be seen as or may be a measure of the amount of uncertainty of
and a substantially perfectly directional distribution,
The values in equations 7 and 8 may provide boundaries for possible entropy levels. In case of directional noise, the entropy may typically be low, as in equation 8, while in case of non-directional noise the entropy may typically be high, as in equation 7. According to equation 7 the possible maximum value of the entropy is log2 N. While the possible minimum value according to equation 8 equals zero, this implies to a theoretical case of an infinite number of microphones 112 in microphone array 110. In more realistic cases the possible minimum value is higher than zero and depends on the constellation of microphone array 110. For pure directional source located in front of the array and a uniform linear microphone array 110 the observed beam pattern may be provided by:
Where M is the number of microphones, d is the distance between two close microphones, λ is the speech wavelength (usually 30 cm) and θ are the examined degrees with relation to the longitudinal axis of the linear array. According to equation 9, the entropy ri decreases as M increases. The term in equation 9 may approach the Dirac delta as M approaches infinity. Specifically, the SRP value from the DOA of the speaker may approach infinity while the other values are zero.
Reference is now made to
En(m)=−Σθ
An entropy level of a current or incoming audio frame may be referred herein as the instantaneous entropy level.
In operation 340 background entropy, Ēn, may be estimated or calculated. For example, a sequence (e.g. a series of frames ordered by time of recording, the frames being contiguous in time) of audio frames in which the entropy levels are substantially constant, or vary within a narrow predefined range, during or across the sequence of frames may be detected. An entropy level of the sequence may be designated or denoted as a background entropy, Ēn. For example, the background entropy may equal an average of the entropy level across or during the sequence. Other methods for deriving the background entropy, or the entropy of the sequence, may be used.
In some embodiments a local minimum, EnLmin(m), and a local maximum, EnLmax(m), of the instantaneous entropy En(m) may be tracked. In some embodiments, the local minimum may be estimated by selecting a minimum value between the instantaneous entropy, En(m), and the last value of the local minimum, EnLmin(m−1). The last value of the local minimum, EnLmin(m−1) or the selected minimum value may be multiplied by a value slightly larger than one, e.g., by (1ϵ), where ϵ is a small constant (e.g., ≈10−4) that may prevent EnLmin(m) from being trapped at a global minimum point. The local maximum may be estimated by selecting a maximum value between the instantaneous entropy, En(m), and the last value of the local maximum, EnLmax(m−1). The last value of the local maximum, EnLmax(m−1) or the selected minimum value may be multiplied by a value slightly smaller than one, e.g., by (1−ϵ), that may prevent EnLmax(m) rom being trapped at a global maximum point. For example, the local minimum and maximum may be estimated by for example:
E
n
Lmin(m)=min{EnLmin(m−1),En(m)}·(1+ϵ) (Equation 11)
E
n
Lmax(m)=max{EnLmax(m−1),En(m)}·(1−ϵ) (Equation 12)
Other equations may be used, for example:
E
n
Lmin(m)=min{EnLmin(m−1)·(1+ϵ),En(m)} (Equation 13)
E
n
Lmax(m)=max{EnLmax(m−1)·(1−ϵ),En(m)} (Equation 14)
In equation 11 the smaller value among the instantaneous entropy, En(m), or the former or previous local minimum, EnLmin(m−1), (e.g., the last value of the local minimum as was determined for an audio frame immediately previous to the incoming audio frame) may be selected and multiplied by (1+ϵ), and in equation 12 the larger value among the instantaneous entropy, En(m), or the former or previous local maximum, EnLmax(m−1), (e.g., the last value of the local maximum as was determined for an audio frame immediately previous to the incoming audio frame) may be selected and multiplied by (1−ϵ). The local range of the entropy may be estimated by the distance between the local maximum and minimum, e.g.:
E
n
Range(m)≡|EnLmax(m)−EnLmin(m)|, (Equation 15)
The background entropy, Ēn, may be updated only in frames in which the local minimum, EnLmin(m), and maximum EnLmax(m), are close enough, e.g.:
Where β is a decay factor. For example, β may equal 0.9, or other value. The threshold ζ may equal 0.05, 0.1, or another value. Thus, if the absolute value of the difference between the the local minimum, EnLmin(m), and the local maximum, EnLmax(m), is larger or higher than the threshold, ζ, then it may be decided that the entropy is not substantially constant, and the background entropy, Ēn, should not be updated. If, however, the absolute value of the difference between the the local minimum, EnLmin(m), and the local maximum, EnLmax(m), is lower than the threshold, ζ, then it may be decided that the entropy is substantially constant and the background entropy, Ēn, may be updated. Other equations may be used, for example:
Other methods may be used to determine if the entropy is substantially constant and to update the background entropy. For example, it may be determined that if the entropy does not change by more than a predetermined value, e.g., 0.1, during a pre-determined time window, e.g., 1-2 seconds, than the entropy is substantially constant, and that the background entropy equals the average entropy in the time window. A value may be substantially constant if it varies within a predefined range across or during a certain time period.
In operation 350 an incoming audio frame may be identified as containing or not containing voice activity based on entropy, e.g. according to the difference between a level of entropy of the current or incoming audio frame (the instantaneous entropy) and the background entropy. The following example decision rule may be used:
Where VADSRP(m) is the SRP based VAD for time index m, and ηVAD is a threshold. For example, the threshold ηVAD may equal 0.05, 0.1, or other value. Thus, if VADSRP(m)=1, then an audio frame related to time index m may contain speech, and if VADSRP(m)=0, then the audio frame related to time index m may not contain speech. Thus, if the difference between the level of entropy of the current audio frame and the background entropy is larger or higher than a threshold, ηVAD, it may be determined that the current audio frame contains speech, as indicated in block 370, and if the difference between the level of entropy of the current audio frame and the background entropy is not larger than the threshold it may be determined that the current audio frame does not contain voice activity or speech, as indicated in block 360.
In some embodiments VADSRP(m) may be further refined, for example using other VAD methods. For example, a final VAD(m) decision may be made by using an OR operation between an energy-based VAD(m) and the SRP-based VAD, VADSRP(m):
VAD(m)=VAD(m)OR VADSRP(m) (Equation 19)
According to the decision rule of equation 19, it may be determined that an audio frame related to time index (m) contains speech if one of the energy based VAD(m) and the SRP based VAD(m) indicates that the audio frame contains speech. In case both the energy-based VAD(m) and the SRP-based VAD(m) indicate that the audio frame does not contain speech, it may be determined that the audio frame does not contain speech. It is noted that the energy-based VAD tends to imply ‘noise’ even when speech is present in low SNR cases. However, the directionality of the observed signals changes when speech is presented even in low SNR cases. Thus, employing the SRP values to detect these changes in directionality according to embodiments of the invention may improve the VAD performance. Other VAD methods and operations may be used in conjunction with the SRP-based VAD disclosed herein.
In operation 380 a global minimum of the entropy, EnGmin, may be estimated or calculated. For example, the global minimum of the entropy, EnGmin, may be the minimal value of the instantaneous entropy in a predetermined time frame or time window such as one hour, one day or one week, etc. In some embodiments, the global minimum of the entropy, EnGmin, may be estimated or calculated based on voiced audio frames in the time frame or time window. In some embodiments, the global minimum of the entropy, EnGmin, may be estimated or calculated based on all the audio frames in the time frame or time window. In operation 390 entropy-based STD may be determined, e.g., it may be determined if only one speaker is active in voiced audio frames, e.g. if the frames contain voice activity of one speaker. For example, STD may be performed based on the difference between a level of entropy of the current or incoming audio frame (the instantaneous entropy) and the global minimum of the entropy, EnGmin. The following example decision rule may be used:
Where STD(m) is the entropy-based STD value for time index m, and ηSTD is a threshold. For example, the threshold ηSTD may equal 0.05, 0.1, or other value. For example, if STD(m)=1, then it may be determined that only one speaker is active in the audio frame related to time index m, and if STD(m)=0, then it may be determined that more than one speaker is active in the audio frame related to time index m. Thus, if the difference between the level of entropy of the current audio frame and the global minimum of the entropy is larger or higher than a threshold, ηSTD, it may be determined that the current audio frame contains speech originated from a single speaker, as indicated in block 394, and if the difference between the level of entropy of the current audio frame and the global minimum of the entropy is not larger than (e.g., equal to or smaller than) the threshold, ηSTD, it may be determined that the current audio frame contains speech originated by two or more speakers, as indicated in block 392.
In operation 362, noise parameters may be characterized based on audio frames that do not contain voice activity, e.g., audio frames that were recognized as not containing speech in operation 360. Frames that do not contain speech may be analyzed in order to characterize, categorize or otherwise describe noise parameters. For example, the noise parameters may include the noise spatial characteristics, e.g., the PSD matrix of the noise. In operation 372 the noise parameters extracted from the audio frames that do not contain speech (e.g., in operation 362) may be used for performing noise cancelation from the audio frames that do contain speech (e.g., audio frames that were recognized as containing speech in operation 370). Noise cancelation may enhance noisy speech (e.g. enhancing the speech component of a recording including speech and noise) and improve the voice quality. For example, the noise spatial characteristics may be used for performing noise cancelation. In some embodiments, weights may be calculated and used to filter and sum the microphone signals, based on the noise PSD matrix and the steering vector. For example, the weights may be calculated to preserve the signal impinged from the assumed DOA of the speaker undistorted, while reducing as much as possible the ambient noise. The calculated weights may be used to perform pre-whitening of the noise and then activate a D&S beamformer. In operation 396, speaker characteristics may be estimated based on audio frames that include a single speaker. For example, the speaker characteristics may include location and an RTF. In operation 374, the speaker characteristics may be used for speaker separation tasks, for example using beamforming and other methods. In some embodiments, blocks 380, 390, 392 and 394 may be performed only for audio frames that contain speech.
For examining the entropy-based STD, two speakers were recorded with a single and a double talk sections in noiseless background. The speakers were placed 1 meter from the microphone array with 180° between them. In
Reference is made to
Operating system 915 may be or may include any code segment (e.g., one similar to executable code 925 described herein) designed and/or configured to perform tasks involving coordination, scheduling, arbitration, supervising, controlling or otherwise managing operation of computing device 900, for example, scheduling execution of software programs or enabling software programs or other modules or units to communicate. Operating system 915 may be a commercial operating system.
Memory 920 may be or may include, for example, a Random Access Memory (RAM), a read only memory (ROM), a Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a double data rate (DDR) memory chip, a Flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units. Memory 920 may be or may include a plurality of, possibly different memory units. Memory 920 may be a computer or processor non-transitory readable medium, or a computer non-transitory storage medium, e.g., a RAM.
Executable code 925 may be any executable code, e.g., an application, a program, a process, task or script. Executable code 925 may be executed by controller 905 possibly under control of operating system 915. For example, executable code 925 may be an application that when executed performs VAD and STD as further described herein. Although, for the sake of clarity, a single item of executable code 925 is shown in
Storage device 930 may be any applicable storage system, e.g., a disk or a virtual disk used by a VM. Storage 930 may be or may include, for example, a hard disk drive, a floppy disk drive, a Compact Disk (CD) drive, a CD-Recordable (CD-R) drive, a Blu-ray disk (BD), a universal serial bus (USB) device or other suitable removable and/or fixed storage unit. Content or data may be stored in storage 930 and may be loaded from storage 930 into memory 920 where it may be processed by controller 905. In some embodiments, some of the components shown in
Input devices 935 may be or may include microphones, a mouse, a keyboard, a touch screen or pad or any suitable input device. It will be recognized that any suitable number of input devices may be operatively connected to computing device 900 as shown by block 935. Output devices 945 may include one or more displays or monitors, speakers and/or any other suitable output devices. It will be recognized that any suitable number of output devices may be operatively connected to computing device 900 as shown by block 945. Any applicable input/output (I/O) devices may be connected to computing device 900 as shown by input devices 935 and output devices 945. For example, a wired or wireless network interface card (NIC), a printer, a universal serial bus (USB) device or external hard drive may be included in input devices 935 and/or output devices 945.
Some embodiments of the invention may include an article such as a computer or processor non-transitory readable medium, or a computer or processor non-transitory storage medium, such as for example a memory, a disk drive, or a USB flash memory, encoding, including or storing instructions, e.g., computer-executable instructions, which, when executed by a processor or controller, carry out methods disclosed herein. For example, an article may include a storage medium such as memory 920, computer-executable instructions such as executable code 925 and a controller such as controller 905.
The storage medium may include, but is not limited to, any type of disk including, semiconductor devices such as read-only memories (ROMs) and/or random access memories (RAMs), flash memories, electrically erasable programmable read-only memories (EEPROMs) or any type of media suitable for storing electronic instructions, including programmable storage devices. For example, in some embodiments, memory 920 is a non-transitory machine-readable medium.
A system according to some embodiments of the invention may include components such as, but not limited to, a plurality of central processing units (CPU) or any other suitable multi-purpose or specific processors or controllers (e.g., controllers similar to controller 905), a plurality of input units, a plurality of output units, a plurality of memory units, and a plurality of storage units. A system according to some embodiments of the invention may additionally include other suitable hardware components and/or software components. In some embodiments, a system may include or may be, for example, a personal computer, a desktop computer, a laptop computer, a workstation, a server computer, a network device, or any other suitable computing device. For example, a system according to some embodiments of the invention as described herein may include one or more devices such as computing device 900.
Different embodiments are disclosed herein. Features of certain embodiments may be combined with features of other embodiments; thus certain embodiments may be combinations of features of multiple embodiments.
Embodiments of the invention may include an article such as a computer or processor readable non-transitory storage medium, such as for example a memory, a disk drive, or a USB flash memory device encoding, including or storing instructions, e.g., computer-executable instructions, which when executed by a processor or controller, cause the processor or controller to carry out methods disclosed herein.
While the invention has been described with respect to a limited number of embodiments, these should not be construed as limitations on the scope of the invention, but rather as exemplifications of some of the preferred embodiments. Other possible variations, modifications, and applications are also within the scope of the invention. Different embodiments are disclosed herein. Features of certain embodiments may be combined with features of other embodiments; thus certain embodiments may be combinations of features of multiple embodiments.
This Application is a continuation of U.S. application Ser. No. 16/435,656, published as United States Application Publication Serial Number 2019/0385635 and entitled “SYSTEM AND METHOD FOR VOICE ACTIVITY DETECTION”, which in turn claims the benefit of U.S. Provisional Application Ser. No. 62/684,357, filed Jun. 13, 2018, and of U.S. Provisional Application Ser. No. 62/774,879, filed Dec. 4, 2018, all of which are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
62684357 | Jun 2018 | US | |
62774879 | Dec 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16435656 | Jun 2019 | US |
Child | 17340118 | US |