Voice activity detection using a soft decision mechanism

Description

BACKGROUND

Voice activity detection (VAD), also known as speech activity detection or speech detection, is a technique used in speech processing in which the presence or absence of human speech is detected. The main uses of VAD are in speech coding and speech recognition. VAD can facilitate speech processing, and can also be used to deactivate some processes during identified non-speech sections of an audio session. Such deactivation can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol (VOIP) applications, saving on computation and on network bandwidth.

SUMMARY

Voice activity detection (VAD) is an enabling technology for a variety of speech-based applications. Herein disclosed is a robust VAD algorithm that is also language independent. Rather than classifying short segments of the audio as either “speech” or “silence”, the VAD as disclosed herein employees a soft-decision mechanism. The VAD outputs a speech-presence probability, which is based on a variety of characteristics.

In one aspect of the present application, a method of detection of voice activity in audio data, the method comprises obtaining audio data, segmenting the audio data into a plurality of frames, computing an activity probability for each frame from the plurality of features of each frame, compare a moving average of activity probabilities to at least one threshold, and identifying a speech and non-speech segments in the audio data based upon the comparison.

In another aspect of the present application, a method of detection of voice activity in audio data, the method comprises obtaining a set of segmented audio data, wherein the segmented audio data is segmented into a plurality of frames, calculating a smoothed energy value for each of the plurality of frames, obtaining an initial estimation of a speech presence in a current frame of the plurality of frames, updating an estimation of a background energy for the current frame of the plurality of frames, estimating a speech present probability for the current frame of the plurality of frames, incrementing a sub-interval index .mu. modulo U of the current frame of the plurality of frames, and resetting a value of a set of minimum tracers.

In another aspect of the present application, a non-transitory computer readable medium having computer executable instructions for performing a method comprises obtaining audio data, segmenting the audio data into a plurality of frames, computing an activity probability for each frame from the plurality of features of each frame, compare a moving average of activity probabilities to at least one threshold, and identifying a speech and non-speech segments in the audio data based upon the comparison.

In another aspect of the present application, a non-transitory computer readable medium having computer executable instructions for performing a method comprises obtaining a set of segmented audio data, wherein the segmented audio data is segmented into a plurality of frames, calculating a smoothed energy value for each of the plurality of frames, obtaining an initial estimation of a speech presence in a current frame of the plurality of frames, updating an estimation of a background energy for the current frame of the plurality of frames, estimating a speech present probability for the current frame of the plurality of frames, incrementing a sub-interval index .mu. modulo U of the current frame of the plurality of frames, and resetting a value of a set of minimum tracers.

In another aspect of the present application, a method of detection of voice activity in audio data, the method comprises obtaining audio data, segmenting the audio data into a plurality of frames, calculating an overall energy speech probability for each of the plurality of frames, calculating a band energy speech probability for each of the plurality of frames, calculating a spectral peakiness speech probability for each of the plurality of frames, calculating a residual energy speech probability for each of the plurality of frames, computing an activity probability for each of the plurality of frame from the overall energy speech probability, band energy speech probability, spectral peakiness speech probability, and residual energy speech probability, comparing a moving average of activity probabilities to at least one threshold, and identifying a speech and non-speech segments in the audio data based upon the comparison.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart that depicts an exemplary embodiment of a method of voice activity detection.

FIG. 2 is a system diagram of an exemplary embodiment of a system for voice activity detection.

FIG. 3 is a flow chart that depicts an exemplary embodiment of a method of tracing energy values.

DETAILED DISCLOSURE

Most speech-processing systems segment the audio into a sequence of overlapping frames. In a typical system, a 20-25 millisecond frame is processed every 10 milliseconds. Such speech frames are long enough to perform meaningful spectral analysis and capture the temporal acoustic characteristics of the speech signal, yet they are short enough to give fine granularity of the output.

Having segmented the input signal into frames, features, as will be described in further detail herein, are identified within each frame and each frame is classified as silence/speech. In another embodiment, the speech-presence probability is evaluated for each individual frame. A sequence of frames that are classified as speech frames (e.g. frames having a high speech-presence probability) are identified in order to mark the beginning of a speech segment. Alternatively, a sequence of frames that are classified as silence frames (e.g. having a low speech-presence probability) are identified in order to mark the end of a speech segment.

As disclosed in further detail herein, energy values over time can be traced and the speech-presence probability estimated for each frame based on these values. Additional information regarding noise spectrum estimation is provided by I. Cohen. Noise spectrum estimation in adverse environment: Improved Minima Controlled Recursive Averaging. IEEE Trans. on Speech and Audio Processing, vol. 11(5), pages 466-475, 2003, which is hereby incorporated by reference in its entirety. In the following description a series of energy values computed from each frame in the processed signal, denoted E₁, E₂, . . . , E_Tis assumed. All E_tvalues are measured in dB. Furthermore, for each frame the following parameters are calculated:

- S_t—the smoothed signal energy (in dB) at time t.
- τ_t—the minimal signal energy (in dB) traced at time t.
- τ_t^(u)—the backup values for the minimum tracer, for 1≤u≤U (U is a parameter).
- P_t—the speech-presence probability at time t.
- B_t—the estimated energy of the background signal (in dB) at time t.

The first frame is initialized S₁, τ₁, τ₁^(u)(for each 1≤u≤U), and B₁is equal to E₁and P₁=0. The index u is set to be 1.

For each frame t>1, the method 300 is performed.

At 302 the smoothed energy value is computed and the minimum tracers (0<α_S<1 is a parameter) are updated, exemplarily by the following equations:

S_t=α_S·S_t-1+(1−α_S)·E_t
τ_t=min(τ_t-1,S_t)
τ_t^(u)=min(τ_t-1^(u),S_t)

Then at 304, an initial estimation is obtained for the presence of a speech signal on top of the background signal in the current frame. This initial estimation is based upon the difference between the smoothed power and the traced minimum power. The greater the difference between the smoothed power and the traced minimum power, the more probable it is that a speech signal exists. A sigmoid function

$\sum (x; μ, σ) = \frac{1}{1 + e^{σ \cdot (μ - x)}}$

can be used, where μ,σ are the sigmoid parameters:

q=Σ(S_t−τ_t;μ,σ)

Next, at 306, the estimation of the background energy is updated. Note that in the event that q is low (e.g. close to 0), in an embodiment an update rate controlled by the parameter 0<α_B<1 is obtained. In the event that this probability is high, a previous estimate may be maintained:

β=α_B+(1−α_B)·√{square root over (q)}
B_t=β·E_t-1+(1−β)·S_t

The speech-presence probability is estimated at 308 based on the comparison of the smoothed energy and the estimated background energy (again, μ,σ are the sigmoid parameters and 0<α_P<1 is a parameter):

p=Σ(S_t−B_t;μ,σ)
P_t=α_P·P_t-1+(1−α_P)·p

In the event that t is divisible by V (V is an integer parameter which determines the length of a sub-interval for minimum tracing), then at 310, the sub-interval index u modulo U (U is the number of sub-intervals) is incremented and the values of the tracers are reset at 312:

$τ_{t} = \min_{1 \leq υ \leq U} {τ_{t}^{(υ)}}$

$τ_{t}^{(u)} = S_{t}$

In embodiments, this mechanism enables the detection of changes in the background energy level. If the background energy level increases, (e.g. due to change in the ambient noise), this change can be traced after about U·V frames.

FIG. 1 is a flow chart that depicts an exemplary embodiment of a method 100 or method 300 of voice activity detection. FIG. 2 is a system diagram of an exemplary embodiment of a system 200 for voice activity detection. The system 200 is generally a computing system that includes a processing system 206, storage system 204, software 202, communication interface 208 and a user interface 210. The processing system 206 loads and executes software 202 from the storage system 204, including a software module 230. When executed by the computing system 200, software module 230 directs the processing system 206 to operate as described in herein in further detail in accordance with the method 100 of FIG. 1, and the method 300 of FIG. 3.

Although the computing system 200 as depicted in FIG. 2 includes one software module in the present example, it should be understood that one or more modules could provide the same operation. Similarly, while description as provided herein refers to a computing system 200 and a processing system 206, it is to be recognized that implementations of such systems can be performed using one or more processors, which may be communicatively connected, and such implementations are considered to be within the scope of the description.

The processing system 206 can comprise a microprocessor and other circuitry that retrieves and executes software 202 from storage system 204. Processing system 206 can be implemented within a single processing device but can also be distributed across multiple processing devices or sub-systems that cooperate in existing program instructions. Examples of processing system 206 include general purpose central processing units, applications specific processors, and logic devices, as well as any other type of processing device, combinations of processing devices, or variations thereof.

The storage system 204 can comprise any storage media readable by processing system 206, and capable of storing software 202. The storage system 204 can include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Storage system 204 can be implemented as a single storage device but may also be implemented across multiple storage devices or sub-systems. Storage system 204 can further include additional elements, such a controller capable, of communicating with the processing system 206.

Examples of storage media include random access memory, read only memory, magnetic discs, optical discs, flash memory, virtual memory, and non-virtual memory, magnetic sets, magnetic tape, magnetic disc storage or other magnetic storage devices, or any other medium which can be used to storage the desired information and that may be accessed by an instruction execution system, as well as any combination or variation thereof, or any other type of storage medium. In some implementations, the store media can be a non-transitory storage media. In some implementations, at least a portion of the storage media may be transitory. It should be understood that in no case is the storage media a propogated signal.

User interface 210 can include a mouse, a keyboard, a voice input device, a touch input device for receiving a gesture from a user, a motion input device for detecting non-touch gestures and other motions by a user, and other comparable input devices and associated processing elements capable of receiving user input from a user. Output devices such as a video display or graphical display can display an interface further associated with embodiments of the system and method as disclosed herein. Speakers, printers, haptic devices and other types of output devices may also be included in the user interface 210.

As described in further detail herein, the computing system 200 receives a audio file 220. The audio file 220 may be an audio recording or a conversation, which may exemplarily be between two speakers, although the audio recording may be any of a variety of other audio records, including multiples speakers, a single speaker, or an automated or recorded auditory message. The audio file may exemplarily be a .WAV file, but may also be other types of audio files, exemplarily in a post code modulation (PCM) format and an example may include linear pulse code modulated (LPCM) audio filed, or any other type of compressed audio. Furthermore, the audio file is exemplary a mono audio file; however, it is recognized that embodiments of the method as disclosed herein may also be used with stereo audio files. In still further embodiments, the audio file may be streaming audio data received in real time or near-real time by the computing system 200.

In an embodiment, the VAD method 100 of FIG. 1 exemplarily processes frames one at a time. Such an implantation is useful for on-line processing of the audio stream. However, a person of ordinary skill in the art will recognize that embodiments of the method 100 may also be useful for processing recorded audio data in an off-line setting as well.

Referring now to FIG. 1, the VAD method 100 may exemplarily begin at step 102 by obtaining audio data. As explained above, the audio data may be in a variety of stored or streaming formats, including mono audio data. At step 104, the audio data is segmented into a plurality of frames. It is to be understood that in alternative embodiments, the method 100 may alternatively begin receiving. audio data already in a segmented format.

Next, at 106, one or more of a plurality of frame features are computed. In embodiments, each of the features are a probability that the frame contains speech, or a speech probability. Given an input frame that comprises samples x₁, x₂, . . . , x_F(wherein F is the frame size), one or more, and in an embodiment, all of the following features are computed.

At 108, the overall energy speech probability of the frame is computed. Exemplarily the overall energy of the frame is computed by the equation:

$\overline{E} = 10 \cdot \log_{10} (\sum_{k = 1}^{F} {(x_{k})}^{2})$

As explained above with respect to FIG. 3, the series of energy levels can be traced. The overall energy speech probability for the current frame, denoted as p_Ecan be obtained and smoothed given a parameter 0<α<1:

{tilde over (p)}_E=α·{tilde over (p)}_E+(1−α)·p_E

Next, at step 110, a band energy speech probability is computed. This is performed by first computing the temporal spectrum of the frame (e.g. by concatenating the frame to the tail of the previous frame, multiplying the concatenated frames by a Hamming window, and applying Fourier transform of order N). Let X₀, X₁, . . . , X_N/2be the spectral coefficients. The temporal spectrum is then subdivided into bands specified by a set of filters H₀^(b), H₁^(b), . . . , H_N/2^(b)for 1≤b≤M (wherein M is the number of bands; the spectral filters may be triangular and centered around various frequencies such that Σ_kH_k^(b)=1). Further detail of one embodiment is exemplarily provided by I. Cohen, and B. Berdugo. Spectral enhancement by tracking speech presence probability in subbands. Proc. International Workshop on Hand-free Speech Communication (HSC'01), pages 95-98, 2001, which is hereby incorporated by reference in its entirety. The energy level for each band is exemplarily computed using the equation:

$E^{(b)} = 10 \cdot \log_{10} (\sum_{k = 0}^{N / 2} H_{k}^{(b)} \cdot {\langle X_{k} \rangle}^{2})$

The series of energy levels for each band is traced, as explained above with respect to FIG. 3. The band energy speech probability P_Bfor each band in the current frame, which we denote p^(b)is obtained, resulting in:

$p_{B} = \frac{1}{M} \cdot \sum_{b = 1}^{M} p^{(b)}$

At 112, a spectral peakiness speech probability is computed A spectral peakiness ratio is defined as:

$ρ = \frac{\underset{k : \langle X_{k} \rangle > \langle X_{k - 1} \rangle, \langle X_{k + 1} \rangle}{\sum {\langle X_{k} \rangle}^{2}}}{\sum_{k = 0}^{N / 2} {\langle X_{k} \rangle}^{2}}$

The spectral peakiness ratio measures how much energy in concentrated in the spectral peaks. Most speech segments are characterized by vocal harmonies, therefore this ratio is expected to be high during speech segments. The spectral peakiness ratio can be used to disambiguate between vocal segments and segments that contain background noises. The spectral peakiness speech probability p_Pfor the frame is obtained by normalizing ρ by a maximal value ρ_max(which is a parameter), exemplarily in the following equations:

$p_{P} = \frac{ρ}{ρ_{\max}}$

${\tilde{p}}_{P} = α \cdot {\tilde{p}}_{P} + (1 - α) \cdot p_{P}$

At step 114, the residual energy speech probability for each frame is calculated. To calculate the residual energy, first a linear prediction analysis is performed on the frame. In the linear prediction analysis given the samples x₁, x₂, . . . , x_Fa set of linear coefficients a₁, a₂, . . . , a_L(L is the linear-prediction order) is computed, such that the following expression, known as the linear-prediction error, is brought to a minimum:

$ɛ = \sum_{k = 1}^{F} {(x_{k} - \sum_{i = 1}^{L} a_{i} \cdot x_{k - i})}^{2}$

The linear coefficients may exemplarily be computed using a process known as the Levinson-Durbin algorithm which is described in further detail in M. H. Hayes. Statistical Digital Signal Processing and Modeling. J. Wiley & Sons Inc., New York, 1996, which is hereby incorporated by reference in its entirety. The linear-prediction error (relative to overall the frame energy) is high for noises such as ticks or clicks, while in speech segments (and also for regular ambient noise) the linear-prediction error is expected to be low. We therefore define the residual energy speech probability (P_R) as:

$p_{R} = {(1 - \frac{ɛ}{\sum_{k = 1}^{F} {(x_{k})}^{2}})}^{2}$

${\tilde{p}}_{R} = α \cdot {\tilde{p}}_{R} + (1 - α) \cdot p_{R}$

After one or more of the features highlighted above are calculated, an activity probability Q for each frame cab be calculated at 116 as a combination of the speech probabilities for the Band energy (P_B), Total energy (P_E), Energy Peakiness (P_P), and Residual Energy (P_R) computed as described above for each frame. The activity probability (Q) is exemplarily given by the equation:

Q=√{square root over (p_B·max{{tilde over (p)}_E,{tilde over (p)}_P,{tilde over (p)}_R})}

It should be noted that there are other methods of fusing the multiple probability values (four in our example, namely p_B, p_E, and P_R) into a single value Q. The given formula is only one of many alternative formulae. In another embodiment, Q may be obtained by feeding the probability values to a decision tree or an artificial neural network.

After the activity probability (Q) is calculated for each frame at 116, the activity probabilities (Q_t) can be used to detect the start and end of speech in audio data. Exemplarily, a sequence of activity probabilities are denoted by Q₁, Q₂, . . . , Q_T. For each frame, let {circumflex over (Q)}_tbe the average of the probability values over the last L frames:

${\hat{Q}}_{t} = \frac{1}{L} \cdot \sum_{k = 0}^{L - 1} Q_{t - k}$

The detection of speech or non-speech segments is carried out with a comparison at 118 of the average activity probability {circumflex over (Q)}_tto at least one threshold (e.g. Q_max, Q_min). The detection of speech or non-speech segments co-believed as a state machine with two states, “non-speech” and “speech”:

- Start from the “non-speech” state and t=1
- Given the tth frame, compute Q_tand the update {circumflex over (Q)}_t
- Act according to the current state
  - If the current state is “no speech”:
  - Check if {circumflex over (Q)}_t>Q_max. If so, mark the beginning of a speech segment at time (t−k), and move to the “speech” state.
  - If the current state is “speech”:
  - Check if {circumflex over (Q)}_t<Q_min. If so, mark the end of a speech segment at time (t−k), and move to the “no speech” state.

Increment t and return to step 2.

Thus, at 120 the identification of speech or non-speech segments is based upon the above comparison of the moving average of the activity probabilities to at least one threshold. In an embodiment, Q_maxtherefore represents an maximum activity probability to remain in a non-speech state, while Q_minrepresents a minimum activity probability to remain in the speech state.

In an embodiment, the detection process is more robust then previous VAD methods, as the detection process requires a sufficient accumulation of activity probabilities over several frames to detect start-of-speech, or conversely, to have enough contiguous frames with low activity probability to detect end-of-speech.

Traditional VAD methods are based on frame energy, or on band energies. In the suggested methods, the system and method of the present application also takes into consideration additional features such as residual LP energy and spectral peakiness. In other embodiments, additional features may be used, which help distinguish speech from noise, where noise segments are also characterized by high energy values:

- Spectral peakiness values are high in the presence of harmonics, which are characteristic to speech (or music). Car noises and bubble noises, for example, are not harmonic and therefore have low spectral peakiness; and
- High residual LP energy is characteristic for transient noises, such as clicks, bangs, etc.

The system and method of the present application uses a soft-decision mechanism and assigns a probability with each frame, rather than classifying it as either 0 (non-speech) or 1 (speech):

obtains a more reliable estimation of the background energies; and

It is less dependent on a single threshold for the classification of speech/non-speech, which leads to false recognition of non-speech segments if the threshold is too low, or false rejection of speech segments if it is too high. Here, two thresholds are used (Q.sub.min and Q.sub.max in the application), allowing for some uncertainty. The moving average of the Q values make the system and method switch from speech to non-speech (or vice versa) only when the system and method are confident enough.

The functional block diagrams, operational sequences, and flow diagrams provided in the Figures are representative of exemplary architectures, environments, and methodologies for performing novel aspects of the disclosure. While, for purposes of simplicity of explanation, the methodologies included herein may be in the form of a functional diagram, operational sequence, or flow diagram, and may be described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance therewith, occur in a different order and/or concurrently with other acts from that shown and described herein. For example, those skilled in the art will understand and appreciate that a methodology can alternatively be represented as a series of interrelated states or events, such as in a state diagram. Moreover, not all acts illustrated in a methodology may be required for a novel implementation.

This written description uses examples to disclose the invention, including the best mode, and also to enable any person skilled in the art to make and use the invention. The patentable scope of the invention is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal languages of the claims.

Claims

1. A computing system, comprising: a processor having an input port for receiving audio data; anda storage system comprising a storage medium comprising executable instructions, wherein the processor is configured to execute the executable instructions, that, when executed by the at least one processor, cause the at least one processor to: calculate an activity probability Q for the audio data based on values calculated based on energy features of the audio data; andoutput the activity probability Q to an external device, wherein the activity probability Q is given by the equation: Q=√{square root over (pB·max{{tilde over (p)}E,{tilde over (p)}P,{tilde over (p)}R})}where:PB is band energy speech probability;PE is overall energy speech probability;PP is spectral peakiness speech probability; andPR is residual energy speech probability; andwhereby Q greater than the threshold indicates voice in the audio data.
2. The computing system of claim 1, wherein the residual energy speech probability (PR) is obtained by:
3. The computing system of claim 1, wherein the executable instructions, when executed by the processor, further cause the processor to: segment the audio data into a sequence of frames, calculate the activity probability for each frame in the sequence, wherein the activity probability corresponds to a probability that the frame contains speech; determine, frame-by-frame, a state of each frame in the sequence as either speech or non-speech by comparing a moving average of activity probabilities for a group of frames, including the frame, to a selected threshold, wherein the selected threshold for a particular frame depends on the determined state of a frame proceeding the particular frame in the sequence, identify non-speech segments in the audio data based upon the determined states of the frames; and deactivate subsequent processing of the non-speech segments in the audio data.
4. The computing system of claim 3, wherein the selected threshold for a frame following a non-speech frame is a maximum activity probability, which the moving average must exceed for the state of the frame to be determined as speech.
5. The computing system of claim 3, wherein the selected threshold for a frame following a speech frame is a minimum activity probability, which the moving average must be below for the state of the frame to be determined as non-speech.
6. The computing system of claim 3, wherein the activity probability for a frame is a combination of a plurality of different speech probabilities computed using the audio data of the frame.
7. The computing system of claim 6, wherein the plurality of different speech probabilities comprises: an overall energy speech probability based on an overall the energy of the audio data;a band energy speech probability based on an energy of the audio data contained within one or more spectral bands;a spectral peakiness speech probability based on an energy of the audio data that is concentrated in one or more spectral peaks; anda residual energy speech probability based on a residual energy resulting from a linear prediction of the audio data.
8. The computing system of claim 7, wherein the overall energy speech probability, the band energy speech probability, the spectral peakiness probability and the residual energy speech probability each have a value between 0 and 1, wherein 0 corresponds to non-speech and 1 corresponds to speech.
9. The computing system of claim 8, wherein the activity probability is the square root of the band energy speech probability multiplied by the largest of the overall energy probability, the spectral peakiness probability, and the residual energy probability.
10. The computing system of claim 3, wherein each non-speech segment corresponds to audio data in one or more consecutive non-speech frames bordered in the sequence by speech frames.
11. The computing system of claim 10, wherein each speech segment corresponds to audio data in one or more consecutive speech frames bordered in the sequence by non-speech frames.
12. A method for identifying speech and non-speech segments in audio data, the method comprising: calculating an activity probability Q for the audio data based on values calculated based on energy features of the audio data; andoutputting the activity probability Q to an external device, wherein the activity probability Q is given by the equation: Q=√{square root over (pB·max{{tilde over (p)}E,{tilde over (p)}P,{tilde over (p)}R})}where:PB is band energy speech probability;PE is overall energy speech probability;PP is spectral peakiness speech probability; andPR is residual energy speech probability;identifying segments in the audio data containing non-speech data according to the activity probability Q; anddetecting voice activity by comparing Q to a threshold, whereby Q greater than the threshold indicates voice in the audio data.
13. The method of claim 12, further comprising: segmenting the audio data into a sequence of frames;calculating the activity probability for each frame in the sequence, wherein the activity probability corresponds to a probability that the frame contains speech;determining, frame-by-frame, a state of each frame in the sequence as either speech or non-speech by comparing a moving average of activity probabilities for a group of frames, including the frame, to a selected threshold, wherein the selected threshold for a particular frame depends on the determined state of a frame proceeding the particular frame in the sequence; andidentifying non-speech segments in the audio data based upon the determined states of the frames.
14. The method of claim 13, further comprising: deactivating subsequent processing of the non-speech segments in the audio data.
15. The method of claim 13, wherein the selected threshold for a frame following a non-speech frame is a maximum activity probability, which the moving average must exceed for the state of the frame to be determined as speech.
16. The method of claim 13, wherein the selected threshold for a frame following a speech frame is a minimum activity probability, which the moving average must be below for the state of the frame to be determined as non-speech.
17. The method of claim 13, wherein the activity probability for a frame is a combination of a plurality of different speech probabilities computed using the audio data of the frame.
18. The method of claim 17, wherein the plurality of different speech probabilities comprises: an overall energy speech probability based on an overall the energy of the audio data;a band energy speech probability based on an energy of the audio data contained within one or more spectral bands;a spectral peakiness speech probability based on an energy of the audio data that is concentrated in one or more spectral peaks; anda residual energy speech probability based on a residual energy resulting from a linear prediction of the audio data.
19. The method of claim 18, wherein the overall energy speech probability, the band energy speech probability, the spectral peakiness probability and the residual energy speech probability each have a value between 0 and 1, wherein 0 corresponds to non-speech and 1 corresponds to speech.
20. The method of claim 18, wherein the activity probability is the square root of the band energy speech probability multiplied by the largest of the overall energy probability, the spectral peakiness probability, and the residual energy probability.
21. The method of claim 13, wherein each non-speech segment corresponds to audio data in one or more consecutive non-speech frames bordered in the sequence by speech frames.
22. The method of claim 13, wherein each speech segment corresponds to audio data in one or more consecutive speech frames bordered in the sequence by non-speech frames.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 15/959,743, filed on Apr. 23, 2018, which is a continuation of U.S. patent application Ser. No. 14/449,770, filed on Aug. 1, 2014, which claims the benefit of U.S. Provisional Application No. 61/861,178, filed Aug. 1, 2013. The contents of these applications are hereby incorporated by reference in their entirety.

US Referenced Citations (149)

Number	Name	Date	Kind
4653097	Watanabe et al.	Mar 1987	A
4864566	Chauveau	Sep 1989	A
5027407	Tsunoda	Jun 1991	A
5222147	Koyama	Jun 1993	A
5638430	Hogan et al.	Jun 1997	A
5805674	Anderson	Sep 1998	A
5907602	Peel et al.	May 1999	A
5946654	Newman et al.	Aug 1999	A
5963908	Chadha	Oct 1999	A
5999525	Krishnaswamy et al.	Dec 1999	A
6044382	Martino	Mar 2000	A
6145083	Shaffer et al.	Nov 2000	A
6266640	Fromm	Jul 2001	B1
6275806	Pertrushin	Aug 2001	B1
6311154	Gersho	Oct 2001	B1
6427137	Petrushin	Jul 2002	B2
6480825	Sharma et al.	Nov 2002	B1
6510415	Talmor et al.	Jan 2003	B1
6587552	Zimmerman	Jul 2003	B1
6597775	Lawyer et al.	Jul 2003	B2
6915259	Rigazio	Jul 2005	B2
7006605	Morganstein et al.	Feb 2006	B1
7039951	Chaudhari et al.	May 2006	B1
7054811	Barzilay	May 2006	B2
7106843	Gainsboro et al.	Sep 2006	B1
7158622	Lawyer et al.	Jan 2007	B2
7212613	Kim et al.	May 2007	B2
7299177	Broman et al.	Nov 2007	B2
7386105	Wasserblat et al.	Jun 2008	B2
7403922	Lewis et al.	Jul 2008	B1
7539290	Ortel	May 2009	B2
7657431	Hayakawa	Feb 2010	B2
7660715	Thambiratnam	Feb 2010	B1
7668769	Baker et al.	Feb 2010	B2
7693965	Rhoads	Apr 2010	B2
7778832	Broman et al.	Aug 2010	B2
7822605	Zigel et al.	Oct 2010	B2
7908645	Varghese et al.	Mar 2011	B2
7940897	Khor et al.	May 2011	B2
8036892	Broman et al.	Oct 2011	B2
8073691	Rajakumar	Dec 2011	B2
8112278	Burke	Feb 2012	B2
8311826	Rajakumar	Nov 2012	B2
8510215	Gutierrez	Aug 2013	B2
8537978	Jaiswal et al.	Sep 2013	B2
8554562	Aronowitz	Oct 2013	B2
8913103	Sargin et al.	Dec 2014	B1
9001976	Arrowood	Apr 2015	B2
9237232	Williams et al.	Jan 2016	B1
9368116	Ziv et al.	Jun 2016	B2
9558749	Seeker-Walker et al.	Jan 2017	B1
9584946	Lyren et al.	Feb 2017	B1
20010026632	Tamai	Oct 2001	A1
20020022474	Blom et al.	Feb 2002	A1
20020099649	Lee et al.	Jul 2002	A1
20030050780	Rigazio	Mar 2003	A1
20030050816	Givens et al.	Mar 2003	A1
20030097593	Sawa et al.	May 2003	A1
20030147516	Lawyer et al.	Aug 2003	A1
20030208684	Camacho et al.	Nov 2003	A1
20040029087	White	Feb 2004	A1
20040111305	Gavan et al.	Jun 2004	A1
20040131160	Mardirossian	Jul 2004	A1
20040143635	Galea	Jul 2004	A1
20040167964	Rounthwaite et al.	Aug 2004	A1
20040203575	Chin et al.	Oct 2004	A1
20040225501	Cutaia	Nov 2004	A1
20040240631	Broman et al.	Dec 2004	A1
20050010411	Rigazio	Jan 2005	A1
20050043014	Hodge	Feb 2005	A1
20050076084	Loughmiller et al.	Apr 2005	A1
20050125226	Magee	Jun 2005	A1
20050125339	Tidwell et al.	Jun 2005	A1
20050185779	Toms	Aug 2005	A1
20060013372	Russell	Jan 2006	A1
20060106605	Saunders et al.	May 2006	A1
20060111904	Wasserblat et al.	May 2006	A1
20060149558	Kahn	Jul 2006	A1
20060161435	Atef et al.	Jul 2006	A1
20060212407	Lyon	Sep 2006	A1
20060212925	Shull et al.	Sep 2006	A1
20060248019	Rajakumar	Nov 2006	A1
20060251226	Hogan et al.	Nov 2006	A1
20060282660	Varghese et al.	Dec 2006	A1
20060285665	Wasserblat et al.	Dec 2006	A1
20060289622	Khor et al.	Dec 2006	A1
20060293891	Pathuel	Dec 2006	A1
20070041517	Clarke et al.	Feb 2007	A1
20070071206	Gainsboro et al.	Mar 2007	A1
20070074021	Smithies et al.	Mar 2007	A1
20070100608	Gable et al.	May 2007	A1
20070124246	Lawyer et al.	May 2007	A1
20070244702	Kahn et al.	Oct 2007	A1
20070280436	Rajakumar	Dec 2007	A1
20070282605	Rajakumar	Dec 2007	A1
20070288242	Spengler	Dec 2007	A1
20080010066	Broman et al.	Jan 2008	A1
20080162121	Son	Jul 2008	A1
20080181417	Pereg et al.	Jul 2008	A1
20080195387	Zigel et al.	Aug 2008	A1
20080222734	Redlich et al.	Sep 2008	A1
20080312914	Rajendran et al.	Dec 2008	A1
20090046841	Hodge	Feb 2009	A1
20090119103	Gerl et al.	May 2009	A1
20090119106	Rajakumar	May 2009	A1
20090147939	Morganstein et al.	Jun 2009	A1
20090247131	Champion et al.	Oct 2009	A1
20090254971	Herz et al.	Oct 2009	A1
20090319269	Aronowitz	Dec 2009	A1
20100174534	Vos	Jul 2010	A1
20100228656	Wasserblat et al.	Sep 2010	A1
20100303211	Hartig	Dec 2010	A1
20100305946	Gutierrez	Dec 2010	A1
20100305960	Gutierrez	Dec 2010	A1
20110026689	Metz et al.	Feb 2011	A1
20110119060	Aronowitz	May 2011	A1
20110191106	Khor et al.	Aug 2011	A1
20110202340	Ariyaeeinia et al.	Aug 2011	A1
20110213615	Summerfield et al.	Sep 2011	A1
20110251843	Aronowitz	Oct 2011	A1
20110255676	Marchand et al.	Oct 2011	A1
20110282661	Dobry et al.	Nov 2011	A1
20110282778	Wright et al.	Nov 2011	A1
20110320484	Smithies et al.	Dec 2011	A1
20120053939	Gutierrez et al.	Mar 2012	A9
20120054202	Rajakumar	Mar 2012	A1
20120072453	Guerra et al.	Mar 2012	A1
20120232896	Taleb et al.	Sep 2012	A1
20120253805	Rajakumar et al.	Oct 2012	A1
20120254243	Zeppenfeld et al.	Oct 2012	A1
20120263285	Rajakumar et al.	Oct 2012	A1
20120265526	Yeldener et al.	Oct 2012	A1
20120284026	Cardillo et al.	Nov 2012	A1
20130163737	Dement et al.	Jun 2013	A1
20130197912	Hayakawa et al.	Aug 2013	A1
20130253919	Gutierrez et al.	Sep 2013	A1
20130253930	Seltzer et al.	Sep 2013	A1
20130300939	Chou et al.	Nov 2013	A1
20140067394	Abuzeina	Mar 2014	A1
20140074467	Ziv et al.	Mar 2014	A1
20140074471	Sankar et al.	Mar 2014	A1
20140142940	Ziv et al.	May 2014	A1
20140142944	Ziv et al.	May 2014	A1
20140278391	Braho	Sep 2014	A1
20150025887	Sidi et al.	Jan 2015	A1
20150055763	Guerra et al.	Feb 2015	A1
20150249664	Talhami et al.	Sep 2015	A1
20160217793	Gorodetski et al.	Jul 2016	A1
20170140761	Secker-Walker et al.	May 2017	A1

Foreign Referenced Citations (7)

Number	Date	Country
0598469	May 1994	EP
2004193942	Jul 2004	JP
2006038955	Sep 2006	JP
2000077772	Dec 2000	WO
2004079501	Sep 2004	WO
2006013555	Feb 2006	WO
2007001452	Jan 2007	WO

Non-Patent Literature Citations (11)

Entry
Baum, L.E., et al., “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains,” The Annals of Mathematical Statistics, vol. 41, No. 1, 1970, pp. 164-171.
Cheng, Y., “Mean Shift, Mode Seeking, and Clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, No. 8, 1995, pp. 790-799.
Cohen, I., “Noise Spectrum Estimation in Adverse Environment: Improved Minima Controlled Recursive Averaging,” IEEE Transactions On Speech and Audio Processing, vol. 11, No. 5, 2003, pp. 466-475.
Cohen, I., et al., “Spectral Enhancement by Tracking Speech Presence Probability in Subbands,” Proc. International Workshop in Hand-Free Speech Communication (HSC'01), 2001, pp. 95-98.
Coifman, R.R., et al., “Diffusion maps,” Applied and Computational Harmonic Analysis, vol. 21, 2006, pp. 5-30.
Hayes, M.H., “Statistical Digital Signal Processing and Modeling,” J. Wiley & Sons, Inc., New York, 1996, 200 pages.
Hermansky, H., “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, No. 4, 1990, pp. 1738-1752.
Lailler, C., et al., “Semi-Supervised and Unsupervised Data Extraction Targeting Speakers: From Speaker Roles to Fame?,” Proceedings of the First Workshop on Speech, Language and Audio in Multimedia (SLAM), Marseille, France, 2013, 6 pages.
Mermelstein, P., “Distance Measures for Speech Recognition—Psychological and Instrumental,” Pattern Recognition and Artificial Intelligence, 1976, pp. 374-388.
Schmalenstroeer, J., et al., “Online Diarization of Streaming Audio-Visual Data for Smart Environments,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, No. 5, 2010, 12 pages.
Viterbi, A.J., “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm,” IEEE Transactions on Information Theory, vol. 13, No. 2, 1967, pp. 260-269.

Related Publications (1)

	Number	Date	Country
	20200357427 A1	Nov 2020	US

Provisional Applications (1)

	Number	Date	Country
	61861178	Aug 2013	US

Continuations (2)

	Number	Date	Country
Parent	15959743	Apr 2018	US
Child	16880560		US
Parent	14449770	Aug 2014	US
Child	15959743		US

Voice activity detection using a soft decision mechanism

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

CPC

International Classifications

Term Extension

Abstract