The present invention generally relates to voice activity detection (VAD), and more particularly to a VAD system with adaptive thresholds.
Voice activity detection (VAD) is the detection or recognition of presence or absence of human speech, primarily used in speech processing. VAD can be used to activate speech-based applications. VAD can avoid unnecessary transmission by deactivating some processes during non-speech period, thereby reducing communication bandwidth and power consumption.
Conventional VAD systems are liable to be erroneous or unreliable, particularly in the noisy environment. A need has thus arisen to propose a novel scheme to overcome drawbacks of the conventional VAD systems.
In view of the foregoing, it is an object of the embodiment of the present invention to provide a voice activity detection (VAD) system with adaptive thresholds capable of adapting to varying environment and noise overcoming, thereby outputting a reliable and accurate detection result.
According to one embodiment, a voice activity detection (VAD) system includes a voice frame detector and a voice detector. The voice frame detector detects a voice frame during which a voice signal is not silent. The voice detector detects presence of human speech according to the voice frame.
In one embodiment, the VAD system further includes a threshold update unit that updates an associated threshold for detecting the presence of human speech according to result of human speech detection by the voice detector.
Specifically, the VAD system 100 of the embodiment may include a transducer 11, such as a microphone, configured to convert sound into a voice (electrical) signal (step 21).
The VAD system 100 may include a voice frame detector 12 coupled to receive the voice signal and configured to detect a voice frame during which the voice signal is not silent (step 22). In one embodiment, the voice frame detector 12 may adopt end-point detection (EPD) to determine end points of the voice signal between which the voice signal is not silent. In one embodiment, amplitude (representing volume) of the voice signal greater than a predetermined threshold is determined as an end-point. In another embodiment, high-order difference (HOD) (representing slope) of the voice signal greater than a predetermined threshold is determined as an end-point.
The VAD system 100 of the embodiment may include a voice detector 13 configured to detect presence of human speech according to the voice frames (step 23).
In the embodiment, presence of human speech is detected (by the voice detector 13) when a value of similarity (or correlation) between voice frames is greater than an associated threshold. Specifically, auto-correlation (function) is performed on the voice frames to determine an auto-correlation value representing similarity (or detect pitch) between a voice frame and a (delayed) voice frame with a time lag. The auto-correlation function (ACF) may be expressed as follows:
where τ is the time lag, s is the voice frame, and i=0, . . . , n−1.
In the embodiment, a normalized squared difference (function) is further performed on the voice frames (e.g., a voice frame and a (delayed) voice frame with a time lag) to determine a normalized squared difference value, and the normalized squared difference function (NSDF) may be expressed as follows:
In the embodiment, presence of human speech is detected when (both) the auto-correlation value is greater than a first threshold, and the normalized squared difference value is greater than a second threshold.
Referring back to
Specifically, the VAD system 100 of the embodiment may include a threshold update unit 14 configured to determine updated (first/second) thresholds (when the presence of human speech is not detected) activated by an activate signal (from the voice detector 13), which is asserted when the presence of human speech is not detected.
According to the embodiment as described above, as the thresholds for detecting presence of human speech are adaptively determined, the VAD system 100 and the VAD method 200 can be adapted to varying environment and noise overcoming, thereby outputting a reliable and accurate detection result.
In the embodiment, the VAD system 100A may include an artificial intelligence (AI) engine 17, for example, an artificial neural network, configured to analyze the images captured by the image sensor 16, and to send analysis results to the controller 15, which then performs specific functions or applications according to the analysis results.
Specifically, the VAD system 100B may further include a voice recognition unit 18 configured to recognize spoken language and even translate spoken language into text, or configured to recognize a speaker, or both according to the voice frames (from the voice frame detector 12). The voice recognition unit 18 is activated only when the voice trigger signal (from the voice detector 13) becomes asserted.
The VAD system 100B of the embodiment may further include a face recognition unit 19 configured to recognize a human face from the images captured by the image sensor 16. The face recognition unit 19 is activated only when the image trigger signal (from the controller 15) becomes asserted.
Although specific embodiments have been illustrated and described, it will be appreciated by those skilled in the art that various modifications may be made without departing from the scope of the present invention, which is intended to be limited solely by the appended claims.