This application claims the benefit under 35 U.S.C. §119(a) of an Indian patent application filed on Sep. 5, 2012 in the Intellectual Property of India and assigned Serial No. 2761/DEL/2012, the entire disclosure of which is hereby incorporated by reference.
The present disclosure relates to the field of speech and audio processing. More particularly, the present disclosure relates to voice activity detection in a voice processing apparatus under adverse environmental.
Recent growth in communication technologies and concurrent development of powerful electronic devices has enabled the development of various multimedia related techniques. However, the use of many voice-enabled devices, systems, and communication technologies is limited due to issues related to battery life (or power consumption) of the device, accuracy, transmission, and storage cost. In audio processing and communication systems, the overall performance in terms of accuracy, computational complexity, memory consumption, and other factors greatly depends on the ability to discriminate voice based speech signal from a non-voice/noise signal present in an input audio signal under an adverse environment, where various kinds of noises exist.
Existing systems and methods have attempted to develop voice/speech activity detection, voice and non-voice detection, temporal and spectral features based systems, source-filter based systems, time-frequency domain based systems, audio-visual based systems, statistical based systems, entropy based systems, short-time spectral analysis systems, and speech endpoint/boundary detection for discriminating a voice signal portion and a non-voice signal portion by using feature information extracted from the input signal. However, it is difficult to detect and extract a voice signal portion, since the voice signal is usually corrupted by a wide range of background sounds and noises.
The existing systems and methods for voice/speech detection have many shortcomings such as: (i) the systems and methods may be diminished under highly non-stationary and low Signal-to-Noise Ratio (SNR) environments; (ii) the systems and methods may be less robust under various types of background sound sources including applause, laughter, crowd noises, cheering, whistling, explosive sounds, babble, train noise, car noise, and so on; (iii) the systems and methods include less discriminative power in characterizing the signal frames having periodic structured noise components; and (iv) fixing a peak amplitude threshold for computing the periodicity from the autocorrelation lag index is very difficult under different noises and noise levels.
Due to the above mentioned reasons, the existing systems and methods fail to provide better detection when the level of background noise increases and a signal is corrupted by the time-varying noise levels. Thus, the use of appropriate noise robust features to characterize speech and non-speech signals is critical for all detection problems. Hence, there is a need for a system which achieves a better detection performance at a low computational cost.
The above information is presented as background information only to assist with an understanding of the present disclosure. No determination has been made, and no assertion is made, as to whether any of the above might be applicable as prior art with regard to the present disclosure.
Aspects of the present disclosure are to address at least the above-mentioned problems and/or disadvantages and to provide at least the advantages described below. Accordingly, an aspect of the present disclosure is to provide a method and system to achieve robust voice activity detection under adverse environmental conditions.
Another aspect of the disclosure is to provide a method to determine endpoints of voice regions.
Another aspect of the disclosure is to provide a method to perform noise reduction and to improve the robustness of voice activity detection against different kinds of realistic noises at varying noise levels.
In accordance with an aspect of the present disclosure, a method for Voice Activity Detection (VAD) in adverse environmental conditions is provided. The method includes receiving an input signal from a source, classifying the input signal into a silent or non-silent signal block by comparing temporal feature information, sending the silent or non-silent signal block to a Voice Endpoint Storing (VES) module or total variation filtering module by comparing the temporal feature information to thresholds, determining endpoint information of a voice signal or non-voice signal, employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions, determining a noise floor in the total variation filtered signal domain, determining feature information in autocorrelation of the total variation filtered signal sequence, determining Binary-flag Storing, Merging and Deletion (BSMD) based on the duration threshold on the determined feature information by a BSMD module, determining a voice endpoint correction based on short-term temporal feature information after the determined BSMD, and outputting the input signal with the voice endpoint information.
Accordingly the disclosure provides a system for VAD in adverse environmental conditions. The system is configured for receiving an input signal from a source. The system is also configured for classifying the input signal into a silent or non-silent signal block by comparing temporal feature information. The system is also configured for sending the silent or non-silent signal block to a VES module or a total variation filtering module by comparing the temporal feature information to the thresholds. The system is also configured for determining endpoint information of a voice signal or a non-voice signal. The system is also configured for employing total variation filtering for enhancing speech features and suppressing noise levels in non-speech portions. Further, the system is configured for determining a noise floor in the total variation filtered signal domain, determining feature information in autocorrelation of the total variation filtered signal sequence. Furthermore, the system is configured for determining BSMD based on the duration threshold on the determined feature information. Furthermore, the system is configured for determining voice endpoint correction based on the short-term temporal feature information after the determined BSMD and outputting the input signal with the voice endpoint information.
In accordance with another aspect of the present disclosure, an apparatus for voice activity detection in adverse environmental conditions is provided. The apparatus includes an integrated circuit including a processor, and a memory having a computer program code within the integrated circuit. The memory and the computer program code are configured to, with the processor, cause the apparatus to receive an input signal from a source. The processor causes the apparatus to classify the input signal into a silent or non-silent signal block by comparing temporal feature information. The processor causes the apparatus to send the silent or non-silent signal block to a VES module or a total variation filtering module by comparing the temporal feature information to thresholds. Further, the processor causes the apparatus to determine endpoint information of a voice signal or a non-voice signal by the VES module or the total variation filtering module. Furthermore, the processor causes the apparatus to employ total variation filtering by the total variation filtering module for enhancing speech features and suppressing noise levels in non-speech portions. Furthermore, the processor causes the apparatus to determine a noise floor in the total variation filtered signal domain. Furthermore, the processor causes the apparatus to determine feature information in autocorrelation of the total variation filtered signal sequence. Furthermore, the processor causes the apparatus to determine BSMD based on the duration threshold on the determined feature information by a BSMD module. Furthermore, the processor causes the apparatus to determine voice endpoint correction based on short-term temporal feature information after the determined BSMD, and output the input signal with the voice endpoint information.
Other aspects, advantages, and salient features of the disclosure will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses various embodiments of the present disclosure.
The above and other aspects, features, and advantages of certain embodiments of the present disclosure will be more apparent from the following description taken in conjunction with the accompanying drawings, in which:
Throughout the drawings, like reference numerals will be understood to refer to like parts, components, and structures.
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of various embodiments of the present disclosure as defined by the claims and their equivalents. It includes various specific details to assist in that understanding but these are to be regarded as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the various embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions may be omitted for clarity and conciseness.
The terms and words used in the following description and claims are not limited to the bibliographical meanings, but, are merely used by the inventor to enable a clear and consistent understanding of the present disclosure. Accordingly, it should be apparent to those skilled in the art that the following description of various embodiments of the present disclosure is provided for illustration purpose only and not for the purpose of limiting the present disclosure as defined by the appended claims and their equivalents.
It is to be understood that the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a component surface” includes reference to one or more of such surfaces.
The various embodiments herein achieve a method and system of a voice activity detection which can be used in a wide range of audio and speech processing applications. The proposed method accurately detects voice signal regions and determines endpoints of voice signal regions in audio signals under diverse kinds of background sounds and noises with varying noise levels.
Referring now to the drawings, and more particularly to
Referring to
In an embodiment, the VAD module 102 can be an integrated circuit, System-on-a-Chip (SoC or SOC), a communication device (e.g., mobile phone, Personal Digital Assistant (PDA), tablet), and the like.
Referring to
In an embodiment, the SBD module 201 includes a memory buffer, a plurality of programs, and a history of memory allocations. Further, the SBD module 201 sets a block length based on a buffer memory size of the processing device, and divides the input discrete-time signal received from the data acquisition module into equal-sized blocks of N×1 samples. The selection of an appropriate block length depends on the type of applications of interest, as well as on the memory size allocated for a scheduled task and other internal resources, such as processor power consumption, processor speed, memory, or I/O (Input/Output) of audio communication and processing devices.
Further, the SBD module 201 waits for a specific period of time for the audio data to be acquired sufficiently, and releases collected data for further processing as the memory buffer gets full. The SBD module 201 holds data for a short period of time until finishing the VAD process. The internal memories of the SBD module 201 will be refreshed periodically. The next block processing continues based on action variable information. The SBD module 201 maintains history information including a start and endpoint position of a block, a memory size, and action variable state information.
Referring to
At operation 304, the filtered signal is divided into signal frames and at operation 305 the signal frames are classified as a silent or non-silent frame using feature parameters extracted from the signal frame and the total variation residual under a wide range of background noises encountered in real world applications. At operation 306, binary values generated for the voice/non-voice signal classification process are stored (e.g., 1: Voice and 0: Non-voice). At operation 307, merging and deleting of a signal frame using the duration information by processing the binary sequence information obtained for each signal block takes place. At operation 308, the endpoint of a voice signal is determined by using the binary sequence information and energy envelope information. Further, at operation 309, correcting the endpoints using the feature parameter made from a portion of signal samples extracted from the endpoint determined at the previous steps takes place. At operation 310, executing the voice end point information or the input signal with voice endpoint information, to the speech related technologies and systems takes place. The various actions in method 300 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions illustrated in
Referring to
In an embodiment, the SNBC module 202 includes means for receiving an input signal block from a memory buffer, means for determining temporal feature parameters from the received signal block, means for determining silent blocks by comparing the extracted temporal feature parameters to the thresholds, means for determining endpoints of a non-silent signal block, and means for generating action variable information to send the signal block either to the VES module 206 or to the total variation filtering module 203.
Further, the SNBC module 202 is constructed using a Hierarchical Decision-Tree (HDT) scheme with a threshold. The SBD module 201 extracts the one or more temporal features (e.g., energy, zero crossing rate, and energy envelope) from an input signal block received from the SBD module 201. The temporal features may represent the various nature of an audio signal that can be used to classify the input signal block. The HDT uses the feature information extracted from the input signal block and a threshold for detecting a silent signal block. The HDT sends a signal block, as an output, to the total variation filtering module 203 only when the feature information of a signal block will be equal to or greater than the threshold. The method provides SFD for dividing a total variation filtered signal into consecutive signal frames.
The SFD module 205 receives a filtered signal block from the TVF module 203 and divides a received filtered signal into equal-sized overlapping short signal frames with frame length of L samples. The frame length and the frame shift are adopted based on the system requirements. The SFD module 205 sends a signal frame to the SNFC module 208 according to the action variable information received from the succeeding modules. In another aspect of HDT, the decision stage considers the signal block as a silent block when the feature information will be smaller than a threshold. In such a scenario, the SNBC module 202 directly sends action variable information to the VES module 206 without sending the action variable information to the other signal processing units. The main objective of preferred SNBC module 202 is to reduce computational cost and power consumption. In the SNBC module 202, a long silent interval frequently occurs between two successive voice signal portions. The various actions in method 400 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in
Referring to
The total variation filtering technique is a process often used in digital image processing that has applications in noise removal. Total variation filtering is based on the principle that signals with excessive and possibly spurious details have high total variation, that is, the integral of the absolute gradient of the signal is high.
Referring to
Referring to
The experimental results on different noise types demonstrate that the total variation filtering technique can address the robustness of the traditional features. The capabilities of the total variation filtering technique can be observed from the energy envelopes extracted from the noisy signal and the total variation filtered signal.
Further, the total variation filtering technique improves the noise-reduction as compared to existing filtering techniques even if the input signal is a mixture of different background noise sources at varying amplitude levels, low-frequency voiced speech portions, and unvoiced portions, which often reduces the detection rates in most of the voice activity detection systems published based on the prior art techniques. The main advantage of using the total variation smoothing filter is that it preserves speech properties of interest in a different manner than conventional filtering techniques used for suppressing noise components.
Referring to
The SNFC module 208 comprises means for receiving total variation filtered signal frames, means for extracting temporal feature information from each signal frame, means for determining silent signal frames by comparing extracted feature information to thresholds, means for determining binary-flag information (e.g., 1:non-silent signal frame and 0:silent signal frame), and means for generating action variable information to send the signal block either to a VNFC module or to a BSMD module. The main objective of the SNFC module 208 is to reduce computational cost and power consumption where a silent portion frequently occurs between voice signal portions. Further, the SNFC module 208 with total variation filter feature information provides better discrimination of silent signal frames from non-silent signal frames.
The binary-flag information may include binary values of 0 (i.e., a False Statement) and 1 (i.e., a True Statement). Further, the decision tree of the HDT sends the binary-flag information value of 0, as an output, to a BSMD module without sending the signal frame to the VNFC module 207 for further signal processing. Otherwise, the input signal frame is further processed at the VNFC module 207 only when feature information extracted from the input signal frame is equal to or greater than thresholds. The various actions in method 900 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in
Referring to
Referring to
From
Referring to
The VNFC module 207 includes means for receiving a non-silent signal frame from the signal frame classification module, means for computing normalized one-sided autocorrelation of a non-silent signal frame, means for extracting autocorrelation feature information, means for determining a voice signal frame and a non-voice signal frame based on the extracted total variation residual and autocorrelation features by comparing features to thresholds, means for generating action variable information to send the voice signal frame to a BSMD module and to control a voice activity detection process. The VNFC module 207 classifies an input non-silence signal frame into a voice signal frame and a non-voice signal frame. Based on the classification results, the VNFC module 207 generates binary-flag information (e.g., binary-flag 0 for non-voice signal frame and binary-flag 1 for voice signal frame) to determine the endpoint of the voice signal activity portion.
The VNFC module 207 includes three major methods such as autocorrelation computation, feature extraction, and decision. The classification method is implemented using a multi-stage HDT scheme with thresholds. The flowchart configuration of the multi-stage HDT can be redesigned according to the computation complexity and memory space involved in extracting feature parameters from the autocorrelation sequence of the non-silence signal frame.
In an embodiment, the VNFC module 207 first receives a non-silence signal frame with a number of signal samples. The VNFC module 207 then computes normalized one-sided autocorrelation of a non-silence signal frame represented as d[n]. The autocorrelation of the signal frame d[n] with length of N samples is computed using Equation (1):
In Equation (1), r denotes the autocorrelation sequence, and k denotes the lag of the autocorrelation sequence.
The feature information from the autocorrelation sequence is used to characterize signal frames. The periodicity feature of the autocorrelation may provide temporal and spectral characteristics of a signal to be processed. For example, the periodicity in the autocorrelation sequence indicates that the signal is periodic. The autocorrelation function falls to zero for highly non-stationary signals. The voiced speech sound is periodically correlated and other background sounds from noise sources are not (or uncorrelated). If a frame of a voiced sound signal is periodically correlated (or a quasi-periodic), its autocorrelation function has the maximum peak value at the location of pitch period of a voiced sound. In general, the autocorrelation function demonstrates the maximum peak within the lag value range corresponding to the expected pitch periods of 2 to 20 ms for voiced sounds. The conventional voiced activity detection considers that voiced speech may have a higher maximum autocorrelation peak value than the background noise frames. In an embodiment, the maximum autocorrelation peak value may be diminished and the autocorrelation lag of the maximum peak may deviate from the threshold range due to phoneme variations and different background sources including applause, laughter, car, train, crowd cheer, babble, thermal noise, and so on. The feature parameters that are extracted from the autocorrelation of the total variation filtered signal can have the ability to increase the robustness of the VAD process.
Further, the VNFC module 207 extracts the feature information comprising an autocorrelation lag index (or time lag) of a first zero crossing point of the autocorrelation function, a lag index of a minimum point of the autocorrelation function, an amplitude of a minimum point of the autocorrelation function, lag indices of local maxima points of the autocorrelation function, amplitudes of local maxima points, and decaying energy ratios. The extraction of feature information is done in a sequential manner according to the heuristic decision rules followed in the preferred HDT scheme.
The lag index (or time lag) of the first zero crossing point is used to characterize the frames with highly non-stationary noises (or transients). From various experimental results, it is noted that the value lag index of the first zero crossing point of the autocorrelation sequence is less than the lag value of 4 for several types of noises.
The proposed method uses the lag index of the first zero crossing point feature to detect the noise frames. For a given autocorrelation sequence with a certain number of coefficients, the first zero crossing point is described as in Equation (2):
fzcp1=first_zcp(r[m]), 0≦m≦UL1 Equation (2)
In Equation (2), first_zcp(·) is the function that provides the lag index of the first zero crossing point (fzcp1), m denotes the autocorrelation lag index variable, and UL1 denotes the upper lag index value.
The proposed method performs the determination of a lag index of the first zero crossing point within a new autocorrelation sequence constructed with a certain number of autocorrelation values. Thus, the proposed method may reduce the computational cost of the feature extraction by examining only a few autocorrelation sequence values. In addition, the power consumption, computational load and memory consumption may be reduced when a particular type of noise constantly occurs.
For a given range of an autocorrelation sequence, the lag index and amplitude of the minimum peak are computed using Equation (3):
[rmin
In Equation (3), min_amp_lag(·) is the function which computes the minimum amplitude (rmin
In an embodiment, the lag index and amplitude of the minimum peak features are extracted from the autocorrelation sequence within a lag interval. These features are used to identify the types of noise signals having periodic structure components.
The proposed method includes extraction of the lag index and amplitude of the maximum peak of the autocorrelation sequence within a lag interval. These features are used to represent a voiced speech sound frame. The lag and maximum peak thresholds are used to distinguish voiced sound from other background sounds. For a given range of autocorrelation coefficients, the lag index and amplitude of the minimum peak are computed using Equation (4):
[rmax
In Equation (4), max_amp_lag(·) is the function that outputs the maximum amplitude (rmax
The proposed method utilizes the peak amplitude and its lag index information for reducing the computational cost of a VAD system by eliminating highly non-stationary noise frames having different noise levels. In order to reduce the number of noise frame detections, the proposed method uses decaying energy ratios.
In certain implementations, the feature extraction method computes decaying energy ratios by dividing the autocorrelation sequence into unequal blocks. For a given block of autocorrelation sequence, the autocorrelation energy decaying ratio (i) is computed using Equation (5):
In Equation (5), τi denotes the ith decaying energy ratio computed for autocorrelation lag index ranging from Li and Ui. N denotes the total number of autocorrelation coefficients and k denotes the autocorrelation lag variable.
Further, the decaying energy ratio lies between 0 and 1 and is a representative feature for distinguishing the voiced sounds from the background sounds, and noises. In most sound frames, the decaying energy ratios in the autocorrelation domain computed in the method described above can demonstrate a high robustness against a wide variety of background sounds and noises. In addition, the decaying energy ratios are computed in a computationally efficient manner.
In an embodiment, the method of constructing a decision tree takes the computational cost of each feature into consideration.
Further, from
In a proposed method, each feature extraction method receives the autocorrelation sequence with a number of autocorrelation coefficient values. The feature extraction method processes the input data according to the action variable information. Finally, the VNFC module 207 of the proposed method generates binary flag information (e.g., binary flag 0 for non-voice signal frame, and binary flag 1 for voice signal frame) and sends flag information to a BSMD module. The plots of feature patterns are shown for a comprehensive understanding and illustrating of the effectiveness in distinguishing a voice signal frame from a non-voice signal frame by using total variation autocorrelation feature information.
Referring to the
Based on the overlapping frame concept in VAD, the total number of missed and false signal frame detections may be reduced by using the information of possible duration of voiced speech regions. Further, in certain embodiments, the proposed method employs the minimum voiced speech duration and the interval between two successive voice signal portions. In an embodiment, the VAD system determines the feature smoothing process which can reduce the number of false and missed detections. In an embodiment, the VAD system may optionally configure the construction of embodiments depending on the applications. The mode of VAD triggering can be manually or automatically selected by a user. In a power saving mode, VAD applications may be disabled.
According to the disclosure, the method of merging replaces binary-flag 0 by binary-flag 1 when it identifies the binary-flag 0 for the signal frames within an interval from the previous endpoint of a voiced speech portion. In another aspect, the binary-flag 1 is replaced by binary-flag 0 when the signal frames detected as a voice signal frame within long zeros on both left and right sides of the detected voice signal frames with a total duration is less than the duration threshold.
In certain embodiments, the binary-flag merging/deletion is performed by using a set of instructions that counts the total numbers of series of ones and zeros, and also continuously compares count values with the thresholds. From various experiments, it was noticed that the merging and deletion methods of the proposed method may provide significantly better endpoint detection results. The main objective of the preferred method of merging is to avoid a discontinuity effect that is introduced due to the elimination of a set of signal samples of a single spoken word during the voice and non-voice classification process.
The main objective of the method of deletion is to remove a short-burst of some types of sounds that are falsely detected. Further, the VEDC is designed for accurately determining the endpoint (or boundary or onset/offset) of a voice signal portion and correcting using the feature information extracted from each sub-frame of the signal samples. The various actions in flowchart 1400 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in
Referring to
The VEDC module provides endpoint determination, signal framing, feature extraction and endpoint correction. The endpoints of all detected voiced signal portions are computed by processing the binary-flag sequence information and the values of frame length and frame shift. Further, the VEDC module provides endpoints in terms of either a sample index number or a sample time measured in milliseconds.
In an embodiment, the endpoint is corrected using a simple feature extraction and a threshold rule. During correction, processing of the signal frame is performed with a number of signal samples. The signal frame is extracted at the onset and offset of each voiced speech portion. During endpoint correction the signal frame is first divided into non-overlapping small frames. Then, the computation of energy of each sub-frame takes place and is finally compared with a threshold. The proposed method may provide an accurate determination of endpoints of voiced signal portions when the recorded/received audio signal with high signal-to-noise ratio mostly occurs in many realistic environments. The various actions in method 1700 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in
Referring to
The overall computing environment can be composed of multiple homogeneous and/or heterogeneous cores, multiple CPUs of different kinds, special media and other accelerators. The processing unit is responsible for processing the instructions of the algorithm. The processing unit receives commands from the control unit in order to perform its processing. Further, any logical and arithmetic operations involved in the execution of the instructions are computed with the help of the ALU. Further, the plurality of process units may be located on a single chip or over multiple chips.
The algorithm comprising of instructions and codes required for the implementation are stored in either the memory unit or the storage or both. At the time of execution, the instructions may be fetched from the corresponding memory and/or storage, and executed by the processing unit.
In case of any hardware implementations various networking devices or external I/O devices may be connected to the computing environment to support the implementation through the networking unit and the I/O device unit.
The various embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the elements. The elements shown in
While the present disclosure has been shown and described with reference to various embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure as defined by the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2761/DEL/2012 | Sep 2012 | IN | national |