In a smart-home application, for example, hands-free voice communications typically occur in noisy far-field conditions. For example, the desired talker's voice may be interfered with by competing talkers, television (TV), a dishwasher, vacuum cleaner, etc. Spatial processing systems may improve signal-to-noise ratio, e.g. adaptive beamformers. Robust system controls are essential in an effective adaptive spatial processing method. Naïve voice activity detections alone may be insufficient.
Noise clustering methods in the spatial domain are being used in state-of-the-art control systems. The control systems dynamically track the inter-microphone phase profile of various noises present in the environment. The inter-microphone frequency-dependent phase profile is the phase of the cross-power spectral density of the microphone signals, and is a unique function of frequency for each source location relative to the microphones, and may be calculated by taking the phase of the time-averaged product of the Fourier transform of one microphone signal and the conjugated Fourier transform of the other microphone signal. The control systems may assume noise sources are spatially non-moving where fluctuations in the inter-microphone phase are used to detect the presence of long-term non-moving sources. The control systems can wrongly classify a non-moving talker as a noise cluster and wrongly classify a moving noise source as a desired talker.
Embodiments are described that recognize and cluster acoustic sources based not only on whether they are moving or non-moving, but also on their identity determined using biometric features of a talker. For example, different acoustic scenarios may be identified such as: desired and spatially non-moving, desired and spatially moving, undesired and spatially non-moving, and undesired and spatially moving.
In one embodiment, the present disclosure provides a method including extracting, from input of multiple microphones, a hyperset of features of acoustic sources. The hyperset of features comprises one or more spatial features of the acoustic sources and one or more voice biometric features of the acoustic sources. The method also includes using the extracted hyperset of features to identify separable clusters associated with acoustic scenarios. The acoustic scenarios comprise a desired spatially non-moving talker, a desired spatially moving talker, an undesired spatially non-moving acoustic source, and an undesired spatially moving acoustic source. The method also includes classifying subsequent input of the multiple microphones as one of the acoustic scenarios using the hyperset of features.
In another embodiment, the present disclosure provides a non-transitory computer-readable medium having instructions stored thereon that are capable of causing or configuring a system to perform operations that includes extracting, from input of multiple microphones, a hyperset of features of acoustic sources. The hyperset of features comprises one or more spatial features of the acoustic sources and one or more voice biometric features of the acoustic sources. The operations also include using the extracted hyperset of features to identify separable clusters associated with acoustic scenarios. The acoustic scenarios comprise a desired spatially non-moving talker, a desired spatially moving talker, an undesired spatially non-moving acoustic source, and an undesired spatially moving acoustic source. The operations also include classifying subsequent input of the multiple microphones as one of the acoustic scenarios using the hyperset of features.
In yet another embodiment, the present disclosure provides an apparatus that includes a feature extractor that extracts a hyperset of features of acoustic sources from input of multiple microphones. The hyperset of features comprises one or more spatial features of the acoustic sources and one or more voice biometric features of the acoustic sources. The apparatus also includes a clustering block that uses the extracted hyperset of features to identify separable clusters associated with acoustic scenarios. The acoustic scenarios include a desired spatially non-moving talker, a desired spatially moving talker, an undesired spatially non-moving acoustic source, and an undesired spatially moving acoustic source. The apparatus also includes a classifier that classifies subsequent input of the multiple microphones as one of the acoustic scenarios using the hyperset of features.
Embodiments are described of an apparatus and method for extracting a hyperset of features from the input of multiple microphones, using the extracted features to identify clusters associated with the four different types of acoustic sources 101-104, and classifying subsequent input as one of the four acoustic sources 101-104 using the hyperset of features. The hyperset of features includes both spatial features and voice biometric features of the acoustic sources. The spatial features may include, but are not limited to, phase information, frequency information, room impulse response, direction of arrival angle, (azimuth, elevation), 3-dimensional position coordinates. The voice biometric features may include, but are not limited to, pitch, Mel Frequency Cepstral Coefficients (MFCC), Line Spectral Frequencies (LSF), Higher Order Spectra (HOS) (e.g., bispectrum or trispectrum obtained by Higher Order Spectral Analysis), enrolled speaker presence likelihood, and voice biometric features extracted using a machine learning or spectral analysis-based algorithms. In one embodiment, a voice biometric system generally uses the MFCC as a basic feature set from which probabilistic models are built to determine the enrolled speaker presence likelihood or uses the features extracted from a deep learning network.
As may be observed in
The values of the horizontal axis of
In contrast, advantageously, the values of the horizontal axis of
Thus, although the plots of
The clustering algorithm block 404 uses the extracted features to identify separable clusters associated with the four acoustic scenarios 101-104 of
The classifier 406 uses the clustering information provided by the clustering algorithm block 404 to classify subsequent hyperset features extracted from the input 401 into one of the four acoustic scenarios 101-104 which it provides as output 409 to a control block of a device. The device may be, for example, a robot or a beamformer, as described below in more detail. Once the clustering is determined from the training data, the centroid of each cluster in the hyper dimensional space can be extracted from the cluster. During the classification stage, a new frame of microphone data is processed to generate the hyperset features, and the distance (e.g., Euclidian, Itakura-Saito distance) between the calculated hyperset and the cluster centroid is calculated for all clusters. The new frame of microphone data is assigned to the cluster whose centroid is closest to the newly generated hyperset feature in the hyper dimensional space. In one embodiment, a processor (e.g., digital signal processor (DSP), microcontroller, or other programmable central processing unit (CPU)) performs the operations of the hyperset feature extractor 402, clustering algorithm block 404, and classifier 406. In other embodiments, dedicated hardware processing blocks may be employed to perform the operations.
At block 502, input (e.g., input 401 of
At block 504, the hyperset of features extracted at block 502 is used (e.g., by the clustering algorithm block 404 of
At block 506, the hyperset of features is extracted from subsequent input and classified (e.g., by the classifier 406 of
At block 508, the classification (e.g., acoustic scenario 409 of
In an alternate embodiment, the classification is used to control a beamformer. For example, the classification may be used to control adaptation of filters of the beamformer, to adjust a step size of a matched filter to track movement of a desired spatially moving talker or to acquire a new desired spatially moving talker in response to detection of a keyword, or to adjust a step size of a noise canceller to track movement of an undesired spatially moving acoustic source. Use of the classification to control a beamformer may be described in more detail below with respect to the embodiment of
The spatial feature extractor 602, the biometric feature extractor 603, the speech recognition block 612, and the frame synchronization block 614 all receive the multi-microphone input 401. The spatial feature extractor 602 extracts spatial features from the input 401 and provides the extracted spatial features to the spatial clustering algorithm block 604 and the spatial statistics-based classifier 606. The biometric feature extractor 603 extracts biometric features from the input 401 and provides the extracted biometric features to the biometric clustering algorithm block 605 and the biometric-based classifier 607. The speech recognition block 612 provides an indication 613 to the spatial clustering algorithm 604 whether a trigger word or phrase has been detected. The trigger word/phrase may indicate a change from a current desired spatially moving talker to a new desired spatially moving talker. The speech recognition engine 612 may also identify which talker uttered the keyword.
The biometric clustering algorithm block 605 uses the extracted biometric features to identify separable desired and undesired clusters associated, respectively, with the desired acoustic scenarios 101 and 102 and with the non-desired acoustic scenarios 103 and 104 of
The biometric classifier 607 uses the biometric clustering information provided by the biometric clustering algorithm block 605, in conjunction with the enrolled talker model 616 and the universal background model 618, to classify subsequent biometric features extracted from the input 401 and to provide a desired talker indication 623 that indicates either the desired acoustic scenarios 101 and 102 or the non-desired acoustic scenarios 103 and 104. The desired talker indication 623 may also identify the talker, which may also be used by the spatial clustering algorithm 604 to label the spatial clusters. The desired talker indication 623 is provided to the spatial clustering algorithm block 604 and to the fusion logic 622.
The spatial clustering algorithm block 604 uses the extracted spatial features, in conjunction with the trigger word indication 613 and the desired talker indication 623, to identify separable spatially moving and non-moving clusters associated, respectively, with the spatially moving acoustic scenarios 102 and 103 and with the spatially non-moving acoustic scenarios 101 and 104 of
The spatial classifier 606 uses the spatial clustering information provided by the spatial clustering algorithm block 604 to classify subsequent spatial features extracted from the input 401 and to provide a moving indication 621 that indicates either the spatially moving acoustic scenarios 102 and 103 or the spatially non-moving acoustic scenarios 101 and 104. The moving indication 621 is provided to the fusion logic 622.
The frame synchronization block 614 provides information to the fusion logic 622 that enables the fusion logic 622 to align the moving indications 621 received from the spatial statistics-based classifier 606 so that they are associated with corresponding frames of the desired talker 623 indications received from the biometric-based classifier 607. The fusion logic 622 uses the frame synchronization information received from the frame synchronization block 614 to synchronize the moving indication 621 and the desired talker indication 623 for a given frame to generate the acoustic scenario output 609 for the frame.
The hyperset feature-based classifier 400/600 receives the input from each of the microphones 1-4 and generates an acoustic scenario output 409/609 in accordance with the operation described above, e.g., with respect to
The signal filtering block 701 comprises multiple adaptive filters f1-f3, each of which receives the primary microphone as an input and attempts to extract the enrolled talker's speech so that it may be subtracted from an associated secondary microphone to produce the noise in the secondary microphone signal as an associated noise reference. A first summing node subtracts the output of filter f1 from a delayed version of microphone 2. The output of the first summing node is a first noise reference that is used by the speech adaptation control block 702 to adapt filter f1. A second summing node subtracts the output of filter f2 from a delayed version of microphone 3. The output of the second summing node is a second noise reference that is used to adapt filter f2. A third summing node subtracts the output of filter f3 from a delayed version of microphone 4. The output of the third summing node is a third noise reference that is used to adapt filter f3. Generally, the function of the signal filtering block 701 is to block the talker's speech and generate the noise references for the noise cancellation block 703.
The filters f1-f3 are controlled by control signals generated by the speech adaptation control block 702. Speech from a non-enrolled talker may be present in the input from the microphones, e.g., from a TV. It may desirable to treat the TV speech as speech noise and remove it from the primary microphone. Advantageously, the desired talker indication 705 from the hyperset feature-based classifier 400/600 may enable the beamformer 700 to distinguish instances in which speech of an enrolled talker is present from instances in which speech noise is the only speech present, e.g., speech from a TV is present but not from an enrolled talker. The speech adaptation control block 702 controls the adaptive filters f1-f3 to adapt only in instances in which a desired talker's speech is present, which may enable the effective removal of speech noise (e.g., from a TV) from the primary microphone so that the speech noise is not present in the beamformer output 709.
The noise cancellation block 703 comprises multiple adaptive filters denoted b1, b2, and b3, which receive the associated noise references as an input from the respective summation nodes of the signal filtering block 701. A fourth summing node sums the outputs of adaptive filters b1-b3. A fifth summing node subtracts the output of the fourth summing node from a delayed version of the primary microphone signal to generate the beamformer output 709, which is used to adapt the filters b1-b3. The noise adaptation control block 704 controls the adaptive filters b1-b3 to adapt only when the enrolled talker's speech is not present, as indicated by the undesired talker/noise indication 707 provided by the hyperset feature-based classifier 400/600. The noise cancellation block 703 uses the noise generated by the signal filtering block 701 and cancels the noise from the primary microphone signal. The adaptive filters may be implemented either in the time domain or in the frequency domain. In the frequency domain approach, the time domain microphone signals are first transformed into the frequency domain using a fast Fourier transform/filter bank and the transformed signal from each filter bank output is processed separately.
In addition to controlling when to adapt the filters, the desired talker indication 705 and the undesired talker/noise indication 707 provided by the hyperset feature-based classifier 400/600 may be used by the speech adaption control block 702 and the noise adaptation control block 704 to control other aspects of the beamformer 700. Examples include adjusting a step size of the matched filters f1-f3 to track movement of a desired spatially moving talker or to acquire a new desired spatially moving talker in response to detection of a keyword. Other examples include adjusting a step size of the noise cancelling filters b1-b3 to track movement of an undesired spatially moving acoustic source. Thus, because the hyperset feature-based classifier 400/600 may provide a high-quality acoustic scenario output 409/609 due to its high dimensionality by combining both the spatial features and the voice biometric features as described above, the beamformer 700 may provide an enhanced signal quality, e.g., a signal-to-noise ratio (SNR) improvement, in the output 709, over a conventional arrangement that uses only spatial features or only biometric features.
The clustering algorithm block 804 uses the extracted features to identify separable clusters associated with the four acoustic scenarios 101-104 of
The classifier 806 uses the clustering information provided by the clustering algorithm block 804 to classify subsequent hyperset features extracted from the inputs 401 and 801 into one of the four acoustic scenarios 101-104 which it provides as output 409 to a control block of a device. The addition of the extracted visual biometric features may provide an even higher dimensional space in which the four acoustic sources 101-104 are even more readily separable.
Data from non-moving sources will tend to cluster tightly, and conventionally the tight clusters have been treated as noise clusters. Data from a moving source does not tend to cluster and is conventionally treated as a talker. A drawback of the conventional approach is that a non-moving talker may be treated as noise or a moving noise source may be clustered as a desired source. The spatial feature-based algorithm alone does not identify specific talkers.
Embodiments have been described that utilize spatial features as well as voice biometric features to classify sound sources into four types of clusters: desired and spatially non-moving, desired and spatially moving, interferer and spatially non-moving, interferer and spatially moving. In one embodiment (e.g., of
It should be understood—especially by those having ordinary skill in the art with the benefit of this disclosure—that the various operations described herein, particularly in connection with the figures, may be implemented by other circuitry or other hardware components. The order in which each operation of a given method is performed may be changed, unless otherwise indicated, and various elements of the systems illustrated herein may be added, reordered, combined, omitted, modified, etc. It is intended that this disclosure embrace all such modifications and changes and, accordingly, the above description should be regarded in an illustrative rather than a restrictive sense.
Similarly, although this disclosure refers to specific embodiments, certain modifications and changes can be made to those embodiments without departing from the scope and coverage of this disclosure. Moreover, any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element.
Further embodiments, likewise, with the benefit of this disclosure, will be apparent to those having ordinary skill in the art, and such embodiments should be deemed as being encompassed herein. All examples and conditional language recited herein are intended for pedagogical objects to aid the reader in understanding the disclosure and the concepts contributed by the inventor to furthering the art and are construed as being without limitation to such specifically recited examples and conditions.
This disclosure encompasses all changes, substitutions, variations, alterations, and modifications to the example embodiments herein that a person having ordinary skill in the art would comprehend. Similarly, where appropriate, the appended claims encompass all changes, substitutions, variations, alterations, and modifications to the example embodiments herein that a person having ordinary skill in the art would comprehend. Moreover, reference in the appended claims to an apparatus or system or a component of an apparatus or system being adapted to, arranged to, capable of, configured to, enabled to, operable to, or operative to perform a particular function encompasses that apparatus, system, or component, whether or not it or that particular function is activated, turned on, or unlocked, as long as that apparatus, system, or component is so adapted, arranged, capable, configured, enabled, operable, or operative.
Finally, software can cause or configure the function, fabrication and/or description of the apparatus and methods described herein. This can be accomplished using general programming languages (e.g., C, C++), hardware description languages (HDL) including Verilog HDL, VHDL, and so on, or other available programs. Such software can be disposed in any known non-transitory computer-readable medium, such as magnetic tape, semiconductor, magnetic disk, or optical disc (e.g., CD-ROM, DVD-ROM, etc.), a network, wire line or another communications medium, having instructions stored thereon that are capable of causing or configuring the apparatus and methods described herein.
Number | Name | Date | Kind |
---|---|---|---|
9269368 | Chen et al. | Feb 2016 | B2 |
9589197 | Prodam et al. | Mar 2017 | B2 |
10142730 | Yousefian et al. | Nov 2018 | B1 |
20140278397 | Chen | Sep 2014 | A1 |
20190096429 | Kowali | Mar 2019 | A1 |