The present disclosure relates to suppression (e.g., attenuation or removal) of acoustic interference components, in particular to removal of components with known or determined speaker characteristics, in audio signals.
Speech communication systems in cars use microphones and loudspeakers that are built in the car cabin. Inherently, the microphones do not pick up only the desired speech signals of the passenger, but also the acoustic feedback of the loudspeaker output. Removal of components in microphone signals that come from the loudspeaker output is often referred to as “echo cancellation” referring to an “echo” of the loudspeaker leaking into the microphone signal. Conventional systems for acoustic echo cancellation make use knowledge of the signal to be cancelled/suppressed that is being played over the loudspeakers generally in the form of the time domain driving signal for the loudspeaker, which may be referred to as a “reference signal.”
In some situations, a reference signal for the component to be cancelled is not available, thereby limiting the applicability of conventional echo cancellation approaches. Furthermore, in situations in which a desired signal represents a user's voice and the interfering signal is another's real or synthesized voice, cancellation of the interfering signal may be particularly challenging.
In a broad aspect, one or more examples described herein provide an effective way of cancelling or suppressing an interfering signal when a time domain reference signal is not available, and where both a desired signal and the interfering signals comprise voice signals. For example, it may be desirable to cancel a synthesized voice signal of a voice assistant that is captured in conjunction with a desired voice signal of a user. In a number of examples, certain characteristics of the interfering voice signal may be known (or determinable at runtime), and the cancellation procedure may be adapted to make use of those characteristics.
In some examples, the characteristics of the interfering signal form a numerical vector representation of the voice or the speaker of the voice, and such a representation may be referred to as an “embedding” of the voice. Such an embedding is then used as an input and/or configuration setting for an interference cancellation processing of a microphone signal. Such an approach may be largely agnostic to the characteristics of the desired speaker, which may be beneficial when substantially more information (i.e., data) is available to characterize the interfering voice than may be available to characterize the desired voice.
In some examples, the interfering voice is a synthesized voice, which may come from a fixed set of available output voices of a voice based assistant. In such a situation, the embeddings of these voices may be precomputed and selected at runtime, for example, by way for a control signal from the voice assistant identifying the voice being used, or by way of an automated speaker recognition (e.g., from a closed set of available synthesizer voices).
In an example, the approach uses a machine learning system (e.g., implemented as a neural-network system, or implemented using some other machine-learning architecture) to suppress undesirable interference acoustic components. The machine-learning system may be configured according to a universal model that receives the sound characteristics information that causes the machine-learning system to perform acoustic suppression for the sound source corresponding to the received sound characteristics information. The machine-learning system thus acts as a source-aware system that behaves to suppress acoustic components according to the determined characteristics.
Thus, in certain variations, a method for acoustic interference suppression is provided that includes obtaining a multi-source sound signal sample combining multiple sound components from a plurality of sound sources in a sound environment, with the plurality of sounds sources including one or more interfering sound sources produced by one or more loudspeakers in the sound environment, determining interfering sound characteristics for one or more sound signals that correspond to the one or more interfering sound sources, and suppressing at least one of the multiple sound components associated with the determined interfering sound characteristics for at least one of the one or more sound signals.
Embodiments of the method may include at least some of the features described in the present disclosure, including one or more of the following features.
At least one of the one or more interfering sound sources may include a voice assistant audio produced using a voice assistant profile selected from a plurality of voice assistant profiles maintained by a text-to-speech voice assistant system.
Determining the interfering sound characteristics may include selecting from data storage, storing respective interfering sound characteristics for the plurality of voice assistant profiles, a selected interfering sound characteristics corresponding to the selected voice assistant profile.
Suppressing the at least one of the multiple sound components may include receiving, by an interference suppressing machine learning system trained to controllably suppress voice assistant sound components according to interfering sound characteristics respectively associated with the plurality of voice assistant profiles, the selected interfering sound characteristics corresponding to the selected voice assistant profile, and sound data representative of the multi-source sound signal sample. The suppressing may also include processing the sound data by the interference suppressing machine learning system according to the selected interfering sound characteristics to produce output data representative of an output sound sample in which at least one of the multiple sound components, associated with sound produced according to the selected voice assistant profile, is suppressed.
Processing the sound data with the interference suppressing machine learning system may include determining, by the machine learning system, filter coefficients for a time-varying linear filter based on at least part of the sound data and the selected interfering sound characteristics, and applying the determined filter coefficients to at least part of a representation of the multi-source sound signal sample to yield an interference suppressed output sound signal.
Processing the sound data with the interference suppressing machine learning system may include generating, by the interference suppressing machine learning system, predicted output representative of the output sound sample in which the at least one of the multiple sound components is suppressed.
Determining interfering sound characteristics may include receiving the one or more sound signals that produce the one or more interfering sound sources, and deriving from the one or more sound signals the determined interfering sound characteristics.
Suppressing the at least one of the multiple sound components may include receiving by an interference suppressing machine learning system, trained to controllably suppress the at least one of the multiple sound components according to the interfering sound characteristics, the derived interfering sound characteristics and sound data representative of the multi-source sound signal sample, and processing the sound data by the interference suppressing machine learning system according to the interference data to produce output data representative of an output sound sample in which the at least one of the multiple sound components is suppressed.
Processing the sound data with the interference suppressing machine learning system may include determining, by the interference suppression machine learning system, filter coefficients for a time-varying linear filter based on at least part of the sound data and the interference data, and applying the determined filter coefficients to at least part of a representation of the multi-source sound signal sample to yield an interference suppressed output sound signal.
Processing the sound data with the interference suppressing machine learning system may include generating, by the interference suppressing machine learning system, predicted output representative of the output sound sample in which the at least one of the multiple sound components is suppressed.
The determined interfering sound characteristics may include one or more of, for example, a vector representation in an embedding vector space for the one or more sound signals, spectral characteristics for the one or more sound signals, temporal characteristics for the one or more sound signals, and/or identifier data representative of identify of sources producing the one or more sound signals.
Suppressing the at least one of the multiple sound components may include suppressing the at least one of the multiple sound components without using a reference signal for any of the plurality of sound sources, to reduce echo effects caused by sound components produced by the one or more loudspeakers.
Suppressing the at least one of the multiple sound components may include receiving by the interference suppressing machine learning system interference data representative of the interfering sound characteristics, receiving sound data representative of the multi-source sound signal sample, and processing the sound data by the interference suppressing machine learning system according to the interference data to produce output data representative of an output sound sample in which the at least one of the multiple sound components is suppressed.
Receiving the sound data may include converting, by an acoustic transducer device, a raw sound sample into a time-domain sound signal, transforming the time-domain sound signal into a transformed domain representation, and extracting features from the transformed domain representation to produce the sound data.
Extracting the features may include deriving from the transformed domain representation one or more of, for example, complex signal spectra features, spectral magnitude features, log spectral magnitude features, log mel spectra features, and/or mel-frequency cepstral coefficients.
The predicted output may include one or more of, for example, a temporal representation of the output sound sample, and/or a spectral representation of the output sound sample.
Suppressing the at least one of the multiple sound components may include providing the interfering sound characteristics to a machine learning system executing a single machine learning model trained to controllably suppress acoustic interference sound components in input sound signals according to different sound characteristics for a plurality of interference sound signals associated with one or more speakers or one or more voice assistant profiles sources, and processing sound data, representative of the multi-source sound signal sample, provided as input to the machine learning system based on the determined interfering sound characteristics provided to the machine learning system to produce output data representative of an output sound sample in which the at least one of the multiple sound components is suppressed.
Providing the determined interfering sound characteristics to the machine learning system may include providing the determined interfering sound characteristics to one or more of, for example, an input layer of the machine learning system, and/or one or more intermediate hidden layers of the machine learning system.
In certain variations, an acoustic interference suppression system is provided that includes an audio acquisition section to obtain a multi-source sound signal sample combining multiple sound components from a plurality of sound sources in a sound environment, with the plurality of sounds sources including one or more interfering sound sources produced by one or more loudspeakers in the sound environment, and an acoustic interference suppression controller in electrical communication with the audio acquisition section. The acoustic interference suppression controller is configured to determine interfering sound characteristics for one or more sound signals that correspond to the one or more interfering sound sources, and suppress at least one of the multiple sound components associated with the determined interfering sound characteristics for at least one of the one or more sound signals.
In certain variations, a non-transitory computer readable media is provided that stores a set of instructions, executable on at least one programmable device, to obtain a multi-source sound signal sample combining multiple sound components from a plurality of sound sources in a sound environment, with the plurality of sounds sources including one or more interfering sound sources produced by one or more loudspeakers in the sound environment, determine interfering sound characteristics for one or more sound signals that correspond to the one or more interfering sound sources, and suppress, based on the determined interfering sound characteristics, at least one of the multiple sound components associated with respective sound characteristics matching the interfering sound characteristics for at least one of the one or more sound signals that produce the one or more interfering sound sources.
In certain variations, a computing apparatus is provided that includes an audio acquisition section to obtain a multi-source sound signal sample combining multiple sound components from a plurality of sound sources in a sound environment, with the plurality of sounds sources including one or more interfering sound sources produced by one or more loudspeakers in the sound environment, and one or more programmable devices to perform acoustic interference suppression according to any of the method steps described above.
In certain variations, a non-transitory computer readable media is provided that is programmed with a set of computer instructions executable on a processor that, when executed, cause the operations comprising any of the various method steps described above.
Embodiments of the above system, the apparatus, and/or the computer-readable media may include at least some of the features described in the present disclosure, and may be combined with any other embodiment, variation, or features of the method described herein.
Other features and advantages of the invention are apparent from the following description, and from the claims.
These and other aspects will now be described in detail with reference to the following drawings.
Like reference symbols in the various drawings indicate like elements.
The interference suppression framework described herein implements a source-aware acoustic interference platform, which adapts acoustic interference suppression behavior (in terms of the levels of attenuation applied to stationary and non-stationary noise components) in accordance with determination/identification of interfering sound sources that are reproduced on loudspeakers within a local sound environment (e.g., the cabin of a vehicle). Consequently, the framework is configured to remove such interfering sound sources (which are typically non-local or non-native to the sound environment), thus reducing (or altogether cancelling) “echo” effects that would otherwise be caused by re-transmitting (through one or more microphones deployed in the sound environment) components reconstructed from sound signals delivered to the loudspeakers. In some examples, the interfering sound sources are synthesized voice sources produced by a local or remote voice assistant, which may be explicitly identified or whose characteristics may be automatically determined. In various examples the interfering sound sources may be sound signals corresponding to remote speakers communicating with occupants of a vehicle.
the suppression of interfering sound sources can be performed by using a machine learning system trained to suppress interfering sound components in accordance with interfering sound characteristics for the sound signals that result in the interfering sound sources. The sound characteristics are provided to the machine learning system which conditions or actuates the machine learning system to cause it to suppress or remove components corresponding to the interfering sound characteristics. A convenient way to determine the sound characteristics to provide to the machine learning system is through a data storage module (e.g., a lookup table) that maintains the sound characteristics for a known set of sound signals/sources. Examples of sound characteristics that can be used to modulate, condition, or actuate the machine learning system include numerical vectors (“embeddings”) representative of sound signal characteristics, spectral features of sound signals, identifier values that identify the particular source or signal (e.g., a known speaker, one of several known voice assistant profiles) corresponding to the sound components that are to be suppressed, etc.
As will discussed in greater detail below, the framework is implemented, in some embodiments, as a machine learning network that is trained (i.e., has its machine learning engine parameters optimized) to process a microphone audio signal and produce a clean audio output in which contribution from acoustic interference components is suppressed based on information indicative of or characterizing the acoustic interference source(s) that is (are) to be suppressed. The machine learning implementation (e.g., which may be realized according to one or more different machine learning architectures) receives noisy sound segments and interference characterization information (such information may be a compact representation of the sound source, such as a numerical vector referred to as an “embedding.” an identifier identifying a source from a finite pool of sources, e.g., a known speaker, one of several known voice assistant profiles, signal features such as log mel spectra features, or mel-frequency cepstral coefficients, etc.) The machine learning network outputs data (e.g., filter coefficients or actual sound signal output) in which interference components are suppressed.
With reference to
As further illustrated in
The SSE system 120 includes a voice characteristics module 122 that is configured to determine, typically in compact form (e.g., identity of a particular source associated with a sound signals, one or more spectral characteristics such as log mel spectra or mel-frequency cepstral features, an embedding in a vector space, etc.) interfering sound characteristics for one or more sound signals 102 that produce the one or more interfering sound sources. In examples where the sound signals are for known acoustic sources, such as sound signals produced by a voice assistant profile from a finite pool of voice assistant profiles, or a known speaker wishing to communicate with an occupant of the sound environment (a speaker in the car, in this case), a data storage device (e.g., lookup table) may store the respective voice characteristics that are used to actuate or modulate a machine learning system that was trained based on such sound characteristics. In such examples, the node 130 may provide to the voice characteristics module a selection signal 131 identifying the particular speaking source (be it a known user, or a known voice assistant profile) that causes the voice characteristics module 122 to output the appropriate stored sound characteristics that is to actuate the speaker-aware speech enhancement unit 124.
Alternatively, in some examples, a copy of the actual sound signal 102 produced by the node 130 and communicated to the loudspeaker 112 may also be provided to the voice characteristics module which, in such example, is configured to derive (e.g., in real-time) the appropriate output to the unit 124. Though derivation of interfering sound characteristics may be more computational costly (and may require a more complicated implementation) it has the advantage that for new, unknown sources (e.g., a new voice assistant profile, or a new speaker) the voice characteristics module 122 will provide an inferred/derived voice characteristics that provide an approximation of the characteristics of the sound signal that is consistent with the sound characteristics that were used to train the system. For example, in embodiments in which the sound/voice characteristics module 122 is configured to produce an embedding vector, a machine learning system (implemented as a neural network, or any other appropriate machine learning system) is trained (optionally jointly with the training of a machine-learning implementation applied to the sound signal generated by the microphone 116) using a large corpus of recorded speech samples from multiple speakers, each of which is associated with an encoded vector (also referred to as an embedding). The encoded vectors define an embedding space representative of at least some voice characteristics for speakers in the large corpus (e.g., spectral domain characteristic). A target speaker, whether it is one of the speakers used to train the machine-learning implementation of the voice characteristics module 122 or a different speaker, provides a sample of speech that can be processed to a format compatible with the machine-learning system. The machine-learning implementation of the voice/sound characteristics module 122 uses the input provided (the representation of the speech sample provided by the arbitrary speaker) to generate (encode) an embedding vector that is representative of the arbitrary speaker's voice characteristics. The encoded embedding for the arbitrary speaker lies within the embedding space defined for the trained encoder.
In another example embodiment, the voice/sound characteristics module may be configured to produce (whether using a data storage such as a lookup table to provide a stored value, or as a processing machine to compute an element from the sound signal) an identifier that represents the source, from a known pool of sources (one or more known voice assistant profiles used to produce synthetic speech of unique voice characteristics, or one or more known speakers). In this example, the identity of the sound source may be determined based on metadata include with the sound signal 102 provided to the voice characteristics module 122, or it may be a value selected based on the selection signal 131 generated by the node 130. In yet further example embodiments, the voice characteristics module 122 may be configured to generate (using algorithmic/filtering-based processing), according to the sample input provided to it via the signal 102, one or more features such as spectral domain features like complex signal spectra features, spectral magnitude features, log spectral magnitude features, log mel spectra features, mel-frequency cepstral features, etc. Here too, such features may previously have been pre-computed and stored on a data storage implementation of the voice characteristics module 122, and the appropriate features data may be outputted by the module 122 and provided to the unit 124 in accordance with a selection signal 131. Other types of features, whether time-domain features, frequency domain features, or features extracted from some other type of representation of the sound signal sample fed to the voice characteristics module 122 may also be used. In the latter example embodiments, the extracted features represent the voice characteristics of the interfering source. Those extracted features can then be used by downstream components/modules to control the interference suppression operations of the system 100.
With continued reference to
Although not specifically illustrated in
As will be discussed in greater detail below in relation to
The interfering sound characteristics information, outputted by the voice/sound characteristics module 122, may dynamically change during a call or a session in accordance with changes to the speakers (real or synthesized) associated with the signal 102. For example, if another remote speaker joins an on-going call (e.g., with the speaker 114 in the car cabin 110) the sound characteristics generated by the voice characteristics module 122 (which may either provide previously stored voice characteristics in the form compatible with the unit 124, or may derive from the actual sound signal the appropriate interfering sound characteristics) will change to reflect that there is now a new or an additional interfering sound source generated by the loudspeaker device 112. The change to the sound characteristics information is provided to the unit 124 (e.g., to a machine-learning engine forming part of the unit 124) by the sound/voice characteristics module 122, to cause a change in the behavior of the machine learning engine (e.g., to produce different filter coefficients responsive to the change in the interfering sound characteristics, or to generate actual clean signal output representation). It is to be noted that in addition to the interfering sound characteristics information, the unit 124 may also be provided with other information to control its output, including, for example, a parameter indicating a type of call that is to be (or is being) conducted, an environment parameter identifying the environment in which the call is conducted, an aggressivity parameter to control the level of attenuation that is to be applied, context information (e.g., derived from sensors data collected by various sensors, such as a camera sensor, in the sound environment), and other parameters that help define the required acoustic interference suppression behavior at a given time instance.
As noted, in some examples, the machine learning system of the unit 124 receives the interfering sound characteristics information that indicates which sound sources may be causing acoustic interference (such as echo), along with extracted features representative of a block/segment of an audio sample, and produces in response to these inputs a set of filter coefficients that are applied to the audio sample. The audio sample processed by the generated filter coefficients may be the original noisy sample, or a sample on which some of the initial pre-processing operations, such as equalization, etc., have been performed. Alternatively, the filtering process, achieved by filter coefficients produced by the machine learning system of the unit 124, may be performed in a frequency-domain, producing a resultant filtered (noise suppressed) signal from the input signal provided to the SSE system 120. It is to be noted that while the computation of filter coefficients to apply to noisy audio samples may be performed for every audio block/segment of the audio signals (e.g., 16 ms blocks), the computation of filter coefficient sets by the machine-learning system may only need to be performed intermittently (e.g., at regular or irregular intervals) so that the filter coefficients derived in response to a particular audio segment are applied to one or more consecutive sound segments. By computing filter coefficients intermittently instead of for every incoming sound block/segment, the computation cost (and resultant latency) can be reduced without excessively compromising the speech quality of the sound output signal.
With continued reference to
Once output processing is performed, the resultant acoustic interference suppressed output 106 is communicated to the node 130 to either process a user command (if the speech is a response to a voice assistant) or to be transmitted to a far-end communication partner (not shown). There, a user located at the destination node may use a similar interference suppression system to the SSE system 120 to suppress any noise/interference generated at the destination according to the environment and interference sources at the destination. Sound/audio signals produced by the user(s) at the far-end location are processed and transmitted via a communication network and a downlink to the node 130 to be processed and reconstructed in the car cabin 110 in accordance with the processes described herein.
Some illustrative example use cases follow. In a first example use case, a driver is located in his/her car and talking to an in-vehicle navigation system or a personal assistant that is talking back to the person in a British woman's voice. The system may not have a good model for the driver, but has a good model for that British woman and the speech signal enhancement (SSE) has been trained to cancel that voice assistant acoustics as well as other known voices. The SSE may know what synthesis voice is being used or at least the suitable characteristics (embedding) vector to the British woman voice assistant profile in use, and can provide that suitable characteristics as an input or a configuration of the SSE. In a second use case, the known personal assistant voice might come from outside the car, but so long that the system is aware that the voice assistant being used is of the particular known profile (namely, a woman's voice with a British accent), the interference suppression system (be it machine learning system or otherwise) can suppress the voice assistant sound components. In a third use case, the voice assistant may not expressly provide information of the profile used (or no a priori knowledge is provided about identity of a live speaker communicating with the driver, but a speaker recognition system (not illustrated) can determine the identity of the source (and consequently produce or cause to produce appropriate sound characteristics) from a closed known set. In a fourth use case, the speaker is not in the known set, and the system 100 may need to include a new embedding (or some type of sound characteristics) for the unknown speaker and use that. In this case, the derivation of the sound characteristics may require a bit of listening to the new speaker, but can subsequently be stored in a data storage so that there is no need to subsequently recompute sound characteristics for that new speaker (or voice assistant profile).
Additional details of the configuration and operation of a machine learning system to implement acoustic interference suppression operations are provided with reference to
The filter coefficients 232 for H(t,ω) produced by the DNN 220 are dynamically adaptable in response to specific dynamic audio signal characteristics (represented by the extracted features fed to the DNN 220), and the interfering sound characteristics. The DNN 220 can thus be implemented, in some embodiments, as a universal, adaptable interference suppression system module that does not need to retrieve (i.e., from a look-up table) different sets of coefficients for different required speakers or voice assistant profiles, but rather is implemented using a single set of optimized DNN coefficients, determined during the training stage, to dynamically produce during runtime filter coefficients that are applied to received sound signals. Alternatively, in various example embodiments, a machine-learning model for particular speakers or voice assistant profiles may be selected (e.g., based on the interfering sound characteristics information) and downloaded for use by the DNN 220. Furthermore, as noted, instead of producing filter coefficients for H(1,ω), the DNN may produce the actual predicted clean output responsive to the noisy input signal 104 and the interfering sound characteristics 123.
The DNN 220 of
Examples of neural networks include convolutional neural network (CNN), feed-forward neural networks, recurrent neural networks (RNN, implemented, for example, using long short-term memory (LSTM) structures), etc. Feed-forward networks include one or more layers of perceptrons (learning nodes/elements) with connections to one or more portions of the input data. In a feedforward network, the connectivity of the inputs and layers of perceptrons is such that input data and intermediate data propagate in a forward direction towards the network's output. There are typically no feedback loops or cycles in the configuration/structure of the feed-forward network. Convolutional layers allow a network to efficiently learn features by applying the same learned transformation to subsections of the data. Other examples of learning engine approaches/architectures that may be used include generating an auto-encoder and using a dense layer of the network to correlate with probability for a future event through a support vector machine, constructing a regression or classification model, etc.
In various embodiments the learning engine implementations may include a trained learning engine (e.g., a neural network) and a corresponding coupled learning engine controller/adapter configured to determine and/or adapt the parameters (e.g., neural network weights) of the learning engine that would produce output representative of cleaned/filtered audio samples in response to noisy input audio data and/or noise reduction profile information. In such implementations, training data includes sets of input audio sample representations (clean samples and noisy samples, in a time-domain representation, a transformed domain representation, extracted features representation, or other types of representations) along with corresponding output data comprising desired filtered (cleaned) audio sample representations (such output data is referred to as target or label data). The input and output training data together define the ground truth for the machine learning model being implemented. After initial training of the learning engine, subsequent training may be intermittently performed (at regular or irregular periods). Upon completion of a training cycles by the adapter/controller coupled to a particular learning engine, the adapter provides data representative of updates/changes (e.g., in the form of parameter values/weights to be assigned to links of a neural-network-based learning engine) to cause the learning engine to be updated in accordance with the training cycle(s) completed. During training neural network errors are computed based on distance between network output (responsive to the input data) and the desired target. Based on this error, the neural network weights are optimized with, for example, gradient descent processing to minimize the error.
More particularly, training the machine learning engine (e.g., the DNN 220) may be performed by using clean speech (audio) spectrum as the desired output to be produced by applying the filter coefficients outputs (generated by the learning engine) to the noisy input audio signal representation. This approach, therefore, requires knowledge of the clean speech sample (be it as a time domain representation or a spectral representation).
As further illustrated in
The DNN 534 thus receives voice characteristics data (applied to an input layer of the DNN 534, or to an intermediate layer) produced by the module 510, and noisy sound data which optionally may be processed by the input processor 532 performing at least some of the pre-processing operations discussed with respect to
Another way to obtain training samples would be to record the speech produced by local speakers in response to speech from a remote speaker(s) provided to the local speaker through headsets. The local speaker's and the remote speaker's speeches can then be replayed through loudspeakers deployed in a sound environment and the resultant combined speech produced through the loudspeakers can be recorded. This setup would thus provide the clean speech (local speaker's speech in isolation from the remote speaker's speech) and the noisy speech (the local speaker's speech combined with the interfering remote speaker's speech). This setup can be repeated for a large number of loudspeakers with different configuration of noise environments (e.g., in car cabins, office, home environments, etc., and with different configurations of deployed loudspeaker devices and microphones). Another way to obtain training samples is to have human speakers repeat scripted dialogues twice, the first time with remote speaker's sound not produced within the noise environment (i.e., the local speaker would wear headsets), and the second time with the remote speaker's sound produced by loudspeaker devices deployed within the sound environment. Other ways to produce training samples (e.g., through artificial production of speech samples) can also be used.
With reference next to
As further illustrated in
The signal 303, representative of the selected voice assistance (TTS) profile from the N available profiles 334A-N is provided to a selector (multiplexor) 322 of the SSE system 320 which is configured to select one of N available voice characteristics records that, when provided to the unit 324, cause (actuate or modulate) the unit to adaptively perform suppression operations on the signal 304 to suppress at least one interfering sound component from the multi-components of the signal 304. The speaker-aware speech enhancement unit 324 may be, in some implementations, a machine learning system in which a machine-learning engine, trained on training samples that include clean and noisy sound signal samples, along with voice characteristics of a format and configuration similar to that provided by the N TTS voice characteristics stored in a memory device of the SSE system 320, produces clean sound sample output that suppresses sound components associated with the selected TTS profile. For a particular sound profile, the voice characteristics may include, for example, an embedding vector from a vector space defined by a machine learning system that outputs compact vector representations (parameterized representations) of sound/voice samples. However, instead of processing sounds samples to produce voice characteristics during runtime (which could be a computationally intensive effort), the embodiments of
As noted, the unit 324 may include a machine learning engine that produces output responsive to the voice characteristics from the selector 322. In various example embodiments, the machine learning engine (which may be implemented similarly to the DNN 220 of
It is to be noted that in some example embodiments the SSE system 320 may store pre-computed machine learning models (represented as the weights to be assigned to the various machine learning interconnections between the nodes of the machine learning engine). In such embodiments, one of the stored models corresponding to a particular speaker or voice assistance profile currently providing the sound signals that are produced by the one or more loudspeaker devices in the sound environment is downloaded to the machine learning engine in response to identification of the particular speaker of voice assistance profile.
With reference next to
In some examples, at least one of the one or more interfering sound sources may include a voice assistant audio produced using a voice assistant profile selected from a plurality of voice assistant profiles maintained by a text-to-speech voice assistant system. In such examples, determining the interfering sound characteristics may include selecting from data storage (that stores respective interfering sound characteristics for the plurality of voice assistant profiles) a selected interfering sound characteristics corresponding to the selected voice assistant profile. Suppressing the at least one of the multiple sound components may include receiving, by an interference suppressing machine learning system trained to controllably suppress voice assistant sound components according to interfering sound characteristics respectively associated with the plurality of voice assistant profiles, with the selected interfering sound characteristics corresponding to the selected voice assistant profile, and sound data representative of the multi-source sound signal sample, and processing the sound data by the interference suppressing machine learning system according to the selected interfering sound characteristics to produce output data representative of an output sound sample in which at least one of the multiple sound components, associated with sound produced according to the selected voice assistant profile, is suppressed.
Processing the sound data with the interference suppressing machine learning system may include determining, by the machine learning system, filter coefficients for a time-varying linear filter based on at least part of the sound data and the selected interfering sound characteristics, and applying the determined filter coefficients to at least part of a representation of the multi-source sound signal sample to yield an interference suppressed output sound signal. Processing the sound data with the interference suppressing machine learning system may include generating, by the interference suppressing machine learning system, predicted output representative of the output sound sample in which the at least one of the multiple sound components is suppressed.
The computation of interfering sound characteristics may be performed in real-time. Thus, in such examples, determining interfering sound characteristics may include receiving the one or more sound signals that produce the one or more interfering sound sources, and deriving from the one or more sound signals the determined interfering sound characteristics. Suppressing the at least one of the multiple sound components may include receiving, by an interference suppressing machine learning system trained to controllably suppress the at least one of the multiple sound components according to the interfering sound characteristics, the derived interfering sound characteristics and sound data representative of the multi-source sound signal sample, and processing the sound data by the interference suppressing machine learning system according to the interference data to produce output data representative of an output sound sample in which the at least one of the multiple sound components is suppressed. Processing the sound data with the interference suppressing machine learning system may include determining, by the interference suppression machine learning system, filter coefficients for a time-varying linear filter based on at least part of the sound data and the interference data, and applying the determined filter coefficients to at least part of a representation of the multi-source sound signal sample to yield an interference suppressed output sound signal. In some examples, processing the sound data with the interference suppressing machine learning system may include generating, by the interference suppressing machine learning system, predicted output representative of the output sound sample in which the at least one of the multiple sound components is suppressed.
The determined interfering sound characteristics may include one or more of, for example, a vector representation in an embedding vector space for the one or more sound signals, spectral characteristics for the one or more sound signals, temporal characteristics for the one or more sound signals, and/or identifier data representative of identify of sources producing the one or more sound signals. Suppressing the at least one of the multiple sound components may include suppressing the at least one of the multiple sound components without using a reference signal for any of the plurality of sound sources, to reduce echo effects caused by sound components produced by the one or more loudspeakers.
Suppressing the at least one of the multiple sound components may include receiving by the interference suppressing machine learning system interference data representative of the interfering sound characteristics, receiving sound data representative of the multi-source sound signal sample, and processing the sound data by the interference suppressing machine learning system according to the interference data to produce output data representative of an output sound sample in which the at least one of the multiple sound components is suppressed. In such examples, receiving the sound data may include converting, by an acoustic transducer device, a raw sound sample into a time-domain sound signal, transforming the time-domain sound signal into a transformed domain representation, and extracting features from the transformed domain representation to produce the sound data. Extracting the features may include deriving from the transformed domain representation one or more of, for example, complex signal spectra features, spectral magnitude features, log spectral magnitude features, log mel spectra features, and/or mel-frequency cepstral coefficients.
In some examples, the predicted output of the interference suppressing machine learning system may include one or more of, for example, a temporal representation of the output sound sample, or a spectral representation of the output sound sample.
Suppressing the at least one of the multiple sound components may include providing the interfering sound characteristics to a machine learning system executing a single machine learning model trained to controllably suppress acoustic interference sound components in input sound signals according to different sound characteristics for a plurality of interference sound signals associated with one or more speakers or one or more voice assistant profiles sources, and processing sound data, representative of the multi-source sound signal sample, provided as input to the machine learning system based on the determined interfering sound characteristics provided to the machine learning system to produce output data representative of an output sound sample in which the at least one of the multiple sound components is suppressed. Providing the determined interfering sound characteristics to the machine learning system may include providing the determined interfering sound characteristics to one or more of, for example, an input layer of the machine learning system, and/or one or more intermediate hidden layers of the machine learning system.
Implementation described herein, including implementations using neural networks, can be realized on any computing platform, including computing platforms that include one or more microprocessors, microcontrollers, and/or digital signal processors that provide processing functionality, as well as other computation and control functionality. The computing platform can include one or more CPU's, one or more graphics processing units (GPU's, such as NVIDIA GPU's), and may also include special purpose logic circuitry, e.g., an FPGA (field programmable gate array), an ASIC (application-specific integrated circuit), a DSP processor, an accelerated processing unit (APU), an application processor, customized dedicated circuitry, etc., to implement, at least in part, the processes and functionality for the neural network, processes, and methods described herein. The computing platforms typically also include memory for storing data and software instructions for executing programmed functionality within the device. Generally speaking, a computer accessible storage medium may include any non-transitory storage media accessible by a computer during use to provide instructions and/or data to the computer. For example, a computer accessible storage medium may include storage media such as magnetic or optical disks and semiconductor (solid-state) memories, DRAM, SRAM, etc. The various learning processes implemented through use of the neural networks may be configured or programmed using TensorFlow (a software library used for machine learning applications such as neural networks). Other programming platforms that can be employed include keras (an open-source neural network library) building blocks, NumPy (an open-source programming library useful for realizing modules to process arrays) building blocks, etc.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly or conventionally understood. As used herein, the articles “a” and “an” refer to one or to more than one (i.e., to at least one) of the grammatical object of the article. By way of example, “an element” means one element or more than one element. “About” and/or “approximately” as used herein when referring to a measurable value such as an amount, a temporal duration, and the like, encompasses variations of +20% or +10%, +5%, or +0.1% from the specified value, as such variations are appropriate in the context of the systems, devices, circuits, methods, and other implementations described herein. “Substantially” as used herein when referring to a measurable value such as an amount, a temporal duration, a physical attribute (such as frequency), and the like, also encompasses variations of +20% or +10%, +5%, or +0.1% from the specified value, as such variations are appropriate in the context of the systems, devices, circuits, methods, and other implementations described herein.
As used herein, including in the claims, “or” as used in a list of items prefaced by “at least one of” or “one or more of” indicates a disjunctive list such that, for example, a list of “at least one of A, B, or C” means A or B or C or AB or AC or BC or ABC (i.e., A and B and C), or combinations with more than one feature (e.g., AA, AAB, ABBC, etc.). Also, as used herein, unless otherwise stated, a statement that a function or operation is “based on” an item or condition means that the function or operation is based on the stated item or condition and may be based on one or more items and/or conditions in addition to the stated item or condition.
Additional details regarding the embodiments described herein are provided in Appendix A appended hereto.
A number of embodiments of the invention have been described. Nevertheless, it is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the following claims. Accordingly, other embodiments are also within the scope of the following claims. For example, various modifications may be made without departing from the scope of the invention. Additionally, some of the steps described above may be order independent, and thus can be performed in an order different from that described.
This application claims the benefit of U.S. Provisional Application No. 63/431,382, filed Dec. 9, 2022, the contents of which are incorporated herein.
Number | Date | Country | |
---|---|---|---|
63431382 | Dec 2022 | US |