In one aspect, there is provided systems and methods for speech enhancement based on ultrasound.
In some embodiments, there is provided a method including receiving, by a machine learning model, first data corresponding to noisy audio including audio of a target speaker of interest proximate to a microphone; receiving, by the machine learning model, second data corresponding to articulatory gestures sensed by the microphone which also detected the noisy audio, wherein the second data corresponding to the articulatory gestures comprises one or more Doppler data indicative of Doppler associated with the articulatory gestures of the target speaker while speaking the audio; generating, by the machine learning model, a first set of features for the first data and a second set of features for the second data; combining, by the machine learning model, the first set of features for the first data and the second set of features for the second data to form an output representative of the audio of the target speaker that reduces, based on the combined first and second features, noise and/or interference related to at least one other speaker and/or related at least one other source of audio; and providing, by the machine learning model, the output representative of the audio of the target speaker.
In some variations of the methods, systems, and computer program products, one or more of the following features can optionally be included in any feasible combination. The method may further include emanating, via a loudspeaker, ultrasound towards at least the target speaker, wherein the ultrasound is reflected by the articulatory gestures and detected by the microphone. The method may further include receiving an indication of an orientation of a user equipment including the microphone and the loudspeaker; and selecting, using the received indication, the machine learning model. The ultrasound includes a plurality of continuous wave (CW) single frequency tones. The articulatory gestures include gestures associated with the target speaker's speech including mouth gestures, lip gestures, tongue gestures, jaw gestures, vocal cord gestures, and/or other speech related organs. The generating, by the machine learning model, the first set of features for the first data and the second set of features for the second data may further include using, a first set of convolutional layers to provide feature embedding for the first data, wherein the first data is in a time-frequency domain and using, a second set of convolutional layers to provide feature embedding for the second data, wherein the second data is in the time-frequency domain. The first set of features and the second set of features are combined in the time-frequency domain while maintaining time alignment between the first and second set of features. The machine learning model includes one or more fusion layers to combine, in a frequency domain, the first set of features for the first data and the second set of features for the second data. A single stream of data (which is obtained from the microphone) is received and preprocessed to extract the first data comprising noisy audio and to extract the second data comprising the articulatory gestures. The phase of the output representative of the audio of the target speaker is phase corrected. During training of the machine learning model, a generator comprising the machine learning model is used to output a noise-reduced representation of audible speech of the target speaker, and a discriminator is used to receive as a first input the noise-reduced representation of audible speech of the target speaker, receive as a second input a noisy representation of audible speech of the target speaker, and output, using a cross modal similarity metric, a cross-modal indication of similarity to train the machine learning model.
Implementations of the current subject matter can include, but are not limited to, systems and methods consistent including one or more features are described as well as articles that comprise a tangibly embodied machine-readable medium operable to cause one or more machines (e.g., computers, etc.) to result in operations described herein. Similarly, computer systems are also described that may include one or more processors and one or more memories coupled to the one or more processors. A memory, which can include a computer-readable storage medium, may include, encode, store, or the like one or more programs that cause one or more processors to perform one or more of the operations described herein. Computer implemented methods consistent with one or more implementations of the current subject matter can be implemented by one or more data processors residing in a single computing system or multiple computing systems. Such multiple computing systems can be connected and can exchange data and/or commands or other instructions or the like via one or more connections, including but not limited to a connection over a network (e.g., the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.
The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims. While certain features of the currently disclosed subject matter are described for illustrative purposes in relation to optical edge detection, it should be readily understood that such features are not intended to be limiting. The claims that follow this disclosure are intended to define the scope of the protected subject matter.
In the drawings,
Robust speech enhancement is a goal and a requirement of audio processing to enable for example human-human and/or human-machine interaction. Solving this task remains an open challenge, especially for practical scenarios involving a mixture of competing speakers and background noise.
In some embodiments disclosed herein, there is provided systems, methods, and articles of manufacture that use ultrasound sensing as a complementary modality to process (e.g., separate) a desired speaker's speech from interference and/or noise.
In some embodiments, a user equipment, such as a smartphone, mobile phone, IoT device, and/or other device) may emit an ultrasound signal and receive (1) the ultrasound reflections from the speaker's articulatory gestures and receive (2) the noisy speech from the speaker. The phrase articulatory gestures refer to mouth, lips, tongue, jaw, vocal cords, and other speech related organs associated with the articulation of speech. In some implementations, the use of the microphone at the user equipment to receive the ultrasound reflections from the speaker's articulatory gestures and the noise speech from the speaker may provide an advantage of synchronizing the two heterogeneous modalities (i.e., the speech and ultrasound modalities).
In some embodiments, the (1) ultrasound reflections from the speaker's articulatory gestures are received and processed to detect the Doppler shift of the articulatory gestures. In some embodiments, the noisy speech (which includes the speaker's speech as well as interference and/or noise such as from other speakers and sources of sound) is processed into a spectrogram. In other words, the target (or desired) speech is embedded in the noisy speech, which can make it difficult to discern the target speaker's speech.
In some embodiments, at least one machine learning (ML) model may be used to process the ultrasonic Doppler features (which correspond to the speaker's auditory gestures) and the audible speech spectrogram (which includes the speaker's speech as well as interference and/or noise such as from other speakers and sources of sound) to output speech which has been enhanced by improving speech intelligibility and quality (e.g., by reducing if not eliminating some of the interference, such as noise caused by background speakers or other sources of sound). In other words, the ultrasonic Doppler features can be used by the ML model to correlate with the speaker's speech (and thus reduce or eliminate the noise or interference not associated with the speaker's speech and articulatory gestures).
In some embodiments, the at least one ML model may include an adversarially trained discriminator (e.g., based on a cross-modal similarity measurement network) that learns the correlation between the two heterogeneous feature modalities of the Doppler features and the audible speech spectrogram.
As noted, the user equipment 110 may receive at the microphone 150B the ultrasound and noisy speech and store (e.g., record) the received signals corresponding to the ultrasound and noisy speech from processing. During the voice recording, the user equipment may transmit (e.g., emit) inaudible ultrasound wave(s). This ultrasound transmission may be continuous during the voice recording phase. The transmitted ultrasound waves may be modulated by the speaker's articulatory gestures. For example, the speaker's 112 lip movement (which is for example within 18 inches of the speaker 112) modulates the ultrasound waves, although other auditory gestures such as movement of tongue, teeth, throat, and/or the like) may also modulate the ultrasound as well. The modulated ultrasound is then received by microphone 150B along with the noisy speech (which includes the speech of the desired speaker 112 as well as the noise/interference 114A-D). Moreover, the received ultrasound 118A and received noisy speech 118B may be stored (e.g., recorded) for processing by at least one ML model 120. The ML model may be implemented at the user equipment 110. Alternatively, or additionally, the ML model may be implemented at another device (e.g., a server, cloud server, and/or the like). Although the received noisy speech includes the speech of the desired speaker 112 (as well as the noise/interference 114A-D), the received ultrasound for the most part only captures the targeted speaker's 112 articulatory gesture motion (which can be correlated with the speaker's 112 speech).
In some embodiments, the ML model 120 may comprise a deep neural network (DNN) system that captures the correlation between the received ultrasound's articulatory gestures 118A and the received noisy speech 118B, and this correlation may be used to enhance (e.g., denoise, which refers to reducing or eliminating noise and/or interference) the noisy speech from the output 122 of enhanced speech. For example, the speaker's 112 speech may include the term “to.” In this example, the received ultrasound sensed by the microphone 150B may include the articulatory gestures (e.g., lip and/or tongue movement), which can be correlated to the term “to” in the noisy speech that is also received by the microphone 150B. This correlation may be used to process the noisy speech so that the “to” can be enhanced, while the noise and interference is reduced and/or filtered/suppressed.
Before providing additional description about the system of
Human speech generation involves multiple articulators, such as tongue, lips, jaw, teeth, vocal cords, and other speech related organs. Coordinated movement of the articulators, such as the lip protrusion and closure, tongue stretch and constriction, jaw angle change, and/or the like may be used to at least in part define one or more phonological units (e.g., a phoneme in phonology and linguistics). Assuming that articulatory gestures can be fully captured and interpreted, it would be possible to recover using articulatory gestures the speech signals but in practice it is challenging to capture the fine-grained gesture motion of all articulators by using only articulatory gestures as some of the articulators are close to each other, some articulators can be inside the mouth/throat (so it is hard to discriminate their motion), and the articulatory gestures can be fast and subtle. For example, an articulatory gesture may last between about 100 and 700 millisecond (ms) and may involve less than about 5 centimeters (cm) of moving distance in the case of for example lip and jaw. Rather than recover speech only using the articulatory gestures, the system of
The velocity of the speaker's 112 articulatory gestures can range from for example about −80 cm/second to 80 cm/second (−160˜160 cm/s for propagation path change). This can introduce a corresponding Doppler shift of for example about −100 Hertz (Hz) to about 100 Hz, when the transmitted ultrasound signal's frequency is 20 kHz. Moreover, each articulatory gesture may correspond to a single phoneme lasting for example, about 100 milliseconds (ms) to about 700 ms. To characterize the articulatory gestures, the short-term, high-resolution Doppler shift may be used, while being robust to multipath and frequency-selective fading, such that the signal features from articulatory gestures are alone identified or extracted.
In some embodiments, the ultrasound transmitted by the loudspeaker 150A may comprise a continuous wave (CW) ultrasound signal, such as multiple single tones (e.g., single frequency) continuous waves (CWs) having linearly spaced frequencies. Although modulated CW signals (e.g., frequency modulated continuous wave, orthogonal frequency division multiplexing, and pseudo-noise (PN) sequences) may measure the impulse response to resolve multipath, they may suffer from a low sampling rate problem. A reason is for this is that the modulation processes the signal in segments (e.g., chirp period or symbol period). Thus, each feature point of the modulated CW signal characterizes the motion within a whole segment, which is typically longer than 10 ms (960 samples) at a sampling rate of 96 kHz, so only about 10 to about 70 feature points can be output for each articulatory gesture with a typical duration of about 100 ms to about 700 ms, which may not be sufficient to represent the fine-grained instantaneous velocity of articulatory gesture motion. By comparison, each sampling point of a single tone CW can generate one feature point (Doppler shift estimation) to represent the micro-motion with duration of, for example, 0.01 ms (1/96000) at a sampling rate of 96 kHz. To further resolve the multipath effect and frequency selective fading, multiple single-tone CWs with equal frequency spacing may be combined, which may result in a transmitted ultrasound waveform T(t)=Σi=1NAi cos 2πfit, where N, Ai and fi denote the number of tones, the amplitude and frequency of the ith tone, respectively. And to alleviate the spectral leakage across different tones when generating the spectrogram in a later processing stage, a short time Fourier transform (STFT) window size (e.g., of 1024 points) is a full cycle of all the transmitted tones at a maximum sampling rate (e.g., 48 or 96 kHz via microphone 150B).
Despite the orthogonality in frequency, there may be mutual interference between the speech and articulatory gesture ultrasound, which can cause ambiguity in the Doppler features. For example, harmonics of the speech may interfere with the Doppler features extracted from the articulatory gesture ultrasound. Specifically, the speech harmonics may interfere the Doppler features due to non-linearity of microphone hardware. In some embodiments, the amplitude of the transmitted ultrasound may be adjusted (e.g., decreased), such that the speech signal harmonics (which interfere with the ultrasound and its Doppler) is reduced (or eliminated). Moreover, when the speaker 112 speaks close to the microphone 150B, some of the phonemes (e.g., /p/ and /t/) may blow air flow into the microphone that can generate high-volume noise. Rather than remove the corrupted, noise speech samples, the ML model 120 may be used to characterize the sampling period corresponding to the specific phonemes (e.g., /p/ and /t/ causing the air flow related noise at the microphone.
Referring again to
In some embodiments, the noisy speech 118B is transformed into a time-frequency spectrogram, which serves as a first input to the ML model 120. In some embodiments, the Doppler shift features are, as noted, extracted from the received ultrasound (corresponding to the articulated gestures) 118A, which serves as a second input to the ML model 120.
The ML model 120 may include one or more layers (each of which form a “subnetwork” and/or a “block,” such as a computational block) 304A that provide feature embedding of the received ultrasound 302A. Likewise, one or more layers 304B provide feature embedding of the received noisy speech spectrogram 302B. The embedding takes the input and generates a lower dimensional representation of the input. At 306, the ultrasound features (which are output by the one or more layers 304A and labeled “U-Feature”) are fused (e.g., concatenated, combined, etc.) with the noisy speech features (which are output by the one or more layers 304B and labeled “S-Feature”).
In the example of speech feature embedding layers 304B, the input 302B is for example a time-frequency (T-F) domain amplitude spectrogram that is represented as Snoisea∈1×T×F
In the example of ultrasound feature embedding layers 304A, the input 302A is represented as Us∈T×F
Referring again to the fusion layers 306, the output 308 of the fusion layers may be considered a mask (referred to herein as an “amplitude Ideal Ratio Mask”, aIRM). The mask provides a ratio between the magnitudes of the clean and noisy spectrograms by using the speech and ultrasound inputs. For example, the mask learns the ratio between the targeted (or desired) speaker's clean speech and the noisy speech. To illustrate further, for each time-frequency slot in the spectrogram, the mask provides a ratio between targeted (or desired) speaker's clean speech and the noisy speech so that when the noisy speech is multiplied with the mask, the final output is only (or primarily) the cleaned speech of the targeted/desired speaker. The use of “ideal” refers to an assumption that the desired speaker's speech signal and noise speech signal are independent and have known power spectra. The first set of layers (or subnetwork) of the fusion layers 306 provides two stream feature embedding by using the noisy speech's T-F spectrogram and the concurrent ultrasound Doppler spectrogram and transforming the two-stream feature embedding (of the different ultrasound and speech modes) into the same feature space while maintaining alignment in the time domain. The second set of layers provide a speech and ultrasound fusion subnetwork that concatenates the features of each stream in the frequency dimension along with self-attention layer and BiLSTM layer to further learn the intra- and inter modal correlation from both frequency domain and time domain.
In the first set of layers of the fusion layers 306 for example, a self-attention layer (labeled “Self Att Fusion”) is applied to fuse the concatenated feature maps to let the multi-modal information “crosstalk” with each other. Here, the crosstalk means that the self-attention layers can assist the speech and Doppler features to learn the intra- and inter-modal correlation between each other effectively. The fused features are subsequently fed into the second set of layers including a bi-directional Long short-term memory (BiLSTM, labeled BiLSTM 600) layer followed by three fully connected (FC) layers. The resulting output 308 is a ratio mask (which corresponds to the ratio between targeted clean speech and the noisy speech) that is multiplied 310 with the original noisy amplitude spectrogram 302B to generate the amplitude-enhanced T-F spectrogram 312.
To illustrate further, the ML model 120 aims to appropriately learn the frequency domain features of the speech and ultrasound modalities, and then fuse speech and ultrasound modalities together to exploit the time-frequency domain correlation. The frequency domain of the ultrasound signal features represents a motion velocity (e.g., Doppler shift) of the articulatory gestures, while that of the speech sound represents the frequency characteristics such as harmonics and consonants. As the size of the two modalities' feature maps are different these two feature maps cannot be simply concatenated into a scalar, so the two-stream embedding framework is used to transform the modalities into the same feature space. After the feature embedding, the feature maps of the two streams are concatenated (which is represented as follows): Sinf=concat(Maa, Uas), where Sinf∈C
T×F
In some embodiments, a conditional generative adversarial network (cGAN) is used in training to denoise the output 308, such as the amplitude-enhanced T-F spectrogram. During the training of the ML model 120, the cGAN may be used to determine the weights of the ML model 120.
An element of the cGAN is the similarity metric used by the discriminator 404. Unlike traditional GAN applications (which compare between the same type of features), the cGAN is cross-modal, so the cGAN needs to discriminate between different modalities, such as whether the enhanced T-F speech spectrogram matches the ultrasound Doppler spectrogram (e.g., whether they are a “real” or “fake” pair). A cross-modal Siamese neural network may be used to address this issue. The Siamese neural network uses shared weights and model architecture while working in tandem on two different input vectors to compute comparable output vectors. Although a traditional Siamese neural network is used to measure the similarity between two inputs from the same modality (e.g., two images), to enable a cross-modal Siamese neural network, two separate subnetworks may be created as shown at
As shown in
The similarity measurement may be used as a discriminator 404 (
In the example of
Referring to
Referring to
In some embodiments, there may be provided preprocessing to extract the speech and ultrasound from the output of the microphone 150B (or a stored version of microphone's output). As noted, the microphone receives both the noisy speech (which includes the desired speaker's 112 speech of interest) and the ultrasound (which includes the sensed articulatory gestures). The preprocessing may extract from the microphone output the speech and ultrasound features.
In the case of the Doppler, the STFT 510 is applied, which allows the Doppler shift to be identified and extracted at 512 from the time frequency bins of the STFT and provides the time frequency ultrasound 302A. In the case of the speech audio, the filtered audio speech is resampled 514 and then the STFT 516 is applied to form the time frequency noisy speech signal 302B. For example, the high-pass elliptic filter 504 may be used to isolate the signals above 16 kHz, where the ultrasound features are located. To extract the ultrasound sensing features within the T-F domain, the Doppler spectrogram induced by articulatory gestures can be extracted and aligned with the speech spectrogram 302B. A consideration for this step is to balance the tradeoff between time resolution and frequency resolution of the STFT 516 under a limited sampling rate (e.g., 96 kHz maximum). To guarantee time alignment between the speech and ultrasound features, their hop length in the time domain may be the same. The STFT uses a hop length of 10 ms to guarantee 100 frames per second, resulting in about 10 to about 70 frames per articulatory gesture, which is sufficient to characterize the process of an articulatory gesture. Moreover, the frequency resolution (which is determined by the window length) may be as fine-grained as possible to capture the micro-Doppler effects introduced by the articulatory gestures, under the premise that the time resolution is sufficient. A window length 85 ms is the longest length for STFT to make it is shorter than the shortest duration of an articulatory gesture (e.g., about 100 ms). For example, with a 96 kHz sampling rate (or less), the STFT may be computed using a window length 85 ms, hop length of 10 ms, and FFT size of 8192 points, which results in a 11.7 Hz frequency resolution. To mitigate the reflections from relatively static objects, the 3 central frequency bins of the STFT may be removed while leaving 8×2 (16) frequency bins corresponding to Doppler shift [−11.7×8, −11.7) and (11.7, 11.7×8] Hz. Moreover, a min-max normalization may be performed on the ultrasound Doppler spectrogram.
Compared to other speech enhancement yechnology, the system of
At 805, a machine learning model may receive a first data corresponding to noisy audio including audio of a target speaker of interest proximate to a microphone, in accordance with some embodiments. For example, the machine learning model 120 (see, e.g.,
At 810, the machine learning model may receive a second data corresponding to articulatory gestures sensed by the microphone which also detected the noisy audio, wherein the second data corresponding to the articulatory gestures comprises one or more Doppler data indicative of Doppler associated with the articulatory gestures of the target speaker while speaking the audio, in accordance with some embodiments. For example, the machine learning model 120 (see, e.g.,
At 815, the machine learning model may generate a first set of features for the first data and a second set of features for the second data, in accordance with some embodiments. For example, the machine learning model may receive time-frequency data, such as the noisy audio spectrogram at 302B. In this example, the ML model 120 may process the received data into features. In some embodiments, the ML model may include a second set of convolutional layers, such as the feature embedding layers 304B. And, the second set of convolutional layers may be used to provide feature embedding for the second data, wherein the second data is in the time-frequency domain. In the example of
At 820, the machine learning model may combine the first set of features for the first data and the second set of features for the second data to form an output representative of the audio of the target speaker that reduces, based on the combined first and second features, noise and/or interference related to at least one other speaker and/or related at least one other source of audio, in accordance with some embodiments. For example, the fusion layer 306 of
At 830, the machine learning model may provide the output representative of the audio of the target speaker, in accordance with some embodiments. For example, the output may correspond to the time-frequency data, such as the time frequency spectrogram 312 which has been enhanced by reducing noise and/or interference. Alternatively, or additionally, the output may correspond to phase corrected speech, such as speech 326 and/or 330 described in the example of
In some embodiments, a loudspeaker, such as the loudspeaker 15A0 may generate ultrasound towards at least the target speaker 112, such that the ultrasound is reflected by the articulatory gestures of the target speaker (e.g., while the target speaker is speaking and moving lips, mouth, and/or the like) and then detected (as ultrasound) by the microphone 150B. Although some of the examples refer to a microphone or a loudspeaker, a plurality of microphones and/or loudspeakers may be used as well. As noted above, the ultrasound may be generated as a plurality of continuous wave (CW) single frequency tones.
In some embodiments, an indication may be received. This indication may provide information regarding an orientation of a user equipment 110 as shown at the example of
In some embodiments, preprocessing may be performed as described with respect to the example of
In some embodiments, phase correction maybe performed. For example, phase correction of the output 312 of the ML model 120 may be performed as noted above with respect to the example of
In some embodiments, the machine learning model 120 is trained using a conditional generative adversarial network. Referring to the example of
In some implementations, the current subject matter may be configured to be implemented in a system 900, as shown in
In view of the above-described implementations of subject matter this application discloses the following list of examples, wherein one feature of an example in isolation or more than one feature of said example taken in combination and, optionally, in combination with one or more features of one or more further examples are further examples also falling within the disclosure of this application:
One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable system or computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
These computer programs, which can also be referred to as programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and/or device, such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium. The machine-readable medium can alternatively, or additionally, store such machine instructions in a transient manner, such as for example, as would a processor cache or other random access memory associated with one or more physical processor cores.
The subject matter described herein can be embodied in systems, apparatus, methods, and/or articles depending on the desired configuration. The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and/or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and subcombinations of the disclosed features and/or combinations and subcombinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and/or described herein do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Other implementations may be within the scope of the following claims.
This application claims priority to U.S. Provisional Application No. 63/301,461 entitled “SINGLE-CHANNEL SPEECH ENHANCEMENT USING ULTRASOUND” and filed on Jan. 20, 2022, which is incorporated herein by reference in its entirety.
This invention was made with government support under CNS-1954608 awarded by the National Science Foundation. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2023/061047 | 1/20/2023 | WO |
Number | Date | Country | |
---|---|---|---|
63301461 | Jan 2022 | US |