This invention relates generally to systems and methods for providing high quality wireless or wired communications. More particularly, this invention relates to systems and methods for providing clear voice communications under noisy conditions.
Conventional technologies for voice communications are faced with a challenge due to the facts that wireless or wired communications, e.g., cellular phone calls, are often carried out in a noisy environment. Common experiences of such phone calls may occur when people are walking on the street, riding in a subway, driving on a noisy highway, eating in a restaurant or attending a party or an entertainment event such as a music festival, etc. Clear communications under those noisy circumstances are often difficult to realize.
Recent technical developments also enable hand-free communications. However, hand-free wireless communication also faces the same challenges to achieve clear communications under the noisy circumstances. For these reasons, noise cancellations become an urgent challenge and there are many conventional technical solutions in attempt to overcome such difficulties. These techniques, including beam forming, statistical noise reduction, frequency-bin filtering, deep learning-based noise cancellation using a large amount of data recorded under different noisy environments, etc. However, these techniques can generally effectively and reliably operate to cancel stationary or known noises. Clear and noise free communications are still not achievable under most circumstances since most of the wireless communications often occur in noisy environments where the noises are not stationary nor known in advance but are changing dynamically, especially under the situations of very low signal and noise ratio (SNR). Therefore, an urgent need still exists in the art of voice communications to provide effective and practical methods and devices to cancel noises for daily wireless communications.
It is therefore an aspect of the present invention to provide a new and improved noise cancellation system implemented with new devices and methods to overcome these limitations and difficulties. Specifically, the noise cancellation system includes wearable devices with a vibration sensor and microphones to detect and track speech signals. In one of the embodiments, the vibration sensors include MEMS accelerometers and piezoelectric accelerometers for installation in earbuds, necklaces, and patches directly on the upper body such as on the chest for detecting vibrations. In another embodiment, the vibration sensor may be implemented as a laser-based vibration sensor, e.g., vibrometer, for non-contact vibration sensing. The wearable device also includes a wireless transmitter/receiver to transmit and receive signals. The clear voice recovery system further includes a converter to convert the vibration sensor and/or microphone sensor signals to a probabilistic distribution of linguistic representation sequences (PDLs) by using a rapidly adapted recognition model. The PDLs are then mapped into a full band MCEP sequence by applying a mapping module that is first developed and trained during the adaptation phase. The clear personal speech to be transmitted to the other parties through the wireless communication is recovered by a vocoder using the full band MCEPs, aperiodic features (AP), Voiced/Unvoiced (VUV), and F0.
Alternatively, the speaker's unique features in the form of embedding are used together with the vibration sensor signals to convert from the vibration signals to the full band Mel-spectrogram of the speech from that speaker. The speaker's clear speech is then recovered from the full band Mel-spectrogram using a seq2seq synthesis trained offline with many different speakers. The conversion from the vibration sensor signals and the speaker features to the full band Mel-spectrogram is trained during the adaptation phase.
The vibration sensor signals are not affected by the noises one would encounter in our daily life. The new and improved systems and methods disclosed in this invention are therefore robust for application under any type of nosy environment with intelligibility, requiring only a few minutes of input speech of the user voice during an enrollment mode or an actual use under a quiet condition. The systems and methods disclosed in this invention are further implemented with flexible configurations to allow different modules to reside in different nodes of the wireless communication including wearable, computing hub, e.g., smartphone, or in the cloud.
Additional embodiments for broader cases of noise-removal tasks beyond earbuds may use an accurate far-field automated speech recognition engine (FF-ASR) for noisy conditions and/or reverberant environment. The FF-ASR translates the speaker's voice into PDL which is then converted by the rest of the system to a clean voice of the same speaker for various online communication or offline noise-removal of speech recordings.
These and other objects and advantages of the present invention will no doubt become obvious to those of ordinary skill in the art after having read the following detailed description of the preferred embodiment, which is illustrated in the various drawing figures.
The present invention will be described based on the preferred embodiments with reference to the accompanying drawings.
Furthermore, in a background blocking mode, the earbuds can be designed to block the background sounds so that the user can hear the sound (voice, music, etc.) from the other parties through the communication channel via the sound speaker in the wearable device such as earbuds. The background sound blocker can be done mechanically or algorithmically. Specifically, the mechanical blocker may be implemented as adaptive rubber buds to fit to the ear openings and canals of each individual person, and the algorithmic blocker may be implemented using an active noise cancellation algorithm.
In a semi-transparent mode when the speaker intends to provide reduced background sounds to the hearers, one may mix the background sounds of a reduced volume set by the earbud user with the synthesized speaker's voice.
In any of its operating modes, an acoustic echo cancellation module is always incorporated to prevent the sounds of the other parties through the communication channel from getting into the microphones, accelerometers, and other vibration sensitive sensors, as one would normally do in other earbud implementations.
The process flow as shown in
Specifically, Processing step 100 is a speech recovering flow, which takes the microphone and vibration sensor inputs and recovers speech via synthesis without noises
Processing step 110 is one or more microphones that sense acoustic waves and convert them into a sequence of digital values. It is typically an air conduction microphone. It is also used for deciding the signal noise ratio calculation (Processing step 3023) together with Processing step 120.
Processing step 120 is a vibration sensor that senses the vibration signals on contact and converts them into a sequence of digital values. Often, it is a bone conduction sensor.
Processing step 110 and processing step 120 sense the signals in a synchronized way with marked time stamps. They are used in an offline training phase (Processing step 400), an adaptation phase (Processing step 750, 750PR, 750PS), and real time recovery phase (Processing step 600, 600PR, 600PS).
Processing step 130 and Processing step 150 are the same feature extraction module. They take a sequence of digital signal values, analyze them, and produce one or more sequences of feature vectors. The feature vectors can be Mel-Frequency Cepstral Coefficients (MFCCs). Specifically, Processing step 130 takes the input from the microphone and Processing step 150 takes the input from the vibration sensor.
Processing step 140 contains three modules Processing step 150, Processing step 200, and Processing step 700 that takes the input from the vibration sensor and the output features from Processing step 130 to generate speaker-independent full band Mel Cepstral Features (MCEP). Processing step 200 takes input from the output of Processing step 150 and optionally the output from Processing step 130 and produces a sequence of probabilistic vectors with the form of a speaker-independent PDL representation (e.g., PPG) such as phonetic piece vectors, grapheme vectors, or word piece vectors, based on a pre-trained model in processing step 270.
Processing step 700 takes the phonetic representation from Processing step 200 and generates a sequence of a full band MCEP, based on a model in processing step 770 trained during the adaptation phase. Variations are given in
Processing step 500 takes the output from Processing step 150, partial band speaker-dependent features, including F0, Aperiodic (AP), Voiced/Unvoiced info, of vibration signals, and adapts them into corresponding full band features. Training details are given in
Processing step 160 is the vocoder that takes the speaker-independent features from Processing step 700 in combination with the speaker-dependent features from Processing step 500 to generate speech wave signals.
As shown in
In one embodiment, the wearable hosts include both the microphone and vibration sensors to perform event trigger detection and SNR calculation. Furthermore, the signals are transmitted to the mobile computing hub via wireless or wired connections depending on different configuration of the speech recovery systems. SNR is applied to decide which channels to be used for transmitting the signals to the hub, and the channels may be either the vibration sensors or the microphone sensors. In one particular embodiment, the speaker embedding and Mel-spectrogram calculation, the mapping of vibration sensor Mel-spectrogram to microphone sensor Mel-spectrogram, as well as the seq2seq conversion to speech waves can be performed in the hub or can also be carried out by a remotely connected cloud server.
Specifically, in
Processing step 112A is one or more microphones that sense acoustic waves and convert them into a sequence of digital signal values. It is typically an air conduction microphone. Processing step 112A is typically able to sense full band speech signals of 16 KHz or even wider ones at 22 KHz or 44 KHz.
Processing step 122A is a vibration sensor that senses the vibration signals on contact and converts them into a sequence of digital values. It is normally an accelerometer or a bone conduction sensor. Processing step 122A is typically only able to sense partial band signals (often below 5 KHz).
Processing step 112A and Processing step 122A sense the signals in a synchronized way with marked time stamps. They perform the same tasks as Processing step 112 and Processing step 122 but they receive time-synchronized signals from many speakers during this training phase. The time-synchronized vibration signals can be simulated by a mapping network trained from mic signals to vibration signals of a small parallel data set, when there is not enough parallel data for training.
Processing step 132A and Processing step 152A are the same feature extraction module as Processing step 130. However, they take input sequences of digital values from different sensors.
The input sequences are analyzed, and one or more sequences of feature vectors are produced. The output feature vectors can be Mel-spectrograms that contain speaker-dependent information. Specifically, Processing step 132A takes the input from the microphone and Processing step 152A takes the input from the vibration sensor. In addition, the output features may also include F0, and or voiced/unvoiced features.
Processing step 142A extracts speaker identity representation from the output signals of the Processing step 112A and Processing step 122A. The speaker identity can be represented by one or a combination of speaker embedding, i-vector, super vector, etc.
Processing step 702A is a model trainer that trains a neural network model (Processing step 175A) used for mapping from a speaker identity representation and partial band Mel-spectrogram to the full band signal representation of the same or very similar speaker. See
Processing step 162A is a neural network model trainer that trains a mapping model taking the Mel-spectrogram of a speech signal as the input and producing its corresponding speech waveforms. The input to the processing step is Mel-spectrogram with speaker-dependent info. When the resulting model processing step 185A is trained, given the Mel-spectrogram of a speaker, the output speech wave signal is full band and sounds the speech from the same speaker.
For both processing step 702A and 162A, typically, their training data sets contain many different diverse speakers.
Specifically, in
Processing step 115A is one or more microphones that sense acoustic waves and convert them into a sequence of digital values. It is typically an air conduction microphone. Processing step 115A is the same kind of mic as Processing step 112A for training.
Processing step 125A is a vibration sensor that senses the vibration signals on contact and converts them into a sequence of digital values. Processing step 125A is the same kind of vibration sensor as Processing step 122A for training. During the online recovery process, Processing step 125A only receives signals from the one who wears the device with the sensor.
Processing step 115A and processing step 125A sense the signals in a synchronized way with marked time stamps.
Processing step 145A extracts a speaker identity representation from the output signals of the Processing step 115A and Processing step 125A. The speaker identity can be represented by one or a combination of speaker embedding, i-vector, and super vector, in the same way as the processing step 142A.
Processing step 155A is the feature extraction module same as Processing step 152A and Processing step 130. It takes a sequence of digital values from the vibration sensor Processing step 125A, analyzes them, and produces one or more sequences of feature vectors. The feature vectors can be Mel-spectrograms that contain speaker-dependent information. In addition, it may also derive F0, and or voiced/unvoiced dynamic features.
Processing step 705A is a mapper that uses the trained model of Processing step 175A and takes the speaker id from Processing step 145A and the partial band (PB) Mel-spectrogram from Processing step 155A the input and generate the full band Mel-spectrogram of the same or very similar speaker as trained in processing step 702A. Details see Processing step 705A in
Processing step 165A is a speech synthesizer (e.g., neural network sequence to sequence—seq2seq synthesizer) that takes the full band (FB) Mel-spectrogram with the same speaker voice characteristics from Processing step 705A as the input and produces its corresponding speech wave signal sequence, using the trained model Processing step 185A. The resulting speech wave signal is full band and sounds the speech by the same speaker. The Processing step can also be implemented by other vocoders, such as the Griffin-Lim algorithm after linearization.
In
Processing step 705A in
Processing step 715A is a neural encoder that takes the partial band (PB) Mel-spectrogram from Processing step 155A and optionally a combined info from Processing step 145A (speaker identity info) and Processing step 175A (the component for encoding in the Mel-spectrogram mapping model) as its input, and produces a sequence of vector representation with speaker-independent linguistic info such as the PDL described before. The PB Mel-spectrogram from Processing step 155A collected from the vibration sensor(s) focuses on low frequency bins, while the info from Processing step 145A and Processing step 175A can supplement info in the higher frequency bins and subsequently provides a better precision in deriving the above mentioned linguistic info (PDL).
Processing step 725A adapts the PB speaker-dependent dynamic acoustic features, such as F0, VUV etc, to their full band correspondent features, which may make use of the speaker identity info from Processing step 145A and Mel-spectrogram from Processing step 175A. The detailed description of the adapter is given in
Processing step 735A is a decoder that takes the speaker-independent linguistic info from the Processing step 715A, the speaker id from Processing step 145A, the Mel-spectrogram mapping model Processing step 175A (the component for decoding in the Mel-spectrogram mapping model), as well as the result from Processing step 725A as input, and generates the full band (FB) Mel-spectrogram that is then sent to the synthesizer to produce the speech wave signals of the same speaker.
The encoding and decoding components can be trained separately with different training data sets because the linguistic representation is speaker-independent, similar to PDLs in variant B. The key difference is that the output of the model is Mel-spectrogram.
Processing step 102B describes an offline training phase of another embodiment as variant B of the speech recovery process. It contains processing steps as indicated in
Processing step 112B is one or more microphones that sense acoustic waves and convert them into a sequence of digital values. It is typically an air conduction microphone. Processing step 112B is able to sense a full band speech signal of 16 KHz, and even wider bands, e.g., 22 KHz. It can be the same as Processing step 112A and Processing step 115A.
Processing step 122B is a brainwave sensor that senses the brainwave signals and converts them into a sequence of digital values. Processing step 122B can be EEG, ECG, Neuralink N1 sensor, or other sensors that can detect and convert brainwaves.
Processing step 112B and processing step 122B sense the signals in a synchronized way with marked time stamps. Such paired signal sequences from thousands, tens of thousands, or even millions of speakers are collected during this offline training phase.
Processing step 132B is a decoder that converts from microphone signals to a sequence of a PDL vector representation, such as phonetic pieces or graphemes described before. This kind of linguistic info representation can be obtained by using any accurate speech recognition tech.
Processing step 142B is a module that computes speaker identification info such as speaker embedding, i-vector. It takes one or both inputs from Processing step 112B and Processing step 122B for computing the speaker id info. This can be implemented using an auto-encoder neural network, or i-vector.
Processing step 152B is a brainwave transcription module that converts brainwave signals to a sequence of MFCC features. This can be done by a neural network model to establish a mapping using a parallel data set of brainwave signals and speech signals collected simultaneously.
Processing step 202B takes the MFCC sequences from Processing step 152B with their corresponding time-synchronized PDL representation to train a neural network model that can be used to transcribe the MFCC sequences to the PDL representation. Processing step 175B is the generated model. The details of Processing step 202B is given in
Processing step 162B is a sequence to sequence neural network synthesizer trainer. It trains a mapping model taking the PDLs and speaker-specific info (e.g., embedding) of a speech signal as its input and producing its corresponding speech wave signal sequence as its output. The PDL vector representation contains speaker-independent linguistic info while the speaker embedding encodes the speaker-dependent info. When the model is trained, the resulting speech wave signal sounds the speech from the same speaker as represented by the speaker embedding. The Processing step 162B is different Processing step 162A which takes the Mel-spectrogram as its input. Processing step 162B may be trained independently in two stages: stage 1 converts from PDLs to Mel-spectrogram and stage 2 from Mel-spectrogram to speech wave signal. These two steps can also be trained jointly.
Processing step 105B describes an online phase of another embodiment as variant B of the speech recovery process with a brainwave sensor and microphone(s) as its input. It contains processing steps as indicated in
Processing step 115B is one or more microphones that sense acoustic waves and convert them into a sequence of digital values. It is typically an air conduction microphone. Processing step 115B is the same kind of microphone as Processing step 112B for training.
Processing step 125B is a brainwave sensor that senses the brainwave signals and converts them into a sequence of digital values. Processing step 125B can be EEG, ECG, Neuralink N1 sensor, or other sensors that can detect brainwaves. It is the same kind of sensor as Processing step 122B for training.
Processing step 115B and processing step 125B sense the signals in a synchronized way with marked time stamps in the same way as in training.
Processing step 145B extracts speaker identity representation (e.g., embedding) from the output signals of the Processing step 115B and Processing step 125B. The speaker identity can be represented by one or a combination of speaker embedding, i-vector, super vector, F0, and or voiced/unvoiced features.
Processing step 155B is a brainwave transcription module that converts brainwave signals to a sequence of MFCC features. This functions the same way as Processing step 152B with the same trained model.
Processing step 205B transcribes the MFCC sequences from Processing step 155B to the PDL vector representation using the model 175B generated by Processing step 202B in the training process of Processing step 102B.
Processing step 165B is a sequence to sequence neural network synthesizer that takes the linguistic info (e.g., PDL) from Processing step 205B and the speaker id info (e.g., speaker embedding) from Processing step 145B as the input, given the model 185B generated by Processing step 162B in the training process of Processing step 102B, and produces its corresponding speech wave signal sequence. The resulting speech wave signal from Processing step 165B will sound the same as the speaker represented by the speaker id info from Processing step 145B. This is different from Processing step 165A, which takes the Mel-spectrogram as the input. Processing step 165B may take a two stage approach with the first stage converting from PDLs to Mel-spectrogram and the second stage from Mel-spectrogram to speech wave signal, consistent with its training phase in processing step 102B.
Processing steps 162B, 165B and model 185B can be partially implemented by neural networks and integrated with other non-neural vocoders, such as the Griffin-Lim algorithm.
Furthermore, the mobile computing hub platform (processing step 800) is implemented with features to indicate to the wearable platform whether it is in the call mode and to further process detected trigger word signals after it is detected by the triggering detector (processing step 302). Therefore, the mobile computing platform is able to adapt and recover signals transmitted from and to the wearable components and further able to pass the synthesized or processed signals via cellular network to the other communication parties. In addition to the system implementation as shown in
The Processing step 800 is the computing hub, which may host a set of core processing components for speech recovery (Processing step 600) and speaker-dependent feature adaptation (Processing step 500). After the speech recovery process, the resulting recovered synthesized speech is sent to the cloud via cellular transmission processing step 805.
Processing step 803 has the functionality of exchanging the signal and info with processing step 306 in the wearable unit of processing step 300, while the Processing step 805 is responsible for info exchange via any cellular connection with the cloud of processing step 900. Processing step 805 is used to transfer recovered speech signals, as well as code and model updates, including any code and models in processing step 800 and processing step 300. If privacy is highly required, personal biometric info can be kept in this unit of processing step 800 without going to the cloud by using a certain type of configuration.
Specifically, when the computing hub is in a mode of communication with another person, conference call, or even automated service call, processing step 800 will tell the wearable unit of processing step 300 that it is in a call-on mode via processing step 803.
The Processing step 800 may also perform additional processing of the signals after the trigger detector alerts any event, which may include a further verification whether the event is indeed contained in the sent signals. This process takes place before the signals are sent to processing steps 500 and 600 for processing.
Processing step 900 complements the cellular network functionalities between the hub (Processing step 800) and the other parties with additional base model training (processing step 400) as well code and model updates (processing step 905).
Processing step 400 performs base model training, such as the models for the automatic speech recognition engine to obtain a linguistic representation (e.g., phonetic pieces) from the speech signals, the conversion from a linguistic representation to MCEP with speaker-dependent info in case the speech biometric is agreed to be placed on the cloud, and neural network speech synthesizer. The resulting models will be sent to the hub (processing step 800). Its detailed functionalities will be described in
Processing step 403 is a collection of models associated with the cloud based trainer of processing step 400. The models are passed down to processing step 800 as processing step 405 and are used for extracting certain features (as in processing step 152B, processing step 155B), extracting speaker-dependent identity representation (as for processing step 142A and processing step 145A, processing step 142B and processing step 145B,), transcribing signals into linguistic representations (for processing step 200 and processing step 715A, model 175B for processing step 205B), adapting speaker-dependent features (as for processing steps 725A and 735A, as model 175B for processing step 205B), and synthesizing personal speech (as model 185A for processing step 165B) in some cases.
Processing step 905 manages the code version update as well as downloading trained models of a proper version to Processing step 800 to be used by the corresponding modules.
Furthermore, the mobile computing hub platform can further process the trigger event signals transmitted from and to the wearable component and also to provide indications to the wearable hub whether it is in call mode. In addition, the system can pass the processed signals via cellular network to the cloud platform for further processing. The cloud platform of the system shown in
When the wearable unit has a powerful computing capability, the functional modules in the mobile computing hub can be shifted to the wearable unit and in the extreme case, the wearable unit serves as the mobile computing hub as well.
When the wearable unit has a weak computing capability, certain functionalities in the wearable unit can be shifted to the mobile computing hub or even cloud. In an extreme case, the wearable unit may only host the functionality of signal collection and transmission.
The important sub-modules in
All modules in processing step 300 in
Processing step 800L serves the purpose of transmitting the signals from the terminal processing step 300 and passing them to the cloud processing step 900L. It still maintains the mode info of the hub so that processing step 300 can behave accordingly.
Processing step 900L now takes the signals from processing step 300 via processing step 800L, and performs all the live adapting and recovery processes in addition to the base model training processing step 400 and the resulting models processing step 403, as well as version updating functions of processing step 905.
The modules in different processing steps of
As in
As a recap, based on the above descriptions and diagrams, the sequences of the data flow of the earbud-based system are described below. During a normal use in the on-call mode, the smartphone provides an indication to the earbuds that the communication process is operating in an on-call mode. Both microphones and accelerometers in earbuds receive raw signals, calculate the SNR and transmit them to the smartphone via wireless or wired connection, such as a Bluetooth connection. In the meantime, the smartphone receives the microphone and accelerometer signals as well as SNR values from earbuds. The recovery module in the phone recovers from raw signals to clean personal speech waves, and further sends to the other parties via cellular network. Alternatively, the phone may send the signals to the cloud where the recovery module recovers the personal clean speech signals from noise-contaminated signals.
On the other hand, when a normal operation of the communication system is in a trigger mode, i.e., the off-call mode, the signals from both microphones and accelerometers in earbuds are fed to the trigger event detection module continuously. In the meantime, the trigger event detection module in earbuds detects the trigger word event and the trigger module sends a detection signal to open the gates so that both microphone and accelerometer signals are transmitted to the smartphone via wireless connection, such as a Bluetooth connection. In one embodiment, the SNR value is also sent to the smartphone. The subsequent commands are then interpreted in the smartphone to perform corresponding functions according to the commands received.
Besides smartphones, smart watches may also serve as the mobile computing hub. A similar configuration can be made for the realization of the clear speech recovery.
In the embodiments described above, there are multiple phases of the system development and usage, and modes of operation.
Base Model Construction or Training Phase
The base model is constructed by collecting high quality clean speech from many speakers and speech from the previously described sensors.
The base model is trained in the servers in the Cloud or locally and downloaded into computing hubs, such as smartphones and smart watches.
The adapters in the smartphone or in the cloud personalize the speech based on the base model downloaded from the servers in the cloud or locally.
During a Normal use in the On-Call Mode of Live Communication
The recoverer module in the phone recovers from raw signals to clean speech waves of the speaker, and further sends to the other parties via cellular network. Alternatively, the phone may send the signals to the cloud where the recoverer module recovers the clean speech signals of the speaker from the noise-contaminated signals.
During a Normal use in the Trigger Mode (the Call Mode is Off)
An AEC (acoustic echo cancellation) module is normally included to remove the echo of the signals to the acoustic speakers in the earbuds (e.g., balanced armature) from the microphones and vibration sensors when there is an output sound signal coming from the other parties.
Corresponding to the hardware system setup in
Specifically, in one embodiment, the cloud-based offline training is to obtain a base model that generates an intermediate representation from speech features that provides intermediate probabilistic distribution of linguistic representations (PDL), e.g., Phonetic Posteriorgrams (PPG), encoded bottleneck features. The speech features are typically the features used in speech recognition, including MFCC (Mel-Frequency Cepstral Coefficients).
In one embodiment, the personal model adaptation process module trains the following models under a quiet condition and adapt them afterwards:
A real time application during the live communications is carried out when an event triggered by detected relevant signals for machine commands or signaled to carry out human communications by the mobile computing hub. The processes continue with recovery operations wherein the full band of clean speech waves of the speaker are recovered and generated from raw vibration signals and/or partially noisy mic signals.
In a semi-transparent mode when the speaker intends to provide reduced background sounds to the hearers, one may mix the background sounds of a reduced volume set by the earbud user with the synthesized speaker's voice. The background sounds are from microphones.
More generally for cases of live noise-removal tasks beyond earbuds when the front-end signal collection sensors contain microphones for near-field or far-field speech signals, one embodiment may use an accurate far-field automated speech recognition engine (FF-ASR) in noisy conditions and/or reverberant environment to obtain PDL from the speech of the intended speaker. The FF-ASR makes use of various noise-cancellation techniques, including beamforming, reverberation removal, etc, coupled with multiple speech channels or recordings via microphone arrays. The FF-ASR translates the speaker's voice into PDL, which is then converted by the rest of the system to a clean voice of the same speaker for various live communication, such as Zoom and Google Meet. Similarly, for offline noise-removal of speech recordings, one embodiment may follow the same process. In all these applications, the clean speech samples of the speaker intended for noise removal can be collected anytime when his or her signals are clean as measured by SNR values. When multiple speakers are present, a speaker identification module can be used to segment the speech stream or recording into sections where a single speaker is present. Clean speech can also be retrieved as described in variant A for noise removal, and close matching speaker embeddings can be obtained with short speech segments from the live streams or offline recordings.
According to the cloud based system as shown, the cloud-based training processes may be carried out by first providing a set of speech training data with transcription and a high quality speech recognition trainer, such as Kaldi. The speech recognition system is then trained offline with speech data from many speakers, and augmented with the speech data from the user. In case of earbuds related embodiments, the speech data are mostly collected from the vibration sensors. An intermediate acoustic model and decoder is generated from the trained speech recognition system to generate an intermediate linguistic representation given the speech features in data collected from many speakers for training the speech recognition system. The input representation for the acoustic model includes MFCC (Mel-Frequency Cepstral Coefficients) and output intermediate representations focus on speaker-independent linguistic content, e.g., Phonetic Pieces, Phonetic Posteriorgrams (PPG), Graphemes, or encoded bottleneck features, instead of speaker-dependent characteristics.
The processing step 400 in
Train an entire speech recognition system on speech data processing step 412 with many speakers offline. The training may start with speech data from microphones, and get adapted by vibration sensor data. The speech data is processed by processing step 415 to get features, such as MFCC, which may be mixed full and partial frequency bands (MB). Speech data from the user of the system is augmented when available. The features are then trained by processing step 420 based on the linguistic model Processing step 440 converted from annotated speech data, lexical data, and textual data. The linguistic model Processing step 440 returns a PDL representation of the references in the annotated speech data. As the result of the training, the SI-PDL model processing step 270 is produced.
This trainer of processing step 750 is used during both the base model training phase and the enrollment phase. During the base model training, the high quality clean speech training data may contain many different speakers or speakers of similar voices, while during the enrollment phase, only the speech data from the user of the system is used to ensure that the resulting speech sounds like the speaker. It may be located in the hub or in the cloud with the training data of aligned acoustic and vibration signals collected during the adaptation phase.
This trainer of processing step 750 is used in a quiet condition as indicated by the SNR value, for example, higher than 20 dB. The SNR estimation is given in processing step 302 of
The MCEP trainer processing step 760 may be realized via an architecture of one or multiple layer of LSTMs or transformers with the above input and output and produce a corresponding model processing step 770
One may have a similar procedure for trainer 705A to get Mel-Spectrogram model Processing step 175A
This module Processing step 750PR is one embodiment of the MCEP adapter that takes the input from both accelerometer and microphones and combines them based on the SNR value before sending to the SI PDL decoder (recognizer). It is used for the enhanced enrollment phase. It may be located either in the hub or in the cloud, and trained offline or online in real time.
During the enhanced enrollment mode, the combined output full band MFCC (FB MFCC) and partial band MFCC (PB MFCC) from the duplicated two Processing step 415 are combined in Processing step 780PR based on SNR level to obtain better features as input to the Processing step 200 (SI PDL decoder) for a more accurately recognized PDL representation (CB PDL). Noises from different scenarios, such as street, cafe, room, news broadcast, music playing, and in-car, are collected offline or in real time use when the speaker is not talking nor in the on-call mode. The noises are added to the speech from microphones with known SNR.
The combiner (Processing step 780PR) may combine both FB MFCC and PB MFCC values linearly or in other function with the weight as a function of SNR: the higher SNR value is given, the heavier weight on FB MFCC (the channel with added noises) will be used. One may even train a neural network for better PDL recognition results. As a result the Processing step 270PR (SI PDL model) can be improved.
When more accurate PDL output from Processing step 200 is used as input to MCEP model trainer (the Processing step 760), the MCEP model (Processing step 770PR) is better trained.
This module Processing step 750PS is another embodiment of the adapter that takes the input from both accelerometer and microphones sends to two duplicated SI PDL decoders (i.e., the two processing steps 200) with respective models (processing step 270PS for the noise-added microphone channel, and processing step 270 for the vibration sensor channel). Their PDL results are combined in processing step 780PS and sent to the MCEP trainer (Processing step 760) for adaptation, resulting in a better MCEP model (Processing step 770PS). It may be located either in the hub or in the cloud, and trained offline or online in real time.
During the enhanced enrollment mode, the output full band MFCC (FB MFCC) and partial band MFCC (PB MFCC) from the two identical processing steps 415 are sent to the two identical processing steps 200 with their respective models: processing step 270PS for FB MFCC, and processing step 270 for PB MFCC. Their results are combined by PDL combiner (Processing step 780PS) based on SNR level for a more accurately recognized PDL representation (CB PDL), similar to the Pre-PDL version. The noises are collected and added to the speech from microphones just like in
The PDL combiner (Processing step 780PS) may combine both FB PDL and PB PDL values linearly or in other function with the weight as a function of SNR with normalization to the probabilistic distribution: the higher SNR value is given, the heavier weight on FB PDL (the channel with added noises) will be used. Alternatively, one may also train a neural network for obtaining better PDL recognition results (CB PDL).
When more accurate PDL output from Processing step 780PS is used as input to MCEP model trainer (the Processing step 760), the MCEP model (Processing step 770PS) is better trained.
The processing step 500 takes the paired output from processing step 130 and processing step 150, and establishes the mapping of the features from the partial band to the full band.
As one embodiment, the F0 adapter takes log of F0 from processing step 130 and log of F0 from processing step 150 of the corresponding frames at time t, and computes their means and variances of the respective log(F0) values from the same set of speech. Given X(t), the log(F0) of a new frame at time t from the partial band signal, and Y(t), the log(F0) of its corresponding frame from the full band signal, is estimated as:
Y(t)=(X(t)−u(X))*d(Y)/d(X)+u(Y),
Alternatively, this adapter can also be estimated by one or more layer neural networks with X as the input and Y as the output.
One embodiment of the VUV adapter may use a threshold of the probability of being a voiced or unvoiced frame for the partial band and full band signals, and establish a similar mapping. In addition, the mapping may use the neighboring PDL info. In one embodiment, the per-frame probability calculation can be made based on the power of the signals as well as the zero-crossing rate.
One embodiment of the AP adapter may use AP value distributions of the partial band and full band signals to obtain a scaling function or neural network for the mapping from partial to full band values.
These mappings form the adapter models of processing step 502.
Processing step 725A is almost the same as processing step 500 without the AP mapping.
As another embodiment of Processing step 500, each adapter in Processing step 500S takes SNR as an additional input. The effect of the SNR values on each mapping in the model Processing step 502S depends on how robust the accelerometer is against the noise of different levels. For each mapping in Processing step 500S, one may train a neural network with SNR as its additional input and fine-tune the network. In real time use, the SNR value is estimated in the module described in Processing step 302 of
SNR=10*log10 p(speech)−10*log10 p(non-speech)
Where p(x) is the averaged power of signal x
Processing step 302 is an event trigger that detects the speech by the speaker wearing the device and decides: whether the speaker gives a voice command, or which SNR gated-signal to be sent to the computing hub. It is used during live communication in real time applications.
Processing step 3021 and processing step 3022 are feature extractors used for computing the SNR (processing step 3023) as well as trigger word detection (processing step 3024). Processing step 3021 extracts features from microphone(s), including signal energy level per frame for SNR computation as well as MFCCs and others for trigger word detection. Processing step 3022 extracts features from the accelerometers, including whether the speaker is talking or not talking in the current frame, as well as MFCCs and others for trigger word detection.
Processing step 3023 estimates the SNR as: (Ett−En)/En, where Ett is the energy while the speaker is talking, and En the energy while not talking, over one or more frames.
Processing step 3024 may be a deep learning model that detects the trigger word with the output from processing step 3021, processing step 3022, and the SNR value when the state is “off-call” from processing step 3025. If a voice command is detected, the command is returned and passed to the computing hub.
Processing step 3025 keeps track of the status from the computing hub: whether it is “on-call” or “off-call”, and communicates it to the other components in the Processing step 302 to decide which signal to pass to the hub. When the state is the on-call mode, if the SNR value is higher than a pre-set threshold, the microphone signal is sent out; if the SNR is below a pre-set threshold, the accelerometer signal is sent. The “on-call” state is when the speaker is talking to another person over the phone. The “off-call” state is when the speaker is not talking to anyone over the phone, but issuing a command to the computing hub, such as the phone, smart watch, or other wearables.
Processing step 3020 is another embodiment of the event trigger processing step 302 coupling with processing step 750PR (in
Processing steps 3021, 3022, 3023, and 3024 have the same functionalities as their counterpart modules in processing step 302. Processing step 30250 is similar to processing step 3025 as it keeps track of the status from the computing hub and when the state is “off-call” it runs processing step 3024 to detect and send voice commands (VC) to the hub. However, when the state is “on-call”, it sends signals from both microphone(s) and accelerometer as well as the estimated SNR to the computing hub.
Therefore, with the functionality of the recoverer, clean personal speech signals are recovered from the signals received from vibration sensors if the SNR is less than nor equal to a thresholded SNR value. When the signals are passed to the recoverer, they go through multiple steps: the feature extraction component for obtaining PB features, such as MFCC; the PDL decoder for obtaining an intermediate representation of mostly linguistic content, such as PB PDLs; the MCEP decoder that mapps the intermediate representation to a FB MCEP sequence; and finally the vocoder which takes the speaker dependent features adapted from the output of the feature extraction component, e.g., F0, AP, Voiced/Unvoiced indicator, and the FB MCEP to synthesize a personal speech wave.
The base recoverer module (processing step 600) recovers the clean speech in real time given the speech signal from the same speaker via the accelerometer, regardless how noisy the speaking environment is. This module can be located either in the hub or in the cloud.
The base module processing step 600 is coupled with processing step 302 in
The processing steps 150, 200, 700, and 160 are the same as the ones described in processing step 100 (
This Pre-PDL recovery module processing step 600PR performs the same functionality as processing step 600, taking additional SNR info and speech signal from the microphone(s) for more accurate PDL decoding so that it may make the use of the speech signal from microphone(s) when SNR is not too low, instead of making a binary thresholded decision.
This module processing step 600PR couples with the pre-PDL MCEP adapter (processing step 750PR in
For the purpose of providing technical references, the following is reference information for microphones and vibration sensors. Namely, for microphones, there are MEM microphones, and piezoelectric sensors, and for vibration sensors there are accelerometers, laser based vibration sensors and fiber optical vibration sensors, and for brainwave sensors, there are NI sensors from Neuralink and Electroencephalography (EEG) that can be implemented in the systems of this invention.
Although the present invention has been described in terms of the presently preferred embodiment, it is to be understood that such disclosure is not to be interpreted as limiting. For example, though the conductivity types in the examples above often show an n-channel device, the invention can also be applied to p-channel devices by reversing the polarities of the conductivity types. Various alterations and modifications will no doubt become apparent to those skilled in the art after reading the above disclosure. Accordingly, it is intended that the appended claims be interpreted as covering all alterations and modifications as fall within the true spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
7676372 | Oba | Mar 2010 | B1 |
20110313762 | Ben-David | Dec 2011 | A1 |
20140029762 | Xie | Jan 2014 | A1 |
20160239084 | Connor | Aug 2016 | A1 |
20180075844 | Kim | Mar 2018 | A1 |
20180084341 | Cordourier Maruri | Mar 2018 | A1 |
20190045298 | Klemme | Feb 2019 | A1 |
20200051583 | Wu | Feb 2020 | A1 |
20200184996 | Steele | Jun 2020 | A1 |
20220013106 | Deng | Jan 2022 | A1 |
Number | Date | Country | |
---|---|---|---|
20220180886 A1 | Jun 2022 | US |
Number | Date | Country | |
---|---|---|---|
63122531 | Dec 2020 | US |