The present application hereby incorporates the following documents by reference in their entireties: U.S. patent application Ser. No. 09/641,157, filed Aug. 17, 2000; and U.S. Pat. No. 6,795,807, issued on Sep. 21, 2004.
Artificial speech can be used to generate human speech from one or more inputs (e.g., written text, in response to a received spoken command, etc.). Artificial speech may be used to replace actual speech in a variety of contexts. For example, artificial speech may be used to generate spoken outputs so that text (e.g., text displayed on a screen) may be communicated to a visually impaired individual who could not otherwise read the text. Similarly, artificial speech may allow an individual with a speech impairment (e.g., an individual with amyotrophic lateral sclerosis (ALS); a laryngeally impaired individual, such as a laryngectomee; an individual who is hoarse; an individual with a speech disorder; etc.) to communicate with another individual when they would otherwise be unable to. Likewise, artificial speech may allow an individual who communicates using specific terminology (e.g., in a certain language) to effectively communicate with another individual who is not fluent in that terminology (e.g., by translating from one language to another). Artificial speech, therefore, is broadly applicable to a number of fields and may be used for a number of purposes.
Additionally, machine learning is a field of computing that is widely used across a variety of technical fields. In machine learning, a model is trained using one or more sets of “training data.” The training of the machine-learned model may take place in a supervised fashion (i.e., “supervised learning”), an unsupervised fashion (i.e., “unsupervised learning”), or in a semi-supervised fashion (i.e., “semi-supervised learning”) depending on whether the training data is labeled with a specified classification. After feeding training data into a machine-learning training algorithm, a machine-learned model be output by the machine-learning algorithm.
The trained machine-learned model may then be applied to a set of “test data” in order to generate an output. In this way, the trained machine-learned model may be applied to new data for which an output is not yet known. For example, a machine-learned model may include a classifier that is trained to determine whether an image includes a human in the image or an animal. The machine-learned model may have been trained using labeled images of humans and labeled images of animals. Thereafter, in runtime, a new image (where it is unknown whether the new image includes a human or an animal) may be input into the machine-learned model in order to determine whether the image includes a human or an animal. Using the machine-learned model to make a prediction about test data is sometimes referred to as the “inference stage.”
There exist a variety of types of machine-learned models. Each of the different types of machine-learned model may be used in various contexts in light of the strengths and weaknesses of each. One popular machine-learned model is an artificial neural network (ANN), which includes layers of artificial neurons. The layers of artificial neurons in an ANN are trained to identify certain features of an input (e.g., an input image, an input sound file, or an input text file). Each artificial neuron layer may be built upon sub-layers that are trained to identify sub-features of a given feature. ANNs having many neuron layers are sometimes referred to as “deep neural networks.”
The specification and drawings disclose embodiments that relate to devices for real-time speech output with improved intelligibility.
In a first aspect, the disclosure describes a device. The device includes a microphone configured to capture one or more frames of unintelligible speech from a user. The device also includes an analog-to-digital converter (ADC) configured to convert the one or more captured frames of unintelligible speech into a digital representation. In addition, the device includes a computing device. The computing device is configured to receive the digital representation from the ADC. The computing device is also configured to apply a machine-learned model to the digital representation to generate one or more frames with improved intelligibility. Further, the computing device is configured to output the one or more frames with improved intelligibility. Additionally, the device includes a digital-to-analog converter (DAC) configured to convert the one or more frames with improved intelligibility into an analog form. Still further, the device includes a speaker configured to produce sound based on the analog form of the one or more frames with improved intelligibility.
In a second aspect, the disclosure describes a method. The method includes capturing, by a microphone, one or more frames of unintelligible speech from a user. The method also includes converting, by an analog-to-digital converter (ADC), the one or more captured frames of unintelligible speech into a digital representation. In addition, the method includes receiving, by a computing device from the ADC, the digital representation. Further, the method includes applying, by the computing device, a machine-learned model to the digital representation to generate one or more frames with improved intelligibility. Additionally, the method includes outputting, by the computing device, the one or more frames with improved intelligibility. Yet further, the method includes converting, by a digital-to-analog converter (DAC), the one or more frames with improved intelligibility into an analog form. Even further, the method includes producing, by a speaker, sound based on the analog form of the one or more frames with improved intelligibility.
In a third aspect, the disclosure describes a non-transitory, computer-readable medium having instructions stored thereon. The instructions, when executed by a processor, cause the processor to execute a method. The method includes receiving a digital representation from an analog-to-digital converter (ADC). The digital representation was generated by the ADC from one or more frames of unintelligible speech from a user that were captured by a microphone. The method also includes applying a machine-learned model to the digital representation to generate one or more frames with improved intelligibility. In addition, the method includes outputting the one or more frames with improved intelligibility to a digital-to-analog converter (DAC). The DAC is configured to convert the one or more frames with improved intelligibility into an analog form. The DAC is also configured to output the analog form to a speaker that is configured to produce sound based on the analog form.
The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the figures and the following detailed description.
Example methods and systems are described herein. Any example embodiment or feature described herein is not necessarily to be construed as preferred or advantageous over other embodiments or features. The example embodiments described herein are not meant to be limiting. It will be readily understood that certain aspects of the disclosed systems and methods can be arranged and combined in a wide variety of different configurations, all of which are contemplated herein.
Furthermore, the particular arrangements shown in the figures should not be viewed as limiting. It should be understood that other embodiments might include more or less of each element shown in a given figure. In addition, some of the illustrated elements may be combined or omitted. Similarly, an example embodiment may include elements that are not illustrated in the figures.
Throughout this disclosure, the phrases “unintelligible speech,” “snippet of unintelligible speech,” “frames of unintelligible speech,” “sound corresponding to unintelligible speech,” etc. may be used. It is understood that these terms are to be construed broadly and are primarily used to contrast with phrases like “speech with improved intelligibility,” “snippet of speech with improved intelligibility,” “frames of speech with improved intelligibility,” etc. As such, a phrase such as “unintelligible speech” does not necessarily connote that the speech would not or could be understood at all by any listener. Instead, the phrase “unintelligible speech” is merely meant to describe speech which may be improved using the devices and methods herein. Said a different way, while the devices and methods herein may transform “unintelligible speech” to “speech with improved intelligibility,” that does not necessarily mean that the “unintelligible speech” was necessarily completely incomprehensible to begin with. Thus, the phrase “unintelligible speech” may sometimes be used as a stand-in, where context dictates, for something akin to “speech with low intelligibility,” “speech which is difficult, but not impossible, to parse,” “speech that is difficult to understand,” etc.
There have been attempts in the past to improve impaired speech (e.g., for laryngectomees) using artificial speech techniques. It has proven difficult, however, to capture sufficient information about the specific speaker in order to recreate their own voice. In some cases, devices may be used to create a simulated glottal pulse. Such devices may include a manual ability to change frequency. Further, a small loudspeaker may be mounted in the mouth of an individual with impaired speech (e.g., on a denture of the individual) in order to provide for artificial speech. Additionally, some devices may vibrate the neck (e.g., in a controllable way so as to enable the user to change the pitch of the generated speech manually). Such alternative devices, however, may sound overly mechanical/unnatural. Even when a user manually changes the pitch, the sound produced may still not sound similar to the natural sound of the human being.
Various methods for alaryngeal speech enhancement (ASE) have been proposed (e.g., vector quantization and linear multivariate regression for parallel spectral conversion, the use of an air-pressure sensor to control F0 contours in addition to Gaussian mixture model based voice conversion, one-to-many eigenvoice conversion for ASE, a hybrid approach combining spectral subtraction with eigenvoice conversion, etc.). Many of these techniques include statistical voice conversion methods based on Gaussian mixture models that were subject to over-smoothing in the converted spectra, though, which results in murky-sounding speech. The field of machine-learning has also attempted to be used to address artificial speech. For example, deep neural network-based voice conversion for ASE has been proposed, but such techniques are also subject to over-smoothing. However, generative adversarial network (GAN)-based voice conversion might not suffer from the over-smoothing problem because GAN-based voice conversion may be capable of learning the gradients of the source and target speech distributions. As such, voice-conversion and speech-enhancement GANs may improve speech for both speakers exhibiting one or more speech impairments (e.g., ASE) and speakers without one or more speech impairments.
Example embodiments herein includes devices and methods used to produce artificial speech from one or more snippets of unintelligible speech. The artificial speech output from devices herein may exhibit improved prosody and/or intelligibility when compared with the alternative techniques described above. Further, and crucially, the techniques described herein may allow for the conversion from snippets of unintelligible speech to snippets with improved intelligibility in “real-time” (e.g., less than about 50 ms, less than about 45 ms, less than about 35 ms, less than about 30 ms, less than about 25 ms, less than about 20 ms, less than about 15 ms, less than about 10 ms, etc.). Being able to reproduce artificial speech in real-time may enhance communication in a variety of ways. For example, eliminating delays between the conversion from unintelligible input to intelligible output may prevent adverse feedback effects on the speaker (e.g., due to a delay between the time at which they speak and when they hear themselves speak), may allow conversation to flow more naturally (e.g., when one speaker interjects into a conversation, another speaker can more quickly identify that such an interjection is occurring, which prevents speakers from speaking over one another), etc. In order to perform conversions in “real-time,” example embodiments described herein may include one or more computing devices (e.g., with one or more processors therein) that apply one or more machine-learned models in order to generate outputs with improved intelligibility. For example, a GAN may be applied to the snippet of unintelligible speech in order to generate the output with improved intelligibility.
The devices and methods described herein may be used to produce artificial speech with improved intelligibility for users with a variety of speech impairments (e.g., users with ALS, users who can only produce alaryngeal speech due to a laryngectomy, etc.). However, example embodiments may also to enhance intelligibility in other contexts (e.g., to improve intelligibility when a speaker has a strong accent or a particular dialect with which the listener is not familiar).
Some embodiments described herein include devices used to capture an input speech from a user, transform the input speech into speech with greater intelligibility, and then output the speech with greater intelligibility. Such devices may include a microphone, an analog-to-digital converter (ADC), a computing device, a digital-to-analog converter (DAC), an amplifier, and/or a speaker. For example, a device may include a microphone or other input device configured to capture a snippet (e.g., a frame or a series of frames) of unintelligible speech from a user. The snippet of unintelligible speech may then be sent to an ADC in order to convert the snippet of unintelligible speech into a digital form. In some embodiments, the digital form may include a mel-spectrogram (e.g., generated using a Fourier transform of the input snippet). Next, the digital form of the unintelligible speech may be transferred to a computing device for processing. Thereafter, the computing device (e.g., a processor of the computing device executing instructions stored within one or more memories of the computing device) may apply a machine-learned model (e.g., a GAN) to the digital form of the snippet of unintelligible speech in order to produce a snippet with improved intelligibility. The output of the machine-learned model may be provided in a similar form to the input. For example, the machine-learned model may be used to transform a mel-spectrogram with lower intelligibility into a mel-spectrogram with improved intelligibility. The output snippet with improved intelligibility (e.g., in a digital form) may be transferred from the computing device to a DAC (e.g., a DAC that includes a vocoder), where the output snippet is then converted to an analog form. The analog signal corresponding to the analog form of the output snippet with improved intelligibility may then be sent to an amplifier to amplify the analog signal. Finally, the amplified signal may be emitted into the surrounding environment using a speaker. As mentioned above, this entire process may occur in real-time. In this way, speech with enhanced intelligibility can be communicated rapidly.
As is illustrated based on the description above, example embodiments are configured to create alaryngeal speech enhancement (ASE) using one or more GANs. Such enhancement may occur using one-to-one, voice-unpaired conversions. In order to obtain training data to train the machine-learning models described herein (e.g., GANs), recordings from various example users may be captured. For example, both male and female laryngectomees capable of tracheo-esophageal puncture (TEP) speech, electronic larynx speech, and esophageal speech may be recorded (e.g., via one or more lavalier microphones using a web interface). For instance, example users may hold one or more microphones close to their lips to lessen mechanical noise (e.g., in the case of electronic larynx speech or esophageal speech) and be provided with prompts (e.g., 50 prompts) that they attempt to reproduce (e.g., randomly selected from one or more speech datasets, such as the ARCTIC dataset (“CMU ARCTIC databases for speech synthesis;” John Kominek, et al.; published in 2003) or the VCTK dataset (“CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit;” Junichi Yamagishi, et al.; The Centre for Speech Technology Research University of Edinburgh; Nov. 13, 2019)). The speakers' attempted reproductions may be recorded for use as training data. Additionally or alternatively, pre-operative recordings (e.g., prior to a laryngectomy) may be used to train the model (e.g., to train one or more discriminators in the case of a machine-learned model that includes a GAN). Some of the sample users may also be capable of multiple types of speech (e.g., both electronic larynx speech and esophageal speech). As such, multiple recordings of a single user using different types of speech may be captured.
As described herein, the machine-learned model may be trained to perform non-parallel voice conversion. Such a conversion can be reframed as an unpaired image-to-image translation that uses mel-spectrograms in place of images. For example, when using one or more GANs, one or more generators G may learn a mapping from the source speaker distribution, X, to the target speaker distribution, Y, such that G(X) is indistinguishable Y to one or more discriminators D. One example unpaired image-to-image translation GAN technique is CycleGAN (“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks;” Jun-Yan Zhu, et al.; arXiv:1703.10593; originally published on Mar. 30, 2017), which employs an inverse mapping, F(Y), and an associated cycle consistency loss such that F(G(X)). Example embodiments described herein could be implemented using CycleGAN, for example.
As described above, example embodiments represent improvements over alternative solutions because the conversion is done in real-time. For example, speech systems that operate in a sequential manner do not create prosody until an entire sentence is divided into elements of speech such as words and phonemes. Such schemes may rely on pre-programmed templates to create prosody. Use of pre-programmed templates might not be applicable in a real-time recreation of speech because such schemes may require an understanding of the word and context to be applied (e.g., thereby preventing the ability for real-time output, since the speaking of many sentences, phrases, or even words occur over much longer timescales than “real-time”).
The following description and accompanying drawings will elucidate features of various example embodiments. The embodiments provided are by way of example, and are not intended to be limiting. As such, the dimensions of the drawings are not necessarily to scale.
As described herein, the speech-improvement device 110 may use a machine-learned model in order to reproduce an artificial voice that sounds similar to a voice of the user 102 prior to the user 102 developing one or more conditions (e.g., ALS) and/or prior to the user 102 experiencing one or more speech impairments (e.g., as a result of a laryngectomy). Further, the speech-improvement device 110 may allow the user 102 to readily alter the prosody of the artificial speech (e.g., using the machine-learned model) in order to convey calmness, levity, anger, friendship, authority, etc. through the timbre of their voice.
Each of the components of the speech-improvement device 110 may communicate with one or more other components within the speech-improvement device 110 using one or more communication interfaces 112. For example, the communication interfaces may include one or more wireless communication protocols (e.g., IEEE 802.15.1—BLUETOOTH®, IEEE 802.11—WIFI®, radio-frequency communication, infrared communication, etc.) and/or one or more wired protocols (e.g., ETHERNET, Universal Serial Bus (USB), communication via traces on a printed circuit board (PCB), communication over optical fiber, etc.). In some embodiments, as illustrated, the communication interface 112 of the speech-improvement device 110 may be a single communication interface 112 through which each component of the speech-improvement device 110 is interconnected and can communicate with other components of the speech-improvement device 110. In other embodiments, however, separate communication channels may be used to interconnect various components of the speech-improvement device 110. For example, the microphone 130 may be connected wirelessly to the ADC 140 using BLUETOOTH®, but the ADC 140 may be connected to the computing device 150 over a USB connection. Other communication types are also possible and are contemplated herein.
Further, in some embodiments, there may be multiple copies of one or more of the components and/or one or more of the components may be integrated into (or not integrated into) one of the other components in a different way than is illustrated. For example, in some embodiments there may be multiple (e.g., and different types of) power sources. For example, the speaker 190 and/or the microphone 130 may have a different power source 170 than the computing device 150. Additionally, the vocoder 162 may not be integrated into the DAC 160 in some embodiments, but instead a standalone component or integrated into the computing device 150, among other possibilities. In still other embodiments, the amplifier 180 could be integrated into the computing device 150 or into the speaker 190. Still further, one or more of the components of the speech-improvement device 110 may not be physically connected to one or more of the other components of the speech-improvement device 110 (e.g., the microphone 130 may be located remotely from the other components of the speech-improvement device 110 and connected by BLUETOOTH®). Lastly, in some embodiments, the speech-improvement device 110 may not include all of the components illustrated in
The microphone 130 may include a mechanical device that transduces sound (e.g., variations in air pressure) into a signal (e.g., an electrical signal, such as an electrical signal corresponding to variations in current or voltage over time). The microphone 130 may be used to capture one or more snippets of unintelligible speech. The microphone 130 may capture these snippets in one or more frames. For example, a frame may correspond to a sound recording that is 8 ms-12 ms in length and a snippet of unintelligible speech may be three frames in length. The length of time for a frame and/or the number of frames in a snippet may be set by the computing device 150 and communicated to the microphone 130, for example. Alternatively, the microphone 130 may continuously capture sound that the computing device 150 later breaks up into frames/snippets in order to analyze. Still further, in some embodiments, the time duration of each frame and/or the number of frames in a snippet may be held constant, while in other embodiments these values may be modulated over time (e.g., by the computing device 150). In various embodiments, the microphone 130 may include a condenser microphone, a dynamic microphone, a ribbon microphone, a piezoelectric microphone, a microelectromechanical systems (MEMS) microphone, etc. In some embodiments, the speech-improvement device 110 may also include one or more user inputs. For example, the speech-improvement device 110 may include one or more buttons, one or more switches, one or more touchscreens, etc. Such user inputs may be used to activate or otherwise control the microphone 130 and/or one or more other components of the speech-improvement device 110 (e.g., the computing device 150). For example, a button may be pressed by a user to begin capturing unintelligible speech from the user (e.g., by powering on the microphone). In such a way, energy (e.g., stored within a battery of the power source 170) can be conserved by refraining from powering/running the device when a user is not presently talking. Additionally or alternatively, unintelligible speech that was previously captured (e.g., and stored within a memory of the speech-improvement device 110) may be selected for improvement. For example, a selection of a snippet of unintelligible speech from among a group of snippets may be made via a user interface.
Additionally or alternatively, in some embodiments, the microphone 130 may be replaced by another device that includes multiple or alternative input devices. For example, if the user 102 has undergone a laryngectomy, the microphone 130 may be replaced by a neck vibrator/microphone combination that assists the user 102 in generating sound waves using the neck vibrator (e.g., to simulate a glottal pulse) and then detects the generated sound waves using a microphone. In such embodiments, the neck vibrator/microphone combination may include a transmitter (e.g., a radio-frequency transmitter) that communicates with the computing device 150 in order to provide information to the computing device 150 about a frequency and/or timing of the vibration used by the neck vibrator in order to assist in the generation of sound. Likewise, the microphone 130 may be replaced or augmented by an eye-gaze input device (e.g., when the user 102 suffers from ALS). Upon capturing the one or more snippets of unintelligible speech, the microphone 130 may communicate an analog signal corresponding to the one or more snippets of unintelligible speech to one or more other components of the speech-improvement device 110. For example, the microphone 130 may communicate the one or more captured snippets of unintelligible speech to the ADC 140 and/or the computing device 150.
The ADC 140 may be used to convert an analog signal (e.g., an analog signal corresponding to one or more snippets of unintelligible captured by the microphone 130) into a digital signal (e.g., a series of bits that can be analyzed by the computing device 150 and/or stored within a memory of the computing device 150). For example, the ADC 140 may include a 13-bit linear pulse-code modulation (PCM) Codec-Filter (e.g., the MOTOROLA® MC145483). A digital signal produced by the ADC 140 may be encoded and/or compressed (e.g., using a codec) such that it can be more readily analyzed by the computing device 150. For example, the digital signal may be converted to and/or stored (e.g., within a memory of the computing device 150) as a .MP3 file, a .WAV file, a .AAC file, or a .AIFF file.
The computing device 150 may be used to generate a signal (e.g., a digital signal) with improved intelligibility based on an input from the microphone 130 that corresponds to a snippet of unintelligible speech (e.g., based on a digital representation of the snippet of unintelligible speech received from the ADC 140). In some embodiments, the computing device 150 may include an NVIDIA® JETSON NANO®. The JETSON NANO® may consume relatively little power during operation (e.g., thereby making the computing device 150, and the corresponding speech-improvement device 110 as a whole, relatively portable). The JETSON NANO® may also contain the software and adequate random-access memory (RAM) for performing calculations in real-time. In one example, the computing device 150 may transmit a signal (e.g., a radio-frequency signal) to a vibrator of an electronic larynx to generate a glottal pulse. In various embodiments, the computing device 150 may include a variety of components (e.g., one or more processors, one or more memories, etc.). An example embodiment of the computing device 150 is shown and described with reference to
The processor 152 may be a central processing unit (CPU), in some embodiments. As such, the processor 152 may include one or more segments of on-board memory (e.g., one or more segments of volatile memory, such as one or more caches). Additionally or alternatively, the processor 152 may include one or more general purpose processors and/or one or more dedicated processors (e.g., application specific integrated circuits (ASICs), digital signal processors (DSPs), graphics processing units (GPUs), tensor processing units (TPUs), network processors, etc.). For example, the processor 152 may include a DSP configured to convert a digital signal (e.g., received by the computing device 150 from the ADC 140) corresponding to the one or more snippets of unintelligible speech (e.g., as captured by the microphone 130) into a mel-spectrogram or some other representation in the frequency domain. The processor 152 may also be configured to (e.g., when executing instructions stored within the data storage 154) apply a machine-learned model to the mel-spectrogram/other representation in the frequency domain in order to provide frequency information, to provide volume information, and/or to generate a digital signal having improved intelligibility.
The data storage 154 may include volatile and/or non-volatile data storage devices (e.g., a non-transitory, computer-readable medium) and can be integrated in whole or in part with the processor 152. The data storage 154 may contain executable program instructions (e.g., executable by the processor 152) and/or data that may be manipulated by such program instructions to carry out the various methods, processes, or operations described herein. Alternatively, these methods, processes, or operations can be defined by hardware, firmware, and/or any combination of hardware, firmware, and software. In some embodiments, the data storage 154 may include a read-only memory (ROM) and/or a random-access memory (RAM). For example, the data storage 154 may include a hard drive (e.g., hard disk), flash memory, a solid-state drive (SSD), electrically erasable programmable read-only memory (EEPROM), dynamic random-access memory (DRAM), and/or static random-access memory (SRAM). It will be understood that other types of volatile or non-volatile data storage devices are possible and contemplated herein.
The network interface 156 may take the form of a wired connection (e.g., an ETHERNET connection) or a wireless connection (e.g., BLUETOOTH® or WIFI® connection). However, other forms of physical layer connections and other types of standard or proprietary communication protocols may be used over the network interface 156. Furthermore, the network interface 156 may include multiple physical interfaces. The network interface 156 may allow the computing device 150 to communicate with one or more devices (e.g., devices that are separate from the speech-improvement device 110). For example, the computing device 150 may communicate with an external server (e.g., over the public Internet) in order to retrieve updated firmware (e.g., which may include an improvement to the machine-learned model used by the processor 152 to generate signals having improved intelligibility) and/or store data (e.g., data captured by the microphone 130 of the speech-improvement device 110).
The input/output function 158 may facilitate interaction (e.g., using USB, WIFI®, radio-frequency, ETHERNET, etc.) between the computing device 150 (e.g., the processor 152 of the computing device 150) and other components of the speech-improvement device 110. For example, the input/output function 158 may allow the ADC 140 to provide digital signals corresponding to snippets of unintelligible speech to the computing device 150 for analysis and/or may allow the computing device 150 to provide digital signals corresponding to snippets of speech with improved intelligibility to the DAC 160 for eventual output (e.g., by the speaker 190).
The DAC 160 may be configured to convert a digital signal from the computing device 150 (e.g., generated by the computing device 150 by applying one or more machine-learned models to one or more snippets of unintelligible speech) to an analog signal (e.g., an analog signal representing sound). The digital signal from the computing device 150 may be stored in a compressed and/or encoded format. As such, the digital signal may be decompressed and/or decoded prior to the computing device 150 providing the digital signal to the DAC 160. In some embodiments, as illustrated in
The power source 170 may be configured to provide power to one or more other components of the speech-improvement device 110. In some embodiments, for example, the power source 170 may include a connection to an alternating current (AC) source (e.g., a wall socket) and/or one or more batteries (e.g., rechargeable batteries). The use of batteries in the power source 170 may allow the speech-improvement device 110 to be portable. Further, in some embodiments, the power source 170 may include a voltage regulator. The voltage regulator may be configured to modulate an input voltage (e.g., the voltage of a battery of the power source 170) to produce one or more appropriate output voltages for use by the other components of the speech-improvement device 110. Further, in embodiments where the power source 170 includes a connection to an AC source, the power source 170 may also include a converter (e.g., including a rectifier) that converts from AC to direct current (DC) to be used by the other components of the speech-improvement device 110.
The amplifier 180 may be configured to take an analog signal (e.g., an analog audio signal) and amplify it (e.g., produce another analog signal having increased magnitude). Thereafter, the speaker 190 may receive the amplified signal from the amplifier 180 and produce an audible using the amplified signal. In some embodiments, the speaker 190 may include an arrangement of multiple speakers within one or more housings (e.g., cabinets). For example, the speaker 190 may include one or more tweeters and/or one or more woofers in order to reproduce a higher dynamic range of frequencies.
At step 210, the method 200 may include a sound being produced that corresponds to unintelligible speech. Such a sound may be produced by a user (e.g., a user 102 of the speech-improvement device 110). For example, a laryngectomee, a user with ALS, or a user with a heavy accent may produce sound that corresponds to unintelligible speech (e.g., unintelligible as a result of a laryngectomy). It is understood that in other embodiments other sources of unintelligible speech are also possible and are contemplated herein. For example, a loudspeaker (e.g., attached to a television or a radio) may output sound corresponding to unintelligible speech (e.g., during playback of dialog within a movie). Such output sound may be garbled as a result of one or more defects in the loudspeaker, for instance.
At step 220, the method 200 may include a microphone (e.g., the microphone 130 shown and described with reference to
At step 230, the method 200 may include converting (e.g., by the ADC 140 shown and described with reference to
At step 240, the method 200 may include analyzing (e.g., by the computing device 150 shown and described with reference to
At step 250, the method 200 may include inverting the digital representation (e.g., the mel-spectrogram) from the computing device to produce an analog signal that can be used to produce sound. For example, the DAC 160 may invert the mel-spectrogram with improved intelligibility to generate an electrical signal (e.g., representing speech with improved intelligibility). Such an inversion may be performed by a vocoder (e.g., the vocoder 162) of the DAC 160, for instance. In some embodiments, the vocoder 162 may include (e.g., stored within a memory of the vocoder 162) a machine-learned model that is trained to invert mel-spectrograms into waveforms. For example, such a machine-learned model may include one or more GANs. At the end of step 250, the analog signal produced by the DAC may be passed to an amplifier (e.g., the amplifier 180 shown and described with reference to
At step 260, the method 200 may include amplifying the signal from the DAC and providing the amplified signal to a speaker (e.g., the speaker 190 shown and described with reference to
The machine-learned model 330 may include, but is not limited to: one or more ANNs (e.g., one or more convolutional neural networks, one or more recurrent neural networks, one or more Bayesian networks, one or more hidden Markov models, one or more Markov decision processes, one or more logistic regression functions, one or more suitable statistical machine-learning algorithms, one or more heuristic machine-learning systems, one or more GANs, etc.), one or more support vector machines, one or more regression trees, one or more ensembles of regression trees (i.e., regression forests), one or more decision trees, one or more ensembles of decision trees (i.e., decision forests), and/or some other machine-learning model architecture or combination of architectures. Regardless of the architecture used, the machine-learned model 330 may be trained to clarify consonants and/or vowels, to determine and/or enhance prosody, and/or to reconstruct a natural glottal pulse. For example, in the case of a GAN, one or more generators of the GAN may take an input signal (e.g., a digital version of a signal having low intelligibility captured from a user via a microphone) and produce an output signal (e.g., a digital version of a signal having higher intelligibility) by improving the consonant sounds, vowel sounds, prosody, and/or glottal pulse of the input signal.
As one example and as described herein, the machine-learned model 330 may include a GAN that, once trained, is configured to take one or more snippets of unintelligible speech (e.g., in a digital form, such as a sound file or a mel-spectrogram) and convert them into one or more snippets of speech with improved intelligibility (e.g., in a digital form, such as a sound file or a mel-spectrogram) that can be output (e.g., by a speaker upon the digital form being converted into an analog sound signal). It is understood that a variety of GAN architectures can be employed within embodiments herein. For example, the GANs described in CycleGAN-VC2 (“CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion;” Takuhiro Kaneko, et al.; arXiv:1904.04631; originally published on Apr. 9, 2019) and CycleGAN-VC3 (“CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion;” Takuhiro Kaneko, et al.; arXiv:2010.11672; originally published on Oct. 22, 2020) could be employed. In some embodiments, and as described further with reference to
Other potential GAN architectures may include those described in ConvS2S-VC (“ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion;” Hirokazu Kameoka, et al.; arXiv:1811.01609; originally published on Nov. 8, 2018), Gaussian mixture model-based (GMM-based) voice conversion (VC) using sprocket (“sprocket: Open-Source Voice Conversion Software;” Kazuhiro Kobayashi, et al.; Proc. Odyssey, pp. 203-210; June 2018), StarGAN-VC (“StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks;” Hirokazu Kameoka, et al.; arXiv:1806.02169; originally published on Jun. 6, 2018), AutoVC (“AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss;” Kaizhi Qian, et al.; arXiv:1905.05879; originally published on May 14, 2019), and cascaded automatic speech recognition/text-to-speech (ASR-TTS) models (“On Prosody Modeling for ASR+TTS based Voice Conversion;” Wen-Chin Huang, et al.; arXiv:2107.09477; originally published on Jul. 20, 2021). However, when applied to improving the intelligibility of input snippets related to certain users (e.g., users with ALS or users who had undergone a laryngectomy), such architectures may fail to converge when attempting to generate an output using the one or more generators and/or may provide outputs with lower intelligibility or lower prosody compared to the examples illustrated and described with reference to
The machine-learning training algorithm 320 may involve supervised learning, semi-supervised learning, reinforcement learning, and/or unsupervised learning. Similarly, the training data 310 may include labeled training data and/or unlabeled training data. Further, similar to described above with respect to the machine-learned model 330, a number of different machine-learning training algorithms 320 could be employed herein. If the machine-learned model 330 includes a GAN, the machine-learning training algorithm 320 may include training one or more discriminator ANNs and one or more generator ANNs based on the training data 310. The discriminator(s) may be trained to distinguish between snippets/signals (e.g., in a digital form such as a mel-spectrogram) with acceptable intelligibility (e.g., actual captured voices) and snippets/signals (e.g., in a digital form such as a mel-spectrogram) with inadequate intelligibility (e.g., garbled signals or signals indicating a heavy accent). The one or more generators, on the other hand, may be trained to produce, based on an input snippet, an output snippet/signal (e.g., in a digital form such as a mel-spectrogram) that fools the discriminator(s) (i.e., a snippet/signal for which the discriminator(s) cannot determine whether the snippet/signal has acceptable intelligibility or inadequate intelligibility). In other words, the one or more generators G may generate an output Y when applied to an input X (i.e., G(X)=Y) and the discriminator(s) may be configured to attempt to apply a further function D to the output of the one or more generators Y to attempt to determine whether output Z (i.e., D(Y)=Z) is proper (e.g., whether Z is a snippet/signal with acceptable intelligibility). An example generator is illustrated and described with reference to
Additionally, the machine-learning training algorithm 320 may be tailored and/or altered based on the machine-learned model 330 to be generated (e.g., based on desired characteristics of the output machine-learned model 330). For example, when the machine-learned model 330 includes a GAN, the machine-learning training algorithm 320 may include applying two-step adversarial losses (Ladv2(GY2X, GX2Y, D′) and vice versa for Dx′), identity-mapping losses, cycle-consistency losses, etc. Such a machine-learning training algorithm 320 may reflect similar training objectives to those undertaken in CycleGAN-VC2 (“CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion;” Takuhiro Kaneko, et al.; arXiv:1904.04631; originally published on Apr. 9, 2019) and/or CycleGAN-VC3 (“CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion;” Takuhiro Kaneko, et al.; arXiv:2010.11672; originally published on Oct. 22, 2020). For example, the following types of loss may be used when training the GAN:
The training data 310 may be used to train the machine-learned model 330 using the machine-learning training algorithm 320. For example, the machine-learned model 330 may include labeled training data that is used to train the machine-learned model 330. In embodiments where the machine-learned model 330 includes one or more GANs, the training data 310 may include sound snippets with high intelligibility (e.g., used as an input to train one or more discriminators of a GAN) and/or sound snippets with low intelligibility (e.g., used as an input for training one or more generators of a GAN). For instance, the training data 310 may include calculating, considering, minimizing, and/or maximizing the following:
Further, in some embodiments (e.g., and unlike the training described in CycleGAN-VC2 (“CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion;” Takuhiro Kaneko, et al.; arXiv:1904.04631; originally published on Apr. 9, 2019) and/or CycleGAN-VC3 (“CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion;” Takuhiro Kaneko, et al.; arXiv:2010.11672; originally published on Oct. 22, 2020)), the machine-learning training algorithm 320 may include a stop gradient function used to skip updating the discriminator(s) D during a step of the training (e.g., during gradient descent) when the discriminator loss is less than a predetermined threshold (e.g., 0.2). This may prevent mode collapse during training of the discriminator(s) D and stabilize the training of the GAN (e.g., by preventing the one or more discriminators from encouraging the one or more generators to collapse towards minor, unperceivable changes in order to fool the one or more discriminators).
In some embodiments, training the discriminator(s)/generator(s) may occur in a supervised fashion using labeled training data (e.g., sample snippets/signals that are labeled as corresponding to acceptable intelligibility) and/or using sample input training data for the generator(s). Further, in some embodiments, the machine-learning training algorithm 320 may enforce rules during the training of the machine-learned model 330 through the use of one or more hyperparameters.
The training data 310 may be used to train the machine-learned model 330 using the machine-learning training algorithm 320. For example, the machine-learned model 330 may include labeled training data that is used to train the machine-learned model 330. In embodiments where the machine-learned model 330 includes one or more GANs, the training data 310 may include sound snippets with high intelligibility (e.g., used as an input to train one or more discriminators of a GAN) and/or sound snippets with low intelligibility (e.g., used as an input for training one or more generators of a GAN). For instance, the training data 310 may include:
Once the machine-learned model 330 is trained by the machine-learning training algorithm 320 (e.g., using the method of
While the same computing device (e.g., the computing device 150 of
As described above, example machine-learned models 330 implemented herein (e.g., to generate one or more snippets of speech with improved intelligibility from one or more snippets with lower intelligibility) may include one or more GANs. Such GANs may include one or more generators and one or more discriminators. A generator according to example embodiments is illustrated in
However, unlike the generator architecture of CycleGAN-VC2/VC3, the one-dimensional ResNet blocks (e.g., near a center of the generator) may be replaced with a series of continuous attention layers (e.g., as illustrated in
The generator described above (and, potentially, the entire machine-learned model 330, such as the entire GAN) and/or the training algorithm used to train the generator (e.g., the machine-learning training algorithm 320) may provide a non-parallel voice conversion approach that is capable of producing naturally sounding speech with limited data (e.g., limited input from a user via the microphone 130). The generator(s)/machine-learned model 330 described herein may be capable of performing such a conversion without F0 modification, synthetic speaker data, or alignment procedures (e.g., by the introduction of an attention module in place of the 1-D latent ResNet module and/or based on a regularization parameter applied during training of the machine-learned model 330 in order to stabilize training). In addition to resulting in more naturally sounding speech, the machine-learned model 330 described herein (e.g., the generator of the GAN as illustrated in
A discriminator according to example embodiments is illustrated in
At step 502, the method 500 may include capturing, by a microphone (e.g., the microphone 130 shown and described with reference to
At step 504, the method 500 may include converting, by an ADC (e.g., the ADC 140 shown and described with reference to
At step 506, the method 500 may include receiving, by a computing device (e.g., the computing device 150 shown and described with reference to
At step 508, the method 500 may include applying, by the computing device (e.g., the computing device 150 shown and described with reference to
At step 510, the method 500 may include outputting, by the computing device (e.g., the computing device 150 shown and described with reference to
At step 512, the method 500 may include converting, by a DAC (e.g., the DAC 160 shown and described with reference to
At step 514, the method 500 may include producing, by a speaker (e.g., the speaker 190 shown and described with reference to
In some embodiments of the method 500, the machine-learned model applied at step 508 may be trained using training data that includes one or more recordings or representations of natural speech.
In some embodiments of the method 500, the machine-learned model applied at step 508 may be trained using training data that includes one or more recordings or representations of speech produced by a laryngectomee.
In some embodiments of the method 500, the machine-learned model applied at step 508 may be trained using training data that includes one or more recordings or representations of alaryngeal speech. The one or more recordings or representations of alaryngeal speech may include recordings or representations of a predetermined type of alaryngeal speech, for example. Further, the predetermined type of alaryngeal speech used may be exhibited by the user (e.g., the user 102 using the speech-improvement device 110 to perform the method 500).
In some embodiments, the method 500 may be performed using a device that is configured such that a time delay between when the microphone captures one or more frames of unintelligible speech (e.g., at step 502) and when the speaker produces sound based on the analog form of the one or more frames with improved intelligibility (e.g., at step 514) is less than 50 ms.
In some embodiments of the method 500, the machine-learned model applied at step 508 may include a GAN.
In some embodiments of the method 500, the machine-learned model applied at step 508 may be trained using a two-step adversarial loss technique.
In some embodiments of the method 500, the machine-learned model applied at step 508 may be trained using training data that includes: one or more recordings or representations (e.g., mel-spectrograms) of one or more sounds from an ARCTIC dataset or VCTK dataset and/or one or more recordings or representations (e.g., mel-spectrograms) of one or more test subjects attempting to reproduce the one or more sounds from the ARCTIC dataset or VCTK dataset (e.g., through alaryngeal speech).
In some embodiments of the method 500, the machine-learned model applied at step 508 may be trained using training data that includes one or more recordings of the user (e.g., the user 102 using a speech-improvement device 110 to perform the method 500) prior to the user undergoing a laryngectomy. The one or more recordings of the user prior to the user undergoing the laryngectomy may be contained within answering machine recordings, home video recordings, or web-archived video.
In some embodiments of the method 500, the machine-learned model applied at step 508 may be training using training data that includes one or more recordings of a relative of the user (e.g., the user 102 using a speech-improvement device 110 to perform the method 500) or a person having similar characteristics as the user. For example, the training data may include one or more recordings of the person having similar characteristics as the user and the similar characteristics may include a same sex as the user, a similar age to the user (e.g., within 1 year of the user, within 2 years of the user, within 3 years of the user, within 4 years of the user, within 5 years of the user, within 6 years of the user, within 7 years of the user, within 8 years of the user, within 9 years of the user, or within 10 years of the user), a similar height to the user (e.g., within 1 cm of the user, within 2 cm of the user, within 3 cm of the user, within 4 cm of the user, within 5 cm of the user, within 6 cm of the user, within 7 cm of the user, within 8 cm of the user, within 9 cm of the user, or within 10 cm of the user), or a similar weight to the user (e.g., within 1 kg of the user, within 2 kg of the user, within 3 kg of the user, within 4 kg of the user, within 5 kg of the user, within 6 kg of the user, within 7 kg of the user, within 8 kg of the user, within 9 kg of the user, or within 10 kg of the user). Additionally or alternatively, the training data may include one or more recordings of a relative of the user (e.g., a sibling, a parent, or a child of the user).
In some embodiments of the method 500, step 510 may include outputting the one or more frames as a mel-spectrogram. In such embodiments, a DAC (e.g., the DAC used to perform step 512) may include a vocoder. Further, step 512 may include the vocoder generating an electrical signal based on the mel-spectrogram. Additionally, in some embodiments, the vocoder may, itself, include a machine-learned model (e.g., a machine-learned model that includes a GAN) that is trained to invert mel-spectrograms to waveforms.
In some embodiments of the method 500, the unintelligible speech from the user (e.g., captured by the microphone at step 502) may be a result of the user possessing a heavy accent or ALS. For example, the reason the speech from the user is unintelligible may be due to the user's heavy accent or the user having ALS.
In some embodiments of the method 500, the microphone used at step 502 may be configured to capture three frames of unintelligible speech from the user. For example, each of the three frames may be between 8 ms and 12 ms in length (i.e., duration). Further, the computing device used at step 508 to apply the machine-learned model to the digital representation may be configured to generate a single frame with improved intelligibility based on the three captured frames of unintelligible speech. The single frame with improved intelligibility may be between 8 ms and 12 in length.
From the foregoing description it will be readily apparent that devices for real-time speech output with improved intelligibility that allow for more natural and more understandable speech have been developed. The naturalness may be provided by the inclusion of prosody; improvement of tonality; and an improved rendition of vowels, consonants, and other speech components. In some embodiments, devices described herein may be packaged to be worn or carried easily. Further, devices described herein may include and/or be powered by one or more batteries. The methods also described herein provide for processing speech in real-time to provide a more natural sounding output (e.g., based on an altered or impaired voice input). In certain cases where real-time speech is not necessary (e.g., when outside of the context of a live conversation, such as when a conversion after an utterance is acceptable), delayed speech with improved accuracy can also be provided (e.g., by applying a GAN that includes additional artificial neuron layers).
The present disclosure is not to be limited in terms of the particular embodiments described in this application, which are intended as illustrations of various aspects. Many modifications and variations can be made without departing from its scope, as will be apparent to those skilled in the art. Functionally equivalent methods and apparatuses within the scope of the disclosure, in addition to those described herein, will be apparent to those skilled in the art from the foregoing descriptions. Such modifications and variations are intended to fall within the scope of the appended claims.
The above detailed description describes various features and operations of the disclosed systems, devices, and methods with reference to the accompanying figures. The example embodiments described herein and in the figures are not meant to be limiting. Other embodiments can be utilized, and other changes can be made, without departing from the scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations.
With respect to any or all of the message flow diagrams, scenarios, and flow charts in the figures and as discussed herein, each step, block, operation, and/or communication can represent a processing of information and/or a transmission of information in accordance with example embodiments. Alternative embodiments are included within the scope of these example embodiments. In these alternative embodiments, for example, operations described as steps, blocks, transmissions, communications, requests, responses, and/or messages can be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved. Further, more or fewer blocks and/or operations can be used with any of the message flow diagrams, scenarios, and flow charts discussed herein, and these message flow diagrams, scenarios, and flow charts can be combined with one another, in part or in whole.
A step, block, or operation that represents a processing of information can correspond to circuitry that can be configured to perform the specific logical functions of a herein-described method or technique. Alternatively or additionally, a step or block that represents a processing of information can correspond to a module, a segment, or a portion of program code (including related data). The program code can include one or more instructions executable by a processor for implementing specific logical operations or actions in the method or technique. The program code and/or related data can be stored on any type of computer-readable medium such as a storage device including RAM, a disk drive, a solid state drive, or another storage medium.
The computer-readable medium can also include non-transitory computer-readable media such as computer-readable media that store data for short periods of time like register memory and processor cache. The computer-readable media can further include non-transitory computer-readable media that store program code and/or data for longer periods of time. Thus, the computer-readable media may include secondary or persistent long term storage, like ROM, optical or magnetic disks, solid state drives, compact-disc read only memory (CD-ROM), for example. The computer-readable media can also be any other volatile or non-volatile storage systems. A computer-readable medium can be considered a computer-readable storage medium, for example, or a tangible storage device.
Moreover, a step, block, or operation that represents one or more information transmissions can correspond to information transmissions between software and/or hardware modules in the same physical device. However, other information transmissions can be between software modules and/or hardware modules in different physical devices.
The particular arrangements shown in the figures should not be viewed as limiting. It should be understood that other embodiments can include more or less of each element shown in a given figure. Further, some of the illustrated elements can be combined or omitted. Yet further, an example embodiment can include elements that are not illustrated in the figures.
While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purpose of illustration and are not intended to be limiting, with the true scope being indicated by the following claims.