Various examples of the invention generally relate to techniques for creating an artificial voice for a patient with missing or impaired phonation but at least residual articulation function.
Voice disorders are known to affect 3% to 9% of the population in developed countries and manifest themselves in a range of symptoms collectively known as dysphonia: from hoarseness to a weak or distorted voice to a complete loss of voice, referred to as aphonia. Voice disorders can be functional or organic in origin. Organic voice disorders can be further classified as either structural or neurogenic. This invention deals primarily with severe structural dysphonia and aphonia, but its uses are not limited to these conditions.
In OECD countries, an estimated 60,000 patients per year cannot speak while they are on longer-term mechanical ventilation involving a tracheostomy, and an estimated 12,000 patients per year lose their voice permanently after throat cancer surgery with a partial or total laryngectomy, and an estimated 4,000 thyroid surgeries per year result in severe and lasting speaking problems. Dysphonia or aphonia after thyroid surgery is typically due to vocal fold paresis, most often caused by damage to a recurrent laryngeal nerve.
Speech production thereby consists of the process of phonation, in technical terms the excitation of an acoustic oscillation by the vocal folds, and articulation, or filtering of the sound spectrum by the time-varying shape the vocal tract. Shaping of the vocal tract is done with the velum (7), which opens or closes off the nasal cavity, the tongue (9), the upper (10a) and lower teeth (10b), as well as the upper (11a) and lower lip (11b).
Different situations leading to a partial or complete loss of phonation are shown schematically in
Current state-of-the art options for voice rehabilitation are limited. They include:
Tracheoesophageal puncture, shown in
Esophageal speech, visualized schematically in
Electrolarynx, or artificial larynx, shown in
Phonosurgery. In cases of unilateral paresis, phonosurgery attempts to adjust the immobilized vocal fold to a position that achieves the best compromise between glottis closure, which is needed for the ability to phonate, and sufficient air flow for breathing. This can be achieved with sutures, laser surgery, or filler materials such as silicone or hyaluronic acid.
Speech therapy. Speech therapists specialize in training patients' residual vocal capabilities through voice exercises. Many recurrent nerve injuries are transient, and the effects of a one-sided paresis can often be compensated by strengthening the contra-lateral vocal fold.
In many cases, none of these options are a satisfying solution. Patients on mechanical ventilation and patients with completely immobilized vocal folds or a surgically removed larynx often recover a rudimentary ability to communicate, albeit with difficulty and/or with a severely distorted voice.
Voicelessness and impaired communication have serious effects on patients, relatives, and caregivers' ability to care for a patient. Voicelessness and inability to communicate verbally have been associated with poor sleep, stress, anxiety, depression, social withdrawal, and reduced motivation of patients to participate in their care.
Next, an overview of related prior work is given:
While the available therapeutic options for voiceless patients are limited, a number of approaches to problem of recognizing speech from measurements of the vocal tract, the larynx, the neck and facial musculature, and the lip and facial movements have been described in the literature. The field of research concerned with speech recognition without acoustic voice input is sometimes referred to as “silent speech” research. Below, key results of silent speech research as the relate to this invention and the main differences are summarized.
Radar sensing: Holzrichter et al. at Lawrence Livermore National Laboratory have developed a range of ideas around the recognition of speech from radar signals in combination with an acoustic microphone. Their fundamental patents date back to 1996. While Holzrichter's focus is on improving speech recognition from healthy speakers, he does mention prosthetic applications in passing, without describing them in detail. However, his primary objective is measurement of the vocal excitation function, i.e. vocal fold motion, rather than the vocal tract configuration—which is fundamentally different from the objective of techniques described herein. Accordingly, all of the described embodiments by Holzrichter et al. require at least partial phonation. See, e.g., Ng, Lawrence C., John F. Holzrichter, and P. E. Larson. Low Bandwidth Vocoding Using EM Sensor and Acoustic Signal Processing. No. UCRL-JC-145934. Lawrence Livermore National Lab., CA (US), 2001. Further see: Holzrichter, J. F. New ideas for speech recognition and related technologies. No. UCRL-ID-120310. Lawrence Livermore National Lab. (LLNL), Livermore, Calif. (United States), 2002. Further, see Holzrichter, John F., Lawrence C. Ng, and John Chang. “Real-time speech masking using electromagnetic-wave acoustic sensors.” The Journal of the Acoustical Society of America 134.5 (2013): 4237-4237. Also, see Jiao, Mingke, et al. “A novel radar sensor for the non-contact detection of speech signals.” Sensors 10.5 (2010): 4622-4633. See Li, Sheng, et al. “A 94-GHz millimeter-wave sensor for speech signal acquisition.” Sensors 13.11 (2013): 14248-14260.
Recently, Birkholz et al. have demonstrated silent phoneme recognition with microwave signals using two antennas attached on test subjects' cheek and below the chin. Measuring time-varying reflection and transmission spectra in the frequency range of 2-12 GHz they achieved phoneme recognition rates in the range of 85% to 93% for a limited set of 25 phonemes. See, e.g., Birkholz, Peter, et al. “Non-invasive silent phoneme recognition using microwave signals.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 26.12 (2018): 2404-2411.
Ultrasound: Ultrasound imaging has been studied as input for speech recognition and synthesis, sometimes in combination with optical imaging of the lips. See, e.g., Hueber, Thomas, et al. “Eigentongue feature extraction for an ultrasound-based silent speech interface.” 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07. Vol. 1. IEEE, 2007; or Hueber, Thomas, et al. “Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips.” Speech Communication 52.4 (2010): 288-300; or Hueber T. Speech Synthesis from ultrasound and optical images of the speaker's vocal tract. Available at: https://www.neurones.espci.fr/ouisper/doc/report_hueber_ouisper.pdf. Accessed Oct. 16, 2018. Hueber, Denby et al. have filed patent applications for a speech recognition and reconstruction device consisting of a wearable ultrasound transducer and a way of tracking its location relative to the patient's head via mechanical means or a 3-axis accelerometer. They also describe image processing methods to extract tongue profiles from two-dimensional ultrasound images. See Denby, Bruce, and Maureen Stone. “Speech synthesis from real time ultrasound images of the tongue.” 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 1. IEEE, 2004; or Denby, Bruce, et al. “Prospects for a silent speech interface using ultrasound imaging.” 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings. Vol. 1. IEEE, 2006.
Hueber's paper “Speech Synthesis from ultrasound and optical images of the speaker's vocal tract” describes the use of machine learning (sometimes also referred to as artificial intelligence) to translate vocal tract configurations into voice output. The concepts Hueber and Denby describe in their publications are limited to ultrasound imaging of the tongue and camera imaging of the lips, and always use ultrasound images as an intermediate processing step. Their experiments aimed at building a “silent speech interface” led Hueber et al. to the conclusion that “with some 60% of phones correctly identified [ . . . ], the system is not able to systematically provide an intelligible synthesis”.
McLoughlin and Song have used low-frequency ultrasound in a 20 to 24 kHz band to sense voice activity via the detection of the mouth state, i.e. the opening and closing of a test subject's lips. Even though their system provided only binary voice activity output, it required subject specific training of the detection algorithm. See, e.g., McLoughlin, Ian Vince. “The use of low-frequency ultrasound for voice activity detection.” Fifteenth Annual Conference of the International Speech Communication Association. 2014; and McLoughlin, Ian, and Yan Song. “Low frequency ultrasonic voice activity detection using convolutional neural networks.” Sixteenth Annual Conference of the International Speech Communication Association. 2015.
Lip and facial video: Zisserman's group at the University of Oxford, Shillingford et al., and Makino et al. have demonstrated combined audio-visual speech recognition using facial video in combination with audio speech data as input to deep neural network algorithms. While their results show that video data can be used to enhance the reliability of audio speech recognition, the approaches they describe are limited to recognizing the speech of healthy subjects with audible speech output. See e.g. Chung, Joon Son, and Zisserman, Andrew. “Lip Reading in Profile.” British Machine Vision Association and Society for Pattern Recognition. 2017; Shillingford, Brendan, et al. “Large-Scale Visual Speech Recognition”. INTERSPEECH. 2019; and Makino, Takaki, et al. “Recurrent Neural Network Transducer for Audio-Visual Speech Recognition”. IEEE Automatic Speech Recognition and Understanding Workshop. 2019.
Surface electromyography: Surface EMG of the neck and face has been tested for Human-Computer Interfaces (HCI), particularly in so-called Augmentative and Alternative Communication (AAC) for people with severe motor disabilities, e.g. due to spinal cord injury or amyotrophic lateral sclerosis (ALS), a motor neuron disease. Stepp's group at Boston University uses surface EMG as input for AAC interfaces: EMG signals from the neck or facial musculature are picked up and allow the patient to move a cursor on a screen. This can be used to design a “phonemic interface” in which the patient painstakingly chooses individual phonemes that are assembled into speech output. Among other aspects, this differs from this invention in that it is not real time. See, e.g., Janke, Matthias, and Lorenz Diener. “EMG-to-speech: Direct generation of speech from facial electromyographic signals.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.12 (2017): 2375-2385; or Denby, Bruce, et al. “Silent speech interfaces.” Speech Communication 52.4 (2010): 270-287; or Wand, Michael, et al. “Array-based Electromyographic Silent Speech Interface.” BIOSIGNALS. 2013. Further, see Stepp, Cara E. “Surface electromyography for speech and swallowing systems: measurement, analysis, and interpretation.” Journal of Speech, Language, and Hearing Research (2012); also see Hands, Gabrielle L., and Cara E. Stepp. “Effect of Age on Human-Computer Interface Control Via Neck Electromyography.” Interacting with computers 28.1 (2016): 47-54; also see Cler, Meredith J., et al. “Surface electromyographic control of speech synthesis.” 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 201.
Speech synthesis from surface EMG signals has been tried, albeit with limited success. Meltzner at al. succeeded in speech recognition in a 65-word vocabulary with a success rate of 86.7% for silent speech in healthy subjects. Meltzner's DARPA-funded study used an HMM for speech recognition from surface EMG signals. The main differences to this invention are that it appears to be limited to a small vocabulary, and does not include a versatile speech synthesis stage. See Meltzner, Geoffrey S., et al. “Speech recognition for vocalized and subvocal modes of production using surface EMG signals from the neck and face.” Ninth Annual Conference of the International Speech Communication Association. 2008.
Electrocorticography: Recently, two groups have demonstrated rudimentary speech recognition from brain signals via electrocorticography. Since this technique required electrodes invasively placed inside the cerebral cortex, it is only conceivable for severely disabled patients, such as advanced-stage ALS patients, for whom the risk of such a procedure could be justified. Moses et al. showed rudimentary speech recognition within a limited vocabulary using electrocorticography in combination with a context sensitive machine learning approach. Akbari et al. also added machine learning based speech synthesis to reconstruct audible speech within a very limited vocabulary. Both approached differ from this invention in their highly invasive nature. See, e.g., Akbari, Hassan, et al. “Towards reconstructing intelligible speech from the human auditory cortex.” Scientific reports 9.1 (2019): 1-12; and Moses, David A., et al. “Real-time decoding of question-and-answer speech dialogue using human cortical activity.” Nature communications 10.1 (2019): 1-14.
The following patent publications are known: DE 202004010342 U1; EP 2577552 B1; U.S. Pat. Nos. 5,729,694; 6,006,175; 7,162.415 B2; 7,191,105 B2; US 2012/0053931 A1.
It is an objective of the techniques described herein to restore in real time natural-sounding speech for patients who have impaired or missing phonation, but have retained at least a partial ability to articulate. This includes, but is not limited to, the three conditions described above: mechanical ventilation, laryngectomy, and recurrent nerve paresis.
A further objective of the techniques described herein is to match the restored voice to desired voice characteristics for a particular patient, for example, by matching the restored voice closely to the patient's voice prior to impairment. While in intensive care unit (ICU) settings, a solution could be a stationary bedside unit, for most other settings, it is an additional objective to provide a solution that is wearable, light-weight, and unobtrusive.
This need is met by the features of the independent claims. The features of the dependent claims define embodiments.
Unlike prostheses that attempt to restore, or mechanically substitute, a patient's ability to phonate, i.e. to produce an acoustic wave that is shaped into speech in the vocal tract, the approach of the techniques described herein is to measure the time-varying physical configuration of the vocal tract, i.e. to characterize the patient's attempted articulation, numerically synthesize a speech waveform from it in real time, and output this waveform as an acoustic wave via a loudspeaker.
Many of the techniques described herein can employ a process labeled “voice grafting” hereinafter. Voice grafting can be understood in terms of the source-filter model of speech production as follows: For a patient who has partially or completely lost the ability to phonate, but retained at least a partial ability to articulate, the techniques described herein computationally “graft” the patient's time varying filter function, i.e. articulation, onto a source function, i.e. phonation, which is based on the speech output of one or more healthy speakers, in order to synthesize natural sounding speech in real time. Real time can correspond to a processing delay of less than 0.5 seconds, optionally of less than 50 ms.
As a general rule, some of the examples described herein pertain to a training of a machine learning algorithm (i.e., creating respective training data and executing the training based on the training data); and some further examples described herein relate to inference provided by the trained machine learning algorithm. Sometimes, the methods can be limited to training; and sometimes, the methods can be limited to inference. It is also possible to combine training and inference.
Regarding the training: A respective method includes training a machine learning algorithm based on, firstly, one or more reference audio signals. The reference audio signals include a speech output of a reference text. The machine learning algorithm is, secondly, trained based on one or more vocal-tract signals. The one or more vocal-tract signals are associated with an articulation of the reference text by a patient.
In other words, it is possible to record the one or more reference audio signals as the acoustic output of one or more healthy speakers reading the reference text. The vocal-tract signal of the patient corresponds to the patient articulating the same reference text.
Training is performed in a training phase. After the training phase is completed, an inference phase commences. In the inference phase, it would then be possible to receive one or more further vocal-tract signals of the patient and convert the one or more further vocal-tract signals into an associated speech output based on the machine learning algorithm. Such further signals can be referred to as “live signals”, because they are received after training, without any ground truth (such as the reference text) being available.
Such techniques as described above and below help to provide a voice grafting that is based on machine learning techniques. The techniques can be implemented in a mobile computing device (also referred to as user device). Accordingly, the voice grafting can be implemented in a variable, lightweight, and unobtrusive manner. In other examples, e.g., in an intensive care unit setting, it would also be possible to implement on a mobile computing device that is stationary bedside.
The techniques described herein can be implemented by methods, devices, systems, or computer programs/computer-program products/and computer-readable storage media that execute respective program code.
It is to be understood that the features mentioned above and those yet to be explained below may be used not only in the respective combinations indicated, but also in other combinations or in isolation without departing from the scope of the invention.
Hereinafter, techniques of generating a speech output based on a residual articulation of the patient (voice grafting) are described. Various techniques are based on the finding that prior-art implementations of speech rehabilitation face certain restrictions and drawbacks. For example, for many patients they do not achieve the objective of restoring natural-sounding speech. Esophageal speech and speaking with the help of a speaking valve, voice prosthesis, or electrolarynx are difficult to learn for some patients and often results in distorted, unnatural speech. The need to hold and activate an electrolarynx device or cover a tracheostoma or valve opening with a finger makes these solutions cumbersome and obtrusive. In-dwelling prostheses also carry the risk of fungal infections.
For example, details with respect to the electrolarynx are described in: Kaye, Rachel, Christopher G. Tang, and Catherine F. Sinclair. “The electrolarynx: voice restoration after total laryngectomy.” Medical Devices (Auckland, NZ) 10 (2017): 133.
For example, details with respect to a speak valve are described in: Passy, Victor, et al. “Passy-Muir tracheostomy speaking valve on ventilator-dependent patients.” The Laryngoscope 103.6 (1993): 653-658. Also, see Kress, P., et al. “Are modern voice prostheses better? A lifetime comparison of 749 voice prostheses.” European Archives of Oto-Rhino-Laryngology 271.1 (2014): 133-140.
An overview of traditional voice prostheses is provided by: Reutter, Sabine. Prothetische Stimmrehabilitation nach totaler Kehlkopfentfernung-eine historische Abhandlung seit Billroth (1873). Diss. Universität Ulm, 2008.
Many different implementations of the voice grafting are possible. One implementation is shown schematically in
Step 1: Creating training data, i.e., training phase. A healthy speaker (25) providing the “target voice” reads a sample text (24) (also labelled reference text) out loud, while the voice output is recorded with a microphone (26). This can create one or more reference audio signals, or simply audio training data. The resulting audio training data (26), including the text and its audio recording, can be thought of as an “audio book”; in fact, it can also be an existing audio book recording or any other available speech corpus. (Step 1a,
The same text is then “read” by a patient with impaired phonation (29), while signals characterizing the patient's time-varying vocal tract configuration, i.e. the articulation, are recorded with suitable sensors (30), yielding vocal tract training data (31). If the patient is completely aphonic, “reading” the text here means silently “mouthing” it. To record the articulation, various options exist. For example, the patient's vocal tract is probed using electromagnetic and/or acoustic waves, and backscattered and/or transmitted waves are measured. Together, the measurement setup can be referred to as “vocal tract sensors” and to the measured signals as “vocal tract signals”. (Step 1b,
Step 2: Training the algorithm (
Step 3: Using the voice prosthesis (
Thus, as a general rule, a corresponding method can include receiving one or more live vocal-tract signals of the patient and then convert the one or more live vocal-tract signals into associated one or more live audio signals including speech output, based on the machine learning algorithm.
For each step described above, a wide range of implementations is possible, which will be described below.
Step 1a: Creating the audio training data. The audio training data representing the “healthy voice” can come from a range of different sources. In the most straightforward implementation it is the voice of a single healthy speaker. The training data set can also come from multiple speakers. In one implementation, a library of recordings of speakers with different vocal characteristics could be used. In the training step, training data of a matching target voice would be chosen for the impaired patient. The matching could happen based on gender, age, pitch, accent, dialect, and other characteristics, e.g., defined by a respective patient dataset.
Step 1b: Creating the vocal tract training data. The vocal tract training data can also come from a range of different sources. In the most straightforward implementation it comes from a single person, the same impaired patient who will use the voice prosthesis in Step 3. The vocal tract signal training data can also come from multiple persons, who do not all have to be impaired patients. For example, the training can be performed in two steps: a first training step can train the machine learning algorithm with a large body of training data consisting of audio data from multiple healthy speakers and vocal tract data from multiple healthy and/or impaired persons. A second training step can then re-train the network with a smaller body of training data containing the vocal tract signals of the impaired patient who will use the voice prosthesis.
Step 1c: Synchronizing audio and the vocal tract training data. As a general rule, the method may further include synchronizing a timing of the one or more reference audio signals with a further timing of the one or more vocal-tract signals. By synchronizing the timings, an accurate training of the machine learning algorithm is enabled. The timing of a signal can correspond to the duration between corresponding information content. For example, the timing of the one or more reference audio signals may be characterized by the time duration required to cover a certain fraction of the reference text by the speech output; similarly, the further timing of the one or more vocal-tract signals can correspond to the time duration required to cover a certain fraction of the reference text by the articulation.
Alternatively or additionally, it would also be possible that said synchronizing includes controlling the HMI to obtain a temporal guidance from the patient when articulating the reference text. For example, the impaired patient could provide synchronization information by pointing at the part of the text being articulated, while the one or more vocal-tract signals are being recorded. Gesture detection or eye tracking may be employed. The position of an electronic pointer, e.g., mouse curser, could be analyzed. A third approach is to synchronize the audio training data and the vocal tract data computationally after recording, by selectively slowing down or speeding up one of the two data recordings. This requires both data streams to be annotated with timing cues. For the vocal tract signals, the subject recording the training set can provide these timing cues themselves by moving an input device, such as a mouse or a stylus, through the text at the speed of his or her reading. For the audio training data, the timing cues can be generated in a similar way, or by manually annotating the data after recording, or with the help of state-of-the-art speech recognition software. Thus, as a general rule, it would be possible that said synchronizing includes postprocessing at least one of the reference audio signal and a vocal-tract signal by changing a respective timing. In other words, it would be possible that said synchronizing is implemented electronically after recording of the one or more reference audio signals and/or the one or more vocal-tract signals, e.g., by selectively speeding up/accelerating or slowing down/decelerating the one or more reference audio signals and/or the one or more vocal-tract signals.
Step 2: Training the machine learning algorithm. A range of different algorithms commonly used in speech recognition and speech synthesis can be adapted to the task of transforming vocal tract signals into acoustic speech output. The transformation can either be done via intermediate representations of speech, or end-to-end, omitting any explicit intermediate steps. Intermediate representations can be, for example, elements of speech such as phonemes, syllables, or words, or acoustic speech parameters such as mel-frequency cepstral coefficients (MFCCs).
Step 3: Using the voice prosthesis. The trained neural network can then be used to realize an electronic voice prosthesis, a medical device that can alternatively be referred to as a “voice graft”, an “artificial voice”, or a “voice transplant”. A wide range of implementations are possible for the voice prosthesis. In practice, the choice will depend on the target patient scenario, i.e. the type of vocal impairment, the patient's residual vocal capabilities, the therapy goals, and aspects of the target setting, such as in-hospital vs. at home; bedridden vs. mobile.
Based on the range of the implementation options for each step above, a wide range of embodiments of the techniques described herein is possible. We describe four preferred embodiments of the invention for different patient scenarios. It is understood that combinations of various aspects of these embodiments can also be advantageous in these and other scenarios and that more embodiments of the invention can be generated from the implementation options discussed above. Also, the preferred embodiments described can apply to scenarios other than the ones mentioned in the description.
For a bedridden patient with no laryngeal airflow, such as a patient who is mechanically ventilated through a cuffed tracheostomy tube, embodiment 1 is a preferred embodiment. Such patients generally have no residual voice output and are not able to whisper. Therefore, the combination of radar sensing to obtain robust vocal tract signals and a video camera to capture lip and facial movements is preferred.
The main elements of the corresponding voice prosthesis are shown in
Two or more antennas (36) are used to collect reflected and transmitted radar signals that encode the time-varying vocal tract shape. The antennas are placed in proximity to the patient's vocal tract, e.g. under the right and left jaw bone. To keep their position stable relative to the vocal tract they can be attached directly to the patient's skin as patch antennas. Each antenna can send and receive modulated electromagnetic signals in a frequency band between 1 kHz and 12 GHz, optionally 1 GHz and 12 GHz, so that (complex) reflection and transmission can be measured. Possible modulations of the signal are: frequency sweep-, stepped frequency sweep-, pulse-, frequency comb-, frequency-, phase-, or amplitude modulation. In addition, a video camera (48) captures a video stream of the patient's face, containing information about the patient's lip and facial movements. The video camera is mounted in front of the patient's face on a cantilever (62) attached to the patient bed. The same cantilever can support the loudspeaker (63) for acoustic speech output.
The computing device (58) contained in the bedside unit (60) locally provides the necessary computing power to receive signals from the signal processing electronics (57) and the video camera (48), run the machine learning algorithm, output acoustic waveforms to the audio amplifier (59), and communicate wirelessly with the portable touchscreen device (61) serving as the user interface. The machine learning algorithm uses a deep neural network to transform the pre-processed radar signals and the stream of video images into an acoustic waveform in real time. The acoustic waveform is sent via the audio amplifier (59) to the loudspeaker (63).
The corresponding method for creating an artificial voice is as follows. An existing speech database is used to obtain audio training data for multiple target voices with different characteristics such as gender, age, and pitch. To create a corresponding body of vocal tract data, the sample text of the audio training data is read by a number of different speakers without speech impairment while their vocal tract signals are being recorded with the same radar sensor and video camera setup as for the eventual voice prosthesis. As the speakers read the sample text off a display screen, they follow the text along with an input stylus and timing cues are recorded. The timing cues are used to synchronize the vocal tract training data with the audio training data.
The audio training data sets of different target voices are separately combined with the synchronized vocal tract training data and used to train a deep neural network algorithm to convert radar and video data into the target voice. This results in a number of different DNNs, one for each target voice. The voice prosthesis is pre-equipped with these pre-trained DNNs.
To deal with the subject-to-subject variation in vocal tract signals, a pre-trained DNN is re-trained for a particular patient before use. To this end, first the pre-trained DNN that best matches the intended voice for the patient is selected. Then, the patient creates a patient-specific set vocal tract training data, by mouthing an excerpt of the sample text that was used to pre-train the DNNs, while vocal tract data are being recorded. This second vocal tract training data set is synchronized and combined with the corresponding audio sample of the selected target voice. This smaller, patient-specific second set of training data is now used to re-train the DNN. The resulting patient specific DNN is used in the voice prosthesis to transform the patient's vocal tract signal to voice output with the characteristics of the selected target voice.
For a mobile patient with no laryngeal airflow, such as a patient whose larynx has been surgically removed, embodiment 2 is a preferred embodiment. Like the patient in embodiment 1, such patients also have no residual voice output and are not able to whisper. Therefore, the combination of radar sensing and a video camera to capture lip and facial movements is preferred in this case, too.
The main elements of the corresponding voice prosthesis are shown in
As in embodiment 1, two or more antennas (36) are used to collect reflected and transmitted radar signals that encode the time-varying vocal tract shape. The antennas are placed in proximity to the patient's vocal tract, e.g. under the right and left jaw bone. To keep their position stable relative to the vocal tract they can be attached directly to the patient's skin. Each antenna can send and receive modulated electromagnetic signals in a frequency band between 1 kHz and 12 GHz, so that (complex) reflection and transmission can be measured. Possible preferred modulations of the signal are: frequency sweep-, stepped frequency sweep-, pulse-, frequency comb-, frequency-, phase-, or amplitude modulation.
In addition, a video camera (48) captures a video stream of the patient's face, containing information about the patient's lip and facial movements. For portability the video camera is mounted in front of the patient's face on a cantilever (62) worn by the patient like a microphone headset.
The portable touchscreen device (61) is also the computing device that locally provides the necessary computing power to receive the processed radar signals and the video images from the wireless transmitter (65), run the machine learning algorithm, output the acoustic speech waveforms via the built-in speaker (63), and provide the user interface on the touchscreen. The machine learning algorithm uses a deep neural network to transform the pre-processed radar signals and the stream of video images into an acoustic waveform in real time.
The corresponding method for creating an artificial voice is the same as in embodiment 1.
For a mobile patient with no laryngeal airflow, such as a patient whose larynx has been surgically removed, embodiment 3 is an alternative preferred embodiment. Instead of radar sensing, in this embodiment low-frequency ultrasound is used to characterize the time-varying shape of the vocal tract.
The main elements of the corresponding voice prosthesis are shown in
A low-frequency ultrasound loudspeaker (42) is used to emit ultrasound signals in the range of 20 to 30 kHz that are directed at the patient's mouth and nose. The ultrasound signals reflected from the patient's vocal tract are captured by an ultrasound microphone (45). The ultrasound loudspeaker and microphone are mounted in front of the patient's face on a cantilever (62) worn by the patient like a microphone headset.
With this setup, the complex reflection coefficient can be measured as a function of frequency. The frequency dependence of the reflection or transmission is measured by sending signals in a continuous frequency sweep, or in a series of wave packets with stepwide increasing frequencies, or by sending a short pulse and measuring the impulse response in a time-resolved manner.
In addition, a video camera (48) captures a video stream of the patient's face, containing information about the patient's lip and facial movements. The video camera is mounted on the same cantilever (62) as the ultrasound loudspeaker and microphone.
As in embodiment 2, the portable touchscreen device (61) is also the computing device. It locally provides the necessary computing power to receive the ultrasound signals converted by the analog-to-digital converter (68) and the video images via the wireless transmitter (65), run the machine learning algorithm, output the acoustic speech waveforms via the built-in speaker (63), and provide the user interface on the touchscreen. The machine learning algorithm uses a DNN to transform the pre-processed ultrasound signals and the stream of video images into an acoustic waveform in real time.
The corresponding method for creating an artificial voice is the same as in embodiments 1 and 2.
For a mobile patient with residual voice output, such as residual phonation, a whisper voice, or a pure whisper without phonation, embodiment 4 is a preferred embodiment. For such a patient, the combination of an acoustic microphone to pick up the residual voice output and a video camera to capture lip and facial movements is preferred.
The main elements of the corresponding voice prosthesis are shown in
A microphone (52) capturing the acoustic signal of the residual voice and a video camera (48) capturing lip and facial movements are placed in front of the patient's face on a cantilever (62) worn by the patient like a microphone headset. The microphone and camera signals are sent to the computing device (59) which runs the machine learning algorithm and outputs the acoustic speech output via the audio amplifier (59) and a loudspeaker (63) that is also mounted on the cantilever in front of the patient's face. The machine learning algorithm uses a DNN to transform the acoustic and video vocal tract signals into an acoustic waveform in real time.
The corresponding method for creating an artificial voice differs from the previous embodiments. Since the residual voice depends strongly on the patient's condition and may even change over time, a patient specific DNN algorithm is trained for each patient.
An existing speech database is used to obtain audio training data for a target voice that matches the patient in characteristics such as gender, age, and pitch. To create a corresponding body of vocal tract data, the sample text of the audio training data is read by the patient with the same microphone and video camera setup as for the eventual voice prosthesis. As the patient reads the sample text off a display screen, he or she follows the text along with an input stylus and timing cues are recorded. The timing cues are used to synchronize the vocal tract training data with the audio training data.
The combined training data set is used to train the DNN algorithm to transform the patient's vocal tract signals, i.e. residual voice and lip and facial movements, into acoustic speech output. If over time the patient's residual voice output changes enough to degrade the quality of the speech output, the algorithm can be re-trained by recording a new set of vocal tract training data.
Summarizing, at least the following examples have been described above.
EXAMPLE 1. A method, comprising:
EXAMPLE 2. The method of EXAMPLE 1,
EXAMPLE 3. The method of EXAMPLE 1 or 2, further comprising:
EXAMPLE 4. The method of EXAMPLE 3, wherein said synchronizing comprises:
EXAMPLE 5. The method of EXAMPLE 3 or 4, wherein said synchronizing comprises:
EXAMPLE 6.The method of any one of EXAMPLEs 3 to 5, wherein said synchronizing comprises:
EXAMPLE 7. The method of any one of EXAMPLEs 1 to 6,
EXAMPLE 8. The method of any one of EXAMPLEs 1 to 6,
EXAMPLE 9. The method of any one of the preceding EXAMPLEs,
EXAMPLE 10. The method of any one of the preceding EXAMPLEs, further comprising:
EXAMPLE 11. The method of EXAMPLE 10,
EXAMPLE 12. The method of EXAMPLE 10 or 11, further comprising:
EXAMPLE 13. The method of EXAMPLE 12, wherein the one or more sensors are selected from the group comprising: a lip camera; a facial camera; a headset microphone; an ultrasound transceiver; a neck or larynx surface electromyogram; and a radar transceiver.
EXAMPLE 14. The method of any one of EXAMPLEs 10 to 13, further comprising:
EXAMPLE 15. The method of any one of the preceding EXAMPLEs, wherein the patient is on mechanical ventilation through a tracheostomy, has undergone a partial or complete laryngectomy, or suffers from vocal fold paresis or paralysis.
EXAMPLE 16. The method of EXAMPLE 15, wherein the speech output of the reference text is provided by the patient prior to speech impairment.
EXAMPLE 17. A device comprising a control circuitry configured to:
EXAMPLE 18. The device of EXAMPLE 17, wherein the control circuitry is configured to execute the method of any one of the EXAMPLES 1 to 16.
Although the invention has been shown and described with respect to certain preferred embodiments, equivalents and modifications will occur to others skilled in the art upon the reading and understanding of the specification. The present invention includes all such equivalents and modifications and is limited only by the scope of the appended claims.
For instance, various examples have been described with respect to certain sensors used to record one or more vocal tract signals. Depending on the patient's condition, residual vocal capabilities, the therapy goals and the setting, different vocal tract sensors can be used. They can be unobtrusive and wearable: light weight and compactness; preferably low power consumption and wireless operation.
For further illustration, various examples have been described with respect to a trained machine learning algorithm. Depending on the computing power requirements in the transmission bandwidth, modifications are possible: for example, the trained machine learning algorithm could be deployed locally (i.e., on a mobile computing device) or remotely, i.e., using a cloud computing service. The mobile computing device can be used to connect one or more sensors with a platform executing the machine learning algorithm. The mobile computing device can also be used to output, via a loudspeaker, one or more audio signals including speech output determined based on the machine learning algorithm.
For further illustration, various examples have been described in which multiple configurations of the machine learning algorithm are trained using varying speech characteristics and/or varying articulation characteristics. In this regard, many levels of matching the speech characteristic to the patient characteristic are conceivable: gender, age, pitch, accent or dialect, etc. The matching can be done by selecting from a “library” of configurations of the machine learning algorithm, by modifying an existing configuration, or by custom recording the voice of a “voice donor”.
For still further illustration, the particular type or sets of sensors is not germane to the functioning of the subject techniques. Different sensor types are advantageous in different situations: (i) lip/facial cameras. A camera recording the motion of the lips and facial features will be useful in most cases, since these cues are available in most disease scenarios, are fairly information-rich (cf. lip reading), and are easy to pick up with a light-weight, relatively unobtrusive setup. A modified microphone headset with one or more miniature CCD cameras mounted on the cantilever may be used. Multiple CCD cameras or depth-sensing cameras, such as cameras using time-of-flight technology may be advantageous to enable stereoscopic image analysis. (ii) Radar transceiver. Short-range radar operating in the frequency range between 1 and 12 GHz is an attractive technology for measuring the internal vocal tract configuration. These frequencies penetrate several centimeters to tens of centimeters into tissue and are safe for continuous use at the extremely low average power levels (microwatts) required. The radar signal can be emitted into a broad beam and detected either with a single antenna or in a spatially (i.e. angularly) resolved manner with multiple antennas. (iii) ultrasound transceiver. Ultrasound can be an alternative to radar sensing in measuring the vocal tract configuration. At frequencies in the range of 1-5 MHz, ultrasound also penetrates and images the pertinent tissues well and can be operated safely in a continuous way. Ultra-compact, chip based phased-array ultrasound transceivers are available for endoscopic applications. Ultrasound can also be envisioned to be used in a non-imaging mode. (iv) surface EMG sensors. Surface EMG sensors may provide complementary data to the vocal tract shape information, especially in cases where the extrinsic laryngeal musculature is present and active. In those cases, EMG may help by providing information on intended loudness (i.e. adding dynamic range to the speech output) and, more fundamentally, distinguishing speech from silence. The latter is a fundamental need in speech recognition, as the shape of the vocal tract alone does not reveal whether or not acoustic excitation (phonation) is present. (v) acoustic microphone. Acoustic microphones make sense as (additional) sensors in all cases with residual voice present. Note that in this context, “residual voice” may include a whispering voice. Whispering needs air flow through the vocal tract, but does not involve phonation (i.e. vocal fold motion). In many cases, picking up a whispered voice, perhaps in combination with observing lip motion, may be enough to reconstruct and synthesize natural sounding speech. In many scenarios, this would greatly simplify speech therapy, as it reduces the challenge from getting the patient to speak to teaching the patient to whisper. Microphones could attach to the patient's throat, under the mandible, or in front of the mouth (e.g. on the same headset cantilever as a lip/facial camera).
For still further illustration, various examples have been described in connection with using a machine learning algorithm to transform vocal-tract signals into audio signals associated with speech. It is not mandatory to use a machine learning algorithm; other types of algorithms may be used for the transformation.
1 anatomical structures involved in phonation: lungs (not shown), trachea, and larynx (“source”)
2 anatomical structures involved in articulation: vocal tract (“filter”)
3 trachea
4 larynx
4
a glottis
5 epiglottis
6 pharynx
7 velum
8 oral cavity
9 tongue
10
a upper teeth
10
b lower teeth
11
a upper lip
11
b lower lip
12 nasal cavity
13 nostrils
14 esophagus
15 thyroid
16 recurrent laryngeal nerve
(a) Tracheostomy
17 tracheostomy for mechanical ventilation
17
a tracheostomy tube
17
b inflated cuff
(b) Laryngectomy
18 tracheostoma after laryngectomy
3 trachea
14 esophagus
(c) Recurrent nerve injury
19 laryngeal nerve injury after thyroidectomy
16 recurrent laryngeal nerve
16
a nerve injury
(a) Tracheoesphageal puncture (TEP)
20 tracheoesophageal puncture and valve
21 finger
22 vibrations
(b) Esophageal speech
22 vibrations
(c) Electrolarynx
23 electrolarynx
22 vibrations
(a) Step 1a: Creating the audio training data
24 sample text
25 healthy speaker
26 microphone
27 audio training data
(b) Step 1b: Creating the vocal tract training data
28 display with sample text
29 impaired patient
30 vocal tract sensors
31 vocal tract training data
(c) Step 1c: Synchronizing audio and vocal tract training data
27 audio training data
31 vocal tract training data
(d) Step 2: Training the algorithm
27 audio training data
31 vocal tract training data
32 trained machine learning algorithm
(e) Step 3: Using the voice prosthesis
29 impaired patient
30 vocal tract sensors
32 trained machine learning algorithm
33 wireless connection
34 mobile computing device
35 acoustic speech output
(a) Microwave radar sensing
36 radar antenna
37 emitted radar signal
38 backscattered/transmitted radar signal
(b) Ultrasound sensing
39 ultrasound transducer
40 emitted ultrasound signal
41 backscattered ultrasound signal
(c) Low-frequency ultrasound
42 ultrasound loudspeaker
43 emitted ultrasound signal
44 reflected ultrasound signal
45 ultrasound microphone
(d) Lip and facial camera
46 ambient light
47 reflected light
48 video camera
(e) Surface electromyography
49 surface electromyography sensors (for extralaryngeal musculature)
50 surface electromyography sensors (for neck and facial musculature)
(f) Acoustic microphone
51 residual acoustic voice signal
52 acoustic microphone
(a) using elements of speech and MFCCs as intermediate representations of speech
70 vocal tract data: series of frames
71 data pre-processing
72 time series of feature vectors
73 speech recognition algorithm
74 elements of speech: phonemes, syllables, words
75 speech synthesis algorithm
76 mel-frequency cepstral coefficients
77 acoustic waveform synthesis
78 acoustic speech waveform
(b) using MFCCs as intermediate representations of speech
70 vocal tract data: series of frames
71 data pre-processing
72 time series of feature vectors
76 mel-frequency cepstral coefficients
77 acoustic waveform synthesis
78 acoustic speech waveform
79 deep neural network algorithm
(c) End-to-end machine learning algorithm using no intermediate representations of speech
70 vocal tract data: series of frames
71 data pre-processing
72 time series of feature vectors
78 acoustic speech waveform
80 end-to-end deep neural network algorithm
(d) End-to-end machine learning algorithm using no explicit pre-processing and no intermediate representations of speech
70 vocal tract data: series of frames
78 acoustic speech waveform
80 end-to-end deep neural network algorithm
36 radar antennas
48 video camera
53 bedridden patient
54 patient bed
55 power supply
56 radar transmission and receiving electronics
57 signal processing electronics
58 computing device
59 audio amplifier
60 bedside unit
61 mobile computing device with touchscreen
62 cantilever
63 loudspeaker
36 radar antennas
48 video camera
55 power supply
56 radar transmission and receiving electronics
57 signal processing electronics
61 mobile computing device with touchscreen
62 cantilever
63 loudspeaker
64 mobile patient
65 wireless transmitter and receiver
66 portable electronics unit
42 ultrasound loudspeaker
45 ultrasound microphone
48 video camera
55 power supply
57 signal processing electronics
61 mobile computing device with touchscreen
62 cantilever
63 loudspeaker
64 mobile patient
65 wireless transmitter and receiver
66 portable electronics unit
67 ultrasound waveform generator
68 analog-to-digital converter
48 video camera
52 microphone
55 power supply
58 computing device
59 audio amplifier
62 cantilever
63 loudspeaker
64 mobile patient
66 portable electronics unit
69 user interface
Number | Date | Country | Kind |
---|---|---|---|
102020110901.6 | Apr 2020 | DE | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2021/060251 | 4/20/2021 | WO |