This invention relates to identification of disorders by analyzing speech patterns of a subject. Speech patterns can indicate the presence of certain disorders including psychological, neurotraumatic, neurodegenerative, and neurodevelopmental disorders. Using depression as an example, if a person is experiencing a depressive episode, their vocal source, vocal tract, and other motor control components of speech may form certain sounds differently than they otherwise would in the absence of depression. These sounds can indicate whether the subject is experiencing depression. This can be useful in making a diagnosis, especially if the subject is remote and only able to talk with a practitioner via telephone or video.
A system that can identify neuromotor disorders from analyzing a patient's speech can increase accuracy, efficiency, and efficacy in making a diagnosis of the disorder. Changes in speech production that occur as a result of psychomotor slowing or other changes in a user's speech are difficult to detect without detailed analysis of the speech. Moreover, changes in a user's speech due to disorders such as depression can be subtle. Therefore, it can be difficult for a clinician to identify objectively whether a patient's speech indicates the presence of a disorder.
In addition, a system that identifies changes and problems in the way that a user's vocal tract articulates sound can improve detection of a neuromotor disorder and provide greater insight into the disorder. For example, depression can cause a person's vocal tract to produce sounds differently, e.g. to slow, slur, or produce less acoustic energy in particular ways or parts of speech. A system that can not only analyze the changes in the acoustic aspects of the speech, but can analyze the motion or position of elements of the vocal tract can provide a more accurate determination of the presence of a disorder such as depression.
In an embodiment, these and other advantages may be achieved, for example, by a method for measuring neuromotor coordination from speech. The method may include receiving an audio recording that includes spoken speech and computing feature coefficients from at least a portion of the spoken speech in the audio recording, the feature coefficients representing at least one characteristic of the at least a portion of the spoken speech in the audio recording. One or more vocal tract variables may be computed from the feature coefficients. The one or more vocal tract variables may represent a physical configuration of a vocal tract associated with at least one of the one or more sounds. The method may also include determining a measurement of a disorder based at least in part on a degree of correlation between two or more of the vocal tract variables.
One or more additional features may be included. For example, the feature coefficients may be cepstral coefficients which represent an audio power spectrum of the portion of the spoken speech, or may be formants. The vocal tract variables may be generated by a neural network, and the feature coefficients may be inputs to the neural network. The neural network may include stored parameters, which may include data from the Wisconsin X-Ray Microbeam database representing vocal tract variables associated with audio data.
The method may also include estimating a glottal state of the vocal tract, which may include performing acoustic measurements of the audio signal and/or providing the feature vectors to a neural network trained to estimate glottal vocal tract variables.
The method may also display an image of a vocal tract on a display device. The display device may be configured to play the audio recording and simultaneously animate the image of the vocal tract to display the physical configuration of the vocal tract of the speaker to provide visualization of where an articulatory deviation may occur due to a disorder.
The method may also associate the vocal tract variables with an utterance within the audio recording. Also, determining the measurement of the disorder may include correlating time-dependent functions of the at least one vocal trace variables. The time correlation dependent functions can include, for example, a channel-delay correlation matrix of the vocal tract variables and/or cepstral coefficients.
An eigenspectrum of the channel-delay correlation matrix can be generated. Magnitudes of eigenvalues within the eigenspectrum can indicate a disorder that affects speech. Thus, these magnitudes can be used in determining the measurement of the disorder.
Determining the measurement of the disorder may include computing changes in articulator kinematics as determined through phasing of coupled oscillatory models of articulatory gestures derived from vocal tract variables.
In another embodiment, a system for measuring neuromotor coordination from speech includes a processor configured to execute instructions stored on a non-transitory medium. The instructions may cause the processor to receive an audio recording that includes spoken speech; compute feature coefficients from at least a portion of the spoken speech in the audio recording, the feature coefficients representing at least one characteristic of the at least a portion off the spoken speech in the audio recording; compute, from the feature coefficients, one or more vocal tract variables representing a physical configuration of a vocal tract associated with at least one of the one or more sounds; and determine a measurement of a disorder based at least in part on a degree of correlation between two or more of the vocal tract variables.
The system may include one or more additional features. For example, the feature coefficients may be cepstral coefficients, which represent an audio power spectrum of the portion of the spoken speech, or may be formants that represent vocal tract resonances. The vocal tract variables may be generated by a neural network or other machine learning systems, and the feature coefficients may be inputs to the neural network or other machine learning systems. The neural network may include stored parameters, which may include data from the Wisconsin X-Ray Microbeam database representing vocal tract variables associated with audio data.
The method may also include estimating a glottal state of the vocal tract, which may include performing acoustic measurements of the audio signal and/or providing the feature vectors to a neural network trained to estimate glottal vocal tract variables.
The method may also display an image of a vocal tract on a display device. The display device may be configured to play the audio recording and simultaneously animate the image of the vocal tract to display the physical configuration of the vocal tract of the speaker.
The method may also associate the vocal tract variables with an utterance within the audio recording. Also, determining the measurement of the disorder may include correlating time-dependent functions of the at least one vocal trace variables. The time correlation dependent functions can include, for example, a channel-delay correlation matrix of the vocal tract variables and/or cepstral coefficients.
An eigenspectrum of the channel-delay correlation matrix can be generated. Magnitudes of eigenvalues within the eigenspectrum can indicate a disorder that affects speech. Thus, these magnitudes can be used in determining the measurement of the disorder.
Determining the measurement of the disorder may include computing changes in articulator kinematics as determined through phasing of coupled oscillatory models of articulatory gestures derived from vocal tract variables
Other features and advantages of the invention are apparent from the following description, and from the claims.
Like reference numbers in the drawings depict like elements.
Referring to
Conditions that affect the user's 104 speech include, but are not limited to depression, autism, attention deficit hyperactivity disorder, strokes, oral cancer, laryngeal cancer, Huntington's disease, dementia, amyotrophic lateral sclerosis (ALS), or other types of apraxia or dysarthria. In embodiments, based on biomarkers in the waveform 102, system 100 can detect a speech disorder and/or provide a determination as to the condition that may be causing the speech disorder. This document will use depression as an example. However, it should be appreciated that system 100 can be configured to detect and make a determination as to the presence or cause of any type of speech disorder, including but not limited to those listed above.
In one or more embodiments, in addition to making a determination as to the presence and source of a speech condition, system 100 may include or provide information to a display 106 that can provide visual information or animations of the user's 104 speech to a clinician. This may aid the clinician in assessing and treating the user's 104 disorder.
Referring to
The system 100 may include an audio input 201 that receives the audio waveform 102. Depending on the format of the waveform 102, the system 100 may optionally contain analog to digital converter (ADC) 202 to sample convert the waveform 102 into a digital waveform if, for example, waveform 102 is not provided in digital format or if waveform 102 needs to be resampled.
The system 100 may include a feature extractor 204 that receives the digital version of waveform 102 and produces feature coefficients YM representing characteristics of at least a portion of the acoustic waveform 102. For example, the feature coefficients YM may include formats, Mel-Frequency cepstral coefficients, log-frequency band energy coefficients, other acoustic energy coefficients, or a combination thereof.
The feature coefficients YM produced by the feature extractor 204 are numerical vectors that each correspond to time (e.g. a segment 205) of the waveform 102. In an example, the feature extractor 204 samples the waveform 102 at a sampling rate of 100 Hz and produces a sequence of feature coefficients Y for each M sample of the waveform 102. The elements of each feature vector are numerical representations of characteristics of the corresponding audio segment.
In an embodiment, the feature extractor 204 includes a short-time spectral analyzer that accepts the audio waveform 102, performs time windowing, Fourier analysis, and summation of energy over the ranges of the frequency bands.
The system 100 also includes a vocal tract variable generator 208 that receives the feature coefficients YM and produces vocal tract variable TVN vectors. The vocal tract variables are numerical representations, specified in terms, for example representing a state of the user's 104 vocal tract 108 during articulation of the sound in the waveform 102 including, but not limited to, a state of the time-varying place (e.g. location along the oral cavity) and time varying manner (e.g. degree of constriction at the location) of characteristics of the position. For example, the vocal tract variables may include, but are not limited to, constriction degree and location of the lips, tongue tip, tongue body, velum, and glottis. Other vocal tract variables that can be included may describe features or positions of the nasal cavity, buccal cavity, nostrils, epiglottis, trachea, hard palate, or any other element of a person's vocal tract.
In embodiments, these TV vectors provide a way for the system to device biomarkers that are not constrained by the formant representation, but rather use the entire speech signal or a portion of the speech signal that is not directly mapped to the feature coefficient vectors. The vocal tract variable generator 208 may produce a TV vector for each sequence of feature coefficients Y that it receives, i.e. a TV vector for each sample of the waveform 102. But, additionally or alternatively, the vocal tract variable generator 208 generates a TV vector associated with a group of feature coefficients, i.e. a one-to-many mapping of TV vectors to feature coefficient vectors. For example, assuming that the user 104 articulated the word “No,” the feature extractor may produce a feature coefficient vector Y for each sampled segment 105 of the waveform. Thus, there may be a sequence of feature coefficient vectors YN associated with the “N” sound of the word “No,” and another sequence of feature coefficient vectors YO associated with the “O” sound of the word “No.” In an embodiment, the vocal tract variable generator 208 may also produce a TV vector that corresponds to the “N” sound (and represents the vocal tract position during utterance of the “N” sounds) from the sequence of feature coefficient vectors YN, and another TV vector that corresponds to the “O” sound (and represents the vocal tract position during utterance of the “O” sound) from the sequence of feature coefficient vectors YO.
In some embodiments, the vocal tract variable generator 208 may use samples from the waveform, in place of or in addition to the feature coefficient vectors YM, to generate the TV vectors, as indicated by dotted line 206.
The vocal tract variable generator may be implemented by a neural network. In embodiments, the neural network may be trained using a database of vocal tract training variables such as the Winsconsin X-Ray Microbeam (XRMB) database, which includes naturally spoken utterances along with XRMB cinematography of the mid-sagittal plane of the vocal tract with pellets placed at points along the vocal tract. In embodiments, the TV vectors include trajectory data (referred to as pellet trajectory) recorded for the individual articulators: e.g. Upper Lip, Lower Lip, Tongue Tip, Tongue Blade, Tongue Dorsum, Tongue Root, Lower Front Tooth (Mandible Incisor), Lower Back Tooth (Mandible Molar). These data may represent the way the articulators move during utterance as opposed to absolute position of the individual articulators. Because the physical X-Y positions of the pellets may be closely tied to the anatomy of the user 104, the pellet trajectories may provide relative measures of the articulators that reduce or remove dependence on the individual user's 104 anatomy. Thus, the TV vectors may specify the salient features of the vocal tract area function more directly than the pellet trajectories and are relatively speaker independent. In embodiments, the pellet trajectories are converted (by the vocal tract variable generator 208 or prior to training the vocal tract variable generator 208) to TV trajectories using geometric transformations.
The system 100 may optionally include a glottal estimator 220 which may receive the feature coefficient vectors YM and produce glottal vocal tract variable vectors TVGQ that estimate articulation by or near the user's 104 glottis. This can be helpful to provide a more accurate model of the glottis if, for example, the audio waveform 102 was recorded without sensors placed near the user's 104 glottis. Glottal estimator 220 may use an aperiodicity, periodicity, and pitch detector that estimates the proportion of periodic energy and aperiodic energy in the speech signal 102 and/or the feature coefficient vectors Y along with the pitch period for the periodic component. In embodiments, glottal estimator 220 uses a time domain approach and is based on the distribution of the minima of the average magnitude difference function of the speech signal. If needed or desired, the glottal vocal tract variable vectors TVGQ can be used in conjunction with the vocal tract variables TVM to enhance the accuracy of glottal-related vocal tract variables. In some embodiments, the glottal estimator 220 may comprise a neural network or other machine learning module to produce the glottal vocal tract variable vectors.
As noted above, the system may correlate the vocal tract activity with the sounds in the speech waveform 102. This is useful because human speech may contain hysteresis (for example, the way a sound is physically formed by the vocal tract can depend on the way the previous sound was physically formed). It can also be useful in correlating the vocal tract activity with the original waveform 102 so that they can be animated and played back in a time synchronous manner. To correlate the time of the vocal tract activity with the waveform 102 system 100 may include a time delay correlation module 222 that receives the feature coefficient vectors Y, the waveform 102, and/or the vocal tract variable vectors TV and performs a time delay correlation.
For each speech signal 102, the time delay correlation module 222 generates a channel delay correlation matrix TDCM from the TV vectors and/or the feature coefficients Y using a time-delay embedding at a constant delay scale. For example, if the sample rate was 100 Hz, a delay scale of 7 samples would introduce delays into the signals in 70 ms increments. The time delay correlation matrix provides information about the mechanisms underlying the coordination level. Each time delay correlation matrix may have a dimensionality of (MN*NM), where M is the number of channels and N is the time delay per channel.
The system 100 also includes a disorder identification module 214 that processes the TDCM to determine whether a disorder is present. The disorder identification module 214 may generates a rank ordered eigenspectrum 216 from the TDCM. In embodiments, the eigenspectrum may be an MN-dimensional feature vector. The eigenvalues in the spectrum may be ranked in order of magnitude (e.g. the rank 1 eigenvalue is the largest and the rank MN eigenvalue is the smallest, for example).
The disorder identification module may process the eigenspectra to determine a degree of correlation between two or more of the vocal tract variables. This degree of correlation may represent correlation of phase, rise time, fall time, slope, or other time-based characteristics of the vocal tract variables, and/or may include correlation of amplitude, peak-to-peak values, or other magnitude-based characteristics of the vocal tract variables. The degree of correlation between vocal tract variables can indicate the presence of a speech irregularity that may be caused by a neuromotor disorder. In embodiments, the disorder identification module may include functions that measure the degree of correlation between vocal tract variables by processing the vocal tract variables directly or by processing the eigenspectrum.
The eigenvalues may, in an embodiment, be proportional to the amount of correlation in the direction of their associated eigenvectors and can be used to identify a disorder. For example, depressed speech has few eigenvalues with significant magnitudes. Therefore, depressed speech can be identified by evaluating the eigenspectrum to determine if the recorded speech that generated the eigenspectrum includes the markers for depressed speech. One of skill in the art will recognize that depressed speech is used merely as an example, and that the eigenspectrum can be evaluated by the disorder identification module 204 to determine if the recorded speech matches markers for other types of disorders. The use of eigenspectra is only one of many ways to represent a change in vocal tract variable dynamics. For example, based on vocal variables, one can estimate the phasing relation across articulatory gestures as determined by a custom implementation of coupled oscillator planning, and the associated Task Dynamics model of speech motor control to generate relevant speech kinematics. See, for example, A. C. Lammert et al., A Coupled Oscillator Planning Model Account of the Speech Articulatory Coordination Metric With Applications to Disordered Speech, 12th International Seminar on Speech Production, which is incorporated here by reference in its entirety.
Yet another example of an approach to measure changes in vocal tract variable dynamics involves entropy measures of system dynamics.
In embodiments, disorder identification module 204 may be implemented as a neural network. It can be trained with model eigenvalues that identify a particular disorder, such as depression. One skilled in the art will recognize that neural networks and training models for identifying a disorder from vocal characterization may be complex. They may require not only the position of articulatory features of the elements of the vocal tract, but transitory movements of those elements as they transition from sound to sound. For example, the previous sound and position of the vocal tract elements may affect articulation of the next sound and/or the positions that the vocal tract elements go through to get to the next sounds. The model may need to include such information so that the system can provide accurate physical configuration of the vocal tract over time.
In general, as described above, the system 100 is configured to analyze the speech of a user 104 and make a determination, based on the speech, as to whether a disorder is present. The system 100 can use feature coefficients Y representing qualities of the audio recording, TV vectors representing articulation of the user's 104 vocal tract, or both in the analysis to determine if a disorder is present. This provides advantages in that inclusion of TV vectors, for example, produces more accurate determination of whether a disorder is present. Also, information about articulation of the vocal tract can be presented to a clinician for further analysis.
Referring to
In box 302, the system 100 may receive an audio recording (e.g. waveform 102) having one or more channels that include speech spoken by a user 14. In box 304, the system 100 may sample and extract audio features from the audio recording. Extracting the audio features may include generating formants (box 306), generating cepstral coefficients (box 307), or generating other variables that represent the audio within the recording.
In box 308, the system 100 may generate TV vectors that represent articulation and position of elements of the speaker's vocal tract. These variables may indicate the position and/or relative position or movement of articulatory vocal elements such as the lips, teeth, vocal folds, etc. In some embodiments, the system 100 may estimate TV vectors (box 316) related to glottal articulatory elements in the vocal tract.
In box 310, the system 100 may generate a time correlation matrix that provides time and delay information in relation to the TV vectors. The time correlation matrix may be useful in capturing temporal information related to the dynamics of the articulation of the vocal tract. In box 311, the system 100 may generate an eigenspectrum having eigenvectors that represent the user's 104 speech and can be used to identify, from the speech, whether a disorder is present. In box 312, the system may determine whether the speech indicates the presence of a possible mental disorder.
In box 314, the system 100 may display its findings regarding the presence of a mental disorder to a clinician. The system 100 may also provide an animation displaying the operation of the user's 104 vocal tract as the audio recording 102 is played.
Referring to
A number of embodiments of the invention have been described. Nevertheless, it is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the following claims. Accordingly, other embodiments are also within the scope of the following claims. For example, various modifications may be made without departing from the scope of the invention. Additionally, some of the steps described above may be order independent, and thus can be performed in an order different from that described.
This invention was made with Government support under Grant No. FA8702-15-D-0001 awarded by the U.S. Air Force, and under Grant No. U.S. Pat. No. 1,514,544 awarded by the National Science Foundation. The Government has certain rights in the invention.