The present disclosure relates to automated detection of speech-based cognitive impairment symptoms, and in particular the use of domain adaptation strategies to extend the use of datasets across languages.
Generally, datasets of appropriate quality and size are necessary when employing machine learning (ML) in the healthcare domain, due to the high cost of errors and the need to ensure fair and unbiased decision-making.
Studies have considered machine learning based approaches to detection of cognitive impairment or speech impairment, such as aphasia, as a symptom of cognitive impairment. However, these studies have been largely restricted to single-language settings. This is likely due to the dearth of medical speech datasets in multiple languages. As a result, much of the prior work on computational methods for detecting signs of aphasia, for example, focuses on ML models developed for a single language. Where multiple languages or cross-language evaluation have been considered, such studies were limited to non-speech-related features or to feature patterns for a single feature.
In drawings which illustrate by way of example only embodiments of the present application,
Multi-language speech datasets are scarce and often have small sample sizes in the medical domain. This has an adverse impact on the development of ML-based approaches to detection of speech impairment as a symptom of certain types of cognitive impairment, particularly in the case of speakers of low-resource languages for which there are only small datasets of examples of impaired speech. As those skilled in the art understand, larger and more diverse are required to produce a ML model with an acceptable error rate—due to the consequences of errors in diagnosis—and to reduce the effect of bias.
To mitigate this problem of data availability, various solutions have been proposed including the creation of novel sources of data, developing data-efficient algorithms, and employing domain adaptation from low-resource domains to resource-rich domains. For instance, Li et al. (Bai Li, Yi-Te Hsu, and Frank Rudzicz. Detecting dementia in Mandarin Chinese using transfer learning from a parallel corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1991-1997, 2019) used paired samples of movie and television subtitles data in Mandarin and English to train a regression model between independently engineered features for detection of dementia. However, this technique relies on paired data which is not always available, particularly for low-resource languages.
As can be seen from the discussion below, it was surprisingly discovered that out-of-domain, unpaired, single-speaker, healthy speech data may be used effectively to adapt a ML classifier model across languages, using the unpaired healthy speech data to train a mapping between a feature set in a source language (e.g., a low-resource language with sparser in-domain sample data) and in a target language (e.g. a language with richer resources available, with more plentiful in-domain data). This may prove particularly useful in the detection of speech-based cognitive impairment symptoms, because it is more easily generalized not only to other languages, but to other use cases where paired data may not be available to train a ML model.
A high-level illustration of the system and process is depicted in
The healthy speech data may be considered to be out of domain, in that it is not necessarily collected for a medical or diagnostic application; rather, it may be sourced from any suitable corpus providing speech samples in the different languages. As will be seen from the examples below, the cross-language training dataset 25 may also include in-domain, impaired speech data in the source and target languages as well. However, the amount of impaired speech data in the cross-language training dataset 25 may be limited to only a small proportion of the dataset; for instance, less than 10%, and even less than 5%, as can be seen from the examples below. In both the ML classifier and domain adaptation training, both the healthy speech and impaired speech data may be unpaired data, again as can be seen in the examples below.
Input speech data from a speaker in a source language, here in the form of a transcript 10, is featurized to extract a feature set 15 based on linguistic elements of speech (step 60). The feature set 15 is provided as input to a domain adaptation system 20, which maps the input feature set 15 to a mapped feature set (step 65), which is then provided as input to the classifier system 30 executing the ML classifier model (step 70). The classifier system 30 outputs a result 40, classifying the input mapped feature set as healthy speech or impaired speech (step 75). It will be understood by those skilled in the art that the “healthy” speech classification may be considered as a “non-impaired” speech classification—that is to say, speech that is not identified by the classifier system 30 as impaired speech.
These systems and methods may be advantageously implemented in a data processing system, such as the example system depicted in
Individual client systems 150 may be computer systems in clinical or consultation settings, communicating with the remote service 130 over a wide area network (e.g., the Internet). It is contemplated that users may implement clinic or patient management software locally, and that preferably, personally identifying information (PII) of a patient—which can include recorded utterances—is stored securely, and even locally, to avoid accidental dissemination of PII by third-party services. This is reflected in the example data processing system of
The clinical system 110 in this example is configured to receive the recorded speech (1), convert the recorded speech to text using a suitable speech recognition module 112 as would be known to those skilled in the art, and to provide the recognized speech (2) to a feature extraction module 114 which is configured to recognize the linguistic features of interest for use in classification, and to generate the feature set that will be used as input to a ML classifier system. In some implementations, a transcript of the subject’s speech may be produced manually, and the featurization may be carried out manually as well. The generated feature set is then provided (3) to the remote analysis service 130; this feature set data may be provided to the remote analysis service 130 anonymously, for example identified only using a patient identification number.
The remote analysis service 130 may implement both the domain adaptation system 20 and the classifier system 30 described with reference to
The various services and systems described above may be implemented using networked computer systems. The configuration of such servers and systems, including required processors and memory, operating systems, network communication subsystems, and the like, will be known to those skilled in the art. Communications between various components and elements of the data processing system may occur over private or public connections, preferably with adequate security safeguards as are known in the art. In particular, if communications take place over a public network such as the Internet, suitable encryption is employed to safeguard the privacy of data exchanged between the various components of the data processing system.
In the implementation and examples described below, the domain adaptation system was trained to effect cross-language transfer of an aphasic speech detection ML model trained in English (a resource-rich domain, with a larger available set of aphasic speech samples) to lower-resource languages, namely French and Mandarin. Aphasia is a form of speech impairment that affects speech production and/or comprehension. It typically occurs due to brain injury, most commonly from a stroke. Evaluation of speech is an important part of diagnosing aphasia and identifying sub-types. Aphasic speech exhibits several common patterns; e.g., omitting short words (“a”, “is”), using made-up words, etc. While it has been shown to be possible to detect aphasic speech using ML from patterns of linguistic features in spontaneous speech, the prior art has mostly been restricted to a single language, as noted above.
In implementation, it was found that by featurizing speech to identify distributions of linguistic elements in speech, and by employing domain adaptation to correlate or map the probability distribution of the linguistic elements from a source language to a target language, ML models developed using robust data in the target language could be effectively used to classify speech as healthy or impaired (in the examples, healthy or aphasic) in a source language, particularly when the source and target languages are significantly different (e.g., English versus Mandarin). In the examples described below, various optimal transport (OT) systems were employed. It was also found that in featurizing speech using distributions of linguistic elements of speech, rather than the linguistic elements themselves, the need for paired data to train either the ML classifier models or the domain adaptation systems was avoided or reduced. In these examples, distributions of selected parts of speech (POS) were employed as the feature set, as described below.
The use of different OT domain adaptation strategies was tested for the detection of aphasic speech in French and Mandarin speakers. A first set of speech datasets were obtained from AphasiaBank, a database of multimedia interactions for the study of aphasic communication in aphasia maintained by the TalkBank Project and available at aphasia.talkbank.org (Brian MacWhinney, Davida Fromm, Margaret Forbes, and Audrey Holland. Aphasiabank: Methods for studying discourse. Aphasiology, 25(11):1286-1307, 2011.) In the dataset, aphasic speakers have various subtypes of aphasia including:
In these datasets, all participants perform multiple speech-based tasks, such as describing pictures, story-telling, free speech and discourse. All such tasks were manually transcribed into single transcripts for analysis of OT domain adaptation systems, following the CHAT protocol (Nan Bernstein Ratner. Brian MacWhinney, The Childes Project: Tools for analyzing talk. Hillsdale, NJ: Erlbaum, 1991. pp. xi+ 360. Language in Society, 22(2):307-313, 1993.) The speech samples from AphasiaBank were classified into two classes, healthy and aphasic, where constitutes all subtypes mentioned above, using extracted linguistic features. The AphasiaBank transcripts thus constitute unpaired data, in that there are no corresponding transcripts in different languages for a single speaker; nor are there corresponding healthy and aphasic transcripts for a single speaker.
A larger dataset of multi-lingual transcripts, in English, French, and Mandarin, was used to train the domain adaptation systems. These transcripts were obtained from a dataset of TED talks (Jörg Tiedemann. Parallel data, tools and interfaces in opus. In Lrec, volume 2012, pages 2214-2218, 2012.) In total, there were recordings available for 1178 talks, with various speaker accents and styles. To remove potential bias of the domain adaptation system by paired data, overlap between speech transcripts of English and French/Mandarin was removed by dividing the talks into two sets and ensuring that the English transcripts for training the domain adaptation systems were obtained from the first set, while those for French/Mandarin were obtained from the second set. Further, experiments were performed to validate that this choice did not significantly affect results. It was found that using either fully paired or fully unpaired (as described here) data yielded statistically insignificantly different results.
Additionally, similar to the methodology of Li et al. (2019) a larger dataset was created by dividing each narration into segments by considering 25 consecutive utterances (in this case, sentences or fully-coherent units of speech) as one segment, to produce a larger number of transcripts. This number was chosen to maximize the data available because it was observed that the features stabilized with this number of utterances. Differences in values of the eight parts of speech (POS) features (nouns, verbs, subordinating conjunctions, adjectives, adverbs, coordinating conjunctions, determiners, and pronouns, using definitions provided by Universal Dependencies, a collaborative project available at universaldependencies.org) taken from speech samples were compared, for transcript lengths of 5, 25, 50, 75 and 100 utterances each. t-tests were computed between the features and it was found that while 5 out of 8 features were significantly different between transcript lengths of 5 and 25, they stabilize for lengths greater than or equal to 25, i.e., no significant difference between lengths of 25 and 50 (lowest p-value is 0.22), 50 and 75 (lowest p-value is 0.32) and 75 and 100 (lowest p-value is 0.59).
The statistics for the samples used in each language are shown in Table 1 below:
In Table 1, the number of participants (speakers) is indicated in parentheses. Pre-processing and feature extraction
Since the transcripts provided in AphasiaBank include annotations for features such as repetitions, markers for incorrect word usage, etc., it was necessary to remove these annotations prior to use. The pylangacq library (Jackson L. Lee, Ross Burkholder, Gallagher B. Flinn, and Emily R. Coppess. Working with chat transcripts in python. Technical Report TR-2016-02, Department of Computer Science, University of Chicago, 2016.) was used for this purpose due to its capabilities of handling CHAT transcripts. Additional pre-processing steps included stripping the various utterances of punctuations before POS-tagging.
The proportion of the eight POS identified above were extracted over the whole of each individual transcript. Although aphasic speakers perform one additional speech task (where they provide details regarding their stroke) more than control speakers in AphasiaBank, these eight features are agnostic to total length and content of transcripts, and rely more on the sentence complexity. These simple features were used because they are general and have been identified to be important in prior work (Kathleen C Fraser, Frank Rudzicz, and Elizabeth Rochon. Using text and acoustic features to diagnose progressive aphasia and its subtypes. In INTERSPEECH, pages 2177-2181, 2013; Li et al. (2019); S Law, A Kong, L Lai, and C Lai. Production of nouns and verbs in picture naming and narrative tasks by Chinese speakers with aphasia. Procedia, social and behavioral sciences, 94, 2013) across languages. These features are extracted from all languages in the AphasiaBank samples.
To analyse the variance in features across languages, it was determined whether they differ significantly between healthy and aphasic speakers across languages. It was observed that every feature varies significantly between healthy and aphasic speakers of English and Mandarin. It was therefore expected that raw, non-adapted cross-language transfer of models trained on English speech to Mandarin would lead to low performance. Table 2 sets out significant p-values corresponding to t-tests of the eight POS features between English and other languages (after Bonferroni correction). In Table 2, an asterisk (*) indicates the feature is significantly different for both Mandarin and French; a plus symbol (+) indicates the feature is only significantly different between English and Mandarin; and a hash symbol (#) ‘#’ indicates the feature is only significantly different between English and French. A dash (-) indicates there is no significant difference.
Due to the lack of baselines on the multilingual AphasiaBank dataset in prior work, new unilingual and multilingual baselines were established as follows:
Machine learning models were trained only on the English AphasiaBank (unpaired) dataset for classifying transcripts as healthy or aphasic.
Hyperparameters for the classification models were tuned using grid search with 10-fold cross validation on the English AphasiaBank training set across all settings. Three classifiers were employed for the cross-linguistic classification task: support-vector machine (SVM) with a radial basis function (RBF) kernel, with regularization parameter C = 0.1 and y = 0.001; Random Forest (RF) with 200 decision trees with a maximum depth of 2; and Neural Network (NN) with 2 hidden layers of 100 units each (Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct): 2825-2830, 2011). Since the training set as between the target English language and source French/Mandarin languages was highly imbalanced (see the sample statistics in Table 1), the minority class was oversampled synthetically with the Synthetic Minority Oversampling Technique (SMOTE) (Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321-357, 2002) with k = 3. Prior to oversampling, the training set was normalized and scaled using the median and interquartile range, a common mechanism to center and scale-normalize data which is robust to outliers (Pedregosa et al. (2011)). The same median and interquartile (obtained from the training set) bounds were used to scale the evaluation set in each case.
OT consists of finding the best transport strategy from one probability distribution function (PDF) to another. This is done by minimizing the total cost of transporting a sample from the source to that in the target. Thus, there needs to be a metric to quantify the different distances between samples in the two probability distributions, as well as solvers to solve the optimization problem of minimizing the total cost of transport, where cost is related to distance between source and target. OT was selected for domain adaptation because the same features are extracted across languages, although their distributions (in terms of feature values, e.g. proportion of nouns) vary from one language to another.
Two OT training regimes were evaluated for both French and Mandarin, across three variants (solvers and distance functions) of OT. The selected variants of OT were as follows:
The PythonOT open source implementation of these algorithms was used (Remi Flamary and Nicolas Courty. Pot python optimal transport library, 2017, available at github.com/PythonOT/POT).
For OT-EMD, the method proposed by Ferradans et al. (Sira Ferradans, Nicolas Papadakis, Gabriel Peyr’e, and Jean-Fran,cois Aujol. Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853-1882, 2014) was used for out of sample mapping to apply to transport samples from a domain into the other, with other default parameters in the open source implementations. For OT-EMD-R, entropic regularization parameter was set to 3 with all other parameters default. For OT-Gaussian, the weight for linear OT loss was set to 1, and maximum iterations was set to 20, with a stop threshold for iterations set to 1E-05 with the other default parameters in the open source implementation.
OT mappings from each source language (French/Mandarin) to the target language (English) were learned for each algorithm, trained on language pairs (English-French/English-Mandarin) from the TED Talks (unpaired) dataset, according to two training regimes:
Overlap in the proportion of AphasiaBank used for learning OT mappings and for evaluating the classifiers was avoided by employing 2-fold cross-validation where one fold is included in the training set for OT and another for evaluation. Since OT involves source and target domain probability estimation, it was expected that adding in-domain data, particularly that of speech-impaired participants, would improve results significantly.
Additionally, it was hypothesized that there would be an observable effect of speech diversity in terms of accents in the OT training set, since the literature shows that accents can have a significant effect on POS features (Elin Runnqvist, Tamar H Gollan, Albert Costa, and Victor S Ferreira. A disadvantage in bilingual sentence production modulated by syntactic frequency and similarity across languages. Cognition, 2(129):256-263, 2013). To study this effect, accents were manually annotated for the presence or absence of a North American accent in the English TED Talks dataset. The TED Talks dataset covers a wide speaker demographic, in terms of sex, age and accents. This was accomplished by an annotator listening to the audio associated with each TED Talk and annotating if the accent is ‘North American’ (NA) or ‘other’. In cases where the accent was not clear, publicly available information regarding nationality of speaker was referenced. In total in the English TED Talks dataset, there were 373 NA accented and 215 ‘other’ accented talks. The impact of increasing accent diversity in training OT algorithms while keeping the size of the dataset constant was evaluated, as discussed in the results below.
Task performance was evaluated primarily using macro-averaged F1 scores, a measure that is known to be robust to class imbalance. Area under the receiver operating characteristic (AUROC) scores were also determined, since they are often used as a general measure of performance irrespective of any particular threshold or operating point.
It may be noted that zero standard deviations occur when a single class was predicted or when standard deviation < 0.01. The standard deviations were an artifact of the small sample sizes in the evaluation set (24 and 60 for French and Mandarin respectively). Highest F1 scores are shown in bold in
Baseline performance was found to vary significantly between languages, as can be seen in
For Mandarin, these results were very different; either baseline approach to adaption hurts overall performance as compared to a unilingual baseline, often yielding solutions for which the model would predict just a single class exclusively.
The difference in outcomes for English to French domain adaptation versus English to Mandarin was expected, as noted above. English and French have relatively similar grammatical patterns (e.g., subject, verb, object ordering), whereas Mandarin and English have a number of significant differences (e.g., reduplication, where a syllable or word is repeated to produce a modified meaning).
Among the OT variants (OT-EMD, OT-Gaussian, OT-EMD-R), there was generally stronger performance as compared to the unilingual models and baseline adaptation systems as well. In all but one case, the best-performing OT variant for a given model/language yielded a statistically significant improvement over the best baseline model according to a paired t-test, the notable exception being for the SVM model on French text, which did not achieve statistical significance.
In general, EMD variants of OT (including both OT-EMD-R and OT-EMD) were found to perform better than the Gaussian variant.
As can be seen from
As noted above, the use of paired data was compared to the use of unpaired data, and it was found that the use of paired data in training the OT variants did not significantly improve performance for classification of French or Mandarin samples, over classification using OT variants trained using unpaired datasets. Table 3 below sets out F1-scores, which may be compared with the results of
83.22 ± 0.00
84.71 ± 1.95
83.22 ± 0.00
55.35 ± 3.38
56.12 ± 2.01
To study the impact of data size on the aphasia detection task, an ablation study was performed in which the size of the English AphasiaBank dataset was artificially reduced by integer factors, while keeping the relative proportion of healthy and aphasic subjects the same. A 10-fold cross-validation was performed for a SVM classifier, with progressively less data. As can be seen in
As described above, the effect of speech diversity, as measured by the frequency of various accents (NA vs other) in the speech data, on the performance of the various domain adaptation systems was evaluated. Table 4 below sets out the effect of speech diversity on the OT-EMD-R variant for SVM models in French and Mandarin.
87.48
87.76
69.19
71.83
As can be seen above, for both French and Mandarin, increasing the prevalence of non-North American accents in the OT domain adaption task was found to improve downstream aphasia/non-aphasia classification performance by several F1 points (yielding a score of 87.48 for French and 69.19 for Mandarin). However, these results are not necessarily statistically significantly different than the best results set out in
The foregoing results demonstrate that POS features extracted from speech transcripts from different languages can be successfully mapped to a target language to aid in medical speech classification tasks, despite any unavailability of paired data. In particular, in comparison to a multilingual baseline with a high-capacity autoencoder, it was found that domain adaption strategies, in particular OT-based domain adaption, can help enable strong predictive models for aphasia detection in low-resource languages that work on par for similar, and significantly better for dissimilar languages. Further, the foregoing results demonstrate that simple-to-extract text-based features—in this example, POS proportions—can be successfully transferred across languages to facilitate detection of speech impairment as a symptom of cognitive impairment and specifically to improve aphasia detection in a cross-language evaluation setting.
While the examples above are mainly focused on optimal transport methods of domain adaptation, it is reasonable to expect from the demonstrated success of this technique that other state of the art domain adaptation techniques such as adversarial domain adaptation (Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167-7176, 2017) may likewise prove useful in cross-language transfer of ML models for detection of speech impairment as a symptom of cognitive impairment, and particularly aphasia.
Thus, in accordance with the examples and embodiments discussed above, there is provided a method of detecting speech impairment indicative of cognitive impairment in a subject, the method comprising obtaining a feature set from a subject’s speech data in a source language; applying a mapping to the extracted feature set to provide mapped features, wherein the mapping is defined by a domain adaptation system trained on a domain adaptation dataset comprising healthy speech data in the source language and a target language; providing the mapped features as input to a classifier trained in the target language to classify impaired speech and healthy speech; and obtaining a classification of the subject’s speech data as impaired speech or healthy speech from the classifier.
There is also provided a method of detecting speech impairment indicative of cognitive impairment from a speech sample, the method comprising generating a machine learning model for classifying input as impaired speech or healthy speech, the machine learning model trained on a first dataset in a target language, the first dataset comprising impaired speech data and healthy speech data; generating an optimal transport mapping of a feature set employed in the machine learning model from a source language to the target language using a second dataset, the second dataset comprising healthy speech data in the source and target languages; extracting the feature set from a subject’s speech data provided in the source language; applying the optimal transport mapping on the extracted feature set to provide mapped features; and providing the mapped features as input to the machine learning model to classify the input as impaired speech or healthy speech; and obtaining a classification of the input as impaired speech or healthy speech.
The examples and embodiments are presented only by way of example and are not meant to limit the scope of the subject matter described herein. Individual features of each example or embodiment presented above may be combined, in whole or in part, with individual features of other examples or embodiments. Some steps or acts in a process or method may be reordered or omitted, and features and aspects described in respect of one embodiment may be incorporated into other described embodiments. Variations of these examples will be apparent to those in the art and are considered to be within the scope of the subject matter described herein.
The data employed by the systems, devices, and methods described herein may be stored in one or more data stores. The data stores can be of many different types of storage devices and programming constructs, including but not limited to RAM, ROM, flash memory, programming data structures, programming variables, and so forth. Code adapted to provide the systems and methods described above may be provided on many different types of computer-readable media including but not limited to computer storage mechanisms (e.g., CD-ROM, diskette, RAM, flash memory, computer hard drive, etc.) that contain instructions for use in execution by one or more processors to perform the operations described herein. The media on which the code may be provided is generally considered to be non-transitory or physical.
Computer components, software modules, engines, functions, and data structures may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. The data processing and computer systems described above may be provided in a single location, or may be distributed in a cloud environment, using techniques known to those skilled in the art. Those skilled in the art will know how to implement the particular examples described above in the systems and methods of
Use of any particular term should not be construed as limiting the scope or requiring experimentation to implement the claimed subject matter or embodiments described herein. Any suggestion of substitutability of the data processing or computer systems or environments for other implementation means should not be construed as an admission that the invention(s) described herein are abstract, or that the data processing systems or their components are non-essential to the invention(s) described herein.
A portion of the disclosure of this patent document contains material which is or may be subject to one or more of copyright, design, or trade dress protection, whether registered or unregistered. The rightsholder has no objection to the reproduction of any such material as portrayed herein through facsimile reproduction of this disclosure as it appears in the Patent Office records, but otherwise reserves all rights whatsoever.
This application claims priority to U.S. Provisional Application No. 62/941,829 filed Nov. 28, 2019, the entirety of which is incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2020/051637 | 11/27/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62941829 | Nov 2019 | US |