This document relates generally to speech analysis and more particularly to evaluating prosodic features of low entropy speech.
When assessing the proficiency of speakers in reading passages of connected text (e.g., analyzing the speaking ability of a non-native speaker to read aloud scripted (low entropy) text), certain dimensions of the speech are traditionally analyzed. For example, proficiency assessments often measure the reading accuracy of the speaker by considering reading errors on the word level, such as insertions, deletions, or substitutions of words compared to the reference text or script. Other assessments may measure the fluency of the speaker, determining whether the passage is well paced in terms of speaking rate and distribution of pauses and free of disfluencies such as fillers or repetitions. Still other assessments may analyze the pronunciation of the speaker by determining whether the spoken words are pronounced correctly on a segmental level, such as on an individual phone level.
While analyzing these dimensions of speech provides some data for assessing a speaker's ability, these dimensions are unable to provide a complete and accurate appraisal of the speaker's discourse capability.
In accordance with the teachings herein, systems and methods are provided for scoring speech. A speech sample is received, where the speech sample is associated with a script. The speech sample is aligned with the script. An event recognition metric of the speech sample is extracted, and locations of prosodic events are detected in the speech sample based on the event recognition metric. The locations of the detected prosodic events are compared with locations of model prosodic events, where the locations of model prosodic events identify expected locations of prosodic events of a fluent, native speaker speaking the script. A prosodic event metric is calculated based on the comparison, and the speech sample is scored using a scoring model based upon the prosodic event metric.
As another example, a system for scoring speech may include a processing system and one or more memories encoded with instructions for commanding the processing system to execute a method. In the method, a speech sample is received, where the speech sample is associated with a script. The speech sample is aligned with the script. An event recognition metric of the speech sample is extracted, and locations of prosodic events are detected in the speech sample based on the event recognition metric. The locations of the detected prosodic events are compared with locations of model prosodic events, where the locations of model prosodic events identify expected locations of prosodic events of a fluent, native speaker speaking the script. A prosodic event metric is calculated based on the comparison, and the speech sample is scored using a scoring model based upon the prosodic event metric.
As a further example, a non-transitory computer-readable medium may be encoded with instructions for commanding a processing system to execute a method. In the method, a speech sample is received, where the speech sample is associated with a script. The speech sample is aligned with the script. An event recognition metric of the speech sample is extracted, and locations of prosodic events are detected in the speech sample based on the event recognition metric. The locations of the detected prosodic events are compared with locations of model prosodic events, where the locations of model prosodic events identify expected locations of prosodic events of a fluent, native speaker speaking the script. A prosodic event metric is calculated based on the comparison, and the speech sample is scored using a scoring model based upon the prosodic event metric.
The prosodic speech feature scoring engine 102 examines the prosody of a received speech sample to generate a prosodic event metric that indicates the quality of prosody of the speech sample. The speech sample may take a variety of forms. For example, the speech sample may be a sample of a speaker that is speaking text from a script. The script may be provided to the speaker in written form, or the speaker may be instructed to repeat words, phrases, or sentences that are spoken to the speaker by another party. Such speech that largely conforms to a script may be referred to as low entropy speech, where the content of the low entropy speech sample is largely known prior to any scoring based on the association of the low entropy speech sample with the script.
The prosodic speech feature scoring engine 102 may be used to score the prosody of a variety of different speakers. For example, the prosodic speech feature scoring engine 102 may be used to examine the prosody of a non-native (e.g., non-English) speaker's reading of a script that includes English words. As another example, the prosodic speech feature scoring engine 102 may be used to score the prosody of a child or adolescent speaker (e.g., a speaker under 19 years of age), such as in a speech therapy class, to help diagnose shortcomings in a speaker's ability. As another example, the prosodic speech feature scoring engine 102 may be used with fluent speakers for speech fine tuning activities (e.g., improving the speaking ability of a political candidate or other orator).
The prosodic speech feature scoring engine 102 provides a platform for users 104 to analyze the prosodic ability displayed in a speech sample. A user 104 accesses the prosodic speech feature scoring engine 102, which is hosted via one or more servers 106, via one or more networks 108. The one or more servers 106 communicate with one or more data stores 110. The one or more data stores 110 may contain a variety of data that includes speech samples 112 and model prosodic events 114.
At 212, locations of prosodic events 214 in the speech sample 204 are detected based on the event recognition metrics 210. For example, the event recognition metrics 210 associated with a particular syllable may be examined to determine whether that syllable includes a prosodic event, such as a stressing or tone change. In another example, additional event recognition metrics 210 associated with syllables near the particular syllable being considered may be used to provide context for detecting the prosodic events. For example, event recognition metrics 210 from surrounding syllables may help in determining whether the tone of the speech sample 204 is rising, falling, or staying the same at the particular syllable.
At 216, a comparison is performed between the locations of the detected prosodic events 214 and locations of model prosodic events 218. The model prosodic events 218 may be generated in a variety of ways. For example, the model prosodic event locations 218 may be generated based on a human annotation of the script based on a fluent, native speaker speaking the script. The comparison at 216 is used to calculate a prosodic event metric 220. The prosodic event metric 220 can represent the magnitude of similarity of the detected prosodic events 214 to the model prosodic events 218. For example, the prosodic event metric may be based on a proportion of matching of syllables having stressed or accented syllables as identified in the detected prosodic event locations 214 and the model prosodic event locations 218. As another example, the prosodic event metric may be based on a proportion of matching of syllables having tone changes as identified in the detected prosodic event locations 214 and the model prosodic event locations 218. If the detected prosodic events 214 of the speech sample 214 are similar to the model prosodic events 218, then the prosody of the speech sample is deemed to be strong, which is represented in the prosodic event metric 220. If there is little matching of the detected prosodic events locations 214 and the model prosodic event locations 218, then the prosodic event metric 220 will identify a low quality of prosody in the speech sample.
The prosodic event metric 220 may be used alone as an indicator of the quality of the speech sample 204 or an indicator of the quality of prosody in the speech sample 204. Further, the prosodic event metric 220 may be provided as an input to a scoring model, where the speech sample is scored using the scoring model based at least in part upon the prosodic event metric.
Outputs from the automatic speech recognizer, such as the syllable to speech sample matching and speech recognizer metrics 314 (e.g., outputs of the automatic speech recognizer 310 and internal variables used by the automatic speech recognizer 310), and the speech sample 304 are used to perform event recognition metric extraction at 316. For example, the event recognition metric extraction can extract attributes of the speech sample 304 at the syllable level to generate the event recognition metrics 318. Example event recognition metrics 318 can include a power measurement for each syllable, a pitch metric for each syllable, a silence measurement metric for each syllable, a syllable duration metric for each syllable, a word-identity associated with a syllable, a dictionary stress associated with the syllable (e.g., whether a dictionary notes that a syllable is expected to be stressed), a distance from a last syllable with a stress or tone metric, as well as others.
The prosodic event detector 410 may be implemented in a variety of ways. In one example, the prosodic event detector 410 comprises a decision tree classifier model that identifies locations of prosodic events 412 based on event recognition metrics 408. In one example, a decision tree classifier model is trained using a number of human-transcribed non-native spoken responses. Each of the responses is annotated for stress and tone labels for each syllable by a native speaker of English. A forced aligned process (e.g., via an automatic speech recognizer) is used to obtain word and phoneme time stamps. The words and phones are annotated to note tone changes (e.g., high to low, low to high, high to high, low to low, and no change), where those tone change annotations describe the relative pitch difference between the last syllable of an intonational phrase and the preceding syllable (e.g., a yes-no question usually ends in a low-to-high boundary tone). Tone changes may also be measured within a single syllable. The words and phones are similarly annotated to identify stressed and not stressed syllables, where stressed syllables were defined as bearing the most emphasis or weight within a clause or sentence. Correlations between the annotations and acoustic characteristics of the syllables (e.g., event recognition metrics) are then determined to generate the decision tree classifier model.
In another example, annotations of the model speech sample can be determined via a crowd sourcing operation, where large numbers of people (e.g., >25) who may not be expert linguists, note their impressions of stresses and tone changes per syllable, where the collective opinions of the group are used to generate the Model Prosodic Event Data Structure. In a further example, the Model Prosodic Event Data Structure may be automatically generated by aligning the model speech sample with the script, extracting features of the sample, and identifying locations of prosodic events in the speech sample based on the extracted figures.
Table 2 depicts an example Detected Prosodic Event Data Structure. At 508,
a location comparator compares the locations of detected prosodic events 504 with the locations of the model prosodic events 506 to generate matches and non-matches of prosodic events 510, such as on a per syllable basis. Comparing the data contained in the data structures of Tables 1 and 2, the location comparator determines that the detected prosodic events match in the “Stressed” category 60% of the time (i.e., for 3 out of 5 records) and in the “Tone Change” category 100% of the time. At 512, a prosodic event metric generator determines a prosodic event metric 514 based on the determined matches and non-matches of prosodic events 510. Such a generation at 512 may be performed using a weighted average of the matches and non-matches data 510 or other mechanism (e.g., a precision recall, an F-score (e.g., an F1 score) of the location of detected prosodic events 504 compared to the model prosodic events 506) to provide the prosodic event metric 514 that can be indicative of the prosodic quality of the speech sample.
The prosodic event metric 514 may be an output in itself, indicating the prosodic quality of a speech sample. Further, the prosodic event metric 514 may be an input to a further data model for scoring an overall quality of the speech sample.
Examples have been used to describe the contents of this disclosure. The scope of this disclosure encompasses examples that are not explicitly described herein. For example, in one such example, alignment between a script and a speech sample is performed on a word by word basis, in contrast to examples where such operations were performed on a syllable by syllable basis.
As another example,
A disk controller 860 interfaces one or more optional disk drives to the system bus 852. These disk drives may be external or internal floppy disk drives such as 862, external or internal CD-ROM, CD-R, CD-RW or DVD drives such as 864, or external or internal hard drives 866. As indicated previously, these various disk drives and disk controllers are optional devices.
Each of the element managers, real-time data buffer, conveyors, file input processor, database index shared access memory loader, reference data buffer and data managers may include a software application stored in one or more of the disk drives connected to the disk controller 860, the ROM 856 and/or the RAM 858. Preferably, the processor 854 may access each component as required.
A display interface 868 may permit information from the bus 852 to be displayed on a display 870 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 872.
In addition to the standard computer-type components, the hardware may also include data input devices, such as a keyboard 873, or other input device 874, such as a microphone, remote control, pointer, mouse and/or joystick.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein and may be provided in any suitable language such as C, C++, JAVA, for example, or any other suitable programming language. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Further, as used in the description herein and throughout the claims that follow, the meaning of “each” does not require “each and every” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
This application claims the benefit of U.S. Provisional Patent Application No. 61/467,498 filed on Mar. 25, 2011, the entire contents of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
4912768 | Benbassat | Mar 1990 | A |
5230037 | Giustiniani et al. | Jul 1993 | A |
5640490 | Hansen et al. | Jun 1997 | A |
6081780 | Lumelsky | Jun 2000 | A |
6185533 | Holm et al. | Feb 2001 | B1 |
7069216 | DeMoortel et al. | Jun 2006 | B2 |
7219060 | Coorman et al. | May 2007 | B2 |
8676574 | Kalinli | Mar 2014 | B2 |
20030212555 | van Santen | Nov 2003 | A1 |
20040006461 | Gupta et al. | Jan 2004 | A1 |
20040111263 | Nishitani et al. | Jun 2004 | A1 |
20060074655 | Bejar et al. | Apr 2006 | A1 |
20060178882 | Braho et al. | Aug 2006 | A1 |
20080082333 | Nurminen et al. | Apr 2008 | A1 |
20080300874 | Gavalda et al. | Dec 2008 | A1 |
20090048843 | Nitisaroj et al. | Feb 2009 | A1 |
20100121638 | Pinson et al. | May 2010 | A1 |
20120203776 | Nissan | Aug 2012 | A1 |
Entry |
---|
Alwan, Abeer, Bai, Yijian, Black, Matt, Casey, Larry, Gerosa, Matteo, Heritage, Margaret, Iseli, Markus, Jones, Barbara, Kazemzadeh, Abe, Lee, Sungbok, Narayanan, Shrikanth, Price, Patti, Tepperman, Joseph, Wang, Shizhen; A System for Technology Based Assessment of Language and Literacy in Young Children: the Role of Multiple Information Sources; Proc. of IEEE Int'l Workshop on Multimedia Signal Processing; Greece; 2007. |
Beckman, Mary, Hirschberg, Julia, Shattuck-Hufnagel, Stefanie; The Original ToBI System and the Evolution of the ToBI Framework; In Prosodic Typology—The Phonology of Intonation and Phrasing, S.A. Jun (Ed.); Oxford University Press: Oxford, UK; 2005. |
Chen, Lei, Zechner, Klaus, Xi, Xiaoming; Improved Pronunciation Features for Construct-Driven Assessment of Non-Native Spontaneous Speech; Proceedings of the NAACL-HLT-2009 Conference; Boulder, CO; pp. 442-449; 2009. |
Cucchiarini, Catia, Strik, Helmer, Boves, Lou; Quantitative Assessment of Second Language Learners' Fluency: Comparisons Between Read and Spontaneous Speech; Journal of the Acoustical Society of America, 111(6); pp. 2862-2873; 2002. |
Dong, Honghui, Tao, Jianhua, Xu, Bo; Chinese prosodic Phrasing with a Constraint-Based Approach; INTERSPEECH 2005; pp. 3241-3244; 2005. |
Franco, Horacio, Bratt, Harry, Rossier, Romain, Gade, Venkata Rao, Shriberg, Elizabeth, Abrash, Victor, Precoda, Kristin; EduSpeak: A Speech Recognition and Pronunciation Scoring Toolkit for Computer-Aided Language Learning Applications; Language Testing, 27(3); pp. 401-418; 2010. |
Linguistic Data Consortium; HUB-4 Broadcast News Corpus (English); 1997. |
Liscombe, Jackson; Prosody and Speaker State: Paralinguistics, Pragmatics, and Proficiency; Ph.D. Thesis, Columbia University, New York, NY; 2007. |
Mostow, Jack, Roth, Steven, Hauptmann, Alexander, Kane, Matthew; A Prototype Reading Coach That Listens; Proceedings of the 12th National Conference on Artificial Intelligence (AAAI); 1994. |
NIST; The Rich Transcription Fall 2003 Evaluation Plan; http://www.itl.nist.gov/iad/mig/tests/rt/2003-fall/index.html; 2003. |
Quinlan, J. Ross; C4.5: Programs for Machine Learning; Morgan Kaufmann: San Mateo, CA; 1992. |
International Search Report; PCT/US2012/029753; Jun. 2012. |
Written Opinion of the International Searching Authority; PCT/US2012/029753; Jun. 2012. |
Number | Date | Country | |
---|---|---|---|
20120245942 A1 | Sep 2012 | US |
Number | Date | Country | |
---|---|---|---|
61467498 | Mar 2011 | US |