The technology described herein relates generally to speech scoring and more particularly to using structural events to score spontaneous speech responses.
In the last decade, research work has begun on automatic estimation of structural events (e.g., clause and sentence structure, disfluencies, and discourse markers) on spontaneous speech. Structural events have been used in natural language processing (NLP) applications, including parsing speech transcriptions, information retrieval (IE), machine translation, and extractive speech summarization.
However, the structural events in speech data have not been utilized in using automatic speech recognition (ASR) technology to assess speech proficiency. This type of ASR analysis has traditionally used cues derived at the word level, such as a temporal profile of spoken words. The information beyond the word level (e.g., clause/sentence structure of utterances and disfluency structure) has not been used to its full potential.
Systems and methods are provided for providing a score for a spontaneous speech response to a prompt. A transcription of the spontaneous speech response may be accessed. A plurality of clauses may be identified within the spontaneous speech response, where identifying a clause includes identifying a beginning boundary and an end boundary of the clause in the spontaneous speech response. The term “clause” encompasses different types of word groupings that represent a complete idea, including “sentences” and “T-Units”
A plurality of disfluencies in the spontaneous speech response may be identified. Furthermore, a plurality of syntactic structures may be identified within each clause. One or more proficiency metrics may be calculated based on the plurality of identified clauses, the identified disfluencies, and the identified syntactic structures and a score for the spontaneous speech response may be generated based on the one or more proficiency metrics and possibly other proficiency metrics available to the system.
As another example, a system for providing a score for a spontaneous speech response to a prompt may include one or more data processors and a computer-readable medium encoded with instructions for commanding the one or more data processors to execute a method. In the method, a transcription of the spontaneous speech response may be accessed. A plurality of clauses may be identified within the spontaneous speech response, where identifying a clause includes identifying a beginning boundary and an end boundary of the clause or sentence in the spontaneous speech response. A plurality of disfluencies in the spontaneous speech response may be identified. A plurality of syntactic structures within each clause may be identified. One or more proficiency metrics may be calculated based on the plurality of identified clauses and the identified disfluencies, and the identified syntactic structures. A score for the spontaneous speech response may be generated based on the one or more proficiency metrics.
As a further example, a computer-readable medium may be encoded with instructions for commanding one or more data processors to execute a method for providing a score for a spontaneous speech response to a prompt. In the method, a transcription of the spontaneous speech response may be accessed. A plurality of clauses may be identified within the spontaneous speech response, where identifying a clause includes identifying a beginning boundary and an end boundary of the clause in the spontaneous speech response. A plurality of disfluencies in the spontaneous speech response may be identified. One or more proficiency metrics may be calculated based on the plurality of identified clauses and the identified disfluencies, and a score for the spontaneous speech response may be generated based on the one or more proficiency metrics.
The spontaneous speech response 106 is provided to a speech scoring engine 108. The speech scoring engine 108 analyzes the spontaneous speech response 106 to generate a spontaneous speech response score 110 for the spontaneous speech response 106. For example, the speech scoring engine 108 may identify certain characteristics of the spontaneous speech response 106 and use those characteristics to calculate the score 110.
As shown at 306, clauses may also be identified using an automated process performed by a processor. For example, automated clause boundary identification may be performed using a classifier based on lexical and prosodic features around the word boundary. Typical lexical features may include co-occurrence of words or Part of Speech (POS) tags. Typical prosodic features may include the pause duration before the word boundary.
Disfluencies can further be sub-classified into several groups: silent pauses, filled pauses (e.g., uh and um), false starts, repetitions, and repairs. The repetitions and repairs were denoted as “edit disfluency”, which were comprised of a reparandum, an optional editing term, and a correction. The reparandum is the part of an utterance that a speaker wants to repeat or change, while the correction contains the speaker's correction. The editing term can be a filled pause (e.g., um) or an explicit expression (e.g., sorry). The interruption point (IP), occurring at the end of reparandum, is where the fluent speech is interrupted to prepare for the correction. In the following sentence, “He 1 is 2 a 3 very 4 mad 5 er 6 very 7 bad 8 police 9 officer”, IP is 5, reparandum is “very mad”, correction is “very bad”, and editing term is “er”.
As shown at 406, disfluencies may also be identified using an automated process performed by a processor. For example, automated disfluency identification may be performed using a classifier based on lexical features including co-occurrence of words, syntactic features including co-occurrence of Part of Speech (POS) tags, and prosodic features including pause duration, pitch, duration of syllable, or word around the word boundary. The followings are examples of lexical and syntactic features for the classifier.
Word N-gram features: Given wi as the word token at position i, wi, wi−1,wi, wi,wi+1, wi−2,wi−1,wi, wi,wi+1,wi+2, and wi−1,wi,wi+1.
POS tag N-gram features: Given ti as the POS tag at position i, ti, ti−1,ti, ti,ti+1, ti−2,ti−1,ti, ti,ti+1,ti+2, and ti−1,ti,tt+1.
Filled pause adjacency: This feature has a binary value showing whether a filled pause such as uh or um was adjacent to the current word (wi).
Word repetition: This feature has a binary value showing whether the current word (wi) was repeated in the following 5 words or not.
Similarity: This feature has a continuous value which measures the similarity between reparandum and correction. Assuming that wi was the end of reparandum, the start point and the end point of the reparandum and correction may be estimated, and the string edit distance between reparandum and correction may be calculated. The start point and the end point of the reparandum and correction may be estimated as follows: if wi appeared in the following 5 words, then the second occurrence is defined as the end of the correction. Otherwise, wi+5 may be defined as the end of correction. Secondly, N, the length of the correction may be calculated, and wi-N+1 is defined as the start point of the reparandum. During the calculation of the string edit distance, a word fragment may be considered to be the same as a word whose initial character sequences matched.
Automated detection of clause boundaries and disfluencies may be performed using classifier built on conditional models, such as maximum entropy (MaxEnt) model and Conditional Random Fields (CRF) model. Based on a variety of features, the structural event detection task can be generalized as:
{circumflex over (E)}=argmaxEP(E|W)
Given that E denotes the between-word event sequence and W denotes the corresponding features, the goal is to find the event sequence that has the greatest probability, given the observed features.
Syntactic structure identification 452 may also be an automated process performed by a processor 456. For example, the Stanford Parser (an open-source parser software developed by Stanford University) may be utilized. The parser may utilize text input from the transcript of the voice recognition output. If the parser uses the voice recognition output, the parser may further rely on clause identification outputs to identify basic punctuation.
The speech scoring engine may also calculate proficiency metrics based on the identified clauses and disfluencies as shown at 226. These proficiency metrics may be based on structural event annotations, including clause boundaries and their types, disfluencies, as well as identified syntax. Some features measuring syntactic complexity and disfluency profile may also be calculated.
Because simple sentences (SS), independent clauses (I), and conjunct clauses (CC) represent a complete idea, they are considered a T-Unit (T). Clauses that have no complete idea are dependent clauses (DEP), which include noun clauses (N), relative clauses that functions as an adjective (ADJ), adverbial clauses (ADV), and adverbial phrases (ADVP). The total number of clauses is a summation of number of T-units (T), dependent clauses (DEP), and failed clauses (denoted as F). Therefore,
N
T
=N
SS
+N
I
+N
CC
N
DEP
=N
NC
+N
ADJ
+N
ADV
+N
ADVP
N
C
=N
T
+N
DEP
+N
F
Assuming Nw is the total number of words in the speech response (without pruning speech repairs), the following features are derived:
MLC=Nw/NC
DEPC=NDEP/NC
IPC=NIP/NC
where MLC is a mean length of clause metric, DEPC is a dependent clause frequency metric, and IPC is an interruption point frequency per clause metric.
Furthermore, the IPC feature may be adjusted. Disfluency may be a complex behavior that is influenced by a variety of factors, such as proficiency level, speaking rate, and familiarity with speaking content. The complexity of utterances is also an important factor on the disfluency pattern. Complexity of expression computed based on the language's parsing tree structure may influence frequency of disfluency. Because disfluency frequency may not only be influenced by test-takers' speaking proficiency but also by speaking content difficulty, the IPC metric can be adjusted accordingly. For this purpose, the IPC can be normalized by dividing some features related to content's complexity, including MLC, DEPC, and both. Thus, the following elaborated disfluency-related features may be calculated:
IPCn1=IPC/MLC
IPCn2=IPC/DEPC
IPCn3=IPC/MLC/DEPC
Syntactic structures are commonly expressed as “parse trees”, i.e., a hierarchical structure of constituents within a sentence. (e.g., the sentence “he gave the book to his little sister” would have the 3 nominal constituents “he”, “the book”, and “his little sister” and a verbal constituent “gave”). Furthermore, in most syntactic descriptions, the phrase “gave the book to his little sister” would be considered a verbal constituent phrase itself, containing the main verb “gave” and the 2 nominal constituents “the book” and “to his little sister”. Finally, the whole sentence would be considered as yet another verbal or sentential constituent, comprising the constituent “he” as a subject and the rest (the “verb phrase”) as a second constituent of the entire sentential phrase.
The identification of syntactic structures as exemplified above is usually performed by either manual annotation by human experts or by automated systems, called syntactic parsers. Based on these identified syntactic structures or constituents, proficiency metrics may be derived, e.g., “frequency of nominal phrases per sentence”.
The speech scoring engine may generate a spontaneous speech response score based on the proficiency metrics. For example, weights may be assigned to certain proficiency metrics. By combining those weights with calculated values for the proficiency metrics, an overall score for the spontaneous speech response may be generated. The overall score for a spoken response may be based totally or in part on features derived from the clause structures, disfluencies, and syntactic structures explicated above. In order to compute a score for a response, other features such as features related to pronunciation or other aspects of speech, may also be used together with the features mentioned in this application.
Certain proficiency metrics may be more highly correlated with high quality spontaneous speech responses than others. Such correlations may be determined by performing a manual (e.g., human) or other scoring of a spontaneous speech responses. Proficiency metrics for the spontaneous speech responses may be calculated, and correlations between the proficiency metrics and the manual scores may be determined to determine which proficiency metrics have the best correlation with the scores. Based on the determined correlation determinations, proficiency metrics may be selected, and a model may be generated based on those proficiency metrics to score spontaneous speech responses (e.g., a regression analysis may be performed using the scores and selected proficiency metrics to determine proficiency metric weights for use in scoring spontaneous speech responses).
A disk controller 760 interfaces one or more optional disk drives to the system bus 752. These disk drives may be external or internal floppy disk drives such as 762, external or internal CD-ROM, CD-R, CD-RW or DVD drives such as 764, or external or internal hard drives 766. As indicated previously, these various disk drives and disk controllers are optional devices.
Each of the element managers, real-time data buffer, conveyors, file input processor, database index shared access memory loader, reference data buffer and data managers may include a software application stored in one or more of the disk drives connected to the disk controller 760, the ROM 756 and/or the RAM 758. Preferably, the processor 754 may access each component as required.
A display interface 768 may permit information from the bus 756 to be displayed on a display 770 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 772.
In addition to the standard computer-type components, the hardware may also include data input devices, such as a keyboard 772, or other input device 774, such as a microphone, remote control, pointer, mouse and/or joystick.
This written description uses examples to disclose the invention, including the best mode, and also to enable a person skilled in the art to make and use the invention. The patentable scope of the invention may include other examples. For example, the systems and methods may include data signals conveyed via networks (e.g., local area network, wide area network, internet, combinations thereof, etc.), fiber optic medium, carrier waves, wireless networks, etc. for communication with one or more data processing devices. The data signals can carry any or all of the data disclosed herein that is provided to or from a device.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
This application claims priority to U.S. Provisional Patent Application No. 61/309,233, filed Mar. 1, 2010, entitled “Processor Implemented Systems and Methods for Measuring Syntactic Complexity Using Structural Events on Non-Native Spoken Data,” and to U.S. Provisional Patent Application No. 61/372,964, filed Aug. 12, 2010, entitled “Computing and Evaluating Syntactic Complexity Features for Spontaneous Speech of Non-Native Test Takers.” The entirety these applications is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61309233 | Mar 2010 | US | |
61372964 | Aug 2010 | US |