The technology described herein relates generally to automated language assessment and more specifically to automatic assessment of spoken language in a multiparty conversation.
Assessment of a person's speaking proficiency is often performed in education and in other domains. One aspect of speaking proficiency is communicative competence, such as a person's ability to adequately converse with one or more interlocutors (who may be human dialog partners or computer programs designed to be dialog partners). The skills involved in contributing adequately, appropriately, and meaningfully to the pragmatic and propositional context and content of the dialog situation is often overlooked. Even in situations where conversational skills are assessed, the assessment is often performed manually, which is costly, time-consuming, and lacks objectivity.
In accordance with the teachings herein, computer-implemented systems and methods are provided for automatically scoring spoken language in multiparty conversations. For example, a computer performing the scoring of multi-party conversations can receive a conversation between an examinee and at least one interlocutor. The computer can select a portion of the conversation. The portion includes one or more examinee utterances and one or more interlocutor utterances. The computer can assess the portion using one or more metrics, such as: a pragmatic metric for measuring a pragmatic fit of the one or more examinee utterances; a speech act metric for measuring a speech act appropriateness of the one or more examinee utterances; a speech register metric for measuring a speech register appropriateness of the one or more examinee utterances; and an accommodation metric for measuring a level of accommodation of the one or more examinee utterances. The computer can compute a final score for the portion of the conversation based on at least the one or more metrics applied.
The spoken conversation 100 is then converted into textual data at 110. In one embodiment, the conversion is performed by automatic speech recognition software, well known in the art. The conversion may also be performed manually (e.g., via human transcription) or any other methods known in the art.
Once converted, the conversation is processed by a feature computation module 120, which has access to both the original audio information as well as the converted textual information. The computation module 120 computes a set of features addressing, for example, pragmatic competence and other aspects of the examinee's conversational proficiency. In one embodiment, a pragmatic fit metric 130 is used to analyze the pragmatic adequacy of the examinee's utterances. A speech act appropriateness metric 140 may be used to analyze whether the examinee is appropriately using and interpreting speech acts. Since different sociolinguistic relationships may call for different speech patterns, a speech register appropriateness metric 150 may be used to analyze whether the examinee is speaking appropriately given his character's sociolinguistic relationship with the interlocutor(s). In addition, an accommodation metric 160 may be used to measure the level of accommodation exhibited by the examinee to accommodate the speech patterns of the interlocutor(s).
After the feature computation module 120 has analyzed the various features of the examinee's utterances, a scoring model 170 uses the results of the various metrics to predict a score reflecting an assessment of the examinee's communicative competence. Different weights may be applied to the metric results according to their perceived relative importance.
At 310, each examinee utterance's context is determined. A context, for example, may be one or more immediately preceding utterances made by the interlocutor(s) and/or the examinee. The context may also include the topic or setting of the conversation or any other indication as to what utterance can be expected given that context.
At 320, one or more pragmatic models are identified based on the context of each examinee utterance. The context, which may be a preceding interlocutor utterance, helps the system determine what utterances are expected in that context. For example, if the context is the interlocutor saying, “How are you?”, an expected utterance may be, “I am fine.” Thus, based on the context, the system can determine which pragmatic model to use to analyze the pragmatic fit of the examinee's utterance in that context. The expected utterances may be predetermined by human experts or via supervised learning.
The pragmatic models may be implemented by any means. For example, a pragmatic model may involve calculating the edit distance between the examinee utterance and one or more expected utterances. Another example of a pragmatic model may involve using formal languages (e.g., regular expressions or context free grammars) that model one or more expected utterances.
At 330, the identified one or more pragmatic models, which are associated with a given context, are applied to the examinee's utterance associated with that same context. Extending the exemplary implementations discussed in the paragraph immediately above, this step may involve calculating an edit distance between the examinee's utterance and each expected utterance, and/or matching the examinee's utterance against each regular expression.
At 340, the results of applying the pragmatic models are used to determine a pragmatic fit score for the portion of conversation from which the examinee's utterances are sampled from. The pragmatic fit score for the portion of conversation selected may be determined, for example, based on scores given to individual examinee utterances in that portion of conversation (e.g., the pragmatic fit score may be an average of the scores of the individual examinee utterances). As for the score for each examinee utterance, it may, for example, be based on the results of one or more different pragmatic models applied to that examinee utterance (e.g., the score for an examinee utterance may be an average between the edit distance result and regular expression result). The manner in which the result of a pragmatic model is determined depends on the nature of the model. Take for example the edit distance pragmatic model described above. Each expected utterance may have an associated correctness weight depending on how well the expected utterance fits in the given context. Based on the calculated edit distances between the examinee's utterance and each of the expected utterances, a best match is determined. The correctness weight of the best-matching expected utterance, for example, may then be the result of applying the edit distance model. The result of the regular expression model may similarly be based on the correctness weight associated with a best-matching regular expression.
At 410, each examinee utterance's context is determined. The context may be any indication as to what speech act can be expected given that context (e.g., one or more preceding utterances by the interlocutor and/or examinee). For a given examinee utterance, the context determined for the speech act analysis may or may not be the same as the context determined for the pragmatic fit analysis described above.
At 420, one or more speech act models are identified based on the context of each examinee utterance. The context helps the system determine what speech acts are expected. Thus, based on the context, the system can determine which speech act model to use to analyze the appropriateness of the examinee's speech act in that context.
The speech act models may be implemented by any means and focused on different linguistic features. For example, lexical choice, grammar, and intonation may all provide cues for speech acts. Thus, the identified speech act models may analyze any combination of linguistic features when comparing the examinee utterance with the expected speech acts. The model may utilize any linguistic comparison or extraction tools, such as formal languages (e.g., regular expressions or context free grammars) and speech act classifiers.
At 430, the identified one or more speech act models, which are associated with a given context, are applied to the examinee's utterance associated with that same context. Then at 440, the results of applying the speech act models are used to determine a speech act appropriateness score for the portion of conversation from which the examinee's utterances are sampled from. The speech act appropriateness score for the portion of conversation selected may be determined, for example, based on scores given to individual examinee utterances in that portion of conversation (e.g., the speech act appropriateness score may be an average of the scores of the individual examinee utterances). The score for each individual examinee utterance may, for example, be based on the results of one or more speech act models applied to that examinee utterance (e.g., the score for an examinee utterance may be an average of the speech act model results). With respect to the result of an individual speech act model, in one embodiment the result is proportional to the correctness weight associated with each expected speech act.
Then at 540, based on the comparison results, a speech register appropriateness score for the selected conversation portion is determined. The speech register appropriateness score may be determined, for example, based on scores given to individual examinee utterances in that portion of conversation (e.g., the speech register appropriateness score may be an average of the scores of the individual examinee utterances). The score for each individual examinee utterance may, for example, be based on the results of one or more speech register models applied to that examinee utterance (e.g., the score for an examinee utterance may be an average of the speech register model results). With respect to the result of an individual speech register model, in one embodiment the result is proportional to the correctness weight associated with each expected speech register.
At 600, a portion of the conversation is identified. At 610, examinee utterances and interlocutor utterances are identified within the conversation portion. In one embodiment, a relationship between the examinee utterances and interlocutor utterances may also be identified so that each examinee utterance is compared to the proper corresponding interlocutor utterance(s). The relationship may be based on time (e.g., utterances within a time frame are compared), chronological sequence (e.g., each examinee utterance is compared with the preceding interlocutor utterance(s)), or other associations.
At 620, one or more linguistic features (e.g., grammatical construction, lexical choice, pronunciation, prosody, rate of speech, and intonation) of the examinee utterances are modeled, and the same or related linguistic features of the interlocutor utterances are similarly modeled. At 630, each examinee model is compared with one or more corresponding interlocutor models. For example, the examinee models and interlocutor models that are related to rate of speech are compared, and the models that are related to intonation are compared. In one embodiment, each model is also associated with an utterance, and the model for an examinee utterance is compared to the model for an interlocutor utterance associated with that examinee utterance. In another embodiment, comparison is made between an examinee model representing a linguistic pattern of the examinee's utterance over time, and an interlocutor model representing a linguistic pattern of the interlocutor's utterance over the same time period. Then at 640, based on the comparison results an accommodation score for the selected conversation portion is determined.
A disk controller 960 interfaces one or more optional disk drives to the system bus 952. These disk drives may be external or internal floppy disk drives such as 962, external or internal CD-ROM, CD-R, CD-RW or DVD drives such as 964, or external or internal hard drives 966. As indicated previously, these various disk drives and disk controllers are optional devices.
Each of the element managers, real-time data buffer, conveyors, file input processor, database index shared access memory loader, reference data buffer and data managers may include a software application stored in one or more of the disk drives connected to the disk controller 960, the ROM 956 and/or the RAM 958. Preferably, the processor 954 may access each component as required.
A display interface 968 may permit information from the bus 952 to be displayed on a display 970 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 973.
In addition to the standard computer-type components, the hardware may also include data input devices, such as a keyboard 972, or other input device 974, such as a microphone, remote control, pointer, mouse and/or joystick.
The invention has been described with reference to particular exemplary embodiments. However, it will be readily apparent to those skilled in the art that it is possible to embody the invention in specific forms other than those of the exemplary embodiments described above. The embodiments are merely illustrative and should not be considered restrictive. The scope of the invention is reflected in the claims, rather than the preceding description, and all variations and equivalents which fall within the range of the claims are intended to be embraced therein.
Applicant claims benefit pursuant to 35 U.S.C. §119 and hereby incorporates by reference the following U.S. Provisional Patent Application in its entirety: “AUTOMATED SCORING OF SPOKEN LANGUAGE IN MULTIPARTY CONVERSATIONS,” App. No. 61/806,001, filed Mar. 28, 2013.
Number | Date | Country | |
---|---|---|---|
61806001 | Mar 2013 | US |