This invention relates to automated assessment of speech skills.
Speech skills can be important, for example, in jobs that may require spoken interaction with customers. For example, a telephone call center agent may require good speech skills in order to interact with customers effectively. In some cases, a person may require good speech skills in a number of languages. Speech skills can include, for example, fluency, pronunciation accuracy, and appropriate speaking rate.
One way to evaluate the speech skills of a person is for another person to converse with that person to assess their skills. Another way is to provide the text of a passage to the person, and record the person reading the passage. This recording can later be evaluated by another person to assess the speech skills.
In a general aspect, an approach to evaluating a person's speech skills includes automatically processing speech of a person and text corresponding to some or all of the speech.
In another aspect, in general, a job application procedure includes collecting speech from an applicant, and using text corresponding to the collected speech to automatically assess speech skills of the applicant. The text may include text that is presented to the applicant and the speech collected from the applicant can include the applicant reading the presented text.
In another aspect, a computer system provides remote users with an assessment of their speech skills. The computer system can provide services to other parties, for example, as a hosted service to companies assessing the speech skills of job applicants.
Advantages of the approach can include one or more of the following.
An automated screening procedure for speech skills can be performed without requiring another person to listen to speech, either live or from a recording. Because a person is not required, automated systems (e.g., in an employment application kiosk) can be used to perform speech skills assessment that is used for screening purposes.
An automated speech skills assessment can be used to provide an initial ranking of speakers by a skills score. For example, this ranking can be used to select top scoring job applicants.
Other features and advantages of the invention are apparent from the following description, and from the claims.
Referring to
In order to assess the speech skills of the user 114, the system analyzes the recorded speech in conjunction with the text that was presented to the user. A variety of aspects of the speech input are evaluated in various examples of the system. The aspects can relate to various characteristics of the input that may indicate or be correlated with skill level. For example, words may be missing or incorrectly substituted with other words (i.e., reading errors), the user may restart reading portions of the text, and sections of the text may be omitted. Words may be read accurately, but be mispronounced. Reading rate may be irregular (i.e., not fluent), or may be significantly faster or slower than an average or typical reading rate. Intonation may not be appropriate to the text being read, for example, with pitch not matching a question in the text.
Referring to
As a first step to analysis of the speech, a transcript alignment procedure 130 is used to match the speech recording and the presented text. In some examples, a transcript alignment procedure described in co-pending application Ser. No. 10/384,273, titled “T
Therefore in some examples, the transcript alignment procedure is robust to portions of the text not being spoken, or being spoken so poorly that they cannot be matched to the corresponding text, and to repetitions and restarts of portions of the text, while still alignment data 132 providing timing information such as overall reading rate, local reading rate for different parts of the text, a degree of variation in reading rate, time alignment indicating the start time and end time of passages, sentences, words, or subword units (e.g., syllables or phonemes). Referring to
The skills scoring step 140 (see
Match scores of one or more granularities of speech units (e.g., sentences, words, syllables, phones) are computed based on the time alignment provided in the alignment data. For example, the match of the speech to phonetic models, for example, based on spectral characteristics is computed for each of the aligned phones (step 410). The scores for the individual units are then combined into an overall pronunciation score, as well as scores for various classes of units. For example, with acoustic match scores computed for aligned phonemes, a score for each of a set of classes of phonemes is computed (step 415). For example, classes of phonemes defined by a place of articulation (e.g., front, back, central, labial, dental, alveolar, post-alveolar/palatal, velar/glottal) and/or degree of stricture (e.g., close, close-mid, open-mid, open, stop, affricate, nasal, fricative, approximant, lateral approximant) are used to determine a score for each class. The scores may be presented in a visual form in two dimensions with the scores indicated by color, as shown in
A reading rate is computed from the alignment data (step 420). For example, the overall reading rate as compared to an average or typical rate for the passage, as well local reading rate for different portions of the passage and variability in reading rate are calculated. From this, fluency, uniformity of reading rate, or match of reading rate to a model of appropriate reading rate (or reading rate variation) for the text are used to compute fluency and reading rate scores (step 425). Other forms of appropriate prosodies, including appropriate pitch variation, can also be me measured.
Discontinuities in the reading of the text, for example, due to restarts or to skipped portions are detected in the alignment data (step 430). Based on these detections, a score representative of a degree of continuity of the reading (e.g., lack of restarting, missing words, etc.) is computed (step 435).
In some examples, an overall score that combines various individual scores (e.g., pronunciation, fluency, continuity) is computed in the skills scoring module. For example, the overall score provides a way to rank different users of the system.
In some examples of such a system, a match to the phonetic models is scored in step 410 based on a wordspotting approach in which the text is divided into a number of words or phrases, and each word or phrase is associated with a detection score in the speech as well as the detected start and end time for the word or phrase, or is determined to be missing from the transcript in an appropriate sequence with the other words or phrases.
An overall match score is then computed as follows:
The terms in this expression are defined as follow
n is the number of phrases in the script
si is the score for the i-th phrase as determined by the word spotting engine
p is the number of missed phonemes (see below), 0≦p≦n
q is the number of bad phonemes (see below), 0≦q≦n
α is the penalty for a missed phoneme, typically 3
β is the penalty for a bad phoneme, typically 1
A missed phoneme is a phoneme that occurs in the script but is not found by the engine when it processes the specific media file. A bad phoneme is a phoneme whose average score falls below a certain threshold.
In some examples of such a system, a fluency score is determined as the ratio of the sum of the durations for each phrase in the script over the entire duration of the script, computed as follows:
The terms in this expression are defined as follows:
Skills assessments for the specific skills or characteristics are optionally combined, for example, by a predetermined weighting or by a non-linear combination, to yield an overall skill assessment for the user.
In some examples of such a system, a global score is computed as a linear combination (e.g., weighted average) of pronunciation and fluency scores as follows:
S
G
=λS
P+(1−λ)SF
Where λ is the weighting factor that ranges from 0 to 1, typically ⅔. In other examples, the global score could also be computed as a non-linear function. In some examples, the global score is a linear or non-linear combination of one or more of pronunciation score, fluency score, speaking rate score (derived from but not necessarily equal to the speaking rate), and continuity score.
In some examples, particular portions of a presentation text have been previously identified as being particularly indicative of a user's speech skills. These portions may be identified by a linguistic expert, or may be identified based on statistical techniques. As an example of a statistical technique, a corpus of recorded passages may be associated with skill scores assigned by listeners to the passages. A statistical approach is then used to weight different portions and/or different characteristics to best match the listener generated scores. In this way, certain passages may be more relied upon than others. Rather than weighting, portions of the text to be relied upon are selected based on the listener's data.
The skills assessment system may be integrated into a number of different overall applications. Referring to
In some examples, the skills evaluation is performed in a hosted system that provides a service to other entities. For example, a company may contract with a hosted system service to evaluate the speech skills of job applicants to that company. For example, the company may provide the recordings of the job applicants to the service, or provide a way for the job applicants to directly provide their speech to the service. The service may evaluate the speech in a full automated manner using the system described above, or may perform a combination of automated and manual evaluation of the speech. If there is a manual component to the evaluation, data such as the alignment data may be used as an aid to the manual component. For example, portions of the speech corresponding to particular passages in the text may be played to a listener that evaluates the skills.
Referring to
In some examples, the speech skills evaluation is performed repeatedly, for example, in an on-going testing mode. For example, an employee in a call center may be tested periodically, or at random, during their employment.
In some examples, rather than the user reading a presentation text, the speech that is evaluated corresponds to a scripted portion of an interaction. For example, a call center telephone agent may answer the telephone with a standard greeting, or may describe a product with a scripted description, and a corresponding portion of a logged telephone call is used for the speech skills assessment.
In some examples, the skills assessment is used for multiple languages with one user or in a non-native language for the user.
Embodiments of the approaches described above can be implemented in software, for example, in a stored program. The software can include instructions embodied on a computer-readable medium, such as on a magnetic or optical disk or on a network communication link. The instructions can include machine instructions, interpreter statements, scripts, high-level program language statements, or object code. Computer implemented embodiments can include client and server components, for example, with an interface being hosted in a client component and analysis components being hosted in a server component.
It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.
This application claims the benefit of U.S. Provisional Application No. 60/941,783, filed Jun. 4, 2007, which is incorporated herein by reference. This application is also related to U.S. Pat. No. 7,231,351, titled “TRANSCRIPT ALIGNMENT,” issued on Jun. 12, 2007, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60941783 | Jun 2007 | US |