Phonetic distance measurement system and related methods

Description

FIELD OF THE INVENTION

The present invention relates to the quantification of acoustic dissimilarity between phonetic elements, and more particularly, to the determination of phonetic distance in the context of speech recognition engines.

BACKGROUND OF THE INVENTION

Conceptually, phonetic “distance” is an attempt to quantify the acoustic dissimilarity between phonetic elements, such as phonemes or words. The phonetic distance between similar sounding phonetic elements is less than the phonetic distance between disparate sounding phonetic elements.

Phonetic distance matrices have been compiled listing the phonetic distance between each pair of phonetic elements within a phonetic element set. For instance, the set can include all the phonemes used in a given spoken language, such as English. The phonetic distance matrix is used as a tool in a variety of applications. For example, the phonetic distance matrix can be used to evaluate the grammar of a speech recognition engine. Grammar paths can be selected with sufficient acoustic separation to minimize recognition errors during subsequent operation of the speech recognition engine.

Currently, phonetic distance is estimated based on knowledge of the physiological mechanisms underlying human pronunciation. Conventionally, the phonetic distances thus estimated are rated on a 0-10 scale.

SUMMARY OF THE INVENTION

In view of the foregoing, it is an object of the present invention to provide an improved system and method for measuring phonetic distances. Another object of the present invention is to provide a system and method for measuring phonetic distances that can take into account the unique characteristics of particular speech recognition engines and/or speakers.

According to an embodiment of the present invention, a phonetic distance measurement system includes a reference file, a recognized speech file, a comparison module configured to determine a plurality of error occurrences by comparing the recognized speech file and the reference file, an error rate module configured to determine a plurality of error rates corresponding to the plurality of error occurrences, and a measurement module configured to determine a plurality of phonetic distances as a function of the plurality of error rates.

According to a method aspect, a method of generating a phonetic distance matrix includes determining a plurality of error occurrences by comparing a recognized speech file with a reference file and determining a plurality of error rates corresponding to the plurality of error occurrences. A plurality of phonetic distances are determined as a function of the plurality of error rates, and a phonetic distance matrix is output based on the plurality of phonetic distances.

According to another aspect of the present invention, the errors for which occurrences and rates are determined by the system and method include substitution, insertion and deletion.

According to a further aspect of the present invention, phonetic distances are normalized to minimize the total separation between the phonetic distance matrix and an existing phonetic distance matrix, for instance, by using a mapping function with three normalization coefficients.

According to additional aspects of the present invention, the error rates and/or phonetic distances determined by the present invention are used in further applications, such as grammar selection for speech recognition engines, and language training and evaluation.

These and other objects, aspects and advantages of the present invention will be better understood in view of the drawings and following detailed description of preferred embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic overview of a phonetic distance measurement system, according to an embodiment of the present invention; and

FIG. 2 is a flow diagram of a method of generating a phonetic distance matrix, according to a method aspect of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Referring to FIG. 1, according to an embodiment of the present invention, a phonetic distance measurement system 10 includes at least one recognized speech file 12, at least one reference file 14, a comparison module 16, an error rate module 18, and a measurement module 20. The systems and methods herein are realized by at least one processor executing machine-readable code. Inputs to and outputs of the system or method are stored, at least temporarily, in some form of machine-readable memory. However, the present invention is not necessarily limited to particular processor types, numbers or designs, to particular code formats or languages, or to particular hardware or software memory media.

The recognized speech file 12 is generated using a speech recognition engine to process a speech audio file. The results of this processing are also referred to as a hypothesis. The reference file 14 contains the actual material spoken when recording the speech audio file and is used to identify errors in the recognized speech file 12.

In general, recognition errors will fall into three categories: substitution errors, insertion errors and deletion errors. A substitution error occurs when a phonetic element is incorrectly recognized as another phonetic element. Although the operations of a speech recognition engine underlying the occurrence of insertion and deletion errors is more complex, an insertion error can be said to occur where there is a phonetic element in the hypothesis where there should not be any phonetic element. A deletion error can be said to occur where the hypothesis is missing a phonetic element where a phonetic element ought to be.

The comparison module 16 is configured to compare the recognized speech file 12 and the reference file 14 to identify recognition errors. For each phonetic element, the comparison module 16 identifies all occurrences of substitution errors, insertion errors and deletion errors. For each phonetic element, the comparison module 16 further identifies how many times each other phonetic element was erroneously substituted.

The recognized speech file 12 and reference file 14 can, themselves, include the phonetic elements for which the phonetic distance is to be determined. Alternately, the phonetic elements can be derived from the recognized speech file 12 and/or the reference file 14 using a dictionary 22. For instance, where the phonetic distance between phonemes is to be determined and the recognized speech and reference files 12, 14 consist of words, the comparison module 16 can determine the phonemes using the dictionary 22 listing the phonemes which comprise each word. It will be appreciated that the present invention is not necessarily limited to any particular language or phonetic elements thereof.

In Tables 1-3, a fictitious phonetic element set has five phonetic elements (1-5) of the same type, such as phonemes, sub-phonemes or words. The data in the following tables is used for illustrative purposes only, and is not intended to reflect any actual phonetic element set or experimental data. In Table 1, the error occurrences for each phonetic element pair_i,jare shown, including substitution, insertion and deletion error occurrences, as determined by the comparison module 16. The total reference occurrences of each phonetic element, and the total reference occurrence of all phonetic elements, tabulated from the reference file, are also shown.

Each of the Error Occurrences_i,jrepresents the number of times the phonetic element i is erroneously replaced by the phonetic element j. For example, Error Occurrences_1,2is the number of times phonetic element 2 is erroneously substituted for phonetic element 1.

For analytical purposes, insertion and deletion errors can be considered as special types of substitution errors. An insertion error can be considered as the substitution of an actual phonetic element (any of the phonetic elements 1-5, in the current example) for a non-existent or “empty” phonetic element. Likewise, a deletion error can be considered as the substitution of the empty phonetic element for an actual phonetic element.

The empty phonetic element can be considered as phonetic element “6”. Accordingly, Error Occurrences_1,6is the number of times the empty phonetic element phonetic is erroneously substituted for the phonetic element 1 (i.e., a deletion error). Similarly, Error Occurrences_6,1is the number of times the phonetic element 1 is erroneously substituted with the empty phonetic element (i.e., an insertion error).

Alternately, insertion and deletion can be treated as special error occurrences referenced by the actual phonetic element involved. In this case, Error Occurrences_1,insrepresents the number of time phonetic element 1 is erroneously added and Error Occurrences_1,delrepresents the number of times phonetic element 1 is erroneously omitted. Both treatments of insertion and deletion errors appear in the following tables.

TABLE 1

Error Occurrences

Error

Per Element

Type
Substitution
Deletion
Insertion
Reference

i\j
1
2
3
4
5
“6” (del)
ins
Occurrences

1
—
30
0
10
20
0
10
640

2
20
—
10
30
10
0
0
890

3
0
20
—
0
30
10
10
530

4
20
20
0
—
0
0
0
440

5
40
0
20
10
—
10
20
1120

“6”
10
0
10
0
20
—
—
—

Insertion

Total Reference Occurrences
3620

In the context of speech recognition engines, substitution error occurrences between any two phonetic elements can be asymmetrical. The term “phonetic element pair” as used herein is direction dependent, unless otherwise indicated. For instance, the characteristics of the phonetic element pair_1,2can be distinct from those of the phonetic element pair_2,1. This phenomenon can be seen in Table 1, for example, where phonetic element 2 is substituted thirty times for phonetic element 1, but phonetic element 1 is substituted for phonetic element 2 only twenty times.

The error rate module 18 is configured to calculate a substitution error rate for each phonetic element pair, as well as insertion and deletion error rates for each phonetic element, as a function of the total number of phonetic element occurrences within the reference file 14. For example:

Error Rate_i,j=(Error Occurrences_i,j)/(Total Reference Occurrences).

In Table 2, error rates are shown based on the number of errors and occurrences in Table 1.

TABLE 2

Error Rates

Error

Type
Substitution
Deletion
Insertion

i\j
1
2
3
4
5
“6” (del)
ins

1
—
0.008
0.000
0.003
0.006
0.000
0.003

2
0.006
—
0.003
0.008
0.003
0.000
0.000

3
0.000
0.006
—
0.000
0.008
0.003
0.003

4
0.006
0.006
0.000
—
0.000
0.000
0.000

5
0.011
0.000
0.006
0.003
—
0.003
0.006

“6”
0.003
0.000
0.003
0.000
0.006
—
—

Insertion

The measurement module 20 is configured to determine the phonetic distance between the phonetic element pairs as a function of the corresponding error rate:

Phonetic Distance_i,j=f(Error Rate_i,j)

To reflect that a greater phonetic distance will result in lower recognition error, the phonetic distance can be set to vary inversely with the error rate. For example:

Phonetic Distance_i,jα(1/Error Rate_i,j);
Or:
Phonetic Distance_i,jα(1/(Error Rate_i,j²+Error Rate_i,j);
Or:
Phonetic Distance_i,jα(1/(log(Error Rate_i,j));
Or:
Phonetic Distance_i,jα(1/(e^{Error Ratei,j}).

While the phonetic distance could be set equal to the inverse of the error rate, or some other function thereof, it is advantageous for analytical purposes to normalize the phonetic distance values. A potentially valuable way to normalize the values is to minimize the separation from an existing phonetic distance matrix; for instance, a phonetic distance matrix determined using physiological considerations and/or an earlier phonetic distance matrix determined by averaging repeated iterations of the present invention with different speakers, speech recognition engines and/or reference texts.

The separation between the earlier existing phonetic distance matrix and current phonetic distance matrices can be expressed as:

L(α₁,α₂,α₃)=Σ_i,j(Existing Phonetic Distance_i,j−Current Phonetic Distance_i,j)²; where α₁,α₂and α₃are normalization coefficients independent of i and j.

The value of the coefficients α₁, α₂and α₃can be determined by minimizing the distance between the existing phonetic distance matrix and the current phonetic distance matrix, L(α₁, α₂, α₃).

The normalized current phonetic distance matrix can then calculated using following exemplary mapping function:

Phonetic Distance_i,j=α₁+α₂/(Error Rate_i,j−α₃)).

Alternately, the phonetic distances could simply be normalized to a scale of 0-10 by the measurement module 20 so that the general magnitude of the measurements are more familiar to those familiar with the previous phonetic distance measurements based on physiological considerations. In Table 3, a phonetic distance matrix is shown based on the assumption that the maximum error rate (the Error Rate_5,1in Table 2) corresponds to a phonetic distance of 1, while the minimum error rate (the 0.000 error rates in Table 2) corresponds to a phonetic distance of 10 with the remaining error rates falling linearly therebetween. Where i=j the phonetic distance is set to 0.

TABLE 3

Phonetic Distances

Deletion
Insertion

i\j
1
2
3
4
5
“6” (del)
ins

1
0.0
3.3
10.0
7.8
5.5
10.0
7.8

2
5.5
0.0
7.8
3.3
7.8
10.0
10.0

3
10.0
5.5
0.0
10.0
3.3
7.8
7.8

4
5.5
5.5
10.0
0.0
10.0
10.0
10.0

5
1.0
10.0
5.5
7.8
0.0
7.8
5.5

“6”
7.8
10.0
7.8
10.0
5.5
0.0
—

Insertion

According to a method aspect of the present invention (beginning at block 100), a speech audio file is generated by recording a speaker reading a reference text, corresponding to the reference file (block 102). A speech recognition engine processes the speech audio file to generate the recognized speech file (block 104).

The recognized speech file is compared to the reference file to determine error occurrences (block 106). Error occurrences can include substitution errors between phonetic element pairs, as well as insertion and deletion errors for phonetic elements. Error rates are calculated for the error occurrences as a fraction of the total occurrences of the corresponding phonetic elements in the reference file (block 108).

Phonetic distances are determined as a function of the error rates; preferably, as inversely proportional to the error rates (block 110). The phonetic distances are normalized as desired and a phonetic distance matrix is outputted (block 112). The method ends at block 114.

In the above example, the phonetic distance matrix is generated based on a recognized speech file from a single speech recognition engine generated by a single speaker. It will be appreciated that a plurality of recognized speech files from a plurality of speakers and/or speech recognition engines can be utilized to generate a phonetic distance matrix that is less dependent on a particular speaker and/or speech recognition engine. A speech recognition engine independent phonetic distance matrix can be generated as a zero-order approximation from a speech recognition engine dependent phonetic matrix.

The system and method of the present invention advantageously provide an empirical method for measuring phonetic distances. Unlike the earlier, physiological approach, the phonetic distances measured by the present invention can take into account the practical dependency of phonetic distance upon particular speech recognition engines, as well as upon particular speakers. Additionally, the present invention can also take into account insertion and deletion errors, which the physiological approach to phonetic distance could not directly address.

It will appreciated that the ability of the present invention to take into consideration the performance of a particular speech recognition engine significantly improves the usefulness of the phonetic distance matrix in grammar selection for the speech recognition engine. As used herein, grammar selection for a speech recognition engine can refer to selections pertaining to a general grammar for a speech recognition engine, as well as to selections pertaining to specialized grammars of one or more specific speech recognition application systems using a speech recognition engine.

Additionally, the ability of the present invention to take into account the speaker dependency of the phonetic distance matrix allows the phonetic distance matrix to be used as a valuable language training and evaluation tool. For example, according to a further aspect of the present invention, a particular speaker reads the reference text and the speech audio file is recorded. A speaker-dependent phonetic distance matrix is generated as described above. This speaker-dependent phonetic distance matrix is compared to a reference phonetic distance matrix, for instance, a phonetic distance matrix corresponding to ideal pronunciation. An evaluation module compares the speaker-dependent and reference phonetic distance matrices and outputs a comparison matrix with the speaker-dependent phonetic distances as a function of the reference phonetic distances. The speaker can thereby readily identify phonetic elements for which pronunciation requires the most improvement. Alternately, these phonetic elements can be automatically identified for the speaker.

Moreover, the speaker-dependent phonetic distance and/or comparison matrices can be compared with earlier matrices generated for the same speaker. Thus, the speaker can readily see improvement or degradation made over time, both with regard to specific phonemes and in general.

The present invention can also allow optimization of a grammar for distinct groups of speakers. For example, a specialized phonetic distance matrix can be developed using audio files of speakers representative of a particular accent (e.g., Boston-English or Korean-English) or other unique characteristics. This specialized phonetic distance matrix can then be used in grammar selection to increase recognition accuracy for the group.

It will be appreciated that the error rates determined by the present invention, or some other function of the error rate other than or in addition to a phonetic distance, can also be used to facilitate grammar selection and language training and evaluation, as well as other applications. For instance, an error rate matrix for a particular speech recognition engine, such as in Table 2, can be used in grammar selection. Likewise, an error rate matrix for a particular individual, preferably compared to optimal error rates, can be used in language training and evaluation.

In general, the foregoing description is provided for exemplary and illustrative purposes; the present invention is not necessarily limited thereto. Rather, those skilled in the art will appreciate that additional modifications, as well as adaptations for particular circumstances, will fall within the scope of the invention as herein shown and described and the claims appended hereto.

Claims

1. A method of generating a phonetic distance matrix comprising: determining, for each of a plurality of phonemes occurring in the reference file, a plurality of phoneme error occurrences by comparing a recognized speech file with a reference file, the recognized speech file generated by processing at least one audio file of recorded speech with a speech recognition engine, the reference file representing the actual contents of the recorded speech;determining, for each of the plurality of phonemes occurring in the reference file, a plurality of phoneme error rates corresponding to the plurality of phoneme error occurrences;generating a plurality of phonetic distances as a function of the plurality of phoneme error rates, the plurality of phonetic distances being inversely proportional to the plurality of phoneme error rates; andoutputting a phonetic distance matrix based on the generated plurality of phonetic distances, the phonetic distance matrix including generated phonetic distances between each of the plurality of phonemes;
2. The method of claim 1, wherein the plurality of phoneme error occurrences includes a plurality of phoneme substitution, insertion and deletion error occurrences, the plurality of phoneme error rates includes a corresponding plurality of phoneme substitution, insertion and deletion error rates, and the generated plurality of phonetic distances further includes phonetic distances between each of the plurality of phonemes and insertion and deletion.
3. The method of claim 1, wherein determining the plurality of phoneme error rates includes dividing the plurality of error phoneme error occurrences by a total number phoneme occurrences in the reference file.
4. The method of claim 1, wherein normalizing the phonetic distances to minimize the total separation between the phonetic distance matrix and an existing phonetic distance matrix includes using a mapping function with three normalization coefficients.
5. The method of claim 4, wherein the mapping function is: Phonetic Distancei,j=α1+(α2/(Error Ratei,j−α3));wherein i and j are indices of the phonemes, and α1, α2 and α3 are the three normalization coefficients.
6. The method of claim 5, wherein the separation between the phonetic distance matrix and the existing phonetic distance matrix is defined as: L(α1,α2,α3)=Σi,j(Existing Phonetic Distancei,j−Phonetic Distancei,j)2.
7. The method of claim 1, further comprising generating the recognized speech file by processing, with a speech recognition engine, an audio file of a speaker reading contents of the reference file.
8. The method of claim 7, further comprising generating the audio file.
9. The method of claim 1, wherein determining a plurality of phoneme error occurrences includes comparing a plurality of recognized speech and reference files.
10. The method of claim 9, wherein the plurality of recognized speech files correspond to audio files of a plurality of different speakers.
11. The method of claim 9, wherein the plurality of recognized speech files are generated by a plurality of different speech recognition engines.
12. A phonetic distance measurement system comprising: a reference file;a recognized speech file generated by processing an audio file of a speaker reading contents of the reference file;a comparison module configured to determine, for each of a plurality of phonemes occurring in the reference file, a plurality of phoneme error occurrences by comparing the recognized speech file and the reference file;an error rate module configured to determine, for each of the plurality of phonemes, a plurality of phoneme error rates corresponding to the plurality of phoneme error occurrences; anda measurement module configured to generate a plurality of phonetic distances between each of the plurality of phonemes as a function of the plurality of phoneme error rates, the plurality of phonetic distances being inversely proportional to the plurality of phoneme error rates;
13. The system of claim 12, further comprising a dictionary, wherein the comparison module is further configured to access the dictionary to identify phonemes in the reference file and recognized speech file prior to determining the plurality of phoneme error occurrences.
14. The system of claim 12, wherein the plurality of phoneme error occurrences the comparison module is configured to determine include phoneme substitution error occurrences, phoneme insertion error occurrences and phoneme deletion error occurrences.
15. The system of claim 14, wherein the comparison module is further configured to identify the plurality of phoneme substitution error occurrences by corresponding pairs of phonemes and the phoneme insertion and deletion error occurrences by individual corresponding phonemes.

US Referenced Citations (56)

Number	Name	Date	Kind
5579436	Chou et al.	Nov 1996	A
6032116	Asghar et al.	Feb 2000	A
6076057	Narayanan et al.	Jun 2000	A
6236964	Tamura et al.	May 2001	B1
6243678	Erhart et al.	Jun 2001	B1
6263308	Heckerman et al.	Jul 2001	B1
6304844	Pan et al.	Oct 2001	B1
6456969	Beyerlein	Sep 2002	B1
6529871	Kanevsky et al.	Mar 2003	B1
6581034	Choi et al.	Jun 2003	B1
6735565	Gschwendtner	May 2004	B2
6754629	Qi et al.	Jun 2004	B1
7103544	Mahajan et al.	Sep 2006	B2
7117153	Mahajan et al.	Oct 2006	B2
7219056	Axelrod et al.	May 2007	B2
7315818	Stevens et al.	Jan 2008	B2
7505906	Lewis et al.	Mar 2009	B2
7590533	Hwang	Sep 2009	B2
7684986	Han et al.	Mar 2010	B2
7783484	Deligne et al.	Aug 2010	B2
7895039	Braho et al.	Feb 2011	B2
7917363	Starkie	Mar 2011	B2
8019602	Yu et al.	Sep 2011	B2
8112274	Beyerlein	Feb 2012	B2
8234116	Liu et al.	Jul 2012	B2
8346548	Owen	Jan 2013	B2
8401852	Zweig et al.	Mar 2013	B2
20020032549	Axelrod et al.	Mar 2002	A1
20020077817	Atal	Jun 2002	A1
20030069729	Bickley et al.	Apr 2003	A1
20040199385	Deligne et al.	Oct 2004	A1
20040254791	Coifman et al.	Dec 2004	A1
20040260543	Horowitz et al.	Dec 2004	A1
20050197838	Lin et al.	Sep 2005	A1
20050203751	Stevens et al.	Sep 2005	A1
20060064177	Tian et al.	Mar 2006	A1
20060106607	Atal	May 2006	A1
20060190252	Starkie	Aug 2006	A1
20060282267	Lopez-Barquilla et al.	Dec 2006	A1
20070038449	Coifman	Feb 2007	A1
20070156403	Coifman	Jul 2007	A1
20070179784	Thambiratnam et al.	Aug 2007	A1
20070198265	Yao	Aug 2007	A1
20080126089	Printz et al.	May 2008	A1
20080162125	Ma et al.	Jul 2008	A1
20080167872	Okimoto et al.	Jul 2008	A1
20080221896	Cai et al.	Sep 2008	A1
20080228485	Owen	Sep 2008	A1
20080243503	Soong et al.	Oct 2008	A1
20080281593	Deligne et al.	Nov 2008	A1
20080294441	Saffer	Nov 2008	A1
20090258333	Yu	Oct 2009	A1
20100217589	Gruhn et al.	Aug 2010	A1
20100228548	Liu et al.	Sep 2010	A1
20110029307	Parthasarathy et al.	Feb 2011	A1
20110093259	Saffer	Apr 2011	A1

Non-Patent Literature Citations (3)

Entry
P. Mermelstein, “Distance measures for speech recognition, psychologicaland instrumental,” in Pattern Recognition and Artificial Intelligence, C. H. Chen, Ed. New York: Academic, 1976, pp. 374-388.
Audhkhasi, K.; Verma, A.;, “Keyword Search using Modified Minimum Edit Distance Measure,” Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 4, No., pp. IV-929-IV-932, Apr. 15-20, 2007.
Schaden, Stefan. uation of Automatically Generated Transcriptions of Non-Native Pronunciations using a Phonetic Distance Measure. in Proceedings of LREC 2006, Genova, Italy, 2006.

Related Publications (1)

	Number	Date	Country
	20100332230 A1	Dec 2010	US

Phonetic distance measurement system and related methods

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

US

CPC

International Classifications

Term Extension

Abstract

Description

Claims

US Referenced Citations (56)

Non-Patent Literature Citations (3)

Related Publications (1)