Automatic continuous speech recognition system employing dynamic programming

Information

  • Patent Grant
  • 4059725
  • Patent Number
    4,059,725
  • Date Filed
    Tuesday, December 21, 1976
    48 years ago
  • Date Issued
    Tuesday, November 22, 1977
    47 years ago
Abstract
A speech pattern recognition system for continuous speech is disclosed. The system includes calculating means which calculates similarity measures between an input pattern and all of the series of patterns including reference word-patterns arranged in all possible orders through a pattern matching process without resorting to a segmentation process. The reference pattern which provides the maximum similarity measure is adopted as the recognized result.
Description

BACKGROUND OF THE INVENTION
This invention relates to a system for automatically recognizing continuous speech composed of continuously spoken words.
Voice recognition systems have been much in demand as input devices for putting data and programs into electronic computers and practical systems for automatically recognizing a speech are expected.
As is known in the art, it is possible to present a voice pattern with a sequence of Q-dimensional feature vectors. In a conventional voice recognition system, such as the one described in the P. Denes and M. V. Mathews article entitled "Spoken Digit Recognition Using Time-frequency Pattern Matching" (The Journal of Acoustical Society of America, Vol. 32, No. 11, November 1960) and another article by H. A. Elder entitled "On the Feasibility of Voice Input to an On Line Computer Processing System" (Communication of ACM, Vol. 13, No. 6, June 1970), the pattern matching is applied to the corresponding feature vectors of a reference pattern and of a pattern to be recognized. More particularly, the similarity measure between these patterns is calculated based on the total sum of the quantities representative of the similarity between the respective feature vectors appearing at the corresponding positions in the respective sequences. It is, therefore, impossible to achieve a reliable result of recognition in those cases where the positions of the feature vectors in one sequence vary relative to the positions of the corresponding feature vectors in another sequence. For example, the speed of utterance of a word often varies as much as 30 percent in practice. The speed variation results in a poor similarity measure even between the voice patterns for the same word spoken by the same person. Furthermore, for a conventional voice recognition system, a series of words must be uttered word by word thereby inconveniencing the speaking person and reducing the speed of utterance.
In order to recognize continuous speech composed of continuously spoken words, each voice pattern for a word must separately be recognized. A proposal to meet this demand has been made in the U.S. Pat. No. 3,816,722 entitled "COMPUTER FOR CALCULATING THE SIMILARITY BETWEEN PATTERNS AND PATTERN RECOGNITION SYSTEM COMPRISING THE SIMILARITY COMPUTER" filed jointly by the inventor of this case. In this system, continuous speech is separated word by word by using the dynamic programming. However, separation of continuous speech into words (segmentation) is not yet well established.
SUMMARY OF THE INVENTION
It is, therefore, an object of this invention to provide a continuous speech recognition system in which the separation of continuous speech can be well achieved.
According to a basic aspect of this invention, similarity measures between all of the series of patterns including reference word-patterns arranged in all possible orders constitute reference patterns and an input pattern are calculated through the pattern matching process without resorting to a segmentation process. The reference pattern which provides the maximum similarity measure is adopted as a recognized result. The continuous speech pattern recognition process with respect to all the reference patterns is performed through three steps so as to achieve a practical operational speed.
In the first step, similarity measures are calculated between the reference patterns and a partial pattern of the input pattern extending from a given time point (a start point) to another given time point (an end point). Thus, the maximum similarity measure and the word corresponding thereto are obtained as a partial similarity measure and a partial recognized result for the partial pattern, respectively. The partial similarity measures and the partial recognized results are then stored at the start and end points. The step is repeated for all the start and end points to form a table stored in the memory device.
In the second step, referring to the table, the partial pattern series is selected so that a plurality of partial patterns included therein are laid without overlapping and spacing and that the sum of the partial similarity measures in the series is maximized.
In the final step, the partial recognized results corresponding to the partial patterns obtained in the second step are extracted from the table to provide a final result.
The calculations in the first and second steps are achieved in practice in accordance with the "Dynamic Programming" described on pp. 3-29 of a book entitled "Applied Dynamic Programming" published in 1962 by Princeton University Press.





BRIEF DESCRIPTION OF THE DRAWINGS
The features and advantages of this invention will be understood from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings, wherein:
FIG. 1 schematically shows two voice patterns for the same continuous speech;
FIG. 2 is a graph for explaining the principles of this invention;
FIGS. 3A and 3B show input patterns for explaining the principles of this invention;
FIG. 4 is a graph for explaining the principles of this invention;
FIG. 5 is a block diagram of a first embodiment of this invention;
FIG. 6 is a block diagram of a partial similarity measure calculating unit used in the first embodiment;
FIG. 7 is a graph for explaining the principles of a second embodiment of this invention shown in FIG. 8;
FIG. 8 is a block diagram of the second embodiment of this invention;
FIG. 9 is a graph for explaining the principles of the second embodiment; and
FIG. 10 is a block diagram of a second calculator used in the second embodiment.





DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Referring to FIGS. 1 to 4, the principle of a continuous speech recognition system according to this invention will be described.
As mentioned hereinabove, a speech pattern can be represented by a time sequence of Q-dimensional feature vectors. An input speech pattern may be given by
A = a.sub.1, a.sub.2 -- a.sub.i -- a.sub.I (1)
where a.sub.i stands for a feature vector for the i-th time point in the input pattern, and may be given by
a.sub.i = (a.sub.1i, a.sub.2i, -- a.sub.2i)
Assume that the continuous speech to be recognized is a series of numerals (0, 1, 2, . . . n . . . 9), and a reference word-pattern B.sup.n (n 32 0 to 9) is composed of J.sub.n -number of feature vectors. The reference word-pattern B.sup.n may be given by
B.sup.n = b.sub.1.sup.n, b.sub.2.sup.n, -- b.sub.j.sup.n -- b.sub.Jn.sup.n (2)
where b.sub.j.sup.n stands for a feature vector for the j-th time point in the reference word-pattern and may be given by
b.sub.j.sup.n = (b.sub.1j.sup.n, b.sub.2j.sup.n -- b.sub.Qj.sup.n)
J.sub.n corresponds to a time period of the reference word-pattern.
For simplicity, the reference word-pattern and its feature vector is represented by
B = b.sub.1, b.sub.2, -- b.sub.j -- b.sub.J (3)
b.sub.j = (b.sub.1j, b.sub.2j, -- b.sub.Qj)
In the continuous speech recognition system of this invention, the reference pattern for the continuous speech is represented as a series of the reference word-patterns for the numerals which are pronounced word by word. In other words, a pattern for the continuous speech by continuously pronouncing numerals [n(1), n(2), . . . n(x), . . . n(Y)] is represented by a pattern series composed of the respective reference word-pattern as follows:
B = B.sup.N(1) +B.sup.n(2) --+B.sup.n(x) --+B.sup.n(Y) (4)
and, in the case of a series of reference word-patterns B.sup.m and B.sup.n,
B = B.sup.m +B.sup.n = b.sub.1.sup.m, b.sub.2.sup.m,--b.sub.Jm.sup.m, b.sub.1.sup.n, b.sub.2.sup.n,--b.sub.Jm.sup.n (5)
For simplicity, the reference pattern B and its feature vector are represented by
B = b.sub.1, b.sub.2 -- b.sub.j -- b.sub.J
b.sub.j = (b.sub.1j, b.sub.2j -- b.sub.Qj) (6)
The components of the vector may be the samples of the outputs, Q in number, of a Q-channel spectrum analyzer sampled at a time point. The feature vectors a.sub.i and b.sub.i situated at the corresponding time positions in the respective sequences for the same speech do not necessarily represent one and the same phoneme, because the speeds of utterance may differ even though the words are spoken by the same person. For example, assume that the patterns A and B are both for a series of phonemes /san-ni-go/ (Japanese numerals for "three-two-five" in English). A vector a.sub.i at a time position 20 represents a pheneme /s/ while another vector b.sub.i at the corresponding time point 20' represents different phoneme /a/. With a conventional method of calculating the similarity measure using the summation of i of the correlation coefficients .gamma.(a.sub.i, b.sub.i)'s, the example depicted in FIG. 1 gives only small similarity, which might result in misrecognition of the pattern in question.
Generally, the duration of each phoneme can vary considerably during actual utterance without materially affecting the meaning of the spoken words. A measure is therefore used for the pattern matching which will not be affected by the variation.
Referring to FIG. 2, the sequences of the feature vectors are arranged along the abscissa i and the ordinate j, respectively. According to the invention, a scalar product of the feature vectors a.sub.i and b.sub.j is used as the similarity measure s(a.sub.i, b.sub.j) between the feature vectors a.sub.i and b.sub.j as follows: ##EQU1##
According to the principle of this invention, the similarity measure S(A, B) between the input and reference patterns A and B is calculated with respect to all of the reference patterns, and the number Y of the reference word-patterns and the corresponding numerals n(1), n(2), . . . n(X), . . . n(Y) by which the maximum similarity measure ##EQU2## is obtained are calculated to obtain a recognized result.
It is noted that direct calculation of the maximum similarity measure represented by expression (8) requires a vast amount of time and adversely affects the cost and the operation time of the system to a certain extent. According to the invention, it has been found as shown in FIG. 3A that the input pattern A is divided by a partial pattern from a given time point (a start point) l to the next given time point (an end point) m, which may be represented by
A (l,m) = a.sub.l+1, a.sub.l+2, -- a.sub.i -- a.sub.m (9
In other words, as shown in FIG. 3B, the input pattern A is divided into Y sets of partial patterns having (Y-1) of breaking points l(1), l (2), . . . l (X), . . . l (Y-1).
a = a (0, l.sub.(1))+A(l.sub.(1),l.sub.(2))--+A (l.sub.(x-1), l.sub.(x))--+A(l.sub.(Y-1), l.sub.(x)) (10 )
By substituting expressions (4) and (10) for expression (8), the following expression (11) is obtained. ##EQU3## The maximization of the light term of expression (11) is carried out in two processes; one is the maximizing with respect to the numerals n(1), n(2), . . . n(Y), and the other to the number Y and the breaking points l(1), l(2) . . . l(Y-1). This is shown by ##EQU4##
According to the invention, a time-normalized similarity measure is used as the similarity measure S(A(l.sub.(x-1), l.sub.(x)), B.sup.n(x)). More definitely, the partial pattern A(l.sub.(x-1), l.sub.(x)) is represented generally by
C = A(l.sub.(x-1), l.sub.(x)) = C.sub.1, C.sub.2 -- C.sub.i -- C.sub.I (13 )
in the case where the reference pattern B.sup.n(x) is represented as in expression (3), the similarity measure S(C, B) is represented by ##EQU5## where j = j(i) It is found that the dynamic programming is conveniently applicable to the calculation of the similarity measure of expression (14).Thus, the calculation of recurrence coefficients or cumulative quantities representative of the similarity given by ##EQU6## is carried out, starting from the initial condition
j = 1
g(1,1) = S(c.sub.1, b.sub.1) (16
and arriving at the ultimate recurrence coefficient g(I, J) for i = I and j = J. It is to be understood that, according to the recurrence formula (15), the ultimate recurrence coefficient g(I, J) is the calculation of expression (14), i.e.,
S(C, B) = g(I, J) (17)
it is noteworthy that the speed of pronunication differs 30 percent at most in practice. The vector b.sub.j which is correlated to a vector c.sub.i is therefore one of the vectors positioned in the neighbourhood of the vector b.sub.i. Consequently, it is sufficient to calculate the expession (14) or (15) for the possible correspondences (i, j)'s satisfying
j -r .ltoreq. i .ltoreq. j + r (18)
which is herein called the normalized window. The integer may be predetermined to be about 30 percent of the number I or J. The provision of the normalization window given by expression (18) corresponds to restriction of the calculation of the expression (14) or (15) within a domain placed between two straight lines 25 and 25 shown in FIG. 2 as
j = 1 + r and j = 1- r
with the boundary inclusive.
As described above, the similarity measure between the partial pattern C and the reference word-pattern B can be calculated by calculating the recurrence expression (15) under conditions of expressions (16) and (18) from i = j = 1 to i = I and j = J. This means that it is possible to calculate the similarity measure S(A(l.sub.(x-1), l.sub.(x)), B.sup.n) or S(A(l,m), B) between the partial pattern A(l.sub.(x-1), l.sub.(x)) or A(l,m) and the reference word pattern B, and to calculate the similarity measures S(A(l,m), B.sup.0) to S(A(l,m), B.sup.9). The partial similarity measure S<l,m > is defined as the maximum one in these similarity measures S(A(l,m), B.sup.0) to S(A(l,m), B.sup.9). The partial recognized result n<l,m> is the numeral by which the partial similarity measure S<l,m> is obtained. In other words, these relationships are given by
S<l,m> = max [S(A(l,m), B.sup.n ]
n<l,m> = argmax [S(A(l,m), B.sup.n ]
These partial similarity measures and partial recognized results are calculated in regard to all the combinations of l and m under the condition of l < m and stored. The above-mentioned process for providing the partial similarity measures and the partial recognized results is the first step in the recognizing process according to the invention.
In practice, the calculation of expression (15) from j = 1 to j = J.sub.n gives all of the similarity measures S(A(l,m), B.sup.n) with respect to a certain point of l and to all of m's satisfying
l + J.sub.n + r - 1 .ltoreq. m .ltoreq. l + J.sub.n + r - 1 (19)
The end point m of the partial pattern A(l,m) for normally spoken words is always included in a domain defined by expression (19). Therefore, the partial similarity measures need not be calculated out of the domain.
The reason will be described with reference to FIG. 4 wherein the abcissa stands for the start point l and the ordinate for the end point m. Cross-points (l,m) in the l-m plane correspond to the combinations of the points l and m (both are integers). The domain of point m in which the similarity measure S(A(l,m), B.sup.n) with respect to a certain point of l is calculated is defined by expression (19). The domain depends upon the time period J.sub.n of the reference word pattern, and is given by
l + min[J.sub.n ] - r - 1.ltoreq.m.ltoreq.l + max[J.sub.n ] + r - 1 (20)
which covers the domain lying between two straight lines 32 and 33. The similarity measures S(A(l,m), B.sup.n) are calculated within the domain to obtain the partial similarity measures S<l,m> and the partial recognized results n<l,m>
Incidentally, the straight lines 32 and 33 are given by
m = l + min[J.sub.n ] - r - 1, and
m = l + max[J.sub.n ] + r - 1,
respectively,
Thus, the partial similarity measures and the partial recognized results are calculated with respect to all the points (l,m) within the hatched domain in FIG. 4. Further, as described above, the calculation of similarity measures S(A(l,m), B.sup.n) with respect to all of m's and to a certain point of l is completed by calculating the expression (15). The similar calculations with respect to all of the reference word patterns give the similarity measures corresponding to the range of line 10 (or with respect to a certain point l), whereby the partial similarity measures and the partial recognized results within the range of line 10 can be obtained. By calculating with respect to all of l's, the partial similarity measure and the partial recognized results within the hatched domain are obtained. The calculation only within the restricted domain makes it possible to reduce its calculation amount.
The first step may be summarized to the following seven sub-steps:
(1-1) Reset S<l,m> to 0, and
set n to 0;
(1-2) Set l to 1;
(1-3) Calculate expression (15) within a domain of m's
satisfying the expression (19) to obtain S(A(l,m), B.sup.n) = g(m,J.sub.n), and set m to l + J.sub.n - r 1;
(1-4) When S(A(l,m), B.sup.n).ltoreq.S<l,m>, jump to (1-5);
When S(A(l,m), B.sup.n).ltoreq.S<l,m>, set S(A(l,m), B.sup.n) and n as
s<l,m> and n<l,m>, respectively;
(1-5) Set m to m + 1.
When m.ltoreq.l+J.sub.n +r-1, jump to (1-4), and when m>l+J.sub.n +r-1, jump to (1-6);
(1-7) Set n to n + 1.
When n .ltoreq. 9, jump to (1-2), and
when n > 9, finish.
The second step of the recognition process according to the invention for the maximization calculation with respect to the number Y and the breaking points l.sub.(1), l.sub.(2) -- l.sub.(x) -- l.sub.(Y-1) in expression (12) by using the dynamic programming will be described.
The expression (12) may be rewritten by using the partial similarity measure S<l.sub.(x-1), l.sub.(x) > to ##EQU7## Assume that l.sub.(Y) = m and ##EQU8## the expression (22) may be represented by T(I) (I stands for the time period of the input pattern A). Further, the expression (22) may be represented by ##EQU9## This shows that the dynamic programming is conveniently applicable to the calculation of the expression (22). Thus, the calculation of the recurrence expression given by ##EQU10## is carried out, starting from the initial condition
T(O) = 0 (24)
and to T(I) for m = I. On the calculation of the recurrence expression (24), ##EQU11## where an operator "argmax" stands for "h" for which the expression in the square bracket [ ] has a maximum value, is obtained with respect to m's of 1 to I and stored. Then, recognized breaking point l.sub.(x) representative of optimum value of l.sub.(x) in expression (21) is obtained by calculating a recurrence expression given by
l.sub.(x) = h(l.sub.(x+1)) (26 )
from the initial condition l.sub.(Y) = J to the ultimate recurrence coefficient l.sub.(0) = 0. The number Y representative of the word number in the input pattern is obtained as the number of X at which l.sub.(x) becomes 0.
In the final step of the recognition process, a final recognized result having Y numerals given by
n <l.sub.(x-1), l.sub.(x)>, X = 1.about.Y (27)
is obtained by referring to the recognized breaking point l.sub.(x) obtained in the second step and to the partial recognized results n<l,m> obtained in the first step.
Referring to FIG. 5, a first embodiment of this invention comprises an input equipment 61 for analyzing an input continuous speech to an input pattern A of a sequence of feature vectors Ai's. The input equipment comprises a microphone and a Q-channel spectrum analyzer composed of a group of channel filters and analogue/digital (A/D) converters. The Q-channel spectrum analyzer may be of the type referred to as "Frequency Selectors" shown in FIG. 3 of page 60 in an article entitled "Automatic Word Recognition" described in IEEE Spectrum Vol. 8, No. 8 (August, 1971). The input speech is converted to an electric signal by the microphone. The signal is then applied to the channel filters in which the input signal frequency is divided in to Q-channel signals. These signals are supplied to the A/D converters in which they are sampled in synchronism with an analyzing clock and converted to digital signals representative of a feature vector a.sub.i = (a.sub.1i, a.sub.2i . . . a.sub.Qi). This input equipment 61 also generates start and final signals representing start and final time points (i = 1 and I) of the input pattern when an amplitude of the input signal first and finally exceeds a predetermined threshold level. The time point i of the input pattern A is counted by a counter (not shown) in the equipment 61 by resetting it to 1 in response to the start signal and counting it in synchronism with the analyzing clock. A counting number of the counter at a time point of the final signal U is defined as a time period I of the input pattern A. The final signal U the time-period representing signal I.sub.1 supplied to a controller 60.
The feature vector a.sub.i is in turn supplied from the input equipment 61 to an input pattern buffer memory 62. The memory 62 has a capacity sufficient to store all the feature vectors a.sub.i 's in the input pattern A, i.e., a.sub.1, a.sub.2, . . . a.sub.i . . . a.sub.I.
Reference word patterns B.sup.n (n = 0-9) are stored in a reference pattern memory 63. The reference word pattern B designated by a signal n from the controller 60 is read out. A partial similarity measure calculating unit 64 calculates through sub-step (1-3) the similarity measure S(A(l, m), B) between a partial pattern A(l, m) of the input pattern A and the reference word pattern B. A partial recognition unit 65 performs processing of the sub-steps (1-4) and (1-5) with reference to the similarity measures S(A(l, m), B) obtained by the unit 64 to provide the partial similarity measures S <l, m> and the partial recognized results n <l, m>. The partial similarity measure buffer memory 66 stores the partial similarity measures S <l, m> with respect to l and m. A partial recognized result buffer memory 67 stores the partial recognized results n <l, m> with respect to l and m. A final calculating unit 68 performs the calculation of the second step described above. A final recognition unit 69 performs the recognition of the final step described above. The controller 60 controls the operations for various parts of the first embodiment.
The partial similarity measures S <l, m> stored in the memory 66 are reset by a control signal Cl from the controller 60. The memory 63 reads out the reference word pattern B(B.sup.0, B.sup.1, B.sup.2 . . . or B.sup.9) designated by the signal n. The controller 60 supplies a signal respresentative of the start time point in the partial pattern A(l, m) to the units 64 and 65. The unit 64 calculates the similarity measures S(A(l, m), B.sup.n) between the partial pattern A(l, m) of the input pattern A and the reference word pattern B by performing the processes of the sub-step (1-3). The unit 65 is supplied with the similarity measures S(A(l, m), B.sup.n) from unit 64, the signals n, l, and m from the controller 60, and the stored similarity measure S <l, m> from the meory 66. The unit 65 performs the processes of sub-steps (1-4) and (1-5), and updates the contents S <l, m> and n <l, m> of the memories 66 and 67, respectively, in accordance with the result thereof. When the process with respect to all of l's has been completed, the signal n from the controller 60 is in turn varied by 1 from 0 to 9. The completion of the process with respect to n to 9 means the completion of the process of the first step. At the time of the completion of the first step, the controller 60 generates and supplies to the unit 68 signals U and I2 representative of the completion of the first step and the time period (the number of the feature vectors) of the input pattern, respectively.
The unit 68 is supplied with the partial similarity measures S <l, m> and calculates the recurrence expression (23) to provide h(m) with reference to the partial similarity measures S <l, m >. The unit 69 also calculates the recurrence expression (26) referring to h(m) to obtain the number Y of the words and breaking points l.sub.(x).
The unit 69 finally recognizes the input pattern A as expression (27) referring to l.sub.(x) and n < l, m >.
The partial similarity measure calculating unit 64 will be described in detail with reference to FIG. 6. The unit 64 is composed of a reference word pattern buffer memory 641, a similarity measure calculator 642, a similarity measure memory 643, a recurrence calculator 644, an operational register 645 and a controller 646.
The reference word pattern B from the memory 63 (FIG. 5) is stored in the buffer memory 641. The calculator 642 is supplied with the input pattern A from the memory 62 and the reference word pattern B from the memory 641 and the similarity measure ##EQU12## between the feature vectors a.sub.i and b.sub.j with respect to all the combinations (i, j)'s satisfying l.ltoreq.i.ltoreq.I and 1.ltoreq.j.ltoreq.J. The obtained similarity measures S(a.sub.i, b.sub.j) are stored in the memory 643. The memory 643 reads out the similarity measure S(a.sub.i, b.sub.j) corresponding to the combination (i, j) designated by signals i.sub.1 and j.sub.1 as a signal S. The calculator 644 is supplied with the signal S from the memory 643 and signals g.sub.1, g.sub.2, and g.sub.3 from the register 645, and makes sum g.sub.0 of the signal S and a maximum one in the signals g.sub.1, g.sub.2, and g.sub.3, i.e., calculates the following expression:
g.sub.0 = S (a.sub.i, a.sub.j) + max[g.sub.1, g.sub.2, g.sub.3 ](28)
The sums g.sub.0 's are in turn stored in the register 645.
A control signal YS from the controller 646 is supplied to the register 645, whereby the initial condition represented by expression (16) is set. The controller 646 generates the signal j.sub.1 increasing in turn from j = 1 and the signal i.sub.1 increasing within the domain of expression (19). The similarity measure S (a.sub.i, b.sub.j) corresponding to the combination (i, j) designated by fhe signals i.sub.1 and j.sub.1 is read out from the memory 643. The register 645 produces in response to the combination (i, j) designated by the signals i.sub.1 and j.sub.1 the signals g.sub.1, g.sub.2 and g.sub.3 given by
g.sub.1 = g(i-1, j)
g.sub.2 = g(i-1, j-1 )
g.sub.3 = g(i-1, j-2),
respectively. The calculator 644 calculates the expression (28) referring to the signals S, g.sub.1, g.sub.2, and g.sub.3. This means that the calculator 644 calculates the recurrence expression (15). The obtained g.sub.0 is written in the register 645. By varying i and j to i = I and j = J.sub.n, the similarity measure S(A(l, m), B.sup.n) = g(I, J) can be obtained.
The following recurrence expression may be employed instead of the recurrence expression (15): ##EQU13##
In the first embodiment, the memories 66 and 67 require a great amount of capacity. The capacity MA corresponds to the area of the hatched domain in FIG. 4, and is given by
MA .apprxeq. I .times. (max[J.sub.n ] - min[J.sub.n ] + 2 r
In the case of the speech of numerals, the following examples are typical:
max[J.sub.n ]=25, min[J.sub.n ]=15
r = 7, I = 70 (in case of 4-digit)
In the above case, the memory capacity MA is 1680 words for each memory. Therefore, the memories 66 and 67 necessitate the memory capacity of 3360 (=1680 .times. 2) words.
In the first embodiment, the partial similarity measures are calculated along the line 10 (FIG. 4) on which l is constant in the first step, and the obtained partial similarity measures are employed along the line 11 (FIG. 4) on which m is constant in the second step. This requires the storage of all of s<l, m> within the hatched domain. On the other hand, it is noted that in the calculation of the expression (25), the partial similarity measure s<h, m> with respect to the one point of m is along the line 11. To reduce the memory capacity, a second embodiment of the invention employs this fact.
More definitely, in the second embodiment, the similarity measures s<l, m> are calculated along the line 11 in the first step. When the partial similarity measure s <l, m> with respect to one point of m has been calculated, the recurrence expression (23) for the second step is performed. Therefore, the memory for storing the partial similarity measures S<l, m> requires a small amount of capacity for which S<l, m> within the range of the line 11 can be stored. Further, the similar reduction of capacity is applicable to the memory for storing the partial recognized results n<l, m>.
For this modification, it is necessary for the calculation of the recurrence expression (15) to reverse its time axes i and j, i.e., initial condition
g(I, J) = s (c.sub.I, b.sub.J) (29)
recurrence expression ##EQU14## The calculation of the expressions (29) and (30) will be described referring to FIG. 7. A point 40 is of (m, J) and gives the initial condition (29). Straight lines 42 and 43 correspond to the expression (18), and are represented by
j + m - Jn - r .ltoreq. i .ltoreq. j + m - Jn + r (31)
Therefore, the calculation of the recurrence expression (30) is carried out within a domain placed between the lines 42 and 43 satisfying the expression (31). When the calculation has arrived at j = 1, all of g(l, 1) with respect to i = l in a range 21 are obtained.
g(l, 1) = S (A(l, m), B.sup.n)
In this case, the range of l is defined by points 44 and 45, i.e., corresponds to that of the expression (31) of j = 1, as represented by
1 + m - Jn - r .ltoreq. l .ltoreq. 1 + m - Jn + r (32)
The similar calculations are performed with all of B.sup.n (n = 0 .about. 9), whereby the partial similarity measures S<l, m> are obtained with respect to all of l's satisfying the expression (32).
S <l, m > = max [S (A(l, m), B.sup.n ] (33)
Further, the partial recognized results n<l, m> are obtained
n<l, m> = argmax [S (A(l, m), B.sup.n ] (34)
Thus, the partial similarity measures S<l, m> and the partial recognized results n<l, m> can be obtained along the line 11 (FIG. 4).
Referring to the partial similarity measures S<l, m> and the partial recognized results n<l, m>, the recurrence expression (23) is calculated to provide T(m) and N(m), which are given by
N(m) = n < h(m), m > (35)
In the final recognition step, the recognition is performed by referring h(m) and N(m). In other words, the calculation of recurrence expression given by ##EQU15## is carried out, from the initial condition
m = 1 (37)
to m = 0.
Referring to FIG. 8, a second embodiment comprises an input equipment 71, an input pattern buffer memory 72, and a reference pattern memory 73 identical to the input equipment 61, the memories 62 and 63 (FIG. 5), respectively. A reference pattern time period memory 74 stores the time periods Jn (n = 0.about.9), and reads out the designated time period J in response to the signal n. A first calculator 75 calculates the recurrence expression (30). A first operable register 76 stores g(i, j) of expression (30). A memory 77 stores the partial similarity measures S<l, m> of expression (33). A memory 78 stores the partial recognized results n<l, m> of expression (34). The memories 77 and 78 have one-dimensional addresses. At the address "l" of the memories 77 and 78, the partial similarity measure S<l, m> and the partial recognized results n<l, m> are stored, respectively. A comparator 79 calculates the partial similarity measure S< l, m>.
A second calculator 80 performs calculations of the recurrence expressions (23) and (35). A memory 81 stores T(m) obtained from the expression (23). A memory 82 stores h(m). A memory 83 stores N(m) obtained from the expression (35). A recognition unit 84 calculates the expressions (36) and (37) to obtain the final result. A controller 70 controls the operations of various parts.
When the end of the input pattern A is detected by the input equipment 71, the signal I representative of the time period of the input pattern A is supplied to the controller 70. The controller 70 sets to 1 a counter, which is installed therein and generates a counting value representative of the ending point m of the partial pattern A(l, m). When the end point m has a small value or, in other words, when the point in lies in the lefthand region with respect to a point 90 of FIG. 9, the corresponding start point lies in a negative range. Therefore, the operation of this system is not initiated until the value of m exceeds the point 90. The value of the point 50 on the i-axis is defined by
m = min[Jn] -r + l (38)
After the value exceeds m defined by the expression (38), the following operations are carried out with respect to each value of m's. The controller 70 varies the signal n representative of numerals 0 to 9. In accordance with the designated n, the processes of the first and second steps are performed.
The signal J representative of the time period J.sub.n of the reference word pattern B.sup.n is generated in accordance with the signal n. At the same time, the reference word pattern B.sup.n is read out from the memory 73 as the signal B. The signal B is read out in time-reversed fashion, i.e., in an order of b.sub.J, b.sub.J-1 . . . b.sub.2, b.sub.1. The feature vector a.sub.i (i .ltoreq. m) is then read out from the memory 72 and is supplied to the first calculator 75, in which the calculation of the recurrence expression (30) is achieved with the register 76 used as a sub-memory. The first calculator 75 and the register 36 may be of the partial similarity measure calculating unit 64 shown in FIG. 5. When the calculation of the recurrence expression (30) is completed, the similarity measures S(A(l, m), B.sup.n) with respect to the domain given by the expression (32) is obtained.
The content of the memory 77 is reset to 0 by the signal cl.sub.1 from the controller 70 before the start of the operation. Every similarity measure S(A(l, m), B.sup.n) is compared with the content of the memory 77, and the greater one is written in the memory 77, whereby the calculation of the expression (33) is carried out. More definetely, the S (A(l, m), B.sup.n) designated by the signal lm from the controller 70 is read out from the register 76 as the signal g o, the S<l, m> stored at address "l" of the memory 77 is read out as the signal S.sub.1. The comparator 79 generates a write-in pulse wp.sub.1 only when the signal g o is greater than the signal S.sub.1. When the write-in pulse wp.sub.1 is obtained, the signal g o , i.e., S(A(l, m), B.sup.n) is written in at address "l" of the memory 77 as a new S<l, m> and the n corresponding thereto is written in at address "l" of the memory 78 as a new n<l, m>. Therefore, when the calculation has been performed by varying within a domain of the expression (32), and varying n from 0 to 9, the partial similarity measure S<l, m> and the partial recognized results n<l, m> are stored in the memories 77 and 78, respectively. The timing signal t.sub.1 is then generated from the controller 70 and is supplied to the second calculator 80. The content of the memory 81 is set to 0 by the signal cl.sub.2 generated in the controller 70 at the start point of the input pattern. The second calculator 80 calculates the expression (23) with respect to m designated by signal m.sub.1 from the controller 70 in response to the timing signal t.sub.1.
The second calculator 80 will be described with reference to FIG. 10. The second calculator 80 comprises a controller 800, a register 810, an adder 820, and a comparator 830. The register 810 is set to 0 in response to a reset signal cl.sub.3 generated in the controller 800 timed under the timing signal t.sub.1. An address designating signal h is varied within the domain of the expression (32). The content at address "h" of the memory 77, i.e., S<h, m> and the content at address "h" of the memory, i.e., n<h, m> are read out as the signals S.sub.2 and n.sub.2, respectively. The content at address "h" of the memory 81, i.e., T(h) is obtained as the signal T.sub.1. The signals T.sub.1 and S.sub.2 are added to each other in the adder 820, whereby the result, i.e., (S(h+m) + T(h)) is obtained as a signal X. The register 810 and the comparator 830 operate similar to that of the memory 77 and the comparator 79, i.e., calculate the maximization of the expression (23). In other words, the output X from the adder 820 is compared with the content Z of the register 810, whereby a write-in pulse wp.sub.2 is generated only when X > Z. In response to the write-in pulse wp.sub.2, the signal X is written-in in the register 810 and at address "m" of the memory 81. The signals n.sub.2 and h are written-in in the memories 82 and 83 at address "m" designated by the signal n.sub.2. Thus, T(m), h(m) and N(m) are stored at address "m" in the memories 81, 82 and 83, respectively.
Thus, the calculation with respect to a given point of m is completed. Then, the signal is increased by 1, and the similar calculation is repeated until the signal m becomes I.
When the calculation with respect to m = I is completed, the recognition unit 84 starts to operate in response to the timing signal t.sub.2 from the controller 30. Under this state, the data h(m) and N(m) stored in the memories 82 and 83 correspond to ones with respect to m = 1 - I. The recognition unit 84 comprises means for generating a signal m.sub.3 designating m. The signal m.sub.3 is supplied to the memories 82 and 83, which supply h(m) and N(m) to the unit 44. The unit 44 calculates the expression (37) and generates N as the final recognition result.
Claims
  • 1. A speech pattern recognition system for continuous speech composed of a series of words pronounced word by word, said system comprising:
  • means for producing from said continuous speech an input pattern A representative of time sequences of feature vectors a.sub.1, a.sub.2 - - - a.sub.i, - - - a.sub.I :
  • first memory means for storing said input pattern A;
  • second memory means for storing n reference word patterns B.sup.n, each representing by time sequences of feature vectors b.sub.1.sup.n, b.sub.2.sup.n, - - -b.sub.j.sup.n, - - - b.sub.jn.sup.n ;
  • means for reading out a partial pattern A(l,m) which is a part of said input pattern A extending from a time point l to another time point m (1.ltoreq.l<m.ltoreq.I), said partial pattern A(l,m) being represented by time sequence of feature vectors a.sub.l+1, a.sub.l+2, - - - a.sub.i, - - - a.sub.m ;
  • first means for calculating through dynamic programming similarity measures S(A(l,m), B.sup.n) between said partial pattern A(l,m) and said reference word pattern B.sup.n ;
  • means for extracting the maximum value of the partial similarity measures S<l,m> with respect to n words;
  • means for providing a partial recognized result n<l,m> which is a word in said n words and by which said partial similarity measure S<l,m> is obtained;
  • third memory means for said partial similarity measure s<l,m> and said partial recognized result n<l,m> obtained with respect to said time points l and m;
  • means for dividing said input pattern A to Y partial patterns A(l.sub.(x-1), L.sub.(x)) (X = 1, 2, 3, - - - Y), said input pattern A being composed of Y words and having (Y - 1) breaking points l.sub.(1), l.sub.(2) - - - l.sub.(x) - - - l.sub.(Y-1) ;
  • means responsive to said partial similarity measure S<l,m> and said partial recognized result n<l,m> for reading out the partial similarity measures S<O,l.sub.(1) >, S<l.sub.(1),l.sub.(2) >, - - - S<l.sub.(x-1), l.sub.(x) >, - - - S<l.sub.Y-1), l.sub.(Y) > with respect to the combinations (O,l.sub.(1)), (l.sub.(1), l.sub.(2)), - - - (l.sub.(x-1), l.sub.(x)), - - - (l.sub.(Y-1), l.sub.(Y)) of said breaking points;
  • second means for calculating the maximum value of the sum of said partial similarity measures S>O,l.sub.(1) >+S<l.sub.(1), l.sub.(2) >+- - - + S<l.sub.(x-1), l.sub.(x) > - - - + S<l.sub.(Y-1), l.sub.(Y) >; and
  • means responsive to said second calculating means and said partial recognized result n, m for providing Y words.
  • 2. A speech pattern recognition system as recited in claim 1 wherein said first means for calculating similarity measures comprises:
  • recurrence coefficient calculating means for successively calculating recurrence coefficients g(i, j) for each similarity quantity s(c.sub.i, b.sub.j) defined as ##EQU16## starting from the initial condition
  • j = 1
  • g(1, 1) = s(c.sub.1, b.sub.1)
  • and arriving at the ultimate recurrence coefficient g(I, J) for i = I and j = J within a domain of m's satisfying the expression
  • l + J.sub.n + r - 1 .ltoreq. m .ltoreq. l + J.sub.n + J.sub.n + r - 1
  • to obtain said similarity measures.
  • 3. A speech pattern recognition system as recited in claim 1 wherein said first means for calculating similarity measures comprises:
  • recurrence coefficient calculating means for successively calculating recurrence coefficients g(i, j) for each similarity quantity S(C.sub.i, b.sub.j) defined as ##EQU17## starting from the initial condition
  • j = 1
  • g(1, 1) = s(c.sub.1, b.sub.1)
  • and arriving at the ultimate recurrence coefficient g(I, J) for i = I and j = J, and
  • a first register for storing said recurrence coefficients g(i, j).
  • 4. A speech pattern recognition system as recited in claim 2 wherein said second means for calculating the maximum value of the sum of said partial similarity measures comprises:
  • first recurrence calculating means for calculating the expression
  • T(m) = Max [S <h, m> + T(h) ] h<m
  • where m = 1 - I,
  • starting from the initial condition T(O) = 0 and continuing to T(I) from m = I to provide h(m) defined as
  • h(m) = argmax [S <h, m> + T(h)] h < m
  • where the operator "argmax" stands for "h" for which the expression in the square bracket [ ] has a maximum value, with reference to said partial similarity measures S <l, m>, and
  • second recurrence calculating means for calculating the expression
  • l.sub.(x) = h(l.sub.(x+1))
  • from the initial condition l.sub.(r) = J to the ultimate recurrence coefficient l(O) = 0 referring to h(m) to obtain the number Y of the words and breaking points l.sub.(x).
  • 5. A speech pattern recognition system as recited in claim 3 wherein said second means for calculating the maximum value of the sum of said partial similarity measures comprises:
  • recurrence calculating means for calculating the expression
  • T(m) = Max [S <h, m> + T(h)] h<m
  • where m = 1-I,
  • starting from the initial condition T(0) = 0 and continuing to T(I) for m = I to provide T(m), N(m) and h(m) defined as
  • N(m) = n <h(m), m>
  • h(m) = argmax [S <h, m> + T(h)] h < m
  • where the operator "argmax" stands for "h" for which the expression in the square bracket [ ] has a maximum value, with reference to said partial similarly measures S <l, m>, and
  • fourth memory means for storing the calculated values T(m), N(m) and h(m0).
Priority Claims (2)
Number Date Country Kind
50-29891 Mar 1975 JA
50-132003 Oct 1975 JA
Parent Case Info

This is a Continuation, of application Ser. No. 665,759, filed Mar. 11, 1976, now abandoned.

US Referenced Citations (3)
Number Name Date Kind
3700815 Doddington et al. Oct 1972
3816722 Sakoe Jun 1974
3943295 Martin Mar 1976
Continuations (1)
Number Date Country
Parent 665759 Mar 1976