Not Applicable
Not Applicable
This invention relates to a method implemented by a computer for the computation of similarity measures between input patterns and stored patterns wherein both input patterns and the stored patterns are derived from data collected from speech or images or video or signals or static physical entities or moving physical entities. The similarity measures obtained can then be used to classify the input patters as similar to one of the classes of the stored patterns. For example, if the input patterns are derived from speech, the method can classify segments of the speech into words by detecting high similarity measures of the input speech to stored exemplar words. In other applications, the method can identify faces in images, classify human actions in video or even classify patterns of weather, human genome etc. In all the applications, both input and stored patterns are converted into arrays of vectors, which are then classified by an algorithm that computes their mutual similarity measures. Our invention is an algorithm that we call VARIS which stands for: “Vector Array Recognition by Indexing and Sequencing”. VARIS has many advantages over currently widely used classification methods such as “Hidden Markov Models” (HMM) or “Dynamic Time Warping” DTW. Unlike HMM and DTW, which can be used only in classification of patterns such as speech, which are represented by one dimensional arrays of vectors, VARIS can classify any dimensional arrays of vectors with polynomial computation complexity. Whereas HMM and DTW have exponential complexity even in just two dimensions. VARIS has many other advantages over HMM such as ease of training, which enables to easily adapt each speech recognizer to any speaker with any accent and with any language. Recognition rates are much higher and much faster.
Adaptive Speech Recognition—In Title:
1. U.S. Pat. No. 7,996,218 Aug. 9, 2011 User adaptive recognition method and apparatus
2. U.S. Pat. No. 7,003,460 Feb. 21, 2006 Method and apparatus for an adaptive speech recognition system utilizing HMM models
3. U.S. Pat. No. 6,662,160 Dec. 9, 2003 Adaptive speech recognition method with noise compensation
4. U.S. Pat. No. 6,418,411 Jul. 9, 2002 Method and system for adaptive speech recognition in noisy environment
5. U.S. Pat. No. 6,278,968 Aug. 21, 2001 Method and apparatus for adaptive speech recognition hypothesis construction and selection in a spoken language translation system
6. U.S. Pat. No. 6,044,343 Mar. 28, 2000 Adaptive speech recognition with selective input data to a speech classifier
7. U.S. Pat. No. 5,774,841 Jun. 30, 1998 Real-time reconfigurable adaptive speech recognition command and control apparatus and method
8. U.S. Pat. No. 5,170,432 Dec. 8, 1992 Method of speaker adaptive speech recognition
Additional Adaptive Speech Recognition—In Abstract
1. U.S. Pat. No. 6,208,964 Mar. 27, 2001 Method and apparatus for providing unsupervised adaptation of transcription
2. U.S. Pat. No. 4,720,802 Jan. 19, 1988 Noise compensation arrangement
In all these inventions the authors use either HMM (Hidden Markov Models) or DTW (Dynamic Time Warping) as the basic recognition approach. As will be elaborated below, these methods are entirely different from my invented method. Similarly, the most relevant patent applications were also on HMM principles:
20110066426 Real time speaker adaptive speech recognition apparatus and method
20060200347 User adaptive speech recognition method and apparatus
20060184360 Adaptive multi-pass speech recognition system
20030187645 Automatic detection of change in speaker in speaker adaptive speech recognition system.
Hence, I find that my invention is entirely novel with respect to other inventors. The only invention that bears some similarity to a part of my current invention is my own U.S. Pat. No. 7,366,645 B2 which describes the old version of RISq (Recognition by Indexing and Sequencing). However, RISq can recognize only one dimensional arrays of multidimensional vectors and was developed for human activity recognition [1]. The new invention is called VARIS (Vector Array Recognition by Indexing and Sequencing) and deals with multidimensional arrays of multidimensional vectors. VARIS employs an improved version of RISq with several unobvious innovations called 1D algorithm and in a recursive application, it is entirely non obvious and took me years to invent and develop. The recursive application enables to reduce the array's dimension by one in each recursive iteration, resulting finally with a similarity measures between multidimensional arrays. The extension to multidimensional arrays opens a whole new space for multidimensional vector array pattern recognition. As far as I investigated, my method is the only one that solves this enormously difficult problem in polynomial complexity of computations. Measuring similarity of multidimensional arrays enables in the first time to recognize physical phenomena such as videos invariant to their speed and distortion.
Description of VARIS (Vector Array Recognition by Indexing and Sequencing):
VARIS (U.S. provisional Patent 61/573,208)—is a Methodology for exemplar based Detection and Recognition/classification of signals that are represented by multidimensional arrays of vectors, which also can be typified as Tensors or as multidimensional sequences of vectors. If the arrays are of dimension n then they are equivalent to tensors of order n+1.
Such arrays can represent many kinds of physical signals. For example, speech can be represented by 1D array of vectors, which is equivalent to 1D temporal sequence of vectors or to Tensors of order 2. Every vector in the sequence represents the spectrum of a very short segment of the speech sound waveform. Among many other applications, 1D arrays are also useful in describing human actions, gestures and other activities. Images are represented by 2D arrays (tensors of order 3). Every vector in the 2D array is representing the properties such as color and brightness of one pixel of the image. Videos can be described by 3D arrays (tensors of order 4). 3D arrays are also useful in describing complex phenomena such as weather patterns, earth quakes, etc. Higher array dimensions can describe even more complex physical phenomena.
In this patent application, I describe a computer methodology that enables to detect and classify such signals with Robustness to Interference, Geometric Distortions and Incomplete Data. VARIS is actually a multidimensional extension of a 1D algorithm, which is an improvement of a method called RISq (Recognition by Indexing and Sequencing). A preliminary version of RISq was invented by me in 2000 and patented in 2008 [6]. RISq is designed to recognize 1D arrays of vectors and is described in detail in the following section.
The recognition of multidimensional arrays by VARIS is achieved by a recursive application of a 1D algorithm on the input array, each time on another dimension of the array. VARIS achieves better generic detection by incorporating in each class several exemplars that represent different instances of the same class. For example, in face detection, one can store many types of faces as stored exemplars. This enables to detect large variations of face appearances. The detection/recognition is further improved by introducing new similarity measures that penalize incompatible input-exemplar vector pairs in the arrays matched. Additional significant improvement is achieved by our new compounding approach. In this approach, each exemplar is divided into components in a way that enables to create new exemplars by compounding parts from several exemplars of the same class. Experimental results of a comparison between the performances of VARIS versus 3 of the best face detection methods, is illustrated in [4][5]. The comparison shows that the performance of VARIS is better both in recognition rates and in false detection rates.
Description of RISq:
RISq is a method for Detection or Recognition of 1D arrays of multidimensional vectors [6]. A more advanced version of RISq called 1D algorithm is being used in VARIS. Three innovative additions that are included in 1D algorithm are described in the section on speech recognition. The problem of detection and classification of patterns that are expressed by arrays of vectors is different and more difficult than classification of single vectors by classical Pattern Recognition (PR). The classical approach for detection and classification of patterns that were composed of vector arrays was to concatenate all the vectors in the array into one long vector that could be recognized by classical PR methods. This approach is not practical because physical patterns such as speech, images or video, usually are varying in time and therefore could not be effectively represented by rigid vectors as required by classical PR. 1D signals such as speech usually are represented by 1D array of multidimensional vectors (Tensors of order 2). Each vector in the array represents a sample of the speech. Methods such as Hidden Markov Models (HMM) or Dynamic Time Warping (DTW) were developed for detection and classification of 1D arrays. DTW is rarely used today because HMM produces much better results. In following paragraphs we elucidate the differences between RISq and DTW. As demonstrated in [2][3] RISq achieves even slightly better results than HMM in recognition of speech.
HMM, is a parametric method and therefore needs rigorous training by a complex algorithm called Expectation Maximization (EM) in order to quantify the parameters of each model. In contrast, 1D algorithm is non-parametric method and needs only one exemplar per class for training. This is the reason that 1D algorithm could be very easily adapted to different speakers, Languages or accents. 1D algorithm is based on k-Nearest Neighbors (kNN) approach in which classification is performed by estimating the posterior probability, which corresponds to the similarity of each vector in the array with respect to exemplar vectors in the its neighborhood. In our opinion, non-parametric methods have a significant advantage over parametric methods because one does not need to assume any functional description for the underlying probability distributions. In practice, distributions of signals such as speech or imagery are quite complex and a-priori unknown. Assuming a functional description that does not fit the actual data could result in low recognition rates and high false positives rates (false alarm rates). In addition, the non-parametric structure of 1D algorithm is very easy to train because it does not make any attempt at building statistical models of pattern classes. Instead, training is performed by simply storing one or more exemplar arrays per class in the 1D algorithm's database.
After training is performed, an unknown input array can be classified using a two-step algorithm. The first step is indexing, which consists of identifying a number of exemplar vectors, which are closest to each input vector and assigning them weights, which are proportional to their mutual similarity measure. The second step is sequencing, which finds using dynamic programming the maximally weighted bipartite graph matching between the input array and each exemplar array, while respecting a sequencing constraint. If vectors i and j in the input sequence are matched with vectors k and l in the exemplar sequence, then if i<j then k must be smaller than l according to the sequencing constraint. The aggregate scores of the bipartite graph matching to each exemplar array are compared and the input array is classified as a member of the class of the exemplar array with the highest score.
Description of VARIS as a Recursive Application of 1D Algorithm:
The VARIS algorithm was developed for vector arrays with 2 or more dimensions (Tensors of order 3 or more). The 1D algorithm, which was designed for optimal matching of 1D arrays of vectors, is applied recursively by VARIS each time on another dimension. The result of each application is an array, which is smaller by one dimension. For example, 2D arrays are arranged as a 2D matrix of vectors. There are two options to execute the VARIS algorithm for two dimensions by switching the processing order of the rows with the columns. One could start at the first phase with matching the columns (or the rows) of the 2D input array to all the columns (rows) of the exemplars, which are also 2D arrays. At the first phase, RISq finds the optimal similarity scores of each column of the input 2D array with all the columns of the exemplars. This task is not insurmountable because most of the exemplar columns do not have vectors which are close enough to the input vectors to be indexed. At second phase, each column of the input and of the exemplars is collapsed into a node in a 1D array. Next, the input 2D array is reduced into a 1D array. Similarly, each 2D exemplar is also reduced into 1D array. As a consequence of 1D algorithm finding similarity scores between each input column to all the columns of the exemplars in the first phase, each input node has similarity scores to each of the nodes of the exemplars. Next, 1D algorithm is applied again with the goal to find the collapsed 1D exemplar array that best matches the collapsed 1D input array. 1D algorithm now finds the optimal aggregate similarity scores of the input array with each of the exemplars. The input array is then classified as a member of the class of the exemplar with the highest similarity score.
The application of 1D algorithm finds the optimal, mutual similarity score between two 1D arrays while allowing any warping that abides by the sequencing constraint. In this 2D example, the sequencing constraint in 1D algorithm's optimization is applied twice. At first phase, the row number of each vector in the columns of the input and the exemplars' arrays serves as its 1D “timing” within the column. In the second phase, the “timing” is the column number of each collapsed column of both the input and the exemplars' arrays. This two tier sequencing allows a wide range of 2D warping, which still preserves the topology of the matches of the input array with the exemplar arrays. This enables to recognize humans and objects that are depicted in a wide divergence of viewpoints. The segmentation of the 2D input and exemplar images into arrays of independent vectors, where each vector represents a small image patch, enables VARIS to include in the aggregate similarity score only patches that belong to the subject and reject patches of the background and of other objects. Therefore, VARIS which is based on recursive application of 1D algorithm, is more robust in conditions of partial occlusion and missing image data.
To improve VARIS's generalization in detection and classification, I included in each class several exemplars, which represent as much as possible the variety of members within each class. Very recently I introduced a new approach which we call “exemplar compounding” [3][4][5]. In order to achieve even more flexibility and adaptability in each class, I developed algorithms that construct 2D optimal exemplars which are composed of patches of different members of the same class. Compounding provides more flexibility and higher similarity scores with relatively smaller exemplar sets. In experiments of person detection in images, VARIS with compounded exemplars achieved better results than any State Of the Art (SOA) detection method including VARIS with uncompounded exemplar. I developed compounding algorithms also for speech [3] in which word exemplars are composed of partial utterances of several people.
Claims 1-8 describe the general multidimensional VARIS method. Claims 15-19 describe VARIS applied for 2D arrays, which is useful for tasks such as face detection or object recognition in imagery. Claims 9-14 describe my new approach of using VARIS for 1D arrays. A task especially suited for adaptive continuous speech recognition. These claims include three innovative additions. The first is the introduction of negative similarity scores to segments that do not match the exemplars. This is an important addition that noticeably reduces the false positives rates because there are many cases where different words with similar features get high similarity scores unjustifiably. The second innovation is the segmentation of the continuous speech by matching the speech stream with overlapping segments with lengths that correspond to different word groups. The problem of continuous speech segmentation is very difficult because it is impossible to segment the speech by detecting silent periods, which are absent in continuous speech. Here both HMM and DTW fail because these methods require to specify the beginnings and ends of each word one intents to classify. The third innovation is in introducing a method for segmenting the phonemes in the stored and input words and pruning all the words which have even a single non-matching phoneme. This further reduces the false alarm rates. The first and second innovations were also included in the multidimensional VARIS claims as well.
Differences between DTW (Dynamic Time Warping) and 1D algorithm
DTW is a method for recognizing an input of 1D sequence of vectors as most similar to one of stored exemplars of 1D sequences of vectors. DTW algorithm optimizes the sum of vector distances between the two sequences using Dynamic Programming (DP). The algorithm allows time warping in which the sequences are shrunk or stretched to improve matching with the other sequence. This provides more flexible matching in cases where the timing of the two sequences is not compatible. The warping of an input sequence with M vectors, matched to an exemplar sequence of N vectors, can be represented by a path in a rectangular lattice which has M rows and N columns. Each junction (k,l) has a cost which is proportional to the distance between the k-th input vector to the l-th exemplar vector. DTW tries to find a monotonically connected path from (1,1) to (M,N) that has minimum cumulative cost=distance. The requirement of monotonous paths means that all the pieces of the path from say junction (k,l) to (h,j) must have k−h≦0 and l−j≦0. DTW has many uses mostly for speech recognition. In speech, the input and the exemplars are sequences of vectors, which are composed of Mel Frequency Cepstrum Coefficients (MFCCs) that are derived by sampling and processing signals of speech utterances. Although there are few similarities between DTW and 1D algorithm there are significant differences between the two. The two methods are superficially similar because both methods match vector sequences while allowing warping and both methods use DP to find the optimal matching score. But here the similarity ends. Firstly, DTW regards the absolute vector distances as an inverse similarity measure and tries to find the match between the sequences which has the minimal cumulative distance using DP, which is constrained by monotonic sequencing requirement. But the sequences must be matched according to a connected and monotonic path. On the other hand, 1D algorithm performs bipartite graph matching, which allows partial matching of both input and exemplar sequences in any sequential configuration i.e. the path could be disconnected. This provides much more flexibility in warping and improves 1D algorithm's and VARIS's recognition rates noticeably. Such a process is feasible because both RISq and VARIS are accumulating vector similarities not distances. RISq and VARIS also penalize dissimilarities while DTW uses only distance penalties. The similarity-dissimilarity scoring in 1D algorithm and in VARIS provide a substantial improvement in performance. Similarity and distance have inverse relations. Mathematically, there is a non linear and inverse relation between the two. However, minimizing accumulated distances is not the same as maximizing accumulated similarities. In addition, DTW must determine the endpoints of both sequences (also HMM). 1D algorithm and VARIS do not have to. This discrepancy alone could make a lot of difference in the recognition rates because DTW must apply pre-segmentation to locate these end points, a process which is very much error prone especially in continuous speech. Another substantial difference: with DTW one has to match all the exemplars serially one by one, in order to find the best match. This makes DTW much slower especially if it has many exemplars (HMM has the same problem). In contrast, VARIS is a parallel method that employs indexing and matches all the exemplars at once. Furthermore, a most important difference is that VARIS can recognize multidimensional sequences in polynomial complexity at any dimension. Whereas DTW can deal only with 1D sequences and has NP complete complexity even with 2D sequences (HMM has also exponential complexity in 2D). To conclude, all these differences result in much better performance of 1D algorithm and VARIS in recognition rates, in False alarm rates and in computation time requirements.
This application claims the benefit of a provisional patent application: Ser. No. 61/573,208
Number | Date | Country | |
---|---|---|---|
61573208 | Sep 2011 | US |