The present invention relates generally to data clustering, and to methods and arrangements for facilitating the same, such as in the context of enrolling target speakers in a speaker verification system. The present invention relates more particularly to the partitioning of a set of multidimensional data points into classes.
It is generally desired that the classes, when modeled with Gaussian densities for example, can be used to construct a probability density for the data. Additional data obtained in the same way as the original set should be judged highly likely according to the constructed density. Clustering is a fundamental data analysis tool and is the basis for many approaches to pattern recognition. Among other things, this process facilitates analyzing the areas of the data space that are the most concentrated with points, while allowing one to determine which points may be outliers (i.e., data points that result from noise and do not give information about the process or system being modeled). It also forms the basis for a compact representation of the data.
Clustering is usually a very time consuming process requiring many iterative passes over the data. Generally, the clustering problem is handled by a clustering technique such as K-means or LBG (see Y. Linde, A. Buzo, R. M. Gray, “An Algorithm for Vector Quantizer Design,” IEEE Trans. Commun., vol. 28, pp.84-95, January 1980). K-means starts with an initial seed of classes and iteratively re-clusters and re-estimates the centroids. The effectiveness of this method depends on the quality of the seed. LBG does not require a seed, but starts with one cluster for all of the data. Then, it uses a random criterion to generate new centroids based on the current set (initially one). K-means is used after constructing the new set of centroids. The process is repeated on the new set. In K-means, the requirement for a good seed is strong, which means one needs a lot of prior information. The iterative reclusterings are also time consuming. LBG has a random component which makes it potentially unstable in the sense that quite different models can result from two independent LBG clusterings of the same data.
In view of the foregoing, a need has been recognized in connection with improving upon the shortcomings and disadvantages associated with conventional data clustering methods and arrangements.
In accordance with at least one presently preferred embodiment of the present invention, clustering problems are solved in an efficient, deterministic manner with a recursive procedure to be discussed below.
In summary, the present invention provides, in one aspect, an apparatus for facilitating data clustering, the apparatus comprising: an arrangement for obtaining input data; and an arrangement for creating a predetermined number of non-overlapping subsets of the input data; the arrangement for creating a predetermined number of non-overlapping subsets being adapted to split the input data recursively.
In another, aspect, the present invention provides a method of facilitating data clustering, the method comprising the steps of: obtaining input data; and creating a predetermined number of non-overlapping subsets of the input data; step of creating a predetermined number of non-overlapping subsets comprising splitting the input data recursively.
Furthermore, the present invention provides, in an additional aspect, a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for facilitating data clustering, the method comprising the steps of: obtaining input data; and creating a predetermined number of non-overlapping subsets of the input data; step of creating a predetermined number of non-overlapping subsets comprising splitting the input data recursively.
For a better understanding of the present invention, together with other and further features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying drawings, and the scope of the invention will be pointed out in the appended claims.
A recursive procedure for solving clustering problems, in accordance with at least one presently preferred embodiment of the present invention, is discussed herebelow.
With reference to
xi={r1, r2, . . . , r
As shown in
where <,>indicates dot product.
The clustering is preferably initialized with Xa=X. After Xa is split into Xb and Xc, each of these in turn becomes the input to the procedure. Thus Xb and Xc are split in the same way each into two subsets. The procedure is repeated until the desired number of subsets are created. These splitting procedures are illustrated schematically in
Among the advantages of the method presented hereinabove are the following:
A practical application of the techniques discussed and contemplated herein is in the enrollment of target speakers in a speaker verification system. In this case, if it is desired that models be built as quickly as possible, then the techniques described and contemplated herein can speed up training time by a significant order of magnitude. An example of a speaker verification system that may readily employ the embodiments of the present invention is discussed in U. V. Chaudhari, J. Navratil, S. H. Maes, and Ramesh Gopinath, “Transformation Enhanced Multi-Grained Modeling for Text-Independent Speaker Recognition”, ICSLP 2000, pp. II.298-II.301.
It is to be understood that the present invention, in accordance with at least one presently preferred embodiment, includes an arrangement for obtaining input data; and an arrangement for creating a predetermined number of non-overlapping subsets of the input data, which together may be implemented on at least one general-purpose computer running suitable software programs. These may also be implemented on at least one Integrated Circuit or part of at least one Integrated Circuit. Thus, it is to be understood that the invention may be implemented in hardware, software, or a combination of both.
If not otherwise stated herein, it is to be assumed that all patents, patent applications, patent publications and other publications (including web-based publications) mentioned and cited herein are hereby fully incorporated by reference herein as if set forth in their entirety herein.
Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the invention.
Number | Name | Date | Kind |
---|---|---|---|
5263120 | Bickel | Nov 1993 | A |
5659662 | Wilcox et al. | Aug 1997 | A |
5710833 | Moghaddam et al. | Jan 1998 | A |
5862519 | Sharma et al. | Jan 1999 | A |
6058205 | Bahl et al. | May 2000 | A |
6064958 | Takahashi et al. | May 2000 | A |
6073096 | Gao et al. | Jun 2000 | A |
6253179 | Beigi et al. | Jun 2001 | B1 |
6272449 | Passera | Aug 2001 | B1 |
6343267 | Kuhn et al. | Jan 2002 | B1 |
6442519 | Kanevsky et al. | Aug 2002 | B1 |
20030046253 | Shetty et al. | Mar 2003 | A1 |
20030224344 | Shamir et al. | Dec 2003 | A1 |
Number | Date | Country | |
---|---|---|---|
20030158853 A1 | Aug 2003 | US |