Claims
- 1. A method for performing clustering within a relational database management system to group a set of n data points into a set of k clusters, each data point having a dimensionality p, the method comprising the steps of:establishing a plurality of first tables, C1 through Ck, each table having p columns and 1 row, for the storage of means values, each one of tables C1 through Ck representing a cluster; establishing a second table, R, having p columns and 1 row, for the storage of covariance values; establishing a third table, W, having k+1 columns and 1 row, for the storage of weight values; and executing a series of SQL commands implementing an Expectation-Maximization clustering algorithm to iteratively update the means values, covariance values and weight values stored in said first, second and third tables.
- 2. The method for performing clustering within a relational database management system in accordance with claim 1, wherein said step of executing a series of SQL commands implementing an Expectation-Maximization clustering algorithm to iteratively update the means values, covariance values and weight values stored in said first, second and third tables continues until a specified number of iterations has been performed.
- 3. The method for performing clustering within a relational database management system in accordance with claim 1, wherein said first, second and third tables represent matrices.
- 4. The method for performing clustering within a relational database management system in accordance with claim 3, wherein said third table, R, represents a diagonal matrix.
- 5. The method for performing clustering within a relational database management system in accordance with claim 1, wherein:k≲p; and p<<n.
- 6. The method for performing clustering within a relational database management system in accordance with claim 5, wherein:p≲100; and k≲100.
- 7. The method for performing clustering within a relational database management system in accordance with claim 3, further comprising the steps of:establishing a fourth table, Z, having p column and n rows, for the storage of dimensionality values p for each data point n; and establishing a fifth table, Y, having 1 column and p * n rows, for the storage of values; and wherein said step of executing a series of SQL commands implementing an Expectation-Maximization clustering algorithm to iteratively update the means values, covariance values and weight values stored in said first, second and third tables includes the steps of: for each of said n data points, calculating Mahalanobis distances using a vertical approach wherein said Mahalanobis distances are calculated by using SQL aggregate functions joining tables Y, C and R; and for each of said n data points, calculating means values, covariance values and weight values using a horizontal approach.
CROSS-REFERENCE TO RELATED APPLICATIONS
This application is related to the following U.S. Patent Applications, filed on even date herewith:
U.S. patent application Ser. No. 09/747,856 by Paul Cereghini and Carlos Ordonez and entitled “METHOD FOR PERFORMING CLUSTERING IN VERY LARGE DATABASES,” the disclosure of which is incorporated by reference herein.
U.S. patent application Ser. No. 09/747,857 by Paul Cereghini and Carlos Ordonez and entitled “VERTICAL IMPLEMENTATION OF EXPECTATION-MAXIMIZATION ALGORITHM IN SQL FOR PERFORMING CLUSTERING IN VERY LARGE DATABASES.”
US Referenced Citations (5)
| Number |
Name |
Date |
Kind |
|
6115708 |
Fayyad et al. |
Sep 2000 |
A |
|
6226334 |
Olafsson |
May 2001 |
B1 |
|
6345265 |
Thiesson et al. |
Feb 2002 |
B1 |
|
6374251 |
Fayyad et al. |
Apr 2002 |
B1 |
|
6449612 |
Bradley et al. |
Sep 2002 |
B1 |