Claims
- 1. A database management system for in-database clustering, comprising:
a first data table and a second data table, each data table including a plurality of rows of data; means for building an enhanced K-means clustering model using the first data table; and means for applying the enhanced K-means clustering model using the second data table to generate apply output data.
- 2. The database management system of claim 1, wherein the means for building an enhanced K-means clustering model comprises:
means for initializing centroids of the clusters of the clustering model; means for finding, for each data record, a cluster closest to the data record; and means for updating cluster centroids and histograms based on the new record assignments.
- 3. The database management system of claim 2, wherein the means for finding, for each data record, a cluster closest to the data record comprises:
means for computing a Euclidean distance between each data record and each centroid; means for selecting a winning cluster for each data record; and means for assigning each data record to its winning cluster.
- 4. The database management system of claim 3, wherein the means for finding, for each data record, a cluster closest to the data record further comprises:
means for computing a dispersion for each cluster; and means for computing a total error.
- 5. The database management system of claim 4, wherein the means for computing a dispersion for each cluster comprises:
means for computing an average distance of members of the cluster to a centroid of the cluster.
- 6. The database management system of claim 5, wherein the means for updating cluster centroids and histograms based on the new record assignments comprises:
means for computing a mean value per attribute for each cluster.
- 7. The database management system of claim 6, wherein the means for initializing centroids of the clusters of the clustering model comprises:
means for seeding the centroids with a centroid of the parent cluster, which is a centroid of all points to be partitioned; and means for perturbing an attribute having a highest dispersion.
- 8. A database management system for in-database clustering, comprising:
a first data table and a second data table, each data table including a plurality of rows of data; means for building a hierarchical K-means clustering model for building a binary tree of clusters using the first data table; and means for applying the hierarchical K-means clustering model using the second data table to generate apply output data.
- 9. The database management system of claim 8, wherein the means for building an enhanced K-means clustering model comprises:
means for creating a root node containing training data; means for choosing nodes to be split; means for, for each node to be split, splitting data associated with the node into a plurality of clusters, wherein new nodes associated with the plurality of clusters are generated; means for recording the new nodes in a tree structure; and means for refining centroids and histograms for the new nodes.
- 10. The database management system of claim 9, wherein the means for choosing nodes to be split comprises:
means for choosing nodes to be split for a balanced tree; and means for choosing nodes to be split for an unbalanced tree.
- 11. The database management system of claim 10, wherein the means for choosing nodes to be split for a balanced tree comprises:
means for choosing splits on all nodes in a level if a resulting number of leaves does not exceed a maximum number of leaves allowed; and means for ranking nodes by dispersion and choosing as many splits as are possible in order of dispersion without exceeding the maximum number of clusters allowed, if splitting on all nodes in a level is not possible.
- 12. The database management system of claim 11, wherein the means for choosing nodes to be split for an unbalanced tree comprises:
means for splitting a node having a largest dispersion.
- 13. A method for in-database clustering in a database management method, comprising the steps of:
receiving a first data table and a second data table, each data table including a plurality of rows of data; building an enhanced K-means clustering model using the first data table; and applying the enhanced K-means clustering model using the second data table to generate apply output data.
- 14. The method of claim 13, wherein the step of building an enhanced K-means clustering model comprises the steps of:
initializing centroids of the clusters of the clustering model; finding, for each data record, a cluster closest to the data record; and updating cluster centroids and histograms based on the new record assignments.
- 15. The method of claim 14, wherein the step of finding, for each data record, a cluster closest to the data record comprises the steps of:
computing a Euclidean distance between each data record and each centroid; selecting a winning cluster for each data record; and assigning each data record to its winning cluster.
- 16. The method of claim 15, wherein the step of finding, for each data record, a cluster closest to the data record further comprises the steps of:
computing a dispersion for each cluster; and computing a total error.
- 17. The method of claim 16, wherein the step of computing a dispersion for each cluster comprises the step of:
computing an average distance of members of the cluster to a centroid of the cluster.
- 18. The method of claim 17, wherein the step of updating cluster centroids and histograms based on the new record assignments comprises the step of:
computing a mean value per attribute for each cluster.
- 19. The method of claim 18, wherein the step of initializing centroids of the clusters of the clustering model comprises the steps of:
seeding the centroids with a centroid of the parent cluster, which is a centroid of all points to be partitioned; and perturbing an attribute having a highest dispersion.
- 20. A method for in-database clustering in a database management method, comprising the steps of:
receiving a first data table and a second data table, each data table including a plurality of rows of data; building a hierarchical K-means clustering model for building a binary tree of clusters using the first data table; and applying the hierarchical K-means clustering model using the second data table to generate apply output data.
- 21. The method of claim 20, wherein the step of building an enhanced K-means clustering model comprises the steps of:
creating a root node containing training data; choosing nodes to be split; means for, for each node to be split, splitting data associated with the node into a plurality of clusters, wherein new nodes associated with the plurality of clusters are generated; recording the new nodes in a tree structure; and refining centroids and histograms for the new nodes.
- 22. The method of claim 21, wherein the step of choosing nodes to be split comprises the steps of:
choosing nodes to be split for a balanced tree; and choosing nodes to be split for an unbalanced tree.
- 23. The method of claim 22, wherein the step of choosing nodes to be split for a balanced tree comprises the steps of:
choosing splits on all nodes in a level if a resulting number of leaves does not exceed a maximum number of leaves allowed; and ranking nodes by dispersion and choosing as many splits as are possible in order of dispersion without exceeding the maximum number of clusters allowed, if splitting on all nodes in a level is not possible.
- 24. The method of claim 23, wherein the step of choosing nodes to be split for an unbalanced tree comprises the step of:
splitting a node having a largest dispersion.
- 25. A system for in-database clustering in a database management system, comprising:
a processor operable to execute computer program instructions; a memory operable to store computer program instructions executable by the processor; and computer program instructions stored in the memory and executable to perform the steps of:
receiving a first data table and a second data table, each data table including a plurality of rows of data; building an enhanced K-means clustering model using the first data table; and applying the enhanced K-means clustering model using the second data table to generate apply output data.
- 26. The system of claim 25, wherein the step of building an enhanced K-means clustering model comprises the steps of:
initializing centroids of the clusters of the clustering model; finding, for each data record, a cluster closest to the data record; and updating cluster centroids and histograms based on the new record assignments.
- 27. The system of claim 26, wherein the step of finding, for each data record, a cluster closest to the data record comprises the steps of:
computing a Euclidean distance between each data record and each centroid; selecting a winning cluster for each data record; and assigning each data record to its winning cluster.
- 28. The system of claim 27, wherein the step of finding, for each data record, a cluster closest to the data record further comprises the steps of:
computing a dispersion for each cluster; and computing a total error.
- 29. The system of claim 28, wherein the step of computing a dispersion for each cluster comprises the step of:
computing an average distance of members of the cluster to a centroid of the cluster.
- 30. The system of claim 29, wherein the step of updating cluster centroids and histograms based on the new record assignments comprises the step of:
computing a mean value per attribute for each cluster.
- 31. The system of claim 30, wherein the step of initializing centroids of the clusters of the clustering model comprises the steps of:
seeding the centroids with a centroid of the parent cluster, which is a centroid of all points to be partitioned; and perturbing an attribute having a highest dispersion.
- 32. A system for in-database clustering in a database management system, comprising:
a processor operable to execute computer program instructions; a memory operable to store computer program instructions executable by the processor; and computer program instructions stored in the memory and executable to perform the steps of:
receiving a first data table and a second data table, each data table including a plurality of rows of data; building a hierarchical K-means clustering model for building a binary tree of clusters using the first data table; and applying the hierarchical K-means clustering model using the second data table to generate apply output data.
- 33. The system of claim 32, wherein the step of building an enhanced K-means clustering model comprises the steps of:
creating a root node containing training data; choosing nodes to be split; means for, for each node to be split, splitting data associated with the node into a plurality of clusters, wherein new nodes associated with the plurality of clusters are generated; recording the new nodes in a tree structure; and refining centroids and histograms for the new nodes.
- 34. The system of claim 33, wherein the step of choosing nodes to be split comprises the steps of:
choosing nodes to be split for a balanced tree; and choosing nodes to be split for an unbalanced tree.
- 35. The system of claim 34, wherein the step of choosing nodes to be split for a balanced tree comprises the steps of:
choosing splits on all nodes in a level if a resulting number of leaves does not exceed a maximum number of leaves allowed; and ranking nodes by dispersion and choosing as many splits as are possible in order of dispersion without exceeding the maximum number of clusters allowed, if splitting on all nodes in a level is not possible.
- 36. The system of claim 35, wherein the step of choosing nodes to be split for an unbalanced tree comprises the step of:
splitting a node having a largest dispersion.
- 37. A computer program product for in-database clustering in a database management computer program product, comprising:
a computer readable medium; computer program instructions, recorded on the computer readable medium, executable by a processor, for performing the steps of:
receiving a first data table and a second data table, each data table including a plurality of rows of data; building an enhanced K-means clustering model using the first data table; and applying the enhanced K-means clustering model using the second data table to generate apply output data.
- 38. The computer program product of claim 37, wherein the step of building an enhanced K-means clustering model comprises the steps of:
initializing centroids of the clusters of the clustering model; finding, for each data record, a cluster closest to the data record; and updating cluster centroids and histograms based on the new record assignments.
- 39. The computer program product of claim 38, wherein the step of finding, for each data record, a cluster closest to the data record comprises the steps of:
computing a Euclidean distance between each data record and each centroid; selecting a winning cluster for each data record; and assigning each data record to its winning cluster.
- 40. The computer program product of claim 39, wherein the step of finding, for each data record, a cluster closest to the data record further comprises the steps of:
computing a dispersion for each cluster; and computing a total error.
- 41. The computer program product of claim 40, wherein the step of computing a dispersion for each cluster comprises the step of:
computing an average distance of members of the cluster to a centroid of the cluster.
- 42. The computer program product of claim 41, wherein the step of updating cluster centroids and histograms based on the new record assignments comprises the step of:
computing a mean value per attribute for each cluster.
- 43. The computer program product of claim 42, wherein the step of initializing centroids of the clusters of the clustering model comprises the steps of:
seeding the centroids with a centroid of the parent cluster, which is a centroid of all points to be partitioned; and perturbing an attribute having a highest dispersion.
- 44. A computer program product for in-database clustering in a database management computer program product, comprising:
a computer readable medium; computer program instructions, recorded on the computer readable medium, executable by a processor, for performing the steps of:
receiving a first data table and a second data table, each data table including a plurality of rows of data; building a hierarchical K-means clustering model for building a binary tree of clusters using the first data table; and applying the hierarchical K-means clustering model using the second data table to generate apply output data.
- 45. The computer program product of claim 44, wherein the step of building an enhanced K-means clustering model comprises the steps of:
creating a root node containing training data; choosing nodes to be split; means for, for each node to be split, splitting data associated with the node into a plurality of clusters, wherein new nodes associated with the plurality of clusters are generated; recording the new nodes in a tree structure; and refining centroids and histograms for the new nodes.
- 46. The computer program product of claim 45, wherein the step of choosing nodes to be split comprises the steps of:
choosing nodes to be split for a balanced tree; and choosing nodes to be split for an unbalanced tree.
- 47. The computer program product of claim 46, wherein the step of choosing nodes to be split for a balanced tree comprises the steps of:
choosing splits on all nodes in a level if a resulting number of leaves does not exceed a maximum number of leaves allowed; and ranking nodes by dispersion and choosing as many splits as are possible in order of dispersion without exceeding the maximum number of clusters allowed, if splitting on all nodes in a level is not possible.
- 48. The computer program product of claim 47, wherein the step of choosing nodes to be split for an unbalanced tree comprises the step of:
splitting a node having a largest dispersion.
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The benefit under 35 U.S.C. §119(e) of provisional application No. 60/379,118, filed May 10, 2002, is hereby claimed.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60379118 |
May 2002 |
US |