Claims
- 1. A method for creating analyzing data in a computer-implemented data mining system, comprising:
(a) accessing data from a database in the computer-implemented data mining system; and (b) performing an Expectation-Maximization (EM) algorithm in the computer-implemented data mining system to create the Gaussian Mixture Model for the accessed data, wherein the EM algorithm generates an output that describes clustering in the data by computing a mixture of probability distributions fitted to the accessed data.
- 2. The method of claim 1, wherein the EM algorithm is performed iteratively to successively improve a solution for the Gaussian Mixture Model.
- 3. The method of claim 2, wherein the EM algorithm terminates when the solution becomes stable.
- 4. The method of claim 2, wherein the solution is measured by a statistical quantity.
- 5. The method of claim 2, wherein the EM algorithm begins with an approximation to the solution.
- 6. The method of claim 2, wherein the EM algorithm uses a log ratio of likelihood to determine whether the solution has improved.
- 7. The method of claim 1, wherein the EM algorithm skips variables in the accessed data whose covariance is null and rescales the data's dimensionality accordingly.
- 8. The method of claim 1, wherein the EM algorithm uses a reciprocal of a Mahalanobis distances to approximate responsibilities in the accessed data.
- 9. The method of claim 1, wherein the EM algorithm generates random numbers from a uniform (0,1) distribution for a means for the accessed data.
- 10. The method of claim 1, wherein the EM algorithm calculates a log-liklihood of the accessed data.
- 11. The method of claim 1, wherein the EM algorithm uses an intercluster distance to distinguish segments in the accessed data.
- 12. The method of claim 1, wherein the EM algorithm uses identical covariance matrices for all clusters in the accessed data.
- 13. The method of claim 1, wherein the EM algorithm formulates the Gaussian Mixture Model so that variables are independent of one another.
- 14. The method of claim 1, wherein the EM algorithm is performed using different numbers of clusters in the accessed data, keeping track of a log-likelihood and a total number of parameters.
- 15. The method of claim 1, wherein the EM algorithm calculates linear discriminants that highlight significant differences between the segments in the accessed data.
- 16. The method of claim 1, wherein the EM algorithm accelerates matrix products by only computing products that do not become zero.
- 17. The method of claim 1, wherein the EM algorithm computes responsibilities and log-likelihood in an Expectation step only and updates parameters in a Maximization step only.
- 18. The method of claim 1, wherein the EM algorithm scales log-likelihood with n, and excludes variables for which distances are above some threshold.
- 19. The method of claim 1, wherein the EM algorithm implements a set of additions to a weight matrix that rectify weights that do not sum to equality because of user constraints.
- 20. A computer-implemented data mining system for analyzing data, comprising:
(a) a computer; (b) logic, performed by the computer, for:
(1) accessing data stored in a database; and (2) performing an Expectation-Maximization (EM) algorithm to create the Gaussian Mixture Model for the accessed data, wherein the EM algorithm generates an output that describes clustering in the data by computing a mixture of probability distributions fitted to the accessed data.
- 21. The system of claim 20, wherein the EM algorithm is performed iteratively to successively improve a solution for the Gaussian Mixture Model.
- 22. The system of claim 21, wherein the EM algorithm terminates when the solution becomes stable.
- 23. The system of claim 21, wherein the solution is measured by a statistical quantity.
- 24. The system of claim 21, wherein the EM algorithm begins with an approximation to the solution.
- 25. The system of claim 21, wherein the EM algorithm uses a log ratio of likelihood to determine whether the solution has improved.
- 26. The system of claim 20, wherein the EM algorithm skips variables in the accessed data whose covariance is null and rescales the data's dimensionality accordingly.
- 27. The system of claim 20, wherein the EM algorithm uses a reciprocal of a Mahalanobis distances to approximate responsibilities in the accessed data.
- 28. The system of claim 20, wherein the EM algorithm generates random numbers from a uniform (0,1) distribution for a means for the accessed data.
- 29. The system of claim 20, wherein the EM algorithm calculates a log-liklihood of the accessed data.
- 30. The system of claim 20, wherein the EM algorithm uses an intercluster distance to distinguish segments in the accessed data.
- 31. The system of claim 20, wherein the EM algorithm uses identical covariance matrices for all clusters in the accessed data.
- 32. The system of claim 20, wherein the EM algorithm formulates the Gaussian Mixture Model so that variables are independent of one another.
- 33. The system of claim 20, wherein the EM algorithm is performed using different numbers of clusters in the accessed data, keeping track of a log-likelihood and a total number of parameters.
- 34. The system of claim 20, wherein the EM algorithm calculates linear discriminants that highlight significant differences between the segments in the accessed data.
- 35. The system of claim 20, wherein the EM algorithm accelerates matrix products by only computing products that do not become zero.
- 36. The system of claim 20, wherein the EM algorithm computes responsibilities and log-likelihood in an Expectation step only and updates parameters in a Maximization step only.
- 37. The system of claim 20, wherein the EM algorithm scales log-likelihood with n, and excludes variables for which distances are above some threshold.
- 38. The system of claim 20, wherein the EM algorithm implements a set of additions to a weight matrix that rectify weights that do not sum to equality because of user constraints.
- 39. An article of manufacture embodying logic for analyzing data in a computer-implemented data mining system, the logic comprising:
(a) accessing data from a database in the computer-implemented data mining system; and (b) performing an Expectation-Maximization (EM) algorithm in the computer-implemented data mining system to create the Gaussian Mixture Model for the accessed data, wherein the EM algorithm generates an output that describes clustering in the data by computing a mixture of probability distributions fitted to the accessed data.
- 40. The article of manufacture of claim 39, wherein the EM algorithm is performed iteratively to successively improve a solution for the Gaussian Mixture Model.
- 41. The article of manufacture of claim 40, wherein the EM algorithm terminates when the solution becomes stable.
- 42. The article of manufacture of claim 40, wherein the solution is measured by a statistical quantity.
- 43. The article of manufacture of claim 40, wherein the EM algorithm begins with an approximation to the solution.
- 44. The article of manufacture of claim 40, wherein the EM algorithm uses a log ratio of likelihood to determine whether the solution has improved.
- 45. The article of manufacture of claim 39, wherein the EM algorithm skips variables in the accessed data whose covariance is null and rescales the data's dimensionality accordingly.
- 46. The article of manufacture of claim 39, wherein the EM algorithm uses a reciprocal of a Mahalanobis distances to approximate responsibilities in the accessed data.
- 47. The article of manufacture of claim 39, wherein the EM algorithm generates random numbers from a uniform (0,1) distribution for a means for the accessed data.
- 48. The article of manufacture of claim 39, wherein the EM algorithm calculates a log-liklihood of the accessed data.
- 49. The article of manufacture of claim 39, wherein the EM algorithm uses an intercluster distance to distinguish segments in the accessed data.
- 50. The article of manufacture of claim 39, wherein the EM algorithm uses identical covariance matrices for all clusters in the accessed data.
- 51. The article of manufacture of claim 39, wherein the EM algorithm formulates the Gaussian Mixture Model so that variables are independent of one another.
- 52. The article of manufacture of claim 39, wherein the EM algorithm is performed using different numbers of clusters in the accessed data, keeping track of a log-likelihood and a total number of parameters.
- 53. The article of manufacture of claim 39, wherein the EM algorithm calculates linear discriminants that highlight significant differences between the segments in the accessed data.
- 54. The article of manufacture of claim 39, wherein the EM algorithm accelerates matrix products by only computing products that do not become zero.
- 55. The article of manufacture of claim 39, wherein the EM algorithm computes responsibilities and log-likelihood in an Expectation step only and updates parameters in a Maximization step only.
- 56. The article of manufacture of claim 39, wherein the EM algorithm scales log-likelihood with n, and excludes variables for which distances are above some threshold.
- 57. The article of manufacture of claim 39, wherein the EM algorithm implements a set of additions to a weight matrix that rectify weights that do not sum to equality because of user constraints.
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is related to the following co-pending and commonly assigned patent applications:
[0002] Application Ser. No. ______, filed on same date herewith, by Paul M. Cereghini and Scott W. Cunningham, and entitled “ARCHITECTURE FOR A DISTRIBUTED RELATIONAL DATA MINING SYSTEM,” attorneys' docket number 9141;
[0003] Application Ser. No. _______, filed on same date herewith, by Mikael Bisgaard-Bohr and Scott W. Cunningham, and entitled “ANALYSIS OF RETAIL TRANSACTIONS USING GAUSSIAN MIXTURE MODELS IN A DATA MINING SYSTEM,” attorneys' docket number 9142; and
[0004] Application Ser. No. _______, filed on same date herewith, by Mikael Bisgaard-Bohr and Scott W. Cunningham, and entitled “DATA MODEL FOR ANALYSIS OF RETAIL TRANSACTIONS USING GAUSSIAN MIXTURE MODELS IN A DATA MINING SYSTEM,” attorneys' docket number 9684; all of which applications are incorporated by reference herein.