The present disclosure is related to data mining, and in particular to a data mining interest generator for identifying associations in large sets of data.
Association rule mining (ARM) is an important feature in knowledge discovery, as association rules identify relationships between data in large data collections. Knowledge discovery has many successful applications to various domains, such as market analysis, Web information processing, recommendation systems, log analysis, bioinformatics, etc.
Knowledge or data mining focuses on the discovery of unknown properties hidden in large sets of data. With the rise of knowledge discovery in databases (KDD) (an interdisciplinary field of computer science with applications to market basket analysis, Web information processing, recommendation system, log analysis, bioinformatics, etc.), more and more techniques of machine learning and statistics are being applied to ARM, for the purpose of detecting latent relations between objects or concepts.
As a simplified example, in supermarkets it is observed that a customer who buys onions and salad cream is likely to buy potatoes. The fact is briefly denoted by an association rule {onions, salad cream}{potatoes}. In KDD, association rule mining evaluates the confidence and interest of a candidate rule, to explore the valuable relations among variables.
A method includes obtaining, at one or more processors, data comprising multiple variables corresponding to multiple samples in a very large dataset, defining, via the one or more processors, multiple sets of variables occurring in the samples comprising a set of x variables and a set of y variables, where the intersection of the sets is zero, for each set of variables, determining, via the one or more processors, a support for each set and a union of each set, determining, via the one or more processors, an interest for each of the multiple association rules of the sets of variables, and determining, via the one or more processors, a chi squared interest, (χ2 interest), for each association to identify related sets of variables, including almost exclusive relationships.
A computer implemented system includes a non-transitory memory storage comprising instructions and one or more processors in communication with the memory, wherein the one or more processors execute the instructions to obtain, via the one or more processors at a programmed computer, data comprising multiple variables corresponding to multiple samples in a very large dataset, define, via the one or more processors, multiple sets of variables occurring in the samples comprising a set of x variables and a set of y variables, where the intersection of the sets is zero, for each set of variables, determine, via the one or more processors, a support for each set and a union of each set, determine, via the one or more processors, an interest for each of the multiple association rules of the sets of variables, and determine, via the one or more processors, a chi squared interest, (χ2 interest), for each association to identify related sets of variables, including almost exclusive relationships.
A non-transitory computer readable media storing computer instructions that when executed by one or more processors, cause the one or more processors to perform the steps of obtaining, via the one or more processors, data comprising multiple variables corresponding to multiple samples in a very large dataset, defining, via the one or more processors, multiple sets of variables occurring in the samples comprising a set of x variables and a set of y variables, where the intersection of the sets is zero, for each set of variables, determining, via the one or more processors, a support for each set and a union of each set, determining, via the one or more processors, an interest for each of the multiple association rules of the sets of variables, and determining, via the one or more processors, a chi squared interest, (χ2 interest), for each association to identify related sets of variables, including almost exclusive relationships.
In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural, logical and electrical changes may be made without departing from the scope of the present invention. The following description of example embodiments is, therefore, not to be taken in a limited sense, and the scope of the present invention is defined by the appended claims.
The functions or algorithms described herein may be implemented in software in one embodiment. The software may consist of computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware based storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server or other computer system, turning such computer system into a specifically programmed machine.
Current forms of association rule mining (ARM) utilize programmed computers to evaluate a confidence and interest of a candidate rule, to explore the valuable relations among variables in very large datasets having many thousands if not millions of entries. Associations may be hidden in such large sets of data and are imperceptible to humans. Candidate rules for a data set may be obtained in many different ways, and may involve single items or sets of items. One example way to develop candidate rules is to simply perform a brute force analysis of the data, sorting the items in the data by frequency of occurrence or even alphabetically, and creating a candidate rule for each pair of items. For instance, using the simplified example referenced in the background, if the items correspond to items purchased in a grocery store, candidate rules may start utilizing a sorted list that starts with apples and artichokes. In other words, when someone purchases apples, how often do they also other items in the list: artichokes, or bananas, or cherries, etc. Further candidates may also be explored that involves sets of items. If someone buys apples and cinnamon, are they also likely to buy butter or flour, or butter and flour, or a prepared pie crust?
While uses of ARM are described with respect to simplified sets of data to facilitate understanding of the inventive subject matter, it should be recognized that many different types of data sets may be analyzed that may have many different associations that are generally not perceptible by humans. Some associations may be almost exclusive, generally meaning that if someone buys one product, they hardly ever buy another product. Prior methods of analyzing proposed association rules have not been able to discern such almost exclusive relationships.
In further detail, once candidate rules have been generated at 120, ARM may be used to evaluate the confidence and interest of each candidate rule. For example, let x be a set of variables, its support is usually defined as the proportion of observing x in the whole data. That is, supp(x)=Nx/n, where Nx is the number of observations of x in a sample with size n.
The support for each set of variables may then be used to obtain the confidence and interest of each candidate rule. For clarity, x∪y is denoted by xy, if x∩y=0. In other words, xy, is the union of x and y if neither x nor y contain common variables. For any association rule of xy, meaning if x occurs, is y also likely to occur, where x, y are two sets of variables satisfying x∩y=0, its confidence conf(xy)=supp(xy) supp(x) is actually the estimate of conditional probability P(y|x). The conditional probability is the probability of y given x.
A conventional measure of interest (or lift) of a rule xy is defined by
The rules with large interest are usually desirable in practice. Since (1) is simple in computation, it is widely used in ARM. However, sometimes (1) lacks rationality.
In an extreme case, supp(xy)=supp(x)=supp(y)=k/n, where k is the number of observations in a sample n, so that the prior measure of interest(xy)=k/n/((k/n)*(k/n))=n/k. It means that the relationship between x and y is determinate in the observations. For convenience, such x, y are called binded.
Using the prior measure of interest (1), it is hard to give a rational interpretation to the binded phenomenon that interest(xy)=n/k decreases when k increases.
In various embodiments of the present subject matter, a new measure of interestingness, referred to as chi squared interest (χ2 interest) is induced from a likelihood ratio, and may be interpreted by a Kullback-Leibler divergence, which is a measure of the difference between two two-point distributions. A distinguishing feature of the new measure of interestingness is its bias to the high-frequency association rules, which are those association rules that occur or are observed very often in a dataset. At the same time, it is capable of finding out the “almost exclusive” relationships between objects, which prior measures failed to provide. An almost exclusive relationship refers to a very low association between two sets of variables. In other words, observations will rarely include both sets of variables.
In one embodiment, it is assumed that the number of observing x and y in all samples (e.g., sale transactions, or sentences in a corpus) is binomially distributed, denoted by Nxy˜B(n, θ), where θ is an unknown probability parameter of observing xy in a sample.
When is the total number of observing in a sample, the following is the likelihood function of parameter θ.
L(θ\=)=(1−θ) (2)
The likelihood function (2) is usually denoted by L(θ) for simplicity. Equation (2) is a unimodal function of θ and the maximum likelihood estimate (MLE) of θ is
If x, y are independent, then 0 can also be estimated by p=NxNy/n2, and the likelihood ratio L, L({circumflex over (θ)})L(p)>1, is close to 1. Otherwise, this ratio should be much bigger than 1.
When n is sufficiently large,
χ2=2[ln L({circumflex over (θ)}/)−ln L(p|)]˜χ2(1) (3)
The random variable χ2 varies in [0,+∞). In detail, χ2 is constructed by the random variables , Nx and Ny as follows.
The variable defined by equation (4) is a χ2-interest, whose value measures the objective belief about the association rule xy. In Neyman-Pearson hypothesis testing theories, at the given significance level α, the critical region of rejecting the null hypothesis H0 that x, y are independent is R=[χα2(1),+∞), where χα2(1) is the α-quantile of χ2(1) distribution. For example, χ0.012(1)≈6.635. Thus, a value of chi-squared interest greater than approximately 6.635 is considered a high value. Values at about this level and higher signify higher and higher reliability of corresponding association rules.
It means that, there is likely an association rule between x and y, if the observations of =, Nx=nx and Ny=ny make the value of equation (4) lie in the critical region R. And, the bigger χ2-value, the more probable that x, y are not independent.
The χ2-interest of a rule xy is defined by:
Apparently,
When (x, y) are binded, i.e., =nx=ny=k, where k=1, 2, . . . , n. By (4), the χ2-interest value is thus:
For any fixed
is a unimodal function of t, illustrated in
For each set of variables at 430, method 400 determines a support for each set and a union of each set, and at 440, an interest for each of the multiple association rules of the sets of variables. At 450, a chi squared interest is determined for each association to identify related sets of variables, including almost exclusive relationships.
One virtue of χ2-interest is that this concept comes from the frequentist statistics, with a well specified distribution in applications. As long as the sample size is sufficient large, the χ2-interest of xy makes sense, in the aspect of measuring the degree of non-independency between x and y. The discussed example of binded rules shows that χ2-interest coincides with intuition regarding the interest measurement as illustrated in graph form in
Let u=supp(xy) and v=supp(x)·supp(y), then interest(xy)=u/v and the χ2-interest is
In fact, equation (6) can be further interpreted by means of Kullback-Leibler divergence, a similarity between two distinct distributions.
χ2=2nDKL(U∥V) (8)
where U˜u1+(1−u)0 (two-point distribution), V˜v1+(1−v)0, and DKL(U∥V) is the Kullback-Leibler divergence between U and V. If u is close to v, then the value of equation (7) is close to 0.
For any fixed u (or w), (8) is a monotonic function of w (or u). The χ2-interest surface in u, w is illustrated by
Another interesting knowledge mined by χ2-interest is the “almost exclusive” relationship between objects of concern. For instance, in Table 700, “small” is a significant “almost exclusive” feature value of “area” in IPKB. These kinds of facts are usually ignored by the traditional ARM.
A Table 800 in
That is, the confidence of xy is too small. It means that, in general, the customer who buys {rolls/buns; yogurt} does not buy {white wine}. Moreover, there is no antecedent of {rolls/buns; yogurt} that contains the variable of {white wine}. Thus, the combination of χ2-interest and confidence can be used to detect almost exclusive relationships.
Some 2-term antecedents of y={whole milk} extracted from the public database of Groceries, associated with χ2-interest and interest values, are listed as shown in table 1000 in
Based on likelihood ratio, the use of χ2-interest provides a well-defined measurement of interestingness for the association rule xy, which evaluates the degree of non-independency between x and y. If the sample size is sufficiently large, the χ2-interest is χ2(1) distributed, and can be further interpreted by a Kullback-Leibler divergence.
The properties and advantages of χ2-interest include a bias to high-frequency observations, relationship to interest, etc. The χ2-interest is capable of mining the rules indicating the “almost exclusive” relation.
One example computing device in the form of a computer 1200 may include a processing unit 1202, memory 1203, removable storage 1210, and non-removable storage 1212. Although the example computing device is illustrated and described as computer 1200, the computing device may be in different forms in different embodiments. For example, the computing device may instead be a smartphone, a tablet, smartwatch, or other computing device including the same or similar elements as illustrated and described with regard to
Memory 1203 may include volatile memory 1214 and/or non-volatile memory 1208. Computer 1200 may include—or have access to a computing environment that includes—a variety of computer-readable media, such as volatile memory 1214 and/or non-volatile memory 1208, removable storage 1210, and/or non-removable storage 1212. Computer storage includes random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc read-only memory (CD ROM), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
Computer 1200 may include or have access to a computing environment that includes input 1206, output 1204, and a communication connection 1216. Output 1204 may include a display device, such as a touchscreen, that also may serve as an input device. The input 1206 may include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to the computer 1200, and other input devices. The computer may operate in a networked environment using the communication connection 1216 to connect to one or more remote computers, such as database servers. The remote computer may include a personal computer (PC), server, router, network PC, a peer device or other common network node, or the like. The communication connection 1216 may include a Local Area Network (LAN), a Wide Area Network (WAN), cellular, WiFi, Bluetooth, or other networks.
Computer-readable instructions stored on a computer-readable medium are executable by the processing unit 1202 of the computer 1200. A program 1218 comprises computer-readable instructions for interest data-mining, as discussed in any of the embodiments herein.
1. In example 1, a method includes obtaining, at one or more processors, data comprising multiple variables corresponding to multiple samples in a very large dataset, defining, via the one or more processors, multiple sets of variables occurring in the samples comprising a set of x variables and a set of y variables, where the intersection of the sets is zero, for each set of variables, determining, via the one or more processors, a support for each set and a union of each set, determining, via the one or more processors, an interest for each of the multiple association rules of the sets of variables, and determining, via the one or more processors, a chi squared interest, (χ2 interest), for each association to identify related sets of variables, including almost exclusive relationships.
2. The method of example 1 wherein x∪y is denoted by xy, if x∩y=0 and wherein the chi-squared interest is stored in a memory in association with each variable.
3. The method of example 2 wherein support for x is defined as supp(x)=Nx/n, where Nx is the number of observations of x in a sample with size n.
4. The method of example 3 wherein support for y is defined as supp(y)=Ny/n, where Ny is the number of observations of y in a sample with size n.
5. The method of example 4 wherein for any association rule of xy, its confidence conf(xy)=supp(xy)/supp(x).
6. The method of example 5 wherein the χ2-interest of a rule xy is defined by:
7. The method of example 5 and further wherein a combination of high χ2-interest with a low confidence is representative of an almost exclusive relationship.
8. The method of example 7 wherein conf(xy)>0:05 is indicative of a positive association between x and y where the χ2-interest is high.
9. The method of any of examples 1-8 and further comprising generating a graphical output having lines drawn between associations of each set of variables, wherein the sets of variable are generally arranged in a circle with the length of the lines connecting the sets of variables being proportional to the χ2-interest between the sets of variables.
10. In example 10, a computer implemented system includes a non-transitory memory storage comprising instructions and one or more processors in communication with the memory, wherein the one or more processors execute the instructions to obtain, via the one or more processors, data comprising multiple variables corresponding to multiple samples in a very large dataset, define, via the one or more processors, multiple sets of variables occurring in the samples comprising a set of x variables and a set of y variables, where the intersection of the sets is zero, for each set of variables, determine, via the one or more processors, a support for each set and a union of each set, determine, via the one or more processors, an interest for each of the multiple association rules of the sets of variables, and determine, via the one or more processors, a chi squared interest, (χ2 interest), for each association to identify related sets of variables, including almost exclusive relationships.
11. The system of example 10 wherein x∪y is denoted by xy, if x∩y=0, support for x is defined as supp(x)=Nx/n, where Nx is the number of observations of x in a sample with size n, and support for y is defined as supp(y)=Ny/n, where Ny is the number of observations of y in a sample with size n.
12. The system of example 11 wherein for any association rule of xy, its confidence conf(xy)=supp(xy)/supp(x).
13. The system of example 12 wherein the χ2-interest of a rule xy is defined by:
14. The system of example 13 wherein a combination of high χ2-interest with a low confidence is representative of an almost exclusive relationship and wherein conf(xy)>0:05 is indicative of a positive association between x and y where the χ2-interest is high.
15. The system of any of examples 10-14 and further comprising a display device coupled to the processor, and wherein the operations further comprise generating a graphical output for display on the display device having lines drawn between associations of each set of variables, wherein the sets of variable are generally arranged in a circle with the length of the lines connecting the sets of variables being proportional to the χ2-interest between the sets of variables.
16. In example 16, a non-transitory computer readable media storing computer instructions that when executed by one or more processors, cause the one or more processors to perform the steps of obtaining, via the one or more processors, data comprising multiple variables corresponding to multiple samples in a very large dataset, defining, via the one or more processors, multiple sets of variables occurring in the samples comprising a set of x variables and a set of y variables, where the intersection of the sets is zero, for each set of variables, determining, via the one or more processors, a support for each set and a union of each set, determining, via the one or more processors, an interest for each of the multiple association rules of the sets of variables, and determining, via the one or more processors, a chi squared interest, (χ2 interest), for each association to identify related sets of variables, including almost exclusive relationships.
17. The non-transitory computer readable storage media of example 16 wherein x∪y is denoted by xy, if x∩y=0, support for x is defined as supp(x)=Nx/n, where Nx is the number of observations of x in a sample with size n, support for y is defined as supp(y)=Ny/n, where Ny is the number of observations of y in a sample with size n, and wherein for any association rule of xy, its confidence conf(xy)=supp(xy)/supp(x).
18. The non-transitory computer readable storage media of example 17 wherein the χ2-interest of a rule xy is defined by:
19. The non-transitory computer readable storage media of example 18 wherein a combination of high χ2-interest with a low confidence is representative of an almost exclusive relationship and wherein conf(xy)>0:05 is indicative of a positive association between x and y where the χ2-interest is high.
20. The non-transitory computer readable storage media of any of examples 16-19 wherein the operations further comprise generating a graphical output for a display device having lines drawn between associations of each set of variables, wherein the sets of variable are generally arranged in a circle with the length of the lines connecting the sets of variables being proportional to the χ2-interest between the sets of variables.
Although a few embodiments have been described in detail above, other modifications are possible. For example, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Other embodiments may be within the scope of the following claims