DATA MINING BY DETERMINING PATTERNS IN INPUT DATA

Information

  • Patent Application
  • 20070219992
  • Publication Number
    20070219992
  • Date Filed
    February 06, 2007
    17 years ago
  • Date Published
    September 20, 2007
    16 years ago
Abstract
Methods and apparatus, including computer program products, implementing and using techniques for pattern detection in input data containing several transactions, each transaction having at least one item. Filter conditions for interesting patterns are received, and a first set of filter conditions applicable in connection with generation of candidate patterns is determined. An evaluated candidate pattern is selected as a parent candidate pattern, and evaluation information about the parent candidate pattern is maintained. Child candidate patterns are generated by extending the parent candidate pattern and taking into account the first set of filter conditions. The child candidate patterns are evaluated with respect to the input data together in sets of similar candidate patterns and based on the evaluation information about the parent candidate pattern. At least one child candidate pattern successfully passing the evaluation step is recursively used as a parent candidate pattern.
Description

DESCRIPTION OF THE DRAWINGS


FIG. 1 schematically shows a computing system that may be used for data mining in accordance with one embodiment of the invention.



FIG. 2 shows a flowchart of a method where candidate patterns are evaluated with respect to input data in sets of similar candidate patterns in accordance with one embodiment of the invention.



FIGS. 3
a, 3b and 3c show examples of generation of similar candidate patterns based on a common parent pattern in accordance with one embodiment of the invention.



FIG. 4 shows a flowchart of a method for extending a parent pattern into child patterns in accordance with one embodiment of the invention.



FIG. 5 shows a more detailed flowchart of a method for evaluating candidate patterns in sets of similar candidate patterns in accordance with one embodiment of the invention.



FIGS. 6
a, 6b and 6c show procedures for storing evaluation information of parent candidate patterns in accordance with one embodiment of the invention.



FIGS. 7
a, 7b, 7c and 7d show data structures for compressing input data efficiently in binary format in accordance with one embodiment of the invention.



FIG. 8 shows a flowchart of a method for compressing data comprised in a set of transactions into a specific data structure in accordance with one embodiment of the invention.



FIGS. 9
a and 9b show flowcharts of further methods for compressing data comprised in a set of transactions in accordance with one embodiment of the invention.



FIGS. 10
a, 10b and 10c show a flowchart of a method for verifying association rules with respect to compressed input data and details for the method in accordance with one embodiment of the invention.



FIG. 11 shows a flowchart of a method for verifying association rules in sets of similar rules with respect to compressed input data in accordance with one embodiment of the invention.



FIGS. 12
a, 12b and 12c show a more detailed flowchart of a method for verifying association rules in sets of similar association rules with respect to compressed input data and details for the method in accordance with one embodiment of the invention.



FIGS. 13
a, 13b and 13c show a flowchart of a method for verifying sequence rules with respect to compressed input data and details for the method in accordance with one embodiment of the invention.



FIG. 14 shows schematically dynamical memory management applicable to data mining applications in accordance with one embodiment of the invention.


Claims
  • 1. A computer-implemented method for detecting patterns in input data containing a plurality of transactions, each transaction having at least one item, the method comprising: receiving filter conditions for interesting patterns,determining, based on the received filter conditions, a first set of filter conditions applicable in connection with generation of candidate patterns,selecting an evaluated candidate pattern as a parent candidate pattern and maintaining evaluation information about the parent candidate pattern,generating child candidate patterns by extending the parent candidate pattern and taking into account the first set of filter conditions,evaluating the child candidate patterns with respect to the input data together in sets of similar candidate patterns and based on the evaluation information about the parent candidate pattern, each set having up to a predetermined number of similar candidate patterns and at least one set having at least two similar candidate patterns, andrecursively using at least one child candidate pattern successfully passing the evaluation step as a parent candidate pattern.
  • 2. The method of claim 1, wherein candidate patterns in each set of similar candidate patterns differ from each other by respective one item added to a common parent candidate pattern.
  • 3. The method of claim 1, wherein the step of generating child candidate patterns includes one or more of the following steps: adding a new item to the parent candidate pattern's first item set;adding a new item to the parent candidate pattern's last item set; andappending a new item set consisting of one item to the parent candidate pattern.
  • 4. The method of claim 1, wherein the predetermined number is dependent on characteristics of the computing system on which the computer-implemented method is executed.
  • 5. The method of claim 1, further comprising: computing statistical measures based on the input data for use in at least one of the generation and evaluations steps, the statistical measures including at least one of: item pair statistics and weight statistics.
  • 6. The method of claim 5, further comprising: restricting the search space of the candidate patterns based on the statistical measures when applying said first set of filter conditions.
  • 7. The method of claim 5, further comprising: determining at least one of the following based on the statistical measures: which child candidate patterns to extend and the order in which to extend child candidate patterns.
  • 8. The method of claim 1, wherein said filter conditions include at least one condition based on one or more of: weight, total weight with respect to input data, average weight of supporting transactions, weight of a rule body, weight of a rule head, total weight of a rule head with respect to input data, total weight of a rule body with respect to input data, and accessible additional total weight.
  • 9. The method of claim 1, further comprising: providing data structures representing sets of transactions in the input data, the data structures including a list of identifiers of different items in a set of transactions, information indicating number of identifiers in the list, and bit field information indicating presence of the different items in the set of transactions, the bit field information being organized in accordance with the list for facilitating evaluation of patterns with respect to the set of transactions, andevaluating the candidate patterns using bit map operations on the bit field information.
  • 10. The method of claim 1, further comprising: maintaining data structures representing transaction in the input data, evaluated candidate patterns, evaluation information of evaluated candidate patterns, candidate patterns to be evaluated, and result patterns; anddynamically determining which data structures to keep in memory and which data structures to place to disk during generation and evaluation of child candidate patterns based on available total memory and usage of the data structures.
  • 11. The method of claim 10, further comprising: indicating for at least first data structures whether the first data structures should be prioritized when determining which data structures to keep in memory.
  • 12. The method of claim 10, further comprising: indicating for at least second data structures the latest fetching time from disk for determining which data structures to keep in memory based on the latest fetching times.
  • 13. The method of claim 1, further comprising maintaining evaluation information of the parent candidate pattern in one of the following formats: a first bit field indicating input data events contributing support for the parent candidate pattern, the length of the first bit field being equal the number of input data events;a second bit field indicating input data events contributing support for the parent candidate pattern, the length of the second bit field being equal to the number input data events contributing support to a further parent of the parent candidate pattern; andinformation about the number of input data events between two subsequent input data events contributing to support of the parent candidate pattern;wherein an input data event is one of the following: a transaction and a group of transactions.
  • 14. The method of 13, further comprising: choosing the format for evaluation information of the parent candidate pattern based on the support of the parent candidate pattern.
  • 15. The method of claim 1, further comprising: maintaining evaluation information of said sets of child candidate patterns in the evaluation step in bit fields indicating input data events contributing to support of the respective child candidate patterns, wherein an input data event is one of: a transaction and a group of transactions.
  • 16. The method of claim 15, wherein the length of the bit fields is equal to the number of input data events.
  • 17. The method of claim 15, wherein the number of bit fields per set of child candidate patterns is the number of input data events contributing support for the respective parent pattern.
  • 18. The method of claim 1, further comprising: determining, based on the received filter conditions, a second set of filter applicable in connection with evaluation of the child candidate patterns; andtaking into account said second set of filter conditions in connection with evaluation of the child candidate patterns.
  • 19. The method of claim 18, further comprising: determining, based on the received filter conditions, a third set of filter conditions applicable during determination of result patterns;taking into account said third set of filter conditions in connection with evaluation of the child candidate patterns; andoutputting validly evaluated candidate patterns passing said third set of filter conditions as result patterns.
  • 20. A computer program product comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to: receive filter conditions for interesting patterns,determine, based on the received filter conditions, a first set of filter conditions applicable in connection with generation of candidate patterns,select an evaluated candidate pattern as a parent candidate pattern and maintaining evaluation information about the parent candidate pattern,generate child candidate patterns by extending the parent candidate pattern and taking into account the first set of filter conditions,evaluate the child candidate patterns with respect to the input data together in sets of similar candidate patterns and based on the evaluation information about the parent candidate pattern, each set having up to a predetermined number of similar candidate patterns and at least one set having at least two similar candidate patterns, andrecursively use at least one child candidate pattern successfully passing the evaluation step as a parent candidate pattern.
  • 21. The computer program product of claim 20, wherein candidate patterns in each set of similar candidate patterns differ from each other by respective one item added to a common parent candidate pattern.
  • 22. The computer program product of claim 20, wherein generate child candidate patterns includes one or more of: add a new item to the parent candidate pattern's first item set;add a new item to the parent candidate pattern's last item set; andappend a new item set consisting of one item to the parent candidate pattern.
  • 23. The computer program product of claim 20, wherein the predetermined number is dependent on characteristics of the computing system on which the computer-implemented method is executed.
  • 24. The computer program product of claim 20, further causing the computer to: compute statistical measures based on the input data for use in at least one of the generation and evaluations steps, the statistical measures including at least one of: item pair statistics and weight statistics.
  • 25. The computer program product of claim 24, further causing the computer to: restrict the search space of the candidate patterns based on the statistical measures when applying said first set of filter conditions.
  • 26. The computer program product of claim 20, wherein said filter conditions include at least one condition based on one or more of: weight, total weight with respect to input data, average weight of supporting transactions, weight of a rule body, weight of a rule head, total weight of a rule head with respect to input data, total weight of a rule body with respect to input data, and accessible additional total weight.
  • 27. The computer program product of claim 20, further causing the computer to: provide data structures representing sets of transactions in the input data, the data structures including a list of identifiers of different items in a set of transactions, information indicating number of identifiers in the list, and bit field information indicating presence of the different items in the set of transactions, the bit field information being organized in accordance with the list for facilitating evaluation of patterns with respect to the set of transactions, andevaluate the candidate patterns using bit map operations on the bit field information.
  • 28. The computer program product of claim 20, further causing the computer to: maintain data structures representing transaction in the input data, evaluated candidate patterns, evaluation information of evaluated candidate patterns, candidate patterns to be evaluated, and result patterns; anddynamically determine which data structures to keep in memory and which data structures to place to disk during generation and evaluation of child candidate patterns based on available total memory and usage of the data structures.
  • 29. The computer program product of claim 20, further causing the computer to maintain evaluation information of the parent candidate pattern in one of the following formats: a first bit field indicating input data events contributing support for the parent candidate pattern, the length of the first bit field being equal the number of input data events;a second bit field indicating input data events contributing support for the parent candidate pattern, the length of the second bit field being equal to the number input data events contributing support to a further parent of the parent candidate pattern; andinformation about the number of input data events between two subsequent input data events contributing to support of the parent candidate pattern;wherein an input data event is one of the following: a transaction and a group of transactions.
  • 30. The computer program product of claim 1, further causing the computer to: maintain evaluation information of said sets of child candidate patterns in the evaluation step in bit fields indicating input data events contributing to support of the respective child candidate patterns, wherein an input data event is one of: a transaction and a group of transactions.
  • 31. The computer program product of claim 30, wherein the length of the bit fields is equal to the number of input data events.
  • 32. The computer program product of claim 30, wherein the number of bit fields per set of child candidate patterns is the number of input data events contributing support for the respective parent pattern.
  • 33. The computer program product of claim 20, further causing the computer to: determine, based on the received filter conditions, a second set of filter applicable in connection with evaluation of the child candidate patterns; andtake into account said second set of filter conditions in connection with evaluation of the child candidate patterns.
  • 34. The computer program product of claim 33, further causing the computer to: determine, based on the received filter conditions, a third set of filter conditions applicable during determination of result patterns;take into account said third set of filter conditions in connection with evaluation of the child candidate patterns; andoutputting validly evaluated candidate patterns passing said third set of filter conditions as result patterns.
  • 35. A computer system for detecting patterns in input data containing a plurality of transactions, each transaction having at least one item, the computer system comprising: means for receiving filter conditions for interesting patterns,means for determining, based on the received filter conditions, a first set of filter conditions applicable in connection with generation of candidate patterns,means for selecting an evaluated candidate pattern as a parent candidate pattern and maintain evaluation information about the parent candidate pattern,means for generating child candidate patterns by extending the parent candidate pattern and taking into account the first set of filter conditions,means for evaluating the child candidate patterns with respect to the input data together in sets of similar candidate patterns and based on the evaluation information about the parent candidate pattern, each set having up to a predetermined number of similar candidate patterns and at least one set having at least two similar candidate patterns, andmeans for recursively using at least one child candidate pattern successfully passing the evaluation as a parent candidate pattern.
Priority Claims (2)
Number Date Country Kind
EP06111138 Mar 2006 EP regional
EP06121743 Oct 2006 EP regional