INPUT DATA STRUCTURE FOR DATA MINING

Information

  • Patent Application
  • 20070220030
  • Publication Number
    20070220030
  • Date Filed
    February 06, 2007
    17 years ago
  • Date Published
    September 20, 2007
    16 years ago
Abstract
Methods and apparatus, including computer program products, implementing and using techniques for compressing data included in several transactions. Each transaction has at least one item. A unique identifier is assigned to each different item and, if taxonomy is defined, to each different taxonomy parent. Sets of transactions are formed from the several transactions. The sets of transactions are stored using a computer data structure including: a list of identifiers of different items in the set of transactions, information indicating number of identifiers in the list, and bit field information indicating presence of the different items in the set of transactions, said bit field information being organized in accordance with the list for facilitating evaluation of patterns with respect to the set of transactions. A data structure for compressing data included in a set of transactions is also provided.
Description

DESCRIPTION OF THE DRAWINGS


FIG. 1 schematically shows a computing system that may be used for data mining in accordance with one embodiment of the invention.



FIG. 2 shows a flowchart of a method where candidate patterns are evaluated with respect to input data in sets of similar candidate patterns in accordance with one embodiment of the invention.



FIGS. 3
a, 3b and 3c show examples of generation of similar candidate patterns based on a common parent pattern in accordance with one embodiment of the invention.



FIG. 4 shows a flowchart of a method for extending a parent pattern into child patterns in accordance with one embodiment of the invention.



FIG. 5 shows a more detailed flowchart of a method for evaluating candidate patterns in sets of similar candidate patterns in accordance with one embodiment of the invention.



FIGS. 6
a, 6b and 6c show procedures for storing evaluation information of parent candidate patterns in accordance with one embodiment of the invention.



FIGS. 7
a, 7b, 7c and 7d show data structures for compressing input data efficiently in binary format in accordance with one embodiment of the invention.



FIG. 8 shows a flowchart of a method for compressing data comprised in a set of transactions into a specific data structure in accordance with one embodiment of the invention.



FIGS. 9
a and 9b show flowcharts of further methods for compressing data comprised in a set of transactions in accordance with one embodiment of the invention.



FIGS. 10
a, 10b and 10c show a flowchart of a method for verifying association rules with respect to compressed input data and details for the method in accordance with one embodiment of the invention.



FIG. 11 shows a flowchart of a method for verifying association rules in sets of similar rules with respect to compressed input data in accordance with one embodiment of the invention.



FIGS. 12
a, 12b and 12c show a more detailed flowchart of a method for verifying association rules in sets of similar association rules with respect to compressed input data and details for the method in accordance with one embodiment of the invention.



FIGS. 13
a, 13b and 13c show a flowchart of a method for verifying sequence rules with respect to compressed input data and details for the method in accordance with one embodiment of the invention.



FIG. 14 shows schematically dynamical memory management applicable to data mining applications in accordance with one embodiment of the invention.


Claims
  • 1. A computer data structure for compressing data comprised in a set of transactions, each transaction having at least one item, the computer data structure comprising: a list of identifiers of different items in the set of transactions,information indicating number of identifiers in the list, andbit field information indicating presence of the different items in the set of transactions, said bit field information being organized in accordance with the list for facilitating evaluation of patterns with respect to the set of transactions.
  • 2. The computer data structure of claim 1, wherein said list of identifiers further comprises identifiers of different taxonomy parents of the different items, and said bit field information indicates presence of the different items and of the different taxonomy parents in the set of transactions.
  • 3. The computer data structure of claim 1, wherein said bit field information further comprises one bit for each item-transaction pair, the size of the bit field being the number of identifiers times number of transactions in the set.
  • 4. The computer data structure of claim 1, wherein the set contains a predetermined number of transactions, said predetermined number being dependent on hardware.
  • 5. The computer data structure of claim 1, wherein the set of transactions belongs to a transaction group and each transaction has ordering information, said data structure further comprising: information indicating number of transactions in the transaction group, andinformation about the ordering information of the different transactions.
  • 6. The computer data structure of claim 5, wherein said information about the ordering information indicates differences between transactions.
  • 7. The computer data structure of claim 5, further comprising information indicating the total number of items in the set of transactions.
  • 8. The computer data structure of claim 1, further comprising at least one of the following: weight statistics for said different items, and accumulated weight statistics for said set of transactions.
  • 9. A computer-implemented method for compressing data included in a plurality of transactions, each transaction having at least one item, said method comprising: assigning a unique identifier to each different item and, if taxonomy is defined, to each different taxonomy parent,forming sets of transactions from the plurality of transactions, andstoring said sets of transactions using a computer data structure including: a list of identifiers of different items in the set of transactions,information indicating number of identifiers in the list, andbit field information indicating presence of the different items in the set of transactions, said bit field information being organized in accordance with the list for facilitating evaluation of patterns with respect to the set of transactions.
  • 10. The method of claim 9, further comprising: determining item frequencies and, if taxonomy is defined, taxonomy parent frequencies before assigning said unique identifiers, anddiscarding items having item frequency and, if present, taxonomy parent frequency less than a predefined frequency, thereby producing remaining items and remaining transactions,wherein said unique identifiers are assigned to each different remaining item and to each different remaining possible taxonomy parent.
  • 11. The method of claim 9, further comprising ordering items and identifiers in said data structures in accordance with said identifiers.
  • 12. The method of claim 9, wherein each set of transactions contains a predetermined number of transactions, said predetermined number being dependent on hardware.
  • 13. The method of claim 12, further comprising discarding transactions having less remaining items than a predefined number before forming said sets of transactions.
  • 14. The method of claim 12, further comprising ordering said remaining transactions based on similarity thereof before said step of forming sets.
  • 15. The method of claim 9, wherein each set of transactions represents a transaction group, identified by each transaction within the group carrying a same transaction group identifier, each transaction having ordering information and said computer data structures further comprising information indicating number of transactions in the transaction group, and information about the ordering information of the different transactions.
  • 16. The method of claim 15, further comprising discarding sets of transaction having less remaining items than a predefined number of items or less transactions than a predefined number of transactions.
  • 17. A computer-implemented method for detecting patterns in input data containing a plurality of transactions, each transaction having at least one item and items possibly having taxonomy parents, the method comprising: providing input data in computer data structures having a list of identifiers of different items in the set of transactions,information indicating number of identifiers in the list, andbit field information indicating presence of the different items in the set of transactions, said bit field information being organized in accordance with the list for facilitating evaluation of patterns with respect to the set of transactions; andevaluating a candidate pattern using bit map operations on the bit field information of the computer data structures.
  • 18. The method of claim 17, wherein said step of providing input data contains at least one of the following: reading said data structures from a storage medium; and processing input data to form said data structures.
  • 19. The method of claim 17, further comprising taking into account evaluation information of a parent candidate pattern of said candidate pattern when evaluating said candidate pattern.
  • 20. The method of claim 19, wherein said evaluation information of said parent candidate pattern is taken into account by evaluating said candidate pattern only with respect to transactions supporting said parent candidate pattern.
  • 21. The method of claims 17, further comprising determining whether items defined by positive item constraints are present in transactions in connection with evaluating the candidate pattern.
  • 22. The method of claim 17, further comprising evaluating a set of similar candidate patterns, said set containing at least two candidate patterns, together with respect to the computer data structures.
  • 23. The method of claim 22, further comprising determining presence of common items of the set of similar candidate patterns in said computer data structures, and determining presence of non-common items of the set of similar candidate patterns in transactions of said computer data structures having said common items.
  • 24. The method of claim 17, further comprising determining whether items occur in a same order in the candidate pattern and in transactions of said computer data structures.
  • 25. A computer program product comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to: assign a unique identifier to each different item and, if taxonomy is defined, to each different taxonomy parent,form sets of transactions from the plurality of transactions, andstore said sets of transactions using a computer data structure including: a list of identifiers of different items in the set of transactions,information indicating number of identifiers in the list, andbit field information indicating presence of the different items in the set of transactions, said bit field information being organized in accordance with the list for facilitating evaluation of patterns with respect to the set of transactions.
  • 26. The computer program product of claim 25, further causing the computer to: determine item frequencies and, if taxonomy is defined, taxonomy parent frequencies before assigning said unique identifiers, anddiscard items having item frequency and, if present, taxonomy parent frequency less than a predefined frequency, thereby producing remaining items and remaining transactions,wherein said unique identifiers are assigned to each different remaining item and to each different remaining possible taxonomy parent.
  • 27. The computer program product of claim 25, further causing the computer to order items and identifiers in said data structures in accordance with said identifiers.
  • 28. The computer program product of claim 25, wherein each set of transactions contains a predetermined number of transactions, said predetermined number being dependent on hardware.
  • 29. The computer program product of claim 25, wherein each set of transactions represents a transaction group, identified by each transaction within the group carrying a same transaction group identifier, each transaction having ordering information and said computer data structures further comprising information indicating number of transactions in the transaction group, and information about the ordering information of the different transactions.
  • 30. A computer program product comprising a computer useable medium including a computer readable program, wherein the computer readable program when executed on a computer causes the computer to: provide input data in computer data structures having a list of identifiers of different items in the set of transactions,information indicating number of identifiers in the list, andbit field information indicating presence of the different items in the set of transactions, said bit field information being organized in accordance with the list for facilitating evaluation of patterns with respect to the set of transactions; andevaluate a candidate pattern using bit map operations on the bit field information of the computer data structures.
  • 31. The computer program product of claim 30, wherein the instructions to provide input data contains instructions to do at least one of the following: read said data structures from a storage medium; and process input data to form said data structures.
  • 32. The computer program product of claim 30, further causing the computer to take into account evaluation information of a parent candidate pattern of said candidate pattern when evaluating said candidate pattern.
  • 33. The computer program product of claim 30, further causing the computer to determine whether items defined by positive item constraints are present in transactions in connection with evaluating the candidate pattern.
  • 34. The computer program product of claim 30, further causing the computer to evaluate a set of similar candidate patterns, said set containing at least two candidate patterns, together with respect to the computer data structures.
  • 35. The computer program product of claim 30, further causing the computer to determine whether items occur in a same order in the candidate pattern and in transactions of said computer data structures.
Priority Claims (2)
Number Date Country Kind
EP06111140 Mar 2006 EP regional
EP06121742 Oct 2006 EP regional