This invention relates generally to filtering data. More particularly, this invention relates to determining filtering categories that will filter data efficiently.
Large datasets with large numbers of associated categories are difficult to navigate quickly. In some cases, filtering on certain categories will only eliminate one or two records from the dataset. Prior art techniques generally provide a list of categories and attributes to filter on without indicating or determining how the filters will affect the resulting dataset. In many cases, the prior art provides a predetermined hierarchy of categories to which records are indexed.
In view of the foregoing, it would be highly desirable to provide enhanced techniques for determining which categories will filter data efficiently.
The invention includes a computer readable storage medium with executable instructions to retrieve a dataset from a data source, where the dataset includes a first set of categories. A data structure that represents the dataset is built. A first set of merit values for the first set of categories is calculated. The first set of categories is ordered based on a criterion. The first set of categories is returned.
The invention also includes a computer readable storage medium with executable instructions to retrieve a dataset from a data source. The dataset is reordered by successively grouping on each category in a first set of categories. An enumeration tree is built. A set of merit values for the first set of categories is calculated. A second set of categories is determined, where the merit values meet a criterion. The second set of categories is returned.
The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
The following terminology is used while disclosing embodiments of the invention:
An attribute is any non-null value in a dataset.
An attribute combination is a set or subset of attributes associated with a particular record in a dataset.
An attribute count is the number of times that a distinct attribute appears in a single category.
An attribute count data structure is a data structure (e.g., a temporary reference table, a list, a hash table, or a tree) that stores the attribute counts for all attributes in a dataset. This data structure is an optional component of the categorical filtering process described within.
A category comprises a group of correlated attributes. A category is defined by similar locations of attributes in a data source. For example, a category is a column in a database table or spreadsheet, a set of fields sharing the same tag in an XML file, or a set of fields with a shared relative location within a hierarchical data source.
Common leading attributes are the set of attributes shared between two records that come before the first differentiating attribute in the category order.
Entropy is a measure from information theory. It describes how attributes in a category are distributed. This well known measure is related to the randomness of the distribution of the attributes.
An enumeration tree is a data structure with nodes connected by edges. An enumeration tree may represent a dataset with data and metadata obtained from a dataset.
A filter comprises one or more attributes belonging to the same category that have been specified as the required value(s) for that category.
Merit value or merit is a measure of how efficient a category is in filtering data.
A nodal attribute count is a count stored in an enumeration tree node that tracks how many times an attribute appears at the end of the preceding sequence of parent node attributes in a dataset. All nodal attribute counts for a given attribute sum to the associated attribute count.
A memory 110 is also connected to the bus 106. The memory 110 stores executable instructions to implement operations of the invention. In an embodiment, the executable instructions include one or more of the following modules: an operating system module 112, a data access module 114, a data structure module 116, a category calculating module 118 and an optional Graphical User Interface (GUI) module 120.
The operating system module 112 includes executable instructions to handle various system services, such as file services or to perform hardware dependant tasks.
The data access module 114 includes executable instructions to modify a data source query (e.g., a Structured Query Language (SQL) query, a MultiDimensional eXpressions (MDX) query, a Data Mining Extensions (DMX) query) to include specified filters. The data access module 114 also includes executable instructions to apply the generated data source query to an underlying data source, which may form a portion of computer 100 or may be accessed as a separate networked machine through the network interface circuit 108.
The data structure module 116 includes executable instructions to build an enumeration tree data structure. This module also includes instructions to parse the enumeration tree in accordance with an embodiment of the invention.
The category calculating module 118 includes executable instructions to determine the categories that will efficiently filter the dataset and to organize the category information. In an embodiment, the category information is passed to the GUI module 120. In another embodiment, the category information is passed to another process.
The GUI module 120 is an optional component and may rely upon standard techniques to produce graphical components of a user interface, e.g., windows, icons, buttons, menu and the like. The GUI module 120 displays the successive sets of filtering categories, the filtered dataset results and the like to the user.
The executable modules stored in memory 110 are exemplary. It should be appreciated that the functions of the modules may be combined. In addition, the functions of the modules need not be performed on a single machine. Instead, the functions may be distributed across a network, if desired. Indeed, the invention is commonly implemented in a client-server environment with various components being implemented at the client-side and/or the server-side. It is the functions of the invention that are significant, not where they are performed or the specific manner in which they are performed.
In one embodiment, the data structure module 116 then optionally reorders the categories in an ascending order of number of distinct attributes 206. In an embodiment, if multiple categories have the same number of distinct attributes, they are grouped in their original order. Note that the categories need not be physically reordered. In an embodiment, it is determined whether it is more efficient to physically reorder the categories or number them so they appear to be reordered in interactions with the data access module 114. This is an optional operation; no operation in the process is dependent on this operation 206.
The next processing operation is to reorder the records by grouping the attributes in a descending order 208. The data structure module 116 begins by grouping on the lead category and progressing through the order.
Once this restructuring is complete, the data structure module 116 builds the enumeration tree 210. The data structure is tree-based, consisting of a single tree or a plurality of trees; one root node exists for each distinct attribute in the lead category.
The data structure module 116 begins by selecting the leading category attribute of the first record and sets it as the root node attribute. In one embodiment, the nodes of the enumeration tree contain an attribute, an ID for the parent node and a nodal attribute count. The remainder of the record is added to the tree as a series of nodes—one for each attribute—creating a single branch. The data structure module 116 tracks the last record added to the enumeration tree. This record information is used in adding subsequent records to the enumeration tree.
To add more records to the enumeration tree, the data structure module 116 selects the next record in the dataset. This record is compared to the previously added record to check for common leading attributes. All common leading attributes share the same node, much like in a prefix tree, and a nodal attribute count tracks how many records are sharing a single node. The remaining attributes of the record are added as a sub-branch beginning at the node of the last common leading attribute. If there are no common leading attributes for a record and the previously added record, then a new root node is created. Note that null values are not stored in the enumeration tree. Records with null values in the leading category select their root node attribute from the first category with a non-null value
The data structure module 116 selects the first attribute 306 from the leftmost column of the temporary table 304 and sets it as the root node 309. The first branch of the enumeration tree is created 307 resulting in the tree 308. Note that null values are not stored in the enumeration tree.
From there, a branch is created for each further record with the root node 309 attribute c1 in its leftmost column, 310. The record 305 is compared to the previously added record to determine the common leading attributes. The nodes of the common leading attributes, in this case attribute c1 and node 311, are shared. The nodal attribute count, depicted as a superscript (e.g., 313), is incremented by 1. And, as illustrated, the remainder of record 305 is stored as a sub-branch in the enumeration tree 312.
The remainder of the data structure 315 is created by repeating this process for the remaining distinct attributes of the leftmost column 314. This completes the enumeration tree 316. As previously mentioned, all common leading attributes use the same node, hence the shared node 318.
In one embodiment, the categories are ordered by ascending merit and in another embodiment, by descending merit. In one embodiment, the categories are ordered by merit and another value derived from the data or associated metadata. In one embodiment, the category attributes are ordered by ascending frequency and in another embodiment, descending frequency. Other embodiments include, but are not limited to, ordering category attributes alphabetically, numerically, according to a user specification submitted via the GUI module 120 or leaving the attributes unordered.
The application then queries the data source 406 using the specified filters and retrieves the applicable dataset and categories 408. The application can then optionally store the dataset and categories or pass them on to another process or the GUI module 120, 409. The application may pass on any subset of the original set of categories. This subset may be based on a criterion set by the data structure module 116 (e.g., highest merit, lowest merit, closest to a target value), a similar criterion specified by the GUI module 120 or a request from the user.
The computer 100 waits for the user, or another agent, to select a filter 410. If a filter is selected (410—Yes), then the category calculating module 118 accepts the filter 412 and rebuilds the enumeration tree 414. In rebuilding the enumeration tree, the data structure module 116 copies the branches with an attribute of the selected filter in the associated category from the current enumeration tree. The process then cycles through operations 402 through 409 again, this time querying for a filtered dataset during operation 406 using the specified filter. If the filter is made up of more than one attribute, then an “or” statement is used in the query. If a filter is not selected (410—No), then the process stops until one is selected.
The first sub-operation 502, to calculate the attribute counts, of calculating merit is optional. In an embodiment, to calculate the attribute counts 502 the data structure module 116 parses the enumeration tree while the category calculating module 118 sums the nodal attribute counts for each distinct node attribute. Calculating the attribute counts first provides a data structure (e.g., a temporary reference table, a list, a hash table, or a tree) for retrieval of a summary data for future calculations by the category calculating module 118. In another embodiment, this sub-operation is not performed and the category calculating module 118 requests that the data structure module 116 parse the enumeration tree to derive specific attribute counts every time one is required.
The next sub-operation is to calculate the entropy (E) of the categories 504 using an entropy formula, such as:
where:
K is an optional constant;
n is the number of distinct attributes in the category;
log is the logarithm function, the base of which varies with different embodiments and may include the natural logarithm, common logarithm, binary logarithm or indefinite logarithms;
cati is the ith distinct attribute in the category; and
p(cati) is the probability that an attribute is cati, which is equivalent to the number of times cats occurs divided by the number of records in the dataset. The values used to calculate p(cati) are retrieved from the attribute count data structure constructed in the previous sub-operation 502 or derived from the enumeration tree when required.
The next sub-operation is to calculate category coverage 506. Category coverage is determined by the percentage of attributes in the category. In one embodiment, the category calculating module 118 retrieves the attribute counts from the attribute count data structure and the number of records in the dataset from the data source. In another embodiment, the attribute counts are derived from the enumeration tree. The category entropies are then multiplied by the corresponding category coverage values 508.
The next sub-operation is to normalize the product from the previous sub-operation 510. Normalization may be performed by dividing the entropy-coverage product by a normalizing value z that is correlated with n, the distinct number of attributes in the category. In one embodiment z is monotonic in n. In one embodiment where z is monotonic in n, z is super linear in n. In one embodiment where z is super linear in n, z is equal to nlog(n). Examples of the logarithm's base include 2, e (i.e., 2.718281828 where loge is denoted ln) and 10. In one embodiment where z is monotonic in n, z is linear in n. In one embodiment where z is linear in n, z is equal to n. The value of n is determined from the attribute count data structure or from parsing the enumeration tree.
The result of normalization is the merit value (M). Note that the merit value is proportional to entropy and coverage, and inversely proportional to the number of distinct attributes in the category.
After the data structure module 116 builds the enumeration tree, the category calculation module 118 takes over, periodically sending requests to the data structure module 116 to parse the enumeration tree for information. The first operation 502 is to calculate attribute counts. This operation 502 is optional. The attribute counts for the dataset 600 are:
Entropy values are then calculated as per operation 504:
The next operation is to calculate coverage values 506:
Then the entropy and coverage values are multiplied 508:
The next operation is to normalize the products of the previous operation 510:
Then the categories are ordered 404, in this case by descending merit:
The next operation, which is optional, is to order the attributes of each category 405 of
The data access module 114 then queries the data source for the dataset 600, 406 of
When a filter is selected, the data structure module 116 accepts that filter and rebuilds the enumeration tree by copying the relevant branches into a new enumeration tree. For example, if the selected filter is B=b2, the enumeration tree 700 of
The category calculating module 118 then performs the set of processing operations 402 of
The categories are then ordered (404 of
And the category attributes are optionally ordered (405 of
The data access module 116 then queries the data source (406 of
The categories and dataset may be displayed by the GUI module 120 in accordance with any number of techniques, including those described in U.S. application Ser. No. 11/555,206 filed Oct. 31, 2006, which is incorporated by reference herein in its entirety:
Embodiments of the invention include a computer readable storage medium storing executable instructions. The computer readable storage medium includes instructions to retrieve a dataset from a data source. The dataset includes a set of records and a set of categories. The instructions include instructions to reorder the set of records by successively grouping on each category in the set of categories. The instructions include instructions to build an enumeration tree. In an embodiment, a category of the computer readable medium includes a set of attributes. In an embodiment, the computer readable medium additionally includes executable instructions to calculate a count of distinct attributes in each category in the set of categories and reorder the categories by ascending order of the count of distinct attributes. In an embodiment, the computer readable medium additionally includes executable instructions to accept a filter, copy a set of applicable branches from the enumeration tree, wherein an applicable branch of the set of applicable branches complies with the filter, and build a new enumeration tree using the set of applicable branches.
An embodiment of the present invention relates to a computer storage product with a computer-readable medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using Java, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.
This application is a continuation of U.S. patent application Ser. No. 11/555,234 filed Oct. 31, 2006, which is related to commonly owned U.S. patent application Ser. No. 11/555,206 filed Oct. 31, 2006, the contents of each of the two applications are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 11555234 | Oct 2006 | US |
Child | 12347593 | US |