The present invention relates to methods and systems for partitioning datasets for data analysis, and in particular, to the analysis of retail sales information for product forecasting and pricing, allocation, and inventory management determinations.
Clustering is one of the most useful tasks in data mining applications for discovering groups and identifying interesting distributions and patterns in the underlying data. Clustering algorithms are utilized to partition a given dataset into groups, or clusters, such that the data points in a cluster are more similar to each other than points in different clusters. Partitioning a given dataset into several groups with similar attributes is of interest for various applications. In the retail environment, examples include partitioning a range of stores into groups with similar gross margin dollars, or grouping products based on their weekly sales. Partitioning datasets can simplify sales analysis and forecasting, particularly when data is missing or contains inaccuracies. It is often easier to obtain good forecasts for the aggregate sales from all items in a store or from all items in a product line than for each individual item in the store.
It is generally, desirable to form clusters of similar size containing items with similar attributes. Thus, the main concern in the clustering process is to reveal the organization of patterns into “sensible” groups, which allow the user to discover similarities and differences, as well as to derive useful conclusions about them.
Various clustering algorithms have been developed for partitioning datasets. However, these clustering algorithms typically require iterative optimization techniques. Consequently these algorithms tend to be computationally intensive, and their application not feasible for many practical cases, where suboptimal but fast algorithms are preferred.
In the following description, reference is made to the accompanying drawings that form a part hereof and in which is shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable one of ordinary skill in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural, logical, optical, and electrical changes may be made without departing from the scope of the present invention. The following description is, therefore, not to be taken in a limited sense, and the scope of the present invention is defined by the appended claims.
As stated earlier, various clustering algorithms have been developed for partitioning datasets. These clustering algorithms typically minimize the sum, over all clusters, of the within-cluster sums of point-to-cluster-centroid distances. These algorithms require iterative optimization techniques and consequently are often computationally expensive and their application is not feasible for many practical cases where suboptimal but fast algorithms are preferred.
Intuitively, one may partition a sorted dataset by defining the cutting points at the largest gaps (highest jumps in the attribute). This strategy, referred to as the largest gap strategy, leads to groups with similar attributes; it may, however, result in groups of widely different sizes. For illustration,
A similar partitioning method, referred to as the equal range strategy, attempts to partition items into groups with similar range (similar variation of the attribute). This strategy often leads to the same shortcomings of the largest gap strategy shown in
As described above, partitioning problems are often caused by the unusual items at the tails of the dataset,
The tails can be identified as a percentage of the total number of items, as a percentage of the range of the dataset, or using a three-sigma technique. Analysis of these techniques showed that the tails can be best defined based on their range, using the following relations:
where k is the number of desired partitions and c is an empirical factor (2<c<4 leads to satisfactory results).
After partitioning off the two tails into groups 1 and 2, represented by segments 202 and 211, the usual items can be simply clustered into groups of equal size, represented by segments 203 through 210.
The partitioning system described below provides a fast, simple and flexible method for partitioning of a dataset. More specifically, this method:
The equal curve-length strategy, illustrated in
In this method, the total length of distribution curve 301 is first calculated, and then the curve is divided into k equal pieces, where k is the number of the partitions. In
Length=f(range,count)≈√{square root over ((range)2+(count)2)}{square root over ((range)2+(count)2)}. EQN 2
The magnitude of the range variable in EQN. 2 depends on the scale of the attribute in hand, e.g., gross margin dollar. Consequently, the length function and the resulting partitions also depend on the scale of the attribute. This potential shortcoming is avoided by normalizing the attributes. Generally, good partitions are obtained when the overall range of the attribute is of the same size as the total number of the items to be grouped. This can be done using the following normalization formula:
where attr is the attribute in hand, N is the total number of items to be partitioned and K is a constant parameter defining the relative scale of the attribute versus the number of items, which is unity (K=1) here.
The relative importance of range (the similarity of items within a group) and count (the size of the group) can be controlled using the tuning parameter K. This feature allows customization of the method for particular applications. Large values of K (K>>1) will result in groups of relatively equal range while a small K (1<K<0) tends to generate groups of relatively equal size. As an illustration,
As an optional step, the partitioning method can be further improved by blending equal curve-length and the largest gap strategies described above. In this approach the dataset is first partitioned into k−1 preliminary groups using the equal curve-length strategy. Then the final partitions are defined at the largest gap within each preliminary group. This partitioning strategy takes into account the number of items per group, the range or similarity of the items within the group, as well as the gaps in the dataset, and hence produces the best partitions among the methods described above.
The total length of the distribution curve is first calculated, for example through use of EQN. 2 provided above, and then the curve is divided into k equal pieces, where k is the number of the partitions, as shown in step 609. Finally, in step 611, the selected data is partitioned into k groups corresponding to the curve portions determined in step 609.
The Figures and description of the invention provided above describe a partitioning system that provides a fast, simple and flexible method for partitioning of a dataset. More specifically, the described method does not rely on any iteration or optimization, and hence requires little computational effort; is based on a straightforward algorithm that allows its implementation at different situations; and can be simply customized for a particular application, by changing a tuning parameter.
The foregoing description of the preferred embodiment of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.
Number | Name | Date | Kind |
---|---|---|---|
7039638 | Zhang et al. | May 2006 | B2 |
7152022 | Joshi | Dec 2006 | B1 |
7441186 | Kasperkiewicz et al. | Oct 2008 | B2 |
Number | Date | Country | |
---|---|---|---|
20090012979 A1 | Jan 2009 | US |