The present disclosure relates to methods for imputing missing data elements or values in data sets, generally, and retail data sets in particular, which are an important prerequisite for use in a variety of decision-support applications in a retail supply chain which decision-support applications are premised on the availability of complete relevant data with no missing data elements. More particularly, the present disclosure relates to a system and method for multiple imputation of missing data elements in retail data sets based on the multi-dimensional, tensor representation of these data sets.
Methods and structures for imputation of missing data elements in retail data sets is an important prerequisite for using these retail data sets in a variety of decision-support applications of interest to retail supply-chain entities such as consumer-product manufacturers, retail chains and individual retail stores; this prerequisite invariably arises since, in practice, decision-support applications require the relevant data sets to be complete with no missing values in them, whereas at the same time, it is often difficult or even impossible for various reasons to obtain such complete retail data sets. Examples of relevant decision-support applications include, but are not limited to, product demand forecasting, inventory optimization, strategic product pricing, product-line rationalization, and promotion planning.
Some retail data sets have a particular multi-dimensional structure and although this structure is common to many decision-support applications, it is often not explicitly specified or exploited in the method steps of the current modeling and analysis.
Two particular limitations of the prior art techniques that may be used for the imputation of missing data elements in retail data sets include: First, in the prior art, these missing data elements are typically replaced by certain point estimates for their relevant imputed values, and therefore, the complete data set resulting from this replacement does not capture the natural variability which would have resulted if these missing data elements had been actually recorded instead of being imputed, and as a consequence, this will lead to a statistical bias in any subsequent analysis using the complete data set; Second, the imputation procedures that are used in the prior art typically ignore any data correlations along the various data set dimensions, or may only consider these data correlations along a single dimension of the retail data set.
In a prior art embodiment of a retail sales data set that is commonly found in many decision-support applications, there is considered a time-series sequence of various specific quantities such as unit-sales, unit-prices, stock levels, delivery levels, unsold goods, discards, etc., for a specific time-period of interest, over a collection of products in a specified retail category of interest, and simultaneously over a collection of stores in the particular market geography of interest. For instance, in typical retail sales data sets, the typical time period for this reporting may be weekly, and data may be collected in a sequence of several months to several years over hundreds of products and stores.
In essence, therefore, these retail data sets have a multi-dimensional structure, with the specific quantities of interest mentioned above are measured and reported for a set of relevant products (whose elements are indexed by “p”), a set of relevant stores (whose elements are indexed by “s”), and the set of consecutive time-periods (whose elements are indexed by “t”), or equivalently, over a set of (p,s,t) combinations.
The use of multi-product and multi-store data, as described above, is of considerable value for any statistical analysis of interest in decision-support applications, even when, as is often the case, the specific focus of the statistical-modeling or decision-support application is confined to a single product, or to a small set of target products of interest. Specifically, even in this case, there may be examined data across multiple stores, or across the entire retail category, so that, for instance, while building statistical models, the data may be pooled across the stores to reduce the estimation errors for the model parameters. However, the inherent difficulty in acquiring this multi-dimensional data across the product, store and time-period dimensions invariably leads to these data sets having many missing data elements, which occur for specific combinations (p,s,t) of product “p”, store “s” and time-period “t” in the data set.
In the retail environment, the reason for the presence of missing data elements for a particular (p,s,t) combination, may be ascribable to a variety of reasons, such as certain privacy and confidentiality issues in acquiring relevant data elements, or what is more likely in practice, the presence of certain process errors in the data logging, reporting or integration required for the compilation and assembling of the required retail data set.
It would be highly desirable to provide multi-product, multi-store and multi-time period data sets for demand modeling, that addresses a pervasive limitation that arises, in this regard, due to the invariable presence of missing data records and missing data elements in the relevant sales data sets for specific combinations of product “p”, store “s” and time-period “t”.
There is now considered some of the limitations of the prior art for the handling, specification and imputation of the missing data elements.
Generally, the prior art for missing value imputation in data sets have been developed in the context of statistical analysis in the presence of missing data, as reviewed by R. Little and D. Rubin, “Statistical Analysis with Missing Data,” 2nd Edition, Wiley and Sons, 2002, and wherein, in general terms, the approaches are based on classifying the mechanism that is responsible for the pattern of missing values in the data sets. For instance, these missing value patterns would be termed “Missing Completely At Random” (or MCAR) if it is assumed that the probability of a given record having a missing data element is the same for all records (that is, the pattern of missing values is completely independent of the remaining variables and factors in the data set, so that excluding any data records with these missing data elements from the data set, as in the “record deletion” approach described below, does not lead to any statistical bias in the retained data records used for the demand modeling analysis). Although the MCAR assumption may be tenable for certain types of missing values in retail data sets, in most cases, the pattern of missing values will depend on other observed factors within the data set, and the resulting missing value patterns would be termed “Missing At Random” (or MAR). The remaining cases, wherein the pattern of missing values may depend on unobserved factors, or even on the magnitude of the missing value itself, are difficult to analyze and require explicit modeling.
One of the most common approaches in the prior art for handling missing data elements is to simply omit, ignore and exclude the entire set of data elements; however, for many statistical methods that require complete set of data elements for each data record that is used in the analysis, this approach is equivalent to deleting the entire record, which would even include many data elements that are non-missing. For instance, if the relevant record corresponded to the unit-sales for all the products in a given store, then the entire set of data elements would be excluded if the unit-sales for just a single product is missing; this is often referred to as the so-called “record deletion” approach in statistical analysis (equivalently, this is also referred to as the “complete case” approach). It can be readily seen that this “record deletion” approach leads to a significant reduction in the data set size, including the exclusion of valid and non-missing data elements in the retail data set which may have acquired at considerable effort and expense. Furthermore, it can also lead to significant statistical bias, as mentioned earlier, when the pattern of missing data elements depends on the values of the other data elements in the same data records, corresponding to the MAR case described earlier.
An alternative approach to “record deletion” that is also widely used in the prior art and does not have this deficiency of having to discard the entire record including the valid data elements, is termed “complete case” analysis, which in its simplest form consists of replacing the missing data elements in the sales data set by statistical estimates such as the mean value, either taken globally, or taken along some marginal dimension of the data set, and in this way to obtain a “complete” data set with the missing data elements filled in suitably. For example, a missing value for the data element corresponding to a certain (p,s,t) combination can be imputed by averaging the corresponding values over the other stores for the same (p,t) combination, or equivalently, across the store dimension, keeping (p,t) fixed. A similar approach can also be taken across the time dimension, that is, by averaging the corresponding values over time for the same (p,s) combination. However, this simplest approach of imputing the missing value by the replacing it by the corresponding mean value over the remaining non-missing data values along one or more dimensions of the data sets has the major disadvantage in that it deflates the variance and distorts the correlations for the measured quantity in the “complete” data set with these “mean-imputed” values.
More sophisticated methods for missing value imputation attempt to retain the naturally-occurring variance and correlation structures in the “complete” data set with the imputed values, and the most widely used approach is based on multiple imputation, as reviewed by J. L. Schafer, “Analysis of Incomplete Multivariate Data,” Chapman and Hall, London (1997), wherein instead of a single set of imputed values for the missing data elements, instead multiple data sets are created with each complete data set contains a representative sample for the missing values with any variability or noise “added back in,” and these multiple complete data sets are then used in subsequent analysis or decision-support procedures in suitable ways.
It would be highly desirable to provide an improved method for the specification or imputation of missing data elements in the retail data sets.
In one aspect, there is provided a multiple imputation system, method and computer program product for multidimensional retail data sets in which multi-dimensional correlation structures are obtained and that are not considered individually and separately, but incorporated simultaneously as part of an overall multi-dimensional correlation structure.
In one embodiment, there is considered a system and method and computer program product for imputation of missing data elements in retail data sets that includes processing a correlation structure across multiple cross sections that are found in retail data sets. In one embodiment, rather than imposing smoothness requirements on the time dimension, it is assumed that the measurements in the time dimension are independent. In a further aspect, any smoothness requirements can always be incorporated by using lagged variables in the auxiliary data features along the time dimension. Furthermore, the estimation procedures described in the methodology of a further embodiment, are quite different from the estimation procedures used in the prior art for multiple imputation, and provide more generality and scalability for large data sets.
In one aspect, the system and method for multiple imputations in retail sales data sets comprises quantities measured over multiple dimension which typically include, a plurality of products, a plurality of stores, and a plurality of time-period values, or equivalently over a range of (p,s, t) values, wherein these retail data sets have missing data elements that are ascribable to various causes, for certain (p, s, t) combinations in this range.
Accordingly, in one embodiment, there is provided a computer-implemented method for multiple imputation for retail data sets with missing data elements. The method comprises receiving an original data set including elements including a plurality of retail products, a plurality of retail stores or chains, and a plurality of time-periods, with the retail products, retail stores and the time-periods; identifying and encoding the missing data elements in the original data set with dummy indicator variables corresponding to specific product, store and time-period combinations; obtaining a joint probability distribution of the magnitudes of the missing data elements in the original data set; generating a plurality of complete data sets corresponding to the original data set, wherein each complete data set in the plurality of complete data sets corresponds to the original data set with its non-missing values intact, and replacing, in each of the complete data sets, missing values indicated by the dummy variables with a sampled set of values from the joint probability distribution for the missing values obtained, wherein a programmed processor device performs one or more of one or more the receiving, identifying and encoding, obtaining, generating and replacing.
In one embodiment, a system for multiple imputation of data values for retail data sets with missing data elements comprises: at least one processor device; and at least one memory device connected to the processor, wherein the processor is programmed to perform a method, the method comprising: receiving an original data set including elements including a plurality of retail products, a plurality of retail stores or chains, and a plurality of time-periods, with the retail products, retail stores and the time-periods; identifying and encoding the missing data elements in the original data set with dummy indicator variables corresponding to specific product, store and time-period combinations; obtaining a joint probability distribution of the magnitudes of the missing data elements in the original data set; generating a plurality of complete data sets corresponding to the original data set, wherein each complete data set in the plurality of complete data sets corresponds to the original data set with its non-missing values intact, and, replacing, in each of the complete data sets, missing values indicated by the dummy variables with a sampled set of values from the joint probability distribution for the missing values obtained.
A computer program product is provided for performing operations. The computer program product includes a storage medium readable by a processing circuit and storing instructions run by the processing circuit for running a method. The method is the same as listed above.
The accompanying drawings are included to provide a further understanding of the present invention, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and, together with the accompanying description, serve to explain the principles of the invention. In the drawings,
A system, method and computer program product provides for accurate multiple imputation of missing data elements in retail data sets. As missing data elements are invariably present in these retail data sets, the specification or imputation of these missing data elements yields a “complete” data set for subsequent data analysis and modeling for various decision-support applications of interest based on this data.
That is, in one embodiment, there is implemented fast, scalable imputation methods suitable for large data sets, to obtain multiple complete data sets in which the original missing values are replaced by various imputed values, by using the method steps described herein.
One or more retail sales data sets are then obtained at 15, for example, by accessing a memory storage device such as a database, which data sets are used for the performing the relevant demand modeling analysis. For the set of relevant products, the analysis data set may include an aggregate retail-sales data set including, but not limited to: a set of time series for the unit sales and unit price over multiple stores.
In a further aspect, at 20, auxiliary data sets are obtained or accessed that include relevant information pertaining to the product and/or store attributes for the products and stores included as well as certain non-primary and auxiliary data, which may comprise, while not being limited to: any information pertaining to the introduction or withdrawal of products in certain stores during certain periods, or to any overstocking or lack of product inventory of products in certain stores during certain periods. This resulting data set contains missing data values for certain combinations of product, store, and time periods.
Then, the performing of the methodology described herein at step 25 results in a plurality of complete data sets with sampled estimates for the relevant missing values, with this plurality of multiple imputed data sets being used for subsequent statistical modeling and analysis for the client decision-support application.
In one example embodiment, the table 50 shown in
In one embodiment, Table 50 shown in
As known, CP decompositions factorize a tensor RI×J×K into a sum of component rank-one tensors 62a, 62b, . . . , 62D. In the computations, UI×D denotes the aggregated matrix corresponding to the first factor so that ui is the D-dimensional vector of the ith row of U for i=1 . . . I. Let VJ×D and TK×D be similarly defined. Then, each entry rijk in R is defined as rijk=ui·vj·tk, where, as shown in
As described herein with respect to
where δijk=1 if rijk is observed and 0 otherwise, and mijk and τ are the mean and variance of the Gaussian distribution. In particular, the mean tensor M=[mijk] has a CP decomposition in terms of matrices U, V, T, i.e.,
The latent factors ui 80, vj 82, and tk 84 are generated from multivariate normal distributions ui 70, vj 72 and tk 74:
N denotes a normal distribution, and model parameters are denoted μu 90, Σu 91, μv 92, Σv 93, μt 95, Σt 96 and τ 98. The latent factors 80, 82, 84 are generated by one or more programmed processing units of a computing system according to the following method:
1. For each i, [i]1I ([i]1I is defined as i=1 . . . I), generate ui˜N(μu,Σu).
2. For each j, [j]1J, generate vj˜N(μv, Σv).
3. For each k, [k]1K, generate tk˜N(μt, Σt).
4. For each non-missing entry (i, j, k), τijk˜N(ui·vj·tk,τ), where ui·vj·tk=Σd=1Duidvjdtkd.
Given the generative model, the likelihood function of PPTF is as follows:
where Θ={μu, Σu, μv, Σv, μt, μt, Σt, τ} denotes all the model parameters.
Given R 99, one embodiment includes obtaining the model parameters Θ such that p(R|Θ) is maximized. A general approach is to use the expectation maximization (EM) algorithm, which is reviewed in R. Neal and G. Hinton, “A view of the EM algorithm that justifies incremental, sparse, and other variants,” Learning in Graphical Models, M. Jordan, Ed. MIT Press, 1998. In EM, there is calculated the posterior over latent variables p(U,V,T|R,Θ) in the E-step and estimate model parameters Θ in the M-step. However, the calculation of posterior for PPTF is intractable, implying that a direct application of EM is not feasible. Therefore, one embodiment is based on a variational EM algorithm to obtain the model parameters. Variational inference is reviewed in M. Wainwright and M. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1-2, 2008. In particular, a fully factorized distribution q(U,V,T|Θ′) is introduced as an approximation of the true posterior p(U,V,T|R,Θ):
where Θ′={mui, mvj, mtk, wui, wvj, wtk, [i]1I, [j]1J, [k]1K} are variational parameters. All variational parameters are D-dimensional vectors, and diag(wui) denotes a square matrix with wui on the diagonal.
Given q(U,V,T|Θ′), applying Jensen's inequality (described by M. Wainwright and M. Jordan in “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1-2, 2008) yields a lower bound to the original log-likelihood of R:
log p(R|Θ)≧Eq[log p(U,V,T,R|Θ)]−Eq[log q(U,V,T|Θ′)].
Denoting the lower bound using L(Θ, Θ′), L(Θ, Θ′) is expanded as:
The first term is given by
and the terms
have a similar form.
For Eq[log p(R|Θ,U,V,T)], there is computed
where H is the total number of non-missing entries in the tensor, and Eq[uidvjdtkd] and Eq[(Σduidvidtkd)2] are given as follows:
Eq[uidvjdtkd]=muidmvjdmtkd,
and
where mvj2 is elementwise square, same for mtk2, ∘ is the elementwise product, mjk=mvj∘mtk, and Σu,dd−1 is the dth element on the diagonal of Σu−1.
For mvj and wvj, there is computed
where mik=mui∘mtk.
For mtk and wtk, there is computed
where mij=mui∘mvj.
Thus, the variational E step in
where H is the total number of non-missing entries in the tensor 99. Variational M step in
In one embodiment, to predict the entry (i,j,k) using point estimate, there is computed
A maximum a posteriori (MAP) estimate is used to estimate {ûi, {circumflex over (v)}j, {circumflex over (t)}k}. MAP estimate is reviewed in M. DeGroot, Optimal Statistical Decisions, McGraw-Hill, 1970. It maximizes the posterior distribution of a random variable given its prior and the observations. In particular, for PPTF, there is computed:
For multiple imputation, an approximation {circumflex over (M)} is constructed for the mean tensor using {circumflex over (m)}ijk=ûi·{circumflex over (v)}j·{circumflex over (t)}k. Then, if rijk is missing, there can be drawn multiple samples of rijk from univariate normal N({circumflex over (m)}ijk,τ).
The method steps 150 for multiple imputation is illustrated in
where α−1 is the precision for the Gaussian distribution and
As a Bayesian model, BPTF maintains prior distributions over U,V,T,α. In particular, BPTF model assumes multivariate normal priors over ui, vj, and tk:
Here μdenotes the mean and Λ denotes the precision matrix for the factors. In one embodiment, the latent factors 280, 282, 284 are generated by one or more programmed processing units of a computing system according to the following generative process of BPTF:
The programmed method continues by letting Θu=(μu, Λu), Θv=(μv, Λv), Θt=(μt, Λt). The parameters Θu, Θv, Θt for each factor also has normal-Wishart hyperpriors. In particular, for some fixed hyperparameters μ0εRD and W0εRD×D with W0>0, there is defined:
p(Θu|μ0,W0)=p(μu,Λu)=p(μu|Λu)p(Λu):N(μu|μ0,(c0Λu)−1)W(Λu|W0,v0)p(Θv|μ0,W0)=p(μv,Λv)=p(μv|Λv)p(Λv):N(μv|μ0,(c0Λv)−1)W(Av|W0,v0)p(Θt|μ0,W0)=p(μt,Λt)=p(μt|Λt)p(Λt):N(μt|μ0,(c0Λt)−1)W(Λt|W0,v0).
where W(·|W0,v0) is the Wishart distribution with v0 degrees of freedom and a D×D scale matrix W0. In addition, α has a Gamma prior:
p(α)˜W(α|
The likelihood conditioned on the hyperparameters can be written as:
p(R|μ0,W0,v0,
The distribution of an unknown entry rijk given the observable tensor R is obtained from
p(rijk|R,Θ0)=∫U,V,T∫Θ
Sampling from this posterior distribution will provide the required multiple imputations of the missing entries. However, since direct computation of the integral is intractable, one embodiment uses a sampling based methods for approximating the posterior distribution as needed
Since Θu={μuΛu} is conditionally independent of all variables given U, i.e., given U, Θu is independent of other variables, hence, its conditional probability is given by:
Similarly, the conditional distribution for Θv={μv, Λv} is given by:
The conditional distribution for Θt={μt, Λt} is given by:
The conditional distribution of the matrix U factorizes over individual components ui, which are conditionally independent of Θv and Θt. Hence, there is computed:
Similarly, the conditional distribution for V is given by
The conditional distribution for T is given by
The conditional distribution of α is given by
The method steps 300 based on the MCMC algorithm require cyclically sampling, according to loop index “g” the parameters (Θu, Θv, ΘT, α) at 305 according to equations (15)-(17), and the factors (U,V,T) at 310 according to equations (18)-(20), and after numerous cycles, the MCMC algorithm converges to the stationary distribution which can be regarded as the true posterior, from which samples can be obtained for the following potential requirements:
(1) To obtain independent estimates for the factors {U(g),V(g),T(g)}, vide
(2) To obtain independent estimates of M; in this respect L samples are taken vide
(3) To construct multiple imputations for the missing values reference is now had to the method 350 shown
Thus, vide
More particularly,
As an example illustrating the particular embodiments, there is now described the application of the methodology in the context of a sales data set with missing data elements for a retail category corresponding to a household staple grocery with products having a retail shelf life of about a week.
In the example, a “real-world” sales data set is used comprising, for example, the unit-sales and unit-price data for the product category (e.g., provided as a computer file) which contains weekly-aggregated sales data on 333 products with unique UPC codes in the category, wherein UPC stands for Universal Product Category, which is a barcode-implemented product identifier that is commonly used for tracking products in retail stores, and this sales data is collected from 146 stores whose TDLinx codes were within the same metropolitan market geography, over a 3 year period from 2006 to 2009, wherein TDLinx is a location-based code, which developed by Nielsen (http://en-us.nielsen.com) to specify a unique retail channel, such as an individual store, retail outlet or retail sales account. Each record in this data set, therefore, contains separate fields with the UPC code, TDLinx code, week index, unit, sales and unit price information, for each (product, store, and week) or (p,s,t) combination for which the aggregated sales data is reported. As noted, the missing data elements for a particular (p,s,t) combination may arise due to a variety of causes including product introduction delays, product withdrawals, process errors in the data collection and logging etc., and many of these causes can be in fact identified by examining the pattern of missing values in the data set. In addition to the sales data set for the product category, some partial auxiliary data was also available on store promotions, inventory stock-outs and coupon redemptions, and this auxiliary data can be joined to the sales data, to support various extensions of the analysis that incorporate these auxiliary data elements according to further embodiments.
Furthermore, additional detailed information on the various individual attributes for the products in the sales data set can be obtained from a product master-data file, which contains information such as brand, packaging and product type. Finally, since the product category under study corresponds to an example “processed-food” category, additional data on the health-benefits, nutritional composition and product quality can also be ascertained from the product label information in public-domain databases. These auxiliary data elements can be incorporated into the method steps described according to the various embodiments herein, for instance, to identify sets of products that are similar to the products that are of particular interest; the retail sales data elements for these additional products can be included in the enhanced data set for carrying out the multiple imputation of the missing data elements, specifically enhancing the results of this multiple imputation for the products that are of particular interest.
Finally, detailed information on the store demographics and characteristics can also be obtained by combining data from public and private databases for the store dimension. These auxiliary data elements can be incorporated into the method steps described according to the various embodiments herein, for instance, to identify sets of stores that are similar to the stores that are of particular interest; the data elements in these additional stores can be included in the enhanced data set for carrying out the multiple imputation of the missing data elements, specifically for the stores that are of particular interest.
It can be noted that the use of any auxiliary data can even be solely for the purpose of missing data imputation, and once this imputation has been completed this auxiliary data need not be required or provided for the subsequent statistical modeling. Therefore, the use of tensor-based approaches incorporating auxiliary data may be used for missing data imputation, even in situations where it may be impossible to share the auxiliary data with the entities responsible for the subsequent statistical modeling. As an example, consider a retail chain with multiple stores, in which each store is interested in demand modeling based on its sales data, although many of these stores have data sets with missing data elements. The retail chain can, in this situation, collect the individual store data sets, and perform a multiple imputation for the missing values, using a tensor-based approach incorporating the data from all the stores. Finally, each store can be provided with its relevant subset from each multiple imputation data set, to obtain corresponding multiple imputation data sets for use in its demand modeling requirements as it see fit, without needing to ever have access to the data from the other stores. It can be readily surmised that having access to any auxiliary data, through the parent retail chain in this case, will considerably improve the quality of the multiple imputation data sets for each store, over what would be possible with the alternative of each store using only its own data for this purpose.
Given the sales data set described above, the method steps of the PPTF or BPTF algorithms as described previously for multiple imputation, can be directly implemented. The particular embodiment described herein uses various techniques for generating random sequences from the various probability distributions encountered in the descriptions therein; for instance, the Box-Muller transform as described in G. Box and M. Muller, “A Note on the Generation of Random Normal Deviates”, The Annals of Mathematical Statistics, Vol. 29, No. 2, 1958, for random sampling from a Gaussian distribution; and the Bartlett-decomposition algorithm described in W. Smith and R. Hocking. “Algorithm AS 53: Wishart Variate Generator” Journal of the Royal Statistical Society. Series C (Applied Statistics) 21 (3): 341 C345. JSTOR 1972 for sampling from a Wishart distribution. The techniques for generating random sequences are used in steps (15)-(21) in the method steps shown and described in
Various results using the method steps of one embodiment for multiple imputation of data on a dataset which contains the unit-price and unit-sales tensor for a set of 19 products in 10 stores during a three-year period (August 2006 to August 2009). In summary, this tensor data set has the dimensions 19×10×156, and contains 28406 non-missing entries. In one embodiment, the method is used to either predict or impute the missing data values in this data set.
In general, the accuracy of the procedures for obtaining multiple imputation estimates of the missing values in a data set cannot be assessed in a straightforward way, since these imputed values cannot be compared with the true value, which by definition is missing and unknown. Therefore, in order to evaluate the accuracy, one approach is to set some of the non-missing values to be missing in some random fashion in the data set, and then carry out the multiple imputation procedures to obtain estimates for these pseudo “missing values”, which may be compared with the corresponding known values. In one embodiment, therefore, for illustrative purposes, some fraction of the non-missing elements in the tensor data set are also randomly designated as missing, even though the corresponding original values are known, and these pseudo “missing values” are estimated by the multiple imputation procedures; the comparison of the imputed value or values with the original value for these pseudo “missing values” provides a means for quantitatively evaluating the accuracy of the imputed values. For notational purposes, and in conformance with standard usage in statistical modeling procedures, the set of pseudo “missing values” is termed the test set (whose values are known but presumed to be missing), and the set of remaining non-missing values is termed the training set.
The multiple imputation approach can be used to obtain the point estimate of each missing value, by simply averaging the corresponding imputed values in each of the multiple imputation data sets; furthermore, the estimated variance of this point estimate can also be obtained from these multiple imputed values, which can be used to obtain a confidence interval for the point estimate for the given missing value. A small estimated variance indicates that indicates that the model used for the multiple imputation procedure is quite effective in the imputation of the specific missing value. A large estimated variance, on the other hand, indicates that the model used for the multiple imputation procedure is not very effective in the imputation of the specific missing value. An important question that can be addressed using the multiple imputation, as to whether the predictions with high confidence are in fact more accurate than the predictions with low confidence, which can be ascertained by computing the associated confidence values for each pseudo “missing value” entry. Therefore, the pseudo “missing values” are sorted based on the standard deviation of the point estimate computed from the multiple imputation results as described above. The sorted values are then divided into five separate partitions, each partition containing 20% of the test set values: The first partition contains the first 20% of the entries with the lowest standard deviation (or high confidence) for the imputed values, and so on, with the last partition containing the last 20% of the entries which have the largest standard deviation for the imputed values. For each of these sets, the root-mean-square error (RMSE) is obtained, which is defined as
where xi and {circumflex over (x)}i are the actual value and imputed values for the ith entry respectively, and n is the total number of entries in the set.
Therefore, the results from the multiple imputation can be used to provide an indication of the accuracy of the imputed values in the resulting data sets, by obtaining the corresponding confidence values, or equivalently, by evaluating the variance of these values from the resulting multiple imputation data sets. This result provides one justification for obtaining multiple imputation data sets, since this also provides information on the associated accuracy of the missing values, which may not be available from just a single imputation data set containing the point estimates. This also justifies and confirms, in the same evident manner, the utility of having multiple imputation complete data sets for the subsequent statistical modeling to be performed, which as a result will provide models that reflect the true variability of the missing values that might be encountered in a hypothetical complete data set had these relevant missing values been putatively not missing.
In principle, it is clear that the confidence score described above (which, to reiterate, is equivalent to the corresponding standard deviation of the samples drawn from the posterior distribution in the BPTF procedure) can be provided even in the case when the sample values are averaged to obtain the point estimate. However, when provided in this form, these confidence scores cannot be directly used in any subsequent statistical modeling and analysis, whereas the multiple imputation data sets can always be used individually for any subsequent statistical modeling and analysis. Subsequently, the respective individual results from the statistical modeling and analysis on the multiple data sets can be finally averaged, so that in this way, the intrinsic variability of the estimates for the missing data values that is provided by the multiple imputation procedure can be suitably incorporated into the subsequent statistical modeling and analysis.
Via the system and method described herein, much greater accuracy and statistical reliability is obtained by simultaneously considering the multi-dimensional dependencies and correlations present in the retail data set.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM); a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a system, apparatus, or device running an instruction.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device running an instruction.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may run entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which run via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which run on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more operable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be run substantially concurrently, or the blocks may sometimes be run in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While the disclosure has been described in terms of specific embodiments, it is evident in view of the foregoing description that numerous alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, the disclosure is intended to encompass all such alternatives, modifications and variations which fall within the scope and spirit of the disclosure and the following claims.
Entry |
---|
Schafer, J. L., & Olsen, M. K., “Multiple imputation for multivariate missing-data problems: A data analyst's perspective”, Multivariate behavioral research, The Pennsylvania State University, Mar. 9, 1998, pp. 1-42. |
Mayfield, C. et al., “A Statistical Method for Integrated Data Cleaning and Imputation”, Perdue University—Computer Science Technical Reports, 09-008, 2009, pp. 1-14. |
Acock, A., “Working With Missing Values”, J. Marriage and Family, vol. 67, Nov. 2005, pp. 1012-1028. |
Salakhutdinov et al., “Restricted Boltzmann Machines for Collaborative Filtering”, Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, 2007. |
Salakhutdinov et al., “Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo”, Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008. |
Buchanan et al., “Damped Newton Algorithms for Matrix Factorization with Missing Data”, Department of Engineering Science, Oxford University, UK, Proceeding CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)—vol. 2-vol. 02, 2005. |
Kolda et al., “Tensor Decompositions and Applications”, SIAM Review, Jun. 10, 2008, pp. 1-47. |
Chi et al., “Probabilistic Polyadic Factorization and Its Application to Personalized Recommendation”, CIKM'08, Oct. 26-30, 2008, Napa Valley, California, USA, pp. 941-950. |
Chu et al., “Probabilistic Models for Incomplete Multi-dimensional Arrays”, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA, vol. 5 of JMLR: W&CP 5, 2009, pp. 89-96. |
Xiong et al., “Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization”, Machine Learning Department, Carnegie Mellon University; Robotics Institute, Carnegie Mellon University; Language Technology Institute, Carnegie Mellon University, 2010. |
Su et al., “Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box”, Journal of Statistical Software, http://www.jstatsoft.org, 2010. |
Andrieu et al., “An introduction to MCMC for machine learning,” Machine Learning, vol. 50, 5-43, 2003, Kluwer Academic Publishers, Manufactured in The Netherlands. |
Smith et al., “Algorithm AS 53: Wishart Variate Generator” Journal of the Royal Statistical Society. Series C (Applied Statistics) 21 (3): 341..C345. JSTOR 1972, pp. 341-345. |
Schafer, “Analysis of Incomplete Multivariate Data,” Chapman and Hall, London (1997). |
Little et al., “Statistical Analysis with Missing Data,” 2nd Edition, Wiley and Sons, 2002. |
Box et al., “A Note on the Generation of Random Normal Deviates”, The Annals of Mathematical Statistics, vol. 29, No. 2, Jan. 31, 1958. |
Number | Date | Country | |
---|---|---|---|
20130036082 A1 | Feb 2013 | US |