Varying-coefficient regression models often yield superior fits to empirical data by allowing parameters to vary as functions of some environmental variables. Very often in varying-coefficient models, the coefficients have an unknown functional form which is estimated nonparametrically. However, such varying-coefficient models with a large number of mixed-type varying-coefficient variables tend to be challenging for conventional nonparametric smoothing methods.
In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other implementations may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims. It is to be understood that features of the various embodiments described herein may be combined with each other, unless specifically noted otherwise.
Estimating the aggregated market demand for a product in a dynamic market is intrinsically important to manufacturers and retailers. The historical practice of using business expertise to make decisions is subjective, irreproducible and difficult to scale up to a large number of products. The disclosed systems and methods provide a scientifically sound approach to accurately price a large number of products while offering a reproducible and real-time solution.
Further input to the pricing module 12 is provided by a modeling module 100. The modeling module 100 receives historical market data 14, for example, and uses the market data 14 to calculate prediction models for the pricing module 12. In some implementations, an estimate of the aggregated market demand is used by the pricing module 12 in determining product pricing 30. Thus, in the illustrated example system 10, the modeling module 100 is configured to calculate a demand prediction model that quantifies product demand under different price points for each product based on the historical market data 14.
The various functions, processes, methods, and operations performed or executed by the system 10 and modeling module 100 can be implemented as the program instructions 122 (also referred to as software or simply programs) that are executable by the processor 112 and various types of computer processors, controllers, central processing units, microprocessors, digital signal processors, state machines, programmable logic arrays, and the like. In some implementations, the computer system 110 may be networked (using wired or wireless networks) with other computer systems, and the various components of the system 110 may be local to the processor 112 or coupled thereto via a network.
In various implementations the program instructions 122 may be stored in the memory 120 or any non-transient computer-readable medium for use by or in connection with any computer-related system or method. A computer-readable medium can be an electronic, magnetic, optical, or other physical device or means that can contain or store a computer program for use by or in connection with a computer-related system, method, process, or procedure. Programs can be embodied in a computer-readable medium for use by or in connection with an instruction execution system, device, component, element, or apparatus, such as a system based on a computer or processor, or other system that can fetch instructions from an instruction memory or storage of any appropriate type. A computer-readable medium can be any structure, device, component, product, or other means that can store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
In certain implementations, the modeling module 100 is configured to model demand as a function of price (e.g., linear regression), but allow the model parameters to vary with product features and other variables. Varying-coefficient regression models often yield superior fits to empirical data by allowing parameters to vary as functions of some environmental variables. Very often in varying-coefficient models, the coefficients have unknown functional form which is estimated nonparametrically.
In systems where the modeling module 100 is configured to predict demand, there can be many varying-coefficient variables with mixed types. Specifically, in predicting product demand, the variables can include various product features and environmental variables like time and location. The regression coefficients are thus functions of high-dimensional covariates, which need to be estimated based on data. Here, the interaction among product features is complex. It is unrealistic to assume that their effects are additive, and it is difficult to specify a functional form that characterizes their joint effects on the regression parameters. Given these practical constraints, the modeling module 100 is configured to provide a data-driven approach for estimating high-dimensional non-additive functions.
Classification and regression trees (“CART”) refers to a tree-based modeling approach used for high-dimensional classification and regression. Such tree-based methods handle the high-dimensional prediction problems in a scalable way and incorporate complex interactions. Single-tree based learning methods, however, tend to be unstable, and a small perturbation to the data may lead to a dramatically changed model.
In terms of the pricing system example illustrated in
In certain implementations, the particular partitioning or splitting of the parent data 300 based on the partition variable is determined by evaluating several possible data splits. In
In
Referring back to
Additional aspects of the disclosed systems and methods are described in further detail as follows. For example, let y be the response variable 202, xεRp denote the vector of predictors 204 that a parametric relationship is available between y and x, for any given values of the varying coefficient, or partition variable vector sεRq, where p and q are the number of predictor variables and partition variables, respectively. The regression relationship between y and x varies under different values of s. The idea of partitioning the space of varying coefficient, or partition variables s, and then imposing a parametric form familiar to the subject matter area within each partition conforms with the general notion of conditioning on the partition variables s. Let (s′i, x′i, yi) denote the measurements on subject i, where i=1, . . . , n, and n is the number of subjects. Here, the partition variable si=(si1, s12, . . . , siq)′ and the regression variable xi=(xi1, xi2, . . . , xip)′, and overlap is allowed between the two sets of variables. The varying-coefficient linear model specifies that,
y
i
=f(xi,si)+εi=x′iβ(si)+εi, (1)
where the regression coefficients β(si) are modeled as functions of s.
In model (1), the key interest is to estimate the multivariate coefficient surface β(si). The disclosed estimation method allows for a high-dimensional varying-coefficient vector si. Examples of the tree-based method approximate β(si) by a piecewise constant function. An example of the proposed tree-based varying-coefficient model is,
where πm(si)ε{0, 1} with
Σm=1Mπm(s)=1
for any sεRq. The error terms εi are assumed to have zero mean and homogeneous variance σ2. The disclosed method can be readily generalized to models with heterogeneous errors. The M-dimensional vector of weights π(s)=(π1(s), π2(s), . . . , πM(s)) is regarded as a mapping from sεRq to the collection of K-tuples
The partitioned regression model (2) can be treated as an extension of regression trees which boils down to the ordinary regression tree if the vector of regression variable only includes 1.
The collection of binary variables πm(s) defines a partition of the space Rq. Cm={s|πm(s)=1}, and the constraints in (3) are equivalent to Cm∩Cm′=ø for any m≠m′, and UMm=1Cm=Rq. Hence the partitioned regression model (2) can be reformulated as
where I(.) denotes the indicator function with I(c)=1 if event c is true and zero otherwise. The implied varying-coefficient function is thus
a piecewise constant function in Rq. In the terminology of recursive partitioning, the set Cm is a child data node referred to as a terminal node or leaf node, which defines the ultimate grouping of the observations (for example, first and second child nodes 301, 302 in
Before addressing the determination of M, the estimation of partition and regression coefficients is considered. The usual least squares criterion for (4) leads to the following estimators of (Cm, βm), as minimizers of sum of squared errors (SSE),
In the above, the estimation of βm is nested in that of the partitions. {circumflex over (β)}m(Cm) is a consistent estimator of βm given the partitions. The estimator could be a least squares estimator, maximum likelihood estimator, or an estimator defined by estimating equations. The following least squares estimator is an example
in which the minimization criterion is essentially based on the observations in node Cm only. Thus, the regression parameters βm are “profiled” out to have
By definition, the sets Cms comprise an optimal partition of the space expanded by the partitioning variables s, where the “optimality” is with respect to the least squares criterion. The search for the optimal partition is of combinatorial complexity, and it is of great challenge to find the globally optimal partition even for a moderate-sized dataset. The tree-based algorithm is an approximate solution to the optimal partitioning and scalable to large-scale datasets. For simplicity, the present disclosure focuses on implementations having binary trees that employ “horizontal” or “vertical” partitions of the feature space and are stage-wise optimal. As noted above, alternative implementations are envisioned where data are partitioned in to more than two child nodes.
An example tree-growing process, referred to herein as the iterative “Part Reg” process, adopts a breadth-first search and is disclosed in the following pseudo code.
Require: n0—the minimum number of observations in a terminal node and M—the desired number of terminal nodes.
1. Initialize the current number of terminal nodes l=1 and Cm=Rq.
2. While l<M, loop:
ΔSSEm,j=max{SSE(Cm)−SSE(Cm,L)−SSE(Cm,R)},
The breadth-first search cycles through all terminal nodes at each step to find the optimal split, and stops when the number of terminal nodes reaches the desired value M. The reduction of SSE is used as a criterion to decide which variable to split on. For a single tree, the stopping criterion is either the size of the resulting child node is smaller than the threshold n0 or the number of terminal nodes reaches M. The minimum node size n0 needs to be specified with respect to the complexity of the regression model, and should be large enough to ensure that the regression function in each node is estimable with high probability. The number of terminal nodes M, which is a measure of model complexity, controls the “bias-variance tradeoff.”
In the example tree growing process disclosed above, the modeling module 100 is configured to cycle through the partition variables at each iteration and consider all possible binary splits based on each variable. The candidate split depends on the type of the variable. For an ordered or a continuous variable, the distinct values of the variable are sorted, and “cuts” are placed between any two adjacent values to form partitions. Hence for an ordered variable with L distinct values, there are L−1 possible splits, which can be huge for a continuous variable in a large-scale data. Thus a threshold Lcont (500, for instance) is specified, and only splits at the Lcont equally spaced quantities of the variable are considered if the number of distinct values exceeds Lcont+1. An alternative way of speeding up the calculation is to use an updating algorithm that “updates” the regression coefficients as the split point is changed, which is computationally more efficient than having to recalculate the regression every time. The example disclosed above adopts the former approach for its algorithmic simplicity.
Three examples of methods for splitting data, such as illustrated in block 208 of
1. Exhaustive search. All possible partitions of the factor levels into two disjoint sets are considered. For a categorical variable with L categories, an exhaustive procedure will attempt 2L-1−1 possible splits.
2. Category ordering. The exhaustive search is computationally intensive for a categorical variable with a large number of categories. Thus the categories are ordered to alleviate the computational burden. In the partitioned regression context, let {circumflex over (β)}l denote the least squares estimate of β based on observations in the l-th category. The fitted model in the l-th category is denoted x′{circumflex over (β)}l. A strict ordering of the x′{circumflex over (β)}ls as functions of x may not exist, thus an approximate solution is used in some implementations. The L categories are ordered using
3. Gradient descent. The idea of ordering the categories ignores any partitions that do not conform with the current ordering, and is not guaranteed to reach a stage-wise optimal partition. A third process starts with a random partition of the L categories into two nonempty and non-overlapping groups, then cycles through all the categories and flips the group membership of each category. The L group assignments resulting from flipping each individual category are compared in terms of the reduction in SSE. The grouping that maximizes the reduction in SSE is chosen as the current assignment, and iteration continues until the algorithm converges. This algorithm performs a gradient descent on the space of possible assignments, where any two assignments are considered adjacent or reachable if they differ only by one category. The gradient descent algorithm is guaranteed to converge to a local optimum, thus multiple random starting points can be chosen in the hope of reaching the global optimal. If the criterion is locally convex near the initial assignment, then this search algorithm has polynomial complexity in the number of categories.
Two strategies, the default algorithm which combines the exhaustive search, gradient descent and category ordering, and an ordering approach that always orders the categories are used in certain implementations:
Default. In the default tree growing algorithm, a lower and an upper bound on the number of categories are specified, namely Lmin and Lmax. When the number of categories is less than or equal to the lower bound, an exhaustive search is performed; when Lmin<L≦Lmax, gradient descent is performed with a random starting point; and when the number of categories is beyond Lmax, the categories are ordered and variable is treated as ordinal. Example implementations use this tree growing algorithm with Lmin=5 and Lmax=40.
Ordering. In the ordering approach, the categorical variable is ordered irrespective of the number of categories (i.e., Lmax=2). The ordering approach is much faster than the default algorithm.
At every stage of the tree, the algorithm cycles through the partition variables to find the optimal splitting variable (block 206 of
Choice of tuning parameters. The proposed iterative “Part Reg” process disclosed above involves two tuning parameters: the minimum node size n0 and number of final partitions M. In theory, one can start with a candidate set of values for the two tuning parameters (n0, M), and then use K-fold cross-validation to choose the best tuning parameter. Here, the number of combinations might be large, which adds to the computational complexity. Example implementations fix the minimum node size at some reasonable value depending on the application and sample size, and then choose the number of terminal nodes by the risk measure on a test sample. Let (s′i,x′i,yi), i=n+1, . . . , N denote the observations in the test data, and ({circumflex over (β)}m,Ĉm) denote the estimate regression coefficients and partitions from training sample and M denote the set of tree sizes that are searched through, then M is chosen by minimizing the out-sample least squares,
As noted above, the varying-coefficient linear model is used in predicting demand in certain implementations of the system 10. In one example implementation, sales units and log-transformed sales units are plotted against price as illustrated in
log(yi)=β0(si)+β1(si)xi+εi, (9)
which is estimated via the tree-based method. The minimum node size in the tree model is fixed at n0=10. The tuning parameters M are chosen by minimizing the squared error loss on a test sample. The L2 risk on training and test sample is plotted in
The disclosed methods and systems primarily focus on varying-coefficient linear regression estimated with a least squares criterion. However, the methodology is readily generalized to nonlinear and generalized linear models, with a wide range of loss functions. More robust loss functions, or likelihood-based criteria for non-Gaussian data are also appropriate.
Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific embodiments shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific embodiments discussed herein. Therefore, it is intended that this invention be limited only by the claims and the equivalents thereof.