This application is a non-provisional of U.S. Application No. 62/556,133, entitled “LEARNING ORDINAL REGRESSION MODEL VIA DIVIDE-AND-CONQUER TECHNIQUE” filed Sep. 8, 2017, of which the full disclosure is incorporated herein by reference for all purposes.
As users increasingly access content electronically and conduct transactions electronically over the Internet, content providers are presented with the problem of processing extremely large amounts of user data in an efficient and intelligent manner to improve the way in which content is delivered to these users. Processing and analyzing user data is critical for training models to predict user behavior using limited amounts of historical data as input. Many content providers specifically select content for certain pages or other interfaces to be displayed to particular users. For example, a user might search for information about a keyword through a search engine. When a results page is returned to the user that includes search results relating to that keyword, content that may be of interest to the user and relevant to the search can be included with the results page that relates to the keyword and/or search results. Often, the content includes a hypertext link or other user-selectable element that enables the user to navigate to another page or display relating to the advertisement.
In conventional approaches, large amounts of data may be stored and analyzed using a single computer equipped with sufficient processing power, which can be costly, inefficient, or inaccurate because of biases trained into the model. Other conventional approaches may divide the data into more digestible blocks for training individual models and then the individual models may be averaged or merged. However, averaging or margining individual models often introduces unnecessary variances and biases, which also results in inaccurate models for user behavior. For example, a user may be associated with a category when the user visits a page, performs a search, or views content associated with that category. For example, a user viewing a page of content relating to cameras may be associated with a camera category and thus may receive advertising relating to cameras. However, the user may have been looking for something only tangentially related to cameras, or might have only visited a camera page once for a particular reason. Thus, conventional approaches do not optimally reflect the interests of various users, and do not allow advertisers to easily determine the appropriate users, or categories of users, to target.
Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which:
Systems and methods in accordance with various embodiments of the present disclosure may overcome one or more of the aforementioned and other deficiencies experienced in conventional approaches to process and analyze large amounts of data, and train models to predict behavior using a limited amount of data as input. In particular, various embodiments provide ordinal regression models to establish functional relationships between predictors and ordinal outcomes, that is, outcomes which are categorical and have a ranked order. Ordinal regression models use a form of regression analysis to predict an ordinal variable, which is a variable with a value that exists on an arbitrary scale where only the relative ordering between different values is different. For example, ordinal regression may be used to model human levels of preference or ratings on a scale, e.g., a scale of 1-5, with 1 being “poor” and 5 being “excellent.” Ordinal regression models may also be used in information retrieval to classify or rank information. Due to extensive applicability of the models, developing efficient techniques to train the models may be difficult, costly, time-consuming, and inaccurate. According to an embodiment, training ordinal regression models may reduce the problem to a binary classification problem, which facilitates the usage of readily available, powerful binary classifiers. Embodiments of the present disclosure provide a systematic reduction technique and improve the structure and properties of the ordinal model trained from the binary data.
However, the reduction procedure necessitates an expansion of the original training data, where the training data increases by K−1 times of its original size, with K being the number of ordinal outcomes. In the era of big data, where training sets are usually large scale in nature, such expansion can introduce computational challenges and may even make it infeasible to train the model on a single machine. Embodiments of the present disclosure provide a divide-and-conquer (DC) algorithm. The DC algorithm of the present invention, in an embodiment, divides the expanded data into a cluster of machines and trains logistic classification models in parallel, and then combines them at the end of the training phase to create a single ordinal model. The training scheme removes the need for synchronization between the parallel learning algorithms during the training period, making training on large datasets technically feasible without the use of supercomputers or computers with specific processing capabilities. Other advantages include improvements in cost reduction, efficiency, and accuracy. Embodiments of the present invention provide consistency and asymptotic normality property of the model learned using the DC algorithm according to embodiments of the present invention. Embodiments of the present invention also provide improved estimation and prediction performance of the model learned compared to the existing techniques for training models with large datasets.
According to various embodiments of the invention, the service provider system 102 may interface with the network 120 through a data interface 104. The data interface 104 may be in communication with an external entity that collects raw ordinal data associated with a plurality of users. The data interface 104 may initially process the large amounts of raw ordinal data to divide them into data blocks, which can be referenced, stored, and/or indexed in data blocks database 116. In some embodiments the raw ordinal data may be collected by the service provider system 102 and processed by the data interface 104. The service provider system 102 may include a processor 110 and a memory with executable instructions that perform specific functions. For example, the service provider system may include a coefficient module 106 that is configured with executable code to compute regression coefficients and de-biased coefficient vectors for each data block identified in data block database 116 trained on each machine 130A, 130B, and 130C. The coefficients and coefficient vectors may be stored in coefficient data 114 for the variance module 108 to use to calculate a robust inverse variance for each data block identified in data block database 116 trained on each machine 130A, 130B, and 130C. The service provider system 102 may then include model summation module 112 that utilizes the regression coefficients, inverse variance, and de-biased coefficient vectors for each data block identified in data block database 116 trained on each machine 130A, 130B, and 130C to create a single model by summing them in a weighted fashion according to embodiments of the present invention.
According to various embodiments of the invention, the service provider system 102 may include machines 130A, 130B, and 130C. The service provider system 102 may also include modules that are enabled to collect and aggregate raw ordinal data associated with a plurality of users. In some embodiments, each machine 130A, 130B, and 130C may have a corresponding coefficient module, variance module, and coefficient data database. The service provider 102 may communicate with the machines 130A, 130B, and 130C in transmitting data blocks that have been divided by data interface 104. Machines 130A, 130B, and 130C may then individually, separately, and independently generate a model that is trained on their corresponding data block. For example, each machine 130A, 130B, and 130C, for its corresponding data block, may each calculate the regression coefficient, the inverse variance, and de-biased coefficient vector. The machines 130A, 130B, and 130C may then transmit the generated models back to the service provider system 102 for a weighted summation of the models by model summation module 112. In some embodiments, the service provider system 102, or another third party entity, may also have a user information database, which may be cross-referenced with raw ordinal data that is divided into data blocks referenced, stored, indexed, or identified in the data blocks database 116. The models generated by each machine 130A, 130B, and 130C training on each data block and summed by the model summation module 112 may be used to create predictive models on user behavior, advertising campaigns, marketing campaigns, media content trends, etc.
The ordinal user data may be collected, aggregated, and analyzed to establish functional relationships between predictors and ordinal outcomes. To illustrate, ordinal models are trained to determine the relationship between events, such as viewing an advertisement to a click on the advertisement, a viewing of a product in the advertisement, and/or a purchase transaction of the product in the advertisement. In this example, depending on the advertising campaign, the advertising service may be getting paid to show impressions of ads, so its goal may be to increase the presence of its ads. Alternatively, the goals of the ad campaign may be to increase the number of clicks, click-throughs (purchases made through clicks on the ad), or purchases outside of the clicks. Other goals may include driving purchases of very specific products, specific vendors, etc. As such, depending on the application in which the ordinal data is being analyzed for creates a structure of the predictions for which to apply the ordinal model. In existing technology, normal classification only predicts two categories, however embodiments of the present invention are improved over normal classification because in the present invention categories can be ranked and distinguished into more than just two categories. For example, a purchase of a product is better than a click on an advertisement and a click is better than just a view of the advertisement.
However, in processing and analyzing large amounts of data, many technical obstacles arise, such as limitations in memory or storage, processing capabilities, accuracy of the data analysis, and efficiency/speed of the analysis. As such, according to an embodiment of the present invention, at 204, the large set of data may be divided into M portions, where each portion is processed and analyze in parallel. The advantages of dividing the large set of data into portions is two-fold in that, first, each portion is a more manageable size to process and analyze, and second, parallel processing reduces the overall processing time of the entire set of data. Each portion may be transmitted to a separate computing resource in a fleet or plurality of computing resources for processing. In some embodiments, the set of data may be divided equally or based on size for particular computing resources and their processing capability or availability. The computing resources may include machines, servers, cloud resources, virtual resources, or any other suitable computing or processing resource, etc. In various embodiments, the computing resources may be in communication with and in synchronization with each other. In other embodiments, the computing resources may be running separately and independently from each other.
At 206, for each portion of data, the computing system may calculate an estimate of regression coefficients. A regression coefficient is a value that represents the rate of change of a variable as a function of changes in another. For example, in a linear equation, the regression coefficient is the slope of the line. According to embodiments of the invention, the estimated regression coefficient may be a constant value indicating a rate of change the raw ordinal data, and may be calculated to include a penalty factor. Adding penalty factors into the estimate regression coefficient analysis eliminates the effect of outliers in the set of data.
At 208, for each portion of data, the computing system then calculates a robust inverse variance, which aggregates two or more random variables to minimize the variance of a weighted average. Inverse variance may be used in combining results from independent measurements where significant variances may exist.
At 210, the system then calculates the de-biased coefficient vector for each portion of data, which may be an approximation of the regression coefficient when there is no penalty factor. The de-biased coefficient vector and robust inverse variance may be obtained for each portion of data in parallel. Lastly, at 212, the robust inverse variance and de-biased coefficient vector for each portion of data is summed to obtain a robust inverse variance weighted average of the entire set of data.
Embodiments of the overview method described in
Thus, learning to rank ordinal outcomes is an important task in many applications where outcomes are categorical and ordered in nature. For example, customer rating may be categorized with the following options and order: highly satisfied, satisfied, neutral, dissatisfied, and highly dissatisfied. The natural ordering of outcomes distinguishes ordinal regression from general multinomial regression where outcomes are categorical but nominal in nature.
According to various embodiments, a binary classification approach for training models may be used to obtain ordinal outcomes. As an example, a content provider or retail provider may consider the satisfaction level of a user for a product, with five possible levels. By asking the question “is the satisfaction level for the user greater than level k”, the provider can get a binary classification problem for a fixed k, since the answer would be binary, for example, yes or no (e.g., 1 or 0). By varying k=1, 2, 3, 4, for each user, the provider may have four different binary classification problems. According to various embodiments, the main advantage of reducing ordinal regression problem into binary classification problem is that it facilitates the usage of well-tuned binary classifiers available with standard libraries.
A number of algorithms may be implement to train ordinal regression models, with the algorithms sharing the property of being inspired or modified from binary classification approaches. Existing technologies include the systematic ordinal to binary classification reduction technique, in which a reduction and training scheme is implemented so that all the binary classification problems are solved jointly to obtain a single binary classifier. A simple step is then used to convert the binary outputs to an ordinal rank, which also leads to an immediate generalization analysis.
While this reduction technique is efficient and only one binary classifier needs to be trained on the expanded classification data, the reduction step, however, involves a necessary expansion of instance space, i.e., it artificially expands the training set by K−1 times of its original size, where K is the number of ordinal outcomes. In the era of big data, where training sets are usually quite large, it might be impossible to store the expanded data on a single machine. Even if storage is not an issue, since the data cannot be loaded into main memory, the computational time required to train a model, by reading data in chunks from the secondary storage and iterating, may be substantially higher and might not be within acceptable time limits. An ad-hoc solution may include down-sampling the expanded data (especially if the classes are highly imbalanced); however such methods are not governed by strong theory and can lead to loss of important information.
Divide-and-conquer methods according to embodiments of the present invention may be used for fitting logistic classification models. DC methods for logistic classification methods may include partitioning the full dataset into M separate parts, obtaining coefficient estimates from each part, and combine the M sets of estimates to get the final result. DC methods according to various embodiments may be more efficient because no synchronization is required between individual learning algorithms, leading to true parallel learning. DC methods according to various embodiments greatly reduce the computation time required by other existing methods, such as Newton's method, stochastic gradient descent, and mini-batch gradient descent.
Existing DC methods simply take the average of the estimates as
where {circumflex over (θ)}m is the linear classifier coefficient estimate from the m-th data partition. However, existing DC methods have been shown to produce high variance in the combined estimator, whereas DC methods of the present invention reduce and minimize variance.
Other existing DC methods calculate an inverse variance weighted average (IVWA) of the separate estimates as:
is the estimated variance-covariance matrix of {circumflex over (θ)}m. For logistic regression
where Xm is the m-th block feature matrix, and Vm(θ) is a diagonal matrix with diagonal element: vm(θ)i,i=σ(XmTθ){1−σ(XmTθ)}, where σ(x)=1/(1+e−x) is the sigmoid function. This estimator provides theoretical efficiency in the sense that the DC estimator can achieve the smallest variance possible, which is the variance achieved by the benchmark of directly training on the full data. However, due to overfitting issues resulting from a lack of regularization, the empirical results usually show larger variance than the benchmark.
To enforce sparsity, a majority voting method may select the most frequently identified features from lasso regressions of across all data divisions. The method according to various embodiments returns nonzero results only for features that are identified across a majority of data parts, and let the rest be zero. The combination step can be represented by:
where A is a column-wise slicing of an identity matrix ID with columns
for some voting threshold
is estimated by plugging in the lasso estimates {circumflex over (θ)}m. Due to the sparseness of {circumflex over (θ)}m, this method is numerically robust. However, it requires tuning of two parameters: the lasso regularization parameter and the voting threshold v. Additionally, the combined estimator {circumflex over (θ)} is biased due to the biasedness of {circumflex over (θ)}m, where m=1, . . . , M.
For a binary logistic regression, with instance x∈D and label ∈{0,1}, the binary classifier f(x) may be parameterized by β∈D, i.e., f(x)=xTβ. The loss (or the negative log likelihood) function of a training dataset may be represented by:
where N is the training sample size, and the estimated coefficient vector {circumflex over (β)} is the maximizer to Equation (1).
A K class ordinal regression problem may be defined by an instance x∈X⊆D and label y∈y{1, 2, . . . , K}, where 1≤2≤ . . . ≤K. In this example, the objective may be to learn a ranking rule r:Xy, which may minimize a cost function Cy,r(x), in expectation of joint distribution over X and Y. Each instance and label pair (xi, yi) may be reduced to a binary classification pair (along with introduction of a weight) by the following equations:
xik=(xiT,ekT)T∈D+K−1,
yik=1[k<y],
wik=|Cy
for k=1, . . . , K−1, where Cy,k may be the loss function for assigning an outcome of k when the actually value is y, and ek is the standard basis vector in dimension K−1. As a result, the original sample size expands from N to (K−1)N. Subsequently, a logistic classifier f(·) may be trained on the expanded training set by minimizing the new loss function, which may be represented as:
Equation (3) may be viewed as the loss (negative log likelihood) of a set of training data with sample size Ñ=(K−1)N, feature dimension {tilde over (D)}=D+K−1, and sample weights may be specified by wik. The solution to Equation (3) may lead to a classifier f(·) of the form f(·)=(g(·), b1, b2, . . . , bK-1), where g is defined by a parameter vector β∈D (g(x)=xTβ), where {b1, b2, . . . , bK-1} are bias terms. As such, ) may be represented as a linear function with parameter θ∈D as θ=[β, b1, . . . , bK-1]T, with f(xk)−=xkTθ=xTβ+bk. When Cy,r(x) is convex, the bias terms are rank monotone such that b1≥b2≥ . . . ≥bK-1, therefore f(x1)≥f(x2)≥ . . . ≥f(xK-1). This justifies the ranking rule of predicting the rank of a new instanced x*∈D by the representation:
However, the convex absolute loss Cy,r(x)=|y−r(x)| may be factored in the reduction to binary classification to ensure the biases to be rank monotone. As a result, for all i, k.
According to various embodiments, the reduction technique is the expansion of the training set, as evidenced in Equation (2). The training set increases by K−1 times its original size. Even for moderately large datasets, such expansions may lead to greater computational burden. It might become cumbersome to store the expanded dataset in a single machine or computing resource, or at least, load it into the main memory, which may lead to substantial increase in training time. Alternatively, if the full training dataset were partitioned and trained on the individual parts, each part may have insufficient sample size to yield a stable coefficient estimate due to overfitting, resulting in poor quality of the combined estimate. Additionally, regularized methods that prevent overfitting usually give biased estimates such that the combined estimate is also biased. Embodiments of the present invention resolve the technical problems that arise from using existing regularized methods by using robust inverse variance weighted average (RIVWA) method.
The DC estimator methods according to embodiments of the present invention provide estimation consistency and asymptotic normality properties in comparison to the benchmark method of using the full data with regularize estimators. As discussed, major challenge for regularized estimators of existing DC methods is that they are biased; thus, the combined estimate often lacks theoretical guarantees in terms of consistency. Regularized estimators may bootstrap subsampling idea to estimate and correct for the bias of DC estimators. However, the method according to various embodiments derives the closed form expression of the bias of l1 regularized logistic regression, and directly corrects the bias within each data part before combining the results.
Another challenge with existing techniques is in processing large amounts of data, because the data can be so large that it usually cannot be stored in a single machine or computing resource on which to train the models. In existing DC techniques, in processing a large amount of data (e.g., ten gigabytes of data), a portion of that data may be taken out to train on the model (e.g., one gigabyte out of ten). However, with existing DC techniques, using only one gigabyte of the entire ten gigabytes of data to train on can lead to inefficient training where significant portions of data are discarded and the discarded (e.g., remaining nine gigabytes of data) may contain important signals in terms of predictability.
In an illustrative example, a service provider (e.g., media content provider, online marketplace provider, insurance provider, etc.) may wish to train models to predict user behavior in response to advertisements and/or customized content. The service provider may receive large amounts of data daily, which may be correlated to reduce the amount of data to be used in training the model. For example, the service provide may receive 10-20 GB of training data, and after correlating the total 10-20 GB, retain around 1 GB of the total data to train the model on a single computing resource, and then that model may be pushed into production. According to various embodiments, the service provider may not wish to correlate the entire set of 10-20 GB of data into just 1 GB of data. Accordingly, the service provider may have a number of computing resources (e.g, 20 machines) on which models can be trained on. The entire set of data may be randomly or uniformly split into the number of computing resources that the service provider has available. For example, 20 GB of data may be divided evenly among the 20 machines such that each machine uses 1 GB of data each to train a model. Each machine replicates the training process for its 1 GB of data, and the training can be done independently with or without synchronization with the other machines. Without requiring the machines to synchronize with each other, there is no communication or overhead between the machines while the training process is executing, which results in efficient parallel ordinal regression model training. Each machine generates a model when each machine has completed the training process on its portion of data. After each machine has completed the training to generate a model, the service provider can combine the models in a weighted fashion using an estimate of the variance of each model from each machine to compute variance matrixes, which provide a weighting coefficient for each model. The models may then be combined in a linear fashion, where each model has a different weighting coefficient or factor. As a result, the summed model according to embodiments of the invention can, with significantly reduced processing time, replace the single model that would have been generated by a single machine training on the entire set of data.
First, the method according to various embodiments divides the full ordinal data into M parts (adopting the convention of dividing the original training data and then expanding, however in another embodiment, the data may be first expanded and then divided). In some embodiments, data is divided equally, however data may be divided based on size or allocated to specific machines based on various parameters. When the data is divided, each part contains
of the original training set. According to other embodiments, the service provider may divide the entire set of data into portions of different sizes, based on available computing resources, computing or processing capabilities of the resources, etc. Smaller portions of data may have smaller inverse variance, and thus may be weighted less and have a lesser effect to the final resulting summed model. According to various embodiments, the feature dimension D may be fixed so that coefficient estimates from separate parts can be combined, and that storing the inverse variances of D×D dimension may be feasible. According to some embodiments, the method mat select an M that it is not too large to ensure n>D, and an M value that is not too small to ensure benefiting from embodiments of the invention.
Next, according to various embodiments, (Ym, Xm) denotes the m-th part after the expansion by Equation (2), with instance space dimension being ñ=(K−1)n and feature space dimension being {tilde over (D)}=D˜+K−1. Thus, Xm may be an ñ×{tilde over (D)} matrix where each row is an expanded instance xik, i.e., Xm=[xm,11, . . . , xm,1K-1, . . . , xm,n1, . . . , xm,nK-1]T, and Ym may be a vector of length ñ, i.e., Ym=[ym,11, . . . , ym,1K-1, . . . , ym,n1, . . . , ym,nK-1]T. In some embodiments, iterator l=1, . . . ñ may be used to iterate through Xm and Ym. For each of the data block, the method may consider the l1 regularized logistic regression which employs the lasso penalty on the loss function to learn the coefficient vector θ:
where ∥·∥1 is the l1 norm and λ is the penalty factor. Equation (5) results in a sparse estimate of regression coefficients θm for the m-th block. According to various embodiments, for example, the Python library sklearn with the liblinear solver may be used to obtain {circumflex over (θ)}m.
Subsequently, according to various embodiments, the method may then calculate a robust inverse variance for each data block, represented by:
where Vm(θ) is an ñ×ñ diagonal matrix with diagonal elements v1,l=σ(xm,iT{circumflex over (θ)}m)(1−σ(xm,lT{circumflex over (θ)}m)), l=1, . . . ñ. U Using the same diagonal variance matrix, a de-biased coefficient vector may be calculated and represented by:
where Ŷm=[m,l, . . . , m,ñ]T with m,l=l=1, . . . ñ. The de-biased coefficient vector {circumflex over (θ)}mc may be an approximation to the coefficient estimated when λ=0 (e.g., no penalty loss). Equation (7) provides a convenient way to quickly compute {circumflex over (θ)}mc instead of solving Equation (5) at λ=0. As such, for each data block, after computing {circumflex over (θ)}m (i.e., estimate of regression coefficients) in Equation (5), then the {circumflex over (θ)}mc (i.e., de-biased coefficient vector) may be calculated in Equation (7), and the
(i.e., robust inverse variance) may be calculated in Equation (6). The estimate of regression coefficients {circumflex over (θ)}m, de-biased coefficient vector {circumflex over (θ)}mc, and robust inverse variance
may be obtained for each data block, m=1, . . . , M, in parallel.
Lastly, the values from Equations (6) and (7) using the regression coefficient from Equation (5) may be summed to obtain a RIVWA estimate {circumflex over (θ)}, represented by:
Note that when estimating the inverse variance weights, the sparse regularized estimates {circumflex over (θ)}m may be plugged in to avoid overfitting and ensure numerical robustness of
And the computation of {circumflex over (θ)}m may serve the purpose of stabilizing the robust inverse variances. However, the average is taken across the de-biased estimates {{circumflex over (θ)}mc}m=1M to provide unbiasedness and consistency of {circumflex over (θ)}. Equation (7) also provides a direct and simple way to use the penalized {circumflex over (θ)}m to compute the unpenalized counterpart {circumflex over (θ)}mc when λ=0. M may be the total number of subsets of data, with each subset of data being processed by an individual computing resource. As such, m identifies the subset of data, and the robust inverse variance is calculated for each subset or portion of data.
In contrast, the classic inverse variance weighted average (IVWA) uses the unregularized coefficient estimates in calculation of the inverse variance weights, which often leads to overfitting in individual data parts, resulting in predicted probabilities very close to the boundary (i.e., 0 or 1), thus produces inaccurate estimate of parameter variance matrices
According to various embodiments, the consistency and asymptotic normality properties of the RIVWA estimator {circumflex over (θ)} may be represented by Equation (8). First, the asymptotic properties of {{circumflex over (θ)}mc}m=1M from individual parts may be obtained, then it can be shown that the same properties apply to the combined estimator {circumflex over (θ)}. θ0 may be the unknown true underlying coefficient, which is the limiting value of the coefficient obtained from the benchmark {circumflex over (θ)}BM as {circumflex over (N)}→∞.
In an embodiment, for the m-th block (e.g., subset or portion of data), {circumflex over (θ)}mc in Equation (7) may be consistent, i.e., {circumflex over (θ)}mcθ0, and asymptotically normally distributed, i.e., √{square root over (ñ)}({circumflex over (θ)}mc−θ0) (0, Σm(θ0)), where
In another embodiment, the combined estimator {circumflex over (θ)} in Equation (8) may have the same consistent and asymptotically normality property as {circumflex over (θ)}BM by the benchmark, i.e., {circumflex over (θ)}θ0 and √{square root over (Ñ)}({circumflex over (θ)}−θ0)(0, Σ(θ0)), with
Embodiments of the present invention utilize a logistic model, and also show that the parameters due to data expansion enjoy the same properties, which may be valuable to outcome prediction and other applications in ordinal regression. In addition, using results from the combined estimator {circumflex over (θ)} in Equation (8), statistical tests may be conducted on the bias terms b1, . . . bK-1 to compare the mean expected probabilities of different outcome levels. For each block of data, regression coefficient estimates and variance coefficients may be calculated to provide a weighting value for each of model from each machine, and then the models are combined in a linear fashion.
The following examples illustrate applying the model on various different types of datasets. Here, t DC method according embodiments of the invention were trained on a public insurance dataset, and various additional datasets. Models trained on data implementing the DC method of the present invention are compared with a benchmark and results from prior art DC methods.
For example, the following methods are compared:
BM, FTRL and MBGD may utilize in-memory access of the full training data. FTRL and MBGD are stochastic methods, that may include synchronization and data access at every iteration, thus these methods cannot be executed in parallel and are considered as single memory training method. For FTRL, all the instances are iterated; for MBGD, the mini-batches are 1/100 of the full training data. For BM, results are reported after subsampling of the expanded dataset. BM does not involve regularization, otherwise it will not have a limiting distribution.
For performance evaluation, the following metrics are reported:
(1) absolute difference d1 and squared difference d2 of an estimated coefficient {circumflex over (θ)} to that of the benchmark {circumflex over (θ)}BM, d1({circumflex over (θ)}, {circumflex over (θ)}BM)=∥{circumflex over (θ)}−{circumflex over (θ)}BM∥1 and d2({circumflex over (θ)}, {circumflex over (θ)}BM)=∥{circumflex over (θ)}−{circumflex over (θ)}BM∥22;
(2) absolute prediction loss of an estimated {circumflex over (θ)} evaluated on the testing set, defined abs_loss
where yi is the true ordinal label of the i-th instance in Ytest and
is its predicted label given {circumflex over (θ)};
(3) computation time in seconds, including the time used to read data from divided parts. Time of DC methods is calculated by the maximum time across all parallel procedures plus time used to combine the results. The above metrics are also compared at different choices of M.
Tuning parameters are selected to maximize the absolute prediction loss on the validation set such that for tuning parameter(s) λ∈λ* is selected such that
is the coefficient vector obtained at tuning value ⊥. A grid search is used for λ∈{10−4, 10−3, . . . , 103}.
Table 1 below shows the results on an insurance dataset. Number of divisions M=100 for DC methods. d1: absolute differences between coefficient estimates of other methods and the benchmark; d2: squared differences between coefficient estimates of other methods and the benchmark; change in abs_loss with respect to benchmark (%): the relative percentage change in absolute prediction loss with respect to benchmark (the smaller the better); time: computation time in seconds. Results are averaged across 10 repeated experiments and reported as mean±sd. Best results are highlighted in bold.
1.60 ± 0.06
0.29 ± 0.14
The outcome of interest is an 8-level ordinal rating related to some undisclosed decision associated with an application in an insurance service provider. The dataset contains 59,381 labeled instances and has 144 features. The dataset was randomly split into 60% for training, 10% for validation, and 30% for testing.
Table 1 shows the results of comparison across different methods. Results in the table are reported as mean and standard deviation from 10 replications of randomized train validation-test splits. Not only does the RIVWA method of the present invention produce the closest coefficient estimates to the benchmark in terms of d1 and d2, it also achieves very good prediction performance in terms of abs_loss(Ytest, {circumflex over (θ)}). Additionally, the computational time is less than 1/100 that of the benchmark method, and similar to other DC methods.
Another dataset used to illustrate the DC techniques of the present invention is trained on a popular public movie rating dataset containing 20,000,263 movie ratings by 138,493 users of 27,278 movies from 1995-2015. The following features are used for modeling: user ID, movie ID, rate year, movie year, genre categories, user tags and genome tags with relevance above 0:8.
Different from the insurance dataset where the feature space dimension is fixed and small; the total number of features in this example is much larger than N and highly sparse. In order to estimate the variance of coefficients, the hashing trick is applied so that the features are reduced to a space of fixed dimension, which is fixed at 210=1,024. Having a fixed feature space with lower dimension may be important for all types of inverse variance weighted methods, i.e., IVWA, MV and RIVWA, because the inverse variance matrix from each data division may be each stored in memory, which is a {tilde over (D)}×{tilde over (D)} matrix. If {tilde over (D)} grows with Ñ, the challenge in storing {tilde over (D)}×{tilde over (D)} weighting matrices compromises the scalability of inverse variance weighted DC methods.
In this illustrative example, each movie has a score from 0.5 to 5.0 with 0.5 increments (10 ordinal levels). The data expansion will increase the data size 9 times. Here, a subsample of around 1,000,000 instances is used, which expands to 9,000,000 binary instances. Similarly, the data processing system may split the data as training, validation and testing set, with 60% for training, 10% for validation, and 30% for testing. Thus the expanded training sample size is Ñ≈6; 000; 000. The training data is divided into M=1000 parts. Results are shown in Table 2 where similar performance metrics as before are reported and compared between the different types of DC methods and the benchmark.
While an ordinal regression ranking problem naturally leads to an expanded dataset, the DC technique according to embodiments may be applicable to general logistic regression for binary classification problems. The DC technique according to embodiments may be applied to a public advertisement dataset, which only has two outcomes, conversion from click versus no conversion. The dataset has 15,898,883 instances. Due to memory limit restrictions when computing BM, the dataset was randomly sampled to N=2,120,698 for training: 212,070 for validation and 848,277 for testing. There are 8 continuous count features and 9 categorical features in this data. Initially, the method bucketizes the counts into the nearest integer of their natural logged values. Then the same hashing trick as in movie ratings dataset may be applied by mapping the features into a size of D=1; 024. Model tuning parameters may be selected to maximize the AUC.
Table 2 shows results for BM and DC methods at M=200 and M=1000. Since the result is binary, both AUC and absolute prediction loss are reported. When M=200, the prediction performance may be very similar for all types of DC methods. However, when M=1000, only RIVWA and SA preserves comparable performance as BM. In all cases, RIVWA has the smallest deviation from BM in terms of coefficient estimation.
Table 2 below provides the results on the movie rating dataset, advertising conversion dataset, and an E-commerce advertising funnel dataset. The smaller the d1({circumflex over (θ)}, {circumflex over (θ)}BM), d2({circumflex over (θ)}, {circumflex over (θ)}BM) and percentage change in abs_loss, the better. The larger the AUC, the better. AUC is only available for the advertising dataset because it has binary (two level) outcomes.
−0.02
8.40
0.47
19.60
0.72
0.14
46.30
4.76
−0.15
0.81
73.71
29.01
−27.47
Table 2 also illustrates the results on an e-commerce advertisement dataset from an online marketplace service provider. The e-commerce dataset consists of 3 ordinal levels: an ad impression on a publisher website which did not lead to a click (k=1), an ad impression which led to a click but did not lead to any product purchase (k=2) and an ad impression which led to a click followed by a product purchase (k=3). For example, a purchase may be valued more than a click, which may be valued more than an impression which did not lead to a click (thus, a natural ordinal ranking is induced).
In this example, for training, the impression, click, and purchase data are collected over a period of 1 week (with 2 day click attribution and 7 day purchase attribution). A single day's data may be used for validation and another day's data for testing. The number of instances and features are in the millions and the same hashing function is applied to project the original feature space into a fixed dimensional space. The training data may be randomly divided into M=1000 parts.
Table 2 shows results of different DC methods, including the RIVWA of the present invention, as well as the benchmark. It can be seen that RIVWA has the improved prediction accuracy in terms of absolute prediction loss, and its estimate is close to that of benchmark. Although IVWA yields similar performance in terms of d1 and d2 to the DC method according to various embodiments, it has a much larger prediction loss compared to the RIVWA of the present invention.
RIVWA consistently shows improved performance in terms of parameter estimation, as supported by the theoretical results. The RIVWA DC method according to various embodiments provides good prediction. Parameter estimation is important to downstream usage of the estimated coefficients for purposes such as ranking prediction, calibration and estimation of probabilities. Thus ensuring coefficient estimates for divide-and-conquer methods to be as close as the benchmark is important. A limitation of our method, as is for all variance-dependent DC methods, is that it requires the feature space to be fixed in dimension. For feature spaces that are not known in advance, the hashing function is applied for dimension reduction. Hashing can improve results when the original feature space is very large and the occurrence of features is sparse.
Various embodiments of the present invention provide a DC based algorithm to overcome scalability issue in training an ordinal regression model. The RIVWA DC method according to various embodiments may not be tied to any specific property of ordinal regression. Instead, the DC algorithm according to various embodiments applies to any logistic classification problem, where the size of training set is too big to train a model on a single machine. The motivation for considering ordinal regression problem is that ordinal to binary reduction method necessarily expands the data set, in order of number of ordinal outcomes, and can even turn a moderate sized training set too large.
Various embodiments of the present invention provide a method to divide the expanded binary classification dataset, resulting from the reduction step of existing technology, train individual regularized logistic classifiers on the blocks of data, and combine the classifiers in an efficient way to get an ordinal regression model. The DC method according to various embodiments removes the need for synchronization between learning algorithms on the data blocks and thus, the learning algorithms can run in parallel, on distributed data frameworks.
According various embodiments, model coefficients from the DC method of the present invention is consistent with the logistic model which is trained on a single machine on the entire dataset and asymptotically normally distributed. Furthermore, comparing the model, produced by the DC technique of the present invention, on multiple datasets, improvement is shown in estimation and prediction performance, as well as reduction in training time, over other existing methods.
The illustrative environment includes at least one application server 508 and a plurality of resources, servers, hosts, instances, routers, switches, data stores, and/or other such components defining what will be referred to herein as a data plane 540, although it should be understood that resources of this plane are not limited to storing and providing access to data. It should be understood that there can be several application servers, layers, or other elements, processes, or components, which may be chained or otherwise configured, which can interact to perform tasks such as obtaining data from an appropriate data store. As used herein the term “data store” refers to any device or combination of devices capable of storing, accessing, and retrieving data, which may include any combination and number of data servers, databases, data storage devices, and data storage media, in any standard, distributed, or clustered environment. The application server can include any appropriate hardware and software for integrating with the data store as needed to execute aspects of one or more applications for the client device, handling a majority of the data access and business logic for an application. The application server provides admission control services in cooperation with the data store, and is able to generate content such as text, graphics, audio, and/or video to be transferred to the user, which may be served to the user by the Web server in the form of HTML, XML, or another appropriate structured language in this example. In some embodiments, the Web server 506, application server 508 and similar components can be considered to be part of the data plane. The handling of all requests and responses, as well as the delivery of content between the client device 502 and the application server 508, can be handled by the Web server. It should be understood that the Web and application servers are not required and are merely example components, as structured code can be executed on any appropriate device or host machine as discussed elsewhere herein.
The data stores of the data plane 540 can include several separate data tables, databases, or other data storage mechanisms and media for storing data relating to a particular aspect. For example, the data plane illustrated includes mechanisms for storing production data 512 and user information 416, which can be used to serve content for the production side. The data plane also is shown to include a mechanism for storing log data 514, which can be used for purposes such as reporting and analysis of the user data, including gathering and aggregating the large amounts of data from multiple users on the network. It should be understood that there can be many other aspects that may need to be stored in a data store, such as for page image information and access right information, which can be stored in any of the above listed mechanisms as appropriate or in additional mechanisms in the data plane 540. The data plane 540 is operable, through logic associated therewith, to receive instructions from the application server 508 and obtain, update, or otherwise process data, instructions, or other such information in response thereto. In one example, a user might submit a search request for a certain type of item. In this case, components of the data plane might access the user information to verify the identity of the user, gather user information, and access the catalog detail information to obtain information about items of that type. The information then can be returned to the user, such as in a results listing on a Web page that the user is able to view via a browser on the user device 502. Information for a particular item of interest can be viewed in a dedicated page or window of the browser.
Each server typically will include an operating system that provides executable program instructions for the general administration and operation of that server, and typically will include a computer-readable medium storing instructions that, when executed by a processor of the server, enable the server to perform its intended functions. Suitable implementations for the operating system and general functionality of the servers are known or commercially available, and are readily implemented by persons having ordinary skill in the art, particularly in light of the disclosure herein.
The environment in one embodiment is a distributed computing environment utilizing several computer systems and components that are interconnected via communication links, using one or more computer networks or direct connections. However, it will be appreciated by those of ordinary skill in the art that such a system could operate equally well in a system having fewer or a greater number of components than are illustrated in
An environment such as that illustrated in
As discussed above, the various embodiments can be implemented in a wide variety of operating environments, which in some cases can include one or more user computers, computing devices, or processing devices which can be used to operate any of a number of applications. User or client devices can include any of a number of general purpose personal computers, such as desktop or laptop computers running a standard operating system, as well as cellular, wireless, and handheld devices running mobile software and capable of supporting a number of networking and messaging protocols. Such a system also can include a number of workstations running any of a variety of commercially-available operating systems and other known applications for purposes such as development and database management. These devices also can include other electronic devices, such as dummy terminals, thin-clients, gaming systems, and other devices capable of communicating via a network.
Various aspects also can be implemented as part of at least one service or Web service, such as may be part of a service-oriented architecture. Services such as Web services can communicate using any appropriate type of messaging, such as by using messages in extensible markup language (XML) format and exchanged using an appropriate protocol such as SOAP (derived from the “Simple Object Access Protocol”). Processes provided or executed by such services can be written in any appropriate language, such as the Web Services Description Language (WSDL). Using a language such as WSDL allows for functionality such as the automated generation of client-side code in various SOAP frameworks.
Most embodiments utilize at least one network that would be familiar to those skilled in the art for supporting communications using any of a variety of commercially-available protocols, such as TCP/IP, OSI, FTP, UPnP, NFS, CIFS, and AppleTalk. The network can be, for example, a local area network, a wide-area network, a virtual private network, the Internet, an intranet, an extranet, a public switched telephone network, an infrared network, a wireless network, and any combination thereof.
In embodiments utilizing a Web server, the Web server can run any of a variety of server or mid-tier applications, including HTTP servers, FTP servers, CGI servers, data servers, Java servers, and business application servers. The server(s) also may be capable of executing programs or scripts in response requests from user devices, such as by executing one or more Web applications that may be implemented as one or more scripts or programs written in any programming language, such as Java®, C, C# or C++, or any scripting language, such as Perl, Python, or TCL, as well as combinations thereof. The server(s) may also include database servers, including without limitation those commercially available from Oracle®, Microsoft®, Sybase °, and IBM®.
The environment can include a variety of data stores and other memory and storage media as discussed above. These can reside in a variety of locations, such as on a storage medium local to (and/or resident in) one or more of the computers or remote from any or all of the computers across the network. In a particular set of embodiments, the information may reside in a storage-area network (“SAN”) familiar to those skilled in the art. Similarly, any necessary files for performing the functions attributed to the computers, servers, or other network devices may be stored locally and/or remotely, as appropriate. Where a system includes computerized devices, each such device can include hardware elements that may be electrically coupled via a bus, the elements including, for example, at least one central processing unit (CPU), at least one input device (e.g., a mouse, keyboard, controller, touch screen, or keypad), and at least one output device (e.g., a display device, printer, or speaker). Such a system may also include one or more storage devices, such as disk drives, optical storage devices, and solid-state storage devices such as random access memory (“RAM”) or read-only memory (“ROM”), as well as removable media devices, memory cards, flash cards, etc.
Such devices also can include a computer-readable storage media reader, a communications device (e.g., a modem, a network card (wireless or wired), an infrared communication device, etc.), and working memory as described above. The computer-readable storage media reader can be connected with, or configured to receive, a computer-readable storage medium, representing remote, local, fixed, and/or removable storage devices as well as storage media for temporarily and/or more permanently containing, storing, transmitting, and retrieving computer-readable information. The system and various devices also typically will include a number of software applications, modules, services, or other elements located within at least one working memory device, including an operating system and application programs, such as a client application or Web browser. It should be appreciated that alternate embodiments may have numerous variations from that described above. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, software (including portable software, such as applets), or both. Further, connection to other computing devices such as network input/output devices may be employed.
Storage media and computer readable media for containing code, or portions of code, can include any appropriate media known or used in the art, including storage media and communication media, such as but not limited to volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage and/or transmission of information such as computer readable instructions, data structures, program modules, or other data, including RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the a system device. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will appreciate other ways and/or methods to implement the various embodiments.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims.
Number | Name | Date | Kind |
---|---|---|---|
10552002 | Maclean | Feb 2020 | B1 |
20140108417 | Chakrabarti | Apr 2014 | A1 |
20180357541 | Chen | Dec 2018 | A1 |
Entry |
---|
Lian et al., “Divide-and-Conquer for Debiased I1-norm Support Vector Machine in Ultra-high Dimensions”, Jan. 2017, Journal of Machine Learning Research 18, pp. 1-26 (Year: 2017). |
Battey et al., “Distributed Estimation and Inference with Statistical Guarantees”, Sep. 17, 2015, arXiv:1509.05457, pp. 1-57 (Year: 2015). |
Number | Date | Country | |
---|---|---|---|
62556133 | Sep 2017 | US |