The presently disclosed embodiments relate to cloud services, and more particularly to methods and systems for modeling cloud user behavior.
Cloud computing has emerged as one of the best methods for companies to revamp and enhance their IT infrastructures. Accordingly, there has been a proliferation of cloud-based service providers in recent years. However, a particular service offering from a particular provider may have a different level of acceptability to different user (or customer) groups depending on the users' preferences. The related art fails to provide a reliable technique to understand different cloud user groups and their behavior, in terms of acceptability of the providers offerings based on users' preferences. However, cloud-based service providers need to understand the different user groups and their behavior, so that they can then target offerings in different user groups according to users' preferences.
The cloud service offerings from these providers are not standardized. Due to this lack of standardization, similar offerings from different providers have different performance and cost implications, and customers are unable to compare the service offerings properly. Currently, customers are responsible for engaging in consultations with cloud service providers to identify an acceptable offering. With the increased focus towards standardization of computing clouds, it is important to understand which cloud offering is better fit for which user group, and accordingly create standards for the different user groups.
However, the problem of modeling user behavior and finding different user groups based on user preferences may be difficult based on a heterogeneous set of users, often spanned across scale (e.g., enterprise vs. small scale customers), economy (e.g., emerging vs. developed markets), geography, and time (e.g., for office-use in day time vs. personal use at nights). For example, in an emerging economy, or with regard to small and medium businesses (SMBs), the users may be less performance-savvy and more cost-concerned, whereas in a developed economy or with regard to large enterprises, the users may have a higher preference on the performance. This issue is exacerbated since the user groups, and the commonalities and differences of their behaviors, are typically unknown and need to be dynamically learned online from the user preferences. The preferences from a user can further change over time (e.g., a performance-savvy customer becoming cost conscious after a month or a year, etc.).
Related art methods of modeling user behaviors are based on a prior knowledge of the different user groups and the behaviors within each group. These methods involve segregation of user behavior into these known groups. However, cloud user groups and their behavior patterns are not known beforehand. Therefore, the related art methods are inapplicable for identification of cloud user groups. It may therefore be beneficial to systematically model the cloud users' behavior in terms of their preferences without any prior knowledge (or with only limited knowledge) of the clusters, then classify the users into different clusters and also predict behavior of new users.
Thus, some embodiments take into account preference data from different cloud users, including ranking of different preference parameters related to cloud service offerings. The user groups can be determined based on fitting mixture models on the preference observations. In some embodiments, a preference is constituted by anything that can characterize high-level requirements, such as demands on performance, cost, security, availability, etc.
In one aspect, the present disclosure provides a method for identifying a plurality of clusters from a plurality of users using at least one cloud service. The method includes obtaining user preferences for the plurality of users, and then estimating at least one parameter of a distance-based model by the Expectation-Maximization (EM) algorithm for a specific number of clusters (G) and computing Bayesian Information Criteria (BIC) with the at least one estimated parameter for the specific number of clusters (G). The method includes repeating the estimating and computing steps using a different value of G. Thereafter, the method includes comparing BICs obtained for various values of G and selecting the model with highest BIC as the best model, wherein the best model includes the plurality of clusters. The method also includes using estimated latent variables of the best model to build a classifier, and classifying each user into a cluster of the best model using the classifier.
In another aspect, the present disclosure provides a method for identifying a plurality of clusters from a plurality of users using at least one cloud service. The method includes obtaining user preferences for the plurality of users, and then estimating at least one parameter of a distance-based model by the EM algorithm for a specific number of clusters (G) and computing BIC with the at least one estimated parameter for the specific number of clusters (G). The method includes repeating the estimating and computing steps using a different value of G. Thereafter, the method includes comparing BICs obtained for various values of G and selecting the model with highest BIC as the best model, wherein the best model includes the plurality of clusters. Next, the method includes using estimated latent variables of the best model to build a classifier, classifying each user into a cluster of the best model using the classifier and determining ranking preference of each cluster in the best model. Further, the method includes obtaining user preferences of a new user, predicting the cluster of the new user using the classifier and characterizing the new user based on the predicted cluster. The method also includes repeating the method steps periodically based on updated user preferences.
In another aspect, the present disclosure provides an apparatus for identifying clusters from a plurality of users using cloud services. The apparatus includes a memory and a processor coupled to the memory. The processor is configured to execute the steps of obtaining user preferences for the plurality of users, estimating at least one parameter of a distance-based model by the EM algorithm for a specific number of clusters (G) and computing BIC with the at least one estimated parameter for the specific number of clusters (G). Next, the processor is configured to repeat the estimating and computing steps using a different value of G compare BICs obtained for various values of G, select the model with highest BIC as the best model use estimated latent variables of the best model to build a classifier. The processor is also configured to classify each user into a cluster of the best model using the classifier.
In a further aspect, the present disclosure provides a system for identifying clusters from a plurality of users using cloud services. The system includes a behavior collection module configured to obtain user preferences for the plurality of users. The system further includes an EM module to configured estimate at least one parameter of a distance-based model by the EM algorithm for various values of G (number of clusters). The system includes a selection module configured to compute BIC with the at least one estimated parameter obtained from the EM module for various G, compare BICs obtained for various values of G, select the model with highest BIC as the best model, wherein the best model comprising the plurality of clusters; and use estimated latent variables of the best model to build a classifier. The system also includes a characterization module configured to classify each user into a cluster of the best model using the classifier and determine ranking preference of each cluster.
In a yet further aspect, the present disclosure provides a computer readable carrier including processing instructions adapted to cause a processor to execute the method for identifying a plurality of clusters from a plurality of users using at least one cloud services. The method includes obtaining user preferences for the plurality of users. The method includes estimating at least one parameter of a distance-based model by the EM algorithm for a specific number of clusters (G) and computing BIC with the at least one estimated parameter for the specific number of clusters (G). The method includes repeating the estimating and computing steps using a different value of G. Thereafter, the method includes comparing BICs obtained for various values of G and selecting the model with highest BIC as the best model, wherein the best model includes the plurality of clusters. The method also includes using estimated latent variables of the best model to build a classifier and classifying each user into a cluster of the best model using the classifier.
In another aspect, the present disclosure provides a computer readable carrier that includes processing instructions adapted to cause a processor to execute the method for identifying a plurality of clusters from a plurality of users using at least one cloud services. The method includes obtaining user preferences for the plurality of users. The method includes estimating at least one parameter of a distance-based model by the EM algorithm for a specific number of clusters (G) and computing BIC with the at least one estimated parameter for the specific number of clusters (G). The method includes repeating the estimating and computing steps using a different value of G. Thereafter, the method includes comparing BICs obtained for various values of G and selecting the model with highest BIC as the best model, wherein the best model includes the plurality of clusters. Next, the method includes using estimated latent variables of the best model to build a classifier, classifying each user into a cluster of the best model using the classifier and determining ranking preference of each cluster in the best model. Further, the method includes obtaining user preferences of a new user, predicting the cluster of the new user using the classifier and characterizing the new user based on the predicted cluster. The method also includes repeating the method steps periodically based on updated user preferences.
In a further aspect, the disclosure provides a method for identifying a plurality of clusters from a plurality of users using at least one cloud services. Each cluster includes at least one user from the plurality of users. The method includes obtaining user preferences for the plurality of users. Next, the method includes estimating at least one parameter for each distance-based model in a plurality of distance-based models, wherein each distance-based model includes a different number of clusters (G). Thereafter, the method selects a best model from the plurality of distance-based models based on estimated value of the at least one parameter, and the method classifies each user into a cluster of the best model.
In a yet further aspect, the disclosure provides a system for identifying a plurality of clusters from a plurality of users using at least one cloud services. Each cluster includes at least one user from the plurality of users. The system includes a behavior collection module configured to obtain user preferences for the plurality of users. The system further includes an estimating module configured to estimate at least one parameter for each distance-based model in a plurality of distance-based models, wherein each distance-based model includes a different number of clusters (G). The system also includes a selection module configured to select a best model from the plurality of distance-based models based on estimated value of the at least one parameter. The system further includes a characterization module configured to classify each user into a cluster of the best model based on the best model and determine ranking preference of each cluster.
The following detailed description is provided with reference to the FIGures. Exemplary, and in some case preferred, embodiments are described to illustrate the disclosure, not to limit its scope, which is defined by the claims. Those of ordinary skill in the art will recognize a number of equivalent variations in the description that follows.
Definitions of one or more terms that will be used in this disclosure are described below.
As used herein, a “cloud” refers to a set of hardware, networks, storage, services, and interfaces that combine to deliver aspects of computing as a service. Accordingly, “cloud computing” refers to distributed computing over a network, which entails the ability to run a program or application on many connected computers at the same time. Further, “cloud services” refers to network-based services, which appear to be provided by real server hardware, and are in fact served up by virtual hardware, simulated by software running on one or more real machines. Such virtual servers do not physically exist and can therefore be moved around and scaled up or down on the fly without affecting the end user. Moreover, “cloud service provider” refers to a service provider that offers customers storage or software services via a cloud. Accordingly, “cloud service offerings” are various cloud services provided by cloud service providers to their customers, wherein the cloud services may be customized by a customer or preset by the cloud service provider.”
An “expectation-maximization (EM) algorithm” is an iterative method for finding maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables.
The term “Rank Clustering” refers to a modeling method that takes preference data from different users and determines the clusters. The “cluster” refers to a group of users with similar requirements and/or similar behavior. The “preference data” captures user's requirements and user's behavior. The “preference data” is the ranking of different parameters by different users. The ranking may be obtained by using surveys or monitoring user behavior including the services users buy, how they use it, when they upgrade or downgrade, and so on.
The disclosure generally relates to modeling cloud user behavior. Modeling user behavior and finding different clusters based on user preferences, with no prior knowledge of clusters, can be problematic. To address challenges in the related art, some of the disclosed embodiments provide methods and systems for modeling cloud user behavior. The disclosed embodiments use unsupervised learning where no prior knowledge (or limited prior knowledge) on the users group exists, and everything (or a significant amount of the data) is estimated from scratch.
Some embodiments provide a system for identifying clusters from a plurality of users using cloud services. The system includes a behavior collection module configured to obtain user preferences for the plurality of users. The system further includes an EM module configured to estimate at least one parameter of a distance-based model by the EM algorithm for various values of G (number of clusters). The system also includes a selection module configured to compute Bayesian Information Criteria (BIC) with the at least one estimated parameter obtained from the EM module for various G, compare BICs obtained for various values of G, select the model with highest BIC as the best model, wherein the best model includes the plurality of clusters; and use estimated latent variables of the best model to build a classifier. Finally, the system includes a characterization module configured to classify each user into a cluster of the best model using the classifier and determine ranking preference of each cluster.
A plurality of users 110-116 may obtain the one or more cloud services 102-108 over a network 118. The plurality of users 110-116 may obtain the one or more cloud services 102-108 through a third party recommendation system or a marketplace, which enables inter-operability among different service providers. The plurality of users 110-116 includes various types of users, including users spanning across scale (e.g., enterprise vs. small scale customers), economy (e.g., emerging vs. developed markets), geography, and time (e.g., for office-use in day time vs. personal use at nights). The network 118 includes, but is not limited to, the Internet, LAN, MAN, WAN, or the like.
The user preferences may vary based on specific requirements of various types of users. For example, in an emerging economy or for small and medium businesses, the customers may be less performance-savvy and more cost-concerned. However, in a developed market, the customers may have a higher preference on the performance. Further, the preferences from a user can further change over time (e.g., a performance-savvy customer becoming cost conscious after a month or a year, etc.). The user preferences are collected using a typical ranking of the high-level requirements from the users. The user preferences are explained in further detail in conjunction with
The system 200 further includes a rank clustering module 204 configured to obtain user preferences from the behavior collection module 202 to determine one or more clusters 206-210. The rank clustering module 204 divides a set of rank observations into meaningful clusters, such that the patterns that distinguish one cluster from another can be observed. Some standard ranking models assume that there are a homogeneous set of users. However, the cloud users are typically heterogeneous in nature. The rank clustering module 204 models a heterogeneous set of users by assuming that the set of users is composed of a finite number of homogeneous sub-groups. The distribution of rankings within the sub-groups is modeled using one of the standard models for rankings. Rank clustering attempts to identify groups of users with a typical preference behavior. The rank clustering module 204 dynamically determines the one or more cloud clusters 206-210 based on the users' preferences. The rank clustering module 204 is explained in further detail in conjunction with
The system 200 further includes a targeted offerings module 212 configured to send targeted offers to users in the one or more cloud clusters 206-210. The targeted offerings module 212 enables cloud service providers to target the cloud service offerings according to the clusters and their typical requirements. Further, if a new user's preferences are unknown, then the user's background information (e.g., location of the user) is used to determine the appropriate cluster and the corresponding preference is used as the new user's requirement. For example, if a new SMB user is unaware of their preferences for a cloud service, the targeted offerings module 212 guides the user with messages, such as “users similar to you have preferred for low cost and high performance”, etc.
The rank clustering module 204 performs cluster analysis of observations r. The rank clustering module 204 uses mixtures of distance-based models for modeling heterogeneous populations. The distance-based models for rankings have two parameters, a central ranking R and a measure of precision A; the probability of a ranking occurring is large for rankings close to the central ranking and is small for rankings far away from the central ranking. The probability that an observation comes from a cluster g is πg. In such a set-up, the probability of a ranking r occurring is given by equation (1) below:
f(r|R,λ)=C(λ)exp[−λd(r,R)] (1)
where,
C(λ) is a constant to make,
f a probability distribution, and
d(r, R) is the distance between two rankings r and R.
The distance is defined using Spearman's definition distance of between two ranks as described below. If r=(r1, r2, . . . , rM) and s=(s1, s2, . . . , sM,) are two ranking of M objects, where rj and sj are the ranks given to object j, then the distance between the ranks is provided by equation (2) below:
d(r,s)=[Σi=1M(ri−si)2]1/2 (2)
The goal of the rank clustering module 204 is to:
1. Find the right number of clusters G.
2. Estimate the parameters πg, λg and Rg within each cluster.
A population may be assumed to include G clusters. The probability that an observation comes from a cluster g is πg and given that the observation belongs to the cluster g, it is generated from a distance-based model with central ranking Rg and precision λg. Then, the model of ranking for this population of is defined by equation (3) below:
f(r)=Σi=1GπgC(λg)exp[−λgd(r,R)] (3)
Thus, the log-likelihood of a dataset r=(r1, r2, . . . , rn) including n rankings of M objects is provided by equation (4) below:
l(R,λ,π|r)=Σi=1n log {Σg=1GπgC(λg)exp[−λgd(ri,Rg)]} (4)
The rank clustering module 204 estimates the parameters π, λg and Rg using EM algorithm on the log-likelihood.
Further, some constraints may be placed on the precision parameters so as to derive some modeling families that lead to non-singular estimation. There are a few ways in which the precision parameters may be constrained in the distance-based model, and this aspect provides a large range of modeling flexibility. Accordingly, the rank clustering module 204 considers models with the following constraints on the precision parameters:
The rank clustering module 204 further includes an EM module 404, a selection module 406, a characterization module 408 and a prediction module 410. The EM module 404 executes the EM algorithm on the data stored in the storage 402 to estimate the precision parameters. Next, the selection module 406 uses an information theoretic criteria for choosing the best model based on the estimated parameters obtained from EM module 404. Then, the characterization module 408 is used to characterize present users' preferences based on their group/cluster membership. The prediction module 410 is used to predict the group/cluster of new users.
If a new user's background information is not known, the prediction module 410 predicts their group/cluster by using preference data of the new user. However, if a new user's preferences are not known, then the characterization module 408 predicts their group/cluster by using the new user's background information. Therefore, the two modules the characterization module 408 and the prediction module 410 are complementary to each other.
Each of the EM module 404, the selection module 406, the characterization module 408 and the prediction module 410 is explained in further detail in conjunction with
At step 504, the EM module 404 takes a number of clusters in the data set as G, and then at step 506 performs EM Algorithm to estimate the parameters πg, λg and Rg as described below (steps 1-7). The EM algorithm also uses latent variables z, which record the cluster membership of each observation. The latent variable z=(z1, z2, . . . , zn) is defined such that zig=1, if the ith observation belongs to cluster g and zero otherwise.
Step 1—Initialize λr, πg, Rg>0, g=1, . . . , G
Step 2—Repeat E Step (defined be equation 6 below) and M step (defined by equations 7, 8, 9, 10, 11) until likelihood converges as defined by equation (5) below.
|L(t+1)−Lt|≦ε (5)
Step 3 (E step)
Step 4 (M step)—For each g, update πg, λg and Rg that maximizes likelihood:
For clusters with unrestricted λg values,
where,
the left-hand side summation is taken over all possible rankings r.
For clusters with identical λg=λ values,
where,
summation is taken over all those g for which clusters are restricted to have equal precision.
Step 5—Likelihood,
L
(t+1)=i=1nΣg=1Gπg(t+1)C(λg(t+1))exp[−λ(t+1)d(ri,Rg(t+1))] (11)
The complete data likelihood is given by equation (12) below.
Step 6
L
c(R,λ,π|r,z)=Πi=1nΠg=1G[πgC(λg)exp[−λgd(ri,Rg)]]z
Step 7—and the complete log-likelihood is given by equation (13) below.
l
c(R,λ,π|r,z)=Σi=1nΣg=1Gzig[log πg+log C(λg)−λgd(ri,Rg)] (13)
Finally, EM algorithm on the complete data log-likelihood provides an estimation of values of πg, λg, and Rg.
At step 508, the selection module 406 selects a model for the user behavior. The selection module 406 computes Bayesian Information Criterion (BIC) for the specified g with the estimated parameters obtained from the EM module 404. The BIC provides an approximation to the Bayes factor for model selection; it involves the maximized log-likelihood minus a penalty term as shown by equation (14) below.
BIC=2l({circumflex over (θ)})−ρ log n (14)
where,
θ is the set of parameters of the model and p is the number of free parameters to be estimated.
Next, at step 510, the selection module 406 executes equation (14) for different values of g and compares the corresponding BIC obtained for all values of g. Then at step 512, the selection module 406 takes the model with highest BIC as the best one and at step 514, the selection module 406 uses the estimated latent variables of the best model to build the classifier. Suppose the best model thus chosen has the estimated parameters {circumflex over (λ)} and {circumflex over (R)}. Further, the latent variables are also estimated as {circumflex over (z)}, which is used to define the classifier. {circumflex over (z)} is basically an n×Ĝ matrix where ĝ is the number of clusters in the best model. The classifier is used to characterize/classify existing users and new users to specific clusters.
Thereafter, at step 516, the characterization module 408 determines ranking preference of each cluster that characterizes present users' preferences based on their cluster membership. Each row of {circumflex over (z)} corresponding to the best model has Ĝ elements. If in the ith row, the maximum value of z occurs at the jth position, then the characterization module 408 assigns observation i to the cluster j.
When a new dataset {tilde over (r)} is received at step 518, the prediction module 410 computes the corresponding estimate of {circumflex over (z)} with Ĝ, {tilde over (r)}, {circumflex over (λ)} and {circumflex over (R)}. Thereafter, the prediction module 410 uses the classifier at step 520 to assign the observation to a corresponding cluster based on new estimate of z at step 522. The prediction module 410 is explained in further detail in conjunction with
The user preferences may change over time. In such situations, the method 500 is repeated periodically (e.g., once every month, given that the rate of change for a cloud users' preferences is at least in the region of months or years; and often a certain customer groups' preference changes over a decade) to update the model. However, the method 500 can also be repeated at a higher granularity depending on the system designers' choice and frequency of changes in customers' preferences. The method 500 has a complexity of O(log N), where N is the total number of cloud users, whose preferences are taken as input to perform the behavior modeling.
For example, in graph 600, out of 30 rankings, performance received the most number of highest preferences (i.e., 19), and location received no preference. Cost has the highest preference in five instances, while both energy and security have three instances of highest preferences. This suggests that the cluster includes gold customers from a big enterprise in a developed economy. Similarly, the graph 602 represents a cluster including government users where security is the key requirement, the graph 604 represents a cluster including green energy companies where energy is the key requirement and the graph 606 represents SMBs or customers from developing countries where cost is the key requirement.
It will be appreciated that several of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art, which are also intended to be encompassed by the following claims.