Most multimedia providers attempt in some form or fashion to include suggestions to their subscribers in hopes of increasing a subscriber's consumption of multimedia content. Since subscriptions are generally tied to revenue models that yield more monetary gain with increases in use, a system that can provide relevant suggestions to a user can dramatically increase sales. Typical systems employ techniques that use historical data associated with a user or groups of users to determine what they might like to view next. However, when these types of systems do not have access to historical data, they tend to slow down and suggest widely irrelevant selections until a user has progressed through a series of long questions or suggestions until the system has “learned” the user. This often frustrates the user and they quit using the system and seek out other means to find multimedia content to watch.
A method for providing recommendations to a user in a setting where the expected gain or value of a multimedia content suggestion is initially unknown is created using an adaptive process based on submodular maximization. This provides an efficient approach for making suggestions to a user in fewer steps, causing less aggravation to the user. The method is referred to as an Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of the uncertainty principle.
In one embodiment, user preferences are elicited in a recommender system for multimedia content. The method presented includes the first near-optimal technique for learning how to elicit user preferences while eliciting them. Initially, the method has some uncertain model of the world based on how users tend to answer questions. When a new user uses the method, it elicits the preferences of the user based on a combination of the existing model and exploration, asking questions that may not be optimal but allows the method to learn how to better elicit preferences. The more the users use the method, the better the method becomes in preference elicitation and ultimately behaves near optimally in rapid time.
The above presents a simplified summary of the subject matter in order to provide a basic understanding of some aspects of subject matter embodiments. This summary is not an extensive overview of the subject matter. It is not intended to identify key/critical elements of the embodiments or to delineate the scope of the subject matter. Its sole purpose is to present some concepts of the subject matter in a simplified form as a prelude to the more detailed description that is presented later.
To the accomplishment of the foregoing and related ends, certain illustrative aspects of embodiments are described herein in connection with the following description and the annexed drawings. These aspects are indicative, however, of but a few of the various ways in which the principles of the subject matter can be employed, and the subject matter is intended to include all such aspects and their equivalents. Other advantages and novel features of the subject matter can become apparent from the following detailed description when considered in conjunction with the drawings.
The subject matter is now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the subject matter. It can be evident, however, that subject matter embodiments can be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the embodiments.
Maximization of submodular functions has wide applications in machine learning and artificial intelligence, such as social network analysis, sensor placement, and recommender systems. The problem of adaptive submodular maximization is discussed in detail below and is a variant of submodular maximization where each item has a state and this state is revealed when the item is chosen. The goal is to learn a policy that maximizes the expected return for choosing K items. Adaptive submodular maximization has been traditionally studied in a setting where the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. This is the first method where the model is initially unknown, and it is learned by interacting repeatedly with the environment. The concepts of adaptive submodular maximization and bandits are brought together, and the result is an efficient solution to the problem.
The recommendation engine 106 interacts with the response analyzer 104 to adaptively determine subsequent maximized diverse user inquiries based on prior user inputs to efficiently learn user preferences. The response analyzer 104 interacts with a user 112 to pose inquiries and receive responses from the user 112. Information derived from the user interactions facilitates in constructing the recommendation engine 106. The technique utilized by the recommender 102 allows for diverse questions to be asked of the user 112 in order to ascertain preferences as quickly as possible. This helps in greatly reducing user frustration when using the recommendation system 100 for the first time. The questions posed to the user 112 are vetted by the technique to optimally maximize the value of the question in relation to establishing the user's preferences in as few questions as possible. This means users can avoid putting in responses to a long list of canned questions such as “what is your gender, age, location, income, prior listening/watching habits, etc.”
For example, it could be determined that out of 100 types of music genres that a majority of users prefer one of three types—pop, country or rock. Thus, a first question with the greatest chance of finding a user's likes can be directed to which of these three genre types the user prefers, greatly narrowing down subsequent questions to the user. It is also possible that the user could also respond with “none” which means the assumption was incorrect. However, the question asking about the three genre types has the highest preference determination value in that it has a high probability that it can quickly narrow down the likes of the user and, therefore, is worth the risk that the user might respond with “none of the above.” The technique then continues to determine further questions that most rapidly lead to a proper recommendation. The method to employ this technique is discussed in detail as follows.
Four aspects of the method are explained. First, a model is used where the expected gain of choosing an item can be learned efficiently. The main assumption in the model is that the state of each item is distributed independently of the other states. Second, an Optimistic Adaptive Submodular Maximization (OASM), a bandit approach that selects items with the highest upper confidence bound on the expected gain is shown. This approach is computationally efficient and easy to implement. Third, the expected cumulative regret of the approach is proven to increase logarithmically with time. The regret bound captures the inherent property of adaptive submodular maximization, earlier mistakes are more costly than later ones. Finally, the method is applied to a real-world preference elicitation problem and shows that non-trivial policies can be learned from just a few hundred interactions with the problem.
In adaptive submodular maximization, the objective is to maximize, under constraints, a function of the form:
ƒi2I×{−1,1}L→, (1)
where I={1, . . . , L} is a set of L items and 2I is its power set. The first argument of ƒ is a subset of chosen items A⊂I. The second argument is the state φ ∈ {−1,1}L of all items. The i-th entry of φ, φ[i], is the state of item i. The state φ is drawn i.i.d. from some probability distribution P(Φ). The reward for choosing items A in state φ is ƒ(A, φ). For simplicity of exposition, assume that ƒ(et, φ)=0 in all φ. In problems of interest, the state is only partially observed. To capture this phenomenon, the notion of observations is introduced. An observation is a vector y ∈ {−1,0,1}L whose non-zero entries are the observed states of items. It is given that y is an observation of state φ, and write φ˜y, if y[i]=φ[i] in all non-zero entries of y. Alternatively, the state φ can be viewed as a realization of y, one of many. This is denoted by dom(y)={i:y[i]≠0] the observed items in y and by φ<A> the observation of items A in state φ. A partial ordering on observations is defined and written y•y if y•[i]=y[i] in all non-zero entries of y, y• is a more specific observation than y. In terminology of the art, y is a subrealization of y•.
The notation is illustrated on a simple example. Let φ=(1,1−1) be a state, and y1=(1,0,0) and y2=(1,0,−1) be observations. Then all of the following claims are true:
φ˜y1, φ˜y2, y2y1, dom(y2)={1,3}, φ<{1,3}>=y2, φ<dom(y1)>=y1.
The goal is to maximize the expected value of ƒ by adaptively choosing K items. This problem can be viewed as a K step game, where at each step an item is chosen according to some policy π and then its state is observed. A policy πi[1,0,1]b>I is a function from observations y to items. The observations represent the past decisions and their outcomes. A k-step policy in state φ, πk(φ), is a collection of the first k items chosen by policy pi. The policy is defined recursively as:
πk(φ)=πk−1(φ)∪{π[k](φ)}, π[k](φ)=π(φ<πk−1(φ)>), π0(φ)= (2)
where π[k](φ) is the k-th item chosen by policy π in state φ. The optimal if K-step policy satisfies:
In general, the problem of computing π* is NP-hard. However, near-optimal policies can be computed efficiently when the maximized function has a diminishing return property. Formally, it is required that the function is adaptive submodular and adaptive monotonic.
Definition 1. Function ƒ is adaptive submodular if: φ[ƒ(A∪{i}, φ)−ƒ(A, φ)[φ˜yA]≧φ[ƒ(B∪{i}, φ)−ƒ(B, φ)[φ˜yB]
for all items i ∈ I\B and observations yByA, where A=dom(yA) and B=dom(yB).
Definition 2. Function ƒ is adaptive monotonic if: φ[ƒ(A∪{i}, φ)−ƒ(A, φ)[φ˜yA]≧0 for all items i ∈ I\A and observations yA, where A=dom(yA).
In other words, the expected gain of choosing an item is always non-negative and does not increase as the observations become more specific. Let πg be the greedy policy for maximizing ƒ, a policy that always selects the item with the highest expected gain:
where:
g
i(y)=φ[ƒ(dom(y)∪[i], φ)ƒ(dom(y), φ)[φ˜y] (5)
is the expected gain of choosing item i after observing y. Then, πg is a (1−1/e)−approximation to π*, φ[ƒ(πKg(φ), φ)]≧(1−1/e)φ[ƒ(πK*(φ), φ)], if ƒ is adaptive submodular and adaptive monotonic. It is established that an observation y is a context if it can be observed under the greedy policy πg. Specifically, there exists k and φ such that y=φ<πkg(φ)>.
Adaptive Submodularity in Bandit Setting
The greedy policy πg can be computed only if the objective function ƒ and the distribution of states P(Φ) are known, because both of these quantities are needed to compute the marginal benefit gi(y) (Equation 5). In practice, the distribution P(Φ) is often unknown, for instance in a newly deployed sensor network where the failure rates of the sensors are unknown. A natural variant of adaptive submodular maximization is explored that can model such problems. The distribution P(Φ) is assumed to be unknown and is learned by interacting repeatedly with the problem.
Recommendation Engine
The problem of learning P(Φ) can be cast in many ways. One approach is to directly learn the joint P(Φ). This approach is not practical for two reasons. First, the number of states φ is exponential in the number of items L. Second, the state of the problem is observed only partially. As a result, it is generally impossible to identify the distribution that generates φ. Another possibility is to learn the probability of individual states φ[i] conditioned on context, observations y under the greedy policy πg in up to K steps. This is impractical because the number of contexts is exponential in K.
Clearly, additional structural assumptions are necessary to obtain a practical solution. It is assumed that the states of items are independent of the context in which the items are chosen. In particular, the state φ[i] of each item i is drawn i.i.d. from a Bernoulli distribution with mean pi. In this setting, the joint probability distribution factors as:
and the problem of learning P(Φ) reduces to estimating L parameters, the means of the Bernoulli distributions. A question is how restrictive is the independence assumption. It is argued that this assumption is fairly natural in many applications. For instance, consider a sensor network where the sensors fail at random due to manufacturing defects. The failures of these sensors are independent of each other and, thus, can be modeled in the framework. To validate the assumption, an experiment is conducted that shows that it does not greatly affect the performance of the method on a real-world problem. Correlations obviously exist and are discussed below.
Based on the independence assumption, the expected gain (Equation 5) is rewritten as:
g
i(y)=pi
where:
i(y)=φ[ƒ(dom(y)∪{i}, Φ)−ƒ(dom(y), φ)[φ˜y,φ[i]=1] (8)
is the expected gain when item i is in state 1. For simplicity of exposition, it is assumed that the gain is zero when the item is in state −1.
In general, the
The learning problem comprises n episodes. In episode t, K items is adaptively chosen according to some policy πt, which may differ from episode to episode. The quality of the policy is measured by the expected cumulative K-step return φ, . . . , φ
In maximum coverage problems, the greedy policy πg is a good surrogate for the optimal policy π* because it is a (1−1/e)−approximation to π*.
Initialization
K-step maximization
Choose the highest index
Update statistics
The technique is designed based on the optimism in the face of uncertainty principle, a strategy that is at the core of many bandit approaches. More specifically, it is a greedy policy where the expected gain gi(y) (Equation 7) is substituted for its optimistic estimate. The technique adaptively maximizes a submodular function in an optimistic fashion and therefore it is referred to as Optimistic Adaptive Submodular Maximization (OASM).
The pseudocode of the method is given in Table 1: Technique 1 above. In each episode, the function ƒ is maximized in K steps. At each step, the index ({circumflex over (p)}i,T
where s is the number of times that item i is chosen and τ(i,z) is the index of the episode in which item i is chosen for the z-th time. In episode t, set s to Ti(t−1), the number of times that item i is selected in the first t−1 episodes. The radius ct,s is designed such that each index is with high probability an upper bound on the corresponding gain. The index enforces exploration of items that have not been chosen very often. As the number of past episodes increases, all confidence intervals shrink and the method starts exploiting most profitable items. The log (t) term guarantees that each item is explored infinitely often as t→∞, to avoid linear regret.
Approach OASM has several notable properties. First, it is a greedy method. Therefore, the policies can be computed very fast. Second, it is guaranteed to behave near optimally as the estimates of the gain gi(y) become more accurate. Finally, the technique learns only L parameters and, therefore, is quite practical. Specifically, note that if an item is chosen in one context, it helps in refining the estimate of the gain gi( ) in all other contexts.
Analysis
An upper bound on the expected cumulative regret of approach OASM in n episodes is shown. Before the main result is presented, notation used in the analysis is defined. It is denoted by i*(y)=πg(y) the item chosen by the greedy policy πg in context y. Without loss of generality, it is assumed that this item is unique in all contexts. The hardness of discriminating between items i and i*(y) is measured by a gap between the expected gains of the items:
Δi(y)=gi·(y)(y)gi(y). (11)
The analysis is based on counting how many times the policies πt and πg choose a different item at step k. Therefore, several variables are defined that describe the state of the problem at this step. It is denoted by k(π)=∪φ{φ<πk−1(φ)>} the set of all possible observations after policy π is executed for k−1 steps. It is written k=k(πg) and kt=k(πt) when the policies πg and πt are referred to, respectively. Finally, it is denoted by k,i=k ∩{y:i≠i*(y)} the set of contexts where items is suboptimal at step k.
The main result is Theorem 1. The terms item and arm are treated as synonyms, and whichever is more appropriate in a given context is used.
Theorem 1. The expected cumulative regret of approach OASM is bounded as:
where Gk=(K−k+1)maxy∈
is the number of pulls after which arm i is not likely to be pulled suboptimally at step k, li=maxk li,k, and
is a weight that associates the regret of arm i to step k such that Σk=1K αi,k=1.
Proof. The theorem is proved in three steps. First, the regret in episode t is associated with the first step where the policy πt selects a different item from the greedy policy πg. For simplicity, suppose that this step is step k. Then the regret in episode t can be written as:
where the last equality is due to the assumption that π[j]g(φt)=π[j](φt) for all j<k; and Fk→g(φt) and Fk→t(φt) are the gains of the policies πg and πt, respectively, in state φt from step k forward. In practice, the first step where the policies πt and πg choose a different item is unknown, because πg is unknown. In this case, the regret can be written as:
where:
1i,k,t(φ)=1{(∀j<k: π[j]t(φ)), π[k]t(φ)≠π[k]g(φ), π[k]t(φ)=i} (15)
is the indicator of the event that the policies πt and πg choose the same first k−1 items in state φ, disagree in the k-th item, and i is the k-th item chosen by πt. The commas in the indicator function represent logical conjunction.
Second, the expected loss associated with choosing the first different item at step k is bound by the probability of this event and an upper bound on the expected loss Gk, which does not depend on πt and φt. Based on this result, the expected cumulative regret is bound as:
Finally, motivated by the analysis of UCB1, the indicator 1i,k,t(φt) is rewritten as:
1i,k,t(φt)=1i,k,t(φt)1{Ti(t−1)≦li,k}+1i,k,t(φt)1{Ti(t−1)>li,k}. (17)
where li,k is a problem-specific constant. li,k is chosen such that arm i at step k is pulled suboptimally a constant number of times in expectation after li,k pulls. Based on this result, the regret corresponding to the events 1{Ti(t−1)>li,k} is bounded as:
On the other hand, the regret associated with the events 1{Ti(t−1)≦li,k} is trivially bounded by Σi=1L Σk=1K Gk li,k. A tighter upper bound is proved below:
The last inequality can be proved as follows. The upper bound on the expected loss at step k, Gk, is monotonically decreasing with k, and therefore G1≧G2≧ . . . ≧ GK. So for any given arm i, the highest cumulative regret subject to the constraint Ti(t−1)≦li,k at step k is achieved as follows. The first li,1 mistakes are made at the first step, [li,2−li,1]→ mistakes are made at the second step, [li,3−max {lli,1, li,2}]← mistakes are made at the third step, and so on. Specifically, the number of mistakes at step k is [li,k−maxk
Approach OASM mimics the greedy policy πg. Therefore, it was decided to prove Theorem 1 based on counting how many times the policies πt and πg choose a different item. The proof has three parts. First, associate the regret in episode t with the first step where the policy πt chooses a different item from πg. Second, bound the expected regret in each episode by the probability of deviating from the policy πg at step k and an upper bound on the associated loss Gk, which depends only on k. Finally, divide the expected cumulative regret into two terms, before and after item i at step k is selected a sufficient number of times li,k, and then set li,k such that both terms are O(log n). It is stressed that the proof is relatively general. In the rest of the proof, it is only assumed that ƒ is adaptive submodular and adaptive monotonic.
The regret bound has several notable properties. First, it is logarithmic in the number of episodes n, through problem-specific constants li,k. So, a classical result is recovered from the bandit literature. Second, the bound is polynomial in all constants of interest, such as the number of items L and the number of maximization steps K in each episode. It is stressed that it is not linear in the number of contexts YK at step K, which is exponential in K. Finally, note that the bound captures the shape of the optimized function ƒ. In particular, because the function ƒ is adaptive submodular, the upper bound on the gain of the policy πg from step k forward, GL, decreases as k increases. As a result, earlier deviations from πg are penalized more than later ones.
Experiments
The approach is evaluated on a preference elicitation problem in a movie recommendation domain. This problem is cast as asking K yes-or-no movie-genre questions. The users and their preferences are extracted from the MovieLens dataset, a dataset of 6 k users who rated one million movies. The 500 most rated movies were chosen from the dataset. Each movie l is represented by a feature vector xl such that xl[i]=1 if the movie belongs to genre i and xl[i]=0 if it does not. The preference of user j for genre i is measured by tf-idf, a popular importance score in information retrieval. In particular, it is defined as
where #(j, i) is the number of movies from genre i rated by user j, nu is the number of users, and #(•, i) is the number of users that rated at least one movie from genre i. Intuitively, this score prefers genres that are often rated by the user but rarely rated overall. Each user j is represented by a genre preference vector φ such that φ[i]=1 when genre is among five most favorite genres of the user. These genres cover on average 25% of the selected movies. In Table 2, several popular genres from the selected dataset are shown. These include eight movie genres that cover the largest number of movies in expectation.
i (0)
The reward for asking user φ questions A is:
the percentage of movies that belong to at least one genre i that is preferred by the user and queried in A. The function ƒ captures the notion that knowing more preferred genres is better than knowing less. It is submodular in A for any given preference vector φ, and therefore adaptive submodular in A when the preferences are distributed independently of each other (Equation 6). In this setting, the expected value of ƒ can be maximized near optimally by a greedy policy (Equation 4).
In the first experiment, it is shown that the assumption on P(Φ) (Equation 6) is not very restrictive in the domain. Three greedy policies for maximizing ƒ that know P(Φ) are compared and differ in how the expected gain of choosing items is estimated. The first policy πg makes no assumption on P(Φ) and computes the gain according to Equation 5. The second policy πfg assumes that the distribution P(Φ) is factored and computes the gain using Equation 7. Finally, the third policy πdg computes the gain according to Equation 8, essentially ignoring the stochasticity of the problem. All policies are applied to all users in the dataset for all K≦L and their expected returns are reported in
In the second experiment, how the OASM policy πt improves over time is studied. In each episode t, a new user φt is randomly chosen and then the policy πt asks K questions. The expected return of πt is compared to two offline baselines, πfg and πdg. The policies πfg and πdg can be viewed as upper and lower bounds on the expected return of πt, respectively. The results are shown in graphs 302-306 of example 300 in
The methods described above use adaptive submodular maximization in a setting where the model of the world is initially unknown. The methods include an efficient bandit technique for solving the problem and prove that their expected cumulative regrets increases logarithmically with time. This is an example of reinforcement learning (RL) for adaptive submodularity. The main difference in the setting is that near-optimal policies can be learned without estimating the value function. Learning of value functions is typically hard, even when the model of the problem is known. This is not necessary in the problem and, therefore, a very efficient learning methods are given.
It was assumed that the states of items are distributed independently of each other. In the experiments, this assumption was less restrictive than expected. Nevertheless, the methods are utilized under less restrictive assumptions. In preference elicitation, for instance, the answers to questions are likely to be correlated due to many factors, such as user's preferences, user's mood, and the similarity of the questions. The methods above are quite general and can be extended to more complex models. Such a generalization would comprise three major steps: choosing a model, deriving a corresponding upper confidence bound on the expected gain, and finally proving an equivalent.
It is assumed that the expected gain of choosing an item (Equation 7) can be written as a product of some known gain function (Equation 8) and the probability of the item's states. This assumption is quite natural in maximum coverage problems but may not be appropriate in other problems, such as generalized binary search. The upper bound on the expected regret at step can be loose in practice because it is obtained by maximizing over all contexts. In general, it is difficult to prove a tighter bound. Such a bound would have to depend on the probability of making a mistake in a specific context at step k, which depends on the policy in that episode, and indirectly on the progress of learning in all earlier episodes.
In view of the exemplary systems shown and described above, methodologies that can be implemented in accordance with the embodiments will be better appreciated with reference to the flow chart of
What has been described above includes examples of the embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the embodiments, but one of ordinary skill in the art can recognize that many further combinations and permutations of the embodiments are possible. Accordingly, the subject matter is intended to embrace all such alterations, modifications and variations. Furthermore, to the extent that the term “includes” is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.