There are various situations in which correlating an interest with another interest can be useful. For example, at some e-commerce sites, shoppers receive recommendations based on previous purchases. A shopper who has purchased Disney-branded video games, for instance, may receive a suggestion to purchase Disney-branded toys as well. Relevant suggestions of this kind may generate increased sales.
The generation of such suggestions can involve relating one type of interest, e.g. Disney-branded video games, with another kind of interest e.g., Disney-branded toys. There are a variety of ways to relate different interests with one another.
One approach is to use ontology-based distances.
Another approach is based on attributes.
Another approach involves tagging. In this approach, a specific item (e.g., the animated Disney movie “Aladdin”) is associated with keywords and key phrases (i.e., “tags), such as “Disney,” “animation,” “fairy tale,” etc. In this example, a user's interest in the film “Aladdin” can be based on the number of tags that the user has already shown an interest in. For instance, based on the above tagging scheme, an e-commerce site may assume that a user with demonstrated interests in animation, fairy tale and Disney films would be much more interested in “Aladdin” than a user who has shown an interest in animation but none of the other tags.
These approaches, while effective in some applications, have weaknesses. They involve the creation of ontologies, domains, attributes, tags and/or other frameworks for each topic or concept. Human intervention is typically required to construct, maintain and update such frameworks. Some products, such as movies, are more easily structured as ontologies, domains and attributes than others. Additionally, the above approaches typically require collecting at least some user data that strongly relates to the sought-after interest. It may be difficult, for example, to estimate a user's interest in Disney movies if data about the user's media and movie preferences has not been gathered.
Accordingly, alternative techniques for predicting a user's interests would be desirable.
Broadly speaking, the present invention relates to techniques for predicting an interest of a user.
One aspect of the invention pertains to determining interest in an object of interest in a given situation. In the given situation, interest in a first object of interest is unknown but interest in a second object of interest is known. Data is obtained. Data can, for example, include documents from the Internet or other forms of information from a network or database. In one embodiment, data is searched to find occurrences of the first object of interest and the second object of interest. The number of joint occurrences of the first object of interest and the second object of interest in the data is determined. Based on this number, at least one correlation value is determined. These correlation values represent the relationship between the first and second objects of interest and may, for example, relate to conditional probability, co-occurrence, correlation or other kinds of relationships. Based on the one or more correlation values, an interest value for the first object of interest is determined. The interest value indicates the interest in the first object of interest in the given situation.
An advantage of the above aspect is that it can determine a relationship between an unknown interest in a first object of interest and a known interest in a second object of interest, even when the objects of interest are not in the same domain and are not obviously related. By contrast, some conventional techniques for interest prediction depend on a strong, pre-existing relationship between the known interests (e.g., Disney movies) and the unknown interest (e.g., animated films in general.)
The invention can be implemented in numerous ways, including, for example, a method, an apparatus, a computer readable medium, and a computing system (e.g., one or more computing devices). Several embodiments of the invention are discussed below.
Other aspects and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.
The present invention will be readily understood by the following detailed description in conjunction with the accompanying drawings, wherein like reference numerals designate like structural elements, and in which:
Broadly speaking, the present invention relates to techniques for predicting an interest of a user.
One aspect of the invention pertains to determining interest in an object of interest. Interest in a first object of interest is unknown but interest in a second object of interest is known. The first and second objects of interest can belong to entirely different domains and/or categories. Data is then obtained. This data, for example, can include documents from the Internet or other forms of information from a network or database. In one embodiment, data is searched to find occurrences of the first object of interest and the second object of interest. The number of joint occurrences of the first object of interest and the second object of interest in the data is determined. Based on this number, at least one correlation value is determined. These correlation values represent the relationship between the first and second objects of interest and may, for example, relate to conditional probability, co-occurrence, correlation or other kinds of relationships. Based on the one or more correlation values, an interest value for the first object of interest is determined. The interest value indicates the interest in the first object of interest.
It can be desirable to determine the degree of interest that a person has in a first object of interest, based on the person's interest in a second object of interest. This can be easier, as noted earlier, if the known interest in the second object of interest (e.g., Disney movies) is obviously connected with and therefore easily helps determine an interest in the second object of interest (e.g., someone who likes Disney movies probably likes animated family movies in general.) But sometimes data on such obviously related interests is scarce or unknown. This can pose a problem for ontology-, attribute- and tag-based approaches. As noted earlier, ontology-based approaches can require that the interests belong to the same domain or be part of the same predetermined tree or framework. Attribute-based and tagging-based approaches are less compatible with objects of interest that have fewer natural connections between them. It is easy, for instance, to link movie interests together by director or genre, but much more difficult to link interests in highly disparate fields, such as video games and classical music.
It will be appreciated that the invention can predict an interest in a first object of interest based on a known interest in a second object of interest, even when little or no data has been collected in direct connection with that first object and even when the first and second objects of interest are not part of the same domain. For the purposes of this application, when multiple objects are part of the same domain, it means that one is a feature and/or aspect of the other, or that both are features and/or aspects of the same item. By way of example, every movie has a director, multiple actors and a genre. Therefore, the objects “director,” “actor” and “genre” are part of the domain “movies.” Another way of understanding a domain is as a tree-like structure, in which each node can be a parent to children nodes. (The term “tree-like structure” is defined as a hierarchical tree structure of linked data nodes, as is commonly understood by those of ordinary skill in the art.) An example of such a structure is provided in
In one embodiment, the prediction of an interest in the first object of interest is further informed by considering the situation. For example, a particular person may be known to enjoy relaxing activities when at home during the evening, such as listening to classical music, playing video games or reading newspapers. At the office in the morning, the person may be more interested in productivity tools, such as time management programs or spreadsheet applications. Such situation-aware data can be accumulated and factored into the interest prediction process. In another embodiment, the interest prediction process is not situation-aware.
In one embodiment, data is obtained that contains joint occurrences of the first and second objects of interest. This data may include web pages and/or data items on the Internet that contain keywords or phrases relating to the two objects. The number of joint occurrences of the objects in such data is determined. Based on this number, a correlation value is determined that indicates a correlation between the first object of interest and the second object of interest. Based on this correlation value, an interest value for the unknown first object of interest is determined.
Situation-based interest rating components 210 represent the interests of one or more users, given various situations 1 through N. Components 210 are separated into rows. Component 210a, for instance, indicates that the interest of the user in the second object 220b is V1 when the user is in situation 1. V1 indicates the intensity of the user's interest in the second object. It should be appreciated that the interests in first object 220a are unknown for all situations 1 through N, as indicated by the column of x's. To use a simple example, if situation 1 represents “at home in the evening,” first object 220a represents “pop music,” second object 220b represents “movies,” and V1=4 out of a range of 0 to 5, then component 210a indicates that the user on average has a relatively high degree of interest in movies when the user is at home in the evening, but has an unknown level of interest in pop music at the same time and location. Components 210 may be derived from a data log that tracks a user's behavior e.g., the observation of a user's utilization of the Internet, a device, various applications, etc. Although components 210 contain information pertaining to a situation, this is not a requirement and components 210 could contain only information relating to interests and/or other information unrelated to a person's situation.
In step 202 of
In accordance with step 204 of
Joint occurrence data components 226 include references to data items 228, which are part of data 212. At least some of data items 228 are individual web pages and/or files. Joint occurrence data components 226 indicate whether the first object 220a, the second object 220b or both appear in a particular data item. To use a simple example, if a=1, b=1, c=1, d=0, first object 220a is “pop music” and second object 220b is “movies,” then component 226a indicates that data item 1 contains references to both pop music and movies, but data item 2 contains references only to pop music. In the illustrated embodiment, values such as a, b, c and d can only be 0 or 1, and thus only take into account whether there are any references at all to the first and second objects in the respective data items 228. This, however, is not a requirement. Joint occurrence data components 226 may be computed using a variety of techniques, depending on the needs of a particular application. For instance, joint occurrence data components 226 can also identify how many occurrences of each object took place in each data item. What amounts to an “occurrence” or “joint occurrence” may vary from application to application. In some embodiments, an “occurrence” may refer to the appearance of one or more keywords, concepts or key phrases appearing in one of the data items 228. Various other metrics may be used to measure the degree to which a particular object of interest occurs or is represented in a particular data item.
In step 206 of
Interest value predictor 216 receives situation-based interest rating components 210 and the one or more correlation values from joint occurrence data components 226. As indicated by step 208 of
Interest value predictor 216 may compute interest value 218 in a variety of ways. For instance, interest value 218 may be computed using a simple weighted sum formula. The “weights” in this weighted sum formula may be the correlation values. Thus, for situation-based interest rating component 210a, which only has 1 known object of interest (i.e., second object 220b), the interest value for first object 220a=V1 (the interest value for second object 220b), since there is only 1 weighted value in the formula. Interest value predictor 216 may also use a weighted sum when there are known interest values for multiple objects of interest and/or multiple correlation values. An example of this approach is described in connection with
In other embodiments, interest value predictor 316 bases the interest value 218 on the correlation value. To use a simple example, assume V1=interest in first object 220a and V2=interest in second object 220b and C=correlation value relating first object 220a and second object 220b. Assume further that V2 is known, V1 is unknown and that interest value predictor 316 is predicting an interest value 218 that indicates an interest in first object 220a i.e., V1. Interest value predictor 316 may estimate V1 according to the exemplary scheme below:
The above scheme indicates that V1 may be calculated based on V2 when C reaches a specific predetermined value. V2 is computed in different ways based on V2 and C depending on the range of predetermined values that C falls into. Additionally, if C falls below a particular predetermined value, V1 is not determined, because C appears to indicate that V2 is not a dependable indicator of V1. Various formulas, algorithms, conditions and/or predetermined values may be used to relate interest values for first object 220a and first object 220b.
It should be appreciated that the method illustrated in
As noted earlier, interest values for first object 220a may be based on the interest values for more than one interest object. In
Initially, situation-based interest rating components 302 are obtained by computing device 316. Components 302 associate various situations 318 with objects of interest 324. The various objects of interest 324a, 324b and 324c are pop music, classical music and jazz music, respectively. In the illustrated embodiment, components 302 are situation-aware and provide information relating to various situations, but this is not a requirement. Components 302 can also be limited to information that does not relate to situations, contexts and/or external circumstances.
Each situation 318 is characterized by two context variables 320a and 320b and their associated context values. The context variables 320a and 320b represent time and place, respectively. Each context variable 320a and 320b has various possible context values. The possible context values for context variable 320a are morning, midday and evening. The possible context values for context variable 320b are work and home.
Each situation-based interest rating component 302 indicates the interests of a user in a variety of objects of interest when the user is in a particular situation. For instance, situation-based interest rating component 302a indicates that a user, on average, has an interest rated at 1.3 in pop music and 4.2 in jazz music when he is at home in the morning. Each of these interest values is from a range of values between 0 and 5, although any range of values may be used. The interest of the user in classical music is unknown in any situation, as indicated by the “x's” in the column for classical music. The above interest values are derived from data accumulated by computing device 316 about the user.
The computing device 316, using search engine 308, then obtains text-based data items from Internet 304. A text-based data item can include any kind of data type that includes words, such as a web page, document, audio, video or text file, etc. Internet 304 includes a network of numerous routers, servers, clients and/or other devices, such as nodes 306a-c. Search engine 308 may be a private search engine or any commonly known, publicly accessible search engine on the Internet 304, such as Yahoo! or Google. Search engine 308 conducts a search of Internet 304 using search terms. Each search term can include one or more keywords associated with the known interest objects of situation-based interest rating component 302a i.e., pop music and jazz music. The exact way in which the search is made and/or keywords are submitted to search engine 308 can vary, depending on the needs of a particular application. In certain instances, a single interest object (e.g., running) may result in the use of one or more keywords that reflect various aspects of the interest object (e.g., jogging, run, marathon, etc.) Data items acquired through the search may be subjected to additional processing steps. For example, stopping words (e.g., I, is, etc.) may be removed and/or keywords in the data items may be stemmed e.g., a word such as “running” may be converted to its root, “run.” In particular embodiments, the search extends to the entire Internet 304. In other embodiments, the search is restricted to one or more nodes, servers, databases, domains and/or sites on a private network and/or Internet 304.
In response to the queries, computing device 316 receives first and second groups of text-based data items, respectively. The first group includes Al data items that each contain at least one occurrence of the keywords related to “pop music.” The second group includes A2 data items that each contain at least one occurrence of the keywords related to “jazz music.” Among the A1 data and A2 data items, there are B1 and B2 data items, respectively, that also each contain at least one occurrence of keywords related to “classical music.”
Afterward, classical-pop and classic-jazz correlation values are determined and correlation data 310 is generated. The classical-pop correlation value is calculated by dividing the number of data items having joint occurrences of “pop music” and “classical music” keywords (i.e., B1) by the number of data items having at least one occurrence of “pop music” keywords (i.e., A1). Hence, the classical-pop correlation value is B1/A1. Calculated in an analogous manner, the classical-jazz correlation value is B2/A2. These values form correlation data 310, which is sent to interest value predictor 312.
Interest value predictor 312 generates the interest value 314 based on correlation data 310 and existing interest values for pop music and classical music in situation-based interest rating component 302a. This interest value will replace the unknown value for classical music in component 302a. Interest value 314 may be calculated in a variety of ways, depending on the needs of a particular application. This calculation, for example, may involve the weighted sum formula below:
In the above exemplary equation, P is the predicted interest value for a specific interest object j, given a situation s. Pr refers to the conditional probability that interest object j will occur when interest object i occurs (e.g., the classical-pop and classical-jazz correlation values.) In particular embodiments, Pr could involve co-occurrence, Pearson correlation, cosine correlation and/or other types of relationships between various interest objects. V refers to the interest value for the interest object i, given the situation s.
Interest value predictor 312 may use the above or different prediction equations to fill in the unknown interest values for classical music in one or more of interest rating components 302. Additionally, the methods described in this application may be modified and/or combined with other methods for predicting interest values, such as those described in the following three patent applications: U.S. patent application Ser. No. 12/343,392, entitled “Rating-based Interests in Computing Environments and Systems”; U.S. patent application Ser. No. 12/343,393, entitled “Semantics-based Interests in Computing Environments and Systems”; and U.S. patent application Ser. No. 12/343,395, entitled “Context-based Interests in Computing Environments and Systems.” (These three patent applications are incorporated herein in their entirety for all purposes.) For example, computing device 316 may determine some of the unknown interest values in situation-based interest rating components 302 based on interest values 314 and correlation data 310. As a result, at least some situation-based interest rating components 302 will have interest values for both interest object 324b as well as at least one of interest objects 324a and 324c. Afterward, other unknown interest values in components 302 may be determined using the techniques of the aforementioned applications. Additionally, context variables, context values, situations, situation-based interest rating components, prediction equations, computing devices and/or other aspects of the present application may be modified according to the features described in these applications.
The various aspects, features, embodiments or implementations of the invention described above can be used alone or in various combinations. The many features and advantages of the present invention are apparent from the written description and, thus, it is intended by the appended claims to cover all such features and advantages of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, the invention should not be limited to the exact construction and operation as illustrated and described. Hence, all suitable modifications and equivalents may be resorted to as falling within the scope of the invention.