A recommendation system is an information filtering system that uses a user's profile to make inferences about other goods and/or services that are likely to be interesting for a user. For example, the user's demographic information and/or history of previous purchases can be used to recommend other products that the user may want to consider purchasing. Recommendations systems often used by ecommerce stores that sell goods over the Internet in order to increase the volume of sales.
Recommendation systems are typically based on statistical inference engines. Describing the behavior and evaluating the accuracy of a recommendation system can be a complex task, since recommendation systems typically produce large sets of rules or inference patterns which can be hard to present in an intuitive form.
Various technologies and techniques are disclosed for calculating and evaluating the behavior of recommendation systems. Accuracy measures are computed for a plurality of items in a real recommendation system, an ideal recommendation system, and a popularity-based baseline recommendation system. The accuracy measures for the plurality of items are presented to a user so the user can evaluate a performance of the real recommendation system in comparison to the ideal recommendation system and the popularity-based baseline recommendation system. The accuracy measures can be presented in an interactive graph.
In one implementation, techniques for generating an interactive graph that can be used by a user to conduct a performance evaluation of a real recommendation system are described. Accuracy measures are computed for a plurality of items in a real recommendation system, an ideal recommendation system, and a popularity-based baseline recommendation system. An interactive graph is generated in descending order by the accuracy measures for the plurality of items in the real recommendation system, the ideal recommendation system, and the popularity-based baseline recommendation system. The graph is displayed so a user can interact with the graph to conduct a performance evaluation of the real recommendation system in comparison to the ideal recommendation system and the popularity-based baseline recommendation system.
This Summary was provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The technologies and techniques herein may be described in the general context as an application that evaluates the behavior and/or accuracy of a recommendation system, but the technologies and techniques also serve other purposes in addition to these.
In one implementation, techniques are described for evaluating the behavior and/or accuracy of a recommendation system. The term “recommendation system” as used herein is meant to include a mechanism for proposing a meaningful addition to a set, based on the existing set and additional information about the set. The term “item” and “items” as used herein are meant to include members of a set, such as good(s) and/or service(s). In one implementation, behavior information is collected, and the information is presented in an interactive diagram for analysis. A metric is provided that enables the performances of different recommendation systems to be compared for the same problem or data.
In general, in order to evaluate a statistical inference model used by a recommendation system, a separate data set is needed that is statistically similar with the one used in detecting the patterns. This data set is typically called “a holdout dataset” or “holdout set”. Techniques are described herein for evaluating the accuracy and behavior of one or more recommendation systems over a holdout (witness) set.
A recommendation system, as mentioned previously, can be thought of as a system which produces a fixed number of recommendations (items) for each individual user, based on the profile information of that user. The profile information generally includes a history of previous items of interest for the user (previous purchases). It may also include demographic information (attributes, such as Gender, Location etc.) of the user. The recommendation system can be thought of as a function which takes as input all the information about the user known to the system and produces (as output) a set of N recommended items.
A simple recommendation system can be implemented by simply returning the top most popular N items in the item space (in the ecommerce metaphor, this means the catalog items that generated most sales over time). In one implementation, such a basic system as a baseline to use in making comparisons with a recommendation system being evaluated.
An ideal recommendation system is a hypothetical system which produces perfect recommendations, such that all recommendations would be interesting to the end user. In one implementation of the system for evaluating recommendation systems that is described herein, the performance of any recommendation system is considered to be somewhere between the very simple system based on popularity and a hypothetical ideal system. In such an implementation, the reasons for such boundaries include the fact that (1) the performance of the ideal system cannot be exceed by definition of the ideal system; and (2) any performance worse than the popularity system may not warrant implementing a more complex system. Some techniques for evaluating recommendation systems will now be described in the figures that follow.
Evaluation process 106 is responsible for evaluating the accuracy and/or behavior of a particular recommendation system when compared to an ideal system and/or other recommendation systems. Evaluation process 106 is responsible for evaluating the performance of a recommendation system with regard to one or more measures. Such measures include: the number of correct recommendations, the cumulative value of correct recommendations, and the percent of correct recommendations. These measures can be computed individually for each item in the item space and aggregated over the holdout set.
Returning to the ecommerce metaphor, then take any product in the catalog as the example. The number of correct recommendations for a given item can be the number of times the item was recommended and the recommendation was useful. The cumulative value of correct recommendations for a given item can be the nominal value of the item times the number of times the item was recommended and the recommendation was useful. The percent of correct recommendations for a given item may be defined as the ratio (by definition, subunitary) between the number of correct recommendations generated by a system and the one generated by an ideal recommendations system.
Evaluation process 106 can work with any metric that can be defined and computed for each individual item in the item space both for an arbitrary recommendation system and for an ideal system, over a holdout set. In one implementation, a measure used by evaluation process 106 also needs to satisfy the following conditions: (1) The measure only takes positive values (>=0); (2) The measure is maximized by an ideal model (for any item, the measure as computed for an arbitrary recommendation system is less or equal with the measure as computed for an ideal system); and (3) The measure is monotonically growing (i.e. if the measure computed for Recommendation System 1 for one item is larger in value than the measure computed for Recommendation System 2 for the same item, then Recommendation System 1 produces better accuracy than Recommendation System 2 for that particular item). In other implementations, some or all of the above conditions do not have to be met. The evaluation process 106 is described in further detail in
Interactive diagram tool 108 is responsible for displaying an interactive chart that shows the results of the evaluation process 106. Interactive diagram tool 108 is described in further detail in
Turning now to
When comparing multiple recommendation systems, one extra computation of the accuracy measure can be added for each item in the item-space (describing the accuracy of the respective recommendation system). In such a scenario, the computation has a fixed parameter, the number of recommendations that the system returns for each request. Consequently, the popularity based system is based on the top N most popular items.
The holdout set can be selected in at least two different ways, and the computation of the accuracy measure depends on the properties of the holdout set. For example, a holdout set can be used that represents a random sample of the data (e.g. prior sales across all customers). Usage of this type of holdout set for computing accuracy measures is described in
Note that in one implementation, in both cases, for the ecommerce metaphor, the holdout set contains users and items purchased by these users (plus, possibly demographic or other information used by the system).
Once the accuracy measures have been calculated, then a graph can be generated in descending order by values of the accuracy measure (stage 156), as is described in further detail in
A user record is then retrieved (stage 204). If the user did not purchase anything (the item set for the user is empty) (decision point 206), then skip this user and go to the next user, if any (decision point 226). If the user did purchase something (decision point 206), then the item set of the current user is considered as a set of items, S={I1, . . . Ip} (stage 208). An item is retrieved from the item set (I from S) (stage 210). The accuracy measure associated with the ideal model for the item I is incremented (stage 212). Since I actually appeared in the holdout set for the current user, it is a valid recommendation and the ideal system would have made it.
If the item I is one of the top N most popular items in the data seen by the system during training (decision point 214), then the accuracy measure associated with the popularity based model for the item I is incremented (stage 216).
The current item is then eliminated from the set (stage 218), such as by constructing the set as S′=S−{I}. A recommendation request is executed (stage 220) to see what type of recommendation the system will make based on the current items in the set (now that the item has been removed). If the item I appears as a recommendation, then the accuracy measure associated with the real system being evaluated is incremented (stage 222). One reason for incrementing this accuracy measure for the real system is because the current user actually purchased S′ and I. A perfect recommendation system would therefore recommend item I for the current customer, based on S′ and the other customer info.
The next item is retrieved, if any are present (decision point 224), and the stages repeat accordingly to evaluate the additional items for the current user. Once all of the items have been evaluated for the current user (decision point 224), then the next user is retrieved, if any (decision point 226). Once all items have been evaluated for all users, then the process ends at end point 228.
In one implementation, with the process described in
Turning now to
A user record is retrieved (stage 304), which would be for a user with previous purchases, as used in training the system. If the user did not purchase anything (the item set for the user is empty) (decision point 306), then this user is skipped and the next user record is retrieved, if any (decision point 324).
The item set of the current user is considered as a set of items (stage 308), such as S={I1, . . . Ip} representing the new purchases by that user. A recommendation request is executed based on user's information and previous purchases history (the information used in training the recommendation system) (stage 310). An item I in set S is retrieved (stage 312). The accuracy measure associated with the ideal model for item I is incremented (stage 314). This is because item I actually appeared in the holdout set for the current user, and therefore it is a valid recommendation and the ideal system would have made it. If item I is one of the top N most popular items in the data seen by the system during training (decision point 316), then the accuracy measure associated with the popularity based model for item I is incremented (stage 318)
If item I appears in the recommendations, then the accuracy measure associated with the real system being evaluated is incremented (stage 320). The reason for incrementing the accuracy measure associated with the real system is because the system's recommendation correctly based is accurate. The next item for in the item set for the current user is retrieved (stage 312) and stages repeat accordingly to evaluate the additional items for the current user. Once all of the items have been evaluated for the current user (decision point 322), then the next user is retrieved, if any (decision point 324). Once all items have been evaluated for all users, then the process ends at end point 326.
In one implementation, with the process described in
As noted earlier, depending on the type of holdout set that is available to use in the computations, the process of
An aggregate measure is defined for each system (including ideal and popularity-based) (stage 364). In one implementation, an aggregate measure is defined for each system using the following formula:
In one implementation, after computing the aggregate measure, the following quantities are available:
A_Ideal=A(Ideal System)
A_Popularity=A(Popularity Based System)
A_1=A (Recommendation System 1 being evaluated)
A_i =A(Recommendation System i being evaluated, if multiple systems)
A lift measure (stage 366) and a coverage measure (stage 368) are optionally defined for each recommendation system being evaluated to capture the accuracy and performance information. In one implementation, the lift measure is calculated as follows:
The lift measure provides a ratio of improvement of the accuracy offered by a recommendation system (improvement measured against the baseline provided by the popularity based model) and can be useful in comparing different recommendation systems.
In one implementation, a coverage measure is calculated as follows:
In one implementation, the coverage measure provides an absolute measure of the accuracy of a system. It can be used in comparing multiple systems, but it is mostly intended as an absolute measure.
In one implementation, the Lift for the popularity based model is 1, and the coverage for the ideal model is also 1.
Diagram 400 presents the performance of a recommendation system as compared to an ideal and a popularity base one. Curve 402, on top of the other curves, shows the performance of an ideal system. Curve 404 shows the performance of a popularity-based system. Curve 406 represents the performance of the recommendation system to be evaluated. For example, the accuracy measure for a long-sleeve logo jersey is indicated for the ideal system, the system being evaluated, and for the popularity-based system.
The area bounded by one of the curves and the vertical and horizontal axis represents a measure of the overall performance of the recommendation system that generated that curve.
The X (horizontal) axis of the diagram represents the item space. All the items in the space are serialized along this axis, in descending order of the accuracy metric as computed by an ideal recommendation system.
For example, in the ecommerce metaphor, assume that a retailer offers Coca Cola, bread and computer mice. Let's also assume that an ideal system recommends bread 5 times, Coca Cola 3 times and a computer mouse only once. Then the X axis will contain: Bread, Coca Cola, Computer Mouse in this order.
The Y(vertical) axis of the diagram represents the calculated metric for each individual item (from the X-axis).
In the context of the diagram, the overall accuracy of a recommendation system can be assimilated to the area of the curve drawn by the measure over the total item space (effectively, the sum of the accuracy measure computed for the system across the item space).
The sorting of items in descending order on the X-axis displays the item space and the accuracy measures in a form that makes it intuitive to evaluate the “Long tail” distribution (term used in the ecommerce to describe some distributions such as Zipf, Power, Pareto and others).
Once generated, diagram 400 can allow one or more interaction scenarios. Since any point on the X (horizontal) axis represents an item, and the chart contains the various accuracy measurements for that item, by selecting or clicking on a chart, the following information is displayed, which effectively details the behavior of the system for that item (using ML Mountain Tire as a non-limiting example):
In one implementation, the diagram allows ‘zooming in’ to explore different areas of the distribution of values. In another implementation, the diagram can be computed for a specified number of recommendations (the N parameter in the computation algorithm).
By analyzing the data shown in diagram 400, managers and other users can conduct a performance evaluation of how well their recommendation system is performing, and/or can determine some adjustments to make to the recommendation system. For example, the user can see how one recommendation system performs compared to another (in case the user wishes to switch to the other, for example), whether the system he is already using is providing any value (i.e. are there increased sales and profits based upon the recommendations), and/or a certain item should be kept in inventory because the item is generating profits based upon recommendations from the real recommendation system.
Turning now to
As shown in
Additionally, device 500 may also have additional features/functionality. For example, device 500 may also include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in
Computing device 500 includes one or more communication connections 514 that allow computing device 500 to communicate with other computers/applications 515. Device 500 may also have input device(s) 512 such as keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 511 such as a display, speakers, printer, etc. may also be included. These devices are well known in the art and need not be discussed at length here.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. All equivalents, changes, and modifications that come within the spirit of the implementations as described herein and/or by the following claims are desired to be protected.
For example, a person of ordinary skill in the computer software art will recognize that the examples discussed herein could be organized differently on one or more computers to include fewer or additional options or features than as portrayed in the examples.