The present invention relates to the field of network communication technologies, and more particularly, to a recommendation method and system base on collaborative filtering.
A recommendation system is an intelligent agent system proposed to solve the information overload problem, which can automatically recommend to a user resources catering for the interests and preference or demands of the user from a large quantity of information. With the popularization and rapid development of the Internet, the recommendation system has been widely applied in various fields, especially in the electronic commerce field, where increasing researches and applications are made to the recommendation system. Currently, almost all the large electronic commerce web sites, such as Amazon, CDNOW, eBay, and Dangdang online bookstore, are using various forms of recommendation systems to a different extent. The collaborative filtering technology is successfully applied in the current recommendation system.
Collaborative filtering algorithms mainly include a user-based collaborative filtering algorithm and an item-based collaborative filtering algorithm. Inputs of both algorithms are matrixes of ratings made by users for items, as shown in Table 1:
The rating made by a user for an item may be explicitly obtained, for example, through a rating operation performed by the user on the item; and may also be implicitly obtained, for example, by calculation with a rating function constructed through behaviors of the user like searching, browsing, and purchasing of the item. The vectors formed in each row of the matrix represent the rating vectors of the user corresponding to the row for each item.
The basic principle of the user-based collaborative filtering algorithm utilizes similarities of item ratings made by the users to mutually recommend items that the users may be interested in. For example, for a current user U, through rating records of the user U and a particular similarity function, the system calculates k users that are the closest to the rating behavior of the user U as the closest neighbor set of the user U, makes statistics on items that the neighboring users of the user U rate while the user U does not rate, to generate a candidate recommendation set, then calculates a predicted rating made by the user U for each item i in the candidate recommendation set, and takes N items with the highest predicted rating as a Top-N recommendation set of the user U.
The item-based collaborative filtering algorithm compares a similarity between items, and recommends an item that has not been rated according to a set of items already rated by the current user. The similarity between the items is more stable than the similarity between the users, so the similarity between the items can be offline calculated and stored and periodically updated. Thus, relative to the user-based collaborative filtering algorithm, the item-based collaborative filtering algorithm has a high recommendation precision and a good real-time performance. The item-based collaborative filtering algorithm, after being optimized, may achieve higher recommendation accuracy and a better effect, and better conform to the demands of the user.
A basic processing procedure of item-based collaborative recommendation is divided into two parts: offline similarity calculation and online recommendation.
The offline similarity calculation procedure in
On the basis of calculating and storing the similarity between the different items in advance, as shown in
where PU,i denotes a predicted rating made by a target user U for an item i, sim(j,i) denotes a similarity between an item j and an item i, and RI,j denotes an actual rating made by the user U for the item j . In step 15, according to a predicted rating result, the top N items with the highest rating are taken as a recommendation result for the target user.
In the procedure of the item-based collaborative filtering algorithm, the similarity between the items has critical influence on the final recommendation result. In the conventional item-based collaborative filtering recommendation algorithm, the calculation of the similarity between the items does not consider the difference between user groups of different preferences. The similarity between the items is calculated based on the user rating matrix. For all users, the similarity between the same two items is identical. However, in practice, for the opinions of the same two items, the viewpoints of user groups of different preferences are generally different, which inevitably reduces the recommendation accuracy and degrades the quality.
To improve the accuracy of recommendation and conform to the user preference, the present invention is directed to a recommendation method and system based on collaborative filtering.
A recommendation method based on collaborative filtering includes the following steps. A target user ID is acquired. A user group ID corresponding to the target user ID is searched. A similarity between items is acquired, where the similarity between items is determined according to a user-item rating matrix corresponding to the user group D. An item is recommended to a target user according to the similarity between the items.
A recommendation system based on collaborative filtering includes a recommendation control module, a set-to-be-recommended determination module, and a recommendation generation module. The recommendation control module is configured to acquire a target user ID, and invoke the set-to-be-recommended determination module and the recommendation generation module to recommend an item to a target user corresponding to the target user ID. The set-to-be-recommended determination module is configured to search a user group ID corresponding to the target user ID, acquire a similarity between items, where the similarity between items is determined according to a user-item rating matrix corresponding to the user group ID, determine, according to the similarity between the items, a set to be recommended, or acquire a set of hotspot items, where the set of hotspot items is determined according to the user-item rating matrix corresponding to the user group ID, and use the set of the hotspot items as the set to be recommended. The recommendation generation module is configured to recommend an item in the set to be recommended to a user.
With the recommendation method and system based on collaborative filtering according to embodiments of the present invention, users are firstly grouped, so that preferences of the users in each user group are substantially the same, and item similarity information contained in each user group is utilized to recommend items for the users, thereby improving the accuracy of recommendation and realizing personalized recommendation.
The technical solutions of the present invention are described in detail below with reference to the embodiments and the accompanying drawings.
In an embodiment, the present invention provides a recommendation method based on collaborative filtering, where the method includes the following steps. Firstly, users are grouped based on a user-item rating matrix, and each user group only includes data of ratings made by the users for all items in the group. Then, a similarity between items is independently calculated for each user group. Finally, recommendation is performed for a target user based on the similarity calculated in the group to which the target user belongs.
In an embodiment, the present invention provides a recommendation system based on collaborative filtering. The system includes: a recommendation control module, configured to acquire a target user identifier, and invoke a set-to-be-recommended determination module and a recommendation generation module to recommend an item to a target user corresponding to the target user identifier; the set-to-be-recommended determination module, configured to search a user group identifier corresponding to the target user identifier, acquire a similarity between items, where the similarity between items is determined according to a user-item rating matrix corresponding to the user group identifier, determine, according to the similarity between the items, the set to be recommended, or acquire a set of hotspot items, where the set of hotspot items is determined according to the user-item rating matrix corresponding to the user group identifier, and use the set of the hotspot items as the set to be recommended; and the recommendation generation module, configured to recommend an item in the set to be recommended to a user. The foregoing is described in detail as follows.
The system master data set mainly includes: user-item rating matrix data, which is specifically rating data for different items generated by each user in a service use process; and user basic information data, which is specifically basic attribute information describing the user, including geography, occupation, gender, age, and education level.
The system computation data set mainly includes: user group data, including a result of grouping users based on the user-item rating matrix data, where each user is corresponding to one group, and each group is corresponding to one group center; a user group item hotspot level database, configured to record a hotspot item and a hotspot level corresponding to each user group generated based on the user grouping result, where the hotspot items are the top M (M is not smaller than N) items rated the most, and the hotspot level of the hotspot item is an average of obtained ratings of the item; and a user group item similarity database, configured to record a similarity between items corresponding to each user group generated based on the user grouping result.
The function of each module in the recommendation system and interaction between the modules are introduced in detail below. The modules in the recommendation system are not all necessary, and a part of the modules can be added or subtracted correspondingly according to the requirements of the function or performance.
The recommendation control module 51 is a main control module of an online recommendation part, and is capable of invoking other modules after receiving a user ID (that is, a target user ID) to be recommended, so as to complete the whole recommendation process.
The set-to-be-recommended determination module 54 is configured to, after determining the corresponding target user according to the user ID to be recommended, find a set of neighboring items of a target user rating item by locating a user group to which a target user belongs, or find a set of hotspot items corresponding to the user group, obtain a set to be recommended, and use this set as a basis of next computation of the rating prediction module 53. The set-to-be-recommended determination module 54 may be further divided into the group-of-user determination module 541 and the set-of-items-to-be-recommended determination module 542. The group-of-user determination module 541 is configured to determine a user group to which a user belongs, and may locate, according to the target user ID, the user group to which the target user belongs, or determine, according to a categorizer, the user group to which the target user belongs. The set-of-items-to-be-recommended determination module 542 is configured to determine a set of items to be recommended in the group to which the target user belongs, and may obtain the set to be recommended through the set of the neighboring items of the target user rating item or the set of hotspot items corresponding to the user group. If the number of items in the set to be recommended is smaller than N, the distances between the target user and other groups are calculated. In a group with the closest distance, the process of determining the set to be recommended is continuously performed, until the number of recommended items is larger than or equal to N, or until all the user groups are traversed.
The rating prediction module 53 is mainly configured to perform a similar item-based rating prediction or a hotspot item-based rating prediction in the set of the items to be recommended obtained by the set-to-be-recommended determination module 54, so as to obtain a predicted rating made by the target user for the items to be recommended. This module may further be divided into the similar item rating prediction module 531 and the hotspot item rating prediction module 532. The similar item rating prediction module 531 calculates the predicted rating according to a similarity between similar items, for example, calculates the predicted rating according to the following formula:
where PU,i denotes a predicted rating made by a target user U for an item i, sim(j,i) denotes a similarity between an item j and an item i, and RI,j denotes an actual rating made by the user U for the item j . The hotspot item rating prediction module 532 is configured to calculate the hotspot-item-based predicted rating. For example, a hotspot level of a hotspot item is calculated as a predicted rating of the hotspot item. In other embodiments of the present invention, the set of the items to be recommended may also be directly recommended to the user without the need of performing further predicted rating on the set of the items to be recommended.
The recommendation generation module 52 is mainly configured to use the top N items with the highest rating as a recommendation result for the target user, according to a predicted rating made by the rating prediction module 53 for each item in the set of the items to be recommended.
The user grouping module 57 is configured to group users according to a user-item rating matrix of all the users stored in the user-item rating matrix library S55 in the database 55, obtain a grouping result of all the users and a group center of each group, and store the grouping result and the group center in the user group library S52 of the database S5.
The categorizer generation module 58 is configured to, according to the user grouping result, construct and store a categorizer with basic information of each user in each user group in the user basic information base S51 in the database S5 as a categorizing feature. In other embodiments of the present invention, a categorizing training set may also take one suitable percentage according to the number of existing users, select several users are randomly from each user group based on this percentage, and use the basic information of the users as categorizing training set data.
The item hotspot level calculation module 59 is configured to, according to the user grouping result and the user-item rating matrix, independently find several items with the most rating, that is, hotspot items, from each user group, calculate an obtained rating mean, that is, a hotspot level, and store the hotspot level in the user group item hotspot level library S53 of the database S5.
The item similarity calculation module 60 is configured to, according to the user grouping result and the user-item rating matrix, independently calculate a similarity between items in each user group, and store the similarity in the user group item similarity library S54 of the database S5.
In other embodiments of the present invention, the set-of-items-to-be-recommended determination module 542 may use data stored in the item hotspot level calculation module 59 and the item similarity calculation module 60 simultaneously to determine the set of the items to be recommended for the user group to which the target user belongs, and may also use data stored in any of the above two modules upon requirements to determine the set of the items to be recommended for the user group to which the target user belongs.
The timer 56 is configured to periodically trigger the user grouping module 57, the categorizer generation module 58, the item hotspot level calculation module 59, and the item similarity calculation module 60 to process a master data set, including an updated master data set. In other embodiments of the present invention, the module is an optional module.
It can be known according to the description of the recommendation system that, when executing a specific operation, the recommendation system may be divided into two parts, namely, an offline part and an online part. The offline part periodically triggers, through the timer 56, the user grouping module 57, the categorizer generation module 58, the item hotspot level calculation module 59, and the item similarity calculation module 60, and may also trigger the modules manually. The triggering of those modules mainly provides data for computation of the online part, reduces the amount of online calculation, and increases the recommendation rate, so as to achieve the purpose of real-time recommendation. Required data is stored in the database S5. The online part is mainly configured to accomplish online recommendation for the target user. It is an important process for the online part to obtain the group to which the target user belongs, the set of the items to be recommended, and the predicted ratings for the items to be recommended, and the main task of the online part is to look for a set of items the most similar to the interest of the target user for the target user and predict a rating of the set before recommendation.
In step S101, a rating made by each user for each item is acquired.
In step S102, a user-item rating matrix is established according to the user item rating. The established user-item rating matrix is shown in Table 2.
In step S103, users are grouped so as to obtain several user groups and a group center of each user group.
In this embodiment, a k-means clustering algorithm based on a similarity between users is provided to group all users. In other embodiments of the present invention, a variety of grouping methods, such as manual grouping, machine grouping, and manual-machine combined grouping can be adopted.
The grouping all the users by the k-means clustering algorithm based on the similarity between the users includes the following steps. In step (1), a category quantity k and an error precision e are defined, and k users M1, M2, . . . , Mk are randomly selected as an initial group center, where M1, M2, . . . , Mk are respectively corresponding to categories C1, C2, . . . , Ck. In step (2), for each user U, a distance d(U,Mi)=1−sim(U,Mi),i=1, 2, . . . , k between the user and each group center is calculated, where sim(U,Mi) refers to a similarity between the user U and a group center Mi. The user is categorized into a group to which a group center at the closest distance from the user belongs, and a diversity
is calculated, where t refers to the number of iteration times. In step (3), a new clustering center
is calculated, where ∥U∥ refers to a modulus length of a rating vector of the user U, and ∥Ci∥ refers to the total number of users in a category Ci. In step (4), steps (2) and (3) are repeated until |E(t+1)−E(t)|<e. Each group is allocated with one user group ID, and meanwhile a final group center of each user group is recorded. In this embodiment, the description is given by taking that all the users are divided into two user groups as an example. Table 3 is a list of user groups.
Group centers corresponding to User group 1 and User group 2 are shown in Table 4.
In step S201, a user group ID uniquely identifying each user group is acquired.
In step S202, a user-item rating matrix corresponding to all users in a corresponding user group is acquired according to the user group ID.
In step S203, a similarity between items in the user-item rating matrix corresponding to the user group is calculated and saved.
In other embodiments of the present invention, the similarity between the items may adopt a cosine similarity, a Pearson correlation coefficient, or a modified cosine similarity. In this embodiment, if the cosine similarity is adopted, the similarity between the items corresponding to each user group is obtained, as shown in Table 5 and Table 6.
In step S204, it is determined whether all the user groups are traversed. If the traverse is not completed, the procedure returns to step S201, until all the user groups are traversed. If the traverse is completed, the procedure ends.
In step S301, a user group ID uniquely identifying each user group is acquired.
In step S302, a user-item rating matrix corresponding to each user in a corresponding user group is acquired according to the user group ID.
In step S303, hotspot levels of hotspot items in the user-item rating matrix corresponding to the user group are calculated.
The hotspot items are the top several items rated the most, and the hotspot level of the item is an average of the obtained ratings of the item. In this embodiment, as an example, two hotspot items are taken from each user group, and the hotspot items and hotspot levels of the items corresponding to each user group are shown in Table 7 and Table 8.
In step S304, it is determined whether all the user groups are traversed. If the traverse is not completed, the procedure returns to step S301, until all the user groups are traversed. If the traverse is completed, the procedure ends.
In step S401, IDs of users occupying a preset proportion a % of the total number of users of each user group are randomly selected.
In step S402, basic attributes of the selected users are acquired.
In step S403, features of the basic attributes of the selected users are analyzed to construct the categorizer. In the embodiments of the present invention, a variety of methods such as decision tree and neural network may be adopted to construct the categorizer.
The procedures as shown in
In step S501, an ID of a user to be recommended is determined. Generally, the user is referred to as a target user, that is, the ID of the target user is acquired.
In step S502, it is determined, according to the target user ID, whether the corresponding target user is in a user group. If the corresponding target user is in the user group, step S503 is performed; otherwise, step S504 is performed.
In step S503, a user group ID corresponding to the target user is acquired.
In step S504, a basic attribute of the target user is acquired.
In step S505, a categorizer is utilized to categorize the target user into a certain corresponding user group, so as to acquire a corresponding user group ID.
In step S506, it is determined whether the target user has an item rating record. If the target user has an item rating record, step S507 is performed; otherwise, step S508 is performed.
In step S507, based on an item similarity and a user item rating in the user group to which the target user belongs, items that have a high similarity to an item with a high user rating and that the target user does not rate are selected as a set to be recommended, that is, a set of similar items to be recommended is determined.
In step S508, a predicted rating made by the target user for a hotspot item of the user group to which the target user belongs is calculated. In this embodiment, the number of hotspot items may be required to be not smaller than N.
In step S509, it is determined whether the number of items in the set to be recommended is not smaller than N. If the number of items in the set to be recommended is smaller than N, step S511 is performed; if the number of items in the set to be recommended is larger than or equal to N, step S510 is performed.
In step S510, a predicted rating made by the target user for each item in the set to be recommended is calculated.
In step S511, distances between the target user and group centers of other user groups are calculated, and a set to be recommended is selected from other groups at the closest distance from the target user and is united with the set to be recommended in the foregoing step, until the number of the items in the set to be recommended is not smaller than N , or until all the user groups are traversed.
In step S512, N items with the highest predicted rating are recommended to the target user as recommended items.
In this embodiment, steps S504 and S505 are a procedure of solving a problem that when new target users are not in the existing user group, recommendation is performed after grouping the new users. It can be foreseen that, in a case without considering the new target users, steps S504 and S505 are optional steps. Step S506 provides two recommendation procedures when the target user has a rating record and has no rating record. In other embodiments of the present invention, one of the two recommendation procedures can be adopted. Steps S508, S507, and S510 also provide two recommendation algorithms simultaneously. It can be foreseen that, in other embodiments of the present invention, one of the two recommendation algorithms can be adopted randomly. Steps S509 and S511 show a procedure that when the number of the items in the set to be recommended is smaller than N, the set to be recommended is determined in an adjacent user group. It can be foreseen that, in other embodiments of the present invention, if the number of the items in the set to be recommended is not limited, steps S509 and S511 are optional steps. Step S510 is a step for increasing recommendation accuracy. In other embodiments of the present invention, if the set to be recommended is directly recommended to the user, step S510 is an optional step. In conclusion, the steps in the procedure of the method according to this embodiment can be flexibly and properly adjusted, and some of the steps may be adopted while some of the steps may be skipped according to the required recommendation accuracy, and are all capable of achieving an effect of increasing the recommendation accuracy.
In step S601, a target user II) is acquired to determine a corresponding target user.
In the embodiment of the present invention, the target user is provided by a service invoking party. The service invoking party provides the target user ID, and intends to acquire a list of recommended items of the target user. It is assumed that User 7 is the target user, and Table 9 is a user-item rating matrix.
In step S602, an ID of a user group to which the target user belongs is acquired. In this embodiment, it can be known from Table 3 that, User 7 belongs to User group 2. If the target user is a new user, it is needed to categorize users with user basic information so as to acquire an ID of a user group to which the new user belongs.
In step S603, a set to be recommended is determined. Firstly, items with high ratings made by User 7 are taken, and a rating made by User 7 larger than or equal to 4 is regarded as a standard for high ratings. For example, items with ratings larger than or equal to 4 are Item 4, Item 7, and Item 8. Next, items, which are with a high similarity to Item 4, Item 7, and Item 8 (the high similarity means that a mean of similarities between a selected item and Item 4, Item 7, and Item 8 is larger than 0.5) and are not rated by User 7, are obtained by searching Table 6 of the above embodiment as a set to be recommended. That is, the set to be recommended includes Item 6 and Item 3. When the number of items in a set of items to be recommended is not smaller than N (in this embodiment, it is assumed that N is equal to 1), the set to be recommended has two items, which satisfies the condition that the number of the items is not smaller than 1.
If the number of the items in the set to be recommended is smaller than 1, it is needed to calculate distances between the target user and other group centers, so as to find the closest user group and select the set to be recommended from this user group, until the total number of the items in the set to be recommended is not smaller than 1, or until all the user groups are traversed.
If the target user has no rating record, a predicted rating made by the target user for a hotspot item of a group to which the target user belongs is calculated. The predicted rating may refer to results in Tables 7 and 8 according to the above embodiment.
In step S604, the predicted rating is calculated.
A formula
is employed for calculation, where PU,i denotes a predicted rating made by a target user U for Item i, sim(j,i) denotes a similarity between Item j and Item i, and RU,j denotes an actual rating made by the user U for Item j. According to the formula, predicted ratings made by User 7 for items to be recommended are shown in Table 10.
In step S604, items satisfying the above condition are recommended to the user. According to Table 10, finally, Item 3 is recommended to User 7.
According to the embodiments of the present invention, a recommendation method and system based on collaborative filtering are provided. In an offline processing procedure of the method, firstly, users are grouped with user item rating data, then a similarity between items is independently calculated in each user group, and a categorizer is established according to a grouping result, so that new users may also be appropriately categorized. During online recommendation, it is needed to acquire a group to which a target user belongs, and the similarity between the items related to the group is utilized to perform item-based collaborative filtering recommendation for the target user, or a hotspot level of a hotspot item related to the group is utilized to perform recommendation for the target user. Compared with a conventional collaborative recommendation procedure, in the present invention, users are firstly grouped, so that preferences of the users in each user group are substantially the same, and item similarity information contained in each user group is utilized to recommend items for the users, thereby improving the accuracy of recommendation and realizing personalized recommendation. Meanwhile, the calculating the similarity after grouping also increases the calculation speed of offline processing.
It will be apparent to persons skilled in the art that various modifications and variations can be made to the present invention without departing from the scope or spirit of the invention. In view of the foregoing, it is intended that the present invention cover modifications and variations of the present invention provided they fall within the scope of the following claims and their equivalents.
Persons of ordinary skill in the art should understand that all or a part of the steps of the method according to the embodiments of the present invention may be implemented by a program instructing relevant hardware. The program may be stored in a computer readable storage medium. When the program is run, the steps of the method according to the embodiments of the present invention are performed. The storage medium may be any medium that is capable of storing program codes, such as a ROM, a RAM, a magnetic disk, and an optical disk.
Number | Date | Country | Kind |
---|---|---|---|
200810216517.9 | Sep 2008 | CN | national |
This application is a continuation of International Application No. PCT/CN2009/073275, filed on Aug. 14, 2009, which claims priority to Chinese Patent Application No. 200810216517.9, filed on Sep. 27, 2008, both of which are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2009/073275 | Aug 2009 | US |
Child | 13072155 | US |