By counting pair-wise co-occurrence of items (or terms) in user buckets (alternatively called transactions), one can measure the affinity between item pairs. More particularly, an affinity is a measure of association between different items. A person may want to know an affinity among items in order to identify or better understand possible correlation or relationships between items such as events, interests, people or products. An affinity may be useful to predict preferences. For instance, an affinity may be used to predict that a person interested in one subject matter also is likely to be interested in another subject matter, to make an item-based recommendation.
Taking music as an example, a music recommendation engine may recognize that people who downloaded Song A also downloaded Song B. Therefore, a user X who has downloaded Song A may also be interested in downloading Song B, and Song B is recommended to user X.
Two commonly used measures are:
Affinity(A, B)=P(B|A)=N(A ̂B)/N(A)
Lift(A, B)=P(B|A)/P(B)=N(A ̂B)/N(A)*N(B)
While affinity is a measure of prevalence of a second item in association with a first item, lift is also a measure of the relative prevalence of the first item with the second item, but also taking into account the popularity of the second item. Put another way, lift is a measure of the extent to which the conditional probability of the second item occurring relates to the overall unconditional probability of the second item occurring. Thus, for example, when lift is considered, a very popular “other item” will not skew the recommendation.
For a web-scale data set, items may number in the millions and users may number in the tens or even hundreds of millions.
In accordance with an aspect, a computer-implemented method of determining pair-wise item affinity based on transaction records tangibly embodied in at least one computer-readable medium, each transaction record including an indication of a bucket and an indication of a item transacted corresponding to that bucket. The method comprises a Phase 1 bucket filtering to determine, a total number of potential item pairs across all the buckets in a partition and a total count of unique items for that partition. In a Phase 2 item count, it is determined, for each item, a count of the number of appearances of each item in all the buckets collectively and, for each item, that item is encoded based at least in part on the determined item distribution from Phase 1.
A Phase 3 bucket materialization includes, for each bucket, collecting into one record all item codes for items transacted in correspondence with that bucket. For each bucket, the one record for that bucket is processed to determine a number of item pairs that can be generated for that bucket and encoding that bucket based at least in part on the item pair distribution determined from Phase 1.
In a Phase 4 pair count and affinity lift/calculation, pairs of item codes are generated, and affinity statistics are generated based on the generated pairs of item codes. The generated pairs of item codes and affinity statistics are stored in a tangible computer-readable medium.
The phases are ideally suited to be carried out by a computing system in a map-reduce configuration.
The inventor has realized that, for affinity and lift determinations, it can be infeasible to calculate all the pair-wise measurements on a single machine due to memory and time constraint. The inventor has thus realized that a parallel platform (for example, a map/reduce paradigm) may be used to make the affinity and lift determinations in a very efficient way. Furthermore, the determinations may be optimized by selectively utilizing integer or other coding such as described in U.S. Pat. No. 6,873,996, assigned to Yahoo Inc!, the assignee of the present patent application.
For example, referring to
We now describe the map-reduce processing 106 broadly, in conjunction with
Various aspects of the map-reduce processing 106, relative to examples of determining affinity and lift, are described in greater detail later, with reference to a particular illustrative example. In the description, a “bucket” is a broad term for a category to which items may correspond, for affinity purposes, such as a “user” or other tangible entity who has transacted a particular actual tangible item. Throughout this description, we refer to a <bucket, item> pair as an indication of one actual tangible transaction of the “item” by the “bucket.”
Referring now to
For example, in order to determine affinity and lift for each possible pair of items, in accordance with an aspect, at least four stages of map/reduce jobs may be utilized. (The affinity/lift determination can also be implemented in a separate map/reduce job by joining the output from the item count stage.) The complexity of bucket filtering, item count and group materialization stages are all in linear relationship with number of <bucket, item> pairs in the input. However, the pair count determination is more complex.
For example, a bucket having 100 corresponding item transactions could have =100!/(98!*2!)=4950 pairs of items generated while a bucket with 10 corresponding item transactions only has =45 pairs of items generated. So a task taking 100 buckets with 495,000 total pairs will take 100 times longer to finish than a task taking 100 buckets with 4,500 total pairs. If the amount of work during pair generation is not distributed well among all the processes, a few processes may take most of time, which can essentially eliminate any benefits introduced by using a map/reduce platform. Thus, it can be significant to distribute all the buckets in a balanced way such that each process gets a similar amount of work.
Also, the bucket and item indications may be arbitrarily long strings, which are computationally expensive to manipulate (such as to compare one string to another string) and store. By using an integer or other easily-manipulable code to represent the strings, a lot of disk and memory space can be saved and, also, computational performance may be increased.
Still referring to
A third phase 112 of the map/reduce processing includes bucket materialization. More specifically, for each bucket, all item integer codes for items transacted in correspondence with that bucket are collected into one record for that bucket. Further, for each bucket, the one record for that bucket is processed to determine a number of item pairs that can be generated for that bucket and integer encoding that bucket based at least in part on the determined number of item pairs that can be generated for that bucket.
In a fourth phase 114 of the map/reduce processing, a pair count and affinity/lift determination is carried out. In particular, pairs of item codes are generated, and affinity statistics are determined based on generated pairs of item codes.
We now describe, more specifically, an example affinity/lift determination solution using a map-reduce architecture such as that shown in overview in
As mentioned above, a first map-reduce phase (Phase 1) is a bucket filtering and distribution determination phase. Thus for example, purpose(s) of this phase may include the following:
Thus, as illustrated in
As also mentioned above, Phase 1 may also include a determination of items and pairs across processing partitions. For example, as illustrated in
In addition to illustrating the partitioning,
As can further be seen from
We now discuss an example of a second map-reduce phase (Phase 2) which is an item count and integer encoding phase. More specifically, purpose(s) of this phase may include the following:
Thus, as illustrated in
Phase 2 processing may also include a determination of allocation of items and pairs across processing partitions. For example, as illustrated in
With regard to the Phase 2 reduce 504 processing, the reduce processing determines the number of appearances for an item. Also, using the item distribution information from Phase 1, the start/end range of the items in each partition is known. An in-range integer number is used to encode each item. Using the example discussed above with respect to Phase 1 (in which the total item count per partition is 500, 2500, 5000 and 12000, respectively, the range of integers reserved for the first, second, third and fourth partition are [0-499], [500-2999], [3000-7999], and [8000-11999], respectively. So the item representation in string form is converted into a much more compact representation in integer form. The reducer output two sets of data in the form <bucket, item code> and <item code, item, item count>. In one example, each set of data goes to a separate file.
A third map-reduce phase (Phase 3) is a bucket materialization phase. More specifically, purpose(s) of this phase may include the following:
As illustrated in
We now describe Phase 3 reduce processing, with reference to
Thus, for each bucket, the reducer 504 output takes the form of <bucket code, item code 1, item code 2, . . . , item code K>. As a result, both bucket and item are represented by using integers.
A fourth map-reduce phase (Phase 4) is a pair-count phase. The phase 4 processing may accomplish a customized split based on a bucket code, such that buckets may be distributed to mappers so that each mapper generates a similar number of pairs. For example, the Phase 4 map processing may be such that the workload at each mapper is calculated as the total number of pairs divided by the number of mappers. Then, for each mapper, buckets are accumulated until the difference between the current bucket number and the start bucket number is greater than or equal to the allocated workload.
Referring to
In the Phase 4 processing illustrated in
1, 1, 2, 3, 4, 5, 6
The output 804a of the first mapper 801a (encoding each pair using following matrix):
is as follows:
1, 1
2, 1
3, 1
4, 1
5, 1
6, 1
7, 1
8, 1
9, 1
10, 1
11, 1
12, 1
13, 1
14, 1
15, 1
The input 804b to the second mapper 801b is
16, 1, 2, 4, 6
22, 2, 4, 5
The output 804b of the second mapper 801b is as follows:
1, 1
3, 1
5, 1
7, 1
9, 1
14, 1
7, 1
8, 1
13, 1
A two-dimensional block partition is carried out using the pair code, so each reduce of Phase 4 processing receives pairs in a particular item code range, e.g. first item code in range [x1-x2], the second item code in range [y1-y2], etc
For example, the reduce stage may carry out a pair count and affinity/lift calculation. In one example, first, the appearance of each pair is counted. Then, the pair code is mapped back to the item code, and the result is in the form <item code 1, item code 2, pair count>. By loading relevant other output from Phase 2, an item count look up and affinity/lift determination may be performed as follows,
Aff1=pair count/item1 count
Aff2=pair count/item2 count
lift1=lift2=pair count/(item1 count*item2 count)
Two rules can be output, <item1, item2, aff1, lift1>
and
<item2, item1, aff2, lift2 >
Only the relevant item code, item and item count need be loaded.
For example, referring to
1, I1, 2
2, I2, 3
So the affinity and lift can be determined as follows:
Affinity(I1, I2)=2/2=1
Affinity(I2, I1)=2/3=0.667
Lift(I1,I2)=Lift(I2,I1)=2/(2*3)=0.333
For example, the output for pair 1 will be:
I1, I2, 1, 0.333
12, I1, 0.667, 0.333
For simplicity of illustration, the remaining determined affinity/lift indications are not shown.
It can thus be seen that, with the use of a map-reduce parallel platform, pairwise affinity and lift determinations can be efficiently carried out. Furthermore, by using integer-encoding, computationally intensive operations of the map-reduce processing can be made more efficient.
Embodiments of the present invention may be employed to facilitate affinity/lift determinations in any of a wide variety of computing contexts. For example, as illustrated in
According to various embodiments, applications may be executed locally, remotely or a combination of both. The remote aspect is illustrated in
The various aspects of the invention may also be practiced in a wide variety of network environments (represented by network 1012) including, for example, TCP/IP-based networks, telecommunications networks, wireless networks, etc. In addition, the computer program instructions with which embodiments of the invention are implemented may be stored in any type of computer-readable media, and may be executed according to a variety of computing models including, for example, on a stand-alone computing device, or according to a distributed computing model in which various of the functionalities described herein may be effected or employed at different locations.