Given information about Internet users, such as what search terms they have entered, behavioral targeting is often performed, such as to send advertisements tailored to specific groups of users. Web personalization also is based on user-specific information, as are other technologies.
As there are too many different users to treat each one individually, users are clustered according to similarities found from such information. In user clustering, users are classically represented by their previous activities such as their search queries or clicked URLs. However, it is a challenge task to cluster millions of users, due to the high complexity of classical clustering algorithms.
Such applications are also interested in temporal clustering, such as to cluster users based on their activities in the last month. However, known temporal clustering techniques (e.g., based upon streaming data) are not adequate in that they are inefficient and inflexible, and fail to be able to cluster users in a discrete time window with any specified length. For example, streaming data techniques are unable to cluster a large number of users according to their activities on every weekend of last month, or some other discrete other time window.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which users are clustered together based on MinHash computations that produce signatures corresponding to users' internet-related activities. In one aspect, users are clustered together on the basis of having similar signature sets, e.g., based on commonality of signatures therein. The signature sets and/or clusters may be associated with timestamps or the like, whereby clusters may be determined for a given discrete time window or set of discrete time windows.
In one aspect, the signature set of one user is determined by performing the MinHash computations for a user's activities relative to a number of (e.g., twenty to thirty) permutations of combined internet-related data for a plurality of (e.g., all) users. To facilitate efficient processing, existing, prior signature sets of a user are incrementally updated as each new signature set is computed (e.g., daily). To further facilitate efficient processing, the MinHash computations for users are partitioned among parallel computing machines.
In one aspect, the timestamps may be used to selectively determine a cluster based on a continuous time, a time window or set of time windows. For example, an advertiser can determine which users were clustered together on the past ten weekends (had similar signature sets on Saturdays and Sundays only).
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards efficiently clustering a large number of users/objects in a discrete or continuous time window. In one aspect, this is accomplished by parallel computation using a MinHash clustering algorithm with an efficient time stamp merging module. As will be understood, such clustering technology provides significant benefits in behavioral targeting, social network mining, personalization research as well as related applications.
It is understood that any of the examples described herein are only examples. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and data processing in general.
As shown in
As shown in
To this end, given a set of activities, random permutations are used to calculate MinHash signatures for users. By way of example, consider that the following comprises the set of activities typed in by users:
[xbox, car, Halo3, laptop]
with the activities of user1=[xbox, Halo3],
and the activities of user2=[Halo3, laptop].
In a first round of permutations/minwise hashing the set of activities is reordered as:
[Halo3, car, laptop, xbox].
Because Halo3 is first in this ordering and each user has Halo3 as an activity, the MinHash signature of user1, mh(user1)=Halo3, and the MinHash signature of user2, mh(user2)=Halo3.
A second round permutation/minwise hash reorders the activities as:
[car, laptop, xbox, Halo3].
This time, the first to appear of those entered by user 1 is “xbox” and thus the MinHash signature of user1, mh(user1)=xbox. The first corresponding activity of what user2 entered is “laptop”, and thus mh(user2)=laptop.
The one or more MinHash signatures computed for each user comprises a signature set for that user. Given two users, the ratio of the number of shared MinHash signatures in each user's signature set between those users to the number of permutations approximates the similarity between users:
Pr(mhi(u)=mhi(v))=sim(u, v)
Mathematically, this may be set forth as:
Suppose H={hk|k=1,2, . . . c} be Min-wise independent permutation, i.e., Pr(min{hk(A)}=hk(aj))=1/|A|; (c is twenty in one implementation).
Define min-wise hash function:
mh
k(ui)=arg min {hk(ui)|ui⊂A}
Then sim(ui, uj)=|u1∩u2|/|u1∪u2|=Pr(mhk(ui)=mhk(uj)) where Pr(mhk(ui)=mhk(uj)) is approximated by |{mhg(ui)=mhg(uj), g=1,2, . . . c}|/c.
Thus, similar users get hashed to the same bucket while dissimilar ones do not.
To summarize the upper portion of
mh
[t, t+k](u)=min {mhs(u), s=t, t+1, t+2, . . . t+k}.
In this way, the users activities may be regularly (e.g., daily) hashed and efficiently merged, and the incremental MinHash allows for user input of a discrete time window, e.g., every weekend in the past year, or the past 3 days, and so forth.
In the lower portion of
Turning to a detailed explanation of parallel MinHash clustering in a flexible time window, let U={ui, i=1,2, . . . } represent a set of object to process and A={aj, j=1,2, . . . } represent the set of attributes that represent the objects. Each object at time stamp t is represented by a set of attributes Cui(t)={ai1, ai2, . . . }, where Cui(t) is a subset of A,=1,2, . . . . In this scheme, i is treated as the unique identifier (ID) of ui and j=1,2, . . . as the unique ID of aj. IDs for newly appeared objects or attributes are incrementally assigned.
Consider that at time t there is a collection of n objects and a collection of m attributes. If a new user and new attribute appears at time t+1, n+1 and m+1 are incrementally assigned as IDs for the new user and new attribute, respectively. A parallel MinHash clustering algorithm in flexible time window is set forth below and visually represented in
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 310 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 310 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 310. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 330 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 331 and random access memory (RAM) 332. A basic input/output system 333 (BIOS), containing the basic routines that help to transfer information between elements within computer 310, such as during start-up, is typically stored in ROM 331. RAM 332 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 320. By way of example, and not limitation,
The computer 310 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 310 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 380. The remote computer 380 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 310, although only a memory storage device 381 has been illustrated in
When used in a LAN networking environment, the computer 310 is connected to the LAN 371 through a network interface or adapter 370. When used in a WAN networking environment, the computer 310 typically includes a modem 372 or other means for establishing communications over the WAN 373, such as the Internet. The modem 372, which may be internal or external, may be connected to the system bus 321 via the user input interface 360 or other appropriate mechanism. A wireless networking component 374 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 310, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 399 (e.g., for auxiliary display of content) may be connected via the user interface 360 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 399 may be connected to the modem 372 and/or network interface 370 to allow communication between these systems while the main processing unit 320 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents failing within the spirit and scope of the invention.