The advent of social networking sites on the Internet has led an unprecedented number of users registered with social networking sites to engage in interesting user activities such as commenting on, liking, and re-sharing content as well as interacting with each other to share thoughts. The exponential growth of information repositories and the diversity of users on these social networking sites provide great challenges.
The following detailed description references the drawings, wherein:
A user of a social network may have certain interests, such as products, events, items, etc. as well as connections to other people. These connections may be formally established though a direct connection or informally established. An informally established connection may between users that are connected through a third user, connected through a similar interest, connected though an action such as commenting on the same page, etc. A mutual bidirectional interaction is an action by the user that is influenced by both the user's individual interests and the user's connections.
For example, a first user may make a decision with respect to a first product based on her own interest in the first product and/or based on a second user's opinion. The opinion of the second user may be expressed as a comment on the social network, a message from the second user to the first user, an endorsement of the second user (a like, a thumbs up, etc.), etc. The first and second user may also be connected on the social network. Accordingly, the connection between the first user and the second user may be a mixture of their prior impressions to each other and their similar interests in product(s), such as the first product. The widespread social phenomenon of homophily suggests that socially acquainted users tend to behave similarly. The homophily social effect is also called the theory of “birds of a feather flock together”—people tend to follow the behaviors of their friends, and people tend to create relationships with other people who are already similar to them.
Determining the likelihood of a connection between the first user and the second user may be helpful in discovering similar interests for product recommendation. Moreover, if two users have similar interests, there may be a high likelihood of a connection between them. With the dramatically rapid growth and great success of many large-scale online social networking services, social media establishes connections between companies and users. Tracking the data created by users on social networks may allow companies to gain feedback and insight in understanding the users' interests.
Recommending products to consumers could not only enhance revenue and profit, but also help commercial companies to understand consumers' interests and market demand. Moreover, discovering potentially valuable consumers though the connections of users on social media can aid companies in better decision making, and benefit product recommendation ultimately. The system for user interest and relationship determination leverages the bidirectional interactions between users' preferences and user-user connections in big social media and performs simultaneous user interest recommendation and connection discovery.
An example method for user interest and relationship determination may include distributing a first set of pairs and a second set of pairs to a plurality of data nodes, wherein each pair in the first set of pairs is of a user of a social network and a product on the social network and each pair in the second set of pairs defines a connection between users on the social network. The method may also include calculating, on a first data node belonging to the plurality, a first probability of a first user's interest in a first product based on a first observable factor and a first latent factor, wherein the first user and the first product belong to a first pair from the first set of pairs. The method may also include calculating, on a second data node, a second probability of a likelihood of a relationship between the first user and a second user of the social network, based on a second observable factor and a second latent factor, wherein the first user and the second user belong to a second pair from the second set of pairs. The method may also include determining, based on the first probability and the second probability, a most likely interest of the first user and a most likely relationship of the first user and predicting a potential interest of the first user based on the most likely interest and the most likely relationship.
Memory 104 stores instructions to be executed by processor 102 including instructions for and/or other components. According to various implementations, user interest and relationship determination system 100 may be implemented in hardware and/or a combination of hardware and programming that configures hardware. Furthermore, in
Processor 102 may execute instructions of distributor 110 to distribute a first set of pairs and a second set of pairs to a plurality of data nodes. A data node stores data in the file system. The set of pairs includes any number of pairs. Each pair in the first set of pairs may be of a user of a social network and an interest of the user on the social network. Interests may include products, events, items, etc. Each pair in the second set of pairs may define a connection between users on the social network. The connection may be a direct connection or an indirect connection. An indirect connection may be between users that are connected through a third user, connected through a similar interest, connected though activities, such as commenting on the same page, etc.
The first pair and the second pair may be used as a first input key and a second input key, respectively, for a map function. A first observable factor and a first latent factor may be used as values for the first input key. A second observable factor and a second latent factor may be used as values for the second input key.
Distributor 110 may distribute the first and second set of pairs using a distributed data processing framework. Distributor 110 may distribute each pair in the first set of pairs and the second pairs to a plurality of data nodes. Each data node in the plurality of data nodes may processes a pair. One example framework is the Apache™ Hadoop® framework that allows for the scalable parallel and distributed computing of large data sets across clusters of computers using programming models such as MapReduce. Hadoop® consists of two layers: a data storage layer Hadoop Distributed File System and a data processing layer called MapReduce framework. The MapReduce framework adopts a master-slave architecture which consists of one master node and multiple slave nodes in the clusters. The master node is generally served as JobTracker and each slave node is generally served as TaskTracker.
Distributor 110 may also use a MapReduce programming technique. MapReduce is based on two functions: Map and Reduce. The Map function applies a user-defined function to each key-value pair<input key; input value>in the input data. The result of the map function may be a list of intermediate key-value pairs, sorted and grouped by key (i.e. list[<map key; map value>]), and passed as input to the Reduce function. The Reduce function applies a second user-defined function to the intermediate key and its associated values (i.e. <map key; list [map value]>), and produces the final aggregated result [<output key; output value>].
MapReduce may utilize a distributed file system from which the Map instances retrieve the input. An example distributed file system is the Hadoop Distributed File System (HDFS). HDFS is a chunk-based distributed file system that supports fault-tolerance by data partitioning and replication.
Processor 102 may execute instructions of first calculator 112 calculate, on a first data node, a first probability of a first user's interest in a first interest based on a first observable factor and a first latent factor. An observable factor may be historical information corresponding to a user. For example, observable factors may include a user's registered data, user's behavioral data, etc. A latent factor is information corresponding to user interactions between connections to interests. Latent factors are usually implicit and/or hidden and are thus unobservable. The first user and the first product may belong to a first pair from the first set of pairs (e.g. as discussed in reference to distributor 110). The first pair may be used as an input key for a map function. The first observable factor and the first latent factor may be used as values for the first input key. For example, the map key for the first data node may be the user-interest pair <i; j>. The value for the map key may be the product of observable and latent factors φφh for <i; j>.
Processor 102 may execute instructions of second calculator 114 to calculate, on a second data node, a second probability of a likelihood of a relationship between the first user and a second user based on a second observable factor and a second latent factor. The first user and the second user belong to a second pair from the second set of pairs (e.g. as discussed in reference to distributor 110). The second pair may be used as an input key for a map function. The second observable factor and the second latent factor may be used as values for the second input key. For example, the map key may be the product of user-user pair <i; k>. The value for the map key may be the product of product of observable and latent factors φ′φ′h for <i; k>.
Processor 102 may execute instructions of output generator 116 to generate, based on the first probability and the second probability, a triplet. The triplet may be the output key of a map function. The value of the output key may be a product of probability distribution Yij Sik. The triplet may be a user-interest-user triplet <i, j, k>. The triplet may include two users from the social network and a product that at least one of the two users has expressed interest in on the social network. Output generator 116 may determine a probability distribution of the first user's interest in the first product and the relationship between the first user and the second user.
Output generator 116 may incorporate a mutual latent random graphs (MLRGs) that incorporates the interactions between users' interests and users' connections. The MLRG may incorporate shared latent factors and coupled models to encode users' interests Yij (user i's interest in product j) and user-user connections Sik (connection between user i and user k). Output generator 116 may express the probability distribution of Yij as Yij˜p(φφh,θ), with θ representing any corresponding parameters. The expression may include an assumption that certain observable factors (φ) exist and certain latent factors (φh) exist. Output generator 116 may express the probability distribution of Sik as Sik˜p(φ′φ′h,Ω), with Ω representing any corresponding parameters. The expression may include an assumption that certain observable factors (φ′) exist and certain latent factors (φ′h) exist. Importantly, both φh and φ′h may capture bidirectional interactions between interests and connections.
The four factors φ, φh, φ′, φ′h can be instantiated in different ways. Each factor may be defined as the exponential family of an inner product over sufficient statistics (feature functions) and corresponding parameters. Each factor may be a clique template whose parameters are tied. More specifically, the factors may be defined as:
φ=exp{Σ′
φh=exp{Σ
φ′=exp{Σ
φ′h=exp{Σ
In other words, a map function may involve calculating probability distributions on data nodes in parallel (e.g. as discussed as discussed in reference to first calculator 112 and second calculator 114) and generating triplet product of probability distribution Yij Sik (as discussed in reference to output generator 116). Each data node may calculate the probability distribution Yij˜p(φφh, θ) and the probability distribution Sjk˜p(φ′φ′h, Ω). This process be repeated until a convergence occurs.
The probability distribution Yij may be calculated as:
Similarly, the probability distribution may be calculated as:
In equation (5) above, θ={
Processor 102 may execute instructions of interest and relationship determiner 118 to determine, based on the first probability and the second probability, a most likely interest of the first user and/or a most likely relationship of the first user. A triplet (e.g. as discussed in reference to output generator 116) may be used as an input key for a reduce function. A probability distribution and/or a product of probability distribution Yij Sik may be used as values for the input key for the reduce function. Interest and relationship determiner 118 may merge a result of processing by the plurality of data nodes (e.g. as discussed in reference to distributor 110) using the triplet (e.g. as discussed in reference to output generator 116) as a key so that all values using the same triplet are grouped together.
Interest and relationship determiner 118 may determine the most likely interest of the first user and the most likely relationship of the first user as an output of the reduce function. An output key for the output of the reduce function may be an objective function (θ,Ω). The value for the output key may be updated and optimized parameters θ and Ω. Interest and relationship determiner 118 may maximize an objective function corresponding to the triplet. A first parameter of the objective function may correspond to the most likely interest of the first user and a second parameter of the objective function may correspond to the most likely relationship of the first user. The objective function may be maximized using a data mining algorithm, such as stochastic gradient descent.
A data mining algorithm (such as a stochastic gradient descent) may be performed with respect to θ with Ω fixed and Ω may be updated. A data mining algorithm such as a stochastic gradient descent) may be performed with respect to Ω with θ fixed and θ may be updated. This process may be repeated until a convergence occurs.
Stochastic gradient descent (SGD) may loop over all the observations and update the parameters θ and Ω by moving in the direction defined by negative gradient. Each data node (e.g. as discussed in reference to first calculator 112 and second calculator 114), may compute and optimize with respect to either Yij or Sik in the Map phase, and the results may be combined in a reduce phase to optimize both parameters θ and Ω globally. After distributed SCD learning, the optimized parameters can be obtained and joint recommendation of interest and friendship can be achieved by computing the most likely Yij or Sik, respectively.
In other words, the reduce function may include calculating the objective function (θ,Ω) and updating all parameters on a master node. The master node may calculate and maximize the objective function (θ,Ω). The master node may update and optimize the parameters (θ,Ω) such that (θ*, Ω*)=arg max (θ,Ω).
After stochastic gradient descent (SGD) for distributed MapReduce learning, an optimized θ and Ω of MLRGs may be obtained. The optimized parameters θ and Ω may be used to discover user interest and infer user-user friendship. More specifically, given the testing social media data, the inference may find the most likely types of user interest and corresponding user-user relationship labels that have the maximum posterior probability. This can be accomplished by performing the model inference of MLRGs. Performing the model inference may include predicting the labels of user interest and user-user friendship finding the maximum a posterior (MAP) user interest labeling assignment and corresponding user-user friendship labeling assignment that have the largest marginal probability according to equations (5) and (6) described above.
The overall MapReduce processing of the user interest and relationship determination system may be summarized as follows. Each processing job in may be broken down to as many Map tasks as input data blocks and one or more Reduce tasks. A master node may select idle workers (data nodes) and may assigns each data node a map or a reduce task according to the stage. Before starting the Map task, an input file may be loaded on the distributed file system. At loading, the file may partitioned into multiple data blocks of the same size. One example size of a data block may be 64 MB. Each block may be triplicated for fault-tolerance. Each block may also be assigned to a mapper, a worker which is assigned a map task, and the mapper may applies a map function (Map()) to each record in the data block.
The intermediate outputs produced by the mappers may be sorted locally for grouping key-value pairs sharing the same key. After local sort, a combine function (Combine()) may be applied to perform pre-aggregation on the grouped key-value pairs so that the communication cost taken to transfer all the intermediate outputs to reducers is minimized. Then the mapped outputs may be stored in local disks of the mappers, partitioned into R, where R is the number of Reduce tasks in the MR job. This partitioning may be done by a hash function e.g. hash(key) mod R.
When all Map tasks are completed, the MapReduce scheduler may assign Reduce tasks to workers. The intermediate results may be shuffled and assigned to reducers via HTTPS protocol. Since all mapped outputs may already be partitioned and stored in local disks, each reducer may perform the shuffling by simply pulling its partition of the mapped outputs from mappers. Put another way, each record of the mapped outputs may be assigned to only a single reducer by one-to-one shuffling strategy. Note that this data transfer may be performed by reducers' pulling intermediate results. A reducer may read the intermediate results and merge them by the intermediate keys, i.e. map key, so that all values of the same key are grouped together. The grouping may be done by external merge-sort. Each reducer may also apply a reduce function (Reduce()) to the intermediate values for each map key it encounters. The output of reducers may be stored and triplicated in the file system.
The number of Map tasks may not depend on the number of nodes, but may be based on the number of input blocks. Each block may be assigned to a single Map task. However, all Map tasks do not need to be executed simultaneously and neither do all Reduce tasks. The MapReduce framework may executes tasks based on runtime scheduling scheme. In other words, MapReduce may not build any execution plan that specifies which tasks will run on which nodes before execution.
With the runtime scheduling, MapReduce may achieve fault tolerance by detecting failures and reassigning tasks of failed nodes to other healthy nodes in the cluster. Nodes which have completed their tasks may be assigned another input block. This scheme naturally achieves load balancing in that faster nodes will process more input chunks and slower nodes process less inputs in the next wave of execution. Furthermore, a MapReduce scheduler may utilize a speculative and redundant execution. Tasks on straggling nodes may be redundantly executed on other idle nodes that have finished their assigned tasks, although the tasks are not guaranteed to end earlier on the new assigned nodes than on the straggling nodes. Map and Reduce tasks may be executed with no communication between other tasks.
Thus, there is no contention arisen by synchronization and no communication cost between tasks during a MR job execution.
An example architecture for the user interest and relationship determination system 100 may exploit Extraction-Transformation-Loading (ETL) technology for heterogeneous (structured and unstructured) big social data to the data storage layer. An example storage layer may include a relational database management system (RDBMS), a NoSQL database management system and logs of social media data. The architecture may also include server-based tool designed to transfer data between Hadoop and relational databases. Example tools may include the Sqoop2™ system (from Cloudera™), MongoDB connector™ (from MongoDB, Inc.) and Flume4™ (from Apache™) to transfer the RDBMS, NoSQL and Log data to the joint recommender layer for distributed analysis respectively. Sqoop2 is a tool designed for transferring bulk data between Hadoop and structured data stores such as relational databases. The MongoDB connector™ is a plugin for Hadoop™ that provides the ability to use MongoDB™ as an input source and/or an output destination. Flume™ is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. The joint recommender layer may consists of a data model storing rich social information and a joint recommender engine for MLRGs and advanced MapReduce learning.
Processor 102 may execute instructions of potential interest and relationship predictor 120 to predict a potential interest of the first user and/or a potential relationship between the first user and a user of the social network based on the most likely interest and the most likely relationship.
Method 200 may start at block 202 and continue to block 204, where the method may include distributing a first set of pairs and a second set of pairs to a plurality of data nodes. Each pair in the flat set of pairs may be of a user of a social network and a product on the social network. Each pair in the second set of pairs may define a connection between users on the social network. A first pair from the first set of pairs and a second pair from the second set of pairs may be used as a first input key and a second input key, respectively, for a map function. A first observable factor and a first latent factor may be used as values for the first input key. A second observable factor and a second latent factor may be used as values for the second input key. At block 206, the method may include calculating, on a first data node belonging to the plurality of data nodes, a first probability of a first user's interest in a first product based on a first observable factor and a first latent factor. The first user and the first product belong to a first pair from the first set of pairs.
At block 208, the method may include calculating, on a second data node, a second probability of a likelihood of a relationship between the first user and a second user, based on a second observable factor and a second latent factor. The first user and the second user belong to a second pair from the second set of pairs. At block 210, the method may include determining, based on the first probability and the second probability, a most likely interest of the first user and a most likely relationship of the first user. At block 212, the method may include predicting a potential interest of the first user based on the most likely interest and the most likely relationship. The method may also include predicating a potential relationship between the first user and another user of the social network based on the most likely interest and the most likely relationship. Method 200 may eventually continue to block 214, where method 200 may stop.
Memory 304 stores instructions to be executed by processor 302 including instructions for a first probability calculator 308, a second probability calculator 310, an interest and relationship determiner 312, a triplet generator 314 and an interest and relationship predictor 316. The components of system 300 may be implemented in the form of executable instructions stored on at least one machine-readable storage medium of system 300 and executed by at least one processor of system 300. The machine-readable storage medium may be non-transitory. Each of the components of system 300 may be implemented in the form of at least one hardware device including electronic circuitry for implementing the functionality of the component.
Processor 302 may execute instructions of first probability calculator 308 to calculate, on a first data node, a first probability of a first user's interest in a first product based on a first observable factor and a first latent factor. The first user and the first product may be used as a first input key. The first user and the second user may be used as a second input key for a map function. A first observable factor and a first latent factor may be used as values for the first input key. A second observable factor and a second latent factor are used as values for the second input key. Processor 302 may execute instructions of second probability calculator 310 to calculate, on a second data node, a second probability of a likelihood of a relationship between the first user and a second user based on a second observable factor and a second latent factor. Processor 302 may execute instructions of interest and relationship determiner 312 to determine, based on the first probability and the second probability, a most likely interest of the first user and a most likely relationship of the first user.
Processor 302 may execute instructions of triplet generator 314 to generate, based on the first probability and the second probability, a triplet including two users from the social network and a product that at least one of the two users has expressed interest in on the social network. Processor 302 may execute instructions of an interest and relationship predictor 316 predict a potential interest of the first user and/or a potential relationship of the first user to another user on the social network based on the most likely interest and the most likely relationship.
Processor 402 may be at least one central processing unit (CPU), microprocessor, and/or other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 404. In the example illustrated in
Machine-readable storage medium 404 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, machine-readable storage medium 404 may be, for example, Random Access Memory (RAM), an Electically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. Machine-readable storage medium 404 may be disposed within system 400, as shown in
Referring to
Probability determine instructions 408, when executed by a processor (e.g., 402), may cause system 400 to determine, on the plurality of data nodes, a probability distribution of a first user's interest in a first product and a relationship between the first user and a second user. The probability may be based on an observable factor and a latent factor. Triplet generate instructions 410, when executed by a processor (e.g., 402), may cause system 400 to generate, based on the probability distribution, a triplet including two users from the social network and an interest product that at least one of the two users has expressed interest in on the social network. Most likely interest and relationship determine instructions 412, when executed by a processor (e.g., 402), may cause system 400 to determine, based on the probability distribution, a most likely interest of the first user and a most likely relationship of the first user. Potential interest and relationship predict instructions 414, when executed by a processor (e.g., 402), may cause system 400 to predict a potential interest of the first user and/or a potential relationship between the first user and another user of the social network based on the most likely interest and the most likely relationship.
The foregoing disclosure describes a number of examples for user interest and relationship determination. The disclosed examples may include systems, devices, computer-readable storage media, and methods for user interest and relationship determination. For purposes of explanation, certain examples are described with reference to the components illustrated in
Further, the sequence of operations described in connection with
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2016/073690 | 2/5/2016 | WO | 00 |