For purposes of analyzing relatively large datasets, it may be beneficial to represent the data in the form of a graph. The graph contains vertices and edges that connect the vertices. The vertices may represent random variables, and a given edge may represent a correlation between a pair of vertices that are connected by the edge. A graph may be quite large, in that the graph may contain thousands to billions of vertices.
Relations between objects may be modeled using a graph. In this manner, a graph has vertices (also called “graph nodes” or “nodes”) and lines, or edges, which interconnect the vertices. A graph may be used to compactly represent the joint distribution of a set of random variables. In this manner, each vertex may represent one of the random variables, and the edges encode correlations between the random variables.
As a more specific example, the graph may be a graph of Internet, or “web,” domains, which may be used to identify malicious websites. In this manner, each vertex may be associated with a particular web domain and have an associated binary random variable. For this example, a given random variable may be assigned either a “1” (for a malicious domain) or a “0” for a domain that is not malicious. Although some of the web domains may be directly observed and thus, may be known to be malicious or non-malicious domains, such direct observations may not be available for a large number of the domains that are associated with the graph. For purposes of inferring properties of a graph, such as the above-described example graph, a process called “graph inference” may be used.
In general, graph inference involves estimating a joint distribution of random variables when direct sampling of the joint distribution cannot be performed or where such direct sampling is difficult. Graph inference may be performed using a graph inference algorithm, which estimates random variable assignments based on the conditional distributions of the random variables. In this context, a “random variable assignment” or “assignment” refers to a value that is determined or estimated for a random variable (and thus, for a corresponding vertex). The graph inference algorithm may undergo multiple iterations (thousands of iterations, for example), with each iteration providing estimates for all of the random number assignments. The estimates ideally improve with each iteration, and eventually, the estimated assignments converge. In this context, “convergence” of the assignments refers to the assignment estimation reaching a stable solution, such as (as examples) the probability of each assignment exceeding a threshold, the number of assignments that change between successive iterations falling below a threshold, and so forth. Given the large number of iterations and the relatively large number of vertices (thousands to billions of vertices, for example), it may be advantageous for the graph inference to be performed in a parallel processing manner by a multiple processor-based computing system.
One type of multiple processor-based computing system employs a non-uniform memory access (NUMA) architecture. In the NUMA architecture, processing nodes have local memories. In this context, a “processing node” (not to be confused with a node of a graph) is an entity that is constructed to perform arithmetic and logical operations, and in accordance with example implementations, a given processing node may execute machine executable instructions. More specifically, in accordance with example implementations, a given processing node may contain at least one central processing unit (CPU), which is constructed to decode and execute machine executable instructions and perform arithmetic and logical operations in response thereto. The “local memory” of a processing node refers to a memory that is located closer to the processing resources of the processing node in terms of interconnects, signal traces, distance and so forth, than other processing nodes, such that the processing node may access its local memory with less latency than other memories, which are external to the node and are called “remote memories” herein. Accesses by a given processing node to write data in and/or read data from a remote memory are referred to herein as “remote accesses” or “remote memory accesses.” The remote memories for a given processing node include memories shared with other processing nodes, as well as the local memories of other processing nodes. As a more specific example, a NUMA architecture computer system may be formed from multicore processor packages (multicore CPU packages, for example), or “sockets,” where each socket has its own local memory, which may be accessed by the processing cores of the socket. A socket is one example of a processing node, in accordance with some implementations.
One way to divide the task of performing graph inference is to partition the graph across all of the processing nodes such that each processing node estimates assignments for an assigned subset of the graph's vertices. However, the graph and its underlying data may not be a mutable data structure, which means that the vertices (and corresponding assignment determinations) may not be strictly partitioned among the processing nodes. In this manner, to estimate assignments for a given assigned subset of vertices, a given processing node may consider assignments for vertices that are not part of this subset. As a result, with a graph associated with a mutable data structure, the processing node may incur remote memory accesses to read the assignments for vertices outside of the assigned subset from a non-local memory (a memory that is local to another processing node, for example). For sparse graphs, a large portion of the execution time for performing graph inference may be attributed to remote and local memory accesses.
In accordance with example implementations that are described herein, for purposes of performing graph inference on a graph that contains multiple processing nodes, the inference processing is partitioned across the processing nodes. In this manner, each processing node is assigned a different partition of the graph for the inference processing and as a result, is assigned a set of vertices and corresponding edges of the graph. Each processing node also maintains a copy of a vertex table in its local memory. In accordance with example implementations, the local copy of the vertex table identifies all of the vertices of the graph (including the vertices that are not part of the assigned graph partition) and corresponding assignments for the vertices. Due to its local copy of the vertex table, a given processing node may determine and update the assignments for its assigned subset of vertices without incurring remote memory accesses. The assignments for vertices other than the assigned subset of vertices are determined by the other processing nodes. Although the assignments for these other vertices may be temporarily stale, or not current, in the local copy of the vertex table, these assignments allow the processing node to proceed with the graph inference while allowing the remote memory accesses that update these assignments to be performed in a more controlled, efficient manner.
As a more specific example,
For the example implementation in which the graph inference engine 110 generates a graph to identify malicious domains, the graph, in general, may have vertices that are interconnected by edges. For this example implementation, a given vertex is associated with a web domain and has an associated random variable, and the random variable may have a binary state: a “1” value to indicate a non-malicious domain and a “0” state to indicate a malicious domain. The edges contain information about correlations between vertices connected by the edges. The input data 150 may represent direct observations about the vertices, i.e., some web domains are known to be malicious, and other domains are known not to be malicious. The input data 150 may further represent observed correlations between web domains.
In accordance with example implementations, the graph inference engine 110 is a multiple processing node machine. In this context, a “machine” refers to an actual, physical machine, which is formed from multiple central processing units (CPUs) or “processing cores,” and actual machine executable instructions, or “software.” A given processing core is a unit that is constructed to read and execute machine executable instructions. In accordance with example implementations, the graph inference engine 110 may contain one or multiple CPU semiconductor packages, where each package contains multiple processing cores (CPU cores, for example).
More specifically, in accordance with example implementations, the graph inference engine 110 includes S processing nodes 120 (processing nodes 120-1, 120-2 . . . 120-S, being depicted in
In accordance with example implementations, the graph inference processing is partitioned among the processing nodes 120. The “partitioning” of the graph refers to subdividing the vertices of the graph among the processing nodes 120 such that each node 120 is assigned the task of determining random number assignments for a different subset of vertices of the graph. The number of vertices per partition may be the same or may vary among the partitions, depending on the particular implementation. Moreover, the graph inference engine 110 may contain more than S processing nodes 120, in that one or multiple other processing nodes of the graph inference engine 110 may not be employed for purposes of executing the graph reference algorithm. The partitioning assignments may be determined by a user or may be determined by the graph inference engine 110, depending on the particular implementation. Each processing node 120 includes a worker engine (herein called a “worker 130”). As described further below, each worker 130 processes a partition of the graph by determining assignments for the vertices of the partitions in a series of iterations.
As depicted in
As also depicted in
Referring to
The processing cores 212 experience relatively rapid access times to the local memory 214 of their processing node 120, as compared to, for example, the times to access a remote memory, such as the memory 214 of another processing node 120. In this manner, access to a memory 214 of another processing node 120 occurs through a memory hub 220 or other interconnect, which introduces memory access delays. In accordance with example implementations, each processing node 120 may contain a memory controller (not shown) to control bus signaling for a remote memory access.
In accordance with example implementations, the graph inference engine 110 executes a Gibbs sampling-based graph inference algorithm (also called “Gibbs sampling” herein). With Gibbs sampling, the worker 130 may determine an assignment (called “a”) for a given vertex by sampling the conditional probability distributions for the random variable to determine an instance (i.e., the assignment a) of the random variable. The conditional probability samples are based on the assignments for the neighboring vertices (or “neighbors”), and the edge information connecting the vertex to these neighbors.
The Gibbs sampling-based graph inference algorithm is performed in multiple iterations (hundreds, thousands or even more iterations), with a full sweep of the graph being made during each iteration to determine the assignments for all of the vertices. Due to the parallel processing, for each iteration, a given worker 130 determines assignments for all of the vertices of its assigned partition. To update a given vertex v, the worker 130 reads the current assignments of the neighbors of the vertex v, reads the corresponding edge information (for the edges connecting the vertex v to its neighbors), determines the assignment for the vertex v based on the sampled conditioned probability distributions and then updates the assignment for the vertex v accordingly.
In accordance with example implementations, the graph partition table 124 identifies the vertices of the assigned partition and the locations of the associated edge information. More specifically, in accordance with example implementations, the schema of the graph partition table 124 may be represented by “G<vi,vk,f>,” where “vi” and “vj” represents two vertices on an edge in G; and “f” represents a pointer to information that is stored on the edge. In accordance with example implementations, data representing the edge information is also stored in the local memory 214.
In accordance with example implementations, the schema of the vertex table copy 126 is “V<vi, a>,” where “vi” represents a vertex identity, and “a” represents its assignment.
Thus, to update a given vertex assignment a for a vertex v in the vertex table copy 126, the worker 130 first reads from the graph partition table 124 the neighbors of a vertex v (including possibly assignment(s) of neighbors that are not part of the graph partitioned assigned to the worker 130), reads edge information for the corresponding edges based on the pointers from the graph partition table 124, reads the current assignments from the vertex table and then after determining the new assignment a, writes to the vertex table copy 126 to modify the copy 126 to reflect the updated assignments. In this context, “modifying” the vertex table copy 126 refers to overwriting assignments of the copy 126, which have changed, or been updated. Due to the storage of the vertex table local copy 126 on each local processing node 120, the memory accesses are controlled so that the above-described memory accesses are local. In other words, the threads executing on the processing node 120, such as the threads executing the worker 130, access memory associated with the local processing node 120 for purposes of determining the assignments for the vertices of the assigned partition. Because the vertex table is replicated on all of the local memory nodes 120, in accordance with example implementations, operations involving updating the vertex assignments involve local reads. The updates to other processing nodes 120 remote memory operations, as further described below.
There are two ways to update remote vertex table copies 126 when the copies appear on all of the processing nodes 120. In the first way, the worker 130 may push updates to the other processing nodes 120. For push updates, a worker 130 that updates an assignment may push, or write, the corresponding update to the vertex table copies 126 that are stored on the other processing nodes 120. For pull updates, the worker 130 pulls, or reads, any vertex assignment updates from the vertex table copies 126 shared on the other processing nodes 120.
A potential advantage of the push update strategy is if there is no update, there is no need to push the updates, and hence, no memory accesses are incurred. This may be particularly useful for iterative graph inference algorithms, such as the Gibbs sampling inference algorithm, as it is often the case that the vertex assignments converge as the algorithm proceeds. Although the push strategy incurs remote writes, the updates may be queued, or accumulated, so that multiple updates may be written at one time, thereby more effectively controlling memory bandwidth communication. Which of the two strategies, push or pull, achieves a better performance may depend on such factors as how soon the graph converges and the remote read and write bandwidth ratios.
Regardless of whether push or pull updates are used, the batch size that is associated with these updates may be varied, depending on the particular implementation. In this context, the “batch size” generally refers to the size of the update data, such as a number of updates accumulated before the updates are pushed/pulled to/from a remote processing node 120. In this manner, in accordance with some implementations, on one extreme, a push/pull update may occur on a given processing node after each vertex is updated. On the other extreme, the push/pull update may occur at the end of a particular iteration of the Gibbs sampling graph inference algorithm or even after several iterations to push or pull the updates to the other copies of the vertex table.
An advantage of a relatively small batch size is that the copy of the vertex table is refreshed more frequently, which may lead to relatively fewer iterations for convergence. A potential advantage of a relatively larger batch size is that memory bandwidth may be used more efficiently, which may lead to better throughput (or less time to complete one iteration).
Thus, the batch sizes may be a function of the following: 1.) how soon the graph converges; and 2.) the memory bandwidth. The tradeoffs between batch size and push and pull updates are summarized below:
Thus, referring to
Referring to
The next iteration then begins by the worker 130 reading (block 412) the current assignments a of neighbors of the vertex from the vertex table copy 126. Next, the worker 130 determines (block 416) the new assignment a of the vertex based at least in part on the assignments a of the neighbors.
Using the new assignment, the worker 130 updates (block 420) the local copy of the vertex table. The worker 130 accumulates, however, the updates for the other processing nodes (i.e., the updates for the vertex table copies 126 stored in local memories of the other processing nodes). In this manner, the worker 130 determines (decision block 424) whether the accumulated updates for the other processing nodes have reached a predefined update batch size threshold. In this manner, the “batch size threshold” for this example refers to the number of vertice updates that the worker 130 accumulates before the updates are communicated to the other (remote) processing nodes. For example, if the batch size is three, the worker 130 accumulates the updates until the number of updates equals three and then pushes new updated values for three vertices to the remote processing nodes at one time. Therefore, if, pursuant to decision block 424, the worker 130 determines that the batch size has been reached, then the worker 130 pushes (block 432) the accumulated updates to the other processor node(s), in accordance with example implementations. The worker 130 then determines (decision block 428) whether another vertex assignment a to update is in the current iteration. In other words, the worker 130 determines whether any more vertices remain to be processed in the current iteration, and if so, control returns to block 408. Otherwise, the iteration is complete, and assignments for all of the vertices for the most recent iteration have been determined.
Next, the worker 130 determines (decision block 436) whether convergence has occurred and if not, control returns to block 408. As described above, convergence generally occurs when the assignments are deemed to be stable and may involve communications among the processing node as convergence may be globally determined for the graph inference. In accordance with further example implementations, a given processing node may determine whether the assignments for its assigned partition have converged independently from the convergence of any other partition. Regardless of how convergence is determined, after a determination that convergence occurs (decision block 436), the graph inference algorithm is complete.
In accordance with example implementations, the worker 130 may be formed from machine executable instructions that are executed by one or more of the processor cores 212 (see
In accordance with further example implementations, the worker 130 may be constructed as a hardware component that is formed from dedicated hardware (one or more integrated circuits that contain logic that is configured to perform a graph inference algorithm). Thus, the worker 130 may take on one or many different forms and may be based on software and/or hardware, depending on the particular implementation.
Other implementations are contemplated, which are within the scope of the appended claims. For example, in accordance with further example implementations, the graph inference engine 110 (see
While the present techniques have been described with respect to a number of embodiments, it will be appreciated that numerous modifications and variations may be applicable therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the scope of the present techniques.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2015/033319 | 5/29/2015 | WO | 00 |