The present disclosure relates generally to relation learning, and more specifically related to methods and systems for graph representation learning using MAGNA.
The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
The introduction of the self-attention mechanism has pushed the state-of-the-art in many domains including graph presentation learning. Graph Attention Network (GAT) and related models developed attention mechanism for Graph Neural Networks (GNNs), which compute attention scores between nodes connected by an edge, allowing the model to attend to messages of node's direct neighbors according to their attention scores.
However, such attention mechanism does not account for nodes that are not directly connected but provide important network context. Therefore, an unaddressed need exists in the art to address the aforementioned deficiencies and inadequacies.
In certain aspects, the present disclosure relates to a system. In certain embodiments, the system includes a computing device, and the computing device has a processer and a storage device storing computer executable code. The computer executable code, when executed at the processor, is configured to:
provide an incomplete knowledge graph comprising a plurality of nodes and a plurality of edges, each of the edges connecting two of the plurality of nodes;
calculate an attention matrix of the incomplete knowledge graph based on one-hop attention between any two of the plurality of the nodes that are connected by one of the plurality of the edges;
calculate multi-head diffusion attention for any two of the plurality of nodes from the attention matrix;
obtain updated embedding of the incomplete knowledge graph using the multi-head diffusion attention; and
update the incomplete knowledge graph to obtain updated knowledge graph based on the updated embedding.
In certain embodiments, the computer executable code is configured to calculate the attention matrix by:
calculating an attention score si,k,j(l) for an edge (vi, rk, vj) by si,k,j(l)=LeakyReLU(vα(l) tan h(Wh(l)hi(l)∥Wt(l)hj(l)∥Wr(l)rk)) (equation (1)), wherein vi and vj are nodes i and j, rk is a type of the edge between the nodes i and j, Wh(l)∈d
obtaining attention score matrix S(l) by:
appears in (equation (2)), wherein is the knowledge graph; and
calculating the attention matrix A(l) by: A(l)=softmax(S(l)).
In certain embodiments, the computer executable code is configured to calculate the multi-head diffusion attention by:
calculating multi-hop attention matrix by: =Σhop=0∞θhopAhop (equation (3)), wherein hop is a positive integer in a range of 2-20, and θhop is an attention decay factor; and
calculating the multi-head diffusion attention by: AttDiffusion(, H(l), Θ)=H(l) (equation (4)), wherein Θ represents parameters for equation (1), and H(l) is input entity embedding of the l-th layer.
In certain embodiments, the H(l) is approximated by:
letting Z(0)=H(l), Z(k+1)=(1−α)AZ(k)+αZ(0) (equation (5)), wherein 0≤k≤K, and θhop=α(1−α)hop; and
defining the H(l) as Z(K).
In certain embodiments, the hop is equivalent to K, and the hop and the K is a positive integer in a range of 2-12. In certain embodiments, the hop and the K is in a range of 3-10. In certain embodiments, the hop and the K is 6, 7, or 8. In certain embodiments, l is a positive integer in a range of 2-24. In certain embodiments, l is 3, 6, 12, 18, or 24. In certain embodiments, l is 3, 6 or 12.
In certain embodiments, the computer executable code is configured to obtain the updated embedding of the incomplete knowledge graph by: performing sequentially a first layer normalization and addition, a feed forward, and a second layer normalization and addition on the multi-head diffusion attention.
In certain embodiments, the feed forward is performed using a two-layer feed forward network. The two layer feed forward network may be a two-layer multiplayer perceptron (MLP).
In certain embodiments, the computer executable code is further configured to, after obtaining updated embedding: calculate a loss function based on the updated embedding and labels of the nodes and edges of the incomplete knowledge, and adjust parameters for calculating the attention matrix, calculating the multi-head diffusion attention, and obtaining the updated embedding.
In certain embodiments, the computer executable code is configured to perform the steps of calculating the attention matrix, calculating the multi-head attention diffusion, obtaining the updated embedding, calculating the loss function, and adjusting the parameters iteratively for a plurality of times, and update the incomplete knowledge graph using the updated embedding obtained after the plurality of times of iterations.
In certain embodiments, the computer executable code is configured to update the incomplete knowledge graph by: predict new feature of the plurality of the nodes or predict new edges based on the updated embedding, and adding the new features to the nodes or adding the new edges to the incomplete knowledge graph.
In certain embodiments, the computer executable code is further configured to, when the updated knowledge graph comprises a plurality of consumers and a plurality of products: recommend a product to a consumer when the product and the consumer is linked by an edge in the updated knowledge graph, and the edge indicates interest of the consumer to the product.
In certain embodiments, the computer executable code is further configured to, when the updated knowledge graph comprises a plurality of customers: provide credit scores for the plurality of customers based on features of the customers in the knowledge graph.
In certain aspects, the present disclosure relates to a method. In certain embodiments, the method includes:
providing, by a computing device, an incomplete knowledge graph comprising a plurality of nodes and a plurality of edges, each of the edges connecting two of the plurality of nodes;
calculating, by the computing device, an attention matrix of the incomplete knowledge graph based on one-hop attention between any two of the plurality of the nodes that are connected by one of the plurality of the edges;
calculating, by the computing device, multi-head diffusion attention for any two of the plurality of nodes from the attention matrix;
obtaining, by the computing device, updated embedding of the incomplete knowledge graph using the multi-head diffusion attention; and
updating, by the computing device, the incomplete knowledge graph to obtain updated knowledge graph based on the updated embedding.
In certain embodiments, the step of calculating the attention matrix comprises:
calculating an attention score si,k,j(l) for an edge (vi, rk, vj) by si,k,j(l)=LeakyReLU(vα(l) tan h(Wh(l)hi(l)∥Wt(l)hj(l)∥Wr(l)rk)) (equation (1)), wherein vi and vj are nodes i and j, rk is a type of the edge between the nodes i and j, Wh(l)∈d
obtaining attention score matrix S(l) by:
appears in (equation (2)), wherein is the knowledge graph; and
calculating the attention matrix A(l) by: A(l)=softmax(S(l)).
In certain embodiments, the step of calculating the multi-head diffusion attention comprises:
calculating multi-hop attention matrix by: =Σhop=0∞θhopAhop (equation (3)), wherein hop is a positive integer in a range of 2-12, and θhop is an attention decay factor; and
calculating the multi-head diffusion attention by: AttDiffusion(, H(l), Θ)=H(l) (equation (4)), wherein Θ represents parameters for equation (1), and H(l) is input entity embedding of the l-th layer.
In certain embodiments, the H(l) is approximated by:
letting Z(0)=H(l), Z(k+1)=(1−α)AZ(k)+αZ(0) (equation (5)), wherein 0≤k≤K, and θhop=α(1−α)hop; and
defining the H(l) as Z(K).
In certain embodiments, the hop is equivalent to K, and the hop and the K is a positive integer in a range of 2-12. In certain embodiments, the hop and the K is in a range of 3-10. In certain embodiments, the hop and the K is 6, 7, or 8. In certain embodiments, l is a positive integer in a range of 2-24. In certain embodiments, l is 3, 6, 12, 18, or 24. In certain embodiments, l is 3, 6 or 12.
In certain embodiments, the computer executable code is configured to obtain the updated embedding of the incomplete knowledge graph by: performing sequentially a first layer normalization and addition, a feed forward, and a second layer normalization and addition on the multi-head diffusion attention.
In certain aspects, the present disclosure relates to a non-transitory computer readable medium storing computer executable code. The computer executable code, when executed at a processor of a computing device, is configured to perform the method described above.
These and other aspects of the present disclosure will become apparent from following description of the preferred embodiment taken in conjunction with the following drawings and their captions, although variations and modifications therein may be affected without departing from the spirit and scope of the novel concepts of the disclosure.
The accompanying drawings illustrate one or more embodiments of the disclosure and together with the written description, serve to explain the principles of the disclosure. Wherever possible, the same reference numbers are used throughout the drawings to refer to the same or like elements of an embodiment.
The present disclosure is more particularly described in the following examples that are intended as illustrative only since numerous modifications and variations therein will be apparent to those skilled in the art. Various embodiments of the disclosure are now described in detail. Referring to the drawings, like numbers indicate like components throughout the views. As used in the description herein and throughout the claims that follow, the meaning of “a” “an”, and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Moreover, titles or subtitles may be used in the specification for the convenience of a reader, which shall have no influence on the scope of the present disclosure. Additionally, some terms used in this specification are more specifically defined below.
The terms used in this specification generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context where each term is used. Certain terms that are used to describe the disclosure are discussed below, or elsewhere in the specification, to provide additional guidance to the practitioner regarding the description of the disclosure. It will be appreciated that same thing can be said in more than one way. Consequently, alternative language and synonyms may be used for any one or more of the terms discussed herein, nor is any special significance to be placed upon whether or not a term is elaborated or discussed herein. The use of examples anywhere in this specification including examples of any terms discussed herein is illustrative only, and in no way limits the scope and meaning of the disclosure or of any exemplified term. Likewise, the disclosure is not limited to various embodiments given in this specification.
As used herein, the term “module” may refer to, be part of, or include an Application Specific Integrated Circuit (ASIC); an electronic circuit; a combinational logic circuit; a field programmable gate array (FPGA); a processor (shared, dedicated, or group) that executes code; other suitable hardware components that provide the described functionality; or a combination of some or all of the above, such as in a system-on-chip. The term module may include memory (shared, dedicated, or group) that stores code executed by the processor.
The term “code”, as used herein, may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, and/or objects. The term shared, as used above, means that some or all code from multiple modules may be executed using a single (shared) processor. In addition, some or all code from multiple modules may be stored by a single (shared) memory. The term group, as used above, means that some or all code from a single module may be executed using a group of processors. In addition, some or all code from a single module may be stored using a group of memories.
The term “interface”, as used herein, generally refers to a communication tool or means at a point of interaction between components for performing data communication between the components. Generally, an interface may be applicable at the level of both hardware and software, and may be uni-directional or bi-directional interface. Examples of physical hardware interface may include electrical connectors, buses, ports, cables, terminals, and other I/O devices or components. The components in communication with the interface may be, for example, multiple components or peripheral devices of a computer system.
In certain aspects, the present disclosure provides a Multi-hop Attention Graph Neural Network (MAGNA) to incorporate multi-hop context information into attention computation, enabling long-range interactions at every layer of the GNN. In certain embodiments, to compute attention between nodes that are not directly connected, MAGNA diffuses the attention scores across the network, which increases the “receptive field” for every layer of the GNN. Unlike previous approaches, MAGNA uses a diffusion prior on attention values, to efficiently account for all paths between the pair of disconnected nodes. This helps MAGNA capture large-scale structural information in every layer, and learn more informative attention. Experimental results on node classification as well as knowledge graph completion benchmarks show that MAGNA achieves state-of-the-art results: MAGNA achieves up to 5:7% relative error reduction over the previous state-of-the-art on Cora, Citeseer, and Pubmed. MAGNA also obtains the best performance on a large-scale Open Graph Benchmark dataset. On knowledge graph completion, MAGNA advances state-of-the-art on WN18RR and FB15k-237 across four different performance metrics.
Let =(V,ε) be the given knowledge graph, where V is the set of Nn number of nodes, ε⊆V×V is the set of Ne number of edges connecting M pairs of nodes in V. Each node v∈V and each edge e∈ε are associated with their type mapping functions: p: V→and ψ:ε→. Here and denote the sets of node labels and edge/relation types. The MAGNA application supports learning on heterogeneous graphs with multiple elements in .
A general graph neural network (GNN) approach learns an embedding that maps nodes and/or edge types into a continuous vector space. Let X∈N
MAGNA builds on GNNs, while bringing together the benefits of Graph Attention and Diffusion techniques. The core of MAGNA is Multi-hop Attribution Diffusion, a principled way to learn attention between any pair of nodes in a scalable way, taking into account the graph structure and enabling multi-hop context-dependent attention directly.
The key challenge here is how to allow for flexible but scalable context-dependent multi-hop attention, where any node can influence embedding of any other node in a single GNN layer (even if they are far away in the underlying network). Simply learning attention score over all node pairs is infeasible, and would lead to overfitting and poor generalization.
In certain aspect, attention diffusion is introduced by computing multi-hop attention directly at each block of the MAGNA based on attention scores in each MAGNA block. The input to the attention diffusion operator is a set of triples (vi, rk, vj) that are currently available in a knowledge graph, where vi, vj are nodes and rk is the edge type. MAGNA first computes the attention scores on all available edges. The attention diffusion module of MAGNA then computes the attention values between pairs of nodes that are not directly connected by an edge, based on the edge attention scores, via a diffusion process. The attention diffusion module can then be used as a component in MAGNA architecture.
To compute the attention diffusion, in the first stage, MAGNA calculates one-hop edge attention, i.e., attention between nodes connected by edges. The MAGNA include one or multiple blocks, and each MAGNA block is also named a MAGNA layer. At each MAGNA block (or layer) l, a vector message is computed for each triple (vi, rk, vj). To compute the representation of vj at block or layer l+1, all messages from triples incident to vj are aggregated into a single message, which is then used to update vjl+1.
The disclosure first computes an attention score s of an edge (vi, rk, vj) by:
s
i,k,j
(l)=LeakyReLU(vα(l) tan h(Wh(l)hi(l)∥Wt(l)hj(l)∥Wr(l)rk)) (1),
where Wh(l), Wt(l)∈d
Applying equation (1) on each edge of the graph , the disclosure obtains an attention score matrix S(l):
Subsequently, the disclosure obtains attention matrix A(l) by performing row-wised softmax over the score matrix S(l): A(l)=softmax(S(l)). The attention matrix A(l) is the one-hope attention, and Aij(l) in the attention matrix A(l) denotes the attention value at the l-th layer when aggregating message from node j to node i.
In the second stage, MAGNA calculates attention diffusion for multi-hop neighbors based on the one-hope edge attention matrix A(l) at the l-th block or layer. In this stage, the disclosure enables attention between nodes that ate not directly connected in the knowledge graph network. The enablement is achieved via the following attention diffusion procedure, where the disclosure first computes the attentions scores of multi-hop neighbors via graph diffusion based on the powers of the 1-hopt attention matrix A:
=Σi=0∞θiAi (3)
where Σi=0∞θi=1 and θi>0.
Here is multi-hop attention, Ai is the powers of one-hop attention matrix A (one hop attention matrix A(l) at the l-th MAGNA block), i is an integer equals to or greater than 0 (kindly differentiate the power i here from the node i disclosed in other parts of the disclosure), and θi is the attention decay factor and θi>θi+1. The powers of attention matrix, Ai, give us the number of relation paths between two nodes of length up to i, increasing the receptive field of the attention. For example, a two-hop attention can be calculated as: =θ0I+θ1A+θ2A2, and a three-hop attention can be calculated as: =θ0I+θ1A+θ2A2+θ3A3, where I is the identify matrix. As shown by equation (3), the mechanism allows the attention between two nodes to not only depend on their previous layer representations, but also taking into account of the paths between the nodes, effectively creating attention shortcuts between nodes that are not directly connected. Attention through each path is also weighted differently, depending on θ and the path length i. In certain embodiments, the disclosure utilizes the geometric distribution θi=α(1−α)i, where α∈(0, 1]. The choice is based on the inductive bias that nodes further away should be weighted less in message aggregation, and nodes with different relation path lengths to the target node are sequentially weighted in an independent manner. In addition, notice that if we define θ0=α∈(0, 1], and Ai=I, then equation (3) gives the Personalized Page Rank (PPR) procedure on the graph with the attention matrix A and teleport probability α. Hence the diffused attention weights, ij, can be seen as the influence of node j to node i.
We can also view ij as the attention value of node j to i since Σj=1N
AttDiffusion(,H(l),Θ)=H(l) (4)
Here represents the knowledge graph, H(l) is the embedding of the nodes at the l-th MAGNA block, and Θ is the set of parameters for computing attention. Thanks to the diffusion process defined in equation (3), MAGNA uses the same number of parameters as if we were only computing attention between nodes connected via edges. In other words, the learnable parameter Θ is the same as or is equivalent to the learnable parameters Wh(l), Wt(l), Wr(l), and vα(l) in equation (1). This ensures runtime efficiency as well as good model generalization.
The H(l) is the attention diffusion the disclosure aims to get. However, for large graphs, computing the exact attention diffusion matrix using equation (3) may be prohibitively expensive, due to computing the powers i of the attention matrix A. To solve this bottleneck, in certain embodiments, the present disclosure provides an approximation computation for attention diffusion H(l). Particularly, the disclosure lets H(l) be the input entity embedding of the l-th block or layer (H(0)=X) and θi=α(1−α)i. Since MAGNA only requires aggregation via H(l), the disclosure approximate H(l) by defining a sequence Z(K) which converges to the true value of H(l) as K→∞:
Z
(0)
=H
(l)
,Z
(k+1)=(1−α)AZ(k)+αZ(0), where 0≤k≤K (5)
Proposition 1.
Using the above approximation by equation (5) to replace the calculations by equations (3) and (4), the complexity of attention computation with diffusion is still O(|E|), with a constant factor corresponding to the number of hops K. In certain embodiments, the disclosure defines the values of K in a range of 3≤K≤10, which results in good model performance. Many real-world graph exhibit small-world property, in which case even a smaller value of K is sufficient. For graphs with larger diameter, the disclosure chooses larger K, and lower the value of α.
Using on the above described computation of multi-hop attention diffusion by equations (1), (2), and (5), or alternatively by equations (1) to (4), the present disclosure provides a direct multi-hop attention based GNN architecture, i.e., the MAGNA.
As shown in
Ĥ
(l)=MultiHead(,{tilde over (H)}(l)=(∥i=1Mheadi)Wo, (6)
where headi=AttDiffusion(,{tilde over (H)}(l),Θi), {tilde over (H)}(l)=LayerNorm(H(l)).
Here ∥ denotes concatenation, O, are the parameters in equation (1) for the i-th head (1≤i≤M), and Wo represents a parameter matrix. Since the disclosure calculates the attention diffusion in a recursive way in equation (5), the disclosure adds layer normalization which is helpful to stabilize the recurrent computation procedure. As the last step of the multi-head graph attention diffusion layer 202, and addition is performed by: Ĥ(l+1)=Ĥ(l)+Hl.
As shown in
H
(l+1)
=W
2
(l)ReLU(W1(l)LayerNorm(Ĥ(l+1)))+Ĥ(l+1) (7)
In reviewing the above description, MAGNA is different from GAT, and it generalizes GAT. MAGNA extends GAT via the diffusion process. The feature aggregation in GAT is H(l+1)=σ(AH(l)W(l)), where σ represents the activation function. We can divide GAT layer into two components as follows:
In component (1), MAGNA removes the restriction of attending to direct neighbors, without requiring additional parameters as is induced from . For component (2), MAGNA uses layer normalization and deep aggregation which achieves significant gains according to ablation studies in Table 1 described later in the Experiments section of the disclosure. Compared to the “shallow” activation function elu in GAT, we can view deep aggregation (i.e., two-layer MLP) as a learnable deep activation function as two layer MLP can approximate many different functions.
In this section, we investigate the benefits of MAGNA from the viewpoint of discrete signal processing on graphs (Sandryhaila & Moura, Discrete signal processing on graphs: graph Fourier transform, ICASSP, 2013). Our first result demonstrates that MAGNA can better capture large-scale structural information. Our second result explores the relation between MAGNA and Personalized PageRank (PPR).
SPECTRAL PROPERTIES OF GRAPH ATTENTION DIFFUSION. We view the attention matrix A of GAT, and A of MAGNA as weighted adjacency matrices, and apply graph Fourier transform and spectral analysis to show the effect of MAGNA as a graph low-pass filter, being able to more effectively capture large-scale structure in graphs. By equation (3), the sum of each row of either or A is . Hence the normalized graph Laplacians are {circumflex over (L)}sym=I− and Lsym=I−A for and A, respectively. We can get the following proposition:
Proposition 2. Let {circumflex over (λ)}ig and λig be the i-th eigenvalues of {circumflex over (L)}sym and Lsym:
We additionally have λig∈[0, 2]. Equation (9) shows that when λig is small such that
then {circumflex over (λ)}ig>λig, whereas for large λig, {circumflex over (λ)}ig<λig. This relation indicates that the use of increases smaller eigenvalues and decreases larger eigenvalues. The low-pass effect increases with smaller α.
The eigenvalues of the low-frequency signals describe the large-scale structure in the graph and have been shown to be crucial in graph tasks. As λig∈[0, 2], and
the reciprocal format in equation (9) 9 will amplify the ratio of lower eigenvalues to the sum of all eigenvalues. In contrast, high eigenvalues corresponding to noise are suppressed.
PERSONALIZED PAGERANK MEETS GRAPH ATTENTION DIFFUSION. We can also view the attention matrix A as a random walk matrix on graph since Σj=1N
A
ppr=α(I−(1−α)A)−1 (10)
Using the power series expansion for the matrix inverse, we obtain:
A
ppr=αΣi=0∞(1−α)iAi=Σi=0∞α(1−α)iAi (11)
Comparing to the diffusion equation (3) with θi=α(1−α)i, we have the following proposition:
Proposition 3. Graph attention diffusion defines a personalized page rank with parameter α∈(0, 1] on with transition matrix A, i.e., =Appr.
The parameter in MAGNA is equivalent to the teleport probability of PPR. PPR provides a good relevance score between nodes in a weighted graph (the weights from the attention matrix A). In summary, MAGNA places a PPR prior over node pairwise attention scores: the diffused attention between node i and node j depends on the attention scores on the edges of all paths between i and j.
Implementation of the Present Disclosure in a Computing Device The present disclosure relates to computer systems. As depicted in the drawings, computer components may include physical hardware components, which are shown as solid line blocks, and virtual software components, which are shown as dashed line blocks. One of ordinary skill in the art would appreciate that, unless otherwise indicated, these computer components may be implemented in, but not limited to, the forms of software, firmware or hardware components, or a combination thereof.
The apparatuses, systems and methods described herein may be implemented by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions that are stored on a non-transitory tangible computer readable medium. The computer programs may also include stored data. Non-limiting examples of the non-transitory tangible computer readable medium are nonvolatile memory, magnetic storage, and optical storage.
The present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the present disclosure are shown. This disclosure may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the present disclosure to those skilled in the art.
The processor 312 may be a central processing unit (CPU) which is configured to control operation of the computing device 310. The processor 312 can execute an operating system (OS) or other applications of the computing device 310. In certain embodiments, the computing device 310 may have more than one CPU as the processor, such as two CPUs, four CPUs, eight CPUs, or any suitable number of CPUs. The memory 314 can be a volatile memory, such as the random-access memory (RAM), for storing the data and information during the operation of the computing device 310. In certain embodiments, the memory 314 may be a volatile memory array. In certain embodiments, the computing device 310 may run on more than one memory 314. The storage device 316 is a non-volatile data storage media for storing the OS (not shown) and other applications of the computing device 310. Examples of the storage device 316 may include non-volatile memory such as flash memory, memory cards, USB drives, hard drives, floppy disks, optical drives, solid-state drive, or any other types of data storage devices. In certain embodiments, the computing device 310 may have multiple storage devices 316, which may be identical storage devices or different types of storage devices, and the applications of the computing device 310 may be stored in one or more of the storage devices 316 of the computing device 310.
In this embodiments, the processor 312, the memory 314, and the storage device 316 are component of the computing device 310, such as a server computing device. In other embodiments, the computing device 310 may be a distributed computing device and the processor 312, the memory 314, and the storage device 316 are shared resources from multiple computing devices in a pre-defined area.
The storage device 316 includes, among other things, a multi-hop attention graph neural network (MAGNA) application 318 and a knowledge graph 332. The MAGNA application 318 is configured to train its model structure using labels of the knowledge graph 332, and make predictions to improve or complete the knowledge graph 332. The knowledge graph 332 is optional for the computing device 310, as long as the knowledge graph stored in other devices is accessible to the MAGNA application 318.
As shown in
The data preparation module 320 is configured to prepare training samples or prediction samples, and send the prepared training samples or prediction samples to the MAGNA blocks. The knowledge graph 332 may have over one thousand nodes up to several hundreds of thousand nodes, but the type of edges generally is limited, such as one type of edge (Yes and No) or several types of edges. Features of the nodes are stored in the knowledge graph 332, which could be age, gender, location, education, etc. of a customer when the nodes include customers. Edges or relations between nodes are stored in the knowledge graph 332, such as a customer node and a product node may have the relation of browsing or purchasing. In certain embodiments, the knowledge graph 332 may not be a complete knowledge graph, and it may lack features of nodes or lack edges between certain related nodes. Under this situation, both the training samples and the prediction samples may be the knowledge graph 332. Known labels of nodes and edges between the nodes in the knowledge graph 332 are used for the training of the MAGNA application 218. After training, the well-trained MAGNA application 318 can then be used to obtain features of more nodes in the knowledge graph 332, or be used to complete the knowledge graph 230. In certain embodiments, the data preparation module 320 may prepare the knowledge graph 332 by embedding the nodes and edges into vectors. Particularly, let =(V,ε) be the given knowledge graph 332, where V is the set of Nn nodes, ε⊆V×V is the set of Ne edges connecting M pairs of nodes in V. Each node v∈V and each edge e∈ε are associated with their type mapping functions: ϕ:V→ and ψ:ε→. Here and denote the sets of node types (labels) and edge/relation types. The MAGNA application 318 supports learning on heterogeneous graphs with multiple elements in .
The MAGNA blocks 322 is configured to, upon receiving the training knowledge graph or the knowledge graph for prediction from the data preparation module 320, train the MAGNA blocks 322 and the classifier or KG completion module 326, or use the well-trained MAGNA blocks 322 and the classifier or KG completion module 326 to make predictions. The MAGNA blocks 322 may include one or multiple MAGNA blocks 3220 that each have a same block structure.
At the start of a training of the MAGNA application 318, the embedding node H(0) is available to both the first layer normalization module 3221 and the first addition module 3223 of the first MAGNA block. After operation of the l-th MAGNA block, the outputted node embedding for that block, i.e., H(l), is available to both the first layer normalization module 3221 and the first addition module 3223 of the next MAGNA block. When the current MAGNA block is the last MAGNA block, the outputted node embedding is provided to the classifier or KG completion module 326.
The first layer normalization module 3221 is configured to, upon receiving the inputted node embeddings H(l) at the l-th block, perform layer normalization on the inputted node and edge embeddings H(l) to obtain first normalized embedding I, and send the first normalized embedding to the multi-head attention diffusion module 3222. The first layer normalization is defined as: {tilde over (H)}(l)=LayerNorm(H(l)).
The multi-head attention diffusion module 3222 is configured to, upon receiving the first normalized embedding {tilde over (H)}(l), compute attention diffusion heads for each head, aggregate the attention diffusions for all the heads to obtain node embedding with aggregated attention diffusion Ĥ(l), and send the node embedding with aggregated attention diffusion Ĥ(l) to the first addition module 3223. The attention diffusion for head i is calculated by headi=AttDiffusion(,{tilde over (H)}(l), Θi), which can be calculated from the equations (1), (2), (3), and (4) using the first normalized embedding {tilde over (H)}(l) and model parameters of the multi-head attention diffusion module 3222. In certain embodiments, the attention diffusion for head i is approximately calculated using equation (5) instead of using equations (3) and (4). When the attention diffusion for all the heads are available, the attention diffusions are concatenated by equation (6): Ĥ(l)=MultiHead(,Ĥ(l))=(∥i=1Mheadi)Wo.
The first addition module 3223 is configured to, upon receiving the aggregated attention diffusion {tilde over (H)}(l), add the aggregated attention diffusion to the inputted embedding to obtain embedding with first addition Ĥ(l+1), and send the embedding with the first addition Ĥ(l+1) to the second layer normalization module 3224 and the second addition module 3226. The inputted embedding could be the node embeddings from a previous MAGNA block, or from the initial embedding H0 if the current MAGNA block 3220 is the first of the MAGNA blocks. The addition step is defined as: Ĥ(l+1)=Ĥ(l)+Hl.
The second layer normalization module 3224 is configured to, upon receiving the embedding with the first addition Ĥ(l+1), normalized the embedding to obtain second normalized embedding, and send the second normalized embedding to the feed forward module 3225. The layer normalization is performed by LayerNorm(H(l+1)).
The feed forward module 3225 is configured to, upon receiving the embedding with the first addition Ĥ(l+1) from the first layer addition module 3223 and the second normalized embedding LayerNorm(H(l+1), perform feed forward to obtain feed forward embedding W2(l)ReLU (W1(l)LayerNorm(H(l+1))), and send the feed forward embedding to the second addition module 3226.
The second addition module 3226 is configured to, upon receiving the embedding with the first addition Ĥ(l+1) from the first addition module 3223 and the feed forward embedding W2(l)ReLU (W1(l)LayerNorm(H(l+1))) from the feed forward module 3225, perform an addition of the two to obtain the updated node embedding H(l+1) by: H(l+1)=W2(l)ReLU (W1(l)LayerNorm(H(l+1)))+Ĥ(l+1) (7), such that the updated node embedding H(l+1) is available to the (l+1)-th MAGNA block, or available to the loss function module 324 when the current block is the last MAGNA block.
Referring back to
The prediction module 326 is configured to, upon receiving the notice from the loss function module 324 that the model is well trained, using the well-trained MAGNA blocks 322 to classify the nodes that do not have a classified type, or predict relations between the nodes that are not linked by edges, add the new node classifications, and/or new edges, and/or new edge relations to the knowledge graph 332 such that the knowledge graph 332 is updated with more information. The updated knowledge graph 332 is available to the function module 328. In certain embodiments, the prediction module 326 is a decoder of a transformer in the field. In certain embodiments, the decoder is a classifier.
The function module 328 is configured to, when the knowledge graph 332 is updated, using the updated knowledge graph to perform certain functions. For example, when the knowledge graph 332 is a customer and product knowledge graph, the knowledge graph 332 may be used to recommend a product to one or more customers when there is a prediction that the customers will be interested in the product, which may be indicated by an edge or a relation linking the product and the customers. In certain embodiments, the knowledge graph 332 may be customers, and each customer may be classified as a high credit score customer or a low credit score customer. By updating the classification of the customer belong to high credit or low credit via the prediction module 326, credit information of more customers are available, and the credit information of the customer may be used by loan companies.
In certain embodiments, the function module 328 is configured to perform the above function automatically or in a predefined time interval, or when trigged by an update of the knowledge graph 322. For example, after the update of the knowledge graph 332, the function module 328 would look for more linked relations between products and customers, and the function module 328 would subsequently push the products to the corresponding customers when the updated relations between the customers and the products are interested in.
The interface 330 is configured to provide an interface for an administrator of the MAGNA application 318 to train the MAGNA blocks 322 and optionally the loss function module 324, and adjust model parameters, or is configured to provide an interface for the administrator to use the MAGNA application 318 to obtain and use the updated knowledge graph 332 for certain functions.
Kindly note that i in different context of the present disclosure may have different meanings. For example, the i in vi means the i-th node, the i in θi and Ai is a positive integer indicating the path length of edges between nodes, the i in headi represents the i-th head for calculating attention diffusion.
As shown in
At procedure 404, the first layer normalization module 3221 of the first MAGNA block 3220 performs layer normalization on the initialized embedding to obtain normalized embedding {tilde over (H)}(0), i.e., {tilde over (H)}(0)=LayerNorm(H(0)), and sends the normalized embedding to the multi-head attention diffusion module 3222.
At procedure 406, upon receiving the normalized embedding {tilde over (H)}(0), the multi-head attention diffusion module 3222 calculates an attention score matrix, where each row of the attention score matrix is an attention score of an edge. The attention score of one of the edges (vi, rk, vj) is calculated using: si,k,j(0)=LeakyReLU(vα(0)tan h(Wh(0)hi(0)∥Wt(0)hj(0)∥Wr(0)rk)) (1), and the attention score matrix Si,j(0) is summarized by:
Accordingly, the attention score matrix includes attention score for each of the edges that are currently exist in the knowledge graph 332.
At procedure 408, the multi-head attention diffusion module 3222 performs softmax on the attention score matrix Si,j(0) to obtain attention matrix A(0), where A(0)=softmax(S(0)).
At procedure 410, the multi-head attention diffusion module 3222 calculates graph attention diffusion H(0) using the attention matrix A(0). In certain embodiments, to increase the calculation speed, the graph attention diffusion H(0) is approximately calculated. Specifically, by defining Z(0)={tilde over (H)}(0) and Z(k+1))=(1−α)AZ(k)+αZ(0), then Z(K)={tilde over (H)}(0). Here a is a predefined constant in a range of 0-0.5. In certain embodiments, α is in a range of 0.05 to 0.25. In certain embodiments, a is in a range of 0.1 to 0.2. In certain embodiments, a is 0.1 or 0.15. 0≤k≤K, and K is a positive integer in a range of 3-10. In certain embodiments, K is in a range of 4-8. In certain embodiments, K is 6. The values of α and K may vary according to the size and features of the knowledge graph 332. For example, assume that α is 0.1 and K is 6, then Z(1)=0.9AZ(0)+0.1Z(0), Z(2)=0.9AZ(1)+0.1Z(0), Z(3)=0.9AZ(2)+0.1Z(0), Z(4)=0.9AZ(3)+0.1Z(0), Z(5)=0.9AZ(4)+0.1Z(0), Z(6)=0.9AZ(5)+0.1Z(0), and Z(6) is the graph attention diffusion, which is the approximation of {tilde over (H)}(0). In certain embodiments, the calculation of the graph attention diffusion H(0) can also be performed using the equations (3) and (4). However, because the Z(1), Z(2), Z(3), Z(4), Z(5), and Z(6) are calculated recursively, the calculation is much faster than the calculation using the equations (3) and (4).
At procedure 412, the procedures 404-410 are performed for each head to obtain the graph attention diffusions {tilde over (H)}(0) (Z(K)) for the heads, and the multi-head attention diffusion module 3222 summarizes the graph attention diffusions {tilde over (H)}(0) for all the heads to obtain the aggregated attention diffusion Ĥ(0). That is, Ĥ(0)=MultiHead(,{tilde over (H)}(0))=(∥m=1Mheadm) Wo (6). After obtaining the aggregated attention diffusion Ĥ(0), the multi-head attention diffusion module 3222 further sends the multi-head attention diffusion to the first addition module 3223.
At procedure 414, upon receiving the multi-head attention diffusion from the multi-head attention diffusion module 3222, the first addition module 3223 adds the aggregated attention diffusion Ĥ(0) to the initial embedding H0 to obtain added attention diffusion Ĥ(1) by: Ĥ(1)=Ĥ(0)+H0, and sends the added attention diffusion to the second layer normalization module 3224.
At procedure 416, upon receiving the added attention diffusion, the second layer normalization module 3224 performs layer normalization on the added attention diffusion to obtain normalized attention diffusion: LayerNorm(Ĥ(1)), and sends the normalized attention diffusion to the feed forward module 3225.
At procedure 418, upon receiving the normalized attention diffusion from the second layer normalization module 3224, the feed forward module 3225 performs feed forward on the normalized attention diffusion to obtain the feed forward attention, and sends the feed forward attention to the second addition module 3226. The feed forward attention is W2(0)ReLU(W1(0)LayerNorm(Ĥ(1)).
At procedure 420, upon receiving the feed forward attention from the feed forward module 3225, the second addition module 3226 adds the added attention diffusion to the feed forward attention, to obtain the updated embedding H(1), that is: H(1)=W2(0)ReLU(W1(0)LayerNorm(Ĥ(1)))+Ĥ(1) (7). After obtaining the updated embedding H(1), the second addition module 3226 sends the updated embedding to the second of the MAGNA blocks 322. In certain embodiments, the feed forward attention has a two-layer feed forward network, such as a two-layer MLP.
At procedure 422, the second of the MAGNA blocks repeats the procedures 404-420, but in the procedure 404, the input is not the initial embedding H(0), but the updated embedding H(1). In certain embodiments, the MAGNA blocks 322 includes 2-10 blocks. In certain embodiments, the MAGNA blocks 322 includes 3-6 blocks. The number of blocks may depend on size and features of the knowledge graph 332. The procedures 404-420 are performed for each of the blocks, so as to update the embeddings. When the number of blocks is L, the blocks would be block 0 (or layer 0), block 1 (or layer 1), . . . , block (L−1) (or layer (L−1)), and the final output of the embedding would be H(L). After obtaining the output of the embedding H(L), the MAGNA blocks 322 further sends the output embedding to the loss function module 324.
At procedure 424, upon receiving the output embedding from the MAGNA blocks 322, the loss function module 324 calculates the loss function by comparing the output embedding with the ground truth labels of the knowledge graph 332, and uses the loss function to adjust parameters of the MAGNA blocks 322. In certain embodiments, the loss function is a cross entropy loss.
At procedure 426, the MAGNA application 318 performs the procedures 404-424 iteratively using updated embedding from the previous iteration, until the training is repeated for a predetermined times, or until the model parameters converge.
As shown in
The probabilities of the predicted features of the nodes may vary, and at procedure 504, the prediction module 326 ranks the nodes based on probabilities or confidence of their predicted features from high to low.
At procedure 506, the prediction module 326 selects the nodes at the top of the rank, and adds the predicted features to the selected nodes. By adding the feature values to the nodes that do not have that feature, the knowledge graph 332 is more complete.
As shown in
The probabilities of the predicted edges may vary, and at procedure 514, the prediction module 326 ranks the newly predicted edges based on probabilities or confidence of the new edges.
At procedure 516, the prediction module 326 selects the new edges at the top of the rank, and adds the predicted new edges to the knowledge graph. By adding the new edges that do not exist before, the knowledge graph 332 is more complete.
In certain aspects, the present disclosure provides methods of using the completed knowledge graph. In certain embodiments, the method is credit evaluation, and the method may include: completing the knowledge graph as shown in
We evaluate MAGNA on two classical tasks. (1) On node classification we achieve an average of 5:7% relative error reduction; (2) on knowledge graph completion, we achieve 7:1% relative improvement in the Hit at 1 metric. We compare with numbers reported by baseline papers when available.
Datasets. We employ four benchmark datasets for node classification: (1) standard citation network benchmarks Cora, Citeseer and Pubmed; and (2) a benchmark dataset ogbn-arxiv on 170 K nodes and 1.2 M edges from the Open Graph Benchmark. We follow the standard data splits for all datasets.
Baselines. We compare against a comprehensive suite of state-of-the-art GNN methods including: GCNs, Chebyshev filter based GCNs, DualGCN, JKNet, LGCN, Diffusion-GCN, APPNP, Graph U-Nets (g-U-Nets), and GAT.
Experimental Setup. For datasets Cora, Citeseer and Pubmed, we use 6 MAGNA blocks with hidden dimension 512 and 8 attention heads. For the large-scale ogbn-arxiv dataset, we use 2 MAGNA blocks with hidden dimension 128 and 8 attention heads.
Results. We report node classification accuracies on the benchmarks. Results are summarized in Table 1 in
Ablation study. We report in Table 1 the model performance after removing each component of MAGNA (layer normalization, attention diffusion and deep aggregation feed forward layers) from every layer of MAGNA. Note that the model is equivalent to GAT without these three components. We observe that both diffusion and layer normalization play a crucial role in improving the node classification performance for all datasets. While layer normalization alone does not benefit GNNs, its use in conjunction with the attention diffusion module significantly boosts MAGNA's performance. Since MAGNA computes many attention values, layer normalization is crucial in ensuring training stability. Meanwhile, we also remove both layer normalization and deep aggregation feed forward layer, and only keep the attention diffusion layer (see the next-to-last row of Table 1). Comparing to GAT, attention diffusion allows multi-hop attention in each layer still benefits the performance of node classification.
Datasets. We evaluate MAGNA on standard benchmark knowledge graphs: WN18RR and FB15K-237.
Baselines. We compare MAGNA with state-of-the-art baselines, including (1) translational distance based models: TransE and its latest extension RotatE, OTE, and ROTH; (2) semantic matching based models: ComplEx, QuatE, CoKE, ConvE, DistMult, and TuckER; (3) GNN-based models: R-GCN, SACN, and A2N.
Training procedure. We use the standard training procedure used in previous KG embedding models. We follow an encoder-decoder framework. The encoder applies the proposed MAGNA model to compute the entity embeddings. The decoder then makes link prediction given the embeddings outputted from the MAGNA, and existing decoders in prior models can be applied. To show the power of MAGNA, we employ the DistMult decoder, a simple decoder without extra parameters.
Evaluation. We use the standard split for the benchmarks, and the standard testing procedure of predicting tail (head) entity given the head (tail) entity and relation type. We exactly follow the evaluation used by all previous works, namely the Mean Reciprocal Rank (MRR), Mean Rank (MR), and hit rate at K (H@K).
Results. MAGNA achieves new state-of-the-art in knowledge graph completion on all four metrics as show in
Here we present (1) the spectral analysis results, (2) effect of the hyper-parameters on MAGNA performance, and (3) attention distribution analysis to show the strengths of MAGNA.
Spectral Analysis. Why does MAGNA work for node classification? We compute the eigenvalues of the graph Laplacian of the attention matrix A, {circumflex over (λ)}ig, and compare to that of the diffused matrix , λig.
MAGNA Model Depth. We conduct experiments by varying the number of GCN, GAT and our MAGNA layers to be 3, 6, 12, 18 and 24 for node classification on Cora. Results in
Effect of K and α.
We also observe that the accuracy drops significantly for larger α>0.25. This is because small α increases the low-pass effect (
Attention Distribution. Last we also analyze the learned attention scores of GAT and MAGNA. We first define a discrepancy metric over the attention matrix A for node vi as
(Shanthamallu et al., 2020), where Ui is the uniform distribution score for the node vi. Δi gives a measure of how much the learnt attention deviates from a uninformative uniform distribution. Large Δi indicates more meaningful attention scores.
We proposed Multi-hop Attention Graph Neural Network (MAGNA), which brings together benefits of graph attention and diffusion techniques in a single layer through attention diffusion, layer normalization and deep aggregation. MAGNA enables context-dependent attention between any pair of nodes in the graph in a single layer, enhances large-scale structural information, and learns more informative attention distribution. MAGNA improves over all state-of-the-art methods on the standard tasks of node classification and knowledge graph completion.
The foregoing description of the exemplary embodiments of the disclosure has been presented only for the purposes of illustration and description and is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching.
The embodiments were chosen and described in order to explain the principles of the disclosure and their practical application so as to enable others skilled in the art to utilize the disclosure and various embodiments and with various modifications as are suited to the particular use contemplated. Alternative embodiments will become apparent to those skilled in the art to which the present disclosure pertains without departing from its spirit and scope. Accordingly, the scope of the present disclosure is defined by the appended claims rather than the foregoing description and the exemplary embodiments described therein.
This application claims priority to and the benefit of, pursuant to 35 U.S.C. § 119(e), U.S. provisional patent application Ser. No. 63/082,096, filed Sep. 23, 2020, titled “METHOD AND SYSTEM FOR REPRESENTATION LEARNING ON RELATION STRUCTURE BY GRAPH DIFFUSION TRANSFORMER” by Guangtao Wang, Zhitao Ying, Jing Huang, and Jurij Leskovec, which is incorporated herein in its entirety by reference. Kindly note that the graph diffusion transformer in the above provisional application is equivalent to multi-hop attention graph neural network (MAGNA) discussed in this disclosure. Some references, which may include patents, patent applications and various publications, are cited and discussed in the description of this disclosure. The citation and/or discussion of such references is provided merely to clarify the description of the present disclosure and is not an admission that any such reference is “prior art” to the disclosure described herein. All references cited and discussed in this specification are incorporated herein by reference in their entirety and to the same extent as if each reference were individually incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63082096 | Sep 2020 | US |