This application claims priority to Chinese Patent Application No. 202311788004.6, filed on Dec. 22, 2023, which is hereby incorporated by reference in its entirety.
One or more embodiments of this specification is associated with the field of computer technologies, and in particular, to methods and apparatuses for storing graph data of a relationship network graph.
A relationship network graph is referred to as a graph for short. Graphs are a type of structure for representing association relationships between objects and are described using vertices and edges. The vertices are also referred to as nodes, and are used to represent objects. The edges are also referred to as connecting edges, and are used to represent relationships between objects. The connecting edges are further classified into undirected connecting edges and directed connecting edges. If a connecting edge between two nodes has no direction, this connecting edge is referred to as an undirected connecting edge. If a connecting edge from one node to another node has a direction, this connecting edge is referred to as a directed connecting edge. Generally, a node corresponds to one or more node attributes, and a connecting edge corresponds to one or more edge attributes. A specific value of a node attribute or an edge attribute possibly belongs to privacy data.
Graph analysis is a series of complex computing performed on objects, relationships, and their attributes included in graph data. As graph data scales up, graph analysis performance often fails to satisfy needs, and an efficient graph data management method is crucial to improving graph analysis performance. Graph data management mainly relies on graph data storage of a relationship network graph.
One or more embodiments of this specification describe methods and apparatuses for storing graph data of a relationship network graph. The methods and apparatuses can implement efficient graph data management, thereby improving graph analysis performance.
According to a first aspect, a method for storing graph data of a relationship network graph is provided. The relationship network graph includes a directed connecting edge between nodes, and the method includes the following: connection relationship information between any two nodes in the relationship network graph is acquired; based on the connection relationship information, a first mapping relationship between an identifier of each node in the relationship network graph and a node identifier of an outgoing edge-connected node of the node is stored in a compressed sparse row format, and a second mapping relationship between the identifier of each node and a node identifier of an incoming edge-connected node of the node is stored in a compressed sparse column format; a set of attribute information in the relationship network graph is acquired, where the set of attribute information includes several node attributes, several edge attributes, and/or several pieces of temporary information; and each attribute value of the same attribute in the set of attribute information is stored in continuous space by means of column storage.
In some possible implementations, the connection relationship information includes one of the following: an adjacency matrix and an adjacency table.
In some possible implementations, the first mapping relationship between the identifier of each node in the relationship network graph and the node identifier of the outgoing edge-connected node of the node is stored in the compressed sparse row format, including: a node identifier of each target node is stored in a first array; where node identifiers of target nodes corresponding to the same node are continuously arranged; and a location index of the first target node of the same node in the first array is stored in a second array.
Further, node identifiers of nodes in the relationship network graph are sequentially incremented by 1 from an initial value, and location indexes corresponding to different nodes are stored in the second array based on a value sequence of the node identifiers of the nodes; and the method further includes the following: when a target node corresponding to an outgoing edge of a first node is queried, based on a node identifier of the first node, a first location index of the first target node of the first node is acquired from the second array, and a second location index of the first target node of a second node is acquired from the second array; where a node identifier of the second node is 1 greater than the node identifier of the first node; a first index set is determined based on the first location index and the second location index; where the first index set includes each index between the first location index and the second location index, and does not include the second location index; and an identifier of a target node corresponding to each index in the first index set is acquired from the first array, and the identifier of the target node is used as an identifier of each target node corresponding to the outgoing edge of the first node.
In some possible implementations, the second mapping relationship between the identifier of each node and the node identifier of the incoming edge-connected node of the node is stored in the compressed sparse column format, including: a node identifier of each start node is stored in a third array; where node identifiers of start nodes corresponding to the same node are continuously arranged; and a location index of the first start node of the same node in the third array is stored in a fourth array.
Further, node identifiers of nodes in the relationship network graph are sequentially incremented by 1 from an initial value, and location indexes corresponding to different nodes are stored in the fourth array based on a value sequence of the node identifiers of the nodes; and the method further includes the following: when a start node corresponding to an incoming edge of a first node is queried, based on a node identifier of the first node, a third location index of the first start node of the first node is acquired from the fourth array, and a fourth location index of the first start node of a second node is acquired from the fourth array; where a node identifier of the second node is 1 greater than the node identifier of the first node; a second index set is determined based on the third location index and the fourth location index; where the second index set includes each index between the third location index and the fourth location index, and does not include the fourth location index; and an identifier of a start node corresponding to each index in the second index set is acquired from the third array, and the identifier of the start node is used as an identifier of each start node corresponding to the incoming edge of the first node.
In some possible implementations, the storing in continuous space by means of column storage includes the following: indication information indicating whether to perform storage in a disk is extracted based on configuration information of a target attribute; if the indication information indicates to perform storage in a disk, each attribute value of the target attribute is stored in continuous space of the disk; or if the indication information indicates not to perform storage in a disk, each attribute value of the target attribute is stored in continuous space of a memory.
In some possible implementations, the method further includes the following: in a process of performing data analysis on the relationship network graph, a node identifier of an outgoing edge-connected node of a first node is acquired based on the first mapping relationship, or a node identifier of an incoming edge-connected node of a second node is acquired based on the second mapping relationship.
According to a second aspect, an apparatus for storing graph data of a relationship network graph is provided. The relationship network graph includes a directed connecting edge between nodes, and the apparatus includes the following: a first acquisition unit, configured to acquire connection relationship information between any two nodes in the relationship network graph; a first storage unit, configured to, based on the connection relationship information acquired by the first acquisition unit, store a first mapping relationship between an identifier of each node in the relationship network graph and a node identifier of an outgoing edge-connected node of the node in a compressed sparse row format, and store a second mapping relationship between the identifier of each node and a node identifier of an incoming edge-connected node of the node in a compressed sparse column format; a second acquisition unit, configured to acquire a set of attribute information in the relationship network graph, where the set of attribute information includes several node attributes, several edge attributes, and/or several pieces of temporary information; and a second storage unit, configured to store each attribute value of the same attribute in the set of attribute information acquired by the second acquisition unit in continuous space by means of column storage.
According to a third aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program, and when the computer program is executed in a computer, the computer is enabled to perform the method of the first aspect.
According to a fourth aspect, a computing device is provided, including a storage and a processor. The storage stores executable code, and when executing the executable code, the processor implements the method of the first aspect.
According to the methods and the apparatuses provided in the embodiments of this specification, first, connection relationship information between any two nodes in a relationship network graph is acquired; then, based on the connection relationship information, a first mapping relationship between an identifier of each node in the relationship network graph and a node identifier of an outgoing edge-connected node of the node is stored in a compressed sparse row format, and a second mapping relationship between the identifier of each node and a node identifier of an incoming edge-connected node of the node is stored in a compressed sparse column format; subsequently, a set of attribute information in the relationship network graph is acquired, where the set of attribute information includes several node attributes, several edge attributes, and/or several pieces of temporary information; and finally, each attribute value of the same attribute in the set of attribute information is stored in continuous space by means of column storage. It can be seen from the above-mentioned description that, in the embodiments of this specification, outgoing-edge information and incoming-edge information of a node are respectively stored in the compressed sparse row format and the compressed sparse column format so as to compress as much space as possible, and increase an edge traversal speed. In addition, unlike common application of the compressed sparse row format and compressed sparse column format, the two formats are used only to store basic information, that is, node identifiers. Other information such as vertex attributes, edge attributes, and temporary information in a graph analysis process is structurally fused by means of column storage so as to improve memory access efficiency of the attributes. In summary, efficient graph data management can be implemented, thereby improving graph analysis performance.
To describe the technical solutions in the embodiments of this application more clearly, the following briefly describes the accompanying drawings needed for describing the embodiments. Clearly, the accompanying drawings in the following description show merely some embodiments of this application, and a person of ordinary skill in the art can derive other drawings from these accompanying drawings without creative efforts.
The following describes the solutions provided in this specification with reference to the accompanying drawings.
Referring to
The graph data further includes two edge attributes. Corresponding to the two edge attributes, each connecting edge has its own attribute values. For example, the connecting edge from node 0 to node 1 has an attribute value of 10 corresponding to a first edge attribute, and has an attribute value of “red” corresponding to a second edge attribute; and a connecting edge from node 1 to node 4 has an attribute value of 40 corresponding to the first edge attribute, and has an attribute value of “green” corresponding to the second edge attribute.
In the embodiments of this specification, a corresponding solution is proposed for storing the connection relationship information and the other information that are included in the graph data so as to implement efficient graph data management, thereby improving graph analysis performance.
The graph data can be stored in multiple forms, for example, an adjacency matrix, an adjacency table, a compressed sparse row (CSR), and a compressed sparse column (CSC).
Generally, if a quantity of elements with a value of 0 is far greater than a quantity of non-0 elements in a matrix, and the non-0 elements are irregularly distributed, the matrix is a sparse matrix. An adjacency matrix is usually a sparse matrix.
In the embodiments of this specification, due to sparsity of the graph, the CSR and the CSC are used to respectively store outgoing-edge information and incoming-edge information of nodes so as to compress as much space as possible, and increase an edge traversal speed. However, a difference is that the CSR and the CSC are used only to store basic information, such as a node identifier of a start node and a node identifier of a target node. The other information such as the node attributes, the edge attributes, and the temporary information in the graph analysis process is structurally fused by means of column storage so as to improve memory access efficiency of the attributes.
First, in step 31, the connection relationship information between any two nodes in the relationship network graph is acquired. It can be understood that, the above-mentioned connection relationship information can indicate whether a directed connecting edge exists between any two nodes.
In some examples, the connection relationship information includes one of the following: an adjacency matrix and an adjacency table.
In the examples, the adjacency matrix can be considered as a two-dimensional array, and be configured to store data of a relationship between nodes. The adjacency table is a chained-storage method for a graph, and a data structure of the adjacency table includes two parts: a node and an adjacent point.
Then, in step 32, based on the connection relationship information, the first mapping relationship between the identifier of each node in the relationship network graph and the node identifier of the outgoing edge-connected node of the node is stored in the compressed sparse row format, and the second mapping relationship between the identifier of each node and the node identifier of the incoming edge-connected node of the node is stored in the compressed sparse column format. It can be understood that, in the embodiments of this specification, both outgoing-edge information of a node and incoming-edge information of the node are stored, thereby helping increase an edge traversal speed.
In some examples, the first mapping relationship between the identifier of each node in the relationship network graph and the node identifier of the outgoing edge-connected node of the node is stored in the compressed sparse row format, including: a node identifier of each target node is stored in a first array; where node identifiers of target nodes corresponding to the same node are continuously arranged; and a location index of the first target node of the same node in the first array is stored in a second array.
In the examples, the above-mentioned first mapping relationship is stored by using the first array and the second array.
Further, node identifiers of nodes in the relationship network graph are sequentially incremented by 1 from an initial value, and location indexes corresponding to different nodes are stored in the second array based on a value sequence of the node identifiers of the nodes; and the method further includes the following: when a target node corresponding to an outgoing edge of a first node is queried, based on a node identifier of the first node, a first location index of the first target node of the first node is acquired from the second array, and a second location index of the first target node of a second node is acquired from the second array; where a node identifier of the second node is 1 greater than the node identifier of the first node; a first index set is determined based on the first location index and the second location index; where the first index set includes each index between the first location index and the second location index, and does not include the second location index; and an identifier of a target node corresponding to each index in the first index set is acquired from the first array, and the identifier of the target node is used as an identifier of each target node corresponding to the outgoing edge of the first node.
For example, referring to
In some examples, the second mapping relationship between the identifier of each node and the node identifier of the incoming edge-connected node of the node is stored in the compressed sparse column format, including: a node identifier of each start node is stored in a third array; where node identifiers of start nodes corresponding to the same node are continuously arranged; and a location index of the first start node of the same node in the third array is stored in a fourth array.
In the examples, the above-mentioned second mapping relationship is stored by using the third array and the fourth array.
Further, node identifiers of nodes in the relationship network graph are sequentially incremented by 1 from an initial value, and location indexes corresponding to different nodes are stored in the fourth array based on a value sequence of the node identifiers of the nodes; and the method further includes the following: when a start node corresponding to an incoming edge of a first node is queried, based on a node identifier of the first node, a third location index of the first start node of the first node is acquired from the fourth array, and a fourth location index of the first start node of a second node is acquired from the fourth array; where a node identifier of the second node is 1 greater than the node identifier of the first node; a second index set is determined based on the third location index and the fourth location index; where the second index set includes each index between the third location index and the fourth location index, and does not include the fourth location index; and an identifier of a start node corresponding to each index in the second index set is acquired from the third array, and the identifier of the start node is used as an identifier of each start node corresponding to the incoming edge of the first node.
For example, referring to
Subsequently, in step 33, the set of attribute information in the relationship network graph is acquired, where the set of attribute information includes several node attributes, several edge attributes, and/or several pieces of temporary information. It can be understood that, the above-mentioned temporary information can be an intermediate result generated in a data analysis process.
In the embodiments of this specification, according to a specific application scenario, the set of attribute information can have all of the node attributes, the edge attributes, and the temporary information, or the set of attribute information can have only the node attributes and the temporary information but no edge attributes, or the set of attribute information can have only the edge attributes and the temporary information but no node attributes. There can be many specific cases, and details are omitted here for simplicity.
Finally, in step 34, each attribute value of the same attribute in the set of attribute information is stored in continuous space by means of column storage. It can be understood that, the same attribute mentioned above can be understood as one column or one field in a data table.
In the embodiments of this specification, the node attributes, the edge attributes, and the temporary information in a computing process are managed by using column storage-based structured fusion. A compact arrangement method for each attribute in the column storage-based structured fusion not only can improve memory access efficiency, but also can facilitate management.
In some examples, the storing in continuous space by means of column storage includes the following: indication information indicating whether to perform storage in a disk is extracted based on configuration information of a target attribute; if the indication information indicates to perform storage in a disk, each attribute value of the target attribute is stored in continuous space of the disk; or if the indication information indicates not to perform storage in a disk, each attribute value of the target attribute is stored in continuous space of a memory.
In the examples, some attribute values can be stored in a disk to reduce memory occupation in the graph analysis process.
In the embodiments of this specification, a user can add configuration information of each attribute by providing definition support similar to a table schema. As some examples, Table 1 shows configuration information of two edge attributes.
Referring to Table 1, the two edge attributes are respectively named c1 and c2, the types are respectively an integer type (int) and a string type (string), and the indication information is false (false) and true (true), which respectively indicate that cold storage is not supported and cold storage is supported. Cold storage is storage in a disk, which can reduce memory occupation in the graph analysis process.
In the embodiments of this specification, a graph analysis task stores each attribute value of each attribute in one piece of continuous space of a disk or one piece of consecutive space of a memory based on the configuration information. As some examples, Table 2 shows attribute values of the two edge attributes corresponding to different connecting edges.
Referring to Table 2, src represents a start node of a connecting edge; dst represents a target node of a connecting edge; corresponding to a connecting edge from node 0 to node 1, an attribute value of the edge attribute c1 is 10, and an attribute value of the edge attribute c2 is red; and corresponding to a connecting edge from node 0 to node 2, an attribute value of the edge attribute c1 is 20, and an attribute value of the edge attribute c2 is white. It can be understood that, one attribute corresponds to one column in a data table, and column storage can be used to improve memory access efficiency in data analysis.
In some examples, the method further includes the following: in a process of performing data analysis on the relationship network graph, a node identifier of an outgoing edge-connected node of a first node is acquired based on the first mapping relationship, or a node identifier of an incoming edge-connected node of a second node is acquired based on the second mapping relationship.
In the examples, because both outgoing-edge information of a node and incoming-edge information of the node are stored, the outgoing-edge information or the incoming-edge information can be flexibly selected in a data analysis process so as to increase an edge traversal speed.
According to the methods provided in the embodiments of this specification, first, connection relationship information between any two nodes in a relationship network graph is acquired; then, based on the connection relationship information, a first mapping relationship between an identifier of each node in the relationship network graph and a node identifier of an outgoing edge-connected node of the node is stored in a compressed sparse row format, and a second mapping relationship between the identifier of each node and a node identifier of an incoming edge-connected node of the node is stored in a compressed sparse column format; subsequently, a set of attribute information in the relationship network graph is acquired, where the set of attribute information includes several node attributes, several edge attributes, and/or several pieces of temporary information; and finally, each attribute value of the same attribute in the set of attribute information is stored in continuous space by means of column storage. It can be seen from the above-mentioned description that, in the embodiments of this specification, outgoing-edge information and incoming-edge information of a node are respectively stored in the compressed sparse row format and the compressed sparse column format so as to compress as much space as possible, and increase an edge traversal speed. In addition, unlike common application of the compressed sparse row format and compressed sparse column format, the two formats are used only to store basic information, that is, node identifiers. Other information such as vertex attributes, edge attributes, and temporary information in a graph analysis process is structurally fused by means of column storage so as to improve memory access efficiency of the attributes. In summary, efficient graph data management can be implemented, thereby improving graph analysis performance.
According to some embodiments of another aspect, an apparatus for storing graph data of a relationship network graph is further provided. The relationship network graph includes a directed connecting edge between nodes, and the apparatus is configured to execute the methods provided in the embodiments of this specification.
Optionally, as some embodiments, the connection relationship information includes one of the following: an adjacency matrix and an adjacency table.
Optionally, as some embodiments, the first storage unit 62 includes the following: a first storage subunit, configured to store a node identifier of each target node in a first array; where node identifiers of target nodes corresponding to the same node are continuously arranged; and a second storage subunit, configured to store, in a second array, a location index of the first target node of the same node in the first array obtained by the first storage subunit.
Further, node identifiers of nodes in the relationship network graph are sequentially incremented by 1 from an initial value, and location indexes corresponding to different nodes are stored in the second array based on a value sequence of the node identifiers of the nodes; and the apparatus further includes the following: a first query unit, configured to: when a target node corresponding to an outgoing edge of a first node is queried, based on a node identifier of the first node, acquire a first location index of the first target node of the first node from the second array, and acquire a second location index of the first target node of a second node from the second array; where a node identifier of the second node is 1 greater than the node identifier of the first node; determine a first index set based on the first location index and the second location index; where the first index set includes each index between the first location index and the second location index, and does not include the second location index; and acquire, from the first array, an identifier of a target node corresponding to each index in the first index set, and use the identifier of the target node as an identifier of each target node corresponding to the outgoing edge of the first node.
Optionally, as some embodiments, the first storage unit 62 includes the following: a third storage subunit, configured to store a node identifier of each start node in a third array; where node identifiers of start nodes corresponding to the same node are continuously arranged; and a fourth storage subunit, configured to store, in a fourth array, a location index of the first start node of the same node in the third array.
Further, node identifiers of nodes in the relationship network graph are sequentially incremented by 1 from an initial value, and location indexes corresponding to different nodes are stored in the fourth array based on a value sequence of the node identifiers of the nodes; and the apparatus further includes the following: a second query unit, configured to: when a start node corresponding to an incoming edge of a first node is queried, based on a node identifier of the first node, acquire a third location index of the first start node of the first node from the fourth array, and acquire a fourth location index of the first start node of a second node from the fourth array; where a node identifier of the second node is 1 greater than the node identifier of the first node; determine a second index set based on the third location index and the fourth location index; where the second index set includes each index between the third location index and the fourth location index, and does not include the fourth location index; and acquire, from the third array, an identifier of a start node corresponding to each index in the second index set, and use the identifier of the start node as an identifier of each start node corresponding to the incoming edge of the first node.
Optionally, as some embodiments, the second storage unit 64 includes the following: an extraction subunit, configured to extract, based on configuration information of a target attribute, indication information indicating whether to perform storage in a disk; a fifth storage subunit, configured to, if the indication information obtained by the extraction subunit indicates to perform storage in a disk, store each attribute value of the target attribute in continuous space of the disk; or a sixth storage subunit, configured to, if the indication information obtained by the extraction subunit indicates not to perform storage in a disk, store each attribute value of the target attribute in continuous space of a memory.
Optionally, as some embodiments, the apparatus further includes the following: an analysis unit, configured to, in a process of performing data analysis on the relationship network graph, acquire a node identifier of an outgoing edge-connected node of a first node based on the first mapping relationship, or acquire a node identifier of an incoming edge-connected node of a second node based on the second mapping relationship.
According to the apparatuses provided in the embodiments of this specification, first, the first acquisition unit 61 acquires connection relationship information between any two nodes in a relationship network graph; then, based on the connection relationship information, the first storage unit 62 stores a first mapping relationship between an identifier of each node in the relationship network graph and a node identifier of an outgoing edge-connected node of the node in a compressed sparse row format, and stores a second mapping relationship between the identifier of each node and a node identifier of an incoming edge-connected node of the node in a compressed sparse column format; subsequently, the second acquisition unit 63 acquires a set of attribute information in the relationship network graph, where the set of attribute information includes several node attributes, several edge attributes, and/or several pieces of temporary information; and finally, the second storage unit 64 stores each attribute value of the same attribute in the set of attribute information in continuous space by means of column storage. It can be seen from the above-mentioned description that, in the embodiments of this specification, outgoing-edge information and incoming-edge information of a node are respectively stored in the compressed sparse row format and the compressed sparse column format so as to compress as much space as possible, and increase an edge traversal speed. In addition, unlike common application of the compressed sparse row format and compressed sparse column format, the two formats are used only to store basic information, that is, node identifiers. Other information such as vertex attributes, edge attributes, and temporary information in a graph analysis process is structurally fused by means of column storage so as to improve memory access efficiency of the attributes. In summary, efficient graph data management can be implemented, thereby improving graph analysis performance.
According to some embodiments of another aspect, a computer-readable storage medium is further provided. The computer-readable storage medium stores a computer program, and when the computer program is executed in a computer, the computer is enabled to perform the method described with reference to
According to some embodiments of still another aspect, a computing device is further provided, including a storage and a processor. The storage stores executable code, and when executing the executable code, the processor implements the method described with reference to
A person skilled in the art should be aware that in the above-mentioned one or more examples, functions described in this application can be implemented by hardware, software, firmware, or any combination thereof. When implemented by using software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or one or more pieces of code on a computer-readable medium.
The above-mentioned specific implementations further describe in detail the objectives, technical solutions, and beneficial effects of this application. It should be understood that the descriptions above are merely specific implementations of this application and are not intended to limit the protection scope of this application. Any modifications, equivalent replacements, or improvements made on the basis of the technical solutions of this application shall fall within the protection scope of this application.
Number | Date | Country | Kind |
---|---|---|---|
202311788004.6 | Dec 2023 | CN | national |