This application claims priority under 35 U.S.C. § 119 or 365 European Application No. 22306928.7 filed Dec. 16, 2022. The entire contents of the above application are incorporated herein by reference.
The disclosure relates to the field of computer programs and systems, and more specifically to a method, system and program for storing RDF graphs data in a graph database comprising a set of RDF tuples.
A number of systems and programs are offered on the market for the design, the engineering and the manufacturing of objects. CAD is an acronym for Computer-Aided Design, e.g., it relates to software solutions for designing an object. CAE is an acronym for Computer-Aided Engineering, e.g., it relates to software solutions for simulating the physical behavior of a future product. CAM is an acronym for Computer-Aided Manufacturing, e.g., it relates to software solutions for defining manufacturing processes and operations. In such computer-aided design systems, the graphical user interface plays an important role as regards the efficiency of the technique. These techniques may be embedded within Product Lifecycle Management (PLM) systems. PLM refers to a business strategy that helps companies to share product data, apply common processes, and leverage corporate knowledge for the development of products from conception to the end of their life, across the concept of extended enterprise. The PLM solutions provided by Dassault Systèmes (under the trademarks CATIA, ENOVIA and DELMIA) provide an Engineering Hub, which organizes product engineering knowledge, a Manufacturing Hub, which manages manufacturing engineering knowledge, and an Enterprise Hub which enables enterprise integrations and connections into both the Engineering and Manufacturing Hubs. All together the system delivers an open object model linking products, processes, resources to enable dynamic, knowledge-based product creation and decision support that drives optimized product definition, manufacturing preparation, production and service.
Furthermore, a number of solutions for database management are offered for application of the above design systems and programs in In-memory databases, i.e., purpose-built databases that rely primarily on memory for data storage, in contrast to databases that store data on disk or SSDs. Among such database management solutions, the solutions related to graph databases, for example RDF graph database, are of particular interest due to their great flexibility in data modeling and data storage. In general applications, RDF graph databases need to be capable of handling large datasets of billions of tuples and in size of terabytes (e.g., Microsoft Academic Knowledge Graph. Retrieved May 10, 2022, from makg.org/rdf-dumps with over 8 billion triples requiring 1.2 TB of storage in the standard TTL format). Known solutions in the art propose data compression techniques:
(en.wikipedia.org/wiki/Data_compression) to reduce the required size of storage (e.g., in a memory or on a disk) for such graphs. Such compression techniques serve to reduce storage cost (e.g., hardware cost) and environmental footprint in various applications, notably in cloud deployment.
Document ÁLVAREZ-GARCÍA, S., et al., “Compressed vertical partitioning for efficient RDF management.”, Knowledge and Information Systems, 2015, vol. 44, no 2, p. 439-474, discloses an RDF indexing technique that supports SPARQL solution in compressed space. The disclosed technique, called k2-triples, uses the predicate to vertically partition the dataset into disjoint subsets of pairs (subject, object), one per predicate. These subsets are represented as binary matrices of subjects×objects in which 1-bits mean that the corresponding triple exists in the dataset. This model results in very sparse matrices, which are efficiently compressed using k2-trees.
Document CHATTERJEE, A., et al., “Exploiting topological structures for graph compression based on quadtrees.”, In: 2016 Second International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN). IEEE, 2016. p. 192-197, discloses algorithms that take into consideration the properties of graphs, and perform compression based on quadtrees. Furthermore, techniques to both compress data and also perform queries on the compressed data itself are introduced and discussed in detail.
Document NELSON, M., et al., “Queryable compression on streaming social networks.”, In: 2017 IEEE International Conference on Big Data (Big Data). IEEE, 2017. p. 988-993, discloses the use of a novel data structure for streaming graphs that is based on an indexed array of compressed binary trees that builds the graph directly without using any temporary storage structures. The data structure provides fast access methods for edge existence (does an edge exist between two nodes?), neighbor queries (list a node's neighbors), and streaming operations (add/remove nodes/edges).
Within this context, there is still a need for an improved method for storing RDF graph data in a graph database comprising a set of RDF tuples.
It is therefore provided a computer-implemented method of storing RDF graph data in a graph database comprising a set of RDF tuples. The method comprises obtaining one or more adjacency matrices wherein each adjacency matrix represents a group of tuples of the graph database comprising a same predicate. The method further comprises storing, for each of the one or more adjacency matrices, a data structure comprising an array. The array comprises one or more indices each pointing to a sub-division of the adjacency matrix, and/or one or more elements each representing a group of tuples of the RDF graph database of a respective sub-division of the adjacency matrix.
The method may comprise one or more of the following:
It is further provided a computer program comprising instructions for performing the method.
It is further provided a computer readable storage medium having recorded thereon the computer program.
It is further provided a system comprising a processor coupled to a memory and, the memory having recorded thereon the computer program.
Non-limiting examples will now be described in reference to the accompanying drawings, where
With reference to the flowchart of
Such a method provides an improved solution in storing an RDF graph database firstly by the obtention of obtaining one or more adjacency matrices representing the tuples of the database. Specifically, the tuples may be triples as each adjacency matrix may represent only one graph. Such an obtention of the one or more adjacency matrices constitutes a vertical partitioning of said database. As known per se an adjacency matrix is a binary matrix (i.e., a matrix with elements with two values, e.g., 0 and 1) of a size related to the number of subjects, predicates, and/or objects, which 1-bits mean that the corresponding triple (e.g., of a respective predicate for the adjacency matrix) exists in the RDF dataset. The size of the adjacency matrix for a predicate may be the size of subjects times by objects of RDF tuples for said predicate. The method, secondly, provides an improved solution by storing each of said one or more adjacency matrices as a data structure comprising an array of one or more indices where each index points to a sub-division of the adjacency matrix, and/or one or more elements each representing a group of tuples of the RDF graph database of a respective sub-division of the adjacency matrix. Such an array enables the method to store data related to one or more sub-divisions of each adjacency matrix. Thereby, the method improves storing RDF graph data in a graph database by storing a data structure in said database which efficiently compresses adjacency matrices. By “storing RDF graphs data in a graph database” it is meant storing data modelized as RDF graph(s) (i.e., an RDF dataset) in a graph database. RDF graphs may be one or more graph of data which represent/model respective data. Databases and RDFs are further explained hereinbelow. On the other hand, the method provides such a capability while allowing a dynamic (i.e., where read/write operations are possible on the stored database) in contrast to methods which are merely designed for a static, i.e., read-only memories.
In examples where the method presents a tree-like data structure, as discussed above, each intermediate node may be a 2-dimensional fixed size array. Such fixed size 2D arrays respective to the intermediate nodes may form at least partially said array comprised in the data structure. More specifically, such fixed size 2D arrays may comprise one or more indices each pointing to a sub-division of the adjacency matrix. By “database” it is meant any collection of data (i.e., information) organized for search and retrieval (e.g., a graph-oriented database). As known in the art, a graph-oriented database, is an object-oriented database using graph theory, therefore with nodes and arcs, allowing data to be represented and stored. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority; contrarily to other database models (e.g., relational databases) that link the data by implicit connections. When stored on a memory (e.g., a persistent memory), the graph database allows a rapid search and retrieval by a computer. Especially, graph databases are structured to for fast retrieval, modification, and deletion of relationships in conjunction with various data-processing operations. Graph-oriented database is also referred to as graph database; the expressions “graph-oriented database” and “graph database” are synonymous.
In examples, the graph database may be an RDF graph database. RDF graphs are a traditional data model used for the storage and the retrieving of graphs. RDF graph is a directed, labeled graph data format. Such format is widely used for representing information in the Web. A standard specification has been published by W3C to specify RDF representation of information as graphs, see for example “RDF 1.1 Concepts and Abstract Syntax”, W3C Recommendation 25 Feb. 2014 (or additionally the draft version RDF-star). An RDF graph database may have billions of tuples; for example the Uniprot dataset is a resource of protein sequence and functional information.
The core structure of the abstract syntax used is a set of tuples, each comprising a predicate. A set of such RDF tuples is called an RDF graph.
In examples, an RDF tuple may comprise three or four elements comprising nodes and edges. In examples, each RDF tuple (or elements of each RDF tuple) may be a triple comprising a subject, a predicate, and an object. In such examples, an RDF graph may be visualized as a node and a directed-arc diagram, in which each triple is represented as a node-arc-node link. Alternatively, an RDF triple may be visualized by two nodes, which are the subject and the object and an arc connecting them, which is the predicate.
In examples, the RDF tuple may be an RDF quad. An RDF quad may be obtained by adding a graph label to an RDF triple. In such examples, an RDF tuple includes the RDF graph. A standard specification has been published by W3C to specify RDF Quads (also referred to as N-Quads), see for example “RDF 1.1 N-Quads, A line-based syntax for RDF datasets”, W3C Recommendation 25 Feb. 2014. An RDF quad may be obtained by adding a graph name to an RDF triple. A graph name may be either empty (i.e., for a default or unnamed graph) or an IRI (i.e., a graph IRI. In examples, a predicate of the graph may have a same IRI as the graph IRI. The graph name of each quad is the graph that the quad is part of in a respective RDF dataset. An RDF dataset, as known per se (e.g., see www.w3.org/TR/rdf-sparql-query/#rdfDataset) represents a collection of graphs. Hereinafter, the term RDF tuple (or tuple) indifferently refers to an RDF triple or an RDF quad, unless the use of one or the other is explicitly mentioned.
Possible optimizations for a query engine of a graph database are impacted by the assumption that the graph database is interacting with an Open World or a Closed World. As known per se, in a formal system of logic used for knowledge representation, the open-world assumption (OWA) is the assumption that the truth value of a statement may be true irrespective of whether or not it is known to be true. It is the opposite of the closed-world assumption, which holds that any statement that is true is also known to be true. On the other hand, Closed World Systems require a place to put everything (e.g., slot on a frame, field on an OO class, or column in a DB). OWA assumes incomplete information by default which intentionally underspecifies and allows others to reuse and extend. Semantic Web is a vision of a computer-understandable web which is distributed knowledge and data in a reusable form and RDF, the W3C recommendation for the Semantic Web, follows the Open World Assumption. It allows a greater flexibility in data modeling and data storage. Yet the constraints of a Closed World Assumption, as in the relational model with SQL, are useful for query optimizations since they provide more information on how the data is stored. In examples, the query is a SPARQL query. SPARQL is the W3C recommendation for querying RDF data and is a graph-matching language built on top of patterns of RDF tuples. By a “pattern of RDF tuples” it is meant a pattern/template formed by an RDF graph. In other words, a pattern of RDF tuples is an RDF graph (i.e., a set of RDF triples) where subject, predicate, object, or label of the graph can be replaced by a variable (for a query). SPARQL is a query language for RDF data able to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL is mainly based on graph homomorphism. A graph homomorphism is a mapping between two graphs that respects their structure. More concretely, it is a function between the vertex sets of two graphs that maps adjacent vertices to adjacent vertices.
SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports aggregation, subqueries, negation, creating values by expressions, extensible value testing, and constraining queries by source RDF graph. This means SPARQL queries needs to answer to eight different triple patterns possible in the SPARQL. Such eight triple patterns include (S, P, O), (S, ?P, O), (S, P, ?O), (S, ?P, ?O), (?S, P, O), (?S, ?P, O), (?S, P, ?O), and (?S, ?P, ?O) in which variables are preceded in the pattern, by the symbol ?. Variables are the output of triple patterns and may be the output of the SPARQL query. In some examples, variables may be the output of a SELECT query. The output of a SPARQL query may be built using the variables (e.g., aggregators like summation). Variables in a query may be used to build a graph homomorphism (i.e., intermediary nodes necessary to get the result of the query). In some examples, variables in a query may be neither used for output nor intermediary result. A Basic Graph Pattern (BGP) may be one of the eight triple patterns explained above. Additionally, a BGP may be a quad pattern with additionally having the label of the graph as the query variable. In particular examples where the method obtains one or more adjacency matrices each as representations of groups of tuples, subject and object may be queried on one adjacency matrix. In other words, in these particular examples, the BGPs may be any of (S,O), (S,?O), (?S,O), and (?S,?O). SPARQL may build more complex queries by joining the result of several BGPs and possibly other operators. Thus, competitive SPARQL engines require, at least, fast triple pattern solution and efficient join methods. Additionally, query optimizers are required to build efficient execution plans that minimize the number of intermediate results to be joined in the BGP.
In examples, the graph database has an existing triple store. A triple store (also referred to as RDF store) is a purpose-built database for the storage and retrieval of triples through semantic queries, as known in the art. A triple store can at least answer to the eight basic triple patterns of SPARQL described above. It may also answer to filtering constraints (e.g., “x>5”) along with the triples pattern. Such a triple store is considered to be the storage engine on which a SPARQL query is executed by a query engine. A storage engine (also called “database engine”) is an underlying software component that a database management system (DBMS) uses to Create, Read, Update and Delete (CRUD) data from a database, as known in the art.
Back to
In step S20, the method storing, for each of the one or more adjacency matrices, a data structure comprising an array. In examples, said array of the stored data is a 2D array of size 2n×2n or a 1D array of size 22n, n being a positive integer. In particular examples, the value of n may be equal to 2. This choice of n makes nodes of a size relatively small that can be easily aligned to the cache line size of modern processors, thereby improving the application of the data structure to be fit into the cache line. Furthermore, such a choice provides algorithms (like Test, Set, Reset, etc. as discussed below) to have a complexity of log 16 rather than log 4 complexity which is the case for n=1.
In examples, the storing the data structure may comprise allocating a slot thereby obtaining a root index. Said root index represents a location of the tree data structure in a memory. The storing may further comprise setting the array at the root index to an array of coordinate pairs. In other words, and according to these examples, an empty data structure may be created (e.g., in a memory) with an empty root node of type V32 (which is explained below), while the location of this first node is represented by the “root index” so to enable the method to locate this first node in memory.
As discussed above, the stored array comprises one or more indices each pointing to a sub-division of the adjacency matrix, and/or one or more elements each representing a group of tuples of the RDF graph database of a respective sub-division of the adjacency matrix. In other words, the stored array may store indices which define a relative location of a sub-division of each adjacency matrix and/or a direct representation of a group of tuples of the RDF graph database of a respective sub-division of the adjacency matrix. In examples, each of the one or more indices may be of 32-bit size. In such examples, and when the value of the positive integer n discussed above is equal to 2, the stored array is of a maximal size of 2n×2n×32 bit=64 bytes. Such stored arrayed may be equivalently called in a tree-like structure which is discussed below.
In examples, the stored database according to the method may represent a tree-like data structure configured to represent each adjacency matrix. In such examples, the data structure may comprise one or more nodes where each node being connected to one or more nodes and/or one or more leaves. The tree-like data structure may be (similar to) a quadtree data structure. A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are the two-dimensional analog of octrees and are most often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions (see en.wikipedia.org/wiki/Quadtree).
In examples, the method may further comprise implementing one or more of the following functions on the RDF graph database. Each of the following functions may be configured to apply on each adjacency matrix of the one or more adjacency matrices: a function (called “Test” function) which is configured to check if a given cell of the adjacency matrix is set to a first specified value, a function (called “Set” function) which is configured to set a given cell of the adjacency matrix to a first specified value, a function (called “Reset” function) which is configured to set a given cell of the adjacency matrix to a second specified value, a function (called “ScanAll” function) which configured to output respective coordinates of all cells of the adjacency matrix with a first specified value, a function (called “ScanRow” function) which is configured to output respective coordinates of cells of a given row in the adjacency matrix with a first specified value, and/or a function (called “ScanColumn” function) which is configured to output respective coordinates of cells of a given column in the adjacency matrix with a first specified value.
In examples, each of the one or more elements of the array comprises a set of one or more coordinate pairs wherein each pair represents respective coordinates in the adjacency matrix. Such a representation of the respective coordinates in the adjacency matrix may be a direct representation of said coordinates (e.g., in V32 subtype) or an indirect representation of said coordinates (e.g., in V16 and V8 using a base and an offset with respect for said base).
Examples of memory layouts of an element discussed above are now discussed. By a “memory layout” or a “layout” it is meant an organized manner in which data are stored in a memory or in a file. As discussed herein below, each example (or kind) of memory layout may use indices of different size from another example of memory layout, for example indices of 32, 16, or 8 bits. Each index may be associated to one coordinate in a pair of coordinates. In examples where the stored data structure represents a tree data structure (e.g., a quadtree data structure) each kind of memory layout may represent a kind of leaf node in said tree. Integrating different memory layouts (each with a different index size) in the stored database constitutes an improved solution for storing graph data in a graph database by adapting the stored data structure (or equivalently a leaf node of a corresponding tree) to a volume (i.e., size) of inserted data and therefore keeping good query time without sacrificing space efficiency. The method may set an ordering for storing the data structure according to an order, e.g., an optimized insertion order based on the Morton order (or Morton code/encode) as known in the field. Insertion (i.e., the insertion of subject, and object) according to the Morton encoding minimizes the number of modifications of intermediate nodes.
In examples, an element of the one or more elements comprises a data structure of 32-bit indices with a layout, called V32 layout. Each 32-bit index is of a size of 32 bits. The size may be equivalently referred to as width, or bitfield width herein. By “a data structure of 32-bit indices” it is meant a data structure comprising a vector/array of a plurality of 32-bit integers each serving as an index. The V32 layout may in particular comprise one or more pairs of coordinates comprising a row and a column of the adjacency matrix. In such cases, each of the 32-bit indices is a coordinate in a pair of coordinates, i.e., an index represents a row or a column of the adjacency matrix. In yet other words, each row index is of size of 32 bits, and each column index is of size of 32 bits.
The V32 layout may be further defined by a fixed maximal capacity which indicates the maximum size of the data structure or equivalently the number of integers of the plurality. In examples, the size of layout may be smaller than the size of the stored one or more indices (i.e., the size of the intermediate node), for example in form of an (1D/2D) array as discussed above.
In some examples, the V32 layout may comprise a tag representing the size, and/or the maximal size (corresponding to the fixed maximal size) of the data structure (i.e., the vector representing the data structure). The tag may further represent a type of the data structure (i.e., a V32 type). Alternatively, the method may determine the tag corresponding to the element. For example, in a tree-like structure the method may determine said tag based on the respective depth of a respective node to the element in the tree. Such alternatives constitute even more improved solution by freeing the size assigned to store the tag in the V32 layout (e.g., in a header), thereby increasing the maximal capacity the layout which is equivalent to optimizing the required storage.
In examples of V32 data structure which is discussed below in reference to
In examples, an element of the one or more elements comprises a data structure of 16-bit indices size with a layout, called V16 layout. Each 16-bit index may is of a size of 16 bits. As discussed above, by “a data structure of 16-bit indices” it is meant a data structure comprising a vector/array of a plurality of 16-bit integers each serving as an index. The V16 layout may in particular comprise one or more pairs of coordinates comprising a base column, a base row, an offset column, and an offset row of the adjacency matrix. In such cases, each of the 16-bit indices represents one coordinate in a pair of coordinates, i.e., represents a base column, a base row, an offset column, or an offset row of the adjacency matrix. In yet other words, each of the base column index, the base row index, the offset column index, and the offset row index is of size of 16 bits.
The V16 layout may be further defined by a fixed maximal capacity which indicates the maximum size of the data structure or equivalently the number of integers of the plurality. The maximal capacity of V16 layout may be same as the maximal capacity of V32 layout as discussed above.
In some examples, the V16 layout may comprise a tag representing the size and/or the maximal size (corresponding to the fixed maximal size) of the data structure. The tag may further represent a type of the data structure (i.e., a V16 type). Alternatively, the method may determine the tag corresponding to the element. For example, in a tree-like structure the method may determine said tag based on the respective depth of a respective node to the element in the tree. Such alternatives constitute improved solutions as discussed for V32 above.
In examples of V16 data structure which is discussed below in reference to
In examples, an element of the one or more elements comprises a data structure of 8-bit indices with a layout, called V8 layout. Each 8-bit index is of a size of 8 bits. As discussed above, by “a data structure 8-bit indices” it is meant a data structure comprising a vector/array of a plurality of 8-bit integers each serving as an index. The V8 layout may in particular comprise one or more pairs of coordinates comprising a base column, a base row, an offset column, and an offset row of the adjacency matrix. For each leaf of a V8 layout there is one base row, one base column, and one or more offset row, and offset column pairs. In such cases, each of the 8-bit indices represents one coordinate in a pair of coordinates, i.e., represents a base column, a base row, an offset column, or an offset row of the adjacency matrix an index. In yet other words, each of the base column index, the base row index, the offset column index, and the offset row index is of size of 8 bits.
The V8 layout may be further defined by a fixed maximal capacity which indicates the maximum size of the data structure or equivalently the number of integers of the plurality. The maximal capacity of V8 layout may be same as the maximal capacity of V32 and V16 layouts as discussed above.
In some examples, the V8 layout may comprise a tag representing the size and/or the maximal size (corresponding to the fixed maximal size) of the data structure. The tag may further represent a type of the data structure (i.e., a V8 type). Alternatively, the method may determine the tag corresponding to the element. For example, in a tree-like structure the method may determine said tag based on the respective depth of a respective node to the element in the tree. Such alternative constitutes improved solutions as discussed for V32 and V16 above.
In examples of V8 data structure which is discussed below in reference to
Specifically, V32, V16, and V8 layouts may be considered as variations of a same pattern with a same maximal capacity, i.e., V<nn> where nn is 32, 16, or 8. The parameter nn may be also presented simply as m. Other values of nn other than 32, 16, and 8 may be used. V<nn> is a vector of coordinate, where one or more coordinate offsets are stored using nn bits and a base coordinate is stored using 32−nn bits. The 32 bits coordinates are extracted by combining the base and an offset (i.e., according to a bitwise operation (base<<nn)|offset). In examples of an V<nn> layout, given a row column pair (each 32 bits index), the lowest nn bits of the row and the column are stored as offset, while the upper bits of the row and the column are shared (i.e., base). In the case of V32 layout, there is no need to store the upper bits because there is 0 bits to share.
In examples for a general V<nn> type, where nn is the bits of indices stored in the data structure, said data structure may further comprise a tag representing the size, the maximal size, and/or a type of the data structure (i.e., a V<nn> type or the value of nn) same to examples of V32, V16, and V8 discussed above. Alternatively, the method may determine said tag corresponding to the element. In examples (which are applicable to any of V32, V16, and V8 discussed above), the determining of the tag may be based on the pair of coordinates. In examples, the method may determine the tag by the size of the translated submatrix that can be held by the node and/or the number of pairs of coordinate that is to be stored in the node. For example, in a tree-like structure the method may determine said tag based on the respective depth of a respective node to the element in the tree (which may be related to the translated submatrix and/or the number of pairs of coordinate to be stored mentioned above). As discussed above, such alternatives constitute even more improved solution by freeing the size assigned to store the tag (e.g., in a header), thereby increasing the maximal capacity of the layout which is equivalent to optimizing the required storage.
In examples, which are combinable with any of the examples discussed above, the stored data structure may further comprise a bit array of bitmap representation of a group of RDF tuples in a respective sub-division of the adjacency matrix. As known per se by a “bitmap” it is meant a mapping from one system such as integers to bits. It is also known as bitmap index or a bit array. This constitutes an improves solution as bitmap representation forms a representation of the one or more elements with smaller size. Thus, the examples of the method comprising the bit array further optimize the memory cost.
The method is computer-implemented. This means that steps (or substantially all the steps) of the method are executed by at least one computer, or any system alike. Thus, steps of the method are performed by the computer, possibly fully automatically, or, semi-automatically. In examples, the triggering of at least some of the steps of the method may be performed through user-computer interaction. The level of user-computer interaction required may depend on the level of automatism foreseen and put in balance with the need to implement user's wishes. In examples, this level may be user-defined and/or pre-defined.
A typical example of computer-implementation of a method is to perform the method with a system adapted for this purpose. The system may comprise a processor coupled to a memory and a graphical user interface (GUI), the memory having recorded thereon a computer program comprising instructions for performing the method. The memory may also store a database. The memory is any hardware adapted for such storage, possibly comprising several physical distinct parts (e.g., one for the program, and possibly one for the database).
The client computer of the example comprises a central processing unit (CPU) 1010 connected to an internal communication BUS 1000, a random access memory (RAM) 1070 also connected to the BUS. The client computer is further provided with a graphical processing unit (GPU) 1110 which is associated with a video random access memory 1100 connected to the BUS. Video RAM 1100 is also known in the art as frame buffer. A mass storage device controller 1020 manages accesses to a mass memory device, such as hard drive 1030. Mass memory devices suitable for tangibly embodying computer program instructions and data include all forms of nonvolatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks. Any of the foregoing may be supplemented by, or incorporated in, specially designed ASICs (application-specific integrated circuits). A network adapter 1050 manages accesses to a network 1060. The client computer may also include a haptic device 1090 such as cursor control device, a keyboard or the like. A cursor control device is used in the client computer to permit the user to selectively position a cursor at any desired location on display 1080. In addition, the cursor control device allows the user to select various commands, and input control signals. The cursor control device includes a number of signal generation devices for input control signals to system. Typically, a cursor control device may be a mouse, the button of the mouse being used to generate the signals. Alternatively or additionally, the client computer system may comprise a sensitive pad, and/or a sensitive screen.
The computer program may comprise instructions executable by a computer, the instructions comprising means for causing the above system to perform the method. The program may be recordable on any data storage medium, including the memory of the system. The program may for example be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The program may be implemented as an apparatus, for example a product tangibly embodied in a machine-readable storage device for execution by a programmable processor. Method steps may be performed by a programmable processor executing a program of instructions to perform functions of the method by operating on input data and generating output. The processor may thus be programmable and coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. The application program may be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired. In any case, the language may be a compiled or interpreted language. The program may be a full installation program or an update program. Application of the program on the system results in any case in instructions for performing the method. The computer program may alternatively be stored and executed on a server of a cloud computing environment, the server being in communication across a network with one or more clients. In such a case a processing unit executes the instructions comprised by the program, thereby causing the method to be performed on the cloud computing environment.
Implementations of the hereinabove discussed examples of the method are now discussed.
The implementations are related to RDF database which needs to be able to handle very large datasets. RDF is the W3C standard to represent knowledge graphs. Knowledge graphs can have billions of triples (e.g., MAG, Uniprot). The explosion in the amount of the available RDF data and consequently the size of graph databases justify a need to explore, query and understand such data sources.
Reducing the storage size of such graphs (in memory or on a disk) is important to reduce cost and environmental footprint in cloud deployment. That is why data compression (en.wikipedia.org/wiki/Data_compression) and more precisely, compression of graphs (i.e., compression of graphs data represented by RDF triples) is a key concept in exploiting the implementations.
The implementations involve using vertical partitioning to represent a graph. Using this scheme to represent RDF graphs leads to a representation where each predicate is itself a graph where subjects and objects are the nodes and the predicate the edges. These graphs can then be seen as n adjacency matrices. Therefore, compressing graphs means using a data structure to efficiently compressed adjacency matrices. In this direction, the implementations further involve dictionary encoding (en.wikipedia.org/wiki/Dictionary_coder) where input values are encoded into integer keys as a first step to data compression. This means that the adjacency matrices manipulate integers instead of strings.
The implementations are specifically related to the field of data structures used to represent adjacency matrices of integers with the objective of compressing very big RDF graphs. This takes place in the field of space-efficient graph representations. In the implementations, the graph (i.e., the graph database) may be in particular dynamic which means that it can be modified and is not read-only. The stored database according to the implementations may be queried without decompressing the data stored in the database.
More specifically, the implementations provide a space-efficient representation of dynamic RDF graphs, represented as adjacency matrices of integers that scales to very large matrices. This means that the elapse time for traversal on as such stored graph databases does not deteriorate and that the representation is still space-efficient (compared to read-optimized data structures).
The implementations in particular discloses a data structure called XAMTree which takes place in the general field of quad-tree data structures (in the sense that it partitions a two dimensional space by subdividing it into four quadrants) and share some traits of Patricia Tries (en.wikipedia.org/wiki/Radix_tree#PATRICIA), like K2-Tree as described in the article ÁLVAREZ-GARCÍA, S., et al., “Compressed vertical partitioning for efficient RDF management.”, Knowledge and Information Systems, 2015, vol. 44, no 2, p. 439-474. Quad-tree is interesting for efficient spatial search, and Patricia trees are interesting for the variability of their internal nodes.
A XAMTree according to the implementations is constituted of nodes which fan-out across the two dimensions of an adjacency matrix (row and column). Each node is constituted of 2n×2n cells and consumes n bits across the two axis. The implementations provide several kinds of leaf nodes for a XAMTree, and an optimized structure of the tree in exploiting each kind according to the volume of inserted data. Therefore the implementations provide good query time without sacrificing space efficiency. Such an optimized insertion order (i.e., the order according to which each kind of leaf node should be used) is according to Morton order (en.wikipedia.org/wiki/Z-order_curve). The implementations also take advantage of novel CPU instructions called AVX in relational query engines as for example described in KERSTEN, T., et al., “Everything you always wanted to know about compiled and vectorized queries but were afraid to ask.”, Proceedings of the VLDB Endowment, 2018, vol. 11, no 13, p. 2209-2222. This further improves the query elapse time in the implementations.
The data structure according to the implementation is organized in such a way that it limits the number of memory area modified by a write and respects the cache lines (to avoid cache misses). Furthermore, the XAMTree has an optimized insertion order (i.e., a Morton encoding) and rebuilding a XAMTree in its ideal optimized state is straightforward. Contrary to the dynamic K2Tree, the algorithmic complexities for a XAMTree have an upper bound of space usage at worst O(n logm n) where n is the number of bit set in the adjacency matrix and the order of insertions does not penalize the XAMTree compared to the dynamic K2. While basic access methods have the same worst case of algorithmic complexities as a static K2Tree, the topology of the adjacency matrix further improves performance of a XAMTree data structure, as XAMTree is not a balanced tree. Additionally, XAMTree is less sensitive to the topology of the adjacency matrix than a dynamic K2Tree (DK2).
According to the implementations, a XAMTRee is a data structure mixing a tree similar to a QuadTree, and leaves storing coordinate pairs vectors and leaves representing a bitmap. As known (e.g., from Wikipedia (en.wikipedia.org/wiki/Quadtree) a QuadTree is a tree data structure in which each internal node has exactly four children. Quadtrees are most often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. The data associated with a leaf cell varies by application, but the leaf cell represents a “unit of interesting spatial information”.
The implementations implement an adjacency matrix answering to the following functions which are the requirement to use it to represent RDF graphs' data.
When “r,c” are the coordinates of an adjacency matrix represented by the QuadTree, setting a bit means the corresponding cell of the matrix has a value of “true” (and resetting a value of “false”):
The implementations further include several bitwise operations on an unsigned integer as listed in the following Table 1.
A XAMTree is a tree data structure composed of three kinds of nodes: intermediate (i.e., non-terminal) nodes, vector leaves or leaves with a vector of coordinates, and bitmap leaves or leaves with bitmap. Each of the three nodes is discussed hereinbelow.
An intermediate node TN of a XAMTree is made of a 2d array of indices of size 2n×2n pointing to optional sub-elements. Each slot of this array may specify a sub item. n must be a strictly positive integer. A TN may be represented a 1d array of size 22n where the row and column relative to a TN can be encoded using ((row*2n)+column) or ((column*2n)+row) convention. The implementations exploit a 1d array representation and encode the representation as ((row*2n)+column).
Hereinbelow, n is a constant and be referred to as δ. Furthermore, the implementations use two constants, TNWIDTH as a shortcut for 2δ (or equivalently as a bitwise left shift operator 1<<δ) and TNSIZE as a shortcut of TNWIDTH2 or 22δ.
The implementations set δ=2. Such a value is an optimized value as it makes the nodes of the XAMTree of a size relatively small that can be easily aligned to the cache line size of modern processor and while resulting in algorithms with a complexity in log16 rather than log4 for δ=1. In variations, the implementations may set other values for 6.
The implementations set 32 bits indices for a TN to have nodes with a limited size of 64 bytes (512 bits=22×22×32) while still being able to create adjacency matrices with several billions of bits. In variations, the implementations may use 64 bits indices which however leads to increased storage.
By convention, the implementations set the index to 0 in order to specify an unallocated node, meaning to indicate that there are no more sub element of the adjacency matrix with cells valued to “true” when we split the matrix (as known in the field for example from the already cited document ÁLVAREZ-GARCÍA, S., et al., “Compressed vertical partitioning for efficient RDF management.”, Knowledge and Information Systems, 2015, vol. 44, no 2, p. 439-474). In variations, the implementations may set other values than 0 as the index value to specify unallocated node. Hereinbelow, said value (either being 0 or other variations) is referred to as UNALLOCATED_NODE.
Leaves with a Vector of Coordinates (Vnn)
A vector leaf Vnn is an element of a XAMTree with the exact same size as an intermediate node TN described above. The implementations identify each Vnn leave with an integer of a specific value at the beginning of the structure of the corresponding element of the Vnn leave in the XAMTree. For example, the implementations may be set to identify Vnn leaves with a 32 bits integer of value 0xffffffff at the beginning of the corresponding structure of each Vnn leave in XAMTree. In variations, the implementations may set to other arbitrary values. Hereinbelow, said arbitrary index value is referred to as VECTOR_LEAF. The implementations may set the value VECTOR_LEAF as an invalid index for the intermediate nodes TN.
Vector leaves are fixed maximal capacity adjacency matrices that fit into a tree node. In other words, Vnn leaves may store r,c, pairs up to said maximal capacity. The implementations may set different layout (i.e., a memory layout) for Vnn nodes. Each layout defines a subtype for vector leaves with a shared maximal capacity among the subtypes imposed by the fact that the implementations require a Vnn to have an equal or smaller size than a TN (i.e., being able to fit in the size of a TN).
The implementations define three subtypes as V32, V16, and V8 with a respective 32, 16 or 8 bit index size as (which in combination with a shared maximal size for all subtypes corresponds to a maximal capacity of r,c pairs to be stored in each subtype), and a current size (corresponding to the number of slots in-use). According to the implementations, each subtype may have a base pair r,c and/or a vector of variable length depending on the subtype (resp. 64 bits, 32 bits and 16 bits for each of V32, V16, and V8). The length refers to the number of used row-column coordinates in the data structure. Said length varies according to the type of V<nn> data structure. Theoretically said Length varied between 1 and the maximal length of a Vnn. In other words, for a 64 bytes Leaf node size, it is up to 7 for V32; up to 13 for V16; and up to 24 or 25 for V8 depending on the implementation. The base pair r,c (or explicitly rbase, cbase) is used to define each pair of coordinates for example in combination with an offset value (e.g., roffset,coffset) in the layout (e.g., using an operation comprising rbase+roffset,cbase+coffset).
According to the
The determining of the tag as discussed above may in particular, at least partially, comprise calculations similar to
The implementation may use each of 32, 16 or 8 bits r,c according to the depth of the vector leaf. In other words, the deeper the VECTOR_LEAF is created there is fewer difference between 2 r,c pairs and therefore the implementations need fewer bits to represent this value. On the other hand, by using 8, 16 and 32 bits basic arithmetic the implementations keep extraction of the data simple and efficient.
Examples of memory layout of various coordinate vector leaves for δ=2 and 32 bits index and a 232×232 matrix are now discussed in reference to
In variations of the implementations, Instead of one VECTOR_LEAF tag as discussed above, it is also possible to use a VECTOR_LEAVES range, and to encode the subtype and size in this range in order to increase V16 and V8 subtype capacity.
Leaves with Bitmap (BL)
In the implementations, and after a certain depth in the tree (i.e., XAMTree), the storage cost of tree nodes or of vector leaves is greater than the cost of a bitmap representation of the data. The implementations may set a BLDEPTH value as a XAMTree depth where only BL can be found.
For example, when δ=2 and the index is 32 bits a TN costs 64 bytes while the last two levels of the tree represent a submatrix of 16×16 bits. Thereby using bitmap leaves improves the storage size. The implementations may use a 256 bits bitmap that fit in 32 bytes to represent the 2 last level of the tree. In cases where there is less than a few bits set in this submatrix using a BL instead of a vector leaf, the implementations may reduce the memory cost by a factor of 2, but if the antepenultimate vector (i.e., the vector corresponding to the last levels of the tree) is saturated the implementations may create up to 16 leaves. The implementations allocate bitmap leaves not in the same way as TN and Vnn so the indices in the last TN does not point to the same thing as bitmap leaves. In particular, the implementation allocates the bitmap layer (i.e., the layers represented by bitmap) in a homogeneous way and without tagging. This The implementations exploit bitmap leaves as a safeguard preventing huge memory usage.
The implementations identify for each case of 6 and width of the index, the depth at which the tree of TN includes only bitmaps. For example, when δ=2 and the width of the index is 32 bits, each level of tree consumes 6 bits, thereby BLDEPTH=indexbits−2*δ is 32−2−2=28 bits. The implementations may represent the last level of a XAMTree as a bitmap. The implementations may alternatively define a bitmap for the two last level for δ=1 or δ=2 because the bitmap is smaller than a tree node. In variations where δ>2, the implementations may only define bitmap laves for the last level of the XAMTree.
The implementations may construct a XAMTree by two arrays of fixed size elements as follows:
The implementations may create and store the two arrays either in memory (a memory heap for instance) or in a file. In examples where the size of a bitmap element is the same as the nodes size the implementations may not create the BM array. The implementations may take into account other criteria not to perform the optimization by using BM.
The implementations may manage the access to NV and/or BM via the following functions summarized in Table 2.
Hereinbelow the notation of array[index] present get and set operation in BM and NV. The implementations may set a header descriptor which contains the root index.
First, an updatable (i.e., dynamic) version of the XAMTree is presented and then a read-only version is detailed.
The implementations may create an empty XAMTree according to the following steps:
The implementations comprise the following functions in order to be used to answer SPARQL queries on the stored database as Table 3.
The implementations further implement the following functions as listed in Table 4.
Details of the algorithm of the implementations for the above functions are now discussed.
Parameters: r, c a row and a column to check
Parameters: r, c a row and a column to set
The mutation in sub-step b. is what that makes the XAMTree asymmetric (i.e., with different subtypes throughout the tree) and prevents degenerations as seen in the K2-Tree. In other words, if there is a locality issue (e.g., values scattered on an area of the matrix, preventing the benefit of a sparse matrix) then only the nodes of this area are impacted thanks to this mutation (and not all the typology of the tree as in the K2Tree).
Parameters: r, c a row and a column to reset
Non-recursive algorithms of scanAll, scanRow and scanColumn is now discussed. These algorithms use a fixed size stack.
These algorithms are efficient scan algorithms because the visitor version (en.wikipedia.org/wiki/Visitor_pattern) does not need heap allocation (en.wikipedia.org/wiki/Memory_management#HEAP), an abstract iterator (en.wikipedia.org/wiki/Iterator_pattern) can be implemented in various language with only one heap allocation, and concrete iterator can be implemented without heap allocation. Therefore, the implementations may use a visitor or iterator pattern instead of a recursive algorithm to improve the heap allocation. As known in the field, despite the fact that all the presented iterator algorithms may be expressed recursively as long as there is a generator concept built into the programming language, it is rather difficult to write an iterator from a recursive algorithm without some means provided by the programming language. On the other hand, while obtaining a recursive visitor, or a visitor implemented from an iterator is trivial, the obtention of a recursive iterator without a generator concept is complex. In the implementations, the size of the stack object is fixed which allows to allocate it without dynamic allocation. This improves an implementation of an abstract iterator with at least one allocation. Furthermore, it is always possible to obtain a recursive version of this algorithm.
According to the implementations, the algorithm iterates over all leaves of the XAMTree using a breadth first traversal of the internal node TN.
Let define a stack of ScanContext with the following data and operations:
ScanContext is a record, which contains a “current_node” index and a “position” in the current node. “current_node” is the index of the node or leaf in NV array or in BM array to be scanned. “position” index indicates the position to process if the current_node is a tree node. The index of “position” can be an integer or a pair of integer or deduced from baseRC (which is described below). A ScanContext record may also keep some other cache data such as pointer to NV[current_node], in this case current_node may be optimized. Stack is a record, which contains:
In this algorithm, the input parameters are a row r to scan and the algorithm iterates over all leaves of the XAMTree using a breadth first traversal (en.wikipedia.org/wiki/Breadth-first_search) of the internal node TN. This algorithm is a refinement of scanAll discussed above.
The implementations uses a stack similar to the stack used in scanAll and associate to the stack a _filterRC field. _filterRC is initialized by PartialEncode(r, δ)<<δ. Specifically, _filterRC is initially set to Encodeδ(r, 0), then during the XAMTree scan the implementations perform logical rotate to the left (ROL) by 2δ when at the beginning of scanning of a node (TN) and a logical rotate to the right (ROR) by 2δ when the scan of a node is finished. As known per se, the bitwise operation ROL(n) performs the following bitwise operation to transform an source array of bits (Source) into a destination array of bits (Dest):
The implementations extend the Stack described in ScanAll with the following operations:
In this algorithm, the input parameters are a row c to scan and the algorithm iterates over all leaves of the XAMTree using a breadth first traversal of the internal node TN. This algorithm is a refinement of scanColumnscanRow discussed above.
The implementations uses a same stack used in scanRow discussed above. But _filterRC is initialized by PartialEncode(Encodeδ (0, c, δ). The implementations extend the Stack described in ScanRow with the following operation:
The implementations may also comprise a read only version which is slightly different from the updatable (i.e., dynamic) version.
In the read-only version the data structure does not need any way to be updated (i.e., no set and reset operation are available). Furthermore, the implementations may store (i.e., create) a read-only version of the graph database by scanning a preexisting adjacency matrix, or from a XAMTree. The read-only variations of the implementations further optimize the memory usage for the storage.
The read-only version of the XAMTree is called ROXAMTree (Read-Only XAMTree).
The fixed size of the tree nodes in the dynamic version, i.e., the defined 22δ indices slots (i.e., TN) are rarely all used. Usually, the TN is sparse. Avoiding the indices slots to get full is important for an updatable data structure in order to keep O(1) update when a new node or leaf have to be created.
The implementations of the method may have a read-only implementation which avoid to pack a set of indices and to keep the sparsity of TN by using a bitmask (see en.wikipedia.org/wiki/Mask_(computing) for bitmask) that represents the used indices. This bitmask is hereinbelow referred to as mn. For a given XAMTree with δ=2, a 16 bits integer is enough to store this mask. For example, a tree node of 32 bits integer may use 3 cells (e.g., the cells located at the positions 4, 7, and 8). Thereby the implementations only need 14 bytes (i.e., 2+3*4) to keep the information instead of 64 bytes (4*1δ). Similarly, a 16 bits bitmask may contain 0b0000′0001′1001′0000 (24+27+28) with indices 4, 7 and 8 in use. The position of the index associated to the cell can be computed using bit population count up to the tested cell (i.e., popcount(mn & (2cell−1)), which returns 0, 1, and 2, for the indices 4, 7 and 8; respectively.
But it is possible to make a better use of the memory space. Uke a XAMTree a ROXAMtree has two families of sub elements tree node and leaves. For instance, at BLDEPTH, the implementations may not replace bitmap leaves with other type of leaves. As a consequence, in the implementations the indices can be omitted and the bitmap leaves are directly placed after mn.
For other level than BLDEPTH tree node mixing vector leaves and tree nodes is frequent, in order to avoid having to keep indices pointing to this leaves, the implementations may use a second bitmask (called ml). Using the second bitmask helps to avoid storing the indices for all vector leaves.
In the previous example if the tree node use the indices 4, 7, and 8 cells but 7 is a vector leaf. The ml bitmask will contains 0b0000′0000′1000′0000 (27), telling that just after the mask and (if any) TN indices there is a coordinate vector. In the implementations, the mn bitmask contains all sub element mask, while ml mask contains the leaves. It is trivial to obtain the tree node bitmask by performing an exclusive or between mn and ml.
In the read-only version, the implementations may also simplify vector nodes. One of the main constraints of the memory layouts (e.g., V32, V16, V8) discussed above for XAMTree is to keep them simple to be easily updatable and to be fit in the tree-node size.
In the case of a read-only data-structure, the implementations may only have to var-int serialize the delta of the sorted Morton encoded row column pairs (e.g., according to the method for compressing RDF tuples disclosed in European Patent Application No. 21306839.8 by Dassault Systèmes which is incorporated herein by reference). In order to keep the first-value as small as possible instead of initializing it at 0, the implementations use the Morton encoded coordinate of the node. Computing this coordinate is not an overhead because it has to be computed in order to scan properly BLs
The implementations are based on tests examining the capacity of the method of storing of a graph database, for example on Microsoft Academic Graph (www.microsoft.com/en-us/research/project/open-academic-graph/) which is a dataset of around 8 billion RDF triples. Such tests have observed the disclosed storing method outperform dynamic K2 tree (which experiences a degradation in term of performances and a write amplification) and an alternative solution using off-the-shelf data structures which exploits classical B+ Trees (which experiences significant increase in storage cost of the database).
A first test according to the following steps was conducted in order to compare XAMTree according to this disclosure with a dynamic version of K2-tree as described in BRISABOA, Nieves R. et al. “Compressed representation of dynamic binary relations with applications.” Information Systems, 2017, vol. 69, p. 106-123 which is incorporated herein by reference:
All operations of the above steps were done on a same hardware for both data structures, XAMTree and K2Tree, thereby the results are not dependent on hardware
The results are provided in the following Table 5:
In the results presented in Table 5, the comparison of the values of the second and the third column is more important than the absolute value of each of them. These results show that the XAMTree storage cost is in the same order of magnitude as with the dynamic k2-tree, while achieving better query times for big queries. Note that this experiment is run with a dataset more suitable K2Tree which means a K2Tree is usable and its lower storage cost can be preferred in regard to the gain in elapse time (20% in the experiment).
The small query (q0.sparql) used in the test is as follows:
A second test was run with a bigger dataset of the Microsoft Academic Knowledge Graph (MAG) dataset (as provided in makg.org/rdf-dumps) which is made of around 8 billion triples. The second test results showed that XAMTree data structure enables importing the MAG in about 20 hours (ingestion throughput is 113428 triples/second), while with the dynamic K2tree the import is not finished after 40 hours on the same hardware. Thereby the K2Tree did not pass the criteria of throughput equal and above 50000 triples/second which was set as a failure threshold for the test. The results of the second test shows while XAMTree scale with the size of the graph database, the (dynamic) K2Tree does not.
| Number | Date | Country | Kind |
|---|---|---|---|
| 22306928.7 | Dec 2022 | EP | regional |