The present disclosure relates to techniques for processing logical graphs. More specifically, the disclosure relates to supporting fast and memory-efficient graph mutations that are suitable for distributed graph computing environments.
A graph is a mathematical structure used to model relationships between entities. A graph consists of a set of vertices (corresponding to entities) and a set of edges (corresponding to relationships). When data for a specific application has many relevant relationships, the data may be represented by a graph.
Graph processing systems can be split into two classes: graph analytics and graph querying. Graph analytics systems have a goal of extracting information hidden in the relationships between entities, by iteratively traversing relevant subgraphs or the entire graph. Graph querying systems have a different goal of extracting structural information from the data, by matching patterns on the graph topology.
Graph pattern matching refers to finding subgraphs, in a given directed graph, that are homomorphic to a target pattern. If the target pattern is (a)→(b)→(c)→(a), then corresponding graph walks or paths may include the following vertex sequences:
There exist challenges to supporting a mutable graph stored in an in-memory graph database or graph processing engine while providing snapshot isolation guarantees and maintaining analytical performance on the graph. Particularly, when supporting a distributed graph, processing load and memory consumption should remain balanced across nodes for optimal analytical performance. This load balancing becomes increasingly difficult as the number of graph mutations increases.
The example embodiment(s) of the present invention are illustrated by way of example, and not in way by limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
Data structures and methods are described for supporting mutations on a distributed graph in a fast and memory-efficient manner. Nodes in a distributed graph processing system may store graph information such as vertices, edges, properties, vertex keys, vertex degree counts, and other information in graph arrays, which are divided into shared arrays and delta logs. In some implementations, graph arrays can be divided into fixed sized segments to enable localized consolidation. In some implementations, edges may be tracked in both forward and reverse directions to accelerate analytic performance at the cost of memory.
Shared arrays represent original graph information that is shared with all nodes, whereas delta logs represent local mutations, e.g. updates, additions, or deletions to the shared arrays by a local node. Iterators may be provided to logically access the graph arrays as unified single arrays that represent the reconstructed mutated graph, or the shared arrays with the delta logs applied on the fly.
Nodes may be responsible for mutually exclusive sets of vertices, except for high degree “ghost” vertices that are replicated on every node. The assigning of vertices to nodes may distribute vertices of the same degree approximately uniformly, thereby providing an approximately even distribution of work and memory consumption to the nodes. The distribution may utilize a hash function as randomness to approximate a uniform balancing.
To identify and reference the vertices, a dictionary may be provided at each node to map vertex keys to nodes and internal table/index identifier tuples. The dictionary may include a shared map that is duplicated among nodes and a local map for local node mapping updates like the delta logs for the graph information as described above.
The shared arrays on a local node are accessible directly from remote nodes without requiring a copy or replication operation. For example, the shared arrays may be accessed using remote direct memory access (RDMA), shared references, message passing, or similar techniques. Since the shared arrays may be proportionally large compared to the delta logs and may remain at each node without replication, memory footprint and replication overhead may be minimized at each node.
Local delta logs may be replicated or copied between nodes when reconstructing the mutated distributed graph, e.g. prior to executing analytic tasks. Mutations may be supported at both the entity and table levels. To minimize reconstruction overhead, periodic delta log consolidation may occur at multiple levels to limit the size of the delta logs, thereby preserving analytic performance. Consolidation at the table level may also trigger rebalancing of vertices across the nodes to preserve an even distribution of work and memory consumption.
As shown in
To divide the graph processing workload and memory footprint evenly across the nodes, a hash function may be utilized as randomness to approximate a uniform balancing of the vertices of a graph to a responsible node or data owner. For example, using the hash function may provide an approximately uniform distribution for vertices of each degree, resulting in vertices with 1 edge approximately uniformly distributed across nodes, vertices with 2 edges approximately uniformly distributed across nodes, vertices with 3 edges approximately uniformly distributed across nodes, and so on. Each node is then responsible for tracking mutations applied to their respective vertices assigned by the hash function. Thus, each vertex may be assigned to a single node or data owner, except for ghost vertices, or vertices exceeding a threshold degree that are tracked at each node independently to reduce communication overhead.
The assignments of vertices to nodes may be stored in dictionary 150A for local node 110A. Each entry in shared map 152 may map an externally referenceable, user specified vertex key to an internally identifying vertex tuple. For example, the tuple may include a machine or node identifier, a vertex table identifier, and a vertex table index. In an example tuple, the node identifier may select from nodes 110A-110D, e.g. remote node 110B, the vertex table identifier may select from a specific graph array 140B, and the vertex table index may select an index within shared array 142B of the selected specific graph array 140B.
The assignments of vertices to nodes may be stored in dictionary 150B for remote node 110B. As shown in
The index associated with a vertex is a numerical identifier unique for each node, which is referred to as a physical vertex index. Initially, when a table is created, the index is in the range [0,V-1] where V is the number of vertices during table loading. In an implementation, the physical vertex index of valid vertices remains persistent across different snapshots of the graph as long as the vertices are not deleted and can therefore be used as index into properties. Whenever new vertices are added, they may take the place of previously deleted vertex, which is referred to herein as a vertex compensation.
A deleted bitset array indicates, for each physical vertex index, if the corresponding vertex is deleted in the current snapshot or not. In an implementation, one such deleted bitset array is created per snapshot. Having such physical vertex indices stable across snapshots makes it possible to minimize disruptive changes for the edges. Deleted bitset arrays may also be provided for edge tables as well. To reduce the memory footprint of deleted bitset arrays, the deleted bitset array may remain unallocated at each node until a deletion is introduced, and/or compressed representations such as run length encoded (RLE) arrays may be used as the deleted bitset array may be sparse with a relatively small number of deletions.
When local node 110A needs to perform updates to shared map 152, the updates are stored separately in local map 154A. For example, if a new vertex is created or an existing vertex is modified or deleted, e.g. by mapping to null or another reserved identifier, the mapping update for the affected vertex is stored in local map 154A. In this manner, shared map 152 can be preserved to reference the original distributed graph without mutations. When local node 110A needs to perform a vertex lookup for the distributed graph with mutations applied, the vertex key is first queried in local map 154A. This query may be omitted if local map 154A is empty. If the vertex key is not found in local map 154A, the vertex key is queried in shared map 152.
Graph array 140A and 140B may correspond to any type of graph array, such as a vertex array, an edge array, a property array, a degree array, or any other type of information pertaining to the distributed graph. Each node may include multiple graph arrays needed to store the vertices, edges, properties, and other related graph data assigned to each respective node. Referring to graph array 140A, each graph array may include two portions: shared array 142A and delta logs 144A.
Shared array 142A corresponds to data for the original distributed graph without any mutations applied. Since the content of the shared array depends on the vertices tracked by each node, each node contains its own independent shared array. The shared arrays are shared in the sense that the shared arrays form a baseline graph that is shared across snapshots of mutated graphs. The shared arrays are also shared in the sense that other nodes can access the shared arrays directly without performing a replication or copy operation. Thus, remote node 110B can access shared array 142A without performing a copy, and local node 110A can access shared array 142B without performing a copy. For example, the shared arrays may be exposed to remote nodes via RDMA or other techniques. In this manner, memory footprint can be distributed evenly among the nodes while reducing duplication overhead.
To provide snapshot isolation guarantees, both the original distributed graph and the mutated distributed graph should be accessible at any given time. Accordingly, mutations to the distributed graph are stored separately as delta logs, which reflect the mutations to the original distributed graph. When local node 110A needs to access the original distributed graph, it can be accessed directly from the shared array 142A. When local node 110A needs to access the mutated distributed graph, it can be reconstructed by applying delta logs 144A on the fly to shared array 142A without directly modifying shared array 142A. Delta logs 144A may include update map 146A, which includes updates or deletions for shared array 142A, append array 148A, which includes new entries for shared array 142A, and deleted bitset 149A, which may include deletions for existing entries in shared array 142A or append array 148A. When there are no modifications, then update map 146A may be empty. Similarly, when there are no new additions, then append array 148A may be empty.
Since the graph is distributed, local node 110A may refer to dictionary 150A and determine that a vertex to be looked up is maintained on a remote node, for example remote node 110B. In this example, for each graph array 140B, the shared array 142B can be accessed directly, whereas the delta logs 144B may be replicated or copied over network 160 and applied on the fly to shared array 142B to reconstruct the mutated graph at local node 110A. Since only the delta logs 144B are copied while the shared array 142B remains in place, replication overhead over network 160 can be thereby reduced, especially when the delta logs are relatively small compared to the shared array. To maintain this performance benefit, delta logs may be constrained by a size threshold. When the size threshold is exceeded, a consolidation operation may be carried out to create a new snapshot while applying and emptying the delta logs, as described in further detail below.
Note that the vertex keys in shared array 142A and 142B are only shown for illustrative purposes, as the actual arrays may be single dimensional (1-D) arrays that include edge table offsets, which point to indexes in separate edge tables using a compressed sparse row (CSR) format, as described below in
Besides vertex tables, each node may also store graph arrays for edge tables, for each property N, for vertex degrees, and for other graph data. For example, node 110A may also store an edge table with two indexes to track forwards edges for vertices “A” and “B”, and N property tables each having two indexes to store the N properties for vertices “A” and “B”. As discussed above, in some implementations, reverse edges may also be tracked as well.
For example, referring to the example shown in
For example, as shown in
For example, referring to the update map, edge ID 7 and edge ID 8 are added to vertex ID 1, whereas edge ID 9 is added to vertex ID 3. Thus, vertex ID 1 now has edges with destination IDs 7 and 8, whereas vertex ID 3 now has edges with destination IDs 0, 1, and 5. This can be visualized by referring to
Using the example shown in
Besides the deletion of vertex ID 0, the delta logs in
Further, in the delta logs, new edge IDs 4 through 6 are added by an append array, wherein the CSR format indicates that vertex ID 4 has one forward edge, or edge ID 4, and vertex ID 5 has no edges. The vertex table in
The delta logs in
Block 910 generates, on local node 110A, a representation in memory 130A for a graph distributed on a plurality of nodes including local node 110A and remote nodes 110B-110D, the graph comprising a plurality of vertices connected by a plurality of edges, wherein each of the plurality of edges is directed from a respective source vertex to a respective destination vertex. As discussed above, a distribution using a hash function may be used to assign each node to a set of mutually exclusive vertices and associated edges, except for ghost vertices that are duplicated at each node.
Block 912 generates at least one graph array 140A, each comprising: shared array 142A accessible by the remote nodes 110B-110D, and one or more delta logs 144A comprising at least one of: update map 146A comprising updates to shared array 142A by local node 110A, and append array 148A comprising new entries to shared array 142A by local node 110A. As discussed above, the graph arrays with delta logs are general data structures that can represent various types of data such as vertex tables in CSR format, edge tables in CSR format, vertex keys, property tables for vertices, property tables for edges, vertex degree counts, and other data.
Block 914 generates dictionary 150A comprising shared map 152 for mapping vertex keys to a tuple, wherein shared map 152 is duplicated on remote nodes 110B-110D, and wherein the tuple comprises: a node identifier of nodes 110A-110D, a vertex table identifier of one of the graph array 140A, and a vertex index of the vertex table identifier. Using the shared map 152, nodes 110A-110D can reference the node and table locations for vertices referenced by a vertex key. Further, the dictionary 150A includes local map 154A for updates to shared map 152 by local node 110A. The updates may include, for example, mappings of new vertices assigned to local node 110A.
Referring to
For each vertex table across the nodes, block 1012 accesses the shared arrays of the graph arrays associated with the vertex table, and replicates the append arrays, update maps, and deleted bitsets associated with the vertex table. The graph arrays may include, for example, a vertex table, vertex property arrays, a vertex key array, and a degree array. For example, in the case of graph array 140B corresponding to a vertex table, the shared array 142B is accessed by reference, whereas the delta logs 144B are replicated, including update map 146B, append array 148B, and deleted bitset 149A.
For each vertex table from the data of block 1012, block 1014 propagates vertex deletions to respective data owner nodes for dictionary updates, and to all nodes for updating deleted bitsets of ghost vertices. For example, vertices indicated as deleted in the deleted bitsets replicated from block 1012 are propagated to their respective data owner for updating local node metadata. For example, dictionary entries at each node may be set to a null or another reserved value to indicate vertex keys that have been deleted. In the case of ghost vertices, the ghost vertices may occupy a reserved area at the top or head of the vertex tables, and these entries may be marked as deleted for each node, e.g. by setting the corresponding deleted bitset value.
For each vertex table from the data of block 1012, block 1016 uses the vertex append array to replace deleted entries in the vertex table or append to the end of the array if no deleted entries are available. For example, in the case of graph array 140B corresponding to a vertex table, the append array 148B is processed to replace vertex entries in shared array 142A that are marked as deleted in deleted bitset 149A, and the corresponding bits are unset to indicate the entries are not deleted. Once shared array 142A runs out of deleted entries to replace, then the new entries from append array 148B are added to append array 148A. Further, shared arrays from other associated tables such as the vertex property arrays and vertex key array are applied to update their respective arrays in an associated graph array 140A.
For each vertex table from the data of block 1012, block 1018 uses the update maps to update entries in the vertex property tables. For example, in the case of graph array 140B corresponding to a vertex property array, the updates are applied locally to shared array 142A of graph array 140A corresponding to the same vertex property array. This is repeated for all vertex properties. After block 1010 is completed, vertex additions, deletions, and modifications are implemented in a new graph defined by the updated data structures in local node 110A.
Referring to
For each vertex and edge table across the nodes, block 1022 accesses the shared arrays of the graph arrays associated with the vertex and edge tables, and replicates the append arrays, update maps, and deleted bitsets associated with the edge table. Process 1000 may have already been applied previously, in which case the vertex tables and associated data are already replicated at local node 110A. In this case, the graph arrays may include, for example, an edge table and edge property arrays. For example, in the case of graph array 140B corresponding to an edge table, the shared array 142B is accessed by reference, whereas the delta logs 144B are replicated, including update map 146B, append array 148B, and deleted bitset 149A.
For each vertex and edge table from the data of block 1022, block 1024 propagates edge deletions from the replicated data of block 1022 to respective data owner nodes based on source vertex for updating deleted bitset arrays. The source vertex defined in the edge defines the data owner for forward edges. For example, edges indicated as deleted in the deleted bitsets replicated from block 1022 are propagated to their respective data owner for updating local node metadata. For example, deleted bitset array indexes for edge tables at each node may be set to indicate edges that have been deleted.
For each vertex and edge table from the data of block 1022, block 1026 propagates new edge additions from edge append array to respective data owner nodes based on source vertex. For example, in the case of graph array 140B corresponding to an edge table, the append array 148B is processed to send new edges to the respective data owner, which can be determined based on the source vertex indicated in the vertex table.
For each vertex table and edge table from the data of block 1022, block 1028 generates a mutated compressed graph representation by iterating concurrently over the vertex tables of the received new edge additions from block 1026 and the original delta logs previously duplicated from block 1022. For example, both the original delta logs and the new edge additions may be formatted into CSR format. By iteratively merging both CSRs concurrently, all edges for a given vertex may be grouped together in the merged CSR. In this manner, the vertex and edge tables can be reconstructed correctly in the mutated compressed graph representation.
For each vertex table and edge table from the data of block 1022, block 1030 uses the update maps to update entries in the edge property tables. For example, in the case of graph array 140B corresponding to an edge property array, the updates are applied locally to shared array 142A of graph array 140A corresponding to the same edge property array. This is repeated for all edge properties. After block 1020 is completed, the forward CSR of the edge table is rebuilt. As discussed above, process 1001 may be repeated to handle the reverse CSR as well. After both process 1000 and 1001 are completed, the new graph is reconstructed at local node 110A with all mutations implemented, which may correspond to a new snapshot.
The above examples have described element-level mutations, or mutations that affect one element at a time. Table-level mutations can also be supported, such as when loading from a file or another data source to add new tables or delete existing tables. In this case, existing entity tables that are not deleted by the table-level mutation can be retained in the new graph, remaining at the same index in a table array and using the same table ID. The shared arrays are accessed by reference, and the delta logs are replicated without modification.
When the mutations delete a table, the table is not actually deleted, but is set as a tombstone table, or a special indicator that the table is empty and should not be accessed. This is to preserve the ordering of table IDs in the table array.
When the mutations add a new table, a loading pipeline is utilized to read the graph from a file or another data source, first by reading the vertex and edge tables, and second by storing the tables in intermediate data structures that are usable to reconstruct the graph. The pipeline is modified to prevent any existing tables to be added to the intermediate data structures. This generalization allows the edge tables to be reconstructed by reading vertex information from the intermediate structures for new vertex tables, or from the original graph for existing vertex tables. Vertex tables can be reconstructed normally without modifications to the pipeline. The new tables replace any tombstone tables, if available, or are otherwise appended to the end of the table array.
As discussed above, when the delta logs exceed a size threshold, a consolidation action may be triggered to apply and clear the delta logs. The size threshold may, for example, be set as a ratio of the shared arrays, or by other criteria. By keeping the delta logs below the size threshold, reconstruction overhead can be kept to a minimum to preserve high analytical performance. Since consolidation is an expensive operation, consolidation may be triggered at various levels to delay large scale consolidation operations.
Consolidation may occur at the array level. In this case, each node can consolidate its own graph arrays, regardless of the data type (vertex, edge, property, etc.) and without coordination with other nodes. However, consolidation of CSR or other compressed formats may be avoided due to the reconstruction overhead. For example, referring to graph array 140A, the delta logs 144A may be applied to shared array 142A by adding the new entries from append array 148A and applying the changes from update map 146A and deleted bitset 149A. Once the new graph array 140A is thereby consolidated, the delta logs 144A may be emptied.
Consolidation may occur at the CSR level. In this case, the vertex and edge tables and edge property arrays are reconstructed, and any multiple edge neighborhoods are consolidated into a single edge neighborhood. The deleted bitsets for the vertices and edges may also be emptied. This operation does not require coordination with other nodes and can be performed independently for both forward and reverse edges. This consolidation may be especially useful when many edges are added or modified on a specific node.
Consolidation may occur at the table level. This is equivalent to applying table level mutations, as described above. Thus, this operation cannot be done independently on a single node and coordination with other nodes is required. All delta logs on all nodes are empty after this operation. Since this operation may result in numerous modifications to the graph, it may be efficient to perform rebalancing at the same time of table level consolidation, as discussed below.
Consolidation may also occur at a segment level. A graph may be divided into segments or chunks of a fixed size, which are identified using a segment array. Segments may therefore function as a fixed size portion of a graph array. Consolidation may then be triggered on a per-segment basis proceeding similarly to the array level consolidation described above, thereby allowing for finer consolidation granularity for more localized consolidation compared to the table level.
When significant mutations are applied to a graph, the graph may become unbalanced, thereby skewing the original load balanced distribution across the nodes. In this case, a rebalancing operation may be carried out to provide a new load balanced distribution across the nodes. Since this operation is expensive, it may be carried out at the same time as table level consolidation when a threshold number of mutations (e.g. edge additions) are applied to existing vertices, as discussed above. The rebalancing may be carried out by repeating the distribution of vertices using the hash function as described above.
When the mutations are primarily for newly added vertices, then a separate rebalancing may be triggered for the new vertices only. In this case, before the new vertices are applied to a local node, the nodes can exchange the new vertices among themselves so that the additions are balanced. For example, by using vertex degree arrays, the degree of existing vertices and new vertices can be compared, and the new vertices can be assigned in a balanced manner across the nodes. When vertex degree arrays are not available, the degrees may be calculated on demand. The nodes can then exchange dictionary mapping information to reference and locate the new vertices on each node.
When vertices have a degree that exceeds a threshold, then the vertices may be treated as ghost vertices, or vertices that are duplicated at each node. This helps to avoid excessive communications overhead between nodes. For example, a reserved portion at the top or head of the vertex and edge arrays may be used to store the ghost vertices at each node. The degree threshold may be set lower for initial table loading and higher for mutations, since a later conversion of a normal vertex into a ghost vertex or a ghost promotion may be an expensive process.
To perform ghost promotion, table level consolidation of vertex tables may be carried out as described above. However, since this operation incurs significant overhead, the reserved portion in the arrays may instead be used to add new ghost vertices. If empty or previously deleted ghost vertex entries are available, the new ghosts can replace these entries. Otherwise, if the reserved portion becomes full, then entries for normal vertices in the array may be relocated to expand the reserved portion.
Since each node replicates the ghost vertices, the ghost promotion also needs to be broadcast to the other nodes. For example, the data owner or local node of a vertex V that is promoted to a ghost vertex G may send a promotion message to remote nodes, which includes the vertex ID V, the ghost vertex ID G, associated properties of V, and edges from V to vertices owned by the remote nodes. The remote nodes create a ghost replica G with the origin vertex V and add the properties. The remote nodes add to G any edges from V to local vertices. The remote nodes updates reverse edges from a local vertex to V by changing the destination vertex to G, which may be represented in update maps for the edge arrays. The local node then deletes the edges of V, since the edges are now associated with G. At this point, the ghost promotion is propagated to all the nodes. This promotion may be especially useful for table level mutations, as the ghost promotion can occur prior to adding numerous edges to a vertex.
Embodiments of the present invention are used in the context of database management systems (DBMSs). Therefore, a description of an example DBMS is provided.
Generally, a server, such as a database server, is a combination of integrated software components and an allocation of computational resources, such as memory, a node, and processes on the node for executing the integrated software components, where the combination of the software and computational resources are dedicated to providing a particular type of function on behalf of clients of the server. A database server governs and facilitates access to a particular database, processing requests by clients to access the database.
A database comprises data and metadata that is stored on a persistent memory mechanism, such as a set of hard disks. Such data and metadata may be stored in a database logically, for example, according to relational and/or object-relational database constructs.
Users interact with a database server of a DBMS by submitting to the database server commands that cause the database server to perform operations on data stored in a database. A user may be one or more applications running on a client computer that interact with a database server. Multiple users may also be referred to herein collectively as a user.
A database command may be in the form of a database statement. For the database server to process the database statements, the database statements must conform to a database language supported by the database server. One non-limiting example of a database language that is supported by many database servers is SQL, including proprietary forms of SQL supported by such database servers as Oracle, (e.g. Oracle Database 11 g). SQL data definition language (“DDL”) instructions are issued to a database server to create or configure database objects, such as tables, views, or complex types. Data manipulation language (“DML”) instructions are issued to a DBMS to manage data stored within a database structure. For instance, SELECT, INSERT, UPDATE, and DELETE are common examples of DML instructions found in some SQL implementations. SQL/XML is a common extension of SQL used when manipulating XML data in an object-relational database.
Generally, data is stored in a database in one or more data containers, each container contains records, and the data within each record is organized into one or more fields. In relational database systems, the data containers are typically referred to as tables, the records are referred to as rows, and the fields are referred to as columns. In object-oriented databases, the data containers are typically referred to as object classes, the records are referred to as objects, and the fields are referred to as attributes. Other database architectures may use other terminology. Systems that implement the present invention are not limited to any particular type of data container or database architecture. However, for the purpose of explanation, the examples and the terminology used herein shall be that typically associated with relational or object-relational databases. Thus, the terms “table”, “row” and “column” shall be used herein to refer respectively to the data container, record, and field.
Query Optimization and Execution Plans
Query optimization generates one or more different candidate execution plans for a query, which are evaluated by the query optimizer to determine which execution plan should be used to compute the query.
Execution plans may be represented by a graph of interlinked nodes, each representing an plan operator or row sources. The hierarchy of the graphs (i.e., directed tree) represents the order in which the execution plan operators are performed and how data flows between each of the execution plan operators.
An operator, as the term is used herein, comprises one or more routines or functions that are configured for performing operations on input rows or tuples to generate an output set of rows or tuples. The operations may use interim data structures. Output set of rows or tuples may be used as input rows or tuples for a parent operator.
An operator may be executed by one or more computer processes or threads. Referring to an operator as performing an operation means that a process or thread executing functions or routines of an operator are performing the operation.
A row source performs operations on input rows and generates output rows, which may serve as input to another row source. The output rows may be new rows, and or a version of the input rows that have been transformed by the row source.
A match operator of a path pattern expression performs operations on a set of input matching vertices and generates a set of output matching vertices, which may serve as input to another match operator in the path pattern expression. The match operator performs logic over multiple vertex/edges to generate the set of output matching vertices for a specific hop of a target pattern corresponding to the path pattern expression.
An execution plan operator generates a set of rows (which may be referred to as a table) as output and execution plan operations include, for example, a table scan, an index scan, sort-merge join, nested-loop join, filter, and importantly, a full outer join.
A query optimizer may optimize a query by transforming the query. In general, transforming a query involves rewriting a query into another semantically equivalent query that should produce the same result and that can potentially be executed more efficiently, i.e. one for which a potentially more efficient and less costly execution plan can be generated. Examples of query transformation include view merging, subquery unnesting, predicate move-around and pushdown, common subexpression elimination, outer-to-inner join conversion, materialized view rewrite, and star transformation.
According to one embodiment, the techniques described herein are implemented by one or more special-purpose computing devices. The special-purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques. The special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
For example,
Computer system 1000 also includes a main memory 1006, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 1002 for storing information and instructions to be executed by processor 1004. Main memory 1006 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1004. Such instructions, when stored in non-transitory storage media accessible to processor 1004, render computer system 1000 into a special-purpose machine that is customized to perform the operations specified in the instructions.
Computer system 1000 further includes a read only memory (ROM) 1008 or other static storage device coupled to bus 1002 for storing static information and instructions for processor 1004. A storage device 1010, such as a magnetic disk, optical disk, or solid-state drive is provided and coupled to bus 1002 for storing information and instructions.
Computer system 1000 may be coupled via bus 1002 to a display 1012, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 1014, including alphanumeric and other keys, is coupled to bus 1002 for communicating information and command selections to processor 1004. Another type of user input device is cursor control 1016, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1004 and for controlling cursor movement on display 1012. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
Computer system 1000 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 1000 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 1000 in response to processor 1004 executing one or more sequences of one or more instructions contained in main memory 1006. Such instructions may be read into main memory 1006 from another storage medium, such as storage device 1010. Execution of the sequences of instructions contained in main memory 1006 causes processor 1004 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The term “storage media” as used herein refers to any non-transitory media that store data and/or instructions that cause a machine to operate in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical disks, magnetic disks, or solid-state drives, such as storage device 1010. Volatile media includes dynamic memory, such as main memory 1006. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1002. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 1004 for execution. For example, the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 1000 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 1002. Bus 1002 carries the data to main memory 1006, from which processor 1004 retrieves and executes the instructions. The instructions received by main memory 1006 may optionally be stored on storage device 1010 either before or after execution by processor 1004.
Computer system 1000 also includes a communication interface 1018 coupled to bus 1002. Communication interface 1018 provides a two-way data communication coupling to a network link 1020 that is connected to a local network 1022. For example, communication interface 1018 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 1018 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 1018 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information.
Network link 1020 typically provides data communication through one or more networks to other data devices. For example, network link 1020 may provide a connection through local network 1022 to a host computer 1024 or to data equipment operated by an Internet Service Provider (ISP) 1026. ISP 1026 in turn provides data communication services through the worldwide packet data communication network now commonly referred to as the “Internet” 1028. Local network 1022 and Internet 1028 both use electrical, electromagnetic, or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 1020 and through communication interface 1018, which carry the digital data to and from computer system 1000, are example forms of transmission media.
Computer system 1000 can send messages and receive data, including program code, through the network(s), network link 1020 and communication interface 1018. In the Internet example, a server 1030 might transmit a requested code for an application program through Internet 1028, ISP 1026, local network 1022 and communication interface 1018.
The received code may be executed by processor 1004 as it is received, and/or stored in storage device 1010, or other non-volatile storage for later execution.
A computer system process comprises an allotment of hardware processor time, and an allotment of memory (physical and/or virtual), the allotment of memory being for storing instructions executed by the hardware processor, for storing data generated by the hardware processor executing the instructions, and/or for storing the hardware processor state (e.g. content of registers) between allotments of the hardware processor time when the computer system process is not running. Computer system processes run under the control of an operating system, and may run under the control of other programs being executed on the computer system.
In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.
Software system 1100 is provided for directing the operation of computing device 1000. Software system 1100, which may be stored in system memory (RAM) 1006 and on fixed storage (e.g., hard disk or flash memory) 1010, includes a kernel or operating system (OS) 1110.
The OS 1110 manages low-level aspects of computer operation, including managing execution of processes, memory allocation, file input and output (I/O), and device I/O. One or more application programs, represented as 1102A, 1102B, 1102C . . . 1102N, may be “loaded” (e.g., transferred from fixed storage 1010 into memory 1006) for execution by the system 1100. The applications or other software intended for use on device 1100 may also be stored as a set of downloadable computer-executable instructions, for example, for downloading and installation from an Internet location (e.g., a Web server, an app store, or other online service).
Software system 1100 includes a graphical user interface (GUI) 1115, for receiving user commands and data in a graphical (e.g., “point-and-click” or “touch gesture”) fashion. These inputs, in turn, may be acted upon by the system 1100 in accordance with instructions from operating system 1110 and/or application(s) 1102. The GUI 1115 also serves to display the results of operation from the OS 1110 and application(s) 1102, whereupon the user may supply additional inputs or terminate the session (e.g., log off).
OS 1110 can execute directly on the bare hardware 1120 (e.g., processor(s) 1004) of device 1000. Alternatively, a hypervisor or virtual machine monitor (VMM) 1130 may be interposed between the bare hardware 1120 and the OS 1110. In this configuration, VMM 1130 acts as a software “cushion” or virtualization layer between the OS 1110 and the bare hardware 1120 of the device 1000.
VMM 1130 instantiates and runs one or more virtual machine instances (“guest machines”). Each guest machine comprises a “guest” operating system, such as OS 1110, and one or more applications, such as application(s) 1102, designed to execute on the guest operating system. The VMM 1130 presents the guest operating systems with a virtual operating platform and manages the execution of the guest operating systems.
In some instances, the VMM 1130 may allow a guest operating system to run as if it is running on the bare hardware 1120 of device 1000 directly. In these instances, the same version of the guest operating system configured to execute on the bare hardware 1120 directly may also execute on VMM 1130 without modification or reconfiguration. In other words, VMM 1130 may provide full hardware and CPU virtualization to a guest operating system in some instances.
In other instances, a guest operating system may be specially designed or configured to execute on VMM 1130 for efficiency. In these instances, the guest operating system is “aware” that it executes on a virtual machine monitor. In other words, VMM 1130 may provide para-virtualization to a guest operating system in some instances.
The above-described basic computer hardware and software is presented for purpose of illustrating the basic underlying computer components that may be employed for implementing the example embodiment(s). The example embodiment(s), however, are not necessarily limited to any particular computing environment or computing device configuration. Instead, the example embodiment(s) may be implemented in any type of system architecture or processing environment that one skilled in the art, in light of this disclosure, would understand as capable of supporting the features and functions of the example embodiment(s) presented herein.
Although some of the figures described in the foregoing specification include flow diagrams with steps that are shown in an order, the steps may be performed in any order, and are not limited to the order shown in those flowcharts. Additionally, some steps may be optional, may be performed multiple times, and/or may be performed by different components. All steps, operations and functions of a flow diagram that are described herein are intended to indicate operations that are performed using programming in a special-purpose computer or general-purpose computer, in various embodiments. In other words, each flow diagram in this disclosure, in combination with the related text herein, is a guide, plan or specification of all or part of an algorithm for programming a computer to execute the functions that are described. The level of skill in the field associated with this disclosure is known to be high, and therefore the flow diagrams and related text in this disclosure have been prepared to convey information at a level of sufficiency and detail that is normally expected in the field when skilled persons communicate among themselves with respect to programs, algorithms and their implementation.
In the foregoing specification, the example embodiment(s) of the present invention have been described with reference to numerous specific details. However, the details may vary from implementation to implementation according to the requirements of the particular implement at hand. The example embodiment(s) are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
This application is related to U.S. patent application Ser. No. 17/194,165 titled “Fast and memory efficient in-memory columnar graph updates preserving analytical performance” filed on Mar. 5, 2021 by Damien Hilloulin et al., and U.S. patent application Ser. No. 17/479,003 titled “Practical method for fast graph traversal iterators on delta-logged graphs” filed on Sep. 20, 2021 by Damien Hilloulin et al., which are incorporated herein by reference.