Appendix A contains source code for an exemplary implementation of an embodiment of a pattern query language (PQL) parser and lexer.
In a database management system, a graph has one or more nodes (or vertices) that are connected by one or more edges (or links). Each node may have a type or class and at least one value associated with it. A graph database refers to a collection of data that is stored in a graph data structure implemented in a database management system.
Analysts often have the need to look for patterns in data that can be represented as subgraphs. Using a pattern as a query mechanism is called subgraph isomorphism, graph pattern matching, or pattern query.
To reduce the search time needed for a particular pattern query, it may be desirable to produce an optimal deconstruction of the pattern query. Currently, such deconstruction is accomplished through a combination of visual inspection, experience, experimentation, or other techniques. However, such ad hoc approaches are not practical for complex pattern query searches. For example, although visual cues may be evident from the visual depiction of the pattern, in practice many pattern queries are prepared in pattern query language (PQL) and the PQL representation of the pattern query does not provide such visual cues. Also, it is not desirable to require an analyst to have expertise in query plan optimization.
An embodiment of a system and method optimizes pattern query searches on a graph database. An embodiment of the system and method may be implemented in, for example, 21st Century Technologies' LYNXeon Server™ and LYNXeon Analyst Studio™. The LYNXeon Server enables the ingest of existing databases, graph storage, search and sophisticated analytics, including graph pattern search capabilities that use an embodiment of a pattern query language (PQL) to express link-oriented queries. The LYNXeon Analyst Studio is a workstation tool for accessing the LYNXeon Server platform (i.e., the “LYNXeon platform”). Embodiments of the LYNXeon platform may use a structured query language (SQL)-compliant object-relational database management system, such as PostgreSQL, to provide persistent storage for node and edge data of a graph database.
Appendix A contains source code for an exemplary implementation of a PQL parser and lexer for expressing graph pattern searches. Additional description of embodiments of the LYNXeon platform, including an exemplary graph schema parser and lexer and graph normal form schema generator, can be found in U.S. patent application Ser. No. 11/590,070, entitled “Segment Matching Search System and Method,” filed on Oct. 30, 2006, and incorporated herein by reference in its entirety.
An embodiment of the LYNXeon platform may implement graph pattern matching and other operations on a graph as Java operations on graph elements stored in a memory. The main data structure is a tuple table. The tuple table includes tuple columns. Each tuple column in the tuple table stores candidates for a specific node in the graph pattern. The tuple columns are filled from left to right as segments are matched as part of the graph pattern matching. A segment, in current embodiments, has two nodes and an edge. In embodiments, each row in the tuple table contains node IDs for a single pattern match. Tuple columns may use two main operation types: streaming and filtering. A streaming operation fills a column with new candidate nodes. All columns except the first may be populated as a function of previous columns. For example, a tuple column may be populated based on the existence of edges between new candidates and candidates in a previous column. A filtering operation removes candidates from a tuple column that do not meet any additional pattern constraints, such as candidates that do not have an edge between the candidates and another node in the match, candidates that are not distinct from other nodes in the match, or candidates that do not meet attribute constraints. In embodiments, the tuple table holds a subset of all potential matches in the memory. Those of ordinary skill in the art will appreciate that the amount of available memory constrains the effective size of the graph database and the performance of pattern searches.
An embodiment of a system and method for optimizing pattern query searches on a graph database provides improvements to the pattern query algorithm of an embodiment of the LYNXeon platform. An embodiment of the system and method implement graph pattern matching by translating PQL statements into a planned sequence of SQL operations (i.e., SQL commands, SQL expressions, and SQL statements). An embodiment of the system and method may decrease execution time of pattern searches by overcoming inefficiencies and pathological cases incurred by relying exclusively on database planner translations and the search methodologies of the current embodiment of the LYNXeon platform.
When a pattern contains branches or cycles, search performance degrades sharply compared to the case where the pattern is a straight line path, i.e., straight line nodes joined by edges. A branch is a path within the graph in which all nodes but the end node of the path have a degree of two in the graph. For example, in the pattern query 100 of
For example, the pattern query 100 shown in
The following shows exemplary SQL expressions generated by the PQL-expressed query shown above.
Referring again to
In the example of
An embodiment of the system and method for optimizing pattern query searches on a graph database may include a pattern query optimizer 64 (shown in
The following shows exemplary optimized SQL expressions that implement the search plan noted above that visits all the constrained nodes first.
Referring again to
Specifically,
Referring to
In block 442 of
In block 444 of
In block 446 of
In block 450 of
1. The node is exported in the PQL statement. For example, the node type is identified as a result of the PQL statement.
2. The node is qualified by an attribute or value.
3. The node is included in a constraint expression
4. The paths formed by splitting at the node are structurally equivalent
The following pseudo-code shows an exemplary algorithm of identifying branches and cycles within a pattern query and decomposing each identified branch and cycle into equivalent straight line paths.
After deconstruction of the pattern query into subpattern queries (block 440), each subpattern query may be converted (block 440) to an equivalent SQL expression that includes all exported, constrained, and joined nodes. The SQL expression for each subpattern query may be executed (e.g., by segment matching and searching for a match using the SQL expression) and the search results may be stored in a temporary table (block 450). A final SQL expression may be used to join the subpattern queries and save the exported nodes and values (block 460).
Cardinality may be used to improve the performance of pattern searches.
As shown above, a pattern query may be optimized by deconstructing branches and cycles into subpattern queries. As illustrated below, a pattern query may further be optimized by eliminating structurally-equivalent subpattern queries.
An embodiment of the system and method may automatically identify cardinality by identifying equivalent paths within a pattern. Specifically, an embodiment of the system and method may first sort the paths by end nodes (e.g., P1 and P5 in
For example, the pattern query 600 of paths in
1. Paths have the same number of nodes
2. Nodes in the same position on the path are of the same type
3. Nodes in the same position on the path have the same qualifications
4. Only the start and end nodes are shared
5. None of the non-shared nodes between the paths are exported
6. Any shared node are shared with all paths in the same position
7. Have one shared node between all paths
8. Non-shared nodes in the paths are not used in PQL value or constraint expressions.
The following shows an exemplary expression of the pattern query 600 illustrated in
The pattern described in
The following shows an exemplary expression of a pattern query that has been converted into SQL, e.g., the optimized SQL expression for the pattern query with the cardinality constraint.
An embodiment of the system and method may also optimize pattern query searches on a graph database by joining nodes with attribute values that satisfy some nearness criteria. Nearness, also known as inferred edges, is the linking of two nodes whose attributes meet some nearness criteria. For example, two flights that take off within a certain amount of time as one another are considered linked. The naive approach compares every flight to every other flight that is O(n2) complexity for n flights.
Performance may be improved by joining nodes with attribute values that satisfy a nearness criteria of |(value 1-value2)|<=delta, i.e., values are within a certain delta from one another. An embodiment of the system and method may first sort the nodes by attribute values being considered, and then iterate over sorted nodes to link any contiguous nodes that meet the nearness criteria. This approach may reduce the complexity for numeric attributes to O(n log(n)) or O(n), depending on the sorting algorithm used.
The exemplary system shown in
Although the computer system 79 is shown with various components, one skilled in the art will appreciate that the computer system 79 can contain additional or different components. In addition, although aspects of an implementation consistent with the method optimizing pattern query searches on a graph database are described as being stored in memory, one skilled in the art will appreciate that these aspects can also be stored on or read from other types of computer program products or computer-readable media, such as secondary storage devices, including hard disks, floppy disks, or CD-ROM; or other forms of RAM or ROM. The computer-readable media may include instructions for controlling the computer system 79 to perform a particular method.
While the foregoing has been made with reference to a particular embodiment of the invention, it will be appreciated by those skilled in the art that changes in this embodiment may be made without departing from the principles and spirit of the invention.
This application claims the priority of U.S. Provisional Application Ser. No. 61/262,917, entitled “Pattern Query Optimizer and Method of Using Same” and filed Nov. 19, 2009, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61262917 | Nov 2009 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12950582 | Nov 2010 | US |
Child | 13856960 | US |