1. Field of Invention
The present invention relates generally to the field of indexing. More specifically, the present invention is related to indexing JavaScript Object Notation (JSON) documents.
2. Discussion of Related Art
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is based on a subset of the JavaScript Programming language. More and more web-based applications exchange and/or store information in JSON format. Indexing and searching JSON data are critical for those applications.
JSON is built on two structures: (1) a collection of name/value pairs; and (2) an ordered list of values. The formal definition of a JSON value is given below.
For example, the following is a valid JSON object. By convention, objects are enclosed within “{ }” and arrays are encloses within “[ ]”. Also, strings are quoted and numbers are not quoted. It is important to understand that field names are unique within an object.
Embodiments of the present invention are an improvement over prior art JSON indexing and searching methods.
The present invention provides for a method of encoding JavaScript Object Notation (JSON) documents in an inverted index, wherein the method comprises the steps of: generating a tree representation of a JSON document; shredding the JSON document into a list of <value, path, type, jdewey> tuples for each atom node, n, in the tree, where value is a label associated with n, path is a concatenation of node labels associated with ancestors of n, starting from a root of the tree, type is a description of a type of value, and jdewey of n is a partial Dewey code of its closest ancestor array node, if one exists, or empty, otherwise; and building an inverted index using <path, type, value> as index term, and jdewey as payload.
The present invention also provides for a method to search the above-mentioned inverted index, wherein the method further comprises the steps of: receiving a search query and constructing a parse tree from said received search query; generating a first evaluation tree from the constructed parse tree to indentify a set of candidate JSON documents that match the search query; generating a second evaluation tree from the constructed parse tree to identify a subset of the set of candidate JSON documents that exactly match the search query; and evaluating the received search query based on the parse tree, first evaluation tree, and second evaluation tree, and outputting results of the evaluation. By using two evaluation trees, searching is accomplished via a first phase that identifies potential matching JSON documents using the index without accessing the payload and via a second phase that computes the exact matching JSON documents using said payload.
While this invention is illustrated and described in a preferred embodiment, the invention may be produced in many different configurations. There is depicted in the drawings, and will herein be described in detail, a preferred embodiment of the invention, with the understanding that the present disclosure is to be considered as an exemplification of the principles of the invention and the associated functional specifications for its construction and is not intended to limit the invention to the embodiment illustrated. Those skilled in the art will envision many other possible variations within the scope of the present invention.
Similar to XML, JSON documents are hierarchical. JSON documents can represent data structures and associative arrays (called objects), wherein the data associated with the data structures and/or data associated with the objects may be associated with tangible items. For example, data associated with a JSON representation could be that of an object that describes a physical server.
There exist extensive works on indexing/searching XML data. See, for example, the paper to Kaushik et al. entitled, “On the Integration of Structure Indexes and Inverted Lists”, the paper to Chien et al. entitled, “Efficient Structural Joins on Indexed XML Documents”, the paper to Jiang et al. entitled, “Holistic twig joins on indexed XML documents”, the paper to Yang et al. entitled, “Virtual Cursors for XML Joins”, the paper to Jiang et al. entitled, “Efficient Processing of XML Twig Queries with OR-Predicates”, and the paper to Fontoura et al. entitled, “Optimizing Cursor Movement in Holistic Twig Joins”.
Compared with XPath, the searching of JSON documents is much simplified. When searching XML documents using XPath, the results have to include all possible matching nodes within the document. In contrast, the present invention's method only returns matches at JSON document level. A matching document is returned exactly once no matter how many times a search query matches within a document. This simplifies the present invention's search algorithm significantly since it does not have to maintain stacks for matching nodes within a document and enumerate all possible combinations of them. Also, our search language requires the specification of an exact JSON structure. As a result, all the complexities of dealing with different XPath axis disappear. Returning matches at document level is desirable for many applications. For example, in faceted search, each matching document is counted exactly once per facet. Further, field names are unique within an object in JSON. This is taken advantage of by using “partial” Dewey codes when indexing JSON documents. In contrast, XML indexes have to use full Dewey codes or their equivalent. This saves space for storing the index and also allows for the optimization of certain types of search queries. Since JSON is truly self-describing (no need for DTD or schema as in the case of XML), index atomic values can be appropriately indexed according to their types (e.g., for range query).
Indexing JSON Documents in an Inverted Index:
JSON documents are indexed in an inverted index since it is well suited for search over semi-structured data. Given a document in JSON format, a tree representation of the document is created as follows. First, an artificial root node is created and labeled with “/”. Next, a look-up is done at the top level of the JSON document. If it is an atomic value, a child node (referred to as atom node) labeled with the value is added to the root. If it is an object, for each object field, a child node (referred to as field node) labeled with the field name is added to the root. Otherwise, it is an array. For each array element, a child node (referred to as array node) labeled with “$” is added to the root. For the latter two cases, the present invention's method descends to each child node and constructs the rest of the tree recursively based on the lower levels of the JSON document. Given a
The tree representation of d is depicted in
To index a JSON document, a “partial” dewey code called jdewey is associated for each atom node in the tree. A jdewey code in calculated as follows. First, all array nodes are encoded using multi-part Dewey decimals. The jdewey code for an atom node is the Dewey code of its closest ancestor array node, if one exists, or empty, if otherwise. The jdewey codes for d are listed under the atom nodes in
A typical inverted index is organized as a list of ordered index terms. Each term points to a posting list and each post is a <d, plist> pair, where d is the document ID and plist is an ordered list of positions within the document. Optionally, one can store more information in a payload associated with each position. To build an inverted index, for each tuple generated from a JSON document, an index term is created that is the concatenation of p, t, val, if one doesn't exist already. By putting val as the last part of the term, range predicates are supported on the inverted index. The identifier of the document is inserted to the posting list of the index term if it's not there yet, and a new document position is added with j as the payload. The following depicts the layout of an index after d is indexed. Note that the jdewey code in each payload is also kept in order.
Searching JSON Documents:
As the present invention involves searching at JSON document level, a search returns a list of ID of the matching JSON documents, not nodes in documents. For simplicity, in this section, queries containing only equality predicates are discussed. Later, this algorithm is extended to support other kinds of predicates. Also, the algorithm described in this section is optimized for inverted indexes (e.g., Lucene) that store posting lists in two separate files, one for document IDs and one for payloads. Such a design often makes conjunctive queries more efficient because most of the time, relatively few documents qualify in a query, so it is better to keep document-level index small. Later, an alternative implementation is discussed when the posting list is stored in a single file.
Query Language and Parse Tree
A simple and intuitive language is defined to search indexed JSON documents. Consider the following two example queries:
Also, consider the following two JSON documents:
To qualify as a match, a document has to match both the JSON structures as well as the Boolean constraints specified in the query. For example, query P (“&&” specifies conjunction) matches d1, but not d2. The reason is that d2 doesn't have the proper B and C fields within the same JSON object. On the other hand, query Q matches both d1 and d2, since it doesn't require the B field and the C field to be in the same JSON object.
From a search query, a parse tree is first constructed. For example, the parse tree for query P is given in
The doc-tree and jdewey-tree for query P are given in
Evaluating a Query
To evaluate a query, a method EvaluateQuery( ) is called in
EvaluateQuery( ) in now discussed. The method first opens the cursors in the atom nodes in the doc-tree. Specifically, for each atom node in a doc-tree, the corresponding atom node n is located in the original parse tree. A path p is computed for n in a way similar to indexing a JSON document, by concatenating the label of all ancestors of n (starting from root). Both AND nodes and OR nodes are ignored when computing the path. Finally, an index term is generated by concatenating p, the type and the atomic value associated with node n. For example, the index term for nodes “b1” and “c1” in
The cursor is then opened in node n on the posting list in the inverted index whose term matches the generated one. Once opened, the cursor iterates through postings in the posting list. A skipTo(target) call on the cursor moves it to a posting in which the document ID is larger than or equal to target. A next( ) call returns the document ID in the posting next to the one that the cursor is currently on. EvaluateQuery( ) then calls the main method GetNextMatch( ) in
Once a candidate is found, the method opens the cursors in the atom nodes in the jdewey-tree. Specifically, for each atom node m in the jdewey-tree, the corresponding node m′ is located in the doc-tree and obtain the posting that the cursor in m′ is currently on. The cursor is then opened in m to iterate through the positions of that posting. A skipTo(target) call moves the cursor to a position in which the jdewey (in the payload) code is larger than or equal to target. A next( ) call returns the jdewey code in the next position. Subsequently, EvaluateQuery( ) makes the same GetNextMatch( ) call again on the jdewey-tree to check if the candidate is a true match based on the jdewey codes. Finally, it outputs the candidate if a true match is found.
The main method GetNextMatch( ) is now discussed. Each node in the evaluation trees has two variables, cur and target. Both cur and target are of the generic base type ID. Again, they are instantiated as document ID in a doc-tree and as jdewey code in a jdewey-tree. As one should see later, variable cur is always propagated bottom-up whereas variable target is always propagated top-down. If GetNextMatch( ) is called for the first time, it calls an InitCur( ) method to initialize cur in each node. The InitCur( ) method in
If GetNextMatch( ) is not called for the first time, a previous match must has been returned. The method calls UnitAdance( ) to move target in the root node by a single unit. If target is a document ID, UnitAdance( ) simply adds one to it. Otherwise, target is a jdewey code, and UnitAdance( ) adds one to the last part of the code. The method then continues in a loop. If cur in root is a null value, there are no more matches and the method returns a null ID. Otherwise, the method makes a CheckMatch( ) call on the root node, which fulfills two tasks. First, it returns a Boolean value indicating whether a match is found or not. Second, it populates a list lessThanList, including all atom nodes whose cur is less than that of target. If a match is found, cur in root has the matching ID and is returned. If not, the lessThanList is not empty. The method picks a random node n in that list, and moves cur to the next ID from the cursor that is larger than or equal to target. It then calls PropagateCurUp( ) on node n to propagate cur all the way up to the root node. PropagateCurUp( ) in
CheckMatch( ) is now discussed with respect to
Note that jdewey codes are propagated in the evaluation tree slightly differently from document IDs. When a jdewey code is propagated up from an array node, the last part of the code is stripped off. This is done by customizing the “=” operator in
The algorithm is now illustrated through an example. Suppose there exists the following three JSON documents:
After indexing those documents, the index entries look like the following.
Consider the two queries shown earlier:
Suppose that query P, given earlier on the index, is evaluated. Note that only document d3 matches the query exactly. To evaluate P, GetNextMatch( ) is first called on the doc-tree given in
The method goes back to the GetNextMatch( ) call on the doc-tree again. Eventually, the doc-tree becomes
Optimization
For certain queries, a jdewey-tree can be simplified while preserving the correctness of query evaluation. Given a jdewey-tree, a breadth-first traversal of the tree can be made. Every time an array node is encountered, a check is made to see if the node has any AND node among its descendants. If not, the sub-tree rooted at the array node is completely eliminated. For example, the jdewey-tree in
Extensions
In this section, some extensions are described with regards to the core algorithm in previous section.
Non-Equality Predicates
The search runtime is not limited to equality predicates. Consider the following queries:
Query R and S have a range predicate and query T has a wildcard predicate. Both types of queries are supported through a rewrite of the parse tree. For example, if an atom node in the parse tree is associated with a range, all index terms in the inverted index that fall into the range are identified. Then, the original atom node is replaced with an OR node. For each identified index term, a corresponding atom node is added under the OR node. For example, the original parse tree (
An Alternative One-Pass Implementation
The algorithm in previous section is optimized for inverted indexes that store payloads separately from the document IDs in the posting list. For inverted indexes that store them together, a single pass algorithm is used by directly calling GetNextMatch( ). A couple of changes need to be made. First, all ID values are instantiated to <document ID, jdewey code>. Second, UnitAdvance( ) will add one to the document ID part of target and set the jdewey code empty.
Additionally, the present invention provides for an article of manufacture comprising computer readable program code contained within implementing one or more modules to implement a method to index and search JavaScript Object Notation (JSON) documents. Furthermore, the present invention includes a computer program code-based product, which is a storage medium having program code stored therein which can be used to instruct a computer to perform any of the methods associated with the present invention. The computer storage medium includes any of, but is not limited to, the following: CD-ROM, DVD, magnetic tape, optical disc, hard drive, floppy disk, ferroelectric memory, flash memory, ferromagnetic memory, optical storage, charge coupled devices, magnetic or optical cards, smart cards, EEPROM, EPROM, RAM, ROM, DRAM, SRAM, SDRAM, or any other appropriate static or dynamic memory or data storage devices.
The present invention also provides for an article of manufacture having computer usable medium storing computer readable program code implementing a computer-based method of encoding JavaScript Object Notation (JSON) documents in an inverted index, wherein the medium comprising: computer readable program code generating a tree representation of a JSON document; computer readable program code shredding said JSON document into a list of <value, path, type, jdewey> tuples for each atom node, n, in said tree, where value is a label associated with n, path is a concatenation of node labels associated with ancestors of n, starting from a root of said tree, type is a description of a type of value, and jdewey of n is a partial Dewey code of its closest ancestor array node, if one exists, or empty, otherwise; computer readable program code building an inverted index using <path, type, value> as index term, and jdewey as payload.
The present invention also provides for an article of manufacture having computer usable medium storing computer readable program code implementing a computer-based method to search the above-mentioned inverted index, wherein the medium comprises: computer readable program code receiving a search query and constructing a parse tree from said received search query; computer readable program code generating a first evaluation tree from said constructed parse tree to indentify a set of candidate JSON documents that match said search query; computer readable program code generating a second evaluation tree from said constructed parse tree to identify a subset of said set of candidate JSON documents that exactly match said search query; and computer readable program code evaluating said received search query based on said parse tree, first evaluation tree, and second evaluation tree, and outputting results of said evaluation.
The present invention also provides a computer-based system 2202, as shown in
A system and method has been shown in the above embodiments for the effective implementation of a method to index and search JavaScript Object Notation (JSON) objects. While various preferred embodiments have been shown and described, it will be understood that there is no intent to limit the invention by such disclosure, but rather, it is intended to cover all modifications falling within the spirit and scope of the invention, as defined in the appended claims. For example, the present invention should not be limited by software/program, computing environment, or specific computing hardware.
The above enhancements are implemented in various computing environments. For example, the present invention may be implemented on a conventional PC or equivalent, multi-nodal system (e.g., LAN) or networking system (e.g., Internet, WWW, wireless web). All programming and data related thereto are stored in computer memory, static or dynamic, and may be retrieved by the user in any of: conventional computer storage, display (e.g., CRT, LCD, etc.) and/or hardcopy (e.g., printed) formats. The programming of the present invention may be implemented by one having ordinary skill in the art of script programming languages, e.g., JavaScript.