1. Field
This disclosure relates to executing a faceted search within a semi-structured database using a Bloom filter.
2. Description of the Related Art
Databases can store and index data in accordance with a structured data format (e.g., Relational Databases for normalized data queried by Structured Query Language (SQL), etc.), a semi-structured data format (e.g., XMLDBs for Extensible Markup Language (XML) data, RethinkDB for JavaScript Object Notation (JSON) data, etc.) or an unstructured data format (e.g., Key Value Stores for key-value data, ObjectDBs for object data, Solr for free text indexing, etc.). In structured databases, any new data objects to be added are expected to conform to a fixed or predetermined schema (e.g., a new Company data object may be required to be added with Name, Industry and Headquarters values, a new Bibliography data object may be required to be added with Author, Title, Journal and Date values, and so on). By contrast, in unstructured databases, new data objects can be added verbatim, so similar data objects can be added via different formats which may cause difficulties in establishing semantic relationships between the similar data objects.
Semi-structured databases share some properties with both structured and unstructured databases (e.g., similar data objects can be grouped together as in structured databases, while the various values of the grouped data objects are allowed to differ which is more similar to unstructured databases). Semi-structured database formats use a document structure that includes a plurality of nodes arranged in a tree hierarchy. The document structure includes any number of data objects that are each mapped to a particular node in the tree hierarchy, whereby the data objects are indexed either by the name of their associated node (i.e., flat-indexing) or by their unique path from a root node of the tree hierarchy to their associated node (i.e., label-path indexing). The manner in which the data objects of the document structure are indexed affects how searches (or queries) are conducted.
An example relates to a method of performing a search within a semi-structured database that is storing a set of documents, each document in the set of documents being organized with a tree-structure that contains a plurality of nodes, the plurality of nodes for each document in the set of documents including a root node and at least one non-root node, each of the plurality of nodes including a set of node-specific data entries. The example method may include executing, among the set of documents, a first query to determine a first list of nodes that each include at least one node-specific data entry that satisfies the first query. The example method may further includes initializing a Bloom filter with the first list of nodes, and filtering a list of candidate nodes for a second query based on the Bloom filter. The example method may further includes executing, in conjunction with a faceted search procedure of the set of documents, the second query that uses the filtered list of candidate nodes as a facet to determine a second list of nodes that each includes one or more node-specific data entries from the facet that satisfy the second query.
Another example relates to server that is configured to perform a search within a semi-structured database that is storing a set of documents, each document in the set of documents being organized with a tree-structure that contains a plurality of nodes, the plurality of nodes for each document in the set of documents including a root node and at least one non-root node, each of the plurality of nodes including a set of node-specific data entries. The server may include means for executing, among the set of documents, a first query to determine a first list of nodes that each include at least one node-specific data entry that satisfies the first query, means for initializing a Bloom filter with the first list of nodes, means for filtering a list of candidate nodes for a second query based on the Bloom filter and means for executing, in conjunction with a faceted search procedure of the set of documents, the second query that uses the filtered list of candidate nodes as a facet to determine a second list of nodes that each includes one or more node-specific data entries from the facet that satisfy the second query.
Another example relates to server that is configured to perform a search within a semi-structured database that is storing a set of documents, each document in the set of documents being organized with a tree-structure that contains a plurality of nodes, the plurality of nodes for each document in the set of documents including a root node and at least one non-root node, each of the plurality of nodes including a set of node-specific data entries. The server may include logic configured to execute, among the set of documents, a first query to determine a first list of nodes that each include at least one node-specific data entry that satisfies the first query, logic configured to initialize a Bloom filter with the first list of nodes, logic configured to filter a list of candidate nodes for a second query based on the Bloom filter and logic configured to execute, in conjunction with a faceted search procedure of the set of documents, the second query that uses the filtered list of candidate nodes as a facet to determine a second list of nodes that each includes one or more node-specific data entries from the facet that satisfy the second query.
Another example relates to a non-transitory computer-readable medium containing instructions stored thereon, which, when executed by a server that is configured to perform a search within a semi-structured database that is storing a set of documents, each document in the set of documents being organized with a tree-structure that contains a plurality of nodes, the plurality of nodes for each document in the set of documents including a root node and at least one non-root node, each of the plurality of nodes including a set of node-specific data entries, cause the server to perform operations. The instructions stored on the non-transitory computer-readable medium may include at least one instruction to cause the server to execute, among the set of documents, a first query to determine a first list of nodes that each include at least one node-specific data entry that satisfies the first query, at least one instruction to cause the server to initialize a Bloom filter with the first list of nodes, at least one instruction to cause the server to filter a list of candidate nodes for a second query based on the Bloom filter and at least one instruction to cause the server to execute, in conjunction with a faceted search procedure of the set of documents, the second query that uses the filtered list of candidate nodes as a facet to determine a second list of nodes that each includes one or more node-specific data entries from the facet that satisfy the second query.
A more complete appreciation of embodiments of the disclosure will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings which are presented solely for illustration and not limitation of the disclosure, and in which:
Aspects of the disclosure are disclosed in the following description and related drawings directed to specific embodiments of the disclosure. Alternate embodiments may be devised without departing from the scope of the disclosure. Additionally, well-known elements of the disclosure will not be described in detail or will be omitted so as not to obscure the relevant details of the disclosure.
The words “exemplary” and/or “example” are used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” and/or “example” is not necessarily to be construed as preferred or advantageous over other embodiments. Likewise, the term “embodiments of the disclosure” does not require that all embodiments of the disclosure include the discussed feature, advantage or mode of operation.
Further, many embodiments are described in terms of sequences of actions to be performed by, for example, elements of a computing device. It will be recognized that various actions described herein can be performed by specific circuits (e.g., application specific integrated circuits (ASICs)), by program instructions being executed by one or more processors, or by a combination of both. Additionally, these sequence of actions described herein can be considered to be embodied entirely within any form of computer readable storage medium having stored therein a corresponding set of computer instructions that upon execution would cause an associated processor to perform the functionality described herein. Thus, the various aspects of the disclosure may be embodied in a number of different forms, all of which have been contemplated to be within the scope of the claimed subject matter. In addition, for each of the embodiments described herein, the corresponding form of any such embodiments may be described herein as, for example, “logic configured to” perform the described action.
A client device, referred to herein as a user equipment (UE), may be mobile or stationary, and may communicate with a wired access network and/or a radio access network (RAN). As used herein, the term “UE” may be referred to interchangeably as an “access terminal” or “AT”, a “wireless device”, a “subscriber device”, a “subscriber terminal”, a “subscriber station”, a “user terminal” or UT, a “mobile terminal”, a “mobile station” and variations thereof. In an embodiment, UEs can communicate with a core network via a RAN, and through the core network the UEs can be connected with external networks such as the Internet. Of course, other mechanisms of connecting to the core network and/or the Internet are also possible for the UEs, such as over wired access networks, WiFi networks (e.g., based on IEEE 802.11, etc.) and so on. UEs can be embodied by any of a number of types of devices including but not limited to cellular telephones, personal digital assistants (PDAs), pagers, laptop computers, desktop computers, PC cards, compact flash devices, external or internal modems, wireless or wireline phones, and so on. A communication link through which UEs can send signals to the RAN is called an uplink channel (e.g., a reverse traffic channel, a reverse control channel, an access channel, etc.). A communication link through which the RAN can send signals to UEs is called a downlink or forward link channel (e.g., a paging channel, a control channel, a broadcast channel, a forward traffic channel, etc.). As used herein the term traffic channel (TCH) can refer to either an uplink/reverse or downlink/forward traffic channel.
Referring to
The Internet 175, in some examples, includes a number of routing agents and processing agents (not shown in
Referring to
While internal components of UEs such as UEs 200A and 200B can be embodied with different hardware configurations, a basic high-level UE configuration for internal hardware components is shown as platform 202 in
Accordingly, an embodiment of the disclosure can include a UE (e.g., UE 200A, 200B, etc.) including the ability to perform the functions described herein. As will be appreciated by those skilled in the art, the various logic elements can be embodied in discrete elements, software modules executed on a processor or any combination of software and hardware to achieve the functionality disclosed herein. For example, the ASIC 208, the memory 212, the API 210 and the local database 214 may all be used cooperatively to load, store and execute the various functions disclosed herein and thus the logic to perform these functions may be distributed over various elements. Alternatively, the functionality could be incorporated into one discrete component. Therefore, the features of UEs 200A and 200B in
The wireless communications between UEs 200A and/or 200B and the RAN 120 can be based on different technologies, such as CDMA, W-CDMA, time division multiple access (TDMA), frequency division multiple access (FDMA), Orthogonal Frequency Division Multiplexing (OFDM), GSM, or other protocols that may be used in a wireless communications network or a data communications network. As discussed in the foregoing and known in the art, voice transmission and/or data can be transmitted to the UEs from the RAN using a variety of networks and configurations. Accordingly, the illustrations provided herein are not intended to limit the embodiments of the disclosure and are merely to aid in the description of aspects of embodiments of the disclosure.
Referring to
In a further example, the logic configured to receive and/or transmit information 305 can include sensory or measurement hardware by which the communications device 300 can monitor its local environment (e.g., an accelerometer, a temperature sensor, a light sensor, an antenna for monitoring local RF signals, etc.). The logic configured to receive and/or transmit information 305 can also include software that, when executed, permits the associated hardware of the logic configured to receive and/or transmit information 305 to perform its reception and/or transmission function(s). However, in various implementations, the logic configured to receive and/or transmit information 305 does not correspond to software alone, and the logic configured to receive and/or transmit information 305 relies at least in part upon hardware to achieve its functionality.
The communications device 300 of
The communications device 300 of
The communications device 300 of
The communications device 300 of
Referring to
Generally, unless stated otherwise explicitly, the phrase “logic configured to” as used throughout this disclosure is intended to invoke an embodiment that is at least partially implemented with hardware, and is not intended to map to software-only implementations that are independent of hardware. Also, it will be appreciated that the configured logic or “logic configured to” in the various blocks are not limited to specific logic gates or elements, but generally refer to the ability to perform the functionality described herein (either via hardware or a combination of hardware and software). Thus, the configured logics or “logic configured to” as illustrated in the various blocks are not necessarily implemented as logic gates or logic elements despite sharing the word “logic.” Other interactions or cooperation between the logic in the various blocks will become clear to one of ordinary skill in the art from a review of the embodiments described below in more detail.
The various embodiments may be implemented on any of a variety of commercially available server devices, such as server 400 illustrated in
Databases can store and index data in accordance with a structured data format (e.g., Relation Databases for normalized data queried by Structured Query Language (SQL), etc.), a semi-structured data format (e.g., XMLDBs for Extensible Markup Language (XML) data, RethinkDB for JavaScript Object Notation (JSON) data, etc.) or an unstructured data format (e.g., Key Value Stores for key-value data, ObjectDBs for object data, Solr for free text indexing, etc.). In structured databases, any new data objects to be added are expected to conform to a fixed or predetermined schema (e.g., a new Company data object may be required to be added with “Name”, “Industry” and “Headquarters” values, a new Bibliography data object may be required to be added with “Author”, “Title”, “Journal” and “Date” values, and so on). By contrast, in unstructured databases, new data objects are added verbatim, which permits similar data objects to be added via different formats which causes difficulties in establishing semantic relationships between the similar data objects.
Examples of structured database entries for a set of data objects may be configured as follows:
whereby “Name”, “Industry” and “Headquarters” are predetermined values that are associated with each “Company”-type data object stored in the structured database, or
whereby “Author”, “Title”, “Journal” and “Date” are predetermined values that are associated with each “Bibliography”-type data object stored in the structured database.
Examples of unstructured database entries for the set of data objects may be configured as follows:
As will be appreciated, the structured and unstructured databases in Tables 1 and 3 and in Tables 2 and 4 store substantially the same information, with the structured database having a rigidly defined value format for the respective class of data object while the unstructured database does not have defined values associated for data object classes.
Semi-structured databases share some properties with both structured and unstructured databases (e.g., similar data objects can be grouped together as in structured databases, while the various values of the grouped data objects are allowed to differ which is more similar to unstructured databases). Semi-structured database formats use a document structure that includes a set of one or more documents that each have a plurality of nodes arranged in a tree hierarchy. The plurality of nodes are generally implemented as logical nodes (e.g., the plurality of nodes can reside in a single memory and/or physical device), although it is possible that some of the nodes are deployed on different physical devices (e.g., in a distributed server environment) so as to qualify as both distinct logical and physical nodes. Each document includes any number of data objects that are each mapped to a particular node in the tree hierarchy, whereby the data objects are indexed either by the name of their associated node (i.e., flat-indexing) or by their unique path from a root node of the tree hierarchy to their associated node (i.e., label-path indexing). The manner in which the data objects of the document structure are indexed affects how searches (or queries) are conducted.
To put the document depicted in
The document structure of a particular document in a semi-structured database can be indexed in accordance with a flat-indexing protocol or a label-path protocol. For example, in the flat-indexing protocol (sometimes referred to as a “node indexing” protocol) for an XML database, each node is indexed with a document identifier at which the node is located, a start-point and an end-point that identifies the range of the node, and a depth that indicates the node's depth in the tree hierarchy of the document (e.g., in
whereby each number represents a location of the document structure that can be used to define the respective node range, as shown in Table 8 as follows:
Accordingly, the “Inventor” context path 605A of
When a node stores a value, the value itself can have its own index. Accordingly, the value of “Brown” 650A as shown in
The flat-indexing protocol uses a brute-force approach to resolve paths. In an XML-specific example, an XPath query for /Patent/Inventor/Name/Last would require separate searches to each node in the address (i.e., “Patent”, “Inventor”, “Name” and “Last”), with the results of each query being joined with the results of each other query, as follows:
Label-path indexing is described in a publication by Goldman et al. entitled “DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases”. Generally, label-path indexing is an alternative to flat-indexing, whereby the path to the target node is indexed in place of the node identifier of the flat-indexing protocol, as follows:
whereby each number represents a location of the document structure that can be used to defined the respective node range, and each letter label (A through I) identifies a context path to a particular node or value, as shown in Table 11 as follows:
Accordingly, with respect to Tables 10-11, the “Inventor” node 605A of
More detailed XML descriptions will now be provided. At the outset, certain XML terminology is defined as follows:
In Table 9 with respect to the flat-indexed protocol, it will be appreciated that the XPath query directed to /Patent/Inventor/Name/Last required four separate lookups for each of the nodes “Patent”, “Inventor”, “Name” and “Last”, along with three joins on the respective lookup results. By contrast, a similar XPath query directed to /Patent/Inventor/Name/Last using the label-path indexing depicted in Tables 10-11 would have a compiled query of lookup (E) based on the path /Patent/Inventor/Name/Last being defined as path “E”.
Generally, the label-path indexing protocol is more efficient for databases with a relatively low number of context paths for a given node name (e.g., less than a threshold such as 100), with the flat-indexing protocol overtaking the label-path indexing protocol in terms of query execution time as the number of context paths increases.
A number of different example XML document structures are depicted below in Table 12 including start and end byte offsets:
whereby each number represents a location of the document structure that can be used to defined the respective node range, and each letter label identifies a context path to a particular node or value as depicted in
Next, a flat simple content index for the documents depicted in Table 12 is as follows:
Next, a flat element index for the documents depicted in Table 12 is as follows,
Faceted searching is one type of search that can be conducted within a semi-structured database. In a faceted search, search criteria (or facets) configured to satisfy a series of search queries are modified in a recursive manner so as to narrow the number of search results (or nodes) returned to a client device that initiated the faceted search. For example, a first query can be performed on a book database with “fiction novels” to obtain a list of search results. This list of search results can then be used as a filter (or facet) for a second query of “1980s books” so that the second query returns a list of fiction books from the 1980s. However, executing recursive search queries on the semi-structured database is costly in terms of resource consumption, as is joining the search results from the series of search queries so as to exclude nodes that are not part of each search query's search results.
Upon being prompted with the facet prompt screen 800, the user of the client device may select one or more of the illustrated search filters (e.g., “2002” and “Agatha Christie”, etc.) to narrow down the first list of nodes, in response to which the semi-structured database server 170 executes a second query by searching the set of documents in the semi-structured database to determine a second list of nodes that each include one or more node-specific data entries (or values) that satisfy the second search query, in block 710. Upon obtaining the second list of nodes, the semi-structured database server 170 executes a join operation the first and second list of nodes in order to provide a reduced set of search results to the user of the client device, in block 715. The join operation of block 715 functions to exclude search results in the second list of nodes which do not match any corresponding search result in the first list of nodes, such that the first list of nodes can be characterized as a facet (or filter) of the second query.
As will be appreciated, caching a large number of search results for use as a facet in block 705 of
Referring to
Similar to the example described above with respect to
In one embodiment, as shown in
In an example, to initialize the Bloom filter at block 905, an array of m bits is generated which, at first, are each initialized to a first logic state (e.g., a de-asserted state, such as “0”). Each node in the first list of nodes is iteratively added to the array of m bits by applying k independent hash functions to node-specific data (e.g., the node's identifier or path), and using the resulting k hash values to address and assert (e.g., set to a second logic value, such as “1”) a bit within the array of bits. For each node added, k bits within the array will be asserted. When this is completed, the Bloom filter is said to be initialized with the first list of nodes. A candidate node can be tested to determine whether the candidate node is already part of the first list of nodes by applying the k hash functions to the node-specific data of the candidate node and then comparing each resulting hash address value to the Bloom filter. If any of the k bits for the candidate node are set to the first logic value (or de-asserted), the candidate node can be ruled out as being part of the first list of nodes. If each of the k bits are asserted, there is a relatively high likelihood that the candidate node is already part of the first list of nodes, although this is not guaranteed since the k bits may have been asserted based on the hash values of other nodes in the first list of nodes. Bloom filters are well-known in the art, and will not be described in further detail for the sake of brevity.
Referring back to
After the list of candidate nodes is filtered at block 910, in block 915, the semi-structured database server 170 executes the second query by using the filtered list of candidate nodes (as opposed to the entire list of candidate nodes and/or all nodes in the set of documents) as a facet to determine a second list of nodes that each includes one or more node-specific data entries from the facet that satisfy the second query.
As will be appreciated, the filtering performed in block 910 may result in a few false positives in the filtered list of candidate nodes, for example, due to the Bloom filter coincidentally having a set of asserted bits in the bit array that align with one or more nodes in the list of candidate nodes which are not actually part of the first list of nodes. In some examples, in block 920, the semi-structured database server 170 can error-check the second list of nodes by comparing the second list of nodes with the first list of nodes (e.g., via a join operation). The error-checking in block 920 (e.g., a join operation) can be similar in some respects to block 715, except that the error-checking of block 920 is most likely conducted with less information by virtue of the filtering that occurs at block 910. This may reduce the overall resource consumption of the error-checking of block 920 at the semi-structured database server 170. While not shown explicitly in
At this point of the process of
In one example, in block 1050, the semi-structured database server 170 updates (or re-initializes) the Bloom filter using the second list of nodes provided to the client device. For example, the Bloom filter is initialized to the first list of nodes at block 1020, but the second list of nodes is likely to be smaller than the first list of nodes. Accordingly, updating or re-initializing the Bloom filter to the second list of nodes at block 1055 will help to narrow any further queries issued in association with the faceted search procedure. After block 1045, if the user determines to specify additional search parameters in a new query for further narrowing the search results in association with the faceted search procedure, the process returns to block 1025, after which blocks 1025-1050 can repeat until the faceted search procedure terminates. For example, the filtering of block 1030 for the new query can be based upon the updated Bloom filter from block 1050. If performed, the error-check of block 1040 can compare a current list of nodes obtained from execution of the search query at block 1035 to one or more lists of nodes returned for each previous query conducted in association with the faceted search procedure. For example, for an Nth search query in the process depicted in
While the processes are described as being performed by the semi-structured database server 170, as noted above, the semi-structured database server 170 can be implemented as a client device, a network server, an application that is embedded on a client device and/or network server, and so on. Hence, the apparatus that executes the processes in various example embodiments is intended to be interpreted broadly.
Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Further, those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The methods, sequences and/or algorithms described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal (e.g., UE). In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
While the foregoing disclosure shows illustrative embodiments of the disclosure, it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the embodiments of the disclosure described herein need not be performed in any particular order. Furthermore, although elements of the disclosure may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
The present application for patent claims the benefit to U.S. Provisional Application No. 62/180,947, entitled “EXECUTING A FACETED SEARCH WITHIN A SEMI-STRUCTURED DATABASE USING A BLOOM FILTER”, filed Jun. 17, 2015, assigned to the assignee hereof, and expressly incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62180947 | Jun 2015 | US |