This disclosure relates in general to the field of data processing and storage, and more particularly, though not exclusively, to database management using relational interval trees with distinct boundaries.
A database typically includes a collection of data records with various fields containing the respective attributes of each data record. Further, database indexes are often created to enable certain fields of the database to be queried more efficiently. It can be challenging, however, to efficiently query a database based on fields within the database that contain intervals.
According to one aspect of the present disclosure, a searchable interval associated with a database is defined, and an interval tree corresponding to the searchable interval is generated. The interval tree comprises a plurality of interval nodes arranged in a plurality of tiers, wherein the plurality of interval nodes correspond to a plurality of tree intervals that collectively span the searchable interval, and wherein the plurality of tree intervals comprise interval boundaries that are distinct among the plurality of tiers. A database query comprising an interval query parameter is then received, and one or more responsive data records corresponding to the database query are identified, wherein the one or more responsive data records are identified from a plurality of data records in the database based on the interval tree and the interval query parameter. The one or more responsive data records are then retrieved from the database, and a response to the database query is generated based on the one or more responsive data records.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or contexts, including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely as hardware, entirely as software (including firmware, resident software, micro-code, etc.), or as a combination of software and hardware implementations, all of which may generally be referred to herein as a “circuit,” “module,” “component,” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.
Any combination of one or more computer readable media may be utilized. The computer readable media may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an appropriate optical fiber with a repeater, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by, or in connection with, an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, CII, VB.NET, Python or the like, conventional procedural programming languages, such as the “C” programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on a user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider), or in a cloud computing environment, or offered as a service such as a Software as a Service (SaaS).
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that when executed can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions when stored in the computer readable medium produce an article of manufacture including instructions which when executed, cause a computer to implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer, other programmable instruction execution apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatuses, or other devices, to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Example embodiments that may be used to implement the features and functionality of this disclosure will now be described with more particular reference to the attached FIGURES.
In the illustrated embodiment, for example, computing system 100 includes cloud-based servers and/or mainframes 110, client devices 140a-c, and one or more networks 150. Cloud servers 110 include one or more computing applications 120 and a database management system (DBMS) 130. Computing applications 120 and database management system (DBMS) 130 may be executed on one or more processors within cloud servers 110, and their underlying data (e.g., databases 134 and/or interval trees 135) may be stored on one or more memory components and/or data storage devices within cloud servers 110. Client devices 140a-c can include any type of device that communicates or interacts with cloud servers 110, such as mobile devices, laptops, desktops, kiosks, and/or other datacenter servers, mainframes, or processing devices. Moreover, components of computing system 100—such as cloud servers 110 and client devices 140—may communicate with each other over network 150. Database management system (DBMS) 130 includes a database management engine 132, along with one or more databases 134, and one or more relational interval trees 135 with distinct boundaries. Database management engine 132 manages access to the databases 134, which includes processing and responding to database queries that are received from applications 120 and/or client devices 140. Moreover, database management engine 132 uses interval trees 135 with distinct boundaries to process interval queries more efficiently, as described further below.
Large enterprises, such as businesses and other organizations, typically generate and/or consume massive volumes of data throughout the ordinary course of business. For example, a business often runs a variety of computing applications that are designed to facilitate or streamline regular business operations, such as applications that provide certain services for the benefit of customers, employees, and/or the business itself. These types of computing applications generally consume and/or generate large volumes of data throughout the course of their operation. Further, numerous reports associated with this data are often generated by a business on a regular basis. In this manner, large volumes of database queries are continuously received from various sources that need access to certain data and/or reports, such as applications, system administrators, employees, customers, and so forth. Accordingly, enterprises must be capable of processing database queries in an efficient manner.
A database typically includes a collection of data records with various fields containing the respective attributes of each data record. Moreover, database indexes are often created to enable certain fields of the database to be queried more efficiently. It can be challenging, however, to efficiently query the database based on fields that contain intervals. For example, in some cases, the database may include fields that store certain types of intervals associated with the data records, such as time intervals, geographical or distance intervals, financial intervals, and/or any other type of numerical interval.
For example, time-series data typically contains data records that each have a defined validation interval, such as a start_time and end_time. Moreover, queries for these data records may specify a ‘time’ property, which can be used to identify valid or responsive data records that match the following search criteria:
rec.start_time<=‘time’
AND ‘time’<rec.end_time
However, this type of query may require a full scan of the database since the condition only requires a partial match (e.g., less than or equal) rather than an exact match (e.g., equal).
Accordingly, in some cases, a relational interval tree (RIT) may be leveraged to process interval queries more efficiently. A relational interval tree, however, typically has collisions among the node boundaries of different tiers, which can result in poor performance in certain circumstances. Accordingly, this disclosure presents various embodiments of relational interval trees with distinct boundaries, which greatly improve the speed of interval queries, while also eliminating the performance drawbacks associated with interval trees whose boundaries collide across different tiers.
For example, a relational interval tree can be implemented using the following methods, where the input arguments can be integers, longs, or timestamps (e.g., for time-series data):
int getFork(startValue, endValue)
int[]getForksForValue(value)
The getFork( ) method is used for storing data records in a database—its input arguments represent an interval associated with a particular data record, while its output represents a corresponding “fork” node in an interval tree, which is stored or referenced in an indexed field of the data record for the purpose of responding to interval queries.
The getForksForValue( ) method is used for responding to interval queries—its input argument is an interval value associated with a search parameter from an interval query, while its output is an array of all possible fork values from the interval tree that may contain data records with matching intervals.
The relationship between these methods can be represented as follows:
getForksForValue(value) CONTAINS getFork(startValue, endValue)
in case that value IS BETWEEN startValue AND endValue.
Accordingly, an interval query associated with these data records can be enriched using a fork lookup:
rec.fork IN getForksForValue('time')
AND rec.start_time<=‘time’
AND ‘time’<rec.end_time
In this manner, assuming rec.fork is an indexed field, a subset of the data records is initially preselected based on the potential fork values, and that subset of data records is then evaluated one-by-one to identify matching records with an appropriate start_time and end_time.
This usage applies regardless of whether the underlying relational interval tree has colliding boundaries or distinct boundaries, as the primary difference between those types of interval trees is based on how the getFork( ) and getForksForValue( ) methods are implemented. For example, as discussed further below, an interval tree with distinct boundaries provides a more uniform distribution of fork values used to cache the data records, which reduces the number of data records that are preselected and checked one-by-one when responding to interval queries.
Additional details and embodiments associated with distinct-boundary relational interval trees are described throughout this disclosure in connection with the remaining FIGURES.
In general, elements of computing system 100, such as “systems,” “servers,” “mainframes,” “devices,” “clients,” “networks,” “computers,” and any components thereof, may be used interchangeably herein and refer to computing devices operable to receive, transmit, process, store, or manage data and information associated with computing system 100. Moreover, as used in this disclosure, the term “computer,” “processor,” “processor device,” or “processing device” is intended to encompass any suitable processing device. For example, elements shown as single devices within computing system 100 may be implemented using a plurality of computing devices and processors, such as server pools comprising multiple server computers. Further, any, all, or some of the computing devices may be adapted to execute any operating system, including Linux, other UNIX variants, Microsoft Windows, Windows Server, Mac OS, Apple iOS, Google Android, etc., as well as virtual machines adapted to virtualize execution of a particular operating system, including customized and/or proprietary operating systems.
Moreover, elements of computing system 100 (e.g., cloud servers 110, applications 120, database management system 130, client devices 140a-c, network 150, and so forth) may each include one or more processors, computer-readable memory, and one or more communication interfaces, among other features and hardware. Servers and mainframes may include any suitable software component or module, or computing device(s) capable of hosting and/or serving software applications and services, including distributed, enterprise, or cloud-based software applications, data, and services. For instance, one or more of the described components of computing system 100 may be at least partially (or wholly) cloud-implemented, “fog”-implemented, web-based, or distributed for remotely hosting, serving, or otherwise managing data, software services, and applications that interface, coordinate with, depend on, or are used by other components of computing system 100. In some instances, elements of computing system 100 may be implemented as some combination of components hosted on a common computing system, server, server pool, or cloud computing system, and that share computing resources, including shared memory, processors, and interfaces.
Further, the network(s) 150 used to communicatively couple the components of computing system 100 may be implemented using any suitable computer communication or network technology for facilitating communication between the participating components. For example, one or a combination of local area networks, wide area networks, public networks, the Internet, cellular networks, Wi-Fi networks, short-range networks (e.g., Bluetooth or ZigBee), and/or any other wired or wireless communication medium may be utilized for communication between the participating devices, among other examples.
While
Additional embodiments and functionality associated with the implementation of computing system 100 are described further in connection with the remaining FIGURES. Accordingly, it should be appreciated that computing system 100 of
The flowchart may begin at block 202 by defining the boundaries of the searchable interval associated with the interval tree. In particular, the minimum and maximum values of the searchable interval are defined up front in order to limit the size and scope of the interval tree. For example, with respect to time-series data (e.g., data associated with temporal intervals), the minimum time could be set to Jan. 1, 1970 (e.g., the beginning of UNIX epoch time) and the maximum time could be set to Dec. 31, 2200 (e.g., an arbitrary time in the future). This will enable data records to be stored and searched over a 230-year time period using the interval tree.
The flowchart may then proceed to block 204 to define the minimum granularity of the interval tree. In particular, the interval tree algorithm needs to define a minimum fork granularity, which represents the maximum distance of interval values that produces the same result for getForksForValue(value). Low granularity means that a single fork node will be used to store more data records, while high granularity means that getForksForValue(value) will return more fork nodes for the subsequent search of the indexed data records. The optimal choice for granularity depends on the typical interval size associated with the data records. In some embodiments, the optimal granularity can be determined based on performance tests, machine learning analysis, and so forth. The decision about granularity is defined up front.
The flowchart may then proceed to block 206 to compute a domain space based on the searchable interval (block 202) and the minimum granularity (block 204). For example, assuming the minimum interval boundary is defined as Jan. 1, 1970, the maximum interval boundary is defined as Dec. 31, 2200, and the minimum granularity is defined as one hour, then the corresponding domain space has a size of 2,014,800 values, which represents the number of hours in 230 years. With respect to a particular domain space, the ‘domain-value’ of any given ‘value’ within the searchable interval is defined as
If the difference between two interval values is greater than defined granularity, those interval values have different domain-values. The collection of all domain-values in the domain space are in the interval from
The flowchart may then proceed to block 208 to generate a tier of nodes in the interval tree based on the computed domain space (block 206). In particular, each tier of interval nodes is generated based on a particular domain space, and each interval node in a given tier corresponds to one of the domain-values in that domain space. Moreover, a ‘tierN-value’ is assigned to each interval node in a given tier based on its corresponding domain-value. The tierN-value is a simple mapping of the domain-value in order to avoid collisions among the values assigned to interval nodes of different tiers. In this manner, each interval node in the resulting tree will be assigned with a unique value that serves as its corresponding “fork” node value. In some embodiments, for example, the tierN-value could be defined as tierN-value=domain-value*2N+N, where N represents the number of the associated tier. In other embodiments, however, any type of mapping can be used that avoids collisions of the interval node values across different tiers (e.g., shifting the domain-values in each tier by the maximum value of the corresponding domain space).
Each successive tier of interval nodes is generated in the manner described above but using a different domain space. In particular, the domain space for each successive tier is computed by doubling the prior granularity and shifting the minimum interval boundary by half of the prior granularity. Successive tiers of interval nodes are continuously generated in this manner until the granularity of the domain space for a generated tier equals or exceeds the size of the defined searchable interval.
For example, during the initial iteration of block 208, a 1st tier of interval nodes is generated based on the initial domain space (e.g., block 206) computed from the searchable interval and granularity that were originally defined at the outset (e.g., blocks 202, 204). In particular, the initial domain space was computed over an interval from Jan. 1, 1970-Dec. 31, 2200 using a granularity of one hour, resulting in a domain space of 2,014,800 values. Accordingly, each interval node in the 1st tier corresponds to a particular domain-value in that domain space. Further, based on the tierN mapping described above, each interval node in the 1st tier is assigned with a tier1-value computed as tier1-value=domain-value*2+1, which effectively maps the domain-values to odd numbers.
The 2nd tier is generated using a different domain space than the 1st tier. As described above, for example, the domain space for the 2nd tier is computed by shifting the minimum interval boundary by half of the granularity (e.g., 0.5 hours) and doubling the granularity (2 hours). Thus, the minimum interval boundary is now Dec. 31, 1969 at 23:30, the maximum interval boundary remains the same, and the granularity is 2 hours. Accordingly, the domain space for the 2nd tier is computed over the interval from Dec. 31, 1969 at 23:30-Dec. 31, 2200 using a granularity of two hours, resulting in a domain space of 1,007,400 values. Further, each interval node in the 2nd tier is assigned with a tier2-value computed as tier2-value=domain-value*4+2, which effectively maps the domain-values to numbers divisible by two but not four.
Successive tiers are continuously generated in this manner until the granularity of a generated tier covers the entire searchable interval (e.g., the [minimum, maximum] range defined at block 202). In this example, the 22nd tier will have a granularity of 2,097,152 hours (239 years), which is just larger than the size of the original searchable interval (230 years).
Thus, during each iteration of the algorithm, a domain space is computed (block 206) and a tier of interval nodes is generated based on the domain space (block 208). The flowchart then proceeds to block 210 to determine whether the granularity associated with the current tier equals or exceeds the size of the searchable interval. If the granularity is less than the size of the searchable interval, the minimum interval boundary is shifted by half the granularity and the granularity is doubled (block 212), a new domain space is computed (block 206), and a successive tier of interval nodes is generated based on the new domain space (block 208).
The flowchart may continue cycling through blocks 206-212 in this manner to generate successive tiers of interval nodes—with each successive tier corresponding to a domain space with a doubled granularity and shifted domain interval—until the granularity of a generated tier equals or exceeds the size of the defined searchable interval.
Once it is determined at block 210 that the granularity of the current tier equals or exceeds the size of the searchable interval, no further tiers need to be generated, and the flowchart proceeds to block 214 to output the resulting interval tree. The resulting interval tree can then be used to process interval-based database queries in an efficient manner (e.g., as described further below in connection with
Notably, the resulting interval tree generated using the approach described above has distinct boundaries among the nodes of different tiers. By way of example, an interval tree generated using the approach described above is visualized in
For comparison, an interval tree with colliding boundaries among the nodes of different tiers is visualized in
To illustrate the difference between the interval trees of
With respect to the interval tree of
tier 1: fork nodes 47, 49;
tier 2: fork nodes 46, 50;
tier 3: fork nodes 44, 52; and
tier 4: fork nodes 40, 56.
As a result, the 5th tier is the lowest tier in which the startValue (47) and endValue (50) share the same fork node, which is fork node 48. Thus, even though the interval [47, 50] only has a size of three, it is stored on the 5th tier of the interval tree, which has a granularity of 32. This means that tiers of the interval tree with large granularities will contain data records with much smaller intervals in some cases, which increases the number of data records that have to be searched one-by-one for those tiers (even when the interval is very small).
On the other hand, with respect to the interval tree of
Thus, due to the distinct boundaries among the tiers in the interval tree of
On the other hand, with respect to the interval tree of
Example code for implementing a relational interval tree with distinct boundaries is provided below in Appendix A, which is fully incorporated by reference into this disclosure. The example code is written in the Java programming language.
In the example code, the interval tree is a precomputed table represented as an array of tier data structures, where each tier data structure in the array defines a particular tier of the tree. Thus, the number of elements in the array is equivalent to the number of tiers in the interval tree, which is a small memory footprint. Accordingly, the interval tree requires only nominal additional memory overhead in exchange for significantly improving the response time for interval queries.
The example code includes implementations of the getFork( ) and getForksForValue( ) methods (which are discussed above):
int getFork(startValue, endValue)
int[] getForksForValue(value)
The getFork(startValue, endValue) method returns the tierN-value for ‘startValue’ and ‘endValue’ from the lowest tier (e.g., the lowest value of N) in which those parameters are mapped to the same tierN-value. This method can be optimized to only search tiers whose corresponding granularity is larger than the difference between the ‘startValue’ and ‘endValue’ parameters.
The getForksForValue(value) method returns an array of all tierN-values corresponding to the ‘value’ parameter. Accordingly, the number of elements in the resulting array is equivalent to the number of tiers in the interval tree. The number of tiers in the interval tree, and thus the size of the resulting array returned by this method, are dictated by the size of the searchable interval and minimum granularity that are defined at the outset.
The attributes of the interval tree generated by the example code in Appendix A are shown below in TABLE 1. In the example code, the interval tree is generated based on a searchable interval of 239 years (e.g., starting from the beginning of UNIX Epoch time) and a minimum granularity of 15 minutes, which results in an interval tree with 24 tiers. Accordingly, the interval tree is represented by an array of tier data structures that contains 24 elements, each of which corresponds to a particular tier of the interval tree. The fork values do not start at value 1 in order to avoid using negative numbers during the pre-calculations—the starting value does not matter since the numbers are still 32-bit integers.
The flowchart may begin at block 302 by defining a searchable interval associated with a database on a data storage device. In some embodiments, for example, the searchable interval may be defined or selected such that the data records within the database all have corresponding data intervals that are within the defined searchable interval.
The flowchart may then proceed to block 304 to generate an interval tree corresponding to the searchable interval. In some embodiments, for example, the interval tree may be generated using the approach described above in connection with
Moreover, the interval tree may be represented in memory using any suitable data structure(s), including arrays, linked lists, and so forth. Further, in some embodiments, other types of graph-based structures may be used instead of an interval tree.
The flowchart may then proceed to block 306 to identify an interval tree node corresponding to each database record, and then to block 308 to store a reference to the corresponding tree node in each database record.
The flowchart may then proceed to block 310 to determine whether a database query has been received. If no database query has been received, the flowchart remains at block 310 until a query is received.
Once a database query has been received, the flowchart may then proceed to block 312 to identify tree nodes corresponding to an interval parameter in the query.
The flowchart may then proceed to block 314 to identify database records corresponding to the identified tree nodes.
The flowchart may then proceed to block 316 to compare intervals of the identified database records to the query interval parameter.
Finally, the flowchart may then proceed to block 318 to return the matching database records.
At this point, the flowchart may be complete. In some embodiments, however, the flowchart may restart and/or certain blocks may be repeated. For example, in some embodiments, the flowchart may repeat blocks 310 to 318 to continue receiving and processing database queries. Alternatively, or additionally, the flowchart may restart at block 302 to continue defining searchable intervals, generating interval trees, and processing interval queries for other databases.
It should be appreciated that the flowcharts and block diagrams in the FIGURES illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various aspects of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order or alternative orders, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as suited to the particular use contemplated.