The present invention generally relates to database access, and more specifically to a method for abstracting database access.
Large amounts of information become available as a consequence of the collection and analysis of more and more time series data. As newer data are collected, the older data are typically moved to larger and less frequently accessed storage units. Generally, the older time series data builds up over time and becomes quite large and are often referred to as Big Data with the larger storage unit referred to as a Big Data repository. The newer time series data generally remains relatively small and can be referred to as Small Data stored in a Small Data repository. The distinction in the age of the data between the Big Data and the Small Data leads to different usage characteristics. For example, the distinction typically impacts the data's frequency of use. That is, the more recent Small Data is typically accessed and used more frequently than the older Big Data.
Many applications require Big Data repositories for storing and mining massive quantities of historical time series data. As Big Data technologies become more prevalent, increasing numbers of applications will require combinations of both big and small data repositories functioning in tandem—using the small data repositories to store the most frequently accessed, such as recently added or updated, data points. This is because Big Data repositories are very effective at enabling deep analytics on large volumes of data, but the analytics typically execute in batch and thus do not provide real or near real-time access to the data.
However, combining big data and small data repositories within a single infrastructure presents a challenge when a user desires to execute queries and/or analytics. Traditionally, as shown in
Therefore, there is a need for a system and method that provide a single data access interface regardless of where the time series data are stored and it is to this need that embodiments of the present invention are primarily directed.
Embodiments of the present invention are constructed to overcome the aforementioned deficiencies. The embodiments provide a single data access method for time series data stored in different location or under different data structure.
The embodiments also provide a method for providing seamless access to time series data located in multiple data storage units. The method includes receiving a first data request for a data from a user device, parsing, by a query interface controller, the first data request identifying a location of the data. The method also includes formulating, by the query interface controller, at least one second data request, sending the at least one second data request to at least one data storage unit. The time series data are received from the at least one data storage unit, and are sent to the user device.
Another illustrious embodiment provides an apparatus for providing seamless access to time series data located in multiple data storage units. The apparatus includes a user interface controller for receiving a first data request from a user device, a query interface controller for parsing the first data request and identifying multiple data storage units. The query interface controller is capable of formulating an appropriate data request for each of the multiple data storage units. The apparatus also includes an input/output (I/O) controller for sending the data requests to the data storage units and receiving data from each data storage unit. The query interface controller then combines the multiple query results and finally, the user interface controller sends the complete time series data to the user device.
The foregoing and other objects, features, aspects and advantages of the present invention will become better understood from a careful reading of a detailed description provided herein below with appropriate reference to the accompanying drawings.
The present invention can be understood in more detail by reading the subsequent detailed description in conjunction with the examples and references made to the accompanying drawings, wherein:
Embodiments of the present invention provide a capability that abstracts the details of the underlying data stores away from the end user, as to eliminate the need for the user to know the format in which the data is to be stored. The embodiments utilize a common interface that is positioned atop of data repositories and is capable of receiving queries, parsing them to determine their data requirements, executing the queries against the appropriate repository or repositories, and combining any results that straddle the small and big time series data stores.
Aspects of the illustrious embodiments work by building a query interface layer that can sit atop different data stores. This layer receives queries, parses them to determine what repositories are most likely to hold the relevant data, and then executes the queries against the relevant data stores. The layer joins (if run against more than one data store) and returns any results. The query interface controller embodiments use metadata about each repository that defines the structure and attributes of the time series data stored in each repository, in order to determine which repository or repositories hold the data being requested by the user.
The query interface layer uses this metadata to become aware of what data is available and in which repository they may be stored. When the query layer parses a query, it can use the parameters of the query to determine what repository, or repositories, house the time series data being requested. For example, if a query requests daily averages of an indicator over the prior three weeks and the query interface layer knows that the small data repository houses the indicator data created over the last month, the actual query can be executed in the small data repository alone.
Alternatively, if the query requests daily averages of the indicator over the past two months, the query interface layer would know to pull the most recent month from the small data repository and the prior month from the big data repository. The results would then be combined in the query interface layer before finally being returned to the requester.
The embodiments of the present invention address the challenge of using multiple time series data repositories to address different data challenges a single system faces. Both small and big data repositories may be required within one infrastructure, to serve very different purposes. Small data repositories give very fast access to limited amounts of data. Big Data repositories allow users to store hundreds of terabytes of data or more, but provide only batch analytic execution on that data. If multiple such data repositories are used within a single system, a significant challenge arises with respect to how end users (and other systems) will interact with those multiple repositories.
Users who wish to analyze the stored data conventionally need special insights into the data repositories to know what time series data is stored where. Embodiments of the present invention solve that problem by creating a layer that sits atop the many repositories to provide an interface to receive and parse queries, distribute the queries to the right repositories, and then combine the results where the queries cross from the small and into the big data stores.
The query interface 202 checks if the data resides in more than one data location, step 306. If the time series data is spread in more than one location, the query interface 202 formulate multiple queries, one for each data storage unit, step 308, and sends queries to different data storage units, step 310. The query interface 202 receives query results back from each data storage unit, step 312, merges the query results, step 313, and then assembles and displays or forwards the queried data to the user, step 314. Because the queries are sent to multiple time series data storage units, the responses from these storage units may not arrive simultaneously. The data interface 202 may send or forward partial results to the user before all the results are received.
If the desired data are not spread in multiple locations, the query interface 202 checks if the data are Big Data, step 316. If the data are Big Data, the query interface 202 formulates the query for the Big Data storage unit, step 318, and sends the query to the Big Data storage unit, step 320. After the queried data is received back from Big Data, step 322, the query interface 202 proceeds to display or forward the data to the user, step 314.
If the desired data are not spread in multiple locations and are not Big Data, the query interface 202 formulates the query for the Small Data storage unit, step 324, and sends the query to the Small Data storage unit, step 326. After the queried data is received back from the Small Data storage unit, step 328, the query interface 202 proceeds to display or forward the time series data to the user, step 314.
Although
The I/O controller 406 sends the newly formulated queries to each time series data location and receives the data back from each data location. When the data are received from multiple data storage units, the data that are received first can be stored in the storage unit 410 until all the data are received. After all the time series data are received, the query interface controller unit 408 assembles all the received data and the user interface controller unit 404 presents them to the user. The information on the data location can also be saved in the storage unit 410.
Embodiments of the present invention provide a major level of simplification for users who are required to interface with such systems. Prior to the present invention, users would be required to develop multiple distinct paths to integrate with each repository, and know a priori what data is found in each. The benefits of the present invention eliminates a significant level of complexity to anyone needing to build or interact with a system that requires different tiers of time series data storage. From a commercial perspective, the embodiments greatly simplify the deployment of systems that include multiple time series data repositories. Such a feature provides significant commercial sales advantage over any competitive systems.
Although the present invention has been described with reference to the preferred embodiments, it will be understood that the invention is not limited to the details described thereof. Various substitutions and modifications have been suggested in the foregoing description, and others will occur to those of ordinary skill in the art. For example, the data may be stored in more than two different locations. Therefore, all such substitutions and modifications are intended to be embraced within the scope of the invention as defined in the appended claims. It is understood that features shown in different figures can be easily combined within the scope of the invention.