Data can be stored in relational tables of a relational database management system (RDBMS). A database query, such as a Structured Query Language (SQL) query, can be submitted to an RDBMS to access (read or write) data contained in relational table(s) stored in the RDBMS.
As the amount of data that is generated has become increasingly large, different data storage architectures have been proposed or implemented for storing large amounts of data in a computationally less intensive manner. An example of such a storage architecture for storing large amounts of data (also referred to as “big data”) is the Hadoop framework used for storing big data across a distributed arrangement of storage nodes.
Some embodiments are described with respect o the following figures.
Although storage architectures designed for storing “big data” allows for storage of relatively large amounts of data in a manner that consumes less computation resources than traditional relational database management systems (RDBMS), such big data storage architectures may be associated with various issues. “Big data” can refer to any relatively large collection of data that may not be practically stored and processed by traditional RDBMS.
An example of a storage architecture for storing and processing big data is the Hadoop framework, which is able to store a collection of data across multiple storage nodes. An issue associated with the Hadoop framework is that a programming interface to a Hadoop storage system (a storage system according to the Hadoop framework) may be inconvenient and not user-friendly. The programming interface to a Hadoop storage system can be according to a MapReduce programming model, which includes a map procedure and a reduce procedure. A map procedure specifies a map task, and a reduce procedure specifies a reduce task, where the map and reduce tasks are executable by computing nodes used for storing a collection of data. Map tasks specified by the map procedure process corresponding segments of input data to produce intermediate results. The intermediate results are then provided to reduce tasks specified by the reduce procedure. The reduce tasks process and merge the intermediate values to provide an output.
The map and reduce procedures can be user-defined functions. Having to develop map and reduce procedures for accessing data stored by the Hadoop storage system adds a layer of complexity to the programming interface for the Hadoop storage system.
A further issue associated with a Hadoop storage system is that the Hadoop storage system is suitable for analytical database applications, but not for operational database applications. An analytical database application refers to a database application that performs processing of a collection of data for analysis purposes; such an application can be performed in an offline manner, such as in batch jobs executed during off-peak time periods.
In contrast, operational database applications are online applications that perform processing of a collection of data in response to queries for the data, where the responses to the queries are provided to requesting entities in an online manner. In other words, the responses are provided back to the requesting entities within a target time interval from receipt of the queries, while the requesting entities remain online with respect to the data processing system that processes the queries. A target time interval during which a response is expected to be returned in response to a query can be governed by a service level agreement (SLA) or another target goal specified for an operational database application. In some examples, an operational database application includes real-time On-Line Transaction Processing (OLTP), which involves inserting, deleting, and updating of data for transactions in real-time in response to requests from requesting entities, which can include users or machines. Obtaining results for transactions in “real-time” can refer to obtaining results while a requesting entity that submitted the transactions remains online with respect to the data processing system and waits for the results, where the requesting entity expects the results to be returned within some expected time interval.
An RDBMS, also referred to as a “relational database system,” provides a more user-friendly programming interface and is able to support operational database applications. The programming interface of a relational database system can include a database query interface, such as a Structured Query Language (SQL) query interface that allows users or machines to submit SQL queries to the database system to access data stored in relational tables of the database system. The syntax of SQL queries is well defined, and database users are familiar with SQL queries.
Experienced database users can formulate SQL queries to perform many different operations on a collection of data, where the operations can include data insert operations, data delete operations, data update operations, data join operations (where data of two or more relational tables can be joined), data merge operations, and so forth.
An issue associated with a relational database system is that the relational database system may not be able to effectively store and process a large amount of data without deployment of large amounts of computation resources. Thus, it may not be practical to use traditional relational database systems for storing a large collection of data.
In accordance with some implementations, a hybrid data storage arrangement is provided that combines query processing features of a relational database system with storage features of a big data storage architecture.
The hybrid data storage arrangement further includes an abstraction layer 108 between the database query engines 102 and the distributed file system 104. The abstraction layer 108 is able to read and write data of the distributed file system in response to a database query received by a database query engine 102.
The hybrid data storage arrangement of
Operational database applications, such as real-time OLTP, can be supported by the hybrid data storage arrangement of
The database query engines 102 and storage managers 114 are part of a database storage layer 101. The database storage layer 101 also includes other features available from relational database systems. For example, the database storage layer 101 can include a query optimizer, which is able to develop a query plan for performing operations on data in response to receiving a database query. The query plan can specify various operations to be performed, including, as examples, a read operation, a join operation, a merge operation, an update operation, and so forth. In response to a database query, the query optimizer can develop several candidate query plans for executing the database query, and can select the best (in terms of operation speed, efficiency, etc.) from among the candidate query plans to use.
The database storage layer 101 can also implement concurrency control, which ensures that there are not inconsistent operations being performed on data that is being concurrently accessed in a distributed arrangement. For example, concurrency control can ensure that two concurrent operations do not write different values to the same data record.
Although the abstraction layer 104 is depicted as being outside of the database storage layer 101 in
An index 110 is defined on one or multiple attributes. A data record can include multiple attributes. For example, a data record stored by an enterprise (e.g. business concern, government agency, educational organization) can include the following attributes: employee identifier, employee name, department name, job role, manager, etc. An index maps different values of at least one attribute to different locations that store data containing the respective values of the at least one attribute. For example, if an index is defined on an employee identifier, then the index can map each unique value of the employee identifier to respective location(s) that store data records that contain the unique value of the employee identifier. In response to a database query, a database query engine 102 is able to identify at least one location storing data responsive to the database query by accessing the index. In some examples, the indexes 110 can be B-tree indexes, or other types of indexes.
Using the indexes 110 for processing database queries improves query processing efficiency, since locations storing requested data can be identified by accessing the indexes 110, without having to scan data stored at the storage nodes 106 (which can take a relatively long time due to the large amount of stored data and the relatively slow access speeds of the persistent storage media used to implement the storage nodes 106). Also, by using the indexes 110, multiple concurrent data retrieval tasks can have random access to requested data.
A buffer pool 112 can include one or multiple buffers for caching recently read data (data retrieved from a storage node 106, which can be implemented with persistent storage media, such as disk-based media or persistent solid state media). A first access of data causes the data to be cached in the buffer pool 112. A subsequent read of the same data can be satisfied from the buffer pool 112. If a buffer pool 112 becomes full, then a replacement technique can be used to evict data from the buffer pool 112. For example, the replacement technique can be a least recently used (LRU) technique, in which a least recently used data is evicted from the buffer pool 112. In other examples, other replacement techniques can be used.
The buffer pool 112 can also include one or multiple buffers that implement(s) a write-ahead-log that stores updated data. For example, a database query can cause the writing of data (e.g. inserting a new data record or updating a data record). The data updated by the write can be stored in the write-ahead-log, without immediately writing the updated data (also referred to as “dirty data”) to persistent storage media of the storage nodes 106. Dirty data evicted from the buffer pool 112 can be synchronized to the distributed file system 104 (to update the respective data stored in the corresponding storage node(s) 106), and the corresponding entry in the write-ahead-log can be removed.
The buffer pool 112 can be implemented in higher-speed memory, which can be accessed more quickly than the persistent storage media of the storage nodes 106.
As shown in
The distributed file system 104 does not support use of the indexes 110 and buffer pools 112. Accordingly, without the database storage layer 101 (including the database query engines 102 and storage managers 114), the distributed file system 104 would not be able to support data operations at a sufficiently high throughput to support operational database applications. However, by tying together the database storage layer 101 and the distributed file system 104 using the abstraction layer 108, an arrangement that provides the computation efficiency of the distributed file system 104 and advanced features of the database storage layer 101 can be provided. The distributed file system 104 can offer scalability (to allow the capacity of the distributed storage system to be expanded easily), reliability (to provide reliable access of data), and availability (to provide fault tolerance in the face of machine faults or errors). The advanced features of the database storage layer 101 include index management, buffer pool management, query optimization, and concurrency control.
In specific examples, the hybrid data storage arrangement includes an operational SQL-on-Hadoop arrangement (which includes a database query engine that supports SQL queries and a HDFS).
In accordance with some implementations, by using the abstraction layer 108, customization of the database storage layer 101 for use with the distributed file system 104 does not have to be performed. From the perspective of the database storage layer 101 the storage system that stores data appears to be that of a relational database system. As a result, substantial re-engineering of the database storage layer 101 does not have to be performed.
As discussed further below, the abstraction layer 108 can be provided at any one of several levels of the hybrid data storage arrangement above the distributed file system 104. The abstraction layer 108 can also be referred to as a virtual file system (VFS).
The abstraction layer 108 allows a client (e.g. the database storage layer 101) to access different types of physical file systems (e.g. HDFS, Ceph File System) in a uniform way. For example, the abstraction layer 108 can access local and network storage devices transparently without the client noticing the difference. The abstraction layer 108 can bridge applications with physical file systems, to allow applications to access files without having to know about the behavior of file system they are accessing. Also, the presence of the abstraction layer 108 allows for support of new types of distributed file systems in the hybrid data storage arrangement without having to modify the database storage layer 101. The abstraction layer 108 abstracts the distributed file system by hiding details of the distributed file system from the database query engines 102.
The abstraction layer 108 also allows interaction between the database query engines 102 and the distributed file system 104 without bypassing certain features (such as the indexes 110 and buffer pools 112) of the database storage layer 101. For example, in some examples, the database query engines 102 may be modified to access the interface (in the form of an application programming interface or API, for example) of the distributed file system 104. In such examples, the indexing and buffer pool features of the database storage layer 101 would be bypassed.
A database query engine 102 receives (at 204) a database query for access of data. The database query engine 102 processes (at 206) the database query using an index 110, and using a buffer pool 112. The database query engine 102 then submits (at 208) commands corresponding to the database query to the abstraction layer 108 to cause the abstraction layer 108 to read and write data of the distributed file system 104 in response to the commands.
The commands are provided by the database query engine 102 based on a query plan produced by a query optimizer of the database storage layer 101. The commands can include a command to insert data, a command to delete data, a command to update data, a command to join data, a command to merge data, and so forth.
As noted above, the abstraction layer 108 can be provided at one of several levels above the distributed file system 104. In some implementations, the abstraction layer 108 can be included in just the database storage layer 101. More specifically, the abstraction layer 108 can be part of the storage manager 114 level in
In the ensuing discussion, it is assumed that the distributed file system 104 is the HDFS, and that an HBase database is implemented on the HDFS. HBase refers to an open source, non-relational distributed database that stores data in HBase tables. Although a specific database architecture (HBase) and distributed file system (HDFS) is assumed in the ensuing discussion, it is noted that techniques or mechanisms according to some implementations can be applied to other types of database architectures and distributed file systems.
In accordance with some implementations, the abstraction layer 108 is abstracted at the page level to allow page access of data stored in the storage system 302. A page (also referred to as a “block”) can refer to a container of data having a specified size. An HBase database is implemented as a key-value store. In a key-value store, each database record has a primary key and a collection of one or multiple values. In accordance with some implementations, the abstraction layer 108 of
In the
The set of APIs 306 can be invoked in response to commands from the database query engine 102 that are generated in response to a database query. The commands access respective pages, whose page IDs are provided in the invocation of APIs from the set 306. Invocation of the APIs from the set 306 causes the abstraction layer 106 to produce further commands that are submitted to the storage system 302 to access respective data records of the HBase database using page IDs as keys.
An HBase table region managed by a region server 304 can be further split into respective portions that are stored as files of the HDFS 104. The files of the HDFS 104 are referred to as HFiles.
Each region server 304 can also include a local buffer pool for buffering portions of an HFile that has been retrieved from a storage node 306. In addition, an updated portion of an HFile can be logged to a local write-ahead-log of the region server 304. Dirty data of the local write-ahead-log of the region server 304 can be synchronized with the HDFS 104.
The local buffer pool and local write-ahead-log of each region server 304 are similar to the respective buffer pool 112 of a corresponding storage manager 114 (discussed above). In the arrangement of
The set of APIs 404 in the VFS interface layer 402 are mapped to a set of APIs 406 of the distributed file system 104, using a mapping 408 that is also part of the abstraction layer 108. The mapping can be in the form of various procedures that translate page accesses due to invocation of the set of APIs 404 in the VFS interface layer 402 to accesses of data at a different granularity (smaller blocks or larger blocks) as provided by the set of APIs 406 of the distributed file system 104.
In response to a query received by a database query engine 102, the database query engine 102 produces commands to access data. These commands cause invocation of API(s) of the set of APIs 404 in the VFS interface layer 402. The invoked API(s) are mapped by the mapping 408 to respective API(s) of the set of the APIs 406 in the distributed file system 104. The mapped API(s) of the set of APIs 406 is (are) executed to access data in the storage nodes 106.
Further details regarding the buffer pools 112 discussed in connection with
The buffer pool 112 can be accompanied by a corresponding array of data structures referred to a buffer descriptors, with each buffer descriptor recording information about a corresponding page (e.g. the page's tag, the usage frequency of the page, the last access time of the page, whether the page is dirty (updated), and so forth.
When a process of a database query engine 102 requests access of a specific page, if the block is already cached in the buffer pool 112, then the corresponding buffered page is pinned (the page is locked to prevent another process from accessing the page). If the page is not cached in the buffer pool 112, then it is determined if a free page slot exists in the buffer pool 112 for storing the page. If there are no slots free, the process selects a page to evict from the buffer pool 112 to make space for the requested page. If the page to be evicted is dirty, the dirty page is written to the distributed file system 104.
Deciding which page to remove from the buffer pool to make space for a new page can use a replacement technique such as an LRU technique. A timestamp when each page was last used is kept in the corresponding buffer descriptor in order for the system to determine which page is least recently used. Another way to implement LRU is to keep pages sorted in order of recent access. In other examples, other types of replacement techniques can be used.
The memory 506 can be implemented as one or multiple non-transitory computer-readable or machine-readable storage media. The storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as compact disks (CDs) or digital video disks (DVDs); or other types of storage devices. Note that the instructions discussed above can be provided on one computer-readable or machine-readable storage medium, or alternatively, can be provided on multiple computer-readable or machine-readable storage media distributed in a large system having possibly plural nodes. Such computer-readable or machine-readable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.
In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/075651 | 12/17/2013 | WO | 00 |