The present disclosure generally relates to networked database management systems (DBMS) and supporting infrastructure. More particularly, the present disclosure relates to computer software application access to resources, such as memory and disk. More particularly still, the present disclosure relates to efficient access by a database application to memory and storage and a database network topology to support the same. Even more particularly still, the present disclosure relates to efficiently aligned direct access to network and disk structures by a database application, for example, accessing NVME (Non-Volatile Memory Express) or SATA (Serial Advanced Technology Attachment) directly through a kernel bypass, as well as a computer software architecture to accomplish the same.
A DBMS is a suite of computer programs that are designed to manage a database, which is a large set of structured data. In particular, a DBMS is designed to quickly access and analyze data on large amounts of stored data. Most modern DBMS systems comprise multiple computers (nodes). The nodes generally communicate via a network, which will use a network protocol, such as HTTP, or raw TCP/IP. Information that is exchange between nodes is exchanged by packets, the specific format of which will be determined by the specific protocol used by the network. The data wrapped in the packet will generally be compressed to the greatest extent possible to preserve network bandwidth. Accordingly, when it has been received, it will have to be formatted for use by the receiving node. A variety of DBMSs and the underlying infrastructure to support them are well known in the art. Database input/output (“I/O”) systems comprise processes and threads that identify, read, and write blocks of data from storage; e.g., spinning magnetic disk drives, network storage, FLASH drives, or cloud storage.
Like many software systems, DBMS evolved from standalone computers, to sophisticated client/server setups, to cloud systems. An example of a cloud based DBMS is depicted in
Generally, DBMSs operate on computer systems (whether standalone, client/server, or cloud) that incorporate operating systems. Operating systems, which are usually designed to work across a wide variety of hardware, utilize device drivers to abstract the particular functions of hardware components, such as, for example, disk controllers, and network interface cards. As drivers are generally accessed through an operating system, such accesses will typically entail significant resource overhead such as a mode switch; i.e., a switch from executing application logic to operating system logic, or a context switch; i.e., the pausing of one task to perform another. Such switches are typically time consuming; sometimes on the order of milliseconds of processor time.
Data stored in a DBMS is usually stored redundantly, using, for example, a RAID controller, Storage Area Network (“SAN”) system, or dispersed data storage. In addition, other measures to ensure that data is stored correctly are usually taken as well. For example, many DBMSs utilize a write log. A write log, which is generally written before the actual database is updated, contains a record of all changes to the database, so that a change can be easily backed out, or, in case of a transaction processing failure, can be redone as needed. In addition, writing to the log prior to committing the data guarantees that committed transactions can be preserved; i.e., properly written to disk. Using prior art methods, disk commits are lengthy procedures, often taking milliseconds. In addition, prior art systems utilizing traditional disk systems must write a complete block, and disk log records will rarely occupy a multiple of a block of data. Given that writing a block of data can be lengthy, prior art database systems generally buffer the log and write it to disk only periodically. Otherwise, if the log were written after each modification, the DBMS would be severely limited in transaction processing speed.
The process of buffering log writes—sometimes known as “boxcarring”—can reduce the number of transactions that a system must track and commit. However, there are penalties in user response time, lock contention, and memory usage. In addition, boxcarring can complicate system recovery.
In certain cases, a DBMS can use Remote Direct Memory Access (RDMA) to transfer data between two nodes (computers) of the DBMS. RDMA is a technology that allows a network interface card (NIC) to transfer data directly to or from memory of a remote node without occupying the central processing unit (CPU) of either node. It should be noted that the term network interface card, as used herein, includes all network interfaces, including chips, and units built directly into processors. By way of example, a remote client can register a target memory buffer and send a description of the registered memory buffer to the storage server. The remote client then issues a read or write request to the storage server. If the request is a write request, the storage server performs an RDMA read to load data from the target memory buffer into the storage server's local memory. The storage server then causes a disc controller to write the target data to a storage disk and, once the write is complete, generates and sends a write confirmation message to the remote client. On the other hand, if the request is a read request, the storage server uses the disk controller to perform a block-level read from disk and loads the data into its local memory. The storage server then performs an RDMA write to place the data directly into an application memory buffer of the remote computer. After an RDMA operation completes, the remote client deregisters the target memory buffer from the RDMA network to prevent further RDMA accesses. Using RDMA increases data throughput, decreases the latency of data transfers, and reduces load on the storage server and remote client's CPU during data transfers. Examples of RDMA capable networks include Infiniband, iWarp, RoCE (RDMA over Converged Ethernet), and OmniPath.
In operation, the driver 30 in the node 10 writes a descriptor for a location of the pinned buffer 20 to the NIC 40. The driver 70 in the node 50 writes a descriptor for the location of the pinned buffer 60 to the NIC 80. The driver 30 works with the operating system, as well as other software and hardware on the node 10 to guarantee that the buffer 20 is locked into physical memory; i.e., “pinned.” The NIC 40 reads data from the pinned buffer 20 and sends the read data on the network 90. The network 90 passes the data to the NIC 80 of the host 50. The NIC 80 writes data to the pinned buffer 60.
Serial Advanced Technology Attachment (SATA) is a computer bus interface that connects host bus adapters to mass storage devices such as hard drives, optical drives, and solid state drives (SSDs). While SATA works with SSDs, it is not designed to allow for the significant level of parallelism that SSDs are capable of. NVMe, NVM (Non-Volatile Memory) Express, or Non-Volatile Memory Host Controller Interface Specification (NVMHCI) is a logical device interface specification for accessing non-volatile storage media attached via PCI Express (PCIe) bus. NVMe allows parallel access to modern Solid State Drives (SSDs) to be effectively utilized by node hardware and software.
Microprocessor architectures generally operate on data types of a fixed width, as the registers within the microprocessor will all be of a fixed width. For example, many modern processors include either 32 bit wide or 64 bit wide registers. In order to achieve maximal efficiency, prior to operating on data, the data must be properly aligned in memory along boundaries defined by that fixed width.
Storage and network bandwidth are substantial drivers of cost for DBMS systems. Accordingly, prior art DBMS systems tend to optimize data handling to conserve storage and network bandwidth. For example, a prior art DBMS system may include a data manipulation step that effectively compresses data prior to storing it or transmitting it via the network, and a decompression step when reading data from storage or the network.
Accordingly, it is an object of this disclosure to provide an infrastructure for a DBMS, and an apparatus and method that operates more efficiently than prior art systems.
Another object of the disclosure is to provide an efficient network infrastructure.
Another object of the disclosure is to provide a network infrastructure for a DBMS that allows a database application to directly access pinned RDMA memory.
Another object of the disclosure is to provide a network infrastructure for a DBMS that manages pinned RDMA memory.
Another object of the disclosure is to provide a storage infrastructure for a DBMS that allows a database application to directly access storage buffers used for disk access.
Another object of the disclosure is to provide a storage infrastructure for a DBMS that that manages pinned DMA memory.
Another object of the disclosure is to provide a DBMS infrastructure that allows an application to directly access NVME drives.
Another object of the disclosure is to provide a DBMS infrastructure that allows an application to directly access SATA drives.
Another object of the disclosure is to provide a DBMS infrastructure whereby data is stored in a format that is usable by a processor with minimal adjustment.
Another object of the disclosure is to provide a DBMS infrastructure whereby network data is maintained in a format that is usable by a processor with minimal adjustment.
Other advantages of this disclosure will be clear to a person of ordinary skill in the art. It should be understood, however, that a system or method could practice the disclosure while not achieving all of the enumerated advantages, and that the protected disclosure is defined by the claims.
A networked database management system along with the supporting infrastructure is disclosed. The disclosed DBMS is capable of handling enormous amounts of data—an Exabyte or more—and accessing each record within the database frequently. In one embodiment the disclosed DBMS comprises a first high speed storage cluster including a first plurality of storage nodes. Each storage node includes a server and one or more storage drives at a high performance level. The first high speed storage cluster also includes a first switch. The DBMS also comprises a second high speed storage cluster including a second plurality of storage nodes. Each storage node in this second cluster also includes a server and one or more storage drives that operate at a lower performance level. The second high speed storage cluster also includes a second switch. The DBMS allows comprises an index cluster including a plurality of index nodes and a third switch. The first switch is operatively coupled to the third switch by a high speed RDMA capable link, and the second switch is operatively coupled to the third switch by a high speed RDMA capable link.
In a separate embodiment, the disclosed DBMS includes an application with a drive access class. The drive access class includes a NVME drive access sub class, and a SATA drive access sub class. The NVME drive access sub class allows the application to directly interface with NVME drives, and the SATA drive access sub class allows the application to directly interface with SATA drives. In addition, NVRAM technologies are now viable, and a specific sub class is allocated to optimize access to drives utilizing NVRAM technology. As future storage technologies are introduced, additional optimized storage access subclasses can be developed.
For example, a node in a database management system will include a drive controller, such as a SATA drive controller that communicates with a drive, such as a SATA hard drive or a SATA solid state drive. An application running on the node will utilize a SATA drive access class, or equivalent code abstraction (such as a function set), to access the SATA drive. The application will create a pinned memory buffer, which may in certain circumstances be an RDMA memory buffer. The application can, for example, establish a queue using the pinned memory buffer; i.e., using the pinned memory buffer as the queue elements; with the queue having a plurality of fixed size entries. The application will then directly access the SATA drive using SATA drive access class, and write data from one of the fixed size entries to the SATA drive.
In a further embodiment, the node will include a network interface controller, which is adapted to receive data into one of the fixed size data entries in the queue. For example, the network interface controller can use RDMA to copy data into the fixed size data entry as explained herein. Further, the application can then write the received data onto a SATA drive (solid state, spinning magnetic, etc.) from the fixed size entry using the SATA drive access class without making any operating system calls.
In another embodiment, the disclosed DBMS incorporates a node. The node includes a network interface card including a controller. A DBMS application executing on the node coordinates with the controller to allocate a pinned memory buffer, which is directly accessed by the DBMS application, which allows the DBMS application to directly access and manipulate received network data.
In another embodiment, the disclosed DBMS incorporates a node. The node includes a drive controller. A DBMS application allocates a pinned memory buffer, and communicates this buffer to the drive controller, which allows the DBMS application to access and manipulate drive data with minimal overhead.
Although the characteristic features of this disclosure will be particularly pointed out in the claims, the invention itself, and the manner in which it may be made and used, may be better understood by referring to the following description taken in connection with the accompanying drawings forming a part hereof, wherein like reference numerals refer to like parts throughout the several views and in which:
This application discloses a number of infrastructure improvements for use in a database system, along with a DBMS utilizing those improvements. The infrastructure is adapted to allow the database system to scale to a size of an Exabyte or even larger, and allow each record stored in the database to be accessed quickly and numerous times per day. Such a database will be useful for many tasks; for example, every bit of data exchanged over a company's network can be stored in such a database and analyzed to determine, for example, the mechanism of a network intrusion after it has been discovered.
One issue that such a database must overcome is data access latency. There are numerous sources of data access latency, including, for example, network accesses, disk accesses, memory copies, etc. These latencies are exacerbated from distributed systems.
Turning to the Figures, and to
A blazing storage node 101 may include, for example, an array of NVDIMM (Non-Volatile Dual Inline Memory Module)(a type of NVRAM) storage, such as that marketed by Hewlett Packard Enterprise, or any other extremely fast storage, along with appropriate controllers to allow for full speed access to such storage. For example, DRAM with a write ahead log implemented on an NVMe drive could be utilized to implement blazing storage. Specifically, write-ahead logging is used to log updates to an in-memory (DRAM) data structure. As log entries are appended to the end of the log, they are flushed to an NVMe drive when the size of in-memory log entries is near or exceeds the size of a Solid State Memory page, or after a configured timeout threshold, such as ten seconds, has been reached. This guarantees an upper bound for the amount of data that is lost if a power outage or system crash should occur; this also allows for the entire in-memory structure from disk log to be rebuilt after a restart. Since the log is written sequentially to the SSD, write amplification on the Solid State drive is minimized.
In addition, NVRAM technology can also be utilized to implement blazing storage. In such a case, the in-memory structures would be preserved in the event of a power outage. A hot storage node 111 may include, for example, one or more Solid State NVME drives, along with appropriate controllers to allow for full speed access to such storage. A warm storage node 121 may include, for example, one or more Solid State SATA drives, along with appropriate controllers to allow for full speed access to such storage.
Each index node 131 will also include storage, which will generally comprise high performance storage such as Solid State SATA drives or higher performance storage devices. Generally, the index nodes 131 will store the relational database structure, which may comprise, for example, a collection of tables and search keys.
To allow for information exchange as fast as possible, certain of the clusters are connected via high speed, RDMA capable links 108. In particular, the index cluster 135 is connected to the storage clusters 105,115,125 by high speed, RDMA capable links 108. On the other hand, the storage clusters 105,115,125 are connected to one another by standard (non-RDMA capable) high performance network links 109, such as 10 Gbps Ethernet.
As discussed above, Infiniband is an example of a high speed, RDMA capable link. Importantly, such links allow different nodes in each cluster to exchange information rapidly; as discussed above, information from one node is inserted into the memory of another node without consuming processor cycles of the target node. The blazing storage cluster 105 also comprises a high speed switch 103. Each blazing storage node 101 is operatively coupled to the high speed switch 103 through a high speed, RDMA capable link 108. Similarly each hot storage node 111 is coupled to a high speed switch 113 through a high speed, RDMA capable, link 108, and each warm storage node 121 is coupled to the high speed switch 123 through a high speed, RDMA capable, link 108. Similarly, the high speed switches 103,113,123 coupled to each storage cluster 105,115,125 are each coupled to the high speed switch 133 of the index cluster 135 by a high speed, RDMA capable, link 108.
Turning to
Similarly, an index node 300 includes an RDMA capable network interface card 306, which communicates with other devices over an RDMA capable network 210. The network interface card 306 receives data directly into a pinned memory buffer 302. An index app 304 directly accesses the pinned memory buffer 302, which can be managed as a queue 320.
Each entry in the queues 220, 320 will be a fixed memory size, such as, for example, 4 kilobytes. In a preferred embodiment of the disclosed DBMS and associated infrastructure, the size of a queue entry will be set to the maximal size of any network message expected to be passed between nodes. As data is received by the node (data or index) the corresponding application (database or index) directly operates on the information without copying it. As data is used, and no longer needed, a queue pointer is advanced, so that the no longer needed data can be overwritten. Accordingly, the application directly manages the RDMA buffer. This is in contrast to prior art systems, where the RDMA buffers are managed by a driver, and accessed through the operating system.
The steps by which network data is received by the disclosed DBMS system is illustrated in
Turning to
Each entry in the queue 420 will be a fixed memory size, such as, for example, 4 kilobytes. In a preferred embodiment of the disclosed DBMS and associated infrastructure, the size of a queue entry will be set to the same as, or an integer multiple of, the page size of the storage drives used by a particular node. As data is read from storage 410 the application 404 directly operates on the data without copying it. As data is used, and no longer needed, a queue pointer is advanced, so that the no longer needed data can be overwritten. Accordingly, the application directly manages the DMA buffer. This is in contrast to prior art DBMS systems, where DMA buffers are managed by a driver, and accessed through the operating system.
The steps by which data is read by a node from storage into memory for the disclosed DBMS system is set forth in
While the applications of the disclosed DBMS system directly managing RDMA buffers will substantially improve performance of the DBMS system versus prior art implementations, additional improvements can still be made. In particular, prior art systems typically maintain data in packed, or compressed, format. For example, to conserve disk storage and network bandwidth, most systems will pack Boolean type data into a single bit. In addition, many systems will actually apply simple compression before transmitting data via a network or committing it to disk storage. Prior art systems do this to ensure that network bandwidth and persistent storage—both of which are scarce resources—are used efficiently.
An example of a prior art data structure 500 is shown in
The prior art data structure 500 efficiently stores 78 data entries in a mere 320 bits; in fact, not a single bit in the prior art data structure 500 is unused. However, each piece of data in the prior art data structure with the exception of the header will require multiple operations prior to being used. For example, long word 503 must be copied to a separate location, and the upper 32 bits masked off prior to being copied into a register and operated on. Each of these operations will use precious processor time in the interest of conserving memory, network bandwidth, and disk storage.
The disclosed DBMS system, however, is not optimized to minimize network usage or disk storage. Rather, the disclosed DBMS system is optimized to maximize performance. Accordingly, each data entry is stored in a 64 bit memory location, as depicted in a simplified fashion in
Another way in which the disclosed DBMS system optimizes performance is by the various DBMS applications directly accessing NVME, NVRAM and SATA drives. This is done through an abstraction layer, which is generally illustrated in
Storage access services include opening a file, reading a file, or writing a file. Generally, such services are provided through an operating system, which utilizes a device specific driver to interface with the particular device. Operating system calls are time consuming, and thereby decrease the performance of DBMS systems that utilize them. Rather than suffering such performance penalties, the disclosed DBMS system directly accesses NVRAM, NVME and SATA controllers, as disclosed herein.
The foregoing description of the disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or to limit the disclosure to the precise form disclosed. The description was selected to best explain the principles of the present teachings and practical application of these principles to enable others skilled in the art to best utilize the disclosure in various embodiments and various modifications as are suited to the particular use contemplated. It is intended that the scope of the disclosure not be limited by the specification, but be defined by the claims set forth below. In addition, although narrow claims may be presented below, it should be recognized that the scope of this invention is much broader than presented by the claim(s). It is intended that broader claims will be submitted in one or more applications that claim the benefit of priority from this application. Insofar as the description above and the accompanying drawings disclose additional subject matter that is not within the scope of the claim or claims below, the additional inventions are not dedicated to the public and the right to file one or more applications to claim such additional inventions is reserved.
This application claims the benefit and priority of U.S. Patent Application No. 63/403,328, entitled “APPLICATION DIRECT ACCESS TO NETWORK RDMA MEMORY,” filed Oct. 3, 2016, which is hereby incorporated by reference in its entirety. This application also claims the benefit and priority of U.S. Patent Application No. 62/403,231, entitled “HIGHLY PARALLEL DATABASE MANAGEMENT SYSTEM,” filed Oct. 3, 2016, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62403328 | Oct 2016 | US | |
62403231 | Oct 2016 | US |