The present invention relates to a data processing system, and more specifically, to high-speed synchronous writes to persistent storage in a data processing system.
Overall computer system performance is affected by each of the key elements of the structure of the computer system, including the performance/structure of the processor(s), any memory cache(s), the input/output (I/O) subsystem(s), the efficiency of the memory control function(s), the main memory device(s), and the type and structure of the memory interconnect interface(s).
High-availability computer systems present challenges related to overall system reliability due to customer expectations that new computer systems will markedly surpass existing systems in regard to mean-time-between-failure (MTBF), in addition to supporting additional functions, increased performance, increased storage, lower operating costs, etc. Other frequent customer requirements further exacerbate design challenges, and include such items as ease of upgrade and reduced system environmental impact, such as space, power, and cooling.
Most contemporary computer systems perform some type of logging to store data for use, for example, during restart and/or recovery processing. Typically, logging is performed in a synchronous manner and the operation of the application requesting the logging is interrupted until an I/O to write the log data to persistent storage is completed by an I/O subsystem. The processor initiates a write command to an I/O subsystem and suspends operation of the application until the processor receives a notification that the write command has completed.
An embodiment is a system that includes a memory configured to provide a write requestor with a direct write programming interface to a disk device. The memory includes a first persistent memory that includes memory locations and that is configured for designating at least a portion of the memory locations as central processing unit (CPU) load storable memory. The first persistent memory is also configured for receiving write data from the write requestor, for storing the write data in the CPU load storable memory, and for returning a write completion message to the write requestor in response to the storing completing. The memory also includes a second persistent memory that includes the disk device, and a controller in communication with the first persistent memory and the second persistent memory. The controller is configured for detecting the storing of the write data to the CPU load storable memory in the first persistent memory. The controller is also configured for copying the write data to the second persistent memory in response to detecting the storing of the write data.
Another embodiment is a method that includes providing a write requestor with a direct write programming interface to a disk device. The providing includes designating at least a portion of a first persistent memory as CPU load storable memory, receiving write data from the write requestor, and storing the write data into the CPU load storable memory. A write completion message is returned to the write requestor in response to the storing completing. The storing of write data to the CPU load storable memory is detected by a controller that is in communication with the first persistent memory and a second persistent memory. The second persistent memory includes the disk device. The write data is copied to a predetermined location in the second persistent memory in response to the detecting. The copying is performed by the controller and is in response to the detecting.
A further embodiment is a computer program product that includes a tangible storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method. The method includes providing a write requestor with a direct write programming interface to a disk device. The providing includes designating at least a portion of a first persistent memory as CPU load storable memory, receiving write data from the write requestor, and storing the write data into the CPU load storable memory. A write completion message is returned to the write requestor in response to the storing completing. The storing of write data to the CPU load storable memory is detected by a controller that is in communication with the first persistent memory and a second persistent memory. The second persistent memory includes the disk device. The write data is copied to a predetermined location in the second persistent memory in response to the detecting. The copying is performed by the controller and is in response to the detecting.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with the advantages and the features, refer to the description and to the drawings.
The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The forgoing and other features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
High-speed synchronous writes to persistent storage are performed in accordance with exemplary embodiments described herein. A new programming model is used to accelerate the speed of synchronous writes of data, such as log data, to persistent storage.
Contemporary implementations of persistent storage may be characterized by aspects such as persistence, programming model, latency of write completion, and capacity. In terms of persistence, records of activity that need to be saved to persistent storage (e.g., log data) may be saved to mediums such as non-volatile (battery backed) dynamic random access memory (DRAM), flash memory (e.g., solid state drive or “SSD”), magnetic disk (e.g., hard disk drive or “HDD”), or tape. Non-volatile DRAM provides the lowest latency of write completion because it is directly byte-storable by a central processing unit (CPU). However, non-volatile DRAM is not practical for the persistent storage of large volumes of data (e.g., gigabytes or terabytes of persistently maintained logs, records, etc.). In addition, non-volatile DRAM only provides temporal persistence (i.e., until the battery dies, or capacitance is lost). Disk devices, such as SSD and HDD do provide the needed capacities for large volumes of data and they are characterized by long term persistence. However, when compared to non-volatile DRAM, disk devices have a longer path length of programming and longer latency of write completion. This is because disk devices are not byte addressable and they require programming of a controller that then manages block transfers (via direct memory access or “DMA”) from the host memory to the device.
Embodiments described herein provide persistent storage that is a mix of non-volatile DRAM and flash memory (SSD) to provide both lower latency and simplified programming (e.g., using a direct CPU store) with disk capacities that support large volumes of data. Embodiments described herein include a new programming model for existing SSDs (or other type of disk devices) that maximizes the performance of applications that require confirmation that synchronous writes to disk have completed (e.g., for integrity, reliability) before continuing on to the next instruction(s). From the perspective of a write requestor (e.g., an application, a device driver), the new programming model provides a direct write programming interface (e.g., memory mapped CPU byte stores) to a disk device.
In an embodiment, the persistent storage includes both a non-volatile (DRAM) and a flash memory. Instead of the programming model requiring block input/output (I/O) write setup through a disk controller, all or a portion of the DRAM is memory mapped and made direct CPU store addressable to an application program. The DRAM is made non-volatile by providing battery backup or simply by containing enough capacitance that it is guaranteed to support draining to the flash memory. In one embodiment, the non-volatile DRAM is remapped to correspond to a different range of flash memory logical blocks on demand. In another embodiment, the non-volatile DRAM is used in a circular buffer fashion by updating start and end pointers within the buffer for new writes, and as soon as the pointers are updated, the contents are spilled to the flash memory. This allow the application program to perform its writes (e.g., log data writes) to persistent storage, at CPU store speed to memory, without having to make a system call to the operating system to initiate an I/O operation and await completion. As soon as the CPU stores are completed, the log data write is complete and in persistent storage, and the application program can continue processing.
Embodiments described herein eliminate performance and throughput bottlenecks that may be caused by contemporary methods for performing synchronous writes to persistent storage. Embodiments may be used for application scenarios that require some amount of data to be written synchronously and require acknowledgement that the write has completed successfully into persistent storage prior to the application continuing. Examples include, but are not limited to database logs, file system journal logs, intent logs, security logs, and compliance audit logs. Another example is trace files that capture application/operating system (OS) execution paths/history for performance or first-failure-data-capture analysis.
As used herein, the term “synchronous store” or “synchronous write” refers to a store, or write, operation that must be completed to persistent memory before the application requesting the store operation can initiate the next instruction in the application.
As used herein, the term “persistent data” refers to data that will exist after execution of the program has completed. As used herein, the term “persistent storage” refers to a storage device (e.g., a non-volatile DRAM, a disk drive, a flash drive, etc.) where persistent data is stored.
As used herein, the term “non-volatile DRAM” refers to a DRAM that retains its data when the power is turned off. In an embodiment, the DRAM is only required to retain its data after a power interruption long enough to ensure that all its contents can be written to the backing persistent memory (e.g., flash memory).
As used herein, the term “memory mapped” or “memory mapped file” refers to a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of the non-volatile DRAM and that can be referenced by the OS through a file descriptor. Once present, this correlation between the non-volatile DRAM and the memory space in the virtual memory permits applications executing on a CPU to treat the mapped portions as if they are primary memory. Memory mapped storage is an example of a CPU load storable memory. Programmed I/O (PIO) is an example of a method of transferring data between the applications and the memory mapped portions of the non-volatile DRAM. In PIO, the CPU (or application executing on the CPU) is responsible for executing instructions that transfer data to/from the non-volatile DRAM at the memory mapped locations. The term “memory mapped I/O” or “MMIO” refers to I/O that is accessible via CPU loads/stores in order to transfer data from/to and I/O device. Thus, the terms MMIO and PIO refer to the same thing are used herein interchangeably. PIO and MMIO are contrasted with direct memory access (DMA) where a subsystem(s) within a processor accesses system memory located on a storage device independently of the CPU. With DMA, the CPU initiates a data store, performs other operations while the transfer is in process, and receives an interrupt from the DMA controller once the operation has been completed. Once the CPU receives the interrupt, the application requesting the synchronous data store can continue processing.
In an embodiment, elements (i.e., the non-volatile DRAM 104, flash memory 106 and micro-controller 108) of the memory system 110 shown in
Referring to
The system in
The right side of
The left side of
The system in
The right side of
The left side of
In the embodiment shown in
Referring to
The window LBA start pointer 506 points to the starting location of the backing store 502 and the window LBA end pointer 508 points to the last location in the backing store 502. In an embodiment, both the LBA start pointer 506 and the LBA end pointer 508 are stored as programmable entities (e.g., stored in a register or memory location on the flash memory device) that are programmed as part of system initialization when the LBA range for the log device is initially being “carved out” and allocated for use as the logical disk (i.e., log device 310 in
The window memory start pointer 510 points to the beginning location of the log memory window 504 (the start of the memory mapped portion of the non-volatile DRAM) and the window memory end pointer 516 points to the ending location of the log memory window 504 (the end of the memory mapped portion of the non-volatile DRAM). In an embodiment, both the window memory start pointer 510 and the window memory end pointer 516 are stored as programmable entities (e.g., stored in a register or memory location on the DRAM device) that are programmed as part of system initialization time to define the circular DRAM buffer where the application or device driver will write the data to. The locations between the active record start pointer 512 and the active record end pointer 514 are the locations in the log memory window 504 where the next segment of log data, received via the PIO window interface, will be stored. In an embodiment, the terms “write segment” or “log segment” refers to data bits that are written to the log by a single write command from an application. The spill start pointer 518 points to the location in the backing store 502 where the next log segment will be stored when the log segment is copied from the log memory window 504 to the backing store 502. Also shown in
Following is a process for performing a log write using the backing store 502 and the log memory window 504 shown in
As shown in
In an embodiment the host processor (e.g., the application or the device driver) and/or the controller must check for and compensate for buffer wrap whenever updating/moving their respective pointers forward. In an embodiment, the controller is also responsible for enforcing that the backing store spill location can't go beyond the window LBA end pointer 508.In another embodiment, the non-volatile DRAM is remapped to correspond to a different range of flash memory logical blocks on demand. In this embodiment, the spill start pointer 518 also becomes an element that is programmable by the host processor. As a start of each new disk write, the host processor programs a new starting location in the backing store 502 for the next write. A MMIO store is performed to store the desired spill start pointer 518 value prior to moving the active record start pointer 512 from its current location to the location pointed to by the active record end pointer 514 as described in the previous example. As described previously, moving the active record start pointer 512 triggers the controller to initiate the write from the log memory window 504 on the non-volatile DRAM to the a spill location on the backing store 502.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Further, as will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method, or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.