At least one embodiment of the present invention pertains to a virtual machine environment in which multiple virtual machines share access to non-volatile solid-state memory.
Virtual machine data processing environments are commonly used today to improve the performance and utilization of multi-core/multi-processor computer systems. In a virtual machine environment, multiple virtual machines share the same physical hardware, such as memory and input/output (I/O) devices. A software layer called a hypervisor, or virtual machine manager, typically provides the virtualization, i.e., enables the sharing of hardware.
A virtual machine can provide a complete system platform which supports the execution of a complete operating system. One of the advantages of virtual machine environments is that multiple operating systems (which may or may not be the same type of operating system) can coexist on the same physical platform. In addition, a virtual machine and have instructions that architecture that is different from that of the physical platform in which is implemented.
It is desirable to improve the performance of any data processing system, including one which implements a virtual machine environment. One way to improve performance is to reduce the latency and increase the random access throughput associated with accessing a processing system's memory. In this regard, flash memory, and NAND flash memory in particular, has certain very desirable properties. Flash memory generally has a very fast random read access speed compared to that of conventional disk drives. Also, flash memory is substantially cheaper than conventional DRAM and is not volatile like DRAM.
However, flash memory also has certain characteristics that make it unfeasible simply to replace the DRAM or disk drives of a computer with flash memory. In particular, a conventional flash memory is typically a block access device. Because such a device allows the flash memory only to receive one command (e.g., a read or write) at a time from the host, it can become a bottleneck in applications where low latency and/or high throughput is needed.
In addition, while flash memory generally has superior read performance compared to conventional disk drives, its write performance has to be managed carefully. One reason for this is that each time a unit (write block) of flash memory is written, a large unit (erase block) of the flash memory must first be erased. The size of the erase block is typically much larger than a typical write block. These characteristics add latency to write operations,. Furthermore, flash memory tends to wear out after a finite number of erase operations.
When memory is shared by multiple virtual machines in a virtualization environment, it is important to provide adequate fault containment for each virtual machine. Further, it is important to provide for efficient memory sharing by virtual machines. Normally these functions are provided by the hypervisor, which increases the complexity and code size of the hypervisor.
One or more embodiments of the present invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
References in this specification to “an embodiment”, “one embodiment”, or the like, mean that the particular feature, structure or characteristic being described is included in at least one embodiment of the present invention. Occurrences of such phrases in this specification do not necessarily all refer to the same embodiment; however, neither are such occurrences mutually exclusive necessarily.
A system and method of providing multiple virtual machines with shared access to non-volatile solid-state memory are described. As described in greater detail below, a processing system that includes multiple virtual machines can include or access a non-volatile solid-state memory (NVSSM) subsystem which includes raw flash memory to store data persistently. Some examples of non-volatile solid-state memory are flash memory and battery-backed DRAM. The NVSSM subsystem can be used as, for example, the primary persistent storage facility of the processing system and/or the main memory of the processing system.
To make use of flash's desirable properties in a virtual machine environment, it is important to provide adequate fault containment for each virtual machine. Therefore, in accordance with the technique introduced here, a hypervisor can implement fault tolerance between the virtual machines by configuring the virtual machines each to have exclusive write access to a separate portion of the NVSSM subsystem.
Further, it is desirable to provide for efficient memory sharing of flash by the virtual machines. Hence, the technique introduced here avoids the bottleneck normally associated with accessing flash memory through a conventional serial interface, by using remote direct memory access (RDMA) to move data to and from the NVSSM subsystem, rather than a conventional serial interface. The techniques introduced here allow the advantages of flash memory to be obtained without incurring the latency and loss of throughput normally associated with a serial command interface between the host and the flash memory.
Both read and write accesses to the NVSSM subsystem are controlled by each virtual machine, and more specifically, by an operating system of each virtual machine (where each virtual machine has its own separate operating system), which in certain embodiments includes a log structured, write out-of-place data layout engine. The data layout engine generates scatter-gather lists to specify the RDMA read and write operations. At a lower-level, all read and write access to the NVSSM subsystem can be controlled from an RDMA controller in the processing system, under the direction of the operating systems.
The technique introduced here supports compound RDMA commands; that is, one or more client-initiated operations such as reads or writes can be combined by the processing system into a single RDMA read or write, respectively, which upon receipt at the NVSSM subsystem is decomposed and executed as multiple parallel or sequential reads or writes, respectively. The multiple reads or writes executed at the NVSSM subsystem can be directed to different memory devices in the NVSSM subsystem, which may include different types of memory. For example, in certain embodiments, user data and associated resiliency metadata (such as Redundant Array of Inexpensive Disks/Devices (RAID) data and checksums) are stored in flash memory in the NVSSM subsystem, while associated file system metadata are stored in non-volatile DRAM in the NVSSM subsystem. This approach allows updates to file system metadata to be made without having to incur the cost of erasing flash blocks, which is beneficial since file system metadata tends to be frequently updated. Further, when a sequence of RDMA operations is sent by the processing system to the NVSSM subsystem, completion status may be suppressed for all of the individual RDMA operations except the last one.
The techniques introduced here have a number of possible advantages. One is that the use of an RDMA semantic to provide virtual machine fault isolation improves performance and reduces the complexity of the hypervisor for fault isolation support. It also provides support for virtual machines' bypassing the hypervisor completely and performing I/O operations themselves once the hypervisor sets up virtual machine access to the NVSSM subsystem, thus further improving performance and reducing overhead on the core for “domain 0”, which runs the hypervisor.
Another possible advantage is the performance improvement achieved by combining multiple I/O operations into single RDMA operation. This includes support for data resiliency by supporting multiple data redundancy techniques using RDMA primitives. Yet another possible advantage is improved support for virtual machine data sharing through the use of RDMA atomic operations. Still another possible advantage is the extension of flash memory (or other NVSSM memory) to support filesystem metadata for a single virtual machine and for shared virtual machine data. Another possible advantage is support for multiple flash devices behind a node supporting virtual machines, by extending the RDMA semantic. Further, the techniques introduced above allow shared and independent NVSSM caches and permanent storage in NVSSM devices under virtual machines.
As noted above, in certain embodiments the NVSSM subsystem includes “raw” flash memory, and the storage of data in the NVSSM subsystem is controlled by an external (relative to the flash device), log structured data layout engine of a processing system which employs a write anywhere storage policy. By “raw”, what is meant is a memory device that does not have any on-board data layout engine (in contrast with conventional flash SSDs). A “data layout engine” is defined herein as any element (implemented in software and/or hardware) that decides where to store data and locates data that is already stored. “Log structured”, as the term is defined herein, means that the data layout engine lays out its write patterns in a generally sequential fashion (similar to a log) and performs all writes to free blocks.
The NVSSM subsystem can be used as the primary persistent storage of a processing system, or as the main memory of a processing system, or both (or as a portion thereof). Further, the NVSSM subsystem can be made accessible to multiple processing systems, one or more of which implement virtual machine environments.
In some embodiments, the data layout engine in the processing system implements a “write out-of-place” (also called “write anywhere”) policy when writing data to the flash memory (and elsewhere), as described further below. In this context, writing out-of-place means that whenever a logical data block is modified, that data block, as modified, is written to a new physical storage location, rather than overwriting it in place. (Note that a “logical data block” managed by the data layout engine in this context is not the same as a physical “block” of flash memory. A logical block is a virtualization of physical storage space, which does not necessarily correspond in size to a block of flash memory. In one embodiment, each logical data block managed by the data layout engine is 4 kB, whereas each physical block of flash memory is much larger, e.g., 128 kB.) Because the flash memory does not have any internal data layout engine, the external write-out-of-place data layout engine of the processing system can write data to any free location in flash memory. Consequently, the external write-out-of-place data layout engine can write modified data to a smaller number of erase blocks than if it had to rewrite the data in place, which helps to reduce wear on flash devices.
Refer now to
The NVSSM subsystem 26 can be within the same physical platform/housing as that which contains the virtual machines 4, although that is not necessarily the case. In some embodiments, the virtual machines 4 and the NVSSM subsystem 26 may all be considered to be part of a single processing system; however, that does not mean the NVSSM subsystem 26 must be in the same physical platform as the virtual machines 4.
In one embodiment, the processing system 2 is a network storage server. The storage server may provide file-level data access services to clients (not shown), such as commonly done in a NAS environment, or block-level data access services such as commonly done in a SAN environment, or it may be capable of providing both file-level and block-level data access services to clients.
Further, although the processing system 2 is illustrated as a single unit in
By assigning a “particular portion”, what is meant is assigning a particular portion of the memory space of the NVSSM subsystem 26, which does not necessarily mean assigning a particular physical portion of the NVSSM subsystem 26. Nonetheless, in some embodiments, assigning different portions of the memory space of the NVSSM subsystem 26 may in fact involve assigning distinct physical portions of the NVSSM subsystem 26.
The use of an RDMA semantic in this way to provide virtual machine fault isolation improves performance and reduces the overall complexity of the hypervisor 11 for fault isolation support.
In operation, once each virtual machine 4 has received its STag(s) from the hypervisor 11, it can access the NVSSM subsystem 26 by communicating through the RDMA controller 12, without involving the hypervisor 11. This technique, therefore, also improves performance and reduces overhead on the processor core for “domain 0”, which runs the hypervisor 11.
The hypervisor 11 includes an NVSSM data layout engine 13 which can control RDMA operations and is responsible for determining the placement of data and flash wear-leveling within the NVSSM subsystem 26, as described further below. This functionality includes generating scatter-gather lists for RDMA operations performed on the NVSSM subsystem 26. In certain embodiments, at least some of the virtual machines 4 also include their own NVSSM data layout engines 46, as illustrated in
In one embodiment, as illustrated in
In addition, in certain embodiments, data consistency is maintained by providing remote locks at the NVSSM 26. More particularly, these are achieved by causing each virtual machine 4 to access the NVSSM subsystem 26 remote locks memory through the RDMA controller only by using atomic memory access operations. This alleviates the need for a distributed lock manager and simplifies fault handling, since lock and data are on the same memory. Any number of atomic operations can be used. Two specific examples which can be used to support all other atomic operations are: compare and swap; and, fetch and add.
From the above description, it can be seen that the hypervisor 11 generates STags to control fault isolation of the virtual machines 4. In addition, the hypervisor 11 can also generate STags to implement a wear-leveling scheme across the NVSSM subsystem 26 and/or to implement load balancing across the NVSSM subsystem 26, and/or for other purposes.
The processors 21 include central processing units (CPUs) of the processing system 2 and, thus, control the overall operation of the processing system 2. In certain embodiments, the processors 21 accomplish this by executing software or firmware stored in memory 22. The processors 21 may be, or may include, one or more programmable general-purpose or special-purpose microprocessors, digital signal processors (DSPs), programmable controllers, application specific integrated circuits (ASICs), programmable logic devices (PLDs), or the like, or a combination of such devices.
The memory 22 is, or includes, the main memory of the processing system 2. The memory 22 represents any form of random access memory (RAM), read-only memory (ROM), flash memory, or the like, or a combination of such devices. In use, the memory 22 may contain, among other things, multiple operating systems 40, each of which is (or is part of) a virtual machine 4. The multiple operating systems 40 can be different types of operating systems or different instantiations of one type of operating system, or a combination of these alternatives.
Also connected to the processors 21 through the interconnect 23 are a network adapter 24 and an RDMA controller 25. Storage adapter 25 is henceforth referred to as the “host RDMA controller” 25. The network adapter 24 provides the processing system 2 with the ability to communicate with remote devices over the network 3 and may be, for example, an Ethernet, Fibre Channel, ATM, or Infiniband adapter.
The RDMA techniques described herein can be used to transfer data between host memory in the processing system 2 (e.g., memory 22) and the NVSSM subsystem 26. Host RDMA controller 25 includes a memory map of all of the memory in the NVSSM subsystem 26. The memory in the NVSSM subsystem 26 can include flash memory 27 as well as some form of non-volatile DRAM 28 (e.g., battery backed DRAM). Non-volatile DRAM 28 is used for storing filesystem metadata associated with data stored in the flash memory 27, to avoid the need to erase flash blocks due to updates of such frequently updated metadata. Filesystem metadata can include, for example, a tree structure of objects, such as files and directories, where the metadata of each of these objects recursively has the metadata of the filesystem as if it were rooted at that object. In addition, filesystem metadata can include the names, sizes, ownership, access privileges, etc. for those objects.
As can be seen from
In the basic operation of the NVSSM subsystem 26, data is scheduled into the NAND flash devices by one or more data layout engines located external to the NVSSM subsystem 26, which may be part of the operating systems 40 or the hypervisor 11 running on the processing system 2. An example of such a data layout engine is described in connection with
In the illustrated embodiment, the NVSSM subsystem 26 also includes a switch 34, where each flash controller 33 is coupled to the interconnect 31 by the switch 34.
The NVSSM subsystem 26 further includes a separate battery backed DRAM DIMM coupled to each of the flash controllers 33, implementing the non-volatile DRAM 28. The non-volatile DRAM 28 can be used to store file system metadata associated with data being stored in the flash devices 32.
In the illustrated embodiment, the NVSSM subsystem 26 also includes another non-volatile (e.g., battery-backed) DRAM buffer DIMM 36 coupled to the switch 34. DRAM buffer DIMM 36 is used for short-term storage of data to be staged from, or destaged to, the flash devices 32. A separate DRAM controller 35 (e.g., FPGA) is used to control the DRAM buffer DIMM 36 and to couple the DRAM buffer DIMM 36 to the switch 34.
In contrast with conventional SSDs, the flash controllers 33 do not implement any data layout engine; they simply interface the specific signaling requirements of the flash DIMMs 32 with those of the host interconnect 31. As such, the flash controllers 33 do not implement any data indirection or data address virtualization for purposes of accessing data in the flash memory. All of the usual functions of a data layout engine (e.g., determining where data should be stored and locating stored data) are performed by an external data layout engine in the processing system 2. Due to the absence of a data layout engine within the NVSSM subsystem 26, the flash DIMMs 32 are referred to as “raw” flash memory.
Note that the external data layout engine may use knowledge of the specifics of data placement and wear leveling within flash memory. This knowledge and functionality could be implemented within a flash abstraction layer, which is external to the NVSSM subsystem 26 and which may or may not be a component of the external data layout engine.
Logically “under” the file system manager 41, to allow the processing system 2 to communicate over the network 3 (e.g., with clients), the operating system 40 also includes a network stack 42. The network stack 42 implements various network protocols to enable the processing system to communicate over the network 3.
Also logically under the file system manager 41, to allow the processing system 2 to communicate with the NVSSM subsystem 26, the operating system 40 includes a storage access layer 44, an associated storage driver layer 45, and may include an NVSSM data layout engine 46 disposed logically between the storage access layer 44 and the storage drivers 45. The storage access layer 44 implements a higher-level storage redundancy algorithm, such as RAID-3, RAID-4, RAID-5, RAID-6 or RAID-DP. The storage driver layer 45 implements a lower-level protocol.
The NVSSM data layout engine 46 can control RDMA operations and is responsible for determining the placement of data and flash wear-leveling within the NVSSM subsystem 26, as described further below. This functionality includes generating scatter-gather lists for RDMA operations performed on the NVSSM subsystem 26.
It is assumed that the hypervisor 11 includes its own data layout engine 13 with functionality such as described above. However, a virtual machine 4 may or may not include its own data layout engine 46. In one embodiment, the functionality of any one or more of these NVSSM data layout engines 13 and 46 is implemented within the RDMA controller.
If a particular virtual machine 4 does include its own data layout engine 46, then it uses that data layout engine to perform I/O operations on the NVSSM subsystem 26. Otherwise, the virtual machine uses the data layout engine 13 of the hypervisor 11 to perform such operations. To facilitate explanation, the remainder of this description assumes that virtual machines 4 do not include their own data layout engines 46. Note, however, that essentially all of the functionality described herein as being implemented by the data layout engine 13 of the hypervisor 11 can also be implemented by a data layout engine 46 in any of the virtual machines 4.
The storage driver layer 45 controls the host RDMA controller 25 and implements a network protocol that supports conventional RDMA, such as FCVI, InfiniBand, or iWarp. Also shown in
Both read access and write access to the NVSSM subsystem 26 are controlled by the operating system 40 of a virtual machine 4. The techniques introduced here use conventional RDMA techniques to allow efficient transfer of data to and from the NVSSM subsystem 26, for example, between the memory 22 and the NVSSM subsystem 26. It can be assumed that the RDMA operations described herein are generally consistent with conventional RDMA standards, such as InfiniBand (InfiniBand Trade Association (IBTA)) or IETF iWarp (see, e.g.: RFC 5040, A Remote Direct Memory Access Protocol Specification, October 2007; RFC 5041, Direct Data Placement over Reliable Transports; RFC 5042, Direct Data Placement Protocol (DDP)/Remote Direct Memory Access Protocol (RDMAP) Security IETF proposed standard; RFC 5043, Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation; RFC 5044, Marker PDU Aligned Framing for TCP Specification; RFC 5045, Applicability of Remote Direct Memory Access Protocol (RDMA) and Direct Data Placement Protocol (DDP); RFC 4296, The Architecture of Direct Data Placement (DDP) and Remote Direct Memory Access (RDMA) on Internet Protocols; RFC 4297, Remote Direct Memory Access (RDMA) over IP Problem Statement).
In an embodiment according to
In one embodiment consistent with
For each granular subset of the NVSSM memory 26, the NVSSM subsystem 26 also provides to host RDMA controller 25 an RDMA STag and a location of a lock used for accesses to that granular memory subset, which then provides the STag to the NVSSM data layout engine 13 of the hypervisor 11.
If multiple processing systems 2 are sharing the NVSSM subsystem 26, then each processing system 2 may have access to a different subset of memory in the NVSSM subsystem 26. In that case, the STag provided in each processing system 2 identifies the appropriate subset of NVSSM memory to be used by that processing system 2. In one embodiment, a protocol which is external to the NVSSM subsystem 26 is used between processing systems 2 to define which subset of memory is owned by which processing system 2. The details of such protocol are not germane to the techniques introduced here; any of various conventional network communication protocols could be used for that purpose. In another embodiment, some or all of memory of DIMM 28 is mapped to an RDMA STag for each processing system 2 and shared data stored in that memory is used to determine which subset of memory is owned by which processing system 2. Furthermore, in another embodiment, some or all of the NVSSM memory can be mapped to an STag of different processing systems 2 to be shared between them for read and write data accesses. Note that the algorithms for synchronization of memory accesses between processing systems 2 are not germane to the techniques being introduced here.
In the embodiment of
In one embodiment consistent with
In the embodiment of
During normal operation, the NVSSM data layout engine 13 (
As noted above, the hypervisor 11 includes an NVSSM data layout engine 13, which can be implemented in an RDMA controller 53 of the processing system 2, as shown in
The single RDMA data access 52 includes a scatter-gather list generated by NVSSM data layout engine 13, where data layout engine 13 generates a list for NVSSM subsystem 26 and the file system manager 41 of a virtual machine generates a list for processing system internal memory (e.g., buffer cache 6). A scatter list or a gather list can specify multiple memory segments at the source or destination (whichever is applicable). Furthermore, a scatter list or a gather list can specify memory segments that are in different subsets of memory.
In the embodiment of
The processing system 2 can initiate a sequence of related RDMA reads or writes to the NVSSM subsystem 26 (where any individual RDMA read or write in the sequence can be a compound RDMA operation as described above). Thus, the processing system 2 can convert any combination of one or more client-initiated reads or writes or any other data or metadata operations into any combination of one or more RDMA reads or writes, respectively, where any of those RDMA reads or writes can be a compound read or write, respectively.
In cases where the processing system 2 initiates a sequence of related RDMA reads or writes or any other data or metadata operation to the NVSSM subsystem 26, it may be desirable to suppress completion status for all of the individual RDMA operations in the sequence except the last one. In other words, if a particular RDMA read or write is successful, then “completion” status is not generated by the NVSSM subsystem 26, unless it is the last operation in the sequence. Such suppression can be done by using conventional RDMA techniques. “Completion” status received at the processing system 2 means that the written data is in the NVSSM subsystem memory, or read data from the NVSSM subsystem is in processing system memory, for example in buffer cache 6, and valid. In contrast, “completion failure” status indicates that there was a problem executing the operation in the NVSSM subsystem 26, and, in the case of an RDMA write, that the state of the data in the NVSSM locations for the RDMA write operation is undefined, while the state of the data at the processing system from which it is written to NVSSM is still intact. Failure status for a read means that the data is still intact in the NVSSM but the status of processing system memory is undefined. Failure also results in invalidation of the STag that was used by the RDMA operation; however, the connection between a processing system 2 and NVSSM 26 remains intact and can be used, for example, to generate new STag.
In certain embodiments, MSI-X (message signaled interrupts (MSI) extension) is used to indicate an RDMA operation's completion and to direct interrupt handling to a specific processor core, for example, for a core where the hypervisor 11 is running or a core where specific virtual machine is running. Moreover, the hypervisor 11 can direct MSI-X interrupt handling to a core which issued the I/O operation, thus improving the efficiency, reducing latency for users, and CPU burden on the hypervisor core.
Reads or writes executed in the NVSSM subsystem 26 can also be directed to different memory devices in the NVSSM subsystem 26. For example, in certain embodiments, user data and associated resiliency metadata (e.g., RAID parity data and checksums) are stored in raw flash memory within the NVSSM subsystem 26, while associated file system metadata is stored in non-volatile DRAM within the NVSSM subsystem 26. This approach allows updates to file system metadata to be made without incurring the cost of erasing flash blocks.
This approach is illustrated in
The file system manager 41 in the processing system 2 initially stores the write data 63 in a source memory 60, which may be memory 22 (
Accordingly, the file system manager 41 causes the NVSSM data layout manager 46 to initiate an RDMA write, to write the data 63 from the processing system buffer cache 6 into the NVSSM subsystem 26. To initiate the RDMA write, the NVSSM data layout engine 13 generates a gather list 65 including source pointers to the buffers in source memory 60 where the write data 63 resides and where file system manager 41 generated corresponding RAID metadata and file metadata, and the NVSSM data layout engine 13 generates a corresponding scatter list 64 including destination pointers to where the data 63 and corresponding RAID metadata and file metadata shall be placed at NVSSM 26. In the case of an RDMA write, the gather list 65 specifies the memory locations in the source memory 60 from where to retrieve the data to be transferred, while the scatter list 64 specifies the memory locations in the NVSSM subsystem 26 into which the data is to be written. By specifying multiple destination memory locations, the scatter list 64 specifies multiple individual write accesses to be performed in the NVSSM subsystem 26.
The scatter-gather list 64, 65 can also include pointers for resiliency metadata generated by the virtual machine 4, such as RAID metadata, parity, checksums, etc. The gather list 65 includes source pointers that specify where such metadata is to be retrieved from in the source memory 60, and the scatter list 64 includes destination pointers that specify where such metadata is to be written to in the NVSSM subsystem 26. In the same way, the scatter-gather list 64, 65 can further include pointers for basic file system metadata 67, which specifies the NVSSM blocks where file data and resiliency metadata are written in NVSSM (so that the file data and resiliency metadata can be found by reading file system metadata). As shown in
As is well known, flash memory is laid out in terms of erase blocks. Any time a write is performed to flash memory, the entire erase block or blocks that are targeted by the write must be first erased, before the data is written to flash. This erase-write cycle creates wear on the flash memory and, after a large number of such cycles, a flash block will fail. Therefore, to reduce the number of such erase-write cycles and thereby reduce the wear on the flash memory, the RDMA controller 12 can accumulate write requests and combine them into a single RDMA write, so that the single RDMA write substantially fills each erase block that it targets.
In certain embodiments, the RDMA controller 12 implements a RAID redundancy scheme to distribute data for each RDMA write across multiple memory devices within the NVSSM subsystem 26. The particular form of RAID and the manner in which data is distributed in this respect can be determined by the hypervisor 11, through the generation of appropriate STags. The RDMA controller 12 can present to the virtual machines 4 a single address space which spans multiple memory devices, thus allowing a single RDMA operation to access multiple devices but having a single completion. The RAID redundancy scheme is therefore transparent to each of the virtual machines 4. One of the memory devices in a flash bank can be used for storing checksums, parity and/or cyclic redundancy check (CRC) information, for example. This technique also can be easily extended by providing multiple NVSSM subsystems 26 such as described above, where data from a single write can be distributed across such multiple NVSSM subsystems 26in a similar manner.
If the requested data resides in the NVSSM subsystem 26, the NVSSM data layout manager 46 generates a gather list 85 for NVSSM subsystem 26 and the file system manager 41 generates a corresponding scatter list 84 for buffer cache 6, first to retrieve file metadata. In one embodiment, the file metadata is retrieved from the NVSSM's DRAM 28. In one RDMA read, file metadata can be retrieved for multiple file systems and for multiple files and directories in a file system. Based on the retrieved file metadata, a second RDMA read can then be issued, with file system manager 41 specifying a scatter list and NVSSM data layout manager 46 specifying a gather list for the requested read data. In the case of an RDMA read, the gather list 85 specifies the memory locations in the NVSSM subsystem 26 from which to retrieve the data to be transferred, while the scatter list 84 specifies the memory locations in a destination memory 80 into which the data is to be written. The destination memory 80 can be, for example, memory 22. By specifying multiple source memory locations, the gather list 85 can specify multiple individual read accesses to be performed in the NVSSM subsystem 26.
The gather list 85 also specifies memory locations from which file system metadata for the first RDMA read and resiliency (e.g., RAID metadata, checksums, etc.) and file system metadata for the second RDMA read are to be retrieved in the NVSSM subsystem 29. As indicated above, these various different types of data and metadata can be retrieved from different locations in the NVSSM subsystem 26, including different types of memory (e.g. flash 27 and non-volatile DRAM 28).
Note that one benefit of using the RDMA semantic is that even for data block updates there is a potential performance gain. For example, referring to
Next, at 1002 the virtual machine (“VM”) determines whether it has a write lock (write ownership) for the targeted portion of memory in the NVSSM subsystem 26. If it does have write lock for that portion, the process continues to 1003. If not, the process continues to 1007, which is discussed below.
At 1003, the file system manager 41 (
If, at 1002, the write is for a portion of memory (i.e. NVSSM subsystem 26) that is shared between multiple virtual machines 4, and the writing virtual machine does not have write lock for that portion of memory, then at 1007 the process waits until the write lock for that portion of memory is available to that virtual machine, and then proceeds to 1003 as discussed above.
The write lock can be implemented by using an RDMA atomic operation to the memory in the NVSSM subsystem 26. The semantic and control of the shared memory accesses follow the hypervisor's shared memory semantic, which in turn may be the same as the virtual machines' semantic. Thus, when a virtual machine acquires the write lock and when it releases it can be is defined by the hypervisor using standard operating system calls.
Initially, at 1121 the NVSSM data layout engine 13 creates a gather list specifying locations in the NVSSM subsystem 26 where the data to be read resides. At 1122 the file system manager 41 creates a scatter list specifying locations in host memory (e.g., memory 22) to which the read data is to be written. At 1123 the operating system 40 sends an RDMA Read operation with the scatter-gather list to the RDMA controller (which in the embodiment of
It will be recognized that the techniques introduced above have a number of possible advantages. One is that the use of an RDMA semantic to provide virtual machine fault isolation improves performance and reduces the complexity of the hypervisor for fault isolation support. It also provides support for virtual machines' bypassing the hypervisor completely, thus further improving performance and reducing overhead on the core for “domain 0”, which runs the hypervisor.
Another possible advantage is a performance improvement by combining multiple I/O operations into single RDMA operation. This includes support for data resiliency by supporting multiple data redundancy techniques using RDMA primitives.
Yet another possible advantage is improved support for virtual machine data sharing through the use of RDMA atomic operations. Still another possible advantage is the extension of flash memory (or other NVSSM memory) to support filesystem metadata for a single virtual machine and for shared virtual machine data. Another possible advantage is support for multiple flash devices behind a node supporting virtual machines, by extending the RDMA semantic. Further, the techniques introduced above allow shared and independent NVSSM caches and permanent storage in NVSSM devices under virtual machines.
Thus, a system and method of providing multiple virtual machines with shared access to non-volatile solid-state memory have been described.
The methods and processes introduced above can be implemented in special-purpose hardwired circuitry, in software and/or firmware in conjunction with programmable circuitry, or in a combination of such forms. Special-purpose hardwired circuitry may be in the form of, for example, one or more application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), etc.
Software or firmware to implement the techniques introduced here may be stored on a machine-readable medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A “machine-readable medium”, as the term is used herein, includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant (PDA), manufacturing tool, any device with a set of one or more processors, etc.). For example, a machine-accessible medium includes recordable/non-recordable media (e.g., read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; etc.), etc.
Although the present invention has been described with reference to specific exemplary embodiments, it will be recognized that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense.