There are many known ways for operating systems to manage block-based storage devices such as disk drives, virtual disks, storage area network (SAN) disks, etc. Typically, an operating system provides a storage stack, which may include a file system and one or more layers and drivers intermediating exchanges between the file system and a storage device. The file system provides organization and structure to data stored in the storage device, other layers of storage stack handle exchanges between the file system and the storage device, and the storage device stores the data in blocks and provides related storage management functionality. For example, an operating system might have an ext3 file system, a SCSI (Small Computer System Interface) subsystem, and a SCSI disk drive, cooperating in known fashion.
Recently, virtual devices have become a common substitute for hardware storage devices such as hard drives. Most implementations of virtual disks or virtual storage devices use a special type of container or file that acts as the backing store for a corresponding virtual disk (to be referred to as a “storage device”, a term used herein to refer to both physical and virtual block-based storage devices), such as the Virtual Hard Disk (VHD) format, the Virtual Machine Disk (VMDK) format, the Virtual Desktop Infrastructure (VDI) format, and others.
Certain usage scenarios of storage devices, both virtual and non-virtual, give rise to inefficiencies. For instance, often times a storage device is called upon to store data that may or may not require persistence across events such as operating system crashes, operating system reboots, storage device duplication, backups, etc. However, previous storage devices and supporting operating system storage stacks have treated all stored data as equivalent. For example, a video editing application might have a large storage space reserved for “scratch” temporary storage of data.
Consider a machine with an operating system. The operating system may have a paging or swap file. To free up memory, code and data that are not in use by the operating system may be written to the swap file, which is usually stored on a disk (in this example, the “disk” could also be a virtual disk, or any other block-based device). The data in the swap file may be faulted back into memory as necessary. When the machine is rebooted, the contents of the swap file usually become irrelevant, as the file's content is temporary. However, operating systems have treated I/O (input/output) to the operating system's swap file in nearly the same way all other disk I/O has been treated. That is, the operating system may ensure, without regard for the nature of data being stored: that writes to the swap file are stored to disk, that swap file I/O is properly ordered with other I/O transactions, etc. In addition, the swap file on the disk might be treated in the same way as any other data on that disk. For instance, the swap file is backed up when the disk is backed up and the swap file is transferred over a network when the disk copied across the network (e.g., when a virtual machine (VM) is replicated or migrated).
Generally, storage systems treat all data as equivalent and fail to address various storage-related inefficiencies. Techniques described herein relate to enabling differentiated storage for block-based storage devices.
The following summary is included only to introduce some concepts discussed in the Detailed Description below. This summary is not comprehensive and is not intended to delineate the scope of the claimed subject matter, which is set forth by the claims presented at the end.
A computing device manages access to a block-based storage device. The computing device has an operating system with a storage stack. The storage stack may have a file system, a device driver driving the block-based storage device, and a storage component (described below) intermediating between the device driver and the file system. The file system may receive a request to tag a file that is managed by the file system and is stored on the storage device. In response the file system requests the storage component to tag blocks corresponding to the file. The device driver forwards or translates the request from the storage component to the storage device. In turn, the storage device stores indicia of the blocks. Data stored in the identified blocks may receive differentiated treatment, by the storage device and/or the operating system, such as a particular choice of backing store, preferential handling, or others.
Many of the attendant features will be explained below with reference to the following detailed description considered in connection with the accompanying drawings.
The present description will be better understood from the following detailed description read in light of the accompanying drawings, wherein like reference numerals are used to designate like parts in the accompanying description.
Embodiments discussed below relate to differentiated storage in block-based storage devices. Discussion will begin with an architectural overview. General processes for setting up and implementing differentiated storage will be described next. Implementation details for different storage standards will be describe next, followed by discussion of usage scenarios and performance enhancements for differentiated storage.
As noted above, the block based storage devices 106 may be either hardware devices or virtual devices. A hardware storage device, such as a disk drive or flash drive, will have an interface to communicate with the host computing device via a physical bus, a wireless link, etc. Virtual storage devices may connect through a virtual bus or other hypervisor-provided communication channel. A storage device can also be a SAN (storage area network) disk provided via a protocol such as iSCSI (Internet SCSI). In any case, the operating system 100 will provide necessary interfaces and drivers for communicating with the storage devices.
At step 144, the propagated (perhaps translated) tag request is received at a storage layer 104 below the file system. For example, the storage layer 104 may have a storage system module 142, which in this description represents any component found in a storage stack of an operating system. For example, the storage system module 142 might be a disk virtualization component that parses virtual disk files (e.g., VHD, VMDK, VDI, etc.) and provides them as virtual disk drives. The storage system module 142 can be implemented as a special device driver, a shim in the operating system's storage stack, part of a SCSI layer or subsystem connecting SCSI clients and targets, etc. In any case, the storage system module 142, at step 144, receives the tag request. Because in some implementations differentiated storage might not be supported at lower levels of the storage stack such as a device driver or the target storage device, the storage system module 142 may check down the stack for support for the tagging request. In a SCSI implementation, for example, this might involve sending a vital product data (VDP) request to the target storage device's device driver 146, which in turn may query the target storage device 106. The storage system module 142 then checks the VDP to determine if differentiated storage is supported. Note that this compatibility check is not required; an error handling process, for example, can deal with any incompatibility faults. Ignoring possible incompatibility may be particularly feasible in implementations where lack of differentiated storage support only results in the default action of storing data in an ordinary undifferentiated manner.
The storage system module 142 may translate the received request into a format suitable for the next layer of the storage stack. For example, the tagging request may be issued as a SATA or SCSI command (e.g., a new command, a new parameter of an existing command such as a SCSI “mode select”, etc.). The storage system module 142 then sends the tag request down the storage stack, which, either directly or indirectly, is received by the device driver 146 which passes the request or command to the target storage device 106 for implementation.
To summarize, the storage system module 142 may be any component of the operating system that intermediates exchanges storage requests, including tagging requests, between initiators/clients and storage devices. The storage system module 142 may or may not include multiple discrete storage layers, depending on implementation. The storage system module 142 may provide an interface between user space and the kernel. The storage system module 142 may also function as a traffic director, routing exchanges between storage devices and initiators, possibly translating between APIs or protocols as exchanges are passed to and from storage devices. The storage system module may perform other functions besides handling I/O requests, such as managing command queues, handling errors, managing power for storage devices, etc.
Indicia of the target blocks may be maintained at any of one or more places in the storage stack, including the target storage device, and no particular element of the storage stack is required to maintain indicia of the target blocks. That is, step 144 and step 148, to the extent they are performed, may be performed anywhere in a path through the storage stack from the file system to the target storage device.
As noted above, differentiated storage decisions and operations may be performed at any stage in a path through the storage stack to the target storage device where indicia of the tagged blocks is stored. In one embodiment, the storage system module 142 stores the set of block identifiers 174. If the storage system module 142 implements virtual disks, then storage system module may make choices regarding which backing store to use, which virtual disk file/container to use, etc.
Embodiments may be implemented where indicia of the tagged blocks is not persisted and may be safely lost if the host machine is shutdown, crashes, or otherwise loses state information. Note that the term “host machine”, as used herein, refers to both physical machines and virtual machines. Consider a SCSI-based embodiment where region or block tagging is used for the operating system's swap file. To use the tagging feature, the operating system opens a swap file shortly after its boot process starts. The operating system issues a trim or unmap command for the swap file, which logically discards any previous data in the swap file. That command flows down through the file system and any intermediary storage layers to the target storage device where the trim or unmap command is executed. The operating system then issues a file system control (fsctl) command directed to the swap file to indicate that the swap file is a special file (e.g., a file that will have a special storage contract). The storage stack may perform various internal management operations such as issuing a SCSI inquiry command, checking the target device's VDP, issuing a mode-sense command, etc. Various management operations may be performed, such as selecting or creating a backing store specifically for the swap file (e.g., a separate VHD) and storing a list of relevant blocks. For efficiency the blocks may be encoded as a linked list where each node in the list identifies a starting block and a length. When writes to the swap file by the file system (or memory manager) are issued, a block to be written is handled as described above. In the event of a crash of the host machine, ordinary untagged blocks persist. If the backing store holding the blocks for the tagged swap file is non-durable, there is no problem because the swap file contents will have become moot.
To elaborate, by identifying the extents of the swap file within various virtual disk files (e.g., VHD files) attached to a machine, and by passing that information down the storage stack to the virtual disk, it becomes possible to identify paging I/O and treat it differently than other I/O that might be destined for the same storage device. When the machine is a virtual machine, this can be done for any guest operating system, for example, as part of a guest operating system's virtualization (i.e., enlightenment) integration services. In some versions of the Microsoft Windows operating system, existing integration services in the file system layers and the block storage layers can be modified. Converting such operating system features into a custom SCSI CDB (Command Descriptor Block) is a convenient way to pass tagging functionality down through any lower layers of the virtual disk or storage stack.
Within a disk virtualization stack (e.g., a VHD stack), swap file extents can be tagged as unnecessary for replication. In one embodiment, the disk virtualization stack creates a separate VHD file, for instance named “pagefile-[unique-identifier].vhdx.” This separate VHD file would receive all swap file I/O for the operating system. The VHD file may be dynamically expanding, with the same dimensions as the VHD from which it was derived (e.g., same block size, same virtual disk size, etc.). Once this secondary swap file VHD is open, all the corresponding ranges in the primary VHD may be trimmed, so that the total size on disk for the two VHDs is the same as the size on disk for a single VHD, plus an extra set of VHD metadata for the swap file VHD.
Building a new VHD each time a machine boots would be feasible but would increase the boot time. To optimize, the swap file VHD may be left in place between boots (e.g., when a machine shuts down), with its contents possibly being trimmed for space usage and security reasons.
When a host (physical or virtual) hosting an operating system crashes, crash data is written into the swap file and the host is rebooted. Preserving this data may be helpful for diagnostics. Therefore, in some embodiments, page file data is deleted unless the operating system determines that the host shut down cleanly. This might be as simple as trimming the data if the host shut down completely and leaving it in place if the host reboots itself. This might also be a helpful performance optimization. In any case, another custom CDB may be sent through the stack when writing a crash dump, thus indicating that the tagged data should be preserved.
In embodiments where the operating system's immediate host is a virtual machine, by splitting the paging data into a separate VHD file, whole-VM snapshots can continue to work as expected, with a differencing disk chain created for the swapping VHDs just as such chains are created are for other VHD files. Storage migration may work in a similar fashion.
By splitting the swap file into a separate VHD file, separate caching policies can be applied. Instead of forcing all writes through to the media, it becomes possible to allow writes to be cached in host RAM and lazily written to the VHD, if written at all. This can reduce the load on the underlying storage subsystem and can make reads from the page file less expensive when the data to be read happens to still be in RAM. This would effectively extend the guest operating system's file system cache into the host machine's RAM, which would make it possible to trim that cache without the guest's cooperation. This might make it possible to assign less total RAM to the virtual machine, as paging I/O could be (with correct administration of RAM allocation) made to be statistically cheaper, reducing the RAM needed within the VM for file caching.
In another embodiment, tagging of a region by software can be used to provide quality of service features. While deciding which part of a storage device will store a tagged region can be useful, performance or quality of service features may also implemented to take advantage of region tagging. In one embodiment, a storage device may provide differentiated levels of throughput, latency, transactions per second, etc., based on whether blocks are in a described or tagged region. Other functions of the storage device may also take into account block tagging. For example, operations related to flushing data from volatile cache storage to non-volatile media, error checking, access priority, or others may be performed in a manner that allows a storage device to provide differentiated performance with respect to tagged blocks. Storage performance may also be implemented in the storage stack, for example in a SCSI subsystem, which may prioritize paths, regulate bus bandwidth, and so forth based on whether storage data corresponds to a tagged region.
Embodiments and features discussed above can be realized in the form of information stored in volatile or non-volatile computer or device readable apparatuses, with such information able to configure the computing device 298, when operating, to perform the embodiments described herein. These apparatuses may include apparatuses such as optical storage (e.g., compact-disk read-only memory (CD-ROM)), magnetic media, holographic storage, flash read-only memory (ROM), or other devices for storing digital information. The stored information can be in the form of machine executable instructions (e.g., compiled executable binary code), source code, bytecode, or other information that can be used to enable or configure computing devices to perform the embodiments described herein. This is also deemed to include at least volatile memory such as random-access memory (RAM) and/or virtual memory storing information such as central processing unit (CPU) instructions during execution of software carrying out an embodiment, as well as non-volatile devices storing information that allows a program or executable to be loaded and executed.