Unless otherwise indicated herein, the approaches described in this section are not admitted to be prior art by inclusion in this section.
Virtualization allows the abstraction and pooling of hardware resources to support virtual machines (VMs) in a virtualized computing environment. For example, through server virtualization, virtualization computing instances such as VMs running different operating systems may be supported by the same physical machine (e.g., referred to as a “host”). Each VM is generally provisioned with virtual resources to run an operating system and applications. The virtual resources may include central processing unit (CPU) resources, memory resources, storage resources, network resources, etc.
In a distributed storage system, storage resources of a cluster of hosts may be aggregated to form a single shared pool of storage. VMs supported by the hosts within the cluster may then access the pool to store data. The data is stored and managed in a form of data containers called objects or storage objects. An object is a logical volume that has its data and metadata distributed in the distributed storage system. A virtual disk of a VM running on a host may also be an object and typically represented as a file in a file system of the host.
Snapshotting is a storage feature that allows for the creation of snapshots, which are point-in-time read-only copies of storage objects. Snapshots are commonly used for data backup, archival, and protection (e.g., crash recovery) purposes. Copy-on-write (COW) snapshotting is an efficient snapshotting implementation that generally involves (1) maintaining, for each storage object, a first B+ tree (referred to as a “logical map”) that keeps track of the storage object's state, and (2) at the time of taking a snapshot of the storage object, making the storage object's logical map immutable/read-only, designating this immutable logical map as the logical map of the snapshot, and creating a new logical map (e.g., a second B+ tree) for a running point (i.e., live) version of the storage object that includes a single root node of the second B+ tree pointing to the first level nodes of the first B+ tree (which allows the two B+ trees to share the same logical block address (LBA)-to-physical block address (PBA) mappings).
If a write is subsequently made to the storage object that results in a change to a particular LBA-to-PBA mapping, a copy of the leaf node in the snapshot's logical map (i.e., the first B+ tree) that holds the changed LBA-to-PBA mapping—as well as copies of any internal nodes between the leaf node and the root node—are created, and the new logical map of the running point version of the storage object is updated to point to the newly-created node copies, thereby separating it from the snapshot's logical map along that particular tree branch. The foregoing steps are then repeated as needed for further snapshots of, and modifications to, the storage object.
For a file system that supports multiple snapshots, data blocks may be shared among such snapshots, leading to inefficiencies associated with tracking and updating PBAs of such shared data blocks. Additional improvements are still needed to further enhance Input/Output (I/O) efficiencies.
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented here. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the drawings, can be arranged, substituted, combined, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein. Although the terms “first” and “second” are used throughout the present disclosure to describe various elements, these elements should not be limited by these terms. These terms are used to distinguish one element from another. For example, a first element may be referred to as a second element, and vice versa.
In the disclosure, a “COW B+ tree” may refer to a B+ tree data structure which keeps track of a storage object's state, and at the time of taking a snapshot of the storage object, making the storage object's logical map immutable/read-only, designating this immutable logical map as the logical map of the snapshot, and making a root node of another COW B+ tree data structure for a running point version of the storage object pointing to the first level nodes of the B+ tree data structure. A “delta mapping” may refer to a mapping between a logical block address to one or more physical block addresses in response to a new write operation performed on a running point of a storage object. A “delta mapping table” may refer to a table including one or more delta mappings. The term “in batches” may refer to being in a small quantity or group at a time.
In the example in
It should be understood that a “virtual machine” running on a host is merely one example of a “virtualized computing instance” or “workload.” A virtualized computing instance may represent an addressable data compute node or isolated user space instance. In practice, any suitable technology may be used to provide isolated user space instances, not just hardware virtualization. Other virtualized computing instances may include containers (e.g., running within a VM or on top of a host operating system without the need for a hypervisor or separate operating system or implemented as an operating system level virtualization), virtual private servers, client computers, etc. Such container technology is available from, among others, Docker, Inc. The VMs may also be complete computational environments, containing virtual equivalents of the hardware and software components of a physical computing system. The term “hypervisor” may refer generally to a software layer or component that supports the execution of multiple virtualized computing instances, including system-level software in guest VMs that supports namespace containers such as Docker, etc.
Hypervisor 114 may be implemented any suitable virtualization technology, such as VMware ESX® or ESXi™ (available from VMware, Inc.), Kernel-based Virtual Machine (KVM), etc. Hypervisor 114 may also be a “type 2” or hosted hypervisor that runs on top of a conventional operating system on host 110.
Hypervisor 114 maintains a mapping between underlying hardware 112 and virtual resources allocated to respective VMs 131 and 132. Hardware 112 includes suitable physical components, such as central processing unit(s) or processor(s) 120; memory 122; physical network interface controllers (NICs) 124; storage resource(s) 126, storage controller(s) 128 to provide access to storage resource(s) 126, etc. Virtual resources are allocated to each VM to support a guest operating system (OS) and applications (not shown for simplicity). For example, corresponding to hardware 112, the virtual resources may include virtual CPU, guest physical memory (i.e., memory visible to the guest OS running in a VM), virtual disk, virtual network interface controller (VNIC), etc.
In practice, storage controller 128 may be any suitable controller, such as redundant array of independent disks (RAID) controller (e.g., RAID-0 or RAID-1 configuration), etc. Host 110 may include any suitable number of storage resources in the form of physical storage devices, drives or disks. Each physical storage resource may be housed in or directly attached to host 110. Example physical storage resources include solid-state drives (SSDs), Universal Serial Bus (USB) flash drives, etc. For example, SSDs are gaining popularity in modern storage systems due to relatively high performance and affordability. Depending on the desired implementation, each SSD may include a high-speed interface connected to a controller chip and multiple memory elements.
To implement Software-Defined Storage (SDS) in virtualized computing environment 100, host 110 and other hosts may be configured as a cluster. This way, all the hosts may aggregate their storage resources to form distributed storage system 190 that represents a shared pool of one or more storage resources 126. Distributed storage system 190 may employ any suitable technology, such as Virtual Storage Area Network (VSAN™) available from VMware, Inc. For example, host 110 and other hosts may aggregate respective storage resources into an “object store” (also known as a datastore or a collection of datastores). The object store represents a logical aggregated volume to store any suitable VM data relating to VMs 131 and 132, such as virtual machine disk (VMDK) objects, snapshot objects, swap objects, home namespace objects, etc. Any suitable disk format may be used, such as VM file system leaf level (VMFS-L), VSAN on-disk file system, etc. Distributed storage system 190 is accessible by hosts 110 via physical network 105.
In some embodiments, hypervisor 114 supports storage stack 116, which processes I/O requests that it receives. Storage stack 116 may include file system component 118 and copy-on-write (COW) snapshotting component 132. File system component 118 is configured to manage the storage of data in distributed storage system 190 and write data modifications to distributed storage system 190. File system component 118 can also accumulate multiple small write requests directed to different LBAs of a storage object in an in-memory buffer and, once the buffer is full, write out all of the accumulated write data (collectively referred to as a “segment”) via a single, sequential write operation.
COW snapshotting component 132 of storage stack 116 is configured to create snapshots of the storage objects supported by distributed storage system 190 by manipulating, via a copy-on-write mechanism, logical maps that keep track of the storage objects' states. A plurality of B+ tree data structures may be used in maintaining these logical maps.
Leaf node 212 is configured to maintain mappings of a first set of LBAs associated with a first memory page to a first set of PBAs. Assuming the first memory page includes 100 contiguous logical blocks and each of the logical blocks has its own block address mapping to one physical block address, leaf node 212 is configured to maintain mappings of LBA1 (i.e., logical block address of 1) to LBA100 (i.e., logical block address of 100), to their corresponding physical block addresses. For example, as illustrated in
Similarly, leaf node 213 is configured to maintain mappings of a second set of LBAs associated with a second memory page to a second set of PBAs and leaf node 214 is configured to maintain mappings of a third set of LBAs associated with a third memory page to a third set of PBAs. Assuming the second memory page and the third memory page both include 100 contiguous logical blocks and each of the logical blocks has its own block address mapping to one physical block address, leaf node 213 is configured to maintain mappings of LBA101 (i.e., logical block address of 101) to LBA200 (i.e., logical block address of 200), to their mapped physical block addresses and leaf node 214 is configured to maintain mappings of LBA201 (i.e., logical block address of 201) to LBA300 (i.e., logical block address of 300), to their mapped physical block addresses.
For example, as illustrated in
In response to creating the first snapshot of the storage object, at the beginning point in time, t0, a conventional COW snapshotting component is configured to make mapping maintained by leaf nodes 212, 213 and 214 immutable/read-only and designate these mappings as the logical map of the first snapshot. In addition, the COW snapshotting component is also configured to create second root node 221 and make second root node 221 point to mappings maintained by leaf nodes 212, 213 and 214 through internal nodes (not shown for simplicity). Second root node 221 is configured to track live version of the storage object in response to new writes to the storage object after the first snapshot is created.
In response to a first new write to LBA50, at a first point in time, t1, the COW snapshotting component is configured to copy all mappings maintained by leaf node 212 to generate leaf node 212′, update single mapping of LBA50 to PBA45 to the new mapping of LBA50 to PBA43 and make second root node 221 point to leaf node 212′.
In response to a second new write to LBA150, at a second point in time, t2, the COW snapshotting component is configured to copy all mappings maintained by leaf node 213 to generate leaf node 213′, update single mapping of LBA150 to PBA235 to the new mapping of LBA150 to PBA233 and make second root node 221 point to leaf node 213′. Assuming the COW snapshotting component only has two in-memory cache pages for buffering these updated mappings, the two in-memory cache pages are now fully occupied by updated mappings maintained by leaf nodes 212′ and 213′.
In response to a third new write to LBA250, at a third point in time, t3, because the in-memory cache pages are full, the COW snapshotting component is configured to evict mappings maintained by leaf node 212′, which causes a first write I/O cost. The COW snapshotting component is configured to copy all mappings maintained by leaf node 214 to generate leaf node 214′, update single mapping of LBA250 to PBA420 to the new mapping of LBA250 to PBA423 and make second root node 221 point to leaf node 213′.
In response to a fourth new write to LBA50, at a fourth point in time, t4, because the in-memory cache pages are full, the COW snapshotting component is configured to evict mappings maintained by leaf node 213′, which causes a second write I/O cost. The COW snapshotting component is configured to load mappings maintained by leaf node 212′ to generate lead node 212″, update single mapping of LBA50 to PBA43 to the new mapping of LBA50 to PBA40 and make second root node 221 point to leaf node 212″.
In response to a fifth new write to LBA151, at a fifth point in time, t5, because the in-memory cache pages are full, the COW snapshotting component is configured to evict mappings maintained by leaf node 214′, which causes a third write I/O cost. The COW snapshotting component is configured to load mappings maintained by leaf node 213′ to generate leaf node 213″, update single mapping of LBA151 to PBA200 to the new mapping of LBA151 to PBA223 and make second root node 221 point to leaf node 213″.
In response to a sixth new write to LBA249, at a sixth point in time, t6, because the in-memory cache pages are full, the COW snapshotting component is configured to evict mappings maintained by leaf node 212″, which causes a fourth write I/O cost. The COW snapshotting component is configured to load mappings maintained by leaf node 214′ to generate leaf node 214″, update single mapping of LBA249 to PBA430 to the new mapping of LBA249 to PBA433 and make second root node 221 point to leaf node 214″.
Lastly, dirty mappings maintained by leaf nodes 212′ and 213′ are flushed, which cause a fifth I/O cost and a sixth I/O cost, respectively. In response to creating a second snapshot of the storage object, the COW snapshotting component is configured to make mapping maintained by leaf nodes 212″, 213″ and 214″ immutable/read-only and designate these mappings as the logical map of the second snapshot. In addition, the COW snapshotting component is also configured to create a new root node (not shown in
In some embodiments, to create a first snapshot of the storage object at time t0′, in conjunction with
For illustration purposes, in some embodiments, still assuming COW snapshotting component 132 only has two in-memory cache pages for buffering delta mappings maintained by leaf node 322, the delta mappings maintained by leaf node 322 may be cached in a first in-memory cache page of the two in-memory cache pages. In some embodiments, additional new writes to the storage object may continue and delta mappings associated with these new writes may be continuously maintained by leaf node 322.
In some embodiments, COW snapshotting component 132 is configured to create leaf nodes of the second COW B+ tree data structure in batches based on an order of the LBAs in delta mappings maintained by leaf node 322. The creation of leaf nodes in batches will be further described below in details.
Process 400 may start with block 410 “create first root node of first B+ tree data structure.” For example, in conjunction with
In some embodiments, in conjunction with
In some embodiments, in conjunction with
In some embodiments, in conjunction with
The above examples can be implemented by hardware (including hardware logic circuitry), software or firmware or a combination thereof. The above examples may be implemented by any suitable computing device, computer system, etc. The computer system may include processor(s), memory unit(s) and physical NIC(s) that may communicate with each other via a communication bus, etc. The computer system may include a non-transitory computer-readable medium having stored thereon instructions or program code that, when executed by the processor, cause the processor to perform processes described herein with reference to
The techniques introduced above can be implemented in special-purpose hardwired circuitry, in software and/or firmware in conjunction with programmable circuitry, or in a combination thereof. Special-purpose hardwired circuitry may be in the form of, for example, one or more application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), and others. The term ‘processor’ is to be interpreted broadly to include a processing unit, ASIC, logic unit, or programmable gate array etc.
The foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or any combination thereof.
Those skilled in the art will recognize that some aspects of the embodiments disclosed herein, in whole or in part, can be equivalently implemented in integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computing systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof, and that designing the circuitry and/or writing the code for the software and or firmware would be well within the skill of one of skill in the art in light of this disclosure.
Software and/or to implement the techniques introduced here may be stored on a non-transitory computer-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A “computer-readable storage medium”, as the term is used herein, includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant (PDA), mobile device, manufacturing tool, any device with a set of one or more processors, etc.). A computer-readable storage medium may include recordable/non recordable media (e.g., read-only memory (ROM), random access memory (RAM), magnetic disk or optical storage media, flash memory devices, SSDs, etc.).
The drawings are only illustrations of an example, wherein the units or procedure shown in the drawings are not necessarily essential for implementing the present disclosure. Those skilled in the art will understand that the units in the device in the examples can be arranged in the device in the examples as described, or can be alternatively located in one or more devices different from that in the examples. The units in the examples described can be combined into one module or further divided into a plurality of sub-units.