Unless otherwise indicated herein, the approaches described in this section are not admitted to be prior art by inclusion in this section.
A virtualization software suite (vSphere Suite) for implementing and managing virtual infrastructures in a virtualized computing environment may include a hypervisor (ESXi) that implements virtual machines (VMs) on one or more physical hosts (hosts), a virtual storage area network (vSAN) software that aggregates local storage to form a shared datastore for a cluster of physical hosts, and a server management software (vCenter) that centrally provisions and manages virtual datacenters, VMs, hosts, clusters, datastores, and virtual networks.
The vSAN software uses the concept of a disk group as a container for solid-state drives (SSDs) and non-SSDs, such as hard disk drives (HDDs). On each host (node) in a vSAN cluster, the local drives of the host are organized into one or more disk groups. Each disk group includes one SSD that serves as read cache and write buffer (e.g., a cache tier), and one or more SSDs or non-SSDs that serve as permanent storage (e.g., a capacity tier). The aggregate of the disk groups from all the nodes forms a vSAN datastore distributed and shared across the nodes of the vSAN cluster.
The vSAN software stores and manages data in the form of data containers called objects. An object is a logical volume that has its data and metadata distributed across a vSAN cluster. For example, every virtual machine disk (VMDK) is an object, as is every snapshot. A virtual machine (VM) is provisioned on a vSAN datastore as a VM home namespace object, which stores metadata files of the VM including descriptor files for the VM's VMDKs. A user may operate a VM to perform input/output (I/O) operations to VMDKs of the VM. When a performance issue (e.g., a latency of the I/O operations) occurs due to one or more outstanding inputs/outputs (I/Os), network hardware issues, network congestions, or disk slowness, the user may manually trigger a diagnostic operation to the VM. The diagnostic operation may collect trace data associated with I/Os generated by the VM at each layer of the vSAN cluster and break down latencies between layers of the vSAN cluster based on the collected trace data.
However, this approach is not adequate because the performance issue may be temporary and the trace data cannot represent the state when the performance issue occurred. Even assuming trace data is constantly collected for potential diagnostics in the future, the constant collection of the trace data and the persisting of these collected trace data will significantly impact the performances of the entire vSAN datastore.
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not meant to be limiting.
Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented here. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the drawings, can be arranged, substituted, combined, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.
In the following detailed description, an “object” is composed of one or more components distributed in an object store, such as a vSAN object store. A “component” is an entity that holds data for the object. A “client” may refer to a host which sends one or more I/Os to an object. An “object owner” may refer to a host which coordinates the I/Os to the object in an object store so that the object owner owns the object. A “component owner” may refer to a host which owns a component. Generally, the data held by a component is persisted on one or more disks of the component owner. “Sampling an I/O” may refer to an object owner configured to tag an I/O to an object owned by the object owner, record timepoints at each layer of a data path of this I/O and collect the recorded timepoints data. “Sampling an I/O” may be interchangeably used with “I/O sampling,” “object I/O sampling,” “I/O(s) . . . is(are) sampled” in the description. “Trace data” may refer to data associated with timepoints of a tagged I/O to an object at each layer of a data path of the tagged I/O.
Challenges relating to performing object I/O samplings for performance diagnosis will now be explained in more detail using
In the example in
Each host 110A/110B/110C in cluster 105 includes suitable hardware 112A/112B/112C and executes virtualization software such as hypervisor 114A/114B/114C to maintain a mapping between physical resources and virtual resources assigned to various virtual machines. For example, Host-A 110A supports VM1 131 and VM2 132; Host-B 110B supports VM3 133 and VM4 134; and Host-C 110C supports VM5 135 and VM6 136. In practice, each host 110A/110B/110C may support any number of virtual machines, with each virtual machine executing a guest operating system (OS) and applications. Hypervisor 114A/114B/114C may also be a “type 2” or hosted hypervisor that runs on top of a conventional operating system (not shown) on host 110A/1101B/110C.
Although examples of the present disclosure refer to “virtual machines,” it should be understood that a “virtual machine” running within a host is merely one example of a “virtualized computing instance” or “workload.” A virtualized computing instance may represent an addressable data compute node or isolated user space instance. In practice, any suitable technology may be used to provide isolated user space instances, not just hardware virtualization. Other virtualized computing instances may include containers (e.g., running on top of a host operating system without the need for a hypervisor or separate operating system such as Docker, etc.; or implemented as an operating system level virtualization), virtual private servers, client computers, etc. The virtual machines may also be complete computation environments, containing virtual equivalents of the hardware and software components of a physical computing system.
Hardware 112A/112B/112C includes any suitable components, such as processor 120A/120B/120C (e.g., central processing unit (CPU)); memory 122A/122B/122C (e.g., random access memory); network interface controllers (NICs) 124A/124B/124C to provide network connection; storage controller 126A/126B/126C that provides access to storage resources 128A/128B/128C, etc. Corresponding to hardware 112A/112B/112C, virtual resources assigned to each virtual machine may include virtual CPU, virtual memory, virtual machine disk(s), virtual NIC(s), etc.
Storage controller 126A/126B/126C may be any suitable controller, such as a redundant array of independent disks (RAID) controller, etc. Storage resource 128A/128B/128C may represent one or more disk groups. In practice, each disk group represents a management construct that combines one or more physical disks, such as hard disk drive (HDD), solid-state drive (SSD), solid-state hybrid drive (SSHD), peripheral component interconnect (PCI) based flash storage, serial advanced technology attachment (SATA) storage, serial attached small computer system interface (SAS) storage, Integrated Drive Electronics (IDE) disks, Universal Serial Bus (USB) storage, etc.
Through storage virtualization, hosts 110A/110B/110C in cluster 105 aggregate their storage resources 128A/128B/128C to form distributed storage system 150, which represents a shared pool of storage resources. For example, in
In virtualized computing environment 100, management entity 160 provides management functionalities to various managed objects, such as cluster 105, hosts 110A/110B/110C, virtual machines 131, 132, 133, 134, 135 and 136, etc. Management entity 160 may include vSAN diagnostic service 162.
In some embodiments, a user operating on VM3 133 may determine that VM3 133 has performance issues (e.g., latencies) and the user may send a vSAN diagnostic service request for VM3 133 from user terminal 170 to vSAN diagnostic service 162. In response to receiving the vSAN diagnostic service request, vSAN diagnostic service 162 is configured to manage vSAN modules 116A, 116B and 116C to fulfill the vSAN diagnostic service request.
Conventionally, in response to receiving the vSAN diagnostic service request, management entity 160 is configured to query vSAN module 116A on primary node host-A 110A. vSAN module 116A is configured to coordinate vSAN modules 116B and 116C to start sampling one or more I/Os and collect trace data of the I/Os associated with VM3 133. In response to vSAN modules 116A/116B/116C collecting the trace data, vSAN module 116A may obtain the trace data from vSAN modules 116B/116C and aggregate and persist the trace data in object store 152. vSAN diagnostic service 162 is then configured to retrieve the persisted data from object store 152, perform performance and/or latency diagnostic based on the retrieved persisted data and transmit a diagnostic result to user terminal 170 for the user's review.
However, the conventional approach above has adoption issues. For example, the performance issues of VM3 133 may be temporary. Therefore, the trace data collected after the performance issues occurred cannot reflect the state of VM3 133 when the performance issues occurred. Accordingly, the diagnostic result based on the trace data generally leads to a diagnostic result which is not associated with the performance issues.
The adoption issues above may be overcome by constantly collecting trace data. Therefore, when the performance issues occur, vSAN diagnostic service 162 may be configured to perform performance and/or latency diagnostic based on the previously collected trace data. However, this approach also has problems. For example, constantly collecting trace data and persisting the collected trace data in object store 152 may impact overall performances of distributed storage system 150, such as shortage of storage spaces or latencies due to persisting significant amounts of collected trace data while distributed storage system 150 still needs to support other normal I/O operations in virtualized computing environment 100.
In some embodiments, in conjunction with
In some embodiments, primary node host-A 110A is configured to periodically call an application program interface (API) of any secondary nodes to fetch the trace data collected by host-B 110B, host-C 110C and other hosts in cluster 105. Primary node host-A 110A is also configured to create an object on primary node host-A 110A to hold raw I/O trace data and/or aggregated I/O trace data.
In some embodiments, for illustration only, the trace data may include, but not limited to, a first timestamp of receiving a tagged I/O at the object owner, a second timestamp of completing the tagged I/O at the object owner, a third timestamp of receiving the tagged I/O at a first component owner, a fourth timestamp of completing the tagged I/O at the first component owner, a fifth timestamp of receiving the tagged I/O at a second component owner and a sixth timestamp of completing the tagged I/O at the second component owner.
More specifically, in some embodiments, for illustration only, the trace data may further include, but not limited to, a seventh timestamp of receiving the tagged I/O at a first disk of the first component owner, an eighth timestamp of completing the tagged I/O at the first disk of the first component owner, a ninth timestamp of receiving the tagged I/O at a second disk of the second component owner and a tenth timestamp of completing the tagged I/O at the second disk of the second component owner.
At block 210 in
At block 220 in
At block 220 in
At block 230 in
hCPU[now-2 min] refers to the CPU usage of host-B 110B two minutes prior to the present time point, hCPU[now-1 min] refers to the CPU usages of host-B 110B one minute prior to the present time point, hCPU[now] refers to the CPU usage of host-B 110B at present time point, maxCPU refers to a total capacity of CPU of host-B 110B.
Similarly, hMEM[now-2 min] refers to the memory usage of host-B 110B two minutes prior to the present time point, hMEM[now-1 min] refers to the memory usage of host-B 110B one minute prior to the present time point, hMEM[now] refers to the memory usage of host-B 110B at present time point, maxMEM refers to a total capacity of memory of host-B 110B.
The sample score of the object owner is the maximum of (samplescoreCPU+samplescoreMEM) and 0.
In some embodiments, block 230 is followed by block 240. At block 240 in
In some embodiments, host-B 110B is configured to ask a component owner (e.g., host-C 110C) to calculate a sample score of the component owner. In some embodiments, the sample score of the component owner may be calculated based on the following. Similarly, assuming there are multiple component owners, corresponding sample score for each component owner may also be calculated based on the following:
hCPU[now-2 min] refers to a difference between the CPU usage of host-C 110C two minutes prior to the present time point, hCPU[now-1 min] refers to the CPU usage of host-C 110C one minute prior to the present time point, hCPU[now] refers to the CPU usage of host-C 110C at present time point, maxCPU refers to a total capacity of CPU of host-C 110C.
Similarly, hMEM[now-2 min] refers to the memory usage of host-C 110C two minutes prior to the present time point, hMEM[now-1 min] refers to the memory usage of host-C one minute prior to the present time point, hMEM[now] refers to the memory usage of host-C 110C at present time point, maxMEM refers to a total capacity of memory of host-C 110C.
The sample score of the component owner is the maximum of (samplescoreCPU+samplescoreMEM) and 0. In some embodiments, object owner host-B 110B is configured to obtain the sample score of the component owner host-C 110C after host-C 110C calculates its own sample score.
In some embodiments, block 240 is followed by block 250. At block 250 in
“object owner sample score” is the sample score calculated at block 230, “component owner sample score” is the sample score obtained at block 240, “curtime” refers to the present time point, “1 stTime” refers to the time point that host-B 110B performs the latest I/O sampling, “maximum sampling time interval” refers to the maximum sampling time interval included in the resource-aware I/O sampling configuration, a, 8 and 7 are weight parameters which can be adjustable by an administrator. Block 250 may be followed by block 260.
In some embodiments, host-B 110B is configured to determine whether the weighted sample score calculated at block 250 equals to or is greater than a predetermined sample score. The predetermined sample score may be 7. In response to determining that the weighted sample score calculated at block 250 equals to or is greater than a predetermined sample score, block 260 is followed by block 270 and host-B 110B is configured to sample the I/O.
In some embodiments, in response to determining that the weighted sample score calculated at block 250 is less than a predetermined sample score, block 260 is followed by block 210. Accordingly, host-B 110B is configured not to sample the I/O. In addition, when the next I/O is received by host-B 110B at block 210, process 200 will repeat to determine whether to sample this next I/O.
Based on the above approaches, not every I/O is sampled. Therefore, the trace data associated with the I/O may be selectively collected for diagnosis for a future performance issue. Compared to the conventional approach that the trace data being collected after the performance issue, the above approaches may be more likely to reflect the state when the performance issue occurs and a more accurate diagnostic result may be obtained.
In addition, based on the above approaches, the sample score of the object owner at block 230 will be higher when an object owner has more CPU or memory resources being freed up shortly. Similarly, the sample score of a component owner at block 240 will also be higher when the component owner has more CPU or memory resources being freed up shortly. Therefore, in conjunction with
In some embodiments, the aggregation of trace data may include, but not limited to, calculate a latency between any of two timestamps (i.e., first timestamp, second timestamp, third timestamp, fourth timestamp, fifth timestamp, sixth timestamp, seventh timestamp, eighth timestamp, ninth timestamp and tenth timestamp) described above. Moreover, assuming there are multiple latencies obtained, the aggregation may also include, but not limited to, performing average and standard deviation operations on these latencies. In addition, the aggregation may also include, but not limited to, identifying the maximum latency or the minimum latency among the multiple latencies.
At block 310, in some embodiments and in conjunction with
At block 320, in some embodiments and in conjunction with
At block 330, in some embodiments and in conjunction with
At block 340, in some embodiments and in conjunction with
hCPU[now-2 min] refers to the CPU usage of host-A 110A two minutes prior to the present time point, hCPU[now-1 min] refers to the CPU usage of host-A 110A one minute prior to the present time point, hCPU[now] refers to the CPU usage of host-A 110A at present time point, maxCPU refers to a total capacity of CPU of host-A 110A. Similarly, hMEM[now-2 min] refers to the memory usage of host-A 110A two minutes prior to the present time point, hMEM[now-1 min] refers to the memory usage of host-A 110A one minute prior to the present time point, hMEM[now] refers to the memory usage of host-A 110A at present time point, maxMEM refers to a total capacity of memory of host-A 110A.
In some embodiments, block 340 is followed by block 350. At block 350, in some embodiments and in conjunction with
As described above, in conjunction with
At block 410, in some embodiments and in conjunction with
Similarly, in some embodiments, host-A 110A is configured to obtain which objects are owned by host-C 110C and which components are owned by host-C 110C based on the query from host-A 110A to host-C 110C. Assuming host-C 110C does not own any object but owns a second component, host-A 110A is also configured to obtain a third time point that host-C 110C sampled the latest I/O to the second component based on the query. Block 410 may be followed by block 420.
At block 420, in some embodiments and in conjunction with
“now” refers to the present time point on host-B 110B, “1stTime(1stobject)” refers to the time point that host-B 110B sampled the latest I/O to the first object owned by host-B 110B and a is a weighting parameter. In some embodiments, the first persist score is inversely proportional to a first time difference between the present time point on host-B 110B and the first time point.
Similarly, in some embodiments, host-A 110A is also configured to calculate a second persist score for the first component owned by host-B 110B. The second persist score is calculated based on:
“now” refers to the present time point on host-B 110B, “1stTime(1stcomponent)” refers to the time point that host-B 110B sampled the latest I/O to the first component owned by host-B 110B and 3 is a weighting parameter.
In addition, in some embodiments, host-A 110A is also configured to calculate a third persist score for the second component owned by host-C 110C. The third persist score is calculated based on:
“now” refers to the present time point on host-C 110C, “1stTime(2ndcomponent)” refers to the time point that host-C 110C sampled the latest I/O to the second component owned by host-C 110C and 3 is a weighting parameter. In some embodiments, the second persist score is inversely proportional to a second time difference between the present time point on host-B 110B and the second time point and the third persist score is inversely proportional to a third time difference between the present time point on host-C 110C and the third time point. Block 420 may be followed by block 430.
At block 430, in some embodiments and in conjunction with
At block 440, in some embodiments and in conjunction with
The techniques introduced above can be implemented in special-purpose hardwired circuitry, in software and/or firmware in conjunction with programmable circuitry, or in a combination thereof. Special-purpose hardwired circuitry may be in the form of, for example, one or more application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), and others. The term ‘processor’ is to be interpreted broadly to include a processing unit, ASIC, logic unit, or programmable gate array etc.
The foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or any combination thereof.
Those skilled in the art will recognize that some aspects of the embodiments disclosed herein, in whole or in part, can be equivalently implemented in integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computing systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof, and that designing the circuitry and/or writing the code for the software and or firmware would be well within the skill of one of skill in the art in light of this disclosure.
Software and/or to implement the techniques introduced here may be stored on a non-transitory computer-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A “computer-readable storage medium”, as the term is used herein, includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant (PDA), mobile device, manufacturing tool, any device with a set of one or more processors, etc.). A computer-readable storage medium may include recordable/non recordable media (e.g., read-only memory (ROM), random access memory (RAM), magnetic disk or optical storage media, flash memory devices, etc.).
It will be understood that although the terms “first,” “second,” third” and so forth are used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, within the scope of the present disclosure, a first element may be referred to as a second element, and similarly a second element may be referred to as a first element. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
The drawings are only illustrations of an example, wherein the units or procedure shown in the drawings are not necessarily essential for implementing the present disclosure. Those skilled in the art will understand that the units in the device in the examples can be arranged in the device in the examples as described, or can be alternatively located in one or more devices different from that in the examples. The units in the examples described can be combined into one module or further divided into a plurality of sub-units.
| Number | Date | Country | Kind |
|---|---|---|---|
| PCT/CN2023/070821 | Jan 2023 | WO | international |
The present application claims the benefit of Patent Cooperation Treaty (PCT) Application No. PCT/CN2023/070821, filed Jan. 6, 2023, which is incorporated herein by reference.