The subject matter of this disclosure is generally related to electronic data storage systems and more particularly to shared memory in such systems.
High capacity data storage systems such as storage area networks (SANs) are used to maintain large data sets and contemporaneously support multiple users. A storage array, which is an example of a SAN, includes a network of interconnected compute nodes that manage access to data stored on arrays of drives. The compute nodes access the data in response to input-output commands (IOs) from host applications that typically run on servers known as “hosts.” Examples of host applications may include, but are not limited to, software for email, accounting, manufacturing, inventory control, and a wide variety of other business processes. The IO workload on the storage array is normally distributed among the compute nodes such that individual compute nodes are each able to respond to IOs with no more than a target level of latency. However, unbalanced IO workloads and resource allocations can result in some compute nodes being overloaded while other compute nodes have unused memory and processing resources.
In accordance with some implementations an apparatus comprises: a data storage system comprising: a plurality of non-volatile drives; and a plurality of interconnected compute nodes that present at least one logical production volume to hosts and manage access to the drives, each of the compute nodes comprising a local memory and being configured to allocate a portion of the local memory to a shared memory that can be accessed by each of the compute nodes, the shared memory comprising cache slots that are used to store data for servicing input-output commands (IOs); wherein a first one of the compute nodes is configured to create donor cache slots that are available for donation to other ones of the compute nodes, a second one of the compute nodes is configured to generate a message that indicates a need for donor cache slots, and the first compute node is configured to provide at least some of the donor cache slots to the second compute node in response to the message, whereby the second compute node acquires remote donor cache slots without searching for candidates in remote portions of the shared memory.
In accordance with some implementations a method for acquiring remote donor cache slots without searching for candidates in remote portions of a shared memory in a data storage system comprising a plurality of non-volatile drives and a plurality of interconnected compute nodes that present at least one logical production volume to hosts and manage access to the drives, each of the compute nodes comprising a local memory and being configured to allocate a portion of the local memory to the shared memory that can be accessed by each of the compute nodes, the shared memory comprising cache slots that are used to store data for servicing input-output commands (IOs) comprises: a first one of the compute nodes creating donor cache slots that are available for donation to other ones of the compute nodes; a second one of the compute nodes generating a message that indicates a need for donor cache slots; and the first compute node providing at least some of the donor cache slots to the second compute node in response to the message.
In accordance with some implementations a computer-readable storage medium stores instructions that when executed by a compute node cause the compute node to perform a method for acquiring remote donor cache slots without searching for candidates in remote portions of a shared memory in a data storage system comprising a plurality of non-volatile drives and a plurality of interconnected compute nodes that present at least one logical production volume to hosts and manage access to the drives, each of the compute nodes comprising a local memory and being configured to allocate a portion of the local memory to the shared memory that can be accessed by each of the compute nodes, the shared memory comprising cache slots that are used to store data for servicing input-output commands (IOs), the method comprising: creating donor cache slots that are available for donation to ones of the compute nodes; generating a message that indicates a need for donor cache slots; and providing at least some of the donor cache slots to the second compute node in response to the message.
All examples, aspects, and features mentioned in this disclosure can be combined in any technically possible way. The terminology used in this disclosure is intended to be interpreted broadly within the limits of subject matter eligibility. The terms “disk” and “drive” are used interchangeably herein and are not intended to refer to any specific type of non-volatile electronic storage media. The terms “logical” and “virtual” are used to refer to features that are abstractions of other features, e.g. and without limitation abstractions of tangible features. The term “physical” is used to refer to tangible features that possibly include, but are not limited to, electronic hardware. For example, multiple virtual computers could operate simultaneously on one physical computer. The term “logic,” if used herein, refers to special purpose physical circuit elements, firmware, software, computer instructions that are stored on a non-transitory computer-readable medium and implemented by multi-purpose tangible processors, alone or in any combination. Aspects of the inventive concepts are described as being implemented in a data storage system that includes host servers and a storage array. Such implementations should not be viewed as limiting. Those of ordinary skill in the art will recognize that there are a wide variety of implementations of the inventive concepts in view of the teachings of the present disclosure.
Some aspects, features, and implementations described herein may include machines such as computers, electronic components, optical components, and processes such as computer-implemented procedures and steps. It will be apparent to those of ordinary skill in the art that the computer-implemented procedures and steps may be stored as computer-executable instructions on a non-transitory computer-readable medium. Furthermore, it will be understood by those of ordinary skill in the art that the computer-executable instructions may be executed on a variety of tangible processor devices, i.e. physical hardware. For practical reasons, not every step, device, and component that may be part of a computer or data storage system is described herein. Those of ordinary skill in the art will recognize such steps, devices, and components in view of the teachings of the present disclosure and the knowledge generally available to those of ordinary skill in the art. The corresponding machines and processes are therefore enabled and within the scope of the disclosure.
The storage array 100, which is depicted in a simplified data center environment with two host servers 103 that run host applications, is one example of a storage area network (SAN). The host servers 103 may be implemented as individual physical computing devices, virtual machines running on the same hardware platform under control of a hypervisor, or in containers on the same hardware platform. The storage array 100 includes one or more bricks 104. Each brick includes an engine 106 and one or more drive array enclosures (DAEs) 108. Each engine 106 includes a pair of interconnected compute nodes 112, 114 that are arranged in a failover relationship and may be referred to as “storage directors” or simply “directors.” Although it is known in the art to refer to the compute nodes of a SAN as “hosts,” that naming convention is avoided in this disclosure to help distinguish the network server hosts 103 from the compute nodes 112, 114. Nevertheless, the host applications could run on the compute nodes, e.g. on virtual machines or in containers. Each compute node includes resources such as at least one multi-core processor 116 and local memory 118. The processor may include central processing units (CPUs), graphics processing units (GPUs), or both. The local memory 118 may include volatile media such as dynamic random-access memory (DRAM), non-volatile memory (NVM) such as storage class memory (SCM), or both. Each compute node allocates a portion of its local memory 118 to a shared memory 210 (
Data associated with instances of a host application running on the hosts 103 is maintained persistently on the managed drives 101. The managed drives 101 are not discoverable by the hosts 103 but the storage array creates logical storage devices known as production volumes 140, 142 that can be discovered and accessed by the hosts. Without limitation, a production volume may alternatively be referred to as a storage object, source device, production device, or production LUN, where the logical unit number (LUN) is a number used to identify logical storage volumes in accordance with the small computer system interface (SCSI) protocol. From the perspective of the hosts 103, each production volume 140, 142 is a single drive having a set of contiguous fixed-size logical block addresses (LBAs) on which data used by the instances of the host application resides. However, the host application data is stored at non-contiguous addresses on various managed drives 101, e.g. at ranges of addresses distributed on multiple drives or multiple ranges of addresses on one drive. The compute nodes maintain metadata that maps between the production volumes 140, 142 and the managed drives 101 in order to process IOs from the hosts.
Donor cache slots are created and held in reserve by compute nodes based on operational status. Each compute node 300, 302, 304, 306 maintains operational status metrics 338, 340, 342, 344 such as one or more of recent cache slot allocation rate, current number of write pending or dirty cache slots, current depth of local shared slot queues, and recent fall-through time (FTT). The recent cache slot allocation rate indicates how many local cache misses occurred within a predetermined window of time, e.g. the past S seconds or M minutes. The current number of write pending (WP) or dirty cache slots indicates how many of the local cache slots contain changed data that must be destaged to the managed drives before the associated cache slot can be recycled. A smaller number indicates better suitability for creation of donor cache slots. The current depth of the local shared slot queues indicates the number of free cache slots required to service new IOs. The depth of the local shared slot queues also indicates the state of the race condition that exists between worker thread recycling and IO workload. A shorter depth indicates better suitability for creation of donor cache slots. Recent FTT indicates the average time that BE TRKs are resident in the local cache slots before being recycled, e.g. time between being written to a cache slot and being flushed or destaged from the cache slot by a worker thread. A larger FTT indicates better suitability for creation of donor cache slots.
The operational status metrics 338, 340, 342, 344 are captured and written by the worker thread of each compute node to the shared memory 210 and used to calculate how many donor cache slots, if any, to create. In the illustrated example compute nodes 300, 302, and 328 each generate a different quantity of donor cache slots 332, 334, 336 based on local operational status while cache slot starved compute node 306 has no donor cache slots. The operational status information and donor cache slot information collectively form part of the Cache_Donation_Source Board-Mask. The cache slot starved compute node 306 generates a cache slot donation target message 346 that is broadcast to the other compute nodes 300, 302, 304. The message may be broadcast by writing to the Cache_Donation_Source Board-Mask. In response to the message, one or more of the potential remote cache slot donor compute nodes provides remote cache slots to the cache slot starved compute node 306. In the illustrated example compute node 302 is shown donating remote cache slots to compute node 306. Donation of remote cache slots may include providing pointers to the locations of the remote cache slots in the shared memory. The remote cache slots can be accessed by the cache slot starved compute node 306 using DMA or RDMA. The local worker thread for the remote cache slot, e.g. WT 326 for the remote cache slots donated by compute node 302, eventually recycles the donated remote cache slots.
The number of cache slots to be queued as donor slots is limited to avoid degrading performance of the donor compute node. Capability to donate cache slots is based on per-director cache statistics, e.g. eliminating as candidates directors that have more than a predetermined number of WP, are above an 85% out of pool (dirty) slots limit, and have a local FTT that is below a predetermined level compared to the storage array average FTT for a specific segment. Per-director DSA statistics, pre-determined pass/fail criteria for each emulation on the director, max work queues or some other indicator of spare cycles, and per-slice DSA statistics for the remaining emulations may also be used. Director cache statistics are not necessarily static so the number of donor cache slots maintained by a director may be dynamically adjusted.
Specific examples have been presented to provide context and convey inventive concepts. The specific examples are not to be considered as limiting. A wide variety of modifications may be made without departing from the scope of the inventive concepts described herein. Moreover, the features, aspects, and implementations described herein may be combined in any technically possible way. Accordingly, modifications and combinations are within the scope of the following claims.