The present invention relates to methods and systems for repurposing a fraction of system-level over provisioned (OP) space into a temporary hot spare, and more particularly relates to repurposing a fraction of system-level OP space on solid-state drives (SSDs) into a temporary hot spare.
A storage system with a plurality of storage units typically employs data redundancy techniques (e.g., RAID) to allow the recovery of data in the event one or more of the storage units fails. While data redundancy techniques address how to recover lost data, a remaining problem is where to store the recovered data. One possibility is to wait until the failed storage unit has been replaced or repaired before storing the recovered data on the restored storage unit. However, in the time before the failed storage unit has been restored, the storage system experiences a degraded mode of operation (e.g., more operations are required to compute error-correction blocks; when data on the failed storage unit is requested, the data must first be rebuilt, etc.). Another possibility is to reserve one of the storage units as a hot spare, and store the recovered data onto the hot spare. While a dedicated hot spare minimizes the time in which the storage system experiences a degraded mode of operation, a hot spare increases the hardware cost of the storage system.
Techniques are provided below for storing recovered data (in the event of a storage unit failure) prior to the restoration of the failed drive and without using a dedicated hot spare.
In accordance with one embodiment, lost data (i.e., data that is lost as a result of the failure of a storage unit) is recovered (or rebuilt) on system-level over provisioned (OP) space, rather than on a dedicated hot spare. The storage space of a storage unit (e.g., an SSD) typically includes an advertised space (i.e., space that is part of the advertised capacity of the storage unit) and a device-level OP space (i.e., space that is reserved to perform maintenance tasks such as device-level garbage collection). The system-level OP space may be formed on a portion of the advertised space on each of a plurality of storage units and is typically used for system-level garbage collection. The system-level OP space may increase the system-level garbage collection efficiency, which reduces the system-level write amplification. If there is a portion of the system-level OP space not being used by the system-level garbage collection, such portion of the system-level OP space can be used by the device-level garbage collection. Hence, the system-level OP space may also increase the device-level garbage collection efficiency, which reduces the device-level write amplification.
Upon the failure of a storage unit, a portion of the system-level OP space may be repurposed as a temporary hot spare, trading off system-level garbage collection efficiency (and possibly device-level garbage collection efficiency) for a shortened degraded mode of operation (as compared to waiting for the repair and/or replacement of the failed drive). The recovered or rebuilt data may be saved on the temporary hot spare (avoiding the need for a dedicated hot spare). After the failed storage unit has been repaired and/or replaced, the rebuilt data may be copied from the temporary hot spare onto the restored storage unit, and the storage space allocated to the temporary hot spare may be returned to the system-level OP space.
In accordance with one embodiment, a method is provided for a storage system having a plurality of solid-state drives (SSDs). Each of the SSDs may have an advertised space and a device-level OP space. For each of the SSDs, a controller of the storage system may designate a portion of the advertised space as a system-level OP space, thereby forming a collection of system-level OP spaces. In response to the failure of one of the SSDs, the storage system controller may repurpose a portion of the collection of system-level OP spaces into a temporary spare drive, rebuild data of the failed SSD, and store the rebuilt data onto the temporary spare drive. The temporary spare drive may be distributed across the SSDs that have not failed.
These and other embodiments of the invention are more fully described in association with the drawings below.
In the following detailed description of the preferred embodiments, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention. Description associated with any one of the figures may be applied to a different figure containing like or similar components/steps. While the flow diagrams each present a series of steps in a certain order, the order of the steps may be changed.
Storage system 104 may comprise storage system controller 106 and a plurality of storage units 108a-108c. While three storage units 108a-108c are depicted, a greater or fewer number of storage units may be present. In a preferred embodiment, each of the storage units is a solid-state drive (SSD). Storage system controller 106 may include a processor and memory (not depicted). The memory may store computer readable instructions, which when executed by the processor, cause the processor to perform data redundancy and/or recovery operations on storage system 104 (described below). Storage system controller 106 may also act as an intermediary agent between host device 102 and each of the storage units 108a-108c, such that requests of host device are forwarded to the proper storage unit(s), and data retrieved from the storage unit(s) is organized in a logical manner (e.g., data blocks are assembled into a data stripe) before being returned to host device 102.
Each of the storage units may include an SSD controller (which is separate from storage system controller 106) and a plurality of flash modules. For example, storage unit 108a may include SSD controller 110a, and two flash modules 112a, 114a. Storage unit 108b may include SSD controller 110b, and two flash modules 112b, 114b. Similarly, storage unit 108c may include SSD controller 110c, and two flash modules 112c, 114c. While each of the SSDs is shown with two flash modules for ease of illustration, it is understood that each SSD may contain many more flash modules. In one embodiment, a flash module may include one or more flash chips.
The SSD controller may perform flash management tasks, such as device-level garbage collection (e.g., garbage collection which involves copying blocks within one SSD). The SSD controller may also implement data redundancy across the flash modules within the SSD. For example, one of the flash modules could be dedicated for storing error-correction blocks, while the remaining flash modules could be dedicated for storing data blocks.
SSD controller 110a may access any storage space within SSD 108a (i.e., advertised space 216a and device-level OP space 218a). SSD controller 110b may access any storage space within SSD 108b (i.e., advertised space 216b and device-level OP space 218b). Similarly, SSD controller 110c may access any storage space within SSD 108c (i.e., advertised space 216c and device-level OP space 218c). In contrast to the SSD controllers, storage system controller 106 may access the advertised space across the SSDs (i.e., advertised space 216a, advertised space 216b and advertised space 216c), but may not have access to the device-level OP space (i.e., device-level OP space 218a, device-level OP space 218b and device-level OP space 218c). Similar to storage system controller 106, host device 102 may access (via storage system controller 106) the advertised space across the SSDs (i.e., advertised space 216a, advertised space 216b and advertised space 216c), but may not have access to the device-level OP space (i.e., device-level OP space 218a, device-level OP space 218b and device-level OP space 218c).
The OP percentage of an SSD is typically defined as the device-level OP storage capacity divided by the advertised storage capacity. For example, in an SSD with 80 GB advertised storage capacity and 20 GB device-level OP storage capacity, the device OP percentage would be 20 GB/80 GB or 25%. Continuing with this example, suppose that each of the SSDs in storage system 104 has 80 GB of advertised storage capacity and 20 GB of device-level OP storage capacity, the advertised storage capacity of storage system 104 would be 240 GB and the device-level OP percentage would be 60 GB/240 GB or 25%.
SSD controller 110a may access any storage space within SSD 108a (i.e., advertised space 316a, system-level OP space 320a and device-level OP space 218a). SSD controller 110b may access any storage space within SSD 108b (i.e., advertised space 316b, system-level OP space 320b and device-level OP 218b). Similarly, SSD controller 110c may access any storage space within SSD 108c (i.e., advertised space 316c, system-level OP space 320c and device-level OP space 218c). In contrast to the SSD controllers, storage system controller 106 may access the advertised space and system-level OP space across the SSDs (i.e., advertised space 316a, advertised space 316b, advertised space 316c, system-level OP space 320a, system-level OP space 320b and system-level OP space 320c), but may not have access to the device-level OP space (i.e., device-level OP space 218a, device-level OP space 218b and device-level OP space 218c). In contrast to storage system controller 106, host device 102 may access (via storage system controller 106) the advertised space across the SSDs (i.e., advertised space 316a, advertised space 316b and advertised space 316c), but may not have access to the system-level OP space across the SSDs (i.e., system-level OP space 320a, system-level OP space 320b and system-level OP space 320c) and the device-level OP space across the SSDs (i.e., device-level OP space 218a, device-level OP space 218b and device-level OP space 218c).
The system-level OP space may be used by storage system controller 106 to perform system-level garbage collection (e.g., garbage collection which involves copying blocks from one storage unit to another storage unit). The system-level OP space may increase the system-level garbage collection efficiency, which reduces the system-level write amplification. If there is a portion of the system-level OP space not being used by the system-level garbage collection, such portion of the system-level OP space can be used by the device-level garbage collection. Hence, the system-level OP space may also increase the device-level garbage collection efficiency, which reduces the device-level write amplification. However, in a failure mode (e.g., failure of one or more of the SSDs), a portion of the system-level OP space may be repurposed as a temporary hot spare drive (as shown in
In one embodiment, the amount of system-level OP space that is repurposed may be the number of failed SSDs multiplied by the advertised capacity (e.g., 216a, 216b, 216c) of each of the SSDs (assuming that all the SSDs have the same capacity). In another embodiment, the amount of system-level OP space that is repurposed may be the sum of each of the respective advertised capacities (e.g., 216a, 216b, 216c) of the failed SSDs. In another embodiment, the amount of system-level OP space that is repurposed may be equal to the amount of space needed to store all the rebuilt data. In yet another embodiment, system-level OP space may be re-purposed on the fly (i.e., in an as needed basis). For instance, a portion of the system-level OP space may be re-purposed to store one rebuilt data block, then another portion of the system-level OP space may be re-purposed to store another rebuilt data block, and so on.
As mentioned above, repurposing the system-level OP space may increase the system-level write amplification (and lower the efficiency of system-level garbage collection). Therefore, in some embodiments, there may be a limit on the maximum amount of system-level OP space that can be repurposed, and this limit may be dependent on the write amplification of the system-level garbage collection. If the system-level write amplification is high, the limit may be decreased (i.e., more system-level OP space can be reserved for garbage collection). If, however, the system-level write amplification is low, the limit may be increased (i.e., less system-level OP space can be reserved for garbage collection).
It is noted that in some instances, the capacity of the data that needs to be rebuilt may exceed the amount of system-level OP space that can be repurposed. In such cases, the data of some of the failed storage unit(s) may be rebuilt and stored on temporary spare drive(s), while other failed storage unit(s) may be forced to temporarily operate in a degraded mode.
In step 504 (during a normal mode of operation of storage system 304), the system-level OP space may be used by storage system controller 106 to perform system-level garbage collection more efficiently (i.e., by reducing write amplification).
Subsequent to step 504 and prior to step 506, storage system 304 may enter a failure mode (e.g., one of the storage units may fail). At step 506, storage system controller 106 may repurpose a fraction of the system-level OP space as a temporary hot spare. At step 508, storage system controller 106 may rebuild data of the failed storage unit. At step 510, storage system controller 106 may store the rebuilt data on the temporary hot spare. At step 512, the failed storage unit may be restored, either by being replaced or by being repaired. At step 514, storage system controller 106 may copy the rebuilt data from the temporary hot spare onto the restored storage unit. At step 516, storage system controller 106 may convert the temporary hot spare drive back into system-level OP space. Storage system 304 may then resume a normal mode of operation, in which system-level OP space is used to more efficiently perform system-level garbage collection (step 504).
It is noted that the embodiment of
In the arrangement, error-correction blocks are labeled with reference labels that begin with the letter “P”, “Q” or “R”; data blocks are labeled with reference labels that begin with the letter “d”; OP blocks are labeled with reference labels that begin with the string “OP”; and spare blocks are labeled with reference labels that begin with the letter “S”.
Each row of error correction blocks and data blocks may belong to one data stripe (or “stripe” in short). For example, stripe 0 may include data blocks d.00, d.01, d.02, d.03 and d.04, and error correction blocks, P.0, Q.0 and R.0. If three or fewer of the blocks (i.e., data and error correction blocks) are lost, the remaining blocks in the data stripe (i.e., data and error correction blocks) may be used to rebuild the lost blocks. The specific techniques to rebuild blocks are known in the art and will not be described further herein. Since each stripe contains three parity blocks, the redundancy scheme is known as “triple parity”. While the example employs triple parity, it is understood that other levels of parity may be employed without departing from the spirit of the invention.
Certain blocks of the arrangement are illustrated with a horizontal line pattern. These blocks will be the primary focus of the operations described in the subsequent figures.
In response to the failure of SSD 4, OP blocks may be repurposed into a temporary spare drive so that the contents of the failed drive may be rebuilt on the spare drive. An arrangement of blocks after OP blocks have been repurposed into spare blocks is depicted in
In the example of
In the example of
In the example of
While the embodiments above have described re-purposing a fraction of the system-level OP space as a temporary hot spare, it is possible, in some embodiments, to re-purpose a fraction of the system-level OP space for other purposes, such as for logging data, caching data, storing a process core dump and storing a kernel crash dump. More generally, it is possible to re-purpose a fraction of the system-level OP space for any use case, as long as the use is for a short-lived “emergency” task that is higher in priority than garbage collection efficiency.
As is apparent from the foregoing discussion, aspects of the present invention involve the use of various computer systems and computer readable storage media having computer-readable instructions stored thereon.
Computer system 1700 includes a bus 1702 or other communication mechanism for communicating information, and a processor 1704 coupled with the bus 1702 for processing information. Computer system 1700 also includes a main memory 1706, such as a random access memory (RAM) or other dynamic storage device, coupled to the bus 1702 for storing information and instructions to be executed by processor 1704. Main memory 1706 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1704. Computer system 1700 further includes a read only memory (ROM) 1708 or other static storage device coupled to the bus 1702 for storing static information and instructions for the processor 1704. A storage device 1710, which may be one or more of a floppy disk, a flexible disk, a hard disk, flash memory-based storage medium, magnetic tape or other magnetic storage medium, a compact disk (CD)-ROM, a digital versatile disk (DVD)-ROM, or other optical storage medium, or any other storage medium from which processor 1704 can read, is provided and coupled to the bus 1702 for storing information and instructions (e.g., operating systems, applications programs and the like).
Computer system 1700 may be coupled via the bus 1702 to a display 1712, such as a flat panel display, for displaying information to a computer user. An input device 1714, such as a keyboard including alphanumeric and other keys, is coupled to the bus 1702 for communicating information and command selections to the processor 1704. Another type of user input device is cursor control device 1716, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1704 and for controlling cursor movement on the display 1712. Other user interface devices, such as microphones, speakers, etc. are not shown in detail but may be involved with the receipt of user input and/or presentation of output.
The processes referred to herein may be implemented by processor 1704 executing appropriate sequences of computer-readable instructions contained in main memory 1706. Such instructions may be read into main memory 1706 from another computer-readable medium, such as storage device 1710, and execution of the sequences of instructions contained in the main memory 1706 causes the processor 1704 to perform the associated actions. In alternative embodiments, hard-wired circuitry or firmware-controlled processing units (e.g., field programmable gate arrays) may be used in place of or in combination with processor 704 and its associated computer software instructions to implement the invention. The computer-readable instructions may be rendered in any computer language including, without limitation, C#, C/C++, Fortran, COBOL, PASCAL, assembly language, markup languages (e.g., HTML, SGML, XML, VoXML), and the like, as well as object-oriented environments such as the Common Object Request Broker Architecture (CORBA), Java™ and the like. In general, all of the aforementioned terms are meant to encompass any series of logical steps performed in a sequence to accomplish a given purpose, which is the hallmark of any computer-executable application. Unless specifically stated otherwise, it should be appreciated that throughout the description of the present invention, use of terms such as “processing”, “computing”, “calculating”, “determining”, “displaying” or the like, refer to the action and processes of an appropriately programmed computer system, such as computer system 700 or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within its registers and memories into other data similarly represented as physical quantities within its memories or registers or other such information storage, transmission or display devices.
Computer system 1700 also includes a communication interface 1718 coupled to the bus 1702. Communication interface 1718 provides a two-way data communication channel with a computer network, which provides connectivity to and among the various computer systems discussed above. For example, communication interface 1718 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN, which itself is communicatively coupled to the Internet through one or more Internet service provider networks. The precise details of such communication paths are not critical to the present invention. What is important is that computer system 1700 can send and receive messages and data through the communication interface 1718 and in that way communicate with hosts accessible via the Internet.
Thus, methods and systems for repurposing system-level OP space into temporary spare drive(s) have been described. It is to be understood that the above-description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.