The present invention generally relates to a leveling processing technique for data stored in flash memories constituting storage media for a storage apparatus.
When rewriting a flash memory, it is necessary to first perform the operation called “erasing” of data in blocks, which are memory units for the flash memory, and then rewrite data in the blocks. Each block has a limited life cycle for this erase operation due to physical limitations, and the limited number of erases is approximately 5,000 times for a Multi Level Cell (MLC) type flash memory and approximately 100,000 times for a Single Level Cell (SLC) type memory.
When rewriting data in each block in the flash memory, the number of erases varies among different blocks and, therefore, the flash memory cannot be used efficiently. There is a technique called “wear leveling” to equalize this imbalance. From among a variety of wear leveling systems, a representative wear leveling system is called “Hot-Cold (HC) wear leveling” for switching data between those in “Hot” blocks whose number of erases is large, and those in “Cold” blocks whose number of erases is small (see Non-patent Document 1).
In these wear leveling systems, data in flash memory packages equipped with a plurality of flash memory blocks are leveled.
Furthermore, a wear leveling system in which a plurality of flash memory modules is treated as one group in a storage apparatus is suggested (see Patent Document 1). In this system, the above-described wear leveling is conducted by treating a plurality of flash memory modules as a group.
[Non-patent Document 1] On efficient Wear-leveling for Large Scale Flash Memory Storage System http://www.cis.nctu.edu.tw/˜|pchang/papers/crm_sac07.pdf
If a flash memory module (flash memory package) in the system described in Patent Document 1 fails and the faulty flash memory module is replaced with a new flash memory module, when blocks with a small number of erases are selected as wear leveling object blocks from flash memory modules, there is a possibility that selected blocks to be wear-leveled may be concentrated in flash memories of the new flash memory module and, as a result, data in the flash memory modules after the replacement may not be sufficiently leveled.
In other words, when a flash memory module is replaced in or added to a plurality of flash memory modules in the conventional art, the life of flash memory may vary among different flash memory modules due to imbalance of the number of erases.
The present invention was devised in light of the problem of the conventional art described above, and it is an object of the invention to provide a storage apparatus and its data control method enabling efficient leveling among a plurality of flash memory packages including a newly added substitute flash memory package.
In order to achieve the above-described object, the present invention is characterized in that the property of data in a plurality of flash memory packages is treated as an attribute and the data is migrated between the flash memory packages based on that attribute to avoid concentration on blocks selected to be leveled in the plurality of flash memory packages including a newly added substitute flash memory package.
The present invention can efficiently perform leveling among a plurality of flash memory packages including a newly added substitute flash memory package.
According to the present embodiment, the property of data in a plurality of flash memory packages is treated as an attribute and data is migrated between the flash memory packages based on that attribute of the data in order to avoid concentration of selected blocks in the plurality of flash memory packages including a newly added substitute flash memory package when performing leveling.
A storage apparatus 100 serving as a storage subsystem is constituted from a plurality of storage controllers 110, internal bus networks 120, flash memory packages 130, and a service processor SVP (Service Processor) 140.
The storage controller 110 is constituted from a channel I/F 111 for connection to a host 300 via, for example, Ethernet (IBM's registered trademark) or Fibre Channel, a CPU 112 (Central Processing Unit) for processing I/O (inputs/outputs), a memory (MEM) 113 for storing programs and control information, an I/F 114 for connection to a bus inside the storage subsystem, and a network interface card (NIC) 115 for connection to the service processor 140. Incidentally, PCI-Express is used as the I/F 114 in this embodiment, but an I/F such as SAS (Serial Attached SCSI) or Fibre Channel, or a network such as Ethernet may be used as the I/F 114.
The internal bus network 120 is constituted from a switch that can be connected to, for example, PCI-Express. Incidentally, a bus-type network may be used as the internal bus network 120, if necessary.
Each flash memory package (hereinafter referred to as the “FMPK”) 130 is constituted from a plurality of flash memories 132 and a flash memory adapter (FMA) 131 for controlling access to data in the flash memories 132 based on access from the internal I/F 114. This FMPK 130 may be a flash memory package that make memory access, or a flash memory package like a Solid State Disk (SSD) that has a disk I/F for, for example, Fibre Channel or SAS.
The service processor (SVP) 140 loads programs that should be loaded to the storage controller 110 to the storage controller 110, performs initialization of the storage system, and manages the storage subsystem. This service processor 140 is constituted from a processor 141, a memory 142, a disk 143 for storing an OS (Operating System) and a microcode program for the storage controller 110, a network interface card (NIC) 144 for connection to the storage controller 110, and a network interface card (NIC) 145 such as Ethernet for connection to an external management console (management console) 500.
This storage apparatus 100 is connected to the host 300 via a SAN (Storage Area Network) 200 and is also connected to the management console 500 via a LAN (Local Area Network) 400.
The host 300 is a server computer and contains a CPU 301, a memory (MEM) 302, and a disk (HDD) 303. The host 300 also has a host bus adapter (HBA) 304 for, for example, SCSI (Small Computer System Interface) data transfer to/from the storage apparatus 100.
The SAN 200 uses a protocol according to which SCSI commands can be transferred. For example, protocols such as Fibre Channel, iSCSI, SCSI over Ethernet, or SAS can be used. In this embodiment, a Fibre Channel network is used.
The management console 500 is a server computer and contains a CPU 501, a memory (MEM) 502, and a disk (HDD) 503. The management console 500 also has a network interface card (NIC) 504 capable of communicating with the service processor 140 according to TCP/IP (Transmission Control Protocol/Internet Protocol). A network enabling communications between the server and a client such as an Ethernet network can be used as the network interface card (NIC) 504.
The LAN 400 operates according to the IP (Internet Protocol) protocol such as TCP/IP and is connected to the network interface card (NIC) 145 using a network, such as an Ethernet network, enabling communications between the server and a client.
The storage controller 110 executes the microcode program 160 provided by the service processor (SVP) 140. The microcode program 160 is provided by a maintenance person transferring a memory medium belonging to the service processor (SVP) 140 such as a CD-ROM (Compact Disc Read only Memory), a DVD-ROM (Digital Versatile Disc—Read only Memory), or a USB (Universal Serial Bus) memory to the service processor (SVP) 140.
In this situation, the storage controller 110 constitutes a leveling processing unit for managing data in each block of a plurality of FMPKs 130 according to the microcode program 160 and performs leveling processing on data in blocks belonging to leveling object devices.
The microcode program 160 has, as management information, a PDEV-FMPK table 166 showing the correspondence relationship between flash memory packages (hereinafter referred to as “FMPK”) and physical devices which are management units for FMPKs (hereinafter referred to as “PDEV”), a RAID group table 161 that defines data protection units for PDEV 133 groups, a PDEV format table 162 that defines a data area and a user area for flash memories existing in PDEVs, a column device (hereinafter referred to as “CDEV”) table 163 that defines the range of wear leveling for PDEV 133 groups, an LDEV SEG-PDEV BLK mapping table (referred to as the “L_SEG-P_BLK mapping table”) 164 showing the mapping relationship between address spaces in LDEVs and address spaces in PDEVs, an inter-PDEV wear leveling behavior bit 168 showing the types of wear leveling control behaviors, and a WL (Wear Leveling) object block list 169 showing a list of data migration object blocks when performing wear leveling among FMPKs; and the microcode program 160 also has control information in the memory for the storage controller 110.
Furthermore, the microcode program 160 has an I/O processing unit (I/O operations) 167 as a processing unit, an intra-PDEV wear leveling processing unit (WL inside PDEV) 165 for performing wear leveling processing (which may also be called “smoothing” or “leveling processing”) on the number of erases among flash memory blocks within PDEV 133, and an inter-PDEV wear leveling processing unit (WL among PDEVs) 190 for performing wear leveling processing on the number of erases of flash memories among PDEVs 133 defined by CDEVs 136; and the microcode program 160 executes the above-described processing whenever necessary. Incidentally, the details of the processing will be explained later.
Besides the processing described above, the microcode program 160 may perform processing which the storage apparatus 100 should be in charge of, for example, for managing the configuration of the storage apparatus 100 and protecting data in Redundancy Array of Independent Disks (RAID).
The microcode program 160 manages, for example, FMPKs 130 as follows: the microcode program 160 first manages logical storage areas for flash memories 132 belonging to the FMPKs 130, using units called “PDEVs” 133 which are logical management units; and the microcode program 160 constructs a plurality of RAID groups (RG) 134 out of a plurality of PDEVs 133 and protects data in the flash memories 132 in each RG. A stripe line 137 extending across a plurality of PDEVs 133 in a decided management unit (for example, 256 KB) can be used as a unit for managing data.
The stripe line 137 is a data migration unit when performing wear leveling within a PDEV 133 or among PDEVs 133 as described later. Specifically speaking, when wear leveling is performed among RGs, data is migrated in stripe lines. Furthermore, when performing wear leveling among PDEVs 133 as described later, CDEVs 136 that define PDEV 133 groups are defined. When this happens, the CDEVs 136 constitute the leveling object devices.
The microcode program 160 manages data for each RG and performs wear leveling in the CDEV 136, thereby protecting storage areas and improving availability. A plurality of logical devices (hereinafter referred to as “LDEV”) 135 that are logical storage spaces are prepared on the CDEVs 135 in the storage apparatus 100. Each LDEV 135 is constructed across a plurality of CDEVs 136. Each LDEV 135 serving as a logical unit for the host 300 performs SCSI read and write processing for reading/writing data from/to the host 300, using the WWN (World Wide Name) and LU number assigned to the relevant LDEV 135 by the microcode program 160.
The SVP 140 has an OS 142 as well as a management program 142 and a GUI (Graphical User Interface) 141 that are used by the maintenance person to give operational instructions to the microcode program 160.
After the host 300 uses an OS 310 to recognize volumes of logical units LU mentioned above and then creates a device file, the host 300 formats the device file. Subsequently, the device file can be accessed by applications 320. A common OS such as UNIX (a registered trademark of The Santa Cruz Operation, Inc.) or Windows (Microsoft's registered trademark) can be used as the OS 310.
After the microcode program 160 executes processing for erasing data in a block prior to rewriting the block, the number of erases is recorded as an accumulated count in the “number of erases” field 4003,.
The size of a segment is equal to that of a block (for example, 256 KB) in a flash memory 132, but a segment may be constituted from a plurality of blocks. When determining the attribute of each segment 7006, the microcode program 160 periodically measures the write throughput of data belonging to segments (blocks) in each PDEV 133, calculates an average value of the maximum measured value and the minimum measured value, and determines this calculated average value to be a threshold value for the write access frequency.
If the measured value of the write throughput of data in each segment (block) is equal to or larger than the threshold value, the microcode program 160 recognizes the relevant segment (block) as a high-access segment (block) and gives the high access (H) attribute to that segment (block); or if the measured value of the write throughput of data in each segment (block) is smaller than the threshold value, the microcode program 160 recognizes the relevant segment (block) as a low-access segment (block) and gives the low access (H) attribute to that segment (block). As a result, the microcode program 160 records the high access (H) or the low access (L) in the “attribute” field 7006 in the mapping table 164.
The above-described method of determining the attribute 7006 is one example; and other methods may be used as long as data that is frequently accessed can be defined as “high-access” data and data that is not often accessed can be defined as “low-access” data. For example, the write throughput is used as frequency information in this embodiment; however, the number of erases per second for each block may be utilized as the frequency information. An average erase frequency may be calculated from the erase frequency, thereby determining whether the attribute is high-access or low-access. The initial state of the “Lock” field when creating an LDEV 135 may be set to “-” which means the relevant LDEV 135 is not locked at the time of allocation of the LDEV 135; and the initial state of the “Moved” field may be set to “-” which means the relevant segment has not been moved.
The configurations and the management information according to this embodiment have been described above.
Control and operations will be explained below, using the configurations and the management information described above.
The maintenance person first installs FMPKs 130 into slots provided in the storage apparatus 100 and then decides the correspondence relationship between the FMPKs 130 and PDEVs 133. The slot number is set as the PDEV number regarding the correspondence relationship between the FMPKs 130 and the PDEVs 133, and the relationship is stored in the PDEV-FMPK table 166 in
Next, the maintenance person decides the RG number, selects PDEVs 133 to be included in RGs, and creates the RGs, using the management console 500. This relationship is stored in the RAID group table 161 (step 9002). The maintenance person formats the PDEVs 133. After formatting of the PDEVs 133 is completed, the microcode program 160 creates the PDEV format table 162 in
Subsequently, the maintenance person creates CDEVs belonging to a leveling object device for performing wear leveling in the PDEV 133 group (step 9004). This correspondence relationship is stored via the service processor SVP 140 in the column device table 163 in
Finally, the maintenance person creates an LDEV-LU mapping table as processing for disclosing the LDEVs 135 to the host 300 and records this correspondence relationship via the microcode program 160 in the mapping table 8000 in
The initialization process operated by the maintenance person has been described above; however, the operation to create the LDEVs 135 (9005) and the operation to create the mapping table 8000 (9006) may be performed by an administrator who generally manages the storage system (hereinafter referred to as the “administrator”).
Step 10001: the management program (142) of the service processor (SVP) 140 makes a request to the microcode program 160 to create an LDEV 135 with the capacity input by the maintenance person or the administrator.
Step 10002: the microcode program 160 checks, by referring to the PDEV format table 162 in
Step 10003: the microcode program 160 obtains blocks corresponding to the number of segments with the specified capacity and manages the obtained blocks by setting “Allocated” in the “Status” field 4004 in the table 162.
Step 10004: the microcode program 160 assigns an LDEV number to the obtained blocks, gives segment numbers to the allocated blocks, and adds them to the L_SEG-P_BLK mapping table 164 in
Step 10005: the microcode program 160 notifies the service processor (SVP) 140 that the LDEV 135 was successfully created.
Step 10006: the service processor (SVP) 140 notifies the administrator via the GUI that the LDEV 135 was successfully created.
Step 10007: the microcode program 160 notifies the service processor (SVP) 140 that the creation of the LDEV 135 failed.
Step 10008: the service processor (SVP) 140 notifies the administrator via the GUI that the creation of the LDEV 135 failed.
Then, the above-described processing terminates.
Step 11001: the microcode program 160 obtains an access LBA of the target LU from a SCSI write command issued from the host 300. The microcode program 160 obtains the LDEV number 8004 from the mapping table 8000 in
Step 11002: the microcode program 160 enters the wait state (Wait) for several microseconds.
Step 11003: the microcode program 160 reads old data and parity data from blocks on the same stripe line 137 based on the L_SEG-P_BLK mapping table 164.
Step 11004: the microcode program 160 updates the old data, which has been read, with new data.
Step 11005: the microcode program 160 creates new parity data from the updated data and the old parity data.
Step 11006: the microcode program 160 allocates a new block (BLK). When allocating the new BLK to a stripe line selected from stripe lines on the RAID, other corresponding BLKs are also moved to the same stripe line. Processing described later in detail with reference to
Step 11007: the microcode program 160 writes the new data and parity data to the allocated BLK.
Step 11008: the microcode program 160 updates the L_SEG-P_BLK mapping table 164 so that the content of the segment updated in the L_SEG-P_BLK mapping table 164 will match the new block. The microcode program 160 also refers to the WL object block list in
Step 11009: the microcode program 160 unlocks the “lock” (7007).
Step 11010: the microcode program 160 performs post-processing on the original block. Details of this post-processing will be explained below with reference to
Then, the above-described processing terminates.
Step 17001: the microcode program 160 checks, by referring to the “PDEV number” field 4001 and the “BLK number” field of the relevant block, if the number of erases (Num of Erases) 4003 is less than the maximum number of erases for the flash memory 132 of the relevant block (for examples, 5000 times in the case of MLC). If the number of erases is less than the maximum number of erases, the microcode program 160 proceeds to step 17002; or if the number of erases is equal to or more than the maximum number of erases, the microcode program 160 proceeds to step 17005.
Step 17002: the microcode program 160 deletes data in the block in the flash memory 132.
Step 17003: the microcode program 160 increments the number of erases 4003 by only +1.
Step 17004: the microcode program 160 changes the state of the relevant block to “Free.”
Step 17005: the microcode program 160 manages the relevant block by changing the state of the block to “Broken” which means the block cannot be used.
Then, the above-described processing terminates.
The processing shown in
Step 12001: the microcode program 160 reads object data to the cache based on the L_SEG-P_BLK mapping table 164 in
Then, the above-described processing terminates.
Details of the processing are as follows:
Step 13001: the microcode program 160 refers to the “Status” field in the PDEV format table 162 in
Step 13002: the microcode program 160 refers to the column device table 163 in
In the above situation, the microcode program 160 proceeds to step 13005 because an increase in the number of free BLKs in other packages can be expected after adding a substitute FMPK 130 as a substitute for an already used and implemented real FMPK 130 and registering PDEVs 133 belonging to the added substitute FMPK 130. Incidentally, the threshold value used in step 13002 may be decided by the administrator or the maintenance person or decided at the time of factory shipment.
Step 13003: the microcode program 160 selects a block from PDEVs 133 in the FMPK 130. When selecting a block to perform wear leveling, an algorithm for block selection, such as Dual Pool in Non-patent Document 1, an HC algorithm, or other algorithms can be used.
Step 13004: the microcode program 160 refers to the behavior bit 168 indicating the type of wear leveling in the CDEV 136 and decides the wear leveling algorithm for this storage system. If the behavior bit 168 indicates the wear leveling of the low-access type (“L”), the microcode program 160 proceeds to step 13006; or if the behavior bit 168 indicates the wear leveling of the high-access type (“H”), the microcode program 160 proceeds to step 13007.
Step 13005: the microcode program 160 determines that there is no free BLK in the column device CDEV, and then makes a request for addition of a new PDEV 133 to the CDEV 136 to the administrator or the maintenance person via the service processor (SVP) 140, using, for example, a screen on the GUI, according to SNMP (Simple Network Management Protocol), or by mail.
Step 13006: the microcode program 160 performs the low-access-type wear leveling in the CDEV 136 using asynchronous I/O, i.e., in the background. Details of the processing will be explained with reference to
Step 13007: the microcode program 160 performs the high-access-type wear leveling in the CDEV 136 using asynchronous I/O, i.e., in the background. Details of the processing will be explained with reference to
Step 13008: the microcode program 160 allocates a new BLK from free segments 162 in the PDEV 133 added in the PDEV format table 162 in
Then, the above-described processing terminates.
Incidentally, the above flow illustrates the processing for allocation. However, free blocks in the CDEV 136 may be checked (step 13002) periodically in the background independently of this processing in order to promote addition of a new FMPK 130.
In this example, it is assumed that the storage controller 110 including the microcode program 160 serves as the leveling processing unit to execute all the processing. However, if the flash memory adapter (FMA) 131 for FMPKs 130 is configured so that it can manage free blocks in the PDEV format table 162 in
Step 14001: the microcode program 160 refers to the column device table 163 in
Step 14002: the microcode program 160 checks if any block remains unmoved in the block group selected in step 14001. If there is any unmoved block, the microcode program 160 proceeds to step 14003; or if all the blocks have been moved, the microcode program 160 terminates the processing.
Step 14003: the microcode program 160 checks if the block to be moved has not already been moved, by checking whether the status of the “Moved” field 7008 in the L_SEG-P_BLK mapping table 164 in
Step 14004: the microcode program 160 allocates a destination block from a PDEV 133 added to store blocks.
Step 14005: the microcode program 160 migrates data of the block to be moved to the allocated destination block.
Step 14006: the microcode program 160 replaces the segment number 7004 of the segment, to which the source block belongs, in the L_SEG-P_BLK mapping table 164 in
Step 14007: the microcode program 160 resets the value in the “Moved” field 7008 to “-” in order to indicate that the operation on the object block has been completed, and then the microcode program 160 moves the pointer 18003, which is given to the WL object block list 169 shown in
In this embodiment, it is assumed that the microcode program 160 executes all the processing. However, if the flash memory adapter (FMA) 131 for FMPKs 130 is configured so that it can manage free blocks in the PDEV format table 162 in
The advantage of the low-access-type processing is that the number of free blocks in the PDEV which is the migration source increases and it is possible to further perform wear leveling using high-access-type data existing in the remaining segments.
The advantage of the high-access-type processing is that high-access-type data can be expected to be migrated together with write I/O by the host and, therefore, it is possible to reduce the number of I/O at the time of migration.
In the case of the low access type, low-access data in a block 16004 having the low access attribute in a physical device PDEV 16001 is migrated to an additional package (substitute package) 16002 and high-access data remains in the physical device PDEV 16001, so that the number of free blocks increases and the effect of wear leveling can be enhanced.
In the case of the high access type, high-access data in a block 16005 having the high access attribute in the physical device PDEV 16001 is migrated to the additional package (substitute package) 16002 and low-access data remains in the physical device PDEV 16001. As a result, it is possible to enhance the effect of wear leveling in the additional package 16002 and replace the package quickly.
According to this embodiment as described above, the storage controller 110 manages data in each block of a plurality of FMPKs 130 based on the attribute of the relevant block according to the microcode program 160 and performs the leveling processing on data in blocks belonging to the leveling object device(s).
The storage controller 110 can perform the leveling processing on data in blocks belonging to the leveling object device(s) by, for example, allocating a PDEV 133 with a small number of erases to an LDEV 135 with high write access frequency and allocating a PDEV 133 with a large number of erases to an LDEV 135 with low write access frequency.
The microcode program 160 measures the write access frequency of data in each block of the real FMPKs 130 which have been already used, gives a high access attribute to blocks containing data whose measured value of the write access frequency is larger than a threshold value, or gives a low access attribute to blocks containing data whose measured value of the write access frequency is smaller than the threshold value; and if the real FMPKs 130 lack free blocks, the microcode program 160 controls migration of data in each block based on the attribute of the data in each block of the real FMPKs 130, so that it is possible to efficiently perform the leveling among a plurality of FMPKs 130 including a newly added FMPK 130.
Specifically speaking, when the real FMPKs 130 lack free blocks and the microcode program 160 selects a CDEV 136 belonging to any FMPK 130 of the real FMPKs 130 and an added substitute FMPK 130 to be a leveling object device, and if the attribute of a block in the real FMPK 130 belonging to the leveling object device is the high access attribute, the microcode program 160 migrates data which is larger than a threshold value from among data belonging to that block, to a block in the substitute FMPK; or if the attribute of a block in the real FMPK 130 belonging to the leveling object device is the low access attribute, the microcode program 160 migrates data which is smaller than the threshold value from among data belonging to that block, to a block in the substitute FMPK 130; and as a result, it is possible to efficiently perform the leveling among a plurality of FMPKs 130 including a newly added FMPK 130.
According to this embodiment, it is possible to efficiently perform leveling among a plurality of FMPKs 130 including a newly added FMPK 130.
The system according to the present invention constituted from a plurality of flash memory packages 130 where a flash memory packages 130 is added or replaced can be utilized for a storage system in order to equalize the imbalance in the number of erases not only within the packages, but also outside the packages.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2009/056421 | 3/24/2009 | WO | 00 | 8/17/2009 |