This invention relates to a storage control apparatus, specifically to the control for writing to a RAID (Redundant Array of Independent (or Inexpensive) Disks) group of RAID 1 series.
A storage apparatus (e.g. a disk array apparatus) generally comprises multiple HDDs (hard disk drives) and a controller which controls access to the multiple HDDs (hereinafter, a disk controller is abbreviated to a “DKC”). This type of storage apparatus usually comprises one or more RAID groups configured of multiple HDDs. The DKC, in accordance with the RAID level of the RAID groups, accesses the RAID groups.
As a RAID level, for example, RAID 1 exists (Patent Literature 1). A RAID group to which RAID 1 is applied is configured of a master HDD and a mirror HDD. The DKC receives write target data from a host apparatus, and duplicates the write target data. Then, the DKC writes one type of the duplicated data to one HDD (hereinafter referred to as a master HDD), and writes the other type of data to the other HDD (hereinafter referred to as a mirror HDD).
Generally, the storage area provided by the HDDs is configured of multiple sub storage areas (hereinafter referred to as “blocks”) and, to the HDDs, data is written in units of block size. Depending on the type of HDDs, the size of the writing unit to the HDDs might be different from the size of the unit of the write target data to the HDDs.
For example, as an HDD, an SATA (Serial ATA (Advanced Technology Attachment))-HDD exists. As shown in
However, as shown in
Therefore, the data set D1 does not fit into one block in the SATA-HDD and, as shown in
In this case, for writing data in writing units of the SATA-HDD, as shown in
(*) The DKC reads a data [element] E1 on the head stored in the part to which the data set D1 is not written in the block E10, and adds the data E1 to the head of the data set D1.
(*) The DKC reads a data [element] E2 on the end stored in the part to which the data set D1 is not written in the block E11, and adds the data E2 to the end of the data set D1.
That is, for creating a data unit F1, the maximum of two times of read from the SATA-HDD is required. This has a problem of increasing the load on the DKC and the SATA-HDD respectively.
Then, this problem is especially serious in cases where the RAID group is configured of SATA-HDDs and, at the same time, the RAID level is RAID 1. This is because the maximum of two times of read is performed for the two HDDs (the master HDD and the mirror HDD) respectively. This is more specifically described below. Note that, in the description below, the data which is read for creating a data unit and is added to the data set is, for convenience, referred to as the “additional data.”
As shown in
As shown by an arrow a in
Next, the DKC 10, as shown by arrows b1 and b2 in
Next, the DKC 10 performs two times of read for the master HDD 21 and the mirror HDD 22 respectively, that is, performs a total of four times of read. As more specifically described, as shown by arrows c1 and c2 in
Furthermore, the DKC 10, as shown by arrows c3 and c4 in
The DKC 10, as shown by an arrow d1 in
According to the above-mentioned flow, as described above, the maximum of four times of read is performed. As the number of times of data read is larger, the load both on the SATA-HDD and on the DKC 10 becomes heavier. This deteriorates the processing performance of the storage apparatus 1.
This could also occur, as well as RAID 1, to other RAID levels of RAID 1 series. For example, in case of RAID 10 (also referred to as “RAID 1+0), as multiple sets of master HDDs and mirror HDDs exist in one RAID group, the effect of the above-mentioned problem is considered to be even larger compared with the case of RAID 1.
Furthermore, the above-mentioned problem could also occur, as well as SATA-HDDs, to other types of physical storage devices whose writing units are different from the unit sizes of the write target data.
Therefore, the purpose of this invention is to improve the processing performance of the storage apparatus whose size of the writing unit of physical storage devices configuring RAID groups of RAID 1 series is different from the unit size of the write target data.
The storage apparatus comprises a RAID (Redundant Array of Independent (or Inexpensive) Disks) group configured of multiple physical storage devices and a controller which, in accordance with the RAID level of the RAID group, controls data write to the RAID group and data read from the RAID group. The RAID group is a RAID group of RAID 1 series which comprises one or more pairs of first storage devices and second storage devices. The unit size of write target data and the size of the writing unit of the storage device are different. The storage area which the storage device provides is configured of multiple storage blocks. The size of each storage block is the same as the size of the writing unit of the storage device. The controller reads data from the entire area of a first storage block group including the write destination of the write target data in the first storage device. The controller, in accordance with the write target data and staging data which is the read data of the same, generates one or more data units each of which is the data configured of the write target data or the copy of the same and the staging data part or the copy of the same and of the same size as the first storage block group. The controller writes any of the one or more data units to the first storage block group in the first storage device and, at the same time, writes any of the one or more data units to the second storage block group corresponding to the first storage block group and of the same size as the same in the second storage device.
The write target data, for example, may be the data created by adding another type of data (e.g. a guarantee code) to the data from the host apparatus, or may also be the data from the host apparatus itself. The host apparatus is an external apparatus of the storage apparatus.
An embodiment of this invention is described below. Note that, in this embodiment, for simplifying the description, the examples below are assumed to be applied.
(*) Physical storage devices are HDDs. The controller is a disk controller (hereinafter referred to as a DKC) which controls the access to the HDDs.
(*) The RAID level of RAID groups is RAID 1. A RAID group is a pair of a master HDD and a mirror HDD.
(*) Each of the master HDD and the mirror HDD is an SATA-HDD.
(*) Each of the sub storage areas configuring the storage area provided by the SATA-HDDs is referred to as a “block.”
(*) The size of a block is assumed to be 512 B (bytes).
(*) The size of each data element configuring the data which the DKC receives from the host apparatus is assumed to be 512 B (bytes). The size of a data element is the minimum size of the data provided by the host apparatus to the storage apparatus.
(*) The size of a guarantee code (e.g. an ECC (Error Correcting Code)) is assumed to be 8 B (bytes).
This embodiment, in the write processing to the RAID group of RAID 1, can make data read from either the master HDD or the mirror HDD unnecessary and, at the same time, can make the number of times of data read from the RAID group up to one. Hereinafter, with reference to
Firstly, the DKC 100, as shown by an arrow a01 in
Next, the DKC 100 adds a guarantee code (e.g. an ECC (Error Correcting Code)) to the stored data element D21, and generates a data set D20. The size of the data set D20 is 520 B (512 B+8 B). The data set D20 is the write target data.
Next, the DKC 100, as shown by an arrow b10 in
Next, the DKC 100, as shown by an arrow c10 in
A data unit F10 generated by overwriting the staging data [element] E20 with the data set D20 is the data for the master HDD 210. The DKC 100, as shown by arrows d10 and d11 in
After that, the DKC 100, as shown by an arrow e10 in
As described above, in the write processing for the RAID group of RAID 1, in accordance with this embodiment, data read from either the master HDD 210 or the mirror HDD 220 becomes unnecessary and, at the same time, the number of times of data read from the RAID group 200 can be one at a maximum.
In the above-mentioned write processing, the data unit F20 may be a copy of the data unit F10, and should preferably be generated by the above-mentioned method. This is because, if a certain failure occurs when the data unit F10 is copied, the generated data unit F20 might possibly become the data which is different from the data unit F10.
Furthermore, in the above write processing, the size of the staging data [element] E20 (a block group as the read source) has only to be equal to or larger than the size of the data set D1 and, at the same time, a multiple of the size of the writing unit of the SATA-HDD, and should preferably be the smallest size in the range. This is for inhibiting the consumption of the cache memory 120. In the above-mentioned example, as the data set D1 is 512 B, the size of the staging data [element] E20 should preferably be 1024 B (512 B×2). The size of the data units F10 and F20 is the same as the size of the staging data [element] E20. The write destination of the data unit F10 is the read source block group of the staging data [element] E20. The write destination of the data unit F20 is a block group corresponding to the read source block group (e.g. a block group of the same address as the read source block group).
Next, with reference to
The storage apparatus 1B comprises duplicated DKCs 100 and an HDD group 900. To the storage apparatus 1B, host apparatuses 300 which are a type of external apparatuses of the storage apparatus 1B and issue I/O (Input/Output) requests are connected.
The DKC 100 comprises an FE-IF (frontend interface apparatus) 110, a cache memory 120, a CPU 130, a local memory 140, and a BE-IF (backend interface apparatus) 150.
The FE-IF 110 and the BE-IF 150 are the communication interface apparatuses for the DKC 100 to exchange data and others with the external apparatuses. The FE-IF 110 communicates with the host apparatuses 300, and the BE-IF 150 communicates with the HDDs. The cache memory 120 is a memory for temporarily storing the data read from or written to the HDDs. The cache memories 120 share part of the data. The local memory 140 stores the information required for the control (hereinafter referred to as management information) and computer programs. The CPU 130 performs the programs stored by the local memory 140 and, in accordance with the management information, controls the storage apparatus 1B.
Of the HDD group 900, one or more RAID groups are configured. At least one RAID group is a RAID group 200, that is, a RAID group of RAID 1. Note that this invention is applicable, as well as to RAID 1, to other types of RAID levels at which data is duplicated and written (that is, other types of RAID levels of RAID 1 series).
The cache memory 120 is configured of multiple segments 121.
The local memory 140 stores multiple segment tables (hereinafter referred to as SG tables) 141. A segment 121 and an SG table 141 correspond 1 to 1. The SG table 141 comprises the information related to the segment 121 corresponding to the same. Note that, instead of an SG table 141 existing for each segment 121, it may also be permitted that a table comprises multiple records and that each record corresponds to a segment 121. Furthermore, the information related to multiple segments 121 may also be expressed in other methods than tables.
The SG tables 141, for example, comprise the information below. Note that the description with reference to
(*) The data 140a which is the information showing the attribute of the data stored in the target segment 121,
(*) the data 140b which is the information showing where in the target segment 121 the data is located (as more specifically described, from where to where in the target segment 120 the input/output target data for the HDDs exists),
(*) the queue 140c which is the information showing the queue which manages the target segment 121,
(*) the attribute 140d which is the information showing the attribute of the target segment 121, and
(*) the flag 140e which is the information showing whether the data in the target segment 121 also exists in the other cache memory 120 than the cache memory 120 comprising the target segment 121 or not.
The values of the data 140a (that is, the data in the target segment 121) are “intermediate dirty data,” “staging data,” and “physical dirty data.” The “intermediate dirty data” is a type of data which exists in the cache memory 120 but not in the HDDs (dirty data) and which must not be written to the HDDs yet. The intermediate dirty data is, as more specifically described, for example, an above-mentioned data set (a set of a 512 B data element and an 8 B guarantee code). The “staging data” is, as described above, the data read from the HDDs to the cache memory 120. The “physical dirty data” is a type of dirty data and which may be written to the HDDs.
The “queue” referred to in this embodiment is, though not shown in the figure, for example, the information in which multiple entries are aligned in a specified order. Each entry corresponds to a segment and comprises the information related to the corresponding segment. The order of the entries is, for example, the order of the points of time at which data [elements] are newly stored. In this embodiment, the queues are, for example, an intermediate dirty queue, a clean queue, a free queue, and a physical dirty queue. The intermediate dirty queue is configured of entries corresponding to the segments where the intermediate dirty data [elements] are stored. The clean queue is configured of entries corresponding to the segments where the clean data [elements] are stored. A type of clean data is staging data. The free queue is configured of entries corresponding to free segments. The physical dirty queue is configured of entries corresponding to the segments where the physical dirty data [elements] are stored.
As for the flag 140e, ON (e.g. “1”) indicates that the data in the target segment 121 also exists in the other cache memory 120 while OFF (e.g. “0”) indicates that the data in the target segment 121 does not exist in the other cache memory 120.
In the write processing, a segment group (one or more segments) is secured from the cache memory 120, and the data is written to the secured segment group, wherein the size of the secured segment group is a common multiple of the unit size of the write target data and the size of the writing unit of the SATA-HDD, more preferably the least common multiple. As more specifically described, as the unit size of the write target data is 520 B and the size of the writing unit of the SATA-HDD is 512 B, the size of the secured segment group is 32 kB (kilobytes) which is the least common multiple of 520 B and 512 B. Therefore, for example, if the segment size is 4 kB, 8 segments are considered to be secured while, if the segment size is 8 kB, 4 segments are assumed to be secured (hereinafter, the segment size is assumed to be 8 kB).
As described above, the case where, by securing a segment group of 32 kB (the size of the least common multiple) and processing [the data] in units of segment groups, it becomes unnecessary to read staging data from the HDDs (that is, the case where the number of times of read can be made 0) can be expected. That case is described in details with reference to
As shown in
If multiple sequential cache blocks and multiple sequential SATA blocks (SATA-HDD blocks) are assumed to be aligned with the heads lined up, the borders of the cache blocks and the borders of the SATA blocks match every 32 kB.
Furthermore, 32 kB is a multiple of 512 B which is the size of the writing unit of the SATA-HDD. Therefore, if the size of the write target data is 32 kB, the write target data can be written in accordance with the writing unit of the SATA-HDD.
Therefore, if the data sets (520 B intermediate dirty data [elements]) are stored in all the cache blocks for 32 kB (that is, if the intermediate dirty data exists serially from the head to the end of the 32 kB segment group), as the write target data is the 32 kB data (the data configured of 64 sequential data sets), the write processing can be completed without performing reading staging data from the HDDs.
As described above, if a segment group is secured in units of 32 kB, the case where read from the HDDs becomes unnecessary can be expected. Note that the size of the secured segment group may also be other common multiples of 512 B and 520 B than 32 kB, but should preferably be the least common multiple from the perspective of inhibiting the consumption of the cache memory 120.
Hereinafter, with reference to
According to the example shown in
According to the examples shown in
According to the example shown in
According to the examples shown in
The four segments (8 kB each) configuring the above-mentioned segment group (32 kB) do not necessarily have to be sequential in the cache memory 120. For example, four segments separated in the cache memory 120 may also be secured as a segment group.
Next, with reference to
The CPU 130, for each of the multiple intermediate dirty queues, checks the amount of the intermediate dirty data, and determines which intermediate dirty queue should be the target of the processing (Step 401). For example, the intermediate dirty queue whose amount of intermediate dirty data is determined to be the largest (for example, the intermediate dirty queue comprising the largest number of entries) is determined as the target of the processing. The intermediate dirty queues, for example, exist in specified units (for example, in each HDD, or in each area (such as a partition) into which the cache memory 120 is logically partitioned).
Next, the CPU 130, among the intermediate dirty data managed by the intermediate dirty queue determined at Step 401, identifies the intermediate dirty data stored in the cache memory 120 the least recently (Step 402). The SG table #1 corresponding to the segment #1 storing the intermediate dirty data identified at this point is as shown in row #1 in
Next, the CPU 130 secures the 32 kB segment group including the segment #1 (Step 403). The segment group secured at this point is hereinafter referred to as a “segment group W.” For the secured segment group W, exclusive control is performed for ensuring no data is written to the same. The SG table #1 at this step is as shown in row #3 in
Next, the CPU 130 registers the information showing the physical dirty reserved number to, for example, the local memory 140 (Step 404). As more specifically described, the CPU 130 considers the value twice the number of intermediate dirty segments (the segments in which the intermediate dirty data is stored in the segment group W) as the physical dirty reserved number. The reasons for doubling the number of intermediate dirty segments are that the data units including the copies of the intermediate dirty data are managed as the physical dirty data eventually, and that two copies of the intermediate dirty data are generated (generated both for the master HDD 210 and the mirror HDD 220). The CPU 130, by summing the physical dirty reserved number and the number of segments where the physical dirty data already exist, and if the total number exceeds a specified value, may suspend the operation of generating physical dirty data. Note that the SG table #1 at this Step 404 is as shown in row #4 in
Next, the CPU 130 determines the RAID level of the RAID group which is the write destination of the write target intermediate dirty data (Step 405). The SG table #1 at this Step 405 is as shown in row #5 in
If the RAID level is not of RAID 1 series (Step 405: NO), the CPU 130 processes the intermediate dirty data in accordance with the RAID level (Step 406).
If the RAID level is of RAID 1 series (Step 405: YES), the CPU 130 determines whether staging data is required or not (Step 407). The SG table #1 at this Step 407 is as shown in row #6 in
The result of the determination at Step 407 is, if the intermediate dirty data occupies the entire area of the segment group W (refer to
The CPU 130, if requiring staging data (Step 407: YES), secures the segment group (hereinafter referred to as the segment group R1) for storing the staging data from the cache memory 120 (Step 408). The SG table #1 at this Step 408 is as shown in row #7 in
Next, the CPU 130, from the status of the intermediate dirty data, determines the SATA block group as the read source of the staging data in the master HDD 210 (or the mirror HDD 220) and, from the entire area of the determined SATA block group, reads the staging data to the segment group R1 (Step 409). The status of the intermediate dirty data (for example, the status shown in
The CPU 130 determines whether the intermediate dirty data in the segment group W is sequential or not (Step 410). The SG tables #1 and #2 at this Step 410 are as shown in row #9 in
If the intermediate dirty data is non-sequential (Step 410: NO), the CPU 130 performs inter-cache copy (Step 411 in
The CPU 130 determines whether the staging data part is normally copied to the segment group W or not (Step 412). The SG tables #1 and #2 at this Step 410 are as shown in row #11 in
The CPU 130, if the inter-cache copy is normally completed (Step 412: YES), performs Step 413 and the following steps. Meanwhile, the CPU 130, if the inter-cache copy is not normally completed (Step 412: NO), performs Step 417. At Step 417, the CPU 130 stores the data of which copy is not successful in a specified save area (e.g. a non-volatile storage area). For the data in this save area, by the CPU 130, later, the copy (generating a data unit) may be performed again (that is, the data may also be copied from the save area to the segment group W, R1 or R2). The save area may be comprised by the DKC or a part of the HDDs may also be used [as the save area].
If the inter-cache copy is normally completed (Step 412: YES), the CPU 130 secures the segment group (hereinafter referred to as the segment group R2) for the mirror HDD from the cache memory 120 (Step 413). The SG tables #1 and #2 at this Step 413 are as shown in row #12 in
The CPU 130 generates a data unit to be written to the master HDD 210 and a data unit to be written to the mirror HDD 220 (Step 414). The SG tables #1 to #3 at this Step 414 are as shown in row #13 in
At Step 414, for example, the processing below is performed.
(*) The CPU 130 overwrites the staging data in the segment group R1 with the sequential data in the segment group W and, at the same time, copies the sequential data to the segment group R2. By this method, in the segment group R1, a data unit for the master HDD 210 (hereinafter referred to as a master data unit) is generated and, at the same time, sequential data comes to exist in the segment group R2. Note that the sequential data in the segment group W might be configured of sequential intermediate dirty data or might also be configured of non-sequential intermediate dirty data and an staging data part.
(*) The CPU 130 copies the staging data part in the segment group R1 (the part of the staging data which is not overwritten with the sequential data) to the segment group R2. By this method, in the segment group R2, a data unit for the mirror HDD 220 (hereinafter referred to as a mirror data unit) is generated.
The CPU 130 determines whether the generation of the master data unit or the mirror data unit is normally completed or not (Step 415). The SG tables #1 to #3 at this Step 414 are as shown in row #14 in
The CPU 130, if [the generation of] at least one of the master data unit and the mirror data unit is not completed normally (Step 415: NO), performs the processing of the above-mentioned Step 417.
The CPU 130, if the generation of the master data unit and the mirror data unit is completed normally (Step 415: YES), transits the attributes of the master data unit and the mirror data unit to physical dirty data respectively (Step 416). As more specifically described, for example, the CPU 130 ensures the management of one or more segments storing master data units and one or more segments storing mirror data units by physical dirty queues. Note that the SG tables #1 to #3 at this Step 416 are as shown in row #15 in
After that, the CPU 130 releases the segment group W (Step 418). By this method, each segment configuring the segment group W is managed as a free segment. The free segments can be secured. The SG tables #1 to #3 at this Step 418 are as shown in row #16 in
The flow of the intermediate summarized write processing is as described above. After this intermediate summarized write processing, the CPU 130 writes the master data unit (physical dirty data) to the master HDD 210 and, at the same time, writes the mirror data unit (physical dirty data) to the mirror HDD 220.
According to this embodiment, in the write processing for the RAID group of RAID 1, reading staging data from either the master HDD or the mirror HDD can be made unnecessary and, at the same time, the number of times of read can be made one at a maximum. By this method, the load on both the HDDs and the DKC 100 can be reduced.
Though an embodiment of this invention is described above, this invention is not limited to this embodiment and, as a matter of course, also comprises any changes or modifications within the spirit and scope hereof.
For example, if the entire area of the segment group W is occupied by the sequential intermediate dirty data, it may also be permitted that the intermediate dirty data in the segment group W is managed as the physical dirty data, and then the physical dirty data is written to the master HDD (or the mirror HDD). In other cases, it may also be permitted that the segment group R1 is secured, the data copy from the segment group W to the segment group R1 is performed, the data in the segment group R1 is managed as physical dirty data, and then, the physical dirty data is written to the master HDD (or the mirror HDD).
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2010/001616 | 3/8/2010 | WO | 00 | 7/13/2010 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2011/111089 | 9/15/2011 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5611069 | Matoba | Mar 1997 | A |
6959373 | Testardi | Oct 2005 | B2 |
7284086 | Kanai | Oct 2007 | B2 |
7346733 | Kitamura | Mar 2008 | B2 |
7711897 | Chatterjee et al. | May 2010 | B1 |
7802063 | Chatterjee et al. | Sep 2010 | B1 |
7873782 | Terry et al. | Jan 2011 | B2 |
7953939 | Nakamura et al. | May 2011 | B2 |
7975168 | Morita et al. | Jul 2011 | B2 |
7991969 | Chatterjee et al. | Aug 2011 | B1 |
8473672 | Moshayedi et al. | Jun 2013 | B2 |
8510497 | Moshayedi et al. | Aug 2013 | B2 |
20050097273 | Kanai | May 2005 | A1 |
20050198434 | Uchiumi et al. | Sep 2005 | A1 |
20060259709 | Uchiumi et al. | Nov 2006 | A1 |
20080005468 | Faibish et al. | Jan 2008 | A1 |
20080005502 | Kanai | Jan 2008 | A1 |
20080195832 | Takada et al. | Aug 2008 | A1 |
20090138672 | Katsuragi et al. | May 2009 | A1 |
20090150756 | Mori et al. | Jun 2009 | A1 |
20090228652 | Takemoto | Sep 2009 | A1 |
20100070733 | Ng et al. | Mar 2010 | A1 |
20110113194 | Terry et al. | May 2011 | A1 |
Number | Date | Country |
---|---|---|
2008-197864 | Aug 2008 | JP |
2009-129201 | Jun 2009 | JP |
2009-217519 | Sep 2009 | JP |
Number | Date | Country | |
---|---|---|---|
20110296102 A1 | Dec 2011 | US |