The present invention generally relates to a storage apparatus, and in particular relates to a storage apparatus comprising a plurality of clusters as processing means for providing a data storage service to a host computer, and having improved redundancy of a data processing service to be provided to a user. The present invention additionally relates to a data transfer control method of a storage apparatus.
A storage apparatus used as a computer system for providing a data storage service to a host computer is demanded of reliability in its data processing and improved responsiveness for such data processing.
Thus, with this kind of storage apparatus, proposals have been made for configuring a controller from a plurality of clusters in order to provide a data storage service to a host computer.
With this kind of storage apparatus, the data processing can be sped up since the processing based on a command received by one cluster can be executed with a processor of that cluster and a processor provided to another cluster.
Meanwhile, since a plurality of clusters exist in the storage apparatus, even if a failure occurs in one cluster, the other cluster can make up for that failure and continue the data processing. Thus, there is an advance in that the data processing function can be made redundant. A storage apparatus comprising a plurality of clusters is described, for instance, in Japanese Patent Laid-Open Publication No. 2008-134776.
With this kind of storage apparatus, in order to coordinate the data processing between a plurality of clusters, it is necessary for the plurality of clusters to mutually confirm the status of the other cluster. Thus, for example, one cluster writes, at a constant frequency, the status of a micro program into the other cluster.
Moreover, if one cluster needs information concerning the status of the other cluster in real time, it directly accesses the other cluster and reads the status information.
Meanwhile, with the method of one cluster reading data from the other cluster, since the reading requires processing across a plurality of clusters, the issue source cluster of reading is not able to perform other processing until a read request is returned from the issue destination cluster of reading. Since the read processing is performed in 4-byte units, the reading of substantial statuses at once will lead to considerable performance deterioration. Consequently, it will not be possible to achieve the objective of a storage apparatus comprising a plurality of clusters for expeditiously performing data processing upon coordinating the plurality of clusters.
In addition, this problem becomes even more prominent when the plurality of clusters are connected with PCI-Express. Specifically, if a read request is issued from a first cluster to a memory of a second cluster, completion of read data is responded from the second cluster to the first cluster. When a read request is issued from the first cluster, data communication using a PCI-Express port connecting the clusters is managed with a timer.
If a completion cannot be issued within a given period of time from the second cluster in response to the read request from the first cluster, the first cluster determines this to be a completion time out to the PCI-Express port, and the first cluster or the second cluster blocks this PCI-Express port by deeming it to be in an error status.
Here, since a failure has occurred in the second cluster that is unable to issue the completion, the first cluster will need to perform the processing of the I/O from the host computer. However, since the completion time out has occurred, the management computer will mandatorily determine that the first cluster is also of a failure status as with the second cluster, and the overall system of the storage apparatus will crash.
Moreover, when writing write data to be written into the first cluster from the host computer to the first cluster to which it is connected, and redundantly writing such write data into the second cluster by transferring it from the first cluster to the second cluster, the host computer is unable to issue the write end command to the second cluster. Thus, there is a problem in that the data of the second cluster cannot be decided.
In light of the above, an object of the present invention is to provide a storage apparatus and its data transfer control method that is free from delays in cluster interaction processing and system crashes caused by integration of multiple clusters even when it is necessary to transfer data in real time between multiple clusters in a storage apparatus including multiple clusters.
Another object of the present invention is to provide a storage system capable of deciding the data of the second cluster even if the host computer is unable to issue the write end command to the second cluster.
In order to achieve the foregoing object, with the present invention, by writing a command for transferring data from a first cluster to a second cluster and the second cluster writing data that was requested from the first cluster based on the command into the first cluster, data can be transferred in real time from the second cluster to the first cluster without having to issue a read request from the first cluster to the second cluster.
According to the present invention, it is possible to provide a storage apparatus and its data transfer control method that is free from delays in cluster interaction processing and system crashes caused by integration of multiple clusters even when it is necessary to transfer data in real time between multiple clusters in a storage apparatus including multiple clusters.
Moreover, according to the present invention, as a result of using a command for transferring data from the first cluster to the second cluster in substitute for the write end command of the host computer, even if the host computer is unable to issue the write end command to the second cluster, it is possible to provide a storage system capable of deciding the data of the second cluster.
Embodiments of the present invention are now explained.
The storage apparatus 10 comprises a first cluster 6A connected to the host computer 2A and a second cluster 6B connected to the host computer 2B. The two clusters are able to independently provide data storage processing to the host computer. In other words, the data storage controller is configured from the cluster 6A and the cluster 6B.
The data storage processing to the host computer 2A is provided by the cluster 6A (cluster A), and also provided by the cluster 6B (cluster B). The same applies to the host computer 2B. Therefore, the two clusters are connected with an inter-cluster connection path 12 for coordinating the data storage processing. The sending and receiving of control information and user data between the first cluster (cluster 6A) and the second cluster (cluster 6B) are conducted via the connection path 12.
As the inter-cluster connection path, a bus and communication protocol compliant with the PCI (Peripheral Component Interconnect)-Express standard capable of realizing high-speed data communication where the data traffic per one-way lane (maximum of eight lanes) is 2.5 Gbit/sec.
The cluster 6A and the cluster 6B respectively comprise the same devices. Thus, the devices provided in these clusters will be explained based on the cluster 6A, and the explanation of the cluster 6B will be omitted. While devices of the cluster 6A and devices of the cluster 6B are identified with the same Arabic numerals, they are differentiated based on the alphabet provided after such Arabic numerals. For example, “**A” shows that it is a device of the cluster 6A and “**B” shows that it is a device of the cluster 6B.
The cluster 6A comprises a microprocessor (MP) 14A for controlling its overall operation, a host controller 16A for controlling the communication with the host computer 2A, an I/O controller 18A for controlling the communication with the storage device 4, a switch circuit (PCI-Express Switch) 20A for controlling the data transfer to the host controller and the storage device and the inter-cluster connection path, a bridge circuit 22A for relaying the MP 14A to the switch circuit 20A, and a local memory 24A.
The host controller 16A comprises an interface for controlling the communication with the host computer 2A, and this interface includes a plurality of communication ports and a host communication protocol chip. The communication port is used for connecting the cluster 6A to a network and the host computer 2A, and, for instance, is allocated with a unique network address such as an IP (Internet Protocol) address or a WWN (World Wide Name).
The host communication protocol chip performs protocol control during the communication with the host computer 2A. Thus, as the host communication protocol chip, for example, if the communication protocol with the host computer 2A is a fibre channel (FC: Fibre Channel) protocol, a fibre channel conversion protocol chip is used and, if such communication protocol is an iSCSI protocol, an iSCSI protocol chip is used. Thus, a host communication protocol chip that matches the communication protocol with the host computer 2A is used.
Moreover, the host communication protocol chip is equipped with a multi microprocessor function capable of communicating with a plurality of microprocessors, and the host computer 2A is thereby able to communicate with the microprocessor 14A of the cluster 6A and the microprocessor 16B of the cluster 6B.
The local memory 24A is configured from a system memory and a cache memory. The system memory and the cache memory may be mounted on the same device as shown in
In addition to storing control programs, the system memory is also used for temporarily storing various commands such as read commands and write commands to be provided by the host computer 2A. The microprocessor 14A sequentially processes the read commands and write commands stored in the local memory 24A in the order that they were stored in the local memory 24A.
Moreover, the system memory 24A records the status of the clusters 6A, 6B and micro programs to be executed by the MP 14A. As the status, there is the processing status of micro programs, version of micro programs, transfer list of the host controller 16A, transfer list of the I/O controller, and so on.
The MP 14A may also write, at a constant frequency, its own status of micro programs into the system memory 24B of the cluster 6B.
The cache memory is used for temporarily storing data that is sent and received between the host computer 2A and the storage device 4, and between the cluster 6A and the cluster 6B.
The switch circuit 20A is preferably configured from a PCI-Express Switch, and comprises a function of controlling the switching of the data transfer with the switch circuit 20B of the cluster 6B and the data transfer with the respective devices in the cluster 6A.
Moreover, the switch circuit 20A comprises a function of writing the write data provided by the host computer 2A in the cache memory 24A of the cluster 6A according to a command from the microprocessor 14A of the cluster 6A, and writing such write data into the cache memory 24B of the cluster 6B via the connection path 12 and the switch circuit 20B of another cluster 6B.
The bridge circuit 22A is used as a relay apparatus for connecting the microprocessor 14A of the cluster 6A to the local memory 24A of the same cluster, and to the switch circuit 20A.
The switch circuit (PCI-Express Switch) 20A comprises a plurality of PCI-Express standard ports (PCIe), and is connected, via the respective ports, to the host controller 16A and the I/O controller 18A, as well as to the PCI-Express standard port (PCIe) of the bridge circuit 22A.
The switch circuit 20A is equipped with a NTB (Non-Transparent Bridge) 26A, and the NTB 26A of the switch circuit 20A and the NTB 26B of the switch circuit 20B are connected with the connection path 12. It is thereby possible to arrange a plurality of MPs in the storage apparatus 10. A plurality of clusters (domains) can be connected by using the NTB. To put it differently, the MP 14A is able to share and access the address space of the cluster 6B (separate cluster) based on the NTB. A system that is able to connect a plurality of MPs is referred to as a multi CPU, and is different from a system using the NTB.
The storage apparatus of the present invention is able to connect a plurality of clusters (domains) by using the NTB. Specifically, the memory space of one cluster can be used; that is, the memory space can be shared among a plurality of clusters.
Meanwhile, the bridge circuit 22A comprises a DMA (Direct Memory Access) controller 28A and a RAID engine 30A. The DMA engine 28A performs the data transfer with devices of the cluster 6A and the data transfer to the cluster 6B without going through the MP 14A.
The RAID engine 30A is an LSI for executing the RAID operation to user data that is stored in the storage device 4. The bridge circuit 22A comprises a port 32A that is to be connected to the local memory 24A.
As described above, the microprocessor 14A has the function of controlling the operation of the overall cluster 6A. The microprocessor 14A performs processing such as the reading and writing of data from and into the logical volumes that are allocated to itself in advance in accordance with the write commands and read commands stored in the local memory 24A. The microprocessor 14A is also able to execute the control of the cluster 6B.
To which microprocessor 14A (14B) of the cluster 6A or the cluster 6B the writing into and reading from the logical volumes should be allocated can be dynamically changed based on the load status of the respective microprocessors or the reception of a command from the host computer designating the associated microprocessor for each logical volume.
The I/O controller 18A is an interface for controlling the communication with the storage device 4, and comprises a communication protocol chip for communicating with the storage device. As this protocol chip, for example, an FC protocol chip is used if the storage device is an FC hard disk drive, and a SAS protocol chip is used if the storage device is a SAS hard disk drive.
When applying a SATA hard disk drive, the FC protocol chip or the SAS protocol chip can be applied as the storage device communication protocol chips 22A, 22B, and the configuration may also be such that the connection to the SATA hard disk drive is made via a SATA protocol conversion chip.
The storage device is configured from a plurality of hard disk drives; specifically, FC hard disk drives, SAS hard disk drives, or SATA hard disk drives. A plurality of logical units as logical storage areas for reading and writing data are set in a storage area that is provided by the plurality of hard disk drives.
A semiconductor memory such as a flash memory or an optical disk device may be used in substitute for a hard disk drive. As the flash memory, either a first type that is inexpensive, has a relatively slow writing speed, and with a low write endurance, or a second type that is expensive, has faster write command processing that the first type, and with a higher write endurance than the first type may be used.
Although the RAID operation was explained to be executed by the RAID controller (RAID engine) 30A of the bridge circuit 22A, as an alternative method, the RAID operation may also be achieved by the MP executing software such as a RAID manager program.
While the cache memory 24A-2 is connected to the MP 14A via the bridge circuit 22A and the switch circuit 20A in
As shown in
An operational example of the storage apparatus (
In this storage apparatus, when the first cluster is to acquire data from the second cluster, the first cluster does not read data from the second cluster, but rather the first cluster writes a transfer command to the DMA of the second cluster, and the target data is DMA-transferred from the second cluster to the first cluster.
The MP 14A of the cluster 6A or the MP 14B of the cluster 6B writes a transfer list as a data transfer command to the DMA 28B into the system memory 24B of the cluster 6B (S1). The writing of the transfer list occurs when the cluster 6A attempts to acquire the status of the cluster 6B in real time, or otherwise when a read command is issued from the host computer 2A or 2B to the storage apparatus. This transfer list includes control information that prescribes DMA-transferring data of the system memory 24B of the cluster 6B to the system memory 24A of the cluster 6A.
Subsequently, the micro program that is executed by the MP 14A starts up the DMA 28B of the cluster 6B (S2). The DMA 28B that was started up reads the transfer list set in the system memory 24B (S3).
The DMA 28B issues a write request for writing the target data from the system memory 24B of the cluster 6B into the system memory 24A of the cluster 6A according to the transfer list that was read (S4).
If the cluster 6A requires user data of the cluster 6B, the MP 14B stages the target data from the HDD 4 to the cache memory of the local memory 24B.
The DMA 28B writes “completion write” representing the completion of the DMA transfer into a prescribed area of the system memory 24A (S5).
The micro program of the cluster 6A confirms that the data migration is complete by reading the completion write of the DMA transfer completion from the cluster 6B that was written into the memory 24A (S6).
If the micro program of the cluster 6A is unable to obtain a completion write of the DMA transfer completion even after the lapse of a given period of time, the cluster 6A determines that some kind of failure occurred in the cluster 6B, and subsequently continues the processing during the anti-failure such as executing jobs of the cluster 6B on its behalf.
Consequently, the storage apparatus is able to migrate data between the clusters only with write processing. In comparison to read processing, the time that write processing binds the MP is short. While the MP that issues a read command must stop the other processing until it receives a read result, the MP that issues a write command is released at the point in time that it issues such write command.
Moreover, even if some kind of failure occurs in the cluster 6B, since a read command will not be issued from the cluster A to the cluster B, completion time out will not occur. Thus, the storage apparatus is able to avoid the system crash of the cluster 6A.
In order to substitute the reading of data of the cluster 6B by the cluster 6A with the writing of the transfer list from the cluster 6A into the DMA 28B of the cluster 6B and the DMA data transfer to the cluster 6A by the DMA 28B of the cluster 6B, the system memory 6A is set with a plurality of control tables. The same applies to the system memory 6B.
This control table is now explained with reference to
The DMA 28A of the cluster 6A executes the data transfer within the cluster 6A, as well as the writing of data into the cluster 6B. Accordingly, in the DMA descriptor table, a descriptor table (A-(1)) as a transfer list for transferring data within the self-cluster is included in the DMA of the self-cluster (cluster 6A), and a descriptor table (A-(2)) as a transfer list for transferring data to the other cluster 6B is included in the DMA of the self-cluster (cluster 6A). The table A-1 is written by the cluster 6A. The table A-2 is written by the cluster 6B.
The DMA status table includes a status table for the DMA 28A of the cluster 6A and a status table for the DMA 28B of the cluster 6B. The DMA 28A of the cluster 6A writes data of the cluster 6A into the cluster 6B according to the transfer list that was written by the cluster 6B, and, contrarily, the DMA 28B of the cluster 6B writes data of the cluster 6B into the cluster 6A according to the transfer list written by the cluster 6A.
In order to control the write processing between the cluster 6A and the cluster 6B, either the cluster 6A writes or the cluster 6B writes into the DMA status table of the cluster 6A or the DMA status table of the cluster 6B. The same applies to the DMA descriptor table and the DMA completion status table.
A-(3) is a status table that is written by the self-cluster (cluster 6A) and allocated to the DMA of the cluster 6A.
A-(4) is a status table that is written by the self-cluster and allocated to the DMA 28B of the cluster 6B.
A-(5) is a status table that is written by the cluster 6B and allocated to the DMA 28B of the cluster 6B, and A-(6) is a status table that is written by the cluster 6B and allocated to the DMA 28A of the cluster 6A.
The DMA status includes information concerning whether that DMA is being used in the data transfer, and information concerning whether a transfer list is currently being set in that DMA. Among the signals configured from a plurality of bits showing the DMA status, “1” (in use flag) being set as the bit [0] shows that the DMA is being used in the data transfer.
If “1” (standby flag) is set as the bit [1], this shows that a transfer list is set, currently being set, or is about to be set in the DMA. If neither flag is set, it means that the DMA is not involved in the data transfer.
The foregoing status tables mapped to the memory space of the system memory in the cluster 6A are explained in further detail below.
A-(3) bit [0]: To be used for the writing by the “in use flag” cluster 6A, and shows whether the self-cluster (cluster 6A) is using the self-cluster DMA 28A for data transfer.
A-(3) bit [1]: To be used for the writing by the “standby flag” cluster 6A, and shows whether the self-cluster is currently setting the transfer list to the self-cluster DMA 28A.
A-(4) bit [0]: To be used for the writing by the “in use flag” cluster 6A, and shows whether the self-cluster is using the cluster 6B DMA for data transfer.
A-(4) bit [1]: To be used for the writing by the “standby flag” cluster 6A, and shows whether the self-cluster is currently setting the transfer list to cluster 6B DMA.
A-(5) bit [0]: To be used for the writing by the “in use flag” cluster 6B, and shows whether the cluster 6B (separate cluster) is using the cluster 6B DMA 28B for data transfer.
A-(5) bit [1]: To be used for the writing by the “standby flag” cluster 6B, and shows whether the cluster 6B is currently setting the transfer list to DMA 28B.
A-(6) bit [0]: To be used for the writing by the “in use flag” cluster 6B, and shows whether the cluster 6B is using the separate cluster (cluster 6B) DMA 28B for data transfer.
A-(6) bit [1]: To be used by the writing by the “standby flag” cluster 6B, and shows whether the cluster 6B is currently setting the transfer list to the separate cluster (cluster 6A) DMA 28B.
In order to implement the exclusive control of the DMA as described above, the cluster 6A needs to confirm the status of use of the DMA of the cluster 6B. Here, if the cluster 6A reads the “in-use flag” of the cluster 6B via the inter-cluster connection 12, the latency will be extremely large, and this will lead to the performance deterioration of the cluster 6A. Moreover, as described above, there is the issue of system failure of the cluster 6A that is associated with the fault of the cluster 6B.
Thus, the storage apparatus 10 sets the DMA status table including the “in-use flag” in the local memory of the respective clusters as (A/B-(3), (4), (5), (6)) so as to enable writing in the status table from other clusters.
A-(7) in
A-(9) is a table for setting the priority among a plurality of masters in relation to the DMA 28A of the cluster 6A, and A-(10) is a table for setting the priority among a plurality of masters in relation to the DMA 28B of the cluster 6B. Explanation regarding the respective tables of the cluster A applies to the respective tables of the cluster B be setting the cluster B as the self-cluster and the cluster A as the other cluster.
A master is a control means (software) for realizing the DMA data transfer. If there are a plurality of masters, the DMA transfer job is achieved and controlled by the respective masters. The adjustment means in a case where the same jobs depending on a plurality of masters are competing in a DMA is the priority table.
The foregoing tables stored in the system memory 24A of the cluster 6A are set or updated by the MP 14A of the cluster 6A and the MP 14B of the cluster 6B during the startup of the system or during the storage data processing. The DMA 28A of the cluster 6A reads the tables of the system memory 24A and executes the DMA transfer within the cluster 6A and the DMA transfer to the cluster 6B.
The processing flow of the cluster 6A receiving the transfer of data from the DMA of the cluster 6B is now explained with reference to the flowchart shown in
If a negative result is obtained in this determination, it means that the DMA of the cluster 6B is being used, and the processing of step 600 is repeatedly executed until the value of both flags becomes “0”; that is, until the DMA becomes an unused status.
Subsequently, at step 602, the MP 14A access the cluster 6B, sets “1” as the “standby flag” to the bit [1] of the status table B-(6) of that local memory, and thereby obtains the setting right of the transfer list to the DMA 28B of the cluster 6B.
The MP 14A also writes “1” as the “standby flag” to the bit [1] of the status table A-4 of the local memory 24A. If the standby flag is raised, this means that the cluster 6A is currently setting the DMA 28B of the cluster 6B.
Subsequently, the MP 14A reads the bit [1] of area A-(5) pertaining to the status of the DMA 28B of the cluster 6B, and determines whether the “standby flag” is “1” (604). A-(4) is used when the cluster 6A controls the DMA of the cluster 6B, and A-(5) is used when the cluster 6B controls the DMA of the self cluster.
If this flag is “0,” [the MP 14A] determines that the other masters also do not have the setting right of the transfer list to the DMA 28B, and proceeds to step 606.
Meanwhile, if the flag is “1” and the cluster 6A and the cluster 6B simultaneously have the right of use of the DMA 28B of the cluster 6B, the routine proceeds from step 604 to step 608. If the priority of the cluster 6A master is higher than the priority of the cluster 6B master, the cluster 6A master returns from step 608 to step 606, and attempts to execute the data transfer from the DMA 28B of the cluster 6B to the cluster 6A.
Meanwhile, if the priority of the cluster 6B master is higher, the cluster 6B master notifies a DMA error to the micro program of the cluster 6A (master) to the effect that the data transfer command from the cluster 6A master to the DMA 28B of the cluster 6B cannot be executed (611).
At step 606, the MP 14A sets “in-use flag”=“1” to the bit [0] of the status tables A-(4), A-(6) of the local memory 24B of the cluster 6B, and secures the right of use against the DMA 28B of the cluster 6B.
Subsequently, at step 607, the MP 14A sets a transfer list in the DMA descriptor table of the local memory 24B of the cluster 6B.
Moreover, the MP 14A starts up the DMA 28B of the memory 6B, the DMA 28B that was started up reads the transfer list, reads the data of the system memory 24B based on the transfer list that was read, and transfers the read data to the local memory 24A of the cluster 6A (610).
If the DMA 28B normally writes data into the cluster 6A, the DMA 28B writes the completion write into the completion status table allocated to the DMA 28B of the cluster B of the system memory 24A.
Subsequently, the MP 14A checks the completion status of this table; that is, checks whether the completion write has been written (612).
If the completion write has been written, the MP 14A determines that the data transfer from the cluster 6B to the cluster 6A has been performed correctly, and proceeds to step 614.
At step 614, the MP 14A sets “0” to the bit [0] related to the in-use flag of the status table B-(6) of the system memory 24B (table written by the cluster 6A and which shows the DMA status of the cluster 6B) and the status table A-(4) of the system memory 24A of the cluster 6A (table written by the cluster 6A and which shows the DMA status of the cluster 6B).
Subsequently, at step 616, the MP 14A sets “0” to the bit [1] related to the standby flag of these tables, and releases the access right to the DMA 28B of the cluster 6A.
If the cluster 6B is to use the DMA 28B on its own, the MP 14A sets “1” to the bit [0] of A-(5), B-(3), and notifies the other masters that the cluster 6B itself owns the right of use of the DMA 28B of the cluster 6B.
At step 612, if the MP 14A is unable to confirm the completion write, the MP 14 determines this to be a time out (618), and notifies the transfer error of the DMA 28B to the user (610).
The processing of the MP 14A of the cluster 6A shown in
When the MP 14A is to set the transfer list in the local memory 24B of the cluster 6B, an address on the memory space in which a descriptor (transfer list) is arranged in the DMA register (descriptor address) is set. An example of such address setting table for setting an address in the register is shown in
The DMA 28B refers to this register to learn of the address where the transfer list is stored in the local memory, and thereby accesses the transfer list. In
When the MP 14A is to start up the DMA 28B, it writes a start flag in the register (start DMA) of the DMA 28B. The DMA 28B is started up once the start flag is set in the register, and starts the data transfer processing.
The setting of the address for writing the completion write into the cluster 6A is performed using the MMIO area of the NTB, and performed to the MMIO area of the cluster 6B DMA. The MP 14A subsequently sets the address of the local memory 24A to issue the completion write in the register (completion write address) shown in
The cluster 6A provides, in the system memory 24A, an area for writing the completion status write of the error notification based on the abort of the DMA 28B as the DMA completion status table (A-8) after the completion of the DMA transfer from the cluster 6B as described above.
The DMA of the storage apparatus is equipped with a completion status write function, and not the interruption function, as the method of notifying the completion or error of the DMA transfer to the cluster of the transfer destination.
Incidentally, the present invention is not denying the interruption method, and the storage apparatus may adopt such interruption method to execute the DMA transfer completion notice from the cluster 6B to the cluster 6A.
When transferring data from the cluster 6B to the cluster 6A, if the completion write is written into the memory of the cluster 6B and data is read from the cluster 6A, since this read processing must be performed across the connection means between a plurality of clusters, there is a problem in that the latency will increase.
Consequently, the completion status area is allocated in the memory 24A of the cluster 6A in advance, and the master of the cluster 6A executes the completion write from the DMA 28B of the cluster 6B to this area while using software to restrict the write access to this area. Thus, as a result of the master of the cluster 6A reading this area without any reading being performed between the clusters, the completion of the DMA transfer from the cluster 6B to the cluster 6A can thereby be confirmed.
At step 604 and step 608 of
This is because, even though the storage apparatus 10 authorized the cluster 6A to perform the write access to the DMA 28B of the cluster 6B, if the cluster 6A and the cluster 6B both attempt to use the DMA 28B, the DMA 28B will enter a competitive status, and the normal operation of the DMA cannot be guaranteed. The foregoing process is performed to prevent this phenomenon. Details regarding the priority processing will be explained later.
Meanwhile, if the number of DMAs to be mounted increases and the access from the cluster 6A and the cluster 6B is approved for all DMAs, this exclusive processing will be required for each DMA, and there is a possibility that the processing will become complicated and the I/O processing performance of the storage apparatus will deteriorate.
Thus, the following embodiment explains a system that is able to avoid the competition of a plurality of masters in the same DMA in substitute for the exclusive processing based on priority in a mode where a DMA configured from a plurality of channels exist in the cluster.
Moreover, the DMA channel 1 and the DMA channel 2 among the plurality of DMAs of the cluster 6B are allocated to the master of the cluster 6A, and the DMA channel 3 and the DMA channel 4 are allocated to the master of the cluster 6B. The foregoing allocation is set during the software coding of the clusters 6A, 6B.
Accordingly, the master of the cluster 6A and the master of the cluster 6B are prevented from competing their access rights to a single DMA in the cluster 6A and the cluster 6B.
Specifically, in the cluster 6A, the DMA channel 1 is used by the master 1 of the cluster 6A, the DMA channel 2 is used by the master 2 of the cluster 6A, the DMA channel 3 is used by the master 1 of the cluster 6B, and the DMA channel 4 is used by the master 2 of the cluster 6B.
Moreover, in the cluster 6B, the DMA channel 1 is used by the master 1 of the cluster 6A, the DMA channel 2 is used by the master 2 of the cluster 6A, the DMA channel 3 is used by the master 1 of the cluster 6B, and the DMA channel 4 is used by the master 2 of the cluster 6B.
Each of the plurality of DMAs of the cluster 6A is allocated with a table stored in the system memory 24A within the same cluster as shown with the arrows of
The master of the cluster 6A uses the DMA channel 1 or the DMA channel 2 and refers to the transfer list table (self-cluster (cluster 6A) DMA descriptor table to be written by the self-cluster) (A-1) and performs the DMA transfer within the cluster 6A.
Here, the master of the cluster 6A refers to the cluster 6A DMA status table (A-3) of the system memory 24A.
When the master of the cluster 6B requires data of the cluster 6A, it writes a startup flag in the register of the DMA channel 3 or the DMA channel 4 of the cluster 6A. The method of choosing which one is as follows. Specifically, the master of the cluster 6B is set to constantly use the DMA channel 3, set to use the DMA channel 4 if it is unable to use the DMA channel 3 due to the priority relationship, and set to wait until a DMA channel becomes available if it is also unable to use the DMA channel 4. Otherwise, the relationship between the DMA and the master (hardware) is set to 1:1 during the coding of software.
Consequently, the DMA channel 3 of the cluster 6A DMA-transfers the data from the cluster 6A to the cluster 6B according to the transfer list stored in the cluster 6B table 110. Moreover, the DMA channel 4 of the cluster 6A DMA-transfers the data from the cluster 6A to the cluster 6B according to the transfer list stored in the cluster 6B table 112.
These tables are set or updated with the transfer list by the master of the cluster 6B.
In the cluster 6B, the access right of the master of the cluster 6A is allocated to the DMA channel 1 and the DMA channel 2. An exclusive right of the master of the cluster 6B is granted to the DMA channel 3 and the DMA channel 4. The allocation of the tables and the DMA channels is as shown with the arrows in
The foregoing priority is now explained.
Accordingly, the micro program of the cluster 6A refers to this priority table when access from the plurality of masters is competing in the same DMA, and grants the access right to the master with the highest priority.
The priority levels are prepared in a quantity that is equivalent to the number of masters. In the foregoing example, four priority levels are set on the premise that the cluster 6A has two masters and the cluster 6B has two masters. If the number of masters is to be increased, then the number of bits for setting the priority will also be increased in order to increase the number of priority levels.
The micro program determines that a plurality of masters are competing in the same DMA as a result of the standby flag “1” being respectively set in the plurality of status tables of that DMA. For example, in
Meanwhile, in the storage apparatus, there are cases where the priority is once set and thereafter changed. For example, when exchanging the firmware in the cluster 6A, the master of the cluster 6A will not use the DMA 28A at all during the exchange of the firmware.
Thus, the DMA of the cluster 6A is preferentially allocated to the master of the cluster 6B on a temporary basis so that the latency of the cluster 6B to use the DMA of the cluster 6A is decreased.
The priority table is set upon booting the storage apparatus. During the startup of the storage apparatus, software is used to write the priority table into the memory of the respective clusters. This writing is performed from the cluster side to which the DMA allocated with the priority table belongs. For example, the writing into the table A-9 is performed with the micro program of the cluster 6A and the writing into the table A-10 is performed with the microgram of the cluster 6B.
Even if a plurality of clusters exist in each cluster, the setting, change and update of the priority is performed by one of such masters. If an unauthorized master wishes to change the priority, it requests such priority change to an authorized master.
The flowchart for changing the priority is now explained with reference to
The priority change processing job includes the process of identifying the priority change target DMA (1600).
The plurality of masters of the cluster to which this DMA belongs randomly selects the job execution authority and determines whether there is any master with the priority change authority (1602). If a negative result is obtained in this determination, the priority change job is given to the authorized master (1604).
If a positive result is obtained in this determination, the master with the priority change authority determines whether “1” is set as the in-use flag of the status table allocated to the DMA in which the priority is to be changed. If the flag is “1,” since the priority cannot be changed since the DMA is being used in the data transfer, the processing is repeated until the flag becomes “0” (1606).
If the data transfer of the target DMA is complete and data regarding that DMA is transferred, then the in-use flag is released and becomes “0,” and step 1606 is passed. Subsequently, the master sets “1” as the standby flag of the status table allocated to the DMA, and secures the access right to the DMA (1608).
At step 1610, if a standby flag is set by a master that is separate from the job in-execution master in the status table of the priority change target DMA to be written by the separate master and which is stored in a memory of the cluster to which the master that is executing the priority change job belongs, the priority change job in-execution master refers to the priority change table of the target DMA to be written by that master, compares the priority of the separate master and the priority of the master to perform the priority change job, and proceeds to step 1620 if the priority of the former is higher.
At step 1620, the priority change job in-execution master releases the standby flag; that is, sets “0” to the standby flag of the target DMA set by the job in-execution master since the transfer list to the target DMA is being set by the separate master, and subsequently proceeds to the processing for starting the setting, change and update of the priority regarding a separate DMA (1622), and then returns to step 1602.
Meanwhile, if the priority of the job in-execution master is higher in the processing at step 1610, this master sets “1” as the in-use flag in the status table of the “target DMA” to be written by that master and the “target DMA” to be written by a separate master write, and locks the target DMA in the priority change processing (1612).
At subsequent step 1614, if “1” showing that the DMA is being used is set in the in-use flag of all DMAs belonging to the cluster, the job in-execution master deems that the locking of all DMAs belonging to the cluster is complete, and performs priority change processing to all DMAs belonging to that cluster (1616), thereafter clears the flag allocated to all DMAs (1618), and releases all DMAs from the priority change processing.
Accordingly, the priority change and update processing of all DMAs belonging to a plurality of clusters is thereby complete.
In the cluster 6A, the DMA channel 1 and the DMA channel 2 are set with the access right of the cluster 6A, and the DMA channel 3 and the DMA channel 4 are set with the access right of the cluster 6B. In the cluster 6B, the DMA channel 1 and the DMA channel 2 are set with the access right of the cluster 6A, and the DMA channel 3 and the DMA channel 4 are set with the access right of the cluster 6B.
Both the cluster 6A and the cluster 6B are set with a control table to be written by the self-cluster and a control table to be written by the other cluster. Each control table is set with a descriptor table and a status table of the DMA channel.
The cluster A-DMA channel 1 table is set with a self-cluster DMA descriptor table (A-(1)) to be written by the self-cluster (cluster 6A), and a self-cluster DMA status table (A-(3)) to be written by the self-cluster. The same applies to the cluster A-DMA channel 2 table.
The cluster A-DMA channel 3 table is set with a self-cluster (cluster 6A) DMA descriptor table (A-(7)) to be written by the cluster B, and a self-cluster (cluster 6A) DMA status table (A-(6)) to be written by the cluster B. The same applies to the cluster A-DMA channel 4 table. This table configuration is the same in the cluster 6B as with the cluster 6A, and
In addition, the cluster 6A is separately set with a control table that can be written by the self-cluster (cluster 6A) and which is used for managing the usage of both DMA channels 1 and 2 of the cluster 6B. Each control table is set with another cluster (cluster 6B) DMA status table (A-(4)) to be written by the self-cluster (cluster 6A). This table configuration is the same in the cluster 6B, and
Although the foregoing embodiment explained a case where data is written form the cluster 6B into the cluster 6A based on DMA transfer, the reverse is also possible as a matter of course.
The present invention can be applied to a storage apparatus comprising a plurality of clusters as processing means for providing a data storage service to a host computer, and having improved redundancy of a data processing service to be provided to a user. In particular, the present invention can be applied to a storage apparatus and its data transfer control method that is free from delays in cluster interaction processing and system crashes caused by integration of multiple clusters even when it is necessary to transfer data in real time between multiple clusters in a storage apparatus including multiple clusters.
Another embodiment of the DMA startup method is now explained. When the MP 14A is to start up the DMA 28B at step 610 of
The embodiment explained below shows another example of the DMA startup method. Specifically, this startup method sets the number of DMA startups in the DMA counter register. The DMA refers to the descriptor table and executes the data write processing in the number of the value that is designated in the register. When the MP executes a micro program and sets a prescribed numerical value in the DMA register counter, the DMA determines the differential and starts up in the number of times of the value corresponding to that differential, refers to the descriptor table, and executes the data write processing.
The memory 24A of the cluster 6A (cluster A) and the memory 24B of the cluster 6B (cluster B) are respectively set with a counter table area to be referred to by the MP upon controlling the DMA startup. The MP reads the value of the counter table and sets the read value in the DMA register of the cluster to which that MP belongs.
When the MP 14B detects the update of B-(12), it determines whether the DMA 28B is being started up by referring to the DMA status register, and, if the DMA 28B is of a startup status, waits for the startup status to end, proceeds to step 2008, and reads the value of B-(12). The MP 14B thereafter writes the read value in the counter register of the DMA 28B.
When the counter register is updated, the DMA 28B determines the differential with the value before the update, and starts up based on the differential.
According to the method shown in
If the MP 28B of the cluster B requests the data transfer to the cluster A, at step 2000, the MP 28B refers to B-(11). In addition, if the MP 28A is to realize the data transfer in its own cluster, it refers to A-(11), and, if the MP 28B is to realize the data transfer in its own cluster, it refers to B-(12).
A practical application of the data transfer method of the present invention is now explained. In a computer system including a plurality of clusters, data that is written from the host computer regarding one cluster is written redundantly in the other cluster via one cluster.
As shown in
Meanwhile, since the host 2A is unable to write the completion write into the separate cluster 6B, the data 2101 sent to the separate cluster remains in an undecided status; that is, the status will be such that the MP 14 is unable to confirm whether all data have reliably reached the cache memory 24B.
Thus, as shown in
Then, as shown in
The write processing from the host computer is completed based on the steps shown in
Thus, the application of the present invention is effective in order to decide the data from the host computer to the other cluster while overcoming the foregoing issue. Specifically, as shown in
The DMA 28B additionally reads the dummy data 2208 in the memory 24B based on the descriptor table 2202, and sends this to the memory 24A of the cluster 6A (2210). As a result of the dummy data 2210 being stored in the memory 24A, the MP 14A is able to confirm that the data of the other cluster 6B has been decided. Incidentally, as shown in
As shown in
When the MP 14A determines that the data of the other cluster has been decided, as with
Incidentally, although
Number | Date | Country | Kind |
---|---|---|---|
2009-173285 | Jul 2009 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2009/006150 | 11/17/2009 | WO | 00 | 1/28/2010 |