METHOD, SYSTEM AND PROGRAM FOR SECURING REDUNDANCY IN PARALLEL COMPUTING SYTEM

Information

  • Patent Application
  • 20070180288
  • Publication Number
    20070180288
  • Date Filed
    December 08, 2006
    17 years ago
  • Date Published
    August 02, 2007
    17 years ago
Abstract
In a parallel computing system having a plurality of computing node groups including at least one spare computing node group, a plurality of managing nodes for allocating jobs to the computing node groups and an information management server having respective computing node group status information are associated with the computing node groups, and the respective managing nodes update respective in-use computing node group status information by accessing the information management server. Furthermore, when the managing node detects an occurrence of a failure, the managing node having used then the computing node group disabled due to the failure identifies a spare computing node group by accessing the computing node group status information in the information management server. Then, the managing node having used then the disabled computing node group obtains the computing node group information of the identified spare computing node group. Furthermore, since the managing node having used then the disabled computing node group can continue processing by switching the disabled computing node group to the identified spare computing node group as a computing node group to be used, on the basis of the computing node group information of the identified spare computing node group, the redundancy in the parallel computing system can be secured.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a diagram of a parallel computing system (prior art);



FIG. 2 is a diagram of the configuration of node groups of the present invention;



FIG. 3 is a diagram of the configuration of the node groups when a failure occurs;



FIG. 4 is a diagram of the configuration of the node groups when the failure is recovered;



FIG. 5 is a diagram of a hardware and system configuration;



FIG. 6 is a diagram of the system configuration in the manner A;



FIG. 7 is a diagram illustrating the computing node group switching in the manner A;



FIG. 8 is a flow chart illustrating the flow in the normal operation;



FIG. 9 is a flow chart illustrating the flow from a failure occurrence to a failure recovery;



FIG. 10 is a diagram illustrating the computing node group switching in the manner B;



FIG. 11 is a diagram illustrating the computing node group switching in the manner C; and



FIG. 12 is a diagram of the system configuration when a plurality of standby spare computing node groups are provided.


Claims
  • 1. A method for securing redundancy in a parallel computing system having a plurality of computing node groups including at least one spare computing node group, comprising the steps of: associating a plurality of managing nodes for allocating jobs to the computing node groups and an information management server having respective computing node group status information with the computing node groups;updating, by the respective managing nodes, the respective in-use computing node group status information by accessing the information management server;detecting, by the managing node, an occurrence of a failure;identifying, by the managing node using the computing node group disabled due to the failure, a spare computing node group by accessing the computing node group status information in the information management server;obtaining, by the managing node using the disabled computing node group, computing node group information of the identified spare computing node group; andcontinuing, by the managing node using the disabled computing node group, processing by switching the disabled computing node group to the identified spare computing node group as a computing node group to be used on the basis of the computing node group information of the identified spare computing node group.
  • 2. The method according to claim 1, wherein the step of continuing the processing by switching to the spare computing node group includes the step of processing a job already queued by a job scheduler of the managing node having used then the disabled computing node group at the time of the occurrence of the failure, by the spare computing node group.
  • 3. The method according to claim 1, wherein when the failure of the disabled computing node group is recovered, the disabled computing node group is registered to the information management server as a new spare computing node group.
  • 4. The method according to claim 1, wherein the total number of computing node groups is provided by adding the number of spare computing node groups required for the jobs to be operated simultaneously to the number of the at least one computing node groups.
  • 5. The method according to claim 1, wherein the computing node group information includes identification information of the computing node group, location information of the computing node group, failure information of the computing node group, and the computing node group status information includes information for indicating a status of the computing node group.
  • 6. The method according to claim 1, wherein the respective computing node group status information and the respective computing node group information of the computing node groups are collectively managed by the information management server.
  • 7. The method according to claim 1, wherein respective computing node group status information are collectively managed by the information management server, and the respective computing node group information of the computing node groups are managed by the respective managing nodes.
  • 8. The method according to claim 1, wherein the respective computing node group status information are collectively managed by the information management server, and the respective managing nodes manage the computing node group information of the respective computing node groups, and the computing node group information of the spare computing node group.
  • 9. A parallel computing system having a plurality of computing node groups including at least one spare computing node group for securing redundancy, comprising: an information management server having a plurality of managing nodes for allocating jobs to the computing node groups, and respective computing node group status information; anda managing node configured to: update the respective in-use computing node group status information by accessing the information management server;detect an occurrence of a failure;identify a spare computing node group by accessing the computing node group status information in the information management server;obtain computing node group information of the spare computing node group; andcontinue processing by switching the disabled computing node group to the spare computing node group as a computing node group to be used on the basis of the computing node group information of the spare computing node group.
  • 10. A parallel computing system having a plurality of computing node groups including at least one spare computing node group for securing redundancy, comprising: an information management server having a plurality of managing nodes for allocating jobs to the computing node groups, and respective computing node group status information; anda managing node having a node managing program product stored in storage media of the managing node, wherein the node managing program product causes the managing node to: update the respective in-use computing node group status information by accessing the information management server;detect an occurrence of a failure;identify, using the computing node group disabled due to the failure, a spare computing node group by accessing the computing node group status information in the information management server;obtain, using the disabled computing node group, computing node group information of the identified spare computing node group; andcontinue, using the disabled computing node group, processing by switching the disabled computing node group to the identified spare computing node group as a computing node group to be used on the basis of the computing node group information of the identified spare computing node group.
  • 11. A program product for securing redundancy in a parallel computing system having a plurality of computing node groups including at least one spare computing node group, and the program product securing the redundancy in the parallel computing system by causing the parallel computing system to execute the acts of: associating a plurality of managing nodes for allocating jobs to the computing node groups and an information management server having respective computing node group status information with the computing node groups;updating, by the respective managing nodes, the respective in-use computing node group status information by accessing the information management server;detecting, by the managing node, an occurrence of a failure;identifying, by the managing node having used then the computing node group disabled due to the failure, a spare computing node group by accessing the computing node group status information in the information management server;obtaining, by the managing node having used then the disabled computing node group, computing node group information of the identified spare computing node group; andcontinuing, by the managing node having used then the disabled computing node group, processing by switching the disabled computing node group to the identified spare computing node group as a computing node group to be used on the basis of the computing node group information of the identified spare computing node group.
  • 12. The program product according to claim 11, wherein the step of continuing the processing by switching to the spare computing node group includes the step of processing a job already queued by a job scheduler of the managing node having used then the disabled computing node group at the time of the occurrence of the failure, by the spare computing node group.
  • 13. The program product according to claim 11, wherein when the failure of the disabled computing node group is recovered, the disabled computing node group is registered to the information management server as a new spare computing node group.
Priority Claims (1)
Number Date Country Kind
JP2005-369863 Dec 2005 JP national