Embodiments of the present invention may relate generally to the field of data processing, system control, and data communications, and more specifically to an integrated method, system, and apparatus that may provide resilient and efficient computational task and computational resource management, especially for large, many-component tasks that are executed on multiple processing elements.
Embodiments of the invention may also generally address fault-tolerant computing in computing systems having multiple interacting nodes.
Modern high-end computer (HEC) architectures may embody thousands to millions of processing elements, together with networking and storage resources, often processing jobs with execution times of weeks or even months. The large number of processors and extended duration of computation guarantee that even highly reliable computing systems are likely to experience point failures during the span of some computations. Some measures that mitigate system unreliability may include: requirement of higher reliability in processing elements; use of redundant processing or voting circuits to provide fail-over; and running system components well below their peak capabilities; but each of these measures may cause significant increases in computation and/or equipment costs, and may additionally reduce system throughput.
In HEC applications, job execution times can far exceed System Mean-Time Between Failures (SMTBF), leading to inefficient use of the available resources. Long-running execution times involving many processors may impair Reliability, Availability, and Serviceability (RAS) of systems—introducing a high hurdle for system administrators and support staff With large numbers of components that can fail, such as computation nodes, communication paths, and storage resources, the System Mean-Time Between Failures (SMTBF) scaling is approximately inversely proportional to the number of nodes used and can result in an Application Mean-Time Between Interrupts (AMTBI) of just a few hours or less. Typically, parallel applications deal with this by frequently preserving the current state of the computation, and after a failure, restarting and continuing the computation from the most recent previously saved state. As a consequence of this current paradigm, AMTBI=SMTBF, and any non-recoverable component failure of the portion of the system used by the application will result in the termination of the job.
Computing systems have been developed using large numbers of computing devices (which will be generically referred to as “nodes” herein) interacting in in both parallel and pipelined fashion. Such systems have made possible dramatic increases in overall computing power. Examples of such systems include IBM Corporation's BlueGene/L®, BlueGene/P®, and Cyclops-64® computing systems.
One problem that occurs in such systems is that of node failure, either because of a malfunction of the node itself or because of faulty communication links with one or more other nodes. Such failures may need to be detected and accommodated in order to maintain satisfactory functionality of the overall system. Checkpointing, in which a current state is stored, in the middle of execution of an application (in order to facilitate recovery in the event of a failure) can be used for detecting and mitigating node failures in such systems.
Some known checkpointing techniques may require that the user initiate the checkpointing process. Many known checkpointing techniques may also be oriented toward checkpointing entire systems, in which case all (or very large amounts) of data may be required. Because of the large granularity of data and system state that must be saved, maintained and restored, the very process of recovery can become significantly time consuming in such systems. At extreme scales, resources and time used in providing recoverable computing can outweigh those spent on productive work.
Embodiments of the present invention may address the HEC reliable computing problem by providing fine-grained reliability mechanisms that may be used to support task restarts within the execution of a large or long-running application. Such embodiments may provide decoupling of AMTBI and STBF for the case of communication failures, and may allow small portions of computing tasks to be transferred on the occasion of node failure, so that computation may be able to continue without large-scale restart of processes. The approach found in embodiments of the invention, generally referred to herein as task-oriented node-centric checkpointing (TONCC), may utilize local persistent storage to support checkpointing of a large amount of data while avoiding access bottlenecks typically encountered with global DRAM.
Glossary of Terms
Application Programmer Interface (API): a set of programmer-accessible procedures that expose the functionality of a system to manipulation by programs written by application developers who do not necessarily have access to the internal components of the system, or may desire a less complex or more consistent interface than that which is available via the underlying functionality of the system, or may desire an interface that adheres to particular standards of interoperation.
Checkpoint: In fault-tolerant computing, a checkpoint is a representation of computational state that can be stored, and from which subsequent computation can be restarted. This may be used to prevent loss of computational work that has been accomplished prior to the checkpoint.
Computer-accessible artifact (CAA): An item of information, media, work, data, or representation that can be stored, accessed, and communicated by a computer.
Logical Node: A logical representative of the processing resource represented by a physical node. A logical node is typically mapped to a physical node, but that mapping may be changed, for instance, if the physical node fails, or if it is desirable to implement the logical node on a different physical node.
Node: An architectural unit of a computing system, typically encompassing, but not limited to, one or more thread processors, local memory, logic to execute instructions and an ability to receive data from off-chip components and to provide data to off-chip components, and more generally defined as any computing device connected to a computer network.
Generalized Actor (GACT): one user or a group of users, or a group of users and software agents, or a computational entity acting in the role of a user, which behaves in a way to achieve some goal.
Local Area Network (LAN): Connects computers and other network devices over a relatively small distance, usually, but not necessarily, within a single organization.
Wide Area Network (WAN): Connects computers and other network devices over a potentially large geographic area.
Scalability: The ability of a computer system, architecture, network or process which allows it to pragmatically meet demands for larger amounts of processing by use of additional processors, memory, and connectivity.
Task: Typically a unified set of data manipulations, performed by one or more processors or thread units that accomplishes some resulting desired data value or data relationships. For the purposes of this application, a “task” will be defined as a set of computations that requires no communication with other nodes. Note that this does not exclude communication between threads running on the same node.
Thread: A thread is a small unit of processing. Typically, in multi-threaded systems, processes are composed of multiple threads, and may accomplish high-level jobs as applications or as services.
Various embodiments of the instant invention may provide several novel approaches for reliable multi-processor computing, which may include using logical nodes to perform a computation to accomplish a computational tasks, storing result the tasks to local persistent storage associated nodes, storing result computational task to local persistent storage of another node, using the accomplish a second computational task, wherein the second task that requires result from the first computational task; and permitting the second task to be restarted. Embodiments of the invention may also provide users with a representational construct capable of specifying computational tasks and their relationships and, using those relationships, to permit tasks to be restarted as needed.
Part of the operation of embodiments of the invention may involve the persistent storage of data input to and/or output from a particular node. In the example of
When Task 1 finishes, it may then send data to Task 2, shown on Node B and/or to other tasks running on other nodes (not shown) and/or may be output as overall system output (not shown). According to an embodiment of the invention, when Task 1 finishes, Node A may store the output of Task 1 in its local persistent storage, as shown in
At the same time, the output data from Task 1 may be sent to Logical Node B for use in Task 2. Logical Node B/Task 2 may then store the data locally in memory, and may begin execution when all input data arrives (noting that data may be received from sources other than the output of Task 1, in general), as shown in
As shown in
In embodiments of the invention, as shown in
Similarly, another task (e.g., Task 3) may run on Logical Node A once the run-time system begins the output data checkpointing of Task 1, as long as the checkpointed data is not some or all of the input data needed for Task 3. In this latter case, Task 3 may begin as soon as the data is placed in persistent storage and mirrored.
Finally, Logical Node A may start sending the output data from Task 1 to Task 2 on Logical Node B for checkpointing while it is storing its local copy to persistent storage. As long as the compute time is significantly larger than the checkpoint time (e.g., but not limited to, two-fold), the checkpointing latency may be able to be hidden. If a task completes (including output data checkpointing) before the input data is checkpointed, the input data checkpointing may be cancelled.
When a node is detected as failed (by some mechanism such as regular polling intervals from a master node), the system may search for tasks with checkpointed inputs on the failed node's local persistent storage. All of these tasks may then be respawned on other nodes. All tasks that have not started on the failed node but were in the process of starting may be run on a different node.
In
In various embodiments of the invention, a generalized actor 601 or 610 may be provided with one or more of the following for specifying tasks and/or task data dependencies: function definitions, procedure definitions, pragmas, annotations, tags, computer language metadata, computer language macros, computer language objects, computer language templates, declarative language constructs, imperative language constructs, glyphs, symbols, or selection of specified sections of task specification via user-interface choices.
In
In
In the preceding description, various memories and/or processor-readable media have been discussed. Such components may comprise, but are not necessarily limited to the following: SRAM, battery-backed SRAM, FLASH memory, FeRAM, MRAM, PCRAM, CBRAM, SONOS, RRAM, Racetrack memory, Carbon Nanotube Memory, Millipede Memory, solid-state-drives, hard-drives, magnetic recording systems, optical drives, optical recording systems, battery-backed DRAM, battery-backed cache memory, capacitor-backed cache memory, contiguous memory, cache memory, main memory, Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Double Datarate Synchronous DRAM (DDR), Synchronous DRAM (SDRAM), Fast-Cycle RAM (FCRAM), Magnetic Random Access Memory (MRAM), Non-Volatile Random Access Memory (NVRAM), Read Only Memory (ROM), Electrically Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), disk storage, Direct Access Storage Device (DASD), Distributed Mass Storage System (DMSS), High Capacity Storage System (HCSS), Hierarchical Storage Management (HSM), Mass Storage Device (MSD), Mass Storage System (MSS), Multiple Virtual Storage (MVS), Network Attached Storage (NAS), Redundant Arrays of Independent Disks (RAID), Storage/System Area Network (SAN), Storage Data Acceleration (SDX), Serial AT Attachment (SATA) devices, Small Computer System Interface (SCSI) devices, Internet Small Computer System Interface (iSCSI) devices, AT Attachment (ATA) devices, Variable Array Storage Technology (VAST), Virtual Storage (VS), Virtual Storage Extended (VSE), Virtual Shared Memory (VSM), and the like.
Various embodiments of the invention have now been discussed in detail; however, the invention should not be understood as being limited to these embodiments. It should also be appreciated that various modifications, adaptations, and alternative embodiments thereof may be made within the scope and spirit of the present invention.
This Application claims the benefit of priority of U.S. Provisional Patent Application No. 61/320,813, filed Apr. 5, 2010, entitled “Task-Oriented Node-Centric Checkpointing,” which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61320813 | Apr 2010 | US |