High-performance computing (HPC) or cluster computing is increasingly used for a large number of computationally intense tasks, such as webscale data mining, machine learning, network traffic analysis, and various engineering and scientific tasks. In such systems, jobs may be scheduled to execute concurrently on a computing cluster in which application data is stored on multiple compute nodes.
Previous implementations of HPC clusters have maintained multiple node databases, between management and scheduler subsystems (with one-to-one mapping between the node-entries in each subsystem). This can lead to several problems, including the following: (1) Interaction between subsystems is informal and fragile; (2) scalability of a cluster is limited to the least scalable subsystem (for example, a system management subsystem may struggle if there are more than 1000 nodes); and (3) different types of HPC nodes may require different types of management and scheduling solutions.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
A node registrar subsystem is disclosed that, according to one embodiment, is implemented as a service and a database, and acts as a central repository for information about all nodes within an HPC system. The node registrar subsystem formalizes data sharing between the HPC subsystems, and allows interaction with heterogeneous subsystems: different types of management, job scheduler, and monitoring solutions. The node registrar subsystem also facilitates scale-out of both management infrastructure and the job scheduler by delegating responsibility of different nodes to different sub-system instances.
One embodiment is directed to a method of managing nodes in a high-performance computing (HPC) system, which includes a management subsystem and a job scheduler subsystem. The method includes providing a node registrar subsystem. Logical node management functions are performed with the node registrar subsystem. Other management functions are performed with the management subsystem using the node registrar subsystem. Job scheduling functions are performed with the job scheduler subsystem using the node registrar subsystem.
The accompanying drawings are included to provide a further understanding of embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain principles of embodiments. Other embodiments and many of the intended advantages of embodiments will be readily appreciated, as they become better understood by reference to the following detailed description. The elements of the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding similar parts.
In the following Detailed Description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.
It is to be understood that features of the various exemplary embodiments described herein may be combined with each other, unless specifically noted otherwise.
The following detailed description is directed to technologies for implementing a node registrar as a central repository for information about all nodes in a high-performance computing (HPC) system. While the subject matter described herein is presented in the general context of program modules that execute in conjunction with the execution of an operating system and application programs on a computer system, those skilled in the art will recognize that other implementations may be performed in combination with other types of program modules. Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the subject matter described herein may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
In the embodiments presented herein, the HPC system utilized by the client computer 102 comprises the computing cluster 106. The computing cluster 106 includes a head node 108 and one or more compute nodes 110A-110N (collectively referred to as nodes or compute nodes 110). The head node 108 comprises a computing system responsible for performing tasks such as job management, cluster management, scheduling of tasks, and resource management for all of the compute nodes 110A-110N in the computing cluster 106. The compute nodes 110A-110N are computing systems that perform the actual computations. The computing cluster 106 may have virtually any number of compute nodes 110A-110N. A node or a compute node according to one embodiment is an individually identifiable computer within an HPC system.
It should be appreciated that the network 104 may comprise any type of local area network or wide area network suitable for connecting the client computer 102 and the computing cluster 106. For instance, in one embodiment, the network 104 comprises a high-speed local area network suitable for connecting the client computer 102 and the computing cluster 106. In other embodiments, however, the network 104 may comprise a high-speed wide area network, such as the Internet, for connecting the client computer 102 and the computing cluster 106 over a greater geographical area. It should also be appreciated that the computing cluster 106 may also utilize various high-speed interconnects between the head node 108 and each of the compute nodes 110A-110N.
Computing device 200 may also have additional features/functionality. For example, computing device 200 may also include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in
The various elements of computing device 200 are communicatively coupled together via one or more communication links 215. Computing device 200 also includes one or more communication connections 224 that allow computing device 200 to communicate with other computers/applications 226. Computing device 200 may also include input device(s) 222, such as keyboard, pointing device (e.g., mouse), pen, voice input device, touch input device, etc. Computing device 200 may also include output device(s) 220, such as a display, speakers, printer, etc.
Node registrar subsystem 302 according to one embodiment performs some management functions. In one embodiment, node registrar subsystem 302 performs logical node management (e.g., adding nodes, removing nodes, grouping nodes, and handling state transitions of nodes). Management subsystem 304 according to one embodiment handles: (1) Node deployment (e.g., getting an operating system and HPC Pack running on an actual node); (2) node configuration management (e.g., altering system configuration of a node after initial installation, and then on an ongoing basis); (3) infrastructure configuration management (e.g., altering configuration of network services after cluster setup, and then on an ongoing basis); and (4) node monitoring (e.g., live heat-map and performance charts).
Node registrar subsystem 302 according to one embodiment is implemented as a service and a database, and acts as a central repository for information about all nodes within the HPC system 100 (including, for example, head nodes, compute nodes, broker nodes, workstation nodes, Azure worker nodes, and Azure virtual machine nodes). The node registrar subsystem 302 formalizes data sharing between the HPC subsystems (e.g., between subsystems 302, 304, and 306), and allows interaction with heterogeneous subsystems: different types of management, job scheduler, and monitoring solutions. The node registrar subsystem 302 also facilitates scale-out of both management infrastructure and the job scheduler by delegating responsibility of different nodes to different sub-system instances, and allows different types of management and job scheduler implementations to run side-by-side.
The node registrar subsystem 302 according to one embodiment maintains information that has common relevance across all HPC node types. In one embodiment, this includes node identifiers (such as name and SID), as well as HPC-logical information (such as type, state, and group membership). The node registrar subsystem 302 additionally maintains resource information about the nodes (e.g., information that job scheduler subsystem 306 uses to make scheduling decisions).
Practically, the node registrar subsystem 302 according to one embodiment efficiently drives the node list (both from a graphical user interface (GUI) and Powershell) and acts as an authoritative list of nodes for other components within the HPC system 100. In one embodiment, node registrar subsystem 302 also performs workflows associated with logical changes to the HPC node data, such as adding and removing nodes, updating common node properties, and changing node state.
Additional features and advantages of the node registrar subsystem 302 according to one embodiment include the following: (1) The node registrar interfaces are versioned; (2) treatment of shared data between the HPC management 304 and job scheduler 306 components is streamlined through the node registrar 302; (3) HPC management 304 and job scheduler 306 components are explicitly dependent on the node registrar 302 (and not each other); (4) the node registrar 302 supports nodes running with no management component; (5) the node registrar service is stateless and can scale-out to meet high availability requirements; (6) the node registrar 302 is integrated with a granular permissions system; (7) the node registrar 302 supports multiple authentication modes; (8) the node registrar 302 can run in Azure, using a SQL Azure store; and (9) the node registrar 302 supports client concurrency, executing both read and write operations against the store.
The node registrar subsystem 302 (
As shown in
Multiple instances of the node registrar service 408 can run in active-active configuration against the same database 403 to facilitate high availability. Additionally, each individual node registrar service 408 is running with multiple threads in one embodiment, and there is not any locking in the DAL 502 to prevent simultaneous requests to the database 403.
SQL server 402 stores node registrar database 403, which includes a plurality of tables. The tables in database 403 according to one embodiment include a Node table, a NodeProperty table, a NetworkInterface table, a Service table, a NodeGroup table, a GroupMembership table, and a GlobalSettings table. These tables are described in further detail below.
The Node table is the central table of the node registrar 302. In one embodiment, each row in the Node table corresponds to a node in the HPC installation. Node properties that are columns in this table are first-class properties that may be used in filters. All nodes are versioned in one embodiment, such that if semantic changes are made to a node type and it is desired to exclude it in future versions, the system provides that flexibility.
The NodeProperty table contains arbitrary id/value pairs associated with particular nodes. These values represent second-class node properties. The id column is indexed for reasonably fast lookups. If a node is deleted, the associated properties are cascade deleted.
The NetworkInterface table stores network interface information for nodes. Each node can have a multiple NICs with different MAC addresses.
The Service table contains management and job scheduler components associated with this node registrar. This data serves a few purposes: (1) When a management or scheduler component calls into the node registrar 302, its view of the nodes can be easily scoped to nodes it cares about; (2) the GUI can query the Service table for a list of operation log providers; (3) management and scheduler URIs are associated with each node, allowing the client to find the proper component for data and scenarios that exist outside the node registrar scope.
The NodeGroup table contains a list of HPC Node Groups.
The GroupMembership table provides group membership information for nodes. Each row in this table defines the relationship of a specific node to a specific group. If either the node or node group are deleted, the group membership is cascade deleted.
The GlobalSettings table stores various configuration properties that are common across all active node registrars.
In one embodiment, the management subsystem 304 and the job scheduler subsystem 306 in method 600 are each a client of the node registrar subsystem 302. The node registrar subsystem 302 in method 600 according to one embodiment comprises a stateless node registrar service 408 and a database 403 for storing node information for the nodes in the HPC system 100. In one embodiment of method 600, the management subsystem 304 is configured to access the stored node information, update node properties in the database 403, and query the nodes by property and by group, using the node registrar service 408. In one embodiment of method 600, the job scheduler subsystem 306 is configured to access the stored node information, update node properties in the database 403, and query the nodes by property and by group, using the node registrar service 408. The database 403 of the node registrar subsystem 302 in method 600 according to one embodiment includes a node table, with each row in the node table corresponding to one of the nodes in the HPC system 100, and each column listing properties of the nodes in the HPC system 100. The logical node management functions performed by the node registrar subsystem 302 in method 600 according to one embodiment include adding nodes, removing nodes, updating node properties, handling state transitions of nodes, and grouping nodes. The other management functions performed with the management subsystem in method 600 according to one embodiment include node deployment, node configuration management, infrastructure configuration management, and node monitoring.
Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific embodiments shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific embodiments discussed herein. Therefore, it is intended that this invention be limited only by the claims and the equivalents thereof.
This application is a continuation of U.S. application Ser. No. 13/162,130, filed Jun. 16, 2011, the specification of which is incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
5666486 | Alfieri et al. | Sep 1997 | A |
5793962 | Badovinatz et al. | Aug 1998 | A |
5964837 | Chao | Oct 1999 | A |
6385643 | Jacobs | May 2002 | B1 |
6578068 | Bowman-Amuah | Jun 2003 | B1 |
6691244 | Kampe et al. | Feb 2004 | B1 |
6928589 | Pomaranski et al. | Aug 2005 | B1 |
7185075 | Mishra et al. | Feb 2007 | B1 |
7185076 | Novaes et al. | Feb 2007 | B1 |
7188343 | Sanchez, II et al. | Mar 2007 | B2 |
7266822 | Boudnik et al. | Sep 2007 | B1 |
7366989 | Naik et al. | Apr 2008 | B2 |
7415003 | Ogura | Aug 2008 | B1 |
7433931 | Richoux | Oct 2008 | B2 |
7711977 | Ballew et al. | May 2010 | B2 |
7861246 | Lu | Dec 2010 | B2 |
8008722 | Kim | Aug 2011 | B2 |
8336040 | Davidson et al. | Dec 2012 | B2 |
8433801 | Yemini | Apr 2013 | B1 |
8453152 | Druyan | May 2013 | B2 |
8724463 | Agarwal et al. | May 2014 | B2 |
8914805 | Krishnamurthy | Dec 2014 | B2 |
20010009014 | Savage, III | Jul 2001 | A1 |
20050172088 | Klingman | Aug 2005 | A1 |
20050235055 | Davidson | Oct 2005 | A1 |
20050251567 | Ballew et al. | Nov 2005 | A1 |
20060064486 | Baron et al. | Mar 2006 | A1 |
20060198386 | Liu | Sep 2006 | A1 |
20070094662 | Berstis | Apr 2007 | A1 |
20070124731 | Neiman | May 2007 | A1 |
20080307426 | Toeroe | Dec 2008 | A1 |
20080320482 | Dawson | Dec 2008 | A1 |
20090254552 | Vinberg | Oct 2009 | A1 |
20090254917 | Ohtani | Oct 2009 | A1 |
20090276482 | Rae et al. | Nov 2009 | A1 |
20100005160 | Sparks | Jan 2010 | A1 |
20100185823 | De et al. | Jul 2010 | A1 |
20100281166 | Buyya | Nov 2010 | A1 |
20110119381 | Glover | May 2011 | A1 |
20110125894 | Anderson et al. | May 2011 | A1 |
20110138051 | Dawson | Jun 2011 | A1 |
20110145383 | Bishop et al. | Jun 2011 | A1 |
20110296423 | Elnozahy | Dec 2011 | A1 |
20120042256 | Jamjoom et al. | Feb 2012 | A1 |
20120124584 | Addala | May 2012 | A1 |
Entry |
---|
Fang et al., “Designing High-Performance Computing Clusters”, Reprinted from Dell Power Solutions, May 2005, 4 pgs., Retrieved at dell.com/downloads/global/power/ps2q05-20040181-Fang-OE.pdf>>. |
Ahmed, et al., “The Cluster as Server: An Integrated Approach to Managing HPC Clusters”, Retrieved at <<jp.dell.com/app/4q02-Ahm.pdf>>, Nov. 2002, pp. 37-41. |
Number | Date | Country | |
---|---|---|---|
20160004563 A1 | Jan 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13162130 | Jun 2011 | US |
Child | 14741807 | US |