This application is the U.S. National Stage of PCT/FR2015/053368, filed Dec. 8, 2015, which in turn claims priority to French Patent Application No. 1462157 filed Dec. 10, 2014, the entire contents of all applications are incorporated herein by reference in their entireties.
The invention relates to a method for managing a network of compute nodes.
A network of compute nodes is understood to mean, in the context of the present invention, any network of machines, where a machine is at least one of: a computer, a server, a blade server, etc. It relates in particular to clusters of servers, i.e. supercomputers, or alternatively high-performance computers. It also relates to the field of high-performance computation, referred to as HPC.
Current supercomputers have a computation power of the order of one petaflop (10{circumflex over ( )}15 floating-point operations per second (flops)). This level of performance is attained by causing 5000 to 6000 computers/servers interconnected using specific topologies to operate simultaneously. These interconnections are made by using switches and wires. In the case of a supercomputer, to designate a computer/server, the term “compute node” is used, or simply “node”.
The networks used in this high-performance computation field are very specialised, and require suitable management. Typically, these services, known as “Fabric Management” or “Fabric Manager”, must provide routing functions, in order to make the communications possible, but also for acquisition and processing of production data (feedback of errors and operating counters).
The switches, and therefore the network, are administered by a management node connected to the network. The management node manages the compute nodes and switches via the same physical network. The switch management protocol depends on the nature of the switches. In practice, in a supercomputer, InfiniBand switches are used; the protocol used is therefore defined by the InfiniBand specification.
The management node, implementing a management function, enables the switches to be configured and supervised. The number of failures grows with the number of switches, and therefore becomes high. Requirements for analysis of the properties of the supercomputer are also substantial. This implies that there are many maintenance communications over the network. A maintenance communication is a use of resources which is not related to the computations required from the supercomputer. In the case in hand this relates, for example, to an alert message transmitted by a switch one of the ports of which is defective, or a collection message transmitted by a management node to obtain statistics relating to a switch.
The increase of the power of supercomputers implies an increase of the number of nodes, and therefore an increase of the number of interconnections, and therefore also an increase of the number of network maintenance messages between the switches and the management node.
This has two negative consequences:
The collapse phenomenon is recognised in supercomputers with 8000 nodes and with some thousand switches. Such a supercomputer does not attain exaflops (10{circumflex over ( )}18 flops), which is however the current goal of research into supercomputers.
The existing solutions are currently centralised. One example is the pair {OpenSM, IBMS}, which is the solution proposed by certain supercomputer manufacturers in order to manage InfiniBand (IB) type networks.
OpenSM, which is responsible for acquiring the topologies and for routing, is starting to cause substantial latency in the calculation of the routes, and IBMS, which centralises the errors in a single database, results in a high CPU load.
Furthermore, there are few error correlation possibilities, which is particularly regrettable since this function becomes invaluable at these scales.
Finally, operating data (error and performance counters) can be managed only laboriously, since individual requests, which are made from a single central point, must be transmitted and then assembled.
The invention seeks to remedy all or a proportion of the disadvantages of the state of the art identified above, and in particular to propose means to enable the increased complexity of supercomputers to be addressed.
With this aim, the present invention proposes fine modelling of these roles and an implementation which is divided into individual although interconnected and distributable modules. Each module can, in turn, be distributed according to a hierarchical arrangement of the supercomputer, which enables the invention to operate up to the sizes of supercomputers which are the subject of active research, i.e. exaflopic supercomputers, and even more powerful ones.
To this end, one aspect of the invention relates to a method for managing a network of compute nodes interconnected by a plurality of interconnection devices characterised by the fact that it includes the following steps:
In addition to the main characteristics described in the preceding paragraph, the method/device according to the invention may have one or more of the following possible additional characteristics, considered individually or in technically possible combinations:
Another object of the invention is a digital storage device including a file with instruction codes implementing the method according to one of the previous claims.
Another object of the invention is a device implementing a method according to a combination of the above characteristics
Other characteristics and advantages of the invention will be seen clearly on reading the description below, with reference to the appended figures, which illustrate:
For greater clarity, identical or similar elements are identified by identical reference signs in all the figures.
The invention will be better understood on reading the description which follows, and on examining the figures accompanying it. These are shown as an indication only, and are not restrictive of the invention in any manner.
In practice all the compute nodes of the first group of compute nodes are connected to ports of the same kind. These are InfiniBand ports, or equivalent. Each switch in first group S1 of switches is itself connected to at least one other switch in first group S1 of switches. This enables all the compute nodes of first group G1 of compute nodes to be connected to one another, and a network to be established between them by this means. The various physical connections and the corresponding wires are not shown, in order not to overcomplicate the figures.
The connectors, or ports, are physical communication interfaces or network interfaces.
When an action is imparted to a device it is in fact performed by a microprocessor of the device controlled by instruction codes recorded in a memory of the device. If an action is imparted to an application it is in fact performed by a microprocessor of the device in a memory where the instruction codes for the application are recorded. From a functional standpoint, for the comprehension of this paragraph, no distinction is made between a microprocessor, a microcontroller and an arithmetical and logic unit.
The second group of compute nodes and the second group of switches associated with the second group of compute nodes consist of elements identical to those of the first group of compute nodes associated with the first group of switches. The switches of the second group of switches are functionally identical to the switches of the first group of switches. Organisation of these elements may vary from one group to the next, or be identical.
Those skilled in the art will easily understand that the number of groups is not limited to two, but that a description of two is sufficient to illustrate the invention.
The network formed by means of the plurality of connectors is therefore dedicated to the compute nodes. In a general sense, “out-of-band” means that the signals called “out-of-band signals” are exchanged over channels or links which do not influence the performance of the channels or links used by the device to perform its main functions.
The module areas contain instruction codes, execution of which corresponds to the module's functions.
In an out-of-band variant of the invention the first management node is:
In this out-of-band variant the messages exchanged between the management node and the switches of a switch group do not travel on the same wires as the messages exchanged between the management node and the level-2 management node. More generally, this means that a bandwidth dedicated to the exchange of messages between the management node and a group of switches can be allocated. No other message will be able to use this bandwidth. This can be obtained by physical means, by physically separating the networks as has been illustrated, but it can also be accomplished using switches capable of managing a service quality, or QoS, contract.
A group of nodes is obtained by this means. Groups of nodes connected in this manner are also interconnected by making connections between the switches of the various groups of nodes. Grouping nodes therefore amounts to grouping switches. Every compute node connected to a switch in a group of switches, defined according to the invention, forms part of the same group of compute nodes. In other words, all nodes directly connected to a given switch form part of the same group of compute nodes. Such a group also contains a plurality of switches.
The switches in question are interconnection devices as described for
When these connections have been made the supercomputer, corresponding to all the compute nodes of the groups of compute nodes, can be started. Start-up of a supercomputer includes a step of updating the routing tables of the switches contained in the supercomputer. This update is accomplished according to an initial configuration known to the supercomputer.
When the initial settings have been written into the various elements, switches, management nodes and level-2 management node, the various management services are started in the management nodes and the level-2 management node. The various management services correspond to the previously described modules.
One feature of the invention is that these services are executed in an independent and decentralised manner. Each service corresponds to at least one process in the device which implements it. These are modules
The services executed by each node include at least:
A management node also implements a communication mechanism between the modules. Such a mechanism is asynchronous, such that management of the messages is not blocking for the various services.
Such an asynchronous mechanism is, for example, a subscription mechanism. A service subscribes to transmitted messages. When a message to which it is subscribed is published it will then read it.
Such a mechanism allows management of the switches to be optimised. For example, we can consider a scenario in which a connector of a switch malfunctions. In this case:
The times given above are for topology sizes with more than 50000 compute nodes. With smaller topologies these times are therefore reduced accordingly.
In the above example it can be seen that, with the invention, the calculation of these routing tables can start directly when the malfunction is detected. If the hypotheses relating to the topology have not changed the routing calculation will be accomplished in a few seconds, as before the invention, in a comparable execution time. Conversely, if the calculation of the hypotheses, according to the invention, reveals that the initial hypothesis is no longer valid, the calculation of the routing is interrupted and restarted with an algorithm called an “agnostic routing algorithm”, i.e. an algorithm which is insensitive to the topology. Before the invention the calculation is performed according to the initial hypothesis, and if that fails then, and only then, the topology-insensitive algorithm is started.
In addition, with the invention these calculations can be made simultaneously in several nodes, i.e. two malfunctions occurring in two switches of two different compute node groups can be managed simultaneously.
Another appropriate communication mechanism would be a letterbox mechanism, where each service has its own letterbox address, and all letterboxes are managed by the communication service.
In a variant of the invention a management node also executes a network performance measurement service. This is the network performance module. The network performance specifications are, for example:
This data is available in the switches and transmitted at regular intervals by the management nodes. This data is then aggregated by the management node and transmitted after aggregation, possibly with data produced by the other modules.
This enables the bandwidth used by the management node for transmission of performance data to be controlled.
The network performance module also stores the performance data, enabling the supercomputer's unprocessed performance data to be divided between the management nodes. A query can then be made from the level-2 management node, which will then search for the data in the management nodes according to the nature of the query. A form of storage is therefore obtained which is distributed over the management nodes, avoiding any use of a centralised storage device. This is of interest since, in order to be tolerant to malfunctions, such a centralised device must implement system redundancies. With the invention these redundancies are natural since they take the form of management nodes which are physically independent from one another. In addition, a centralised system must be able to have a large storage capacity for the data produced by all the supercomputer's switches. With the invention each management node has a storage capacity corresponding to the data produced by the switches which it manages.
With the invention a high and resilient storage capacity is obtained with lesser complexity than an equivalent centralised capacity.
With the invention all the functions of a conventional “fabric manager” can be found, but they are distributed and executed in such a way that they can be implemented even in the context of a supercomputer with over 1000 switches.
The invention enables known supercomputer design processes to be continue to be used, in particular in terms of the positioning of wiring topology, but with a management mode which is much more responsive and appropriate for increasingly large topologies.
Implementation of the method therefore enables an operational and manageable supercomputer to be obtained.
Number | Date | Country | Kind |
---|---|---|---|
14 62157 | Dec 2014 | FR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FR2015/053368 | 12/8/2015 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2016/092197 | 6/16/2016 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5651006 | Fujino | Jul 1997 | A |
6414955 | Clare | Jul 2002 | B1 |
6694304 | Sethi | Feb 2004 | B1 |
7167476 | Kritayakirana | Jan 2007 | B1 |
7873719 | Bishop | Jan 2011 | B2 |
8040811 | Edwards | Oct 2011 | B2 |
8488577 | MacPherson | Jul 2013 | B1 |
8738745 | Brandwine | May 2014 | B1 |
9923975 | Galchev | Mar 2018 | B2 |
20040068479 | Wolfson | Apr 2004 | A1 |
20040078495 | Mousseau | Apr 2004 | A1 |
20050086336 | Haber | Apr 2005 | A1 |
20060072572 | Ikeda | Apr 2006 | A1 |
20070226225 | Yiu | Sep 2007 | A1 |
20070250625 | Titus | Oct 2007 | A1 |
20080170589 | Yim | Jul 2008 | A1 |
20100050181 | Zhang | Feb 2010 | A1 |
20100120011 | O'Brien | May 2010 | A1 |
20100246443 | Cohn | Sep 2010 | A1 |
20110231536 | Tanaka | Sep 2011 | A1 |
20140052836 | Nguyen | Feb 2014 | A1 |
20140149485 | Sharma | May 2014 | A1 |
20140237100 | Cohn | Aug 2014 | A1 |
20140362700 | Zhang | Dec 2014 | A1 |
20150043378 | Bardgett | Feb 2015 | A1 |
20150227318 | Banka | Aug 2015 | A1 |
20150261633 | Usgaonkar | Sep 2015 | A1 |
20160380813 | Wass | Dec 2016 | A1 |
Number | Date | Country |
---|---|---|
WO 2005106668 | Nov 2005 | WO |
Entry |
---|
International Search Report as issued in International Patent Application No. PCT/FR2015/053368, dated Mar. 3, 2016. |
Hussain, H., et al., “A survey on resource allocation in high performance distributed computing systems,” Parallel Computing 39 (2013), pp. 709-736. |
Number | Date | Country | |
---|---|---|---|
20170353358 A1 | Dec 2017 | US |