This application claims priority to European Patent Application Number 23307012.7 filed 21 Nov. 2023, the specification of which is hereby incorporated herein by reference.
The technical field of the invention is that of high-performance computing (HPC).
At least one embodiment of the invention relates to a method and a system for inter- and intra-cluster communication, in particular by configuring network cards in a particular way.
High-performance computers (HPCs) are typically distributed in clusters, in order to spread the execution of applications across several machines. These high-performance computing applications require significant computing resources that cannot be installed on a single machine. For the largest calculations, between 10,000 and 100,000 machines are sometimes needed, and these machines are grouped together in clusters.
A cluster is a set of machines, often designed for the same application, at the same time, and with the same components. A cluster, for example, has a predefined topology, adapted to the execution of a particular application or type of application.
To interconnect these machines and create a cluster, specialized networks are required, such as those implementing interconnection protocols like Infiniband® or BXI® (Bull. eXascale Interconnect).
A problem arises when connecting several clusters to one another. Such a problem may arise, for example, if a user wants to combine an existing cluster with a new one, or with another existing cluster. At present, interconnecting a several of clusters to one another is not done, as each cluster is dedicated to the implementation of a specific application. To interconnect several clusters, it is often decided to use the Internet protocol suite to exchange data, with machines acting as “bridges” or “gateways”. These gateways copy data from the intra-cluster interconnection protocol, e.g. BXI, to the network protocol used to interconnect the two clusters, e.g. IP. But this type of cluster interconnection is mainly used to exchange data between applications, and does not deliver acceptable performance when running the same application on several clusters.
BXI networks use the Portals® application programming interface (API) as a communication protocol for inter-node communications.
A schematic representation of two interconnected clusters according to the prior art is shown in
For its first-generation interconnection network, BXI network cards and switches use their own link protocol (layer 2 of the OSI network model). This approach makes sense when considering a closed computing cluster as planned for this generation and as it currently exists. In this approach, all OSI Layer 3 packets must be encapsulated in a BXI frame to navigate the network. BXI switches 114 and 124 are in charge of switching frames to the right destination using the NID Portals (with NID standing for Network Identifier, which identifies the node's network card) present in the level 2 header, while relying on the cluster topology built by AFM 113 and 123. In this approach, all devices connected to the BXI network must be compatible with the BXI level 2 link layer. The use of general-purpose switches or routers is therefore not permitted. To overcome this constraint, network gateways 115 and 125 have been set up, with BXI cards on the cluster end and standard Ethernet or InfiniBand cards on the network 2 end.
What's more, none of the existing cluster networks can handle the existence of another cluster. It is therefore not possible for a first node 111 of the first cluster 11 to directly address another node 121 of the second cluster 12.
There is therefore a need for an inter-cluster communication solution.
At least one embodiment of the invention offers a solution to the above-mentioned problems, enabling high-performance inter-cluster data exchanges.
One or more embodiments of the invention relates to a high-performance computer comprising a plurality of clusters interconnected by an IP network, each cluster comprising:
By way of at least one embodiment of the invention, it is possible to carry out inter-cluster data exchanges via a high-performance protocol such as BXI or Infiniband, as long as this protocol is at least partially Ethernet-based. This is made possible by simply configuring the physical (that is, non-virtual) network cards of the cluster nodes, initializing them with a predefined routing table and predefined instructions for transmitting an inter-cluster data request. This is also made possible by the implementation of particular naming of the various components of the network, and the use of an address resolution protocol to obtain the address of the target gateway. Finally, the invention enables a significant simplification of the network gateways of the clusters, using simple IP routers rather than complex, costly gateways comprising two network cards and means dedicated to translating one network protocol into another, while at the same time improving inter-cluster data exchange performance.
In addition to the features mentioned in the preceding paragraph, the system according to at least one embodiment of the invention may have one or more complementary features from the following, taken individually or according to all technically plausible combinations:
At least one embodiment of the invention relates to a method of inter-cluster communication in a high-performance computer according to the invention, the method comprising:
In at least one embodiment, the method further comprises, after the transcription step and before the step of encapsulation in an IP packet, a step of comparison, by the network card of the sending node, of the cluster identifier of the network identifier of the destination instance with the cluster identifier of the sending network card, the inter-cluster communication method being continued only if the cluster identifier of the network identifier of the destination instance is different from the cluster identifier of the sending network card.
In at least one embodiment, the method further comprises, between the step of encapsulating the request in the IP packet and the encapsulation of the IP packet in the Ethernet frame, a step of sending, by the network card of the sending node, of an address resolution request from an IP address of the gateway of the first cluster to obtain a physical address of the gateway of the first cluster, the IP address of the gateway of the first cluster being stored in the second routing table associated with the identifier of the second cluster.
In at least one embodiment, the method further comprises the sending of an acknowledgment of receipt of the at least one data item, by the receiving network card, to the sending network card.
In at least one embodiment, the high-performance network library used is the Portals® network library, and wherein the format of the network library is an identifier comprising the network identifier of the destination instance and the process identifier of the destination instance.
In at least one embodiment, which the IP packet comprises a header indicating that the encapsulated request is a Portals® request.
One or more embodiments of the invention and its different applications will be better understood upon reading the following disclosure and examining the accompanying figures.
The figures are presented by way of reference and are in no way limiting to the invention.
Unless otherwise stated, the same element appearing in different figures has the same reference.
At least one embodiment of the invention relates to a system and a method for exchanging data between clusters of the system. The system comprises several computing clusters. The system is, for example, a high-performance computing center, configured to run a high-performance computing application. The components described below are physical components, allowing simple implementation and avoiding virtualization, that is the implementation of virtual networks.
The system according to one or more embodiments of the invention is shown schematically in
In order for the clusters to behave like Ethernet subnetworks according to at least one embodiment of the invention, certain physical components of the clusters are modified, in particular the compute nodes.
Firstly, each cluster's physical gateway linking the cluster to the network 2 is a simple physical Ethernet router, that is with physical Ethernet ports. In this way, the gateways 215 and 225 in the clusters 21 and 22 respectively are physical Ethernet routers, in contrast to the prior art, which comprises complex gateways with several network cards. The gateways are physical devices (that is not virtualized devices). Each gateway stores a first routing table, associating a gateway address with a destination IP address comprised in the cluster comprising the gateway. For example, such a routing table stored by gateway 215 comprises the following association:
Each cluster 21 and 22 comprises at least one interconnection switch 214 and 224 respectively. These interconnection switches 214 and 224 are physical devices. For example, these interconnection switches are interconnection switches using a high-performance Ethernet-based interconnection protocol, such as BXI® or Infiniband®.
These interconnection switches, 214 and 224 respectively, connect all the nodes in their cluster 21 and 22 respectively to the gateway 215 and 225 respectively. In this way, the interconnection switch 214 interconnects the nodes 211 and 212 and the gateway 215. The interconnection switch 224 interconnects the nodes 221 and 222 and the gateway 225.
The nodes of the cluster are configured to route outgoing data outside the cluster and incoming data inside the cluster.
For this purpose, the node N comprises a network card NIC1. The network card NIC1 is a physical device. The network card NIC1 implements high-performance Ethernet-based interconnection protocols such as BXI® or Infiniband®. The network card NIC1 enables the node N to communicate within and outside the cluster. To achieve this, the virtual machines VMID1 and VMID2 each comprise a virtual port, BX11 and BX12 respectively, configured to communicate with the network card NIC1. This enables the network card NIC1 to address the virtual machines VMID1 and VMID2.
According to one or more embodiments of the invention, the node N, which is a physical device, comprises two routing tables. A first routing table stored by the node N in a memory it comprises, is a routing table comprising, for each other cluster of the system, an association of an application instance identifier (also called “rank”) with the cluster comprising the node running the instance, with a unique network identifier (NID, described later) of the application instance and with an IP address of the network card NIC1 of the node running the application instance. For example, the first routing table of the node 211 of the cluster 21 comprises the identifiers (rank) of all application instances running in the cluster 22 associated with a cluster 22 identifier, an application instance NID, and an IP address of the network card NIC1 of the node 222 of the cluster 22. For example, the first routing table of the node 211 of the cluster 21 comprises at least the following associations:
In addition, the first routing table may comprise the same information for all application instances running in cluster 21, that is the first routing table comprises all information concerning all instances running in the system.
This first routing table is used by the network card NIC1 when the instance being run by one of the two virtual machines VMID1 or VMID2 wants to send data to an instance of the application being run by cluster 22, in order to determine whether the destination instance belongs to cluster 21 or not. To do this, the network card NIC1 can store the first routing table, when the application instance being run by the node registers with the network card NIC1. Alternatively, in at least one embodiment, the first routing table can be loaded into the application instance being run by the node N and, on a network transfer request, the application instance then transmits the data from the first routing table to the network card NIC1.
The network card NIC1 stores a second routing table comprising, for each other cluster in the system, an IP address of the gateway of the first cluster (here, for example, cluster 21) associated with the identifier of the other cluster. Thus, for the system shown in
This second routing table is used by the network card NIC1 when the instance being run by one of the two virtual machines VMID1 or VMID2 wants to send data to an instance of the application being run by the cluster 22, to find out which gateway to use in its cluster 21 to reach cluster 22. In the example in
Each network card NIC1 uses a high-performance communications library, such as the Portals® library when the protocol is BXI®, or the Verbs® library when the protocol is Infiniband®. For example, Portals® version 4 can be used with BXI® version 2.
At least one embodiment of the invention also covers a method for exchanging data, that is communication, between one cluster of the system and another cluster of the system. The method 4 is shown schematically in
The method 4 comprises a first step 41 of initializing the communication library. To achieve this, each instance of the high-performance computing application running in the system is assigned a unique network identifier and a process identifier (PID), so that it can be addressed by the other instances. The instance's unique network identifier (NID) is a triplet, which will be, for example, a Portals identifier if this network library is used. This Portals PID consists of three fields:
An example of the NID of an instance of the high-performance computing application is 21-VMID1-10. In the initialization step 41, the unique network identifier NID, the IP address associated with the NID and the unique process identifier PID of each application instance are distributed to all participants, that is to all application instances and to the communication library engines of the network cards of the nodes running them.
The method 4 then comprises a step 42 wherein the instance of the high-performance computing application being run by the virtual machine VMID1 of the node 211 of cluster 21 sends a data request to the instance being run by the virtual machine VMID2 of the node 222 of cluster 22. The request comprises an identifier of the destination instance.
In step 43, this request is received by the network card NIC1 of node 211, known as the sending node.
The method 4 then comprises a step 44 wherein the network card NIC1 of node 211 transcribes the received request into a request from the network library of the high-performance interconnection protocol, for example into a Portals® request if the protocol used is BXIR. The transcription 44 of the request further comprises the transcription of the destination instance identifier into a unique network identifier in the communication library, that is Portals in this example. To do this, this unique Portals network identifier is used with the Portals process identifier (PID) of the destination instance to form a Portals identifier “ptl_process_t” used for Portals network operations. A ptl_process_t identifier will therefore uniquely identify a Portals process within a computing center, and therefore within the system.
The method 4 then comprises a step 45 wherein the network card NIC1 of the sending node 211 compares the cluster identifier comprised in the network identifier of the receiving instance with the cluster identifier of the sending network card NIC1. For example, in this case, the cluster identifier of the sending network card is 21 and the cluster identifier comprised in the network identifier of the destination instance is 22. This comparison 45 enables the transmitting NIC1 to determine whether the request is for the internal destination of the cluster 21 or for another cluster, in this case cluster 22. If the cluster identifier comprised in the network identifier of the destination instance is different from the cluster identifier of the sending network card, the sending network card knows that the communication is inter-cluster and the process is continued in the following steps.
The method 4 then comprises a step 46 wherein the network card NIC1 of sending node 211 encapsulates the request transcribed into a network library in an IP packet. The IP packet then identifies, in its header, the network protocol used in the encapsulated request, that is Portals in this example, as well as the IP address of the network card NIC1 of the target node 222.
To obtain a physical address for the gateway of the first cluster 21, the network card uses an address resolution protocol, such as ARP. Thus, the method 4 comprises a step 47 of sending, by the network card NIC1 of the sending node 211, an address resolution request from an IP address of the gateway 215 of the first cluster 21 to obtain a physical address, for example a MAC address, of the gateway 215 of the first cluster 21, the IP address of the gateway 215 of the first cluster 21 being stored in the first routing table of the network card NIC1 associated with the identifier of the second cluster 22. Finally, the gateway 215 is the only gateway of the cluster 21 that can access cluster 22.
The method 4 comprises a step 48 of Ethernet encapsulation of the IP packet, with the MAC address of the gateway of the first cluster as the destination header.
The Ethernet frame is then transmitted in step 49 by the network card NIC1 of node 211 to the gateway 215 of the cluster 21, via switch 214. Upon receipt of the Ethernet frame, in a conventional manner, the gateway 215 of the first cluster 21 decapsulates the Ethernet frame, reads the IP packet to obtain the destination IP address, that is the IP address of the destination network card, and encapsulates the IP packet in an Ethernet frame with a physical address of the gateway 225 for forwarding to the gateway 225 of the second cluster 22, based on the routing table stored by the gateway 215 and indicating that the second gateway 225 is the destination gateway for the cluster 22. The ARP protocol can also be used by the gateway 215 of the cluster 21 to obtain the MAC address of the gateway 225 of the cluster 22, if the gateway 215 does not yet store this information in its switching table.
Upon receipt, a step 50 comprises the transmission, by the gateway of the second cluster, of the Ethernet frame to the destination network card of the node running the destination instance. To do this, the gateway 225 of the cluster 22 decapsulates the Ethernet frame, reads the destination IP address, and uses a switching table that it stores to obtain the MAC address of the network library engine of the destination network card. This MAC address can be obtained by the ARP protocol if it is not yet comprised in its switching table.
In step 51, the network card NIC1 of the node 222 receives the Ethernet frame and decapsulates it. It also decapsulates the IP packet to obtain the Portals request and the at least one piece of data it contains, intended for the virtual machine VMID2 of the node 222. This data is then transmitted in a step 52, to the destination instance, by the network card, using the network library, for example Portals, the destination network identifier, and the destination process identifier.
In an optional but preferable step 53, the destination network card NIC1 of the node 222 acknowledges receipt of the Portals request and the data it contains to the sending network card, with the acknowledgment taking the reverse route.
Finally, it is specified that the Portals engine present in each network card NIC1 has a dedicated physical MAC address. This destination MAC address is used by Ethernet frames encapsulating a Portals payload. As with MAC addresses on virtual machine Ethernet interfaces, this one will be forged from the node's physical NID and a special VM identifier: 128. For example, the MAC address of the Portals engine on the network card NIC1 could be 00:06:128:00:00:00.
Number | Date | Country | Kind |
---|---|---|---|
23307012.7 | Nov 2023 | EP | regional |