This application relates to the communication field, and in particular, to a path determining method and a related device.
In a communication network, when there is no direct link between different communication apparatuses, a data flow exchanged between the different communication apparatuses may need to be forwarded by another communication apparatus. The other communication apparatus may be referred to as a forwarding device. For example, the forwarding device may include a router, a switch, or a virtual machine.
Currently, as a networking scale becomes larger, a data flow can be exchanged between different communication apparatuses only after the data flow is forwarded by a multi-layer network. However, in the multi-layer network, there are usually a plurality of paths between a communication apparatus corresponding to a source address of the data flow and a communication apparatus corresponding to a destination address of the data flow. When forwarding the data flow, a communication apparatus serving as a forwarding device selects a path from the plurality of paths according to a local policy, and forwards the data flow based on the locally selected path.
However, in the multi-layer network, there are usually a plurality of communication apparatuses serving as forwarding devices, and each communication apparatus serving as a forwarding device determines a path based on a local data flow of the communication apparatus. Consequently, a conflict easily occurs between paths determined by different forwarding devices, and data flow forwarding efficiency is affected.
This application provides a path determining method and a related device, to improve data flow forwarding efficiency.
A first aspect of this application provides a path determining method. The method is performed by a first network device, the method is performed by some components (for example, a processor, a chip, or a chip system) of the first network device, or the method may be implemented by a logical module or software that can implement all or some functions of the first network device. In the first aspect and possible implementations of the first aspect, an example in which the method is performed by the first network device is used for description. The first network device may be a router, a switch, a virtual machine, or the like. In the method, the first network device obtains first topology information, where the first topology information includes connection relationships between N second network devices and P third network devices, any second network device is an upstream network device of any third network device, N is an integer greater than or equal to 2, and P is an integer greater than or equal to 1. The first network device obtains communication relationships of M data flows, where a communication relationship of each of the M data flows includes source address information and destination address information, M is an integer greater than or equal to 2, and the M data flows are separately transmitted by the N second network devices to the P third network devices. The first network device determines M paths based on the communication relationships of the M data flows and the first topology information, where the M paths respectively correspond to the M data flows, and the M paths indicate paths through which the M data flows are transmitted by the N second network devices to the P third network devices. The first network device separately sends the M paths to the N second network devices.
According to the foregoing technical solution, after obtaining the first topology information including the connection relationships between the N second network devices and the P third network devices, and obtaining the communication relationships of the M data flows, the first network device determines the M paths based on the communication relationships of the M data flows and the first topology information, and sends the M paths to the N second network devices. Then, the N second network devices may separately send the M data flows to the P third network devices based on the M paths. In other words, the first network device serves as a device for determining a path, and the path is determined by the first network device based on the connection relationships between the N second network devices and the P third network devices, and the communication relationships of the M data flows. Therefore, by comparison with an implementation in which a path conflict is easily caused because the N second network devices determine paths only based on local data flows, in the foregoing method, the first network device can determine the paths based on global information, to avoid a path conflict and improve data flow forwarding efficiency.
It should be understood that any one of the M data flows may be a unidirectional data flow, or may be a bidirectional data flow. This is not limited in this application. If a data flow in the M data flows is the bidirectional data flow, in the communication relationships of the M data flows, a communication relationship of the bidirectional data flow may include only source address information and destination address information of a flow direction, or a communication relationship of the bidirectional data flow may include source address information and destination address information corresponding to each of two flow directions. This is not limited herein.
In a possible implementation of the first aspect, the M data flows are transmitted by the P third network devices to K fourth network devices, where K is an integer greater than or equal to 1. The M data flows include a first data flow and a second data flow, where source address information of the first data flow and source address information of the second data flow correspond to different second network devices, and destination address information of the first data flow and destination address information of the second data flow correspond to a same fourth network device; and the M paths include a first path and a second path, where the first path corresponds to the first data flow, the second path corresponds to the second data flow, and the first path and the second path correspond to different third network devices.
Optionally, any fourth network device is different from any second network device.
Optionally, at least one of the N second network devices and at least one of the K fourth network devices are a same network device.
Optionally, N is equal to K, and the N second network devices and the K fourth network devices are a same network device.
According to the foregoing technical solution, the M paths determined by the first network device include the first path corresponding to the first data flow and the second path corresponding to the second data flow, and the first path and the second path correspond to the different third network devices, where the source address information of the first data flow and the source address information of the second data flow correspond to the different second network devices, and the destination address information of the first data flow and the destination address information of the second data flow correspond to the same fourth network device. In other words, because any fourth network device is a downstream network device of any third network device, after the first data flow and the second data flow are separately sent by different second network devices to different third network devices, the different third network devices separately send the first data flow and the second data flow to a same fourth network device. This can avoid network congestion generated in a process in which data flows from different second network devices are transmitted by a same third network device and then transmitted by the same third network device to a same fourth network device, to improve transmission efficiency of the first data flow and the second data flow.
In a possible implementation of the first aspect, the M paths further indicate egress ports of the M data flows on the N second network devices.
According to the foregoing technical solution, in addition to indicating the paths through which the M data flows are transmitted by the N second network devices to the P third network devices, the M paths determined by the first network device further indicate the egress ports of the M data flows on the N second network devices, so that the N second network devices can determine, after receiving the M paths, the egress ports for sending the M data flows.
In a possible implementation of the first aspect, the M data flows include a third data flow and a fourth data flow, where source address information of the third data flow and source address information of the fourth data flow correspond to a same second network device; and the M paths include a third path and a fourth path, where the third path corresponds to the third data flow, the fourth path corresponds to the fourth data flow, and the third path is different from the fourth path.
According to the foregoing technical solution, the M paths determined by the first network device include the third path corresponding to the third data flow and the fourth path corresponding to the fourth data flow, and the third path is different from the fourth path. The source address information of the third data flow and the source address information of the fourth data flow correspond to the same second network device. In other words, the third data flow and the fourth data flow are separately transmitted by the same second network device through different paths. This can avoid network congestion generated in a process in which data flows from a same second network device are transmitted through a same path, to improve transmission efficiency of the third data flow and the fourth data flow.
In a possible implementation of the first aspect, the M data flows are transmitted by the P third network devices to the K fourth network devices, where K is a positive integer. That the first network device determines M paths based on the communication relationships of the M data flows and the first topology information includes: The first network device determines a first mapping relationship based on the communication relationships of the M data flows and the first topology information, where the first mapping relationship indicates a mapping relationship between a second network device corresponding to the source address information of each of the M data flows and a fourth network device corresponding to the destination address information of each of the M data flows. The first network device determines the M paths based on the first mapping relationship.
Optionally, the topology information further includes connection relationships between the P third network devices and the K fourth network devices.
According to the foregoing technical solution, an implementation in which the first network device determines the M paths is provided, so that the first network device determines the M paths based on mapping relationships between the N second network devices and the K fourth network devices.
In a possible implementation of the first aspect, that the first network device determines the M paths based on the first mapping relationship includes: The first network device determines first sorting information based on the first mapping relationship, where the first sorting information indicates sorting of a quantity of second network devices corresponding to the K fourth network devices. The first network device sequentially traverses the egress ports on the N second network devices based on the first sorting information, to obtain a second mapping relationship, where the second mapping relationship indicates mapping relationships between the egress ports on the N second network devices and the K fourth network devices. The first network device determines the M paths based on the second mapping relationship.
According to the foregoing technical solution, an implementation in which the first network device determines the M paths is provided, so that the first network device determines the M paths based on the first mapping relationship, the first sorting information, and the second mapping relationship that are sequentially determined.
In a possible implementation of the first aspect, that the first network device sequentially traverses the egress ports on the N second network devices based on the first sorting information, to obtain a second mapping relationship includes: The first network device sequentially traverses the egress ports on the N second network devices based on the first sorting information, to obtain a third mapping relationship, where the third mapping relationship indicates an optional quantity of egress ports on the second network device corresponding to each fourth network device. The first network device determines the second mapping relationship based on the third mapping relationship.
According to the foregoing technical solution, the third mapping relationship indicates the optional quantity of egress ports on the second network device corresponding to each fourth network device. A larger value of the optional quantity indicates smaller uncertainty of a corresponding optional path of the fourth network device. On the contrary, a smaller value of the optional quantity indicates greater uncertainty of a corresponding optional path of the fourth network device. Therefore, the second mapping relationship determined based on the third mapping relationship can be preferentially for traversing egress ports on the second network device corresponding to a fourth network device with an optional path of small uncertainty, to improve accuracy of the solution, and avoid a conflict between the M paths subsequently determined based on the second mapping relationship.
In a possible implementation of the first aspect, the method further includes: The first network device obtains second topology information, where the second topology information includes connection relationships between A second network devices and the P third network devices, at least one of the A second network devices is the same as the at least one of the N second network devices, and A is an integer greater than or equal to 1. The first network device obtains communication relationships of B data flows, where a communication relationship of each of the B data flows includes source address information and destination address information, and B is an integer greater than or equal to 1; and the B data flows are separately transmitted by the A second network devices to the P third network devices. After the first network device determines the M paths based on the communication relationships of the M data flows and the topology information, the method further includes: The first network device determines B paths based on the communication relationship of the B data flows and the topology information, where the B paths respectively correspond to the B data flows, and the B paths indicate paths through which the M data flows are sent by the A second network devices to the P third network devices, where egress ports that are on a second network device and that correspond to the B paths are different from egress ports that are on a second network device and that correspond to the M paths. The first network device separately sends the B paths to the A second network devices.
According to the foregoing technical solution, the egress ports that are on the second network device and that correspond to the B paths determined by the first network device are different from the egress ports that are on the second network device and that correspond to the M paths, to avoid a flow conflict generated when the M data flows and the B data flows correspond to an egress port on a same second network device, and improve data transmission efficiency of the M data flows and the B data flows.
In a possible implementation of the first aspect, that the first network device obtains communication relationships of M data flows includes: The first network device separately receives communication relationships that are of the M data flows and that are from the N second network devices.
According to the foregoing technical solution, the first network device may obtain the communication relationships of the M data flows in a manner of separately receiving the communication relationships that are of the M data flows and that are from the N second network devices, so that the first network device obtains the communication relationships of the M data flows in a manner of interaction between the first network device and the N second network devices.
Optionally, the communication relationships of the M data flows are preconfigured in the first network device, to avoid overhead and latency increase caused by interaction between different network devices.
In a possible implementation of the first aspect, the M data flows correspond to one of a plurality of artificial intelligence (AI) set communication tasks.
Optionally, the M data flows correspond to a long steady-state flow task, and a flow volume of a data flow in the long steady-state flow task is greater than a preset threshold within specific duration.
In a possible implementation of the first aspect, the first network device is a controller or one of the P third network devices.
According to the foregoing technical solution, the first network device that performs the method to determine and send the M paths may be the controller, or may be one of the P third network devices, to improve flexibility of implementing the solution.
A second aspect of this application provides a path determining method. The method is performed by a second network device, the method is performed by some components (for example, a processor, a chip, or a chip system) of the second network device, or the method may be implemented by a logical module or software that can implement all or some functions of the second network device. In the second aspect and possible implementations of the second aspect, an example in which the method is performed by the second network device is used for description. The second network device may be a router, a switch, a virtual machine, or the like. In the method, the second network device sends communication relationships of Q data flows to a first network device, where a communication relationship of each of the Q data flows includes source address information and destination address information, and Q is an integer greater than or equal to 1. The second network device receives Q paths from the first network device, where the Q paths indicate paths used when the second network device transmits the Q data flows. The second network device transmits the Q data flows based on the Q paths.
According to the foregoing technical solution, after sending the communication relationships of the Q data flows to the first network device, the second network device receives the Q paths that are used when the second network device is indicated to transmit the Q data flows and that are from the first network device, and the second network device transmits the Q data flows based on the Q paths. In other words, serving as a device for determining a path, the first network device can determine M paths corresponding to M data flows transmitted between N second network devices and P third network devices. Therefore, by comparison with an implementation in which a path conflict is easily caused because the N second network devices determine paths only based on local data flows, in the foregoing method, the first network device can determine the paths based on global information, to avoid a path conflict and improve data flow forwarding efficiency.
It should be understood that any one of the Q data flows may be a unidirectional data flow, or may be a bidirectional data flow. This is not limited in this application. If a data flow in the Q data flows is the bidirectional data flow, in the communication relationships of the Q data flows, a communication relationship of the bidirectional data flow may include only source address information and destination address information of a flow direction, or a communication relationship of the bidirectional data flow may include source address information and destination address information corresponding to each of two flow directions. This is not limited herein.
In a possible implementation of the second aspect, path information further indicates egress ports that are on the second network device and that are of the Q data flows.
According to the foregoing technical solution, in addition to indicating the paths through which the Q data flows are transmitted by the second network device, the Q paths received by the second network device further indicate the egress ports that are on the second network device and that are of the Q data flows, so that the second network device can determine, after receiving the Q paths, egress ports for sending the Q data flows.
In a possible implementation of the second aspect, the Q data flows include a third data flow and a fourth data flow, where source address information of the third data flow and source address information of the fourth data flow correspond to the second network device; and the path information includes a third path and a fourth path, where the third path corresponds to the third data flow, the fourth path corresponds to the fourth data flow, and the third path is different from the fourth path.
According to the foregoing technical solution, the Q paths received by the second network device include the third path corresponding to the third data flow and the fourth path corresponding to the fourth data flow, and the third path is different from the fourth path. The source address information of the third data flow and the source address information of the fourth data flow correspond to a same second network device. In other words, the third data flow and the fourth data flow are separately transmitted by the same second network device through different paths. This can avoid network congestion generated in a process in which data flows from a same second network device are transmitted through a same path, to improve transmission efficiency of the third data flow and the fourth data flow.
In a possible implementation of the second aspect, the Q data flows correspond to one of a plurality of AI set communication tasks.
Optionally, the Q data flows correspond to a long steady-state flow task, and a flow volume of a data flow in the long steady-state flow task is greater than a preset threshold within specific duration.
A third aspect of this application provides a communication apparatus. The apparatus can implement the method according to any one of the first aspect or the possible implementations of the first aspect. The apparatus includes a corresponding unit or module configured to perform the method. The unit or module included in the apparatus may be implemented by software and/or hardware. For example, the apparatus may be a first network device, the apparatus may be a component (for example, a processor, a chip, or a chip system) in the first network device, or the apparatus may be a logical module or software that can implement all or some functions of the first network device.
The apparatus includes a transceiver unit and a processing unit. The transceiver unit is configured to obtain first topology information, where the first topology information includes connection relationships between N second network devices and P third network devices, any second network device is an upstream network device of any third network device, N is an integer greater than or equal to 2, and P is an integer greater than or equal to 1. The transceiver unit is further configured to obtain communication relationships of M data flows, where a communication relationship of each of the M data flows includes source address information and destination address information, M is an integer greater than or equal to 2, and the M data flows are separately transmitted by the N second network devices to the P third network devices. The processing unit is configured to determine M paths based on the communication relationships of the M data flows and the first topology information, where the M paths respectively correspond to the M data flows, and the M paths indicate paths through which the M data flows are transmitted by the N second network devices to the P third network devices. The transceiver unit is further configured to separately send the M paths to the N second network devices.
In a possible implementation of the third aspect, the M data flows are transmitted by the P third network devices to K fourth network devices, where K is an integer greater than or equal to 1. The M data flows include a first data flow and a second data flow, where source address information of the first data flow and source address information of the second data flow correspond to different second network devices, and destination address information of the first data flow and destination address information of the second data flow correspond to a same fourth network device; and the M paths include a first path and a second path, where the first path corresponds to the first data flow, the second path corresponds to the second data flow, and the first path and the second path correspond to different third network devices.
In a possible implementation of the third aspect, the M paths further indicate egress ports of the M data flows on the N second network devices.
In a possible implementation of the third aspect, the M data flows include a third data flow and a fourth data flow, where source address information of the third data flow and source address information of the fourth data flow correspond to a same second network device; and the M paths include a third path and a fourth path, where the third path corresponds to the third data flow, the fourth path corresponds to the fourth data flow, and the third path is different from the fourth path.
In a possible implementation of the third aspect, the M data flows are transmitted by the P third network devices to the K fourth network devices, where K is a positive integer. The processing unit is specifically configured to: determine a first mapping relationship based on the communication relationships of the M data flows and the first topology information, where the first mapping relationship indicates a mapping relationship between a second network device corresponding to the source address information of each of the M data flows and a fourth network device corresponding to the destination address information of each of the M data flows; and determine the M paths based on the first mapping relationship.
In a possible implementation of the third aspect, the processing unit is specifically configured to: determine first sorting information based on the first mapping relationship, where the first sorting information indicates sorting of a quantity of second network devices corresponding to the K fourth network devices; sequentially traverse the egress ports on the N second network devices based on the first sorting information, to obtain a second mapping relationship, where the second mapping relationship indicates mapping relationships between the egress ports on the N second network devices and the K fourth network devices; and determine the M paths based on the second mapping relationship.
In a possible implementation of the third aspect, the processing unit is specifically configured to: sequentially traverse the egress ports on the N second network devices based on the first sorting information, to obtain a third mapping relationship, where the third mapping relationship indicates an optional quantity of egress ports on the second network device corresponding to each fourth network device; and determine the second mapping relationship based on the third mapping relationship.
In a possible implementation of the third aspect, the transceiver unit is further configured to obtain second topology information, where the second topology information includes connection relationships between A second network devices and the P third network devices, at least one of the A second network devices is the same as at least one of the N second network devices, and A is an integer greater than or equal to 1. The transceiver unit is further configured to obtain communication relationships of B data flows, where a communication relationship of each of the B data flows includes source address information and destination address information, and B is an integer greater than or equal to 1; and the B data flows are separately transmitted by the A second network devices to the P third network devices. The processing unit is further configured to determine B paths based on the communication relationship of the B data flows and the topology information, where the B paths respectively correspond to the B data flows, the B paths indicate paths through which the M data flows are sent by the A second network devices to the P third network devices, where egress ports that are on a second network device and that correspond to the B paths are different from egress ports that are on a second network device and that correspond to the M paths. The transceiver unit is further configured to separately send the B paths to the A second network devices.
In a possible implementation of the third aspect, the transceiver unit is specifically configured to separately receive communication relationships that are of the M data flows and that are from the N second network devices.
In a possible implementation of the third aspect, the M data flows correspond to one of a plurality of AI set communication tasks.
In a possible implementation of the third aspect, the first network device is a controller or one of the P third network devices.
In the third aspect of this application, composition modules of the communication apparatus may be further configured to: perform a step performed in the possible implementations of the first aspect; and implement a corresponding technical effect. For details, refer to the first aspect. Details are not described herein again.
A fourth aspect of this application provides a communication apparatus. The apparatus can implement the method according to any one of the second aspect or the possible implementations of the second aspect. The apparatus includes a corresponding unit or module configured to perform the method. The unit or module included in the apparatus may be implemented by software and/or hardware. For example, the apparatus may be a second network device, the apparatus may be a component (for example, a processor, a chip, or a chip system) in the second network device, or the apparatus may be a logical module or software that can implement all or some functions of the second network device.
The apparatus includes a transceiver unit and a processing unit. The processing unit is configured to determine communication relationships of Q data flows, where a communication relationship of each of the Q data flows includes source address information and destination address information, and Q is an integer greater than or equal to 1. The transceiver unit is configured to send the communication relationships of the Q data flows to a first network device. The transceiver unit is further configured to receive Q paths from the first network device, where the Q paths indicate paths used when the second network device transmits the Q data flows. The transceiver unit is further configured to transmit the Q data flows based on the Q paths.
In a possible implementation of the fourth aspect, path information further indicates egress ports that are on the second network device and that are of the Q data flows.
In a possible implementation of the fourth aspect, the Q data flows include a third data flow and a fourth data flow, where source address information of the third data flow and source address information of the fourth data flow correspond to the second network device; and the path information includes a third path and a fourth path, where the third path corresponds to the third data flow, the fourth path corresponds to the fourth data flow, and the third path is different from the fourth path.
In a possible implementation of the fourth aspect, the Q data flows correspond to one of a plurality of AI set communication tasks.
In the fourth aspect of this application, composition modules of the communication apparatus may be further configured to: perform a step performed in the possible implementations of the second aspect; and implement a corresponding technical effect. For details, refer to the second aspect. Details are not described herein again.
A fifth aspect of this application provides a communication apparatus. The communication apparatus includes at least one processor. The at least one processor is coupled to a memory. The memory is configured to store a program or instructions. The at least one processor is configured to execute the program or the instructions, to enable the apparatus to implement the method according to any one of the first aspect or the possible implementations of the first aspect, or to enable the apparatus to implement the method according to any one of the second aspect or the possible implementations of the second aspect.
A sixth aspect of this application provides a communication apparatus, including at least one logic circuit and an input/output interface. The logic circuit is configured to perform the method according to any one of the first aspect or the possible implementations of the first aspect, or the logic circuit is configured to perform the method according to any one of the second aspect or the possible implementations of the second aspect.
A seventh aspect of this application provides a computer-readable storage medium, configured to store computer instructions. When the computer instructions are executed by a processor, the processor performs the method according to any one of the first aspect or the possible implementations of the first aspect, or the processor performs the method according to any one of the second aspect or the possible implementations of the second aspect.
An eighth aspect of this application provides a computer program product (or referred to as a computer program). The computer program product includes instructions. When the instructions in the computer program product are executed by a processor, the processor performs the method according to any one of the first aspect or the possible implementations of the first aspect, or the processor performs the method according to any one of the second aspect or the possible implementations of the second aspect.
A ninth aspect of this application provides a chip system. The chip system includes at least one processor, configured to support a communication apparatus in implementing a function according to any one of the first aspect or the possible implementations of the first aspect, or configured to support a communication apparatus in implementing a function according to any one of the second aspect or the possible implementations of the second aspect.
In a possible design, the chip system may further include a memory. The memory is configured to store program instructions and data that are used for the communication apparatus. The chip system may include a chip, or may include a chip and another discrete component. Optionally, the chip system further includes an interface circuit. The interface circuit provides program instructions and/or data for the at least one processor.
A tenth aspect of this application provides a communication system. The communication system includes the first network device according to any one of the foregoing aspects.
Optionally, the communication system includes one or more second network devices according to any one of the foregoing aspects.
Optionally, the communication system includes one or more third network devices according to any one of the foregoing aspects.
Optionally, the communication system includes one or more fourth network devices according to any one of the foregoing aspects.
For technical effects brought by any one of the designs of the second aspect to the tenth aspect, refer to the technical effects brought by the different implementations of the first aspect. Details are not described herein again.
The following describes the technical solutions in embodiments of the present disclosure with reference to the accompanying drawings in embodiments of the present disclosure.
The terms “system” and “network” may be used interchangeably in embodiments of this application. “At least one” means one or more, and “a plurality of” means two or more than two. “And/or” describes an association relationship of associated objects, and represents that three relationships may exist. For example, A and/or B may represent the following cases: Only A exists, both A and B exist, and only B exists, where A and B may be singular or plural. The character “/” usually indicates an “or” relationship between the associated objects. “At least one of the following items (pieces)” or a similar expression thereof indicates any combination of these items, including a single item (piece) or any combination of a plurality of items (pieces). For example, “at least one of A, B, and C” includes A, B, C, AB, AC, BC, or ABC. In addition, unless otherwise specified, ordinal numbers such as “first” and “second” in embodiments of this application are used to distinguish between a plurality of objects, and are not used to limit a sequence, a time sequence, priorities, or importance of the plurality of objects.
It should be noted that in this application, the term such as “example” or “for example” is used to represent giving an example, an illustration, or a description. Any embodiment or design scheme described as “example” or “for example” in this application should not be explained as being more preferred or having more advantages than another embodiment or design scheme. Exactly, use of the term “example”, “for example”, or the like is intended to present a related concept in a specific manner.
In a communication network, when there is no direct link between different communication apparatuses, a data flow exchanged between the different communication apparatuses may need to be forwarded by another communication apparatus. The other communication apparatus may be referred to as a forwarding device. For example, the forwarding device may include a router, a switch, a virtual machine, or the like. The following describes a communication system in this application with reference to implementation examples in
In
Optionally, in
Optionally, the customer edge device may be referred to as an edge switching device.
The implementation scenario in
An implementation example is shown in
Optionally, in a scenario example shown in
In addition, considering complexity of a network, both a source device of a data flow and a destination device of the data flow may be connected to a same network device, both a source device of a data flow and a destination device of another data flow may be connected to a same network device, a source device of a data flow and a destination device of the data flow may not be connected to a same network device, or a source device of a data flow and a destination device of another data flow may not be connected to a same network device. In other words, the N second network devices may be completely different from the K fourth network devices, or the N second network devices may be partially the same as or completely the same as the K fourth network devices. The following provides further descriptions with reference to more implementation examples.
An implementation example is shown in
Another implementation example is shown in
Optionally, a multi-layer network including the N second network devices, P third network devices, and the K fourth network devices may be a three-layer network including an access layer, an aggregation layer, and a core layer. Alternatively, the multi-layer network may be a two-layer network including a spine node and a leaf node, or may be another implementation. This is not limited in this application.
In
It should be noted that for ease of description, a communication process between two layers of networks is mainly described in the following embodiments. In actual application of the solution, the following embodiments may be further applied to a communication process between any two networks in another network. For example, for a three-layer network including an access layer, an aggregation layer, and a core layer, the following embodiments may be applied to a network between the access layer and the aggregation layer, may be applied to a network between the aggregation layer and the core layer, or may be applied to a network between the access layer and the core layer.
In
An example in which the foregoing multi-layer network is a data center network is used. With leapfrog development of new technologies such as cloud computing, 5th generation (5G), big data, an internet of things, and AI, and mature commercial use of applications such as autonomous vehicles, 5G smart manufacturing factories, intelligent risk control, and facial recognition, a requirement for a network of a data center becomes higher, and the data center may be required to provide a lossless network with no packet loss, a high throughput, and low latency. As services on the network of the data center are increasingly enriched, a networking scale becomes larger. Therefore, the network of the data center usually uses a multi-root hierarchical network topology. An equal-cost member link exists between switching devices at different layers. In an Internet Protocol (IP) network, a plurality of different links reaches a network environment of a same destination address. When a plurality of routes has same route priorities and same route metrics, the routes are referred to as equal-cost routes, and form an equal-cost multi-path (ECMP). A device at a multi-path bifurcation location selects a path according to a specific policy and sends packets to different paths to implement flow load balancing. In an ECMP mechanism, a feature field (for example, a source media access control (MAC) address, a destination MAC address, and IP 5-tuple information) of a data packet serves as a hash factor, a hash-key value is generated by using a hash algorithm, and then a member link is selected from load balancing links based on the hash-key value to forward the data packet. In this case, data packets with different feature fields may have different hash-key values, and therefore different member links may be selected for forwarding. Data packets with same feature fields may have same hash-key values, and therefore same member links are selected for forwarding. In this way, load balancing forwarding is performed on different data flows through different member links, and a time sequence of data packets in a same data flow reaching a receive end is ensured.
In other words, in the foregoing multi-layer network, there are usually a plurality of paths between a communication apparatus corresponding to a source address of a data flow and a communication apparatus corresponding to a destination address of the data flow. When forwarding the data flow, a communication apparatus, serving as a forwarding device, selects a path from the plurality of paths according to a local policy, and forwards the data flow based on the locally selected path. The following uses the architecture shown in
As shown in
Specifically, an implementation process of determining the path based on the local selection may be implemented based on a load balancing technology. Based on division at a granularity of the load balancing technology, commonly, there is a packet-based load balancing technology, a flow-based load balancing technology, and a flowlet-based load balancing technology. The following describes three implementation processes by using some implementation examples.
Specifically, in the implementation process of packet-based multi-path load balancing, a switching device forwards an Nth packet to a path i, forwards an (N+1)th packet to a path (i+1), and so on, and polling is performed on an egress port on the switching device. Behavior of the packet-based multi-path load balancing is mathematically described as: Path selection is that a packet number is for performing a modulo operation on a quantity of optional equal-cost paths. A purpose is to evenly distribute all packets to equal-cost member links of a next hop.
An implementation example is shown in
It can be learned from the example shown in
Specifically, selecting a plurality of next hop mechanisms depends on an ECMP mechanism. In the mechanism, a feature field (for example, a source MAC address, a destination MAC address, IP 5-tuple information) of a data packet serves as a hash factor, a hash-key value is generated by using a hash algorithm, and then a member link is selected from load balancing links based on the hash-key value to forward the data packet. Data packets with different feature fields may have different hash-key values, and therefore different member links may be selected for forwarding. Data packets with same feature fields may have same hash-key values, and therefore same member links are selected for forwarding. In this way, load balancing forwarding is performed on different data flows through different member links, and a time sequence of data packets in a same data flow reaching a receive end is ensured.
An implementation example is shown in
It can be learned from the example shown in
In addition, the flow-based load balancing causes two types of conflicts, and consequently, network load is unbalanced, and service performance is affected. The following provides descriptions separately by using
A type of conflict may be referred to as a local conflict. A reason is: Because hash-key results computed by using the hash algorithm for different input feature fields are the same, different flows are forwarded to a same path, and a conflict occurs.
For example, an implementation process of the local conflict is shown in
It can be learned from a path selection result that all packets sent from the Server-A and the Server-B are forwarded to a path between the Switch-A and a Switch-1, and there is no packet on a path between the Switch-A and a Switch-2. As a result, the network load is unbalanced, and the service performance is damaged.
Another type of conflict may be referred to as a global conflict. A reason is: Because a current load balancing technology uses a distributed decision-making mechanism and lacks a global perspective, a switching device cannot predict and control a flow direction of an upstream flow, and a flow conflict is caused.
For example, an implementation process of the global conflict is shown in
Specifically, when forwarding a data packet, a network device determines a time interval between a to-be-forwarded data packet and a previous data packet in a data flow to which the to-be-forwarded data packet belongs. If the time interval is greater than maximum link transmission latency (flowlet_gap_time) of member links in a load balancing link, the to-be-forwarded data packet is considered as a first packet of a new flowlet. If the time interval is less than maximum link transmission latency of member links in a load balancing link, the to-be-forwarded data packet is considered as a same flowlet as the previous data packet. A device selects a member link with lighter load from a current load balancing link based on a flowlet for forwarding. For data packets in a same flowlet, same member links are selected for forwarding.
An implementation example is shown in
In the implementation process of the flowlet-based load balancing in the implementation 3, an entire flow may be divided into a plurality of flowlets based on a packet interval. In addition, paths can be selected for different flowlets based on network features, for example, based on link utilization and egress port queue depth. However, this implementation still has disadvantages. In one aspect, a host side cannot actively construct the flowlet. In another aspect, if the host side is forced to construct the flowlet by using a sending-stopping-sending mechanism, a throughput is affected when the network is not congested, and the link utilization is low; however, when the network is congested, it cannot be ensured that a gap between constructed flowlets is greater than a flowlet gap set by a switch, causing disorder on a receive end, and triggering re-transmission. In conclusion, a flowlet-based multi-path load balancing technology has disadvantages of both the packet-based load balancing technology and the flow-based load balancing technology: risks of unbalanced load and packet disorder. In addition, the flowlet-based load balancing technology may need to consider balancing maximization of a network throughput and minimization of the packet disorder, and accurately sets a value of flowlet_gap_time. However, in a process of balancing the network throughput and the packet disorder, flowlet_gap_time is not a static parameter, and the value of flowlet_gap_time may need to be dynamically adjusted based on feedback of a network performance indicator. In the mechanism of dynamically adjusting the value of flowlet_gap_time, the network throughput is affected, and a global optimal value cannot be reached.
It can be learned from the implementation processes of the implementation 1 to the implementation 3 that the following technical problems exist.
In one aspect, in a multi-layer network, there is usually a plurality of communication apparatuses that serve as forwarding devices, and each communication apparatus that serves as a forwarding device may need a process in which a path is selected from a plurality of paths according to a local policy of the communication apparatus, to determine a forwarding path. Consequently, efficiency of a path determining manner is low. In another aspect, in the foregoing three implementations, because a manner in which each communication apparatus that serves as the forwarding device locally makes a decision lacks global planning of a plurality of data flows transmitted in the network, the path determining manner easily causes other problems. For example, in the implementation 1, a local flow conflict of the switching device is easily caused; in the implementation 2, the global conflict and the local conflict are easily caused; and in the implementation 3, a network throughput is easily affected.
To resolve the foregoing problems, this application provides a path determining method and a related device, to improve data flow forwarding efficiency. The following further provides detailed descriptions with reference to the accompanying drawings.
S301: The first network device obtains first topology information.
In this embodiment, the first topology information obtained by the first network device in step S301 includes connection relationships between N second network devices and P third network devices, any second network device is an upstream network device of any third network device, N is an integer greater than or equal to 2, and P is an integer greater than or equal to 1.
In a possible implementation, the first network device is a controller or one of the P third network devices. Specifically, the first network device that performs the method to determine and send M paths may be the controller, or may be one of the P third network devices, to improve flexibility of implementing a solution.
In an implementation example, in the scenario example shown in
For example, the scenario shown in
Optionally, the first network device may be any spine node connected to each leaf node in
It should be understood that in the example shown in
For example, when the data flow is a bidirectional data flow, a leaf node is not only a source leaf node in a flow direction of the data flow, but also a destination leaf node in another flow direction of the data flow. For another example, when the data flow is a unidirectional data flow, a leaf node is a source leaf node of the data flow or a destination leaf node of the data flow. For another example, a same leaf node may be separately configured to transmit different data flows. In a data flow, the leaf node may be the source leaf node, and in another data flow, the leaf node may be the source leaf node or the destination leaf node.
S302: The first network device obtains communication relationships of M data flows.
In this embodiment, a communication relationship that is of each of the M data flows and that is obtained by the first network device in step S302 includes source address information and destination address information, M is an integer greater than or equal to 2, and the M data flows are separately transmitted by the N second network devices to the P third network devices.
It should be noted that implementation processes of step S301 and step S302 are not limited in this application. To be specific, the first network device may first perform step S301 and then perform step S302, or the first network device may first perform step S302 and then perform step S301.
It should be understood that any one of the M data flows may be a unidirectional data flow, or may be a bidirectional data flow. This is not limited in this application. If a data flow in the M data flows is the bidirectional data flow, in the communication relationships of the M data flows, a communication relationship of the bidirectional data flow may include only source address information and destination address information of a flow direction, or a communication relationship of the bidirectional data flow may include source address information and destination address information corresponding to each of two flow directions. This is not limited herein.
In a possible implementation, that the first network device obtains communication relationships of M data flows includes: The first network device separately receives communication relationships that are of the M data flows and that are from the N second network devices. Specifically, the first network device may obtain the communication relationships of the M data flows in a manner of separately receiving the communication relationships that are of the M data flows and that are from the N second network devices, so that the first network device obtains the communication relationships of the M data flows in a manner of interaction between the first network device and the N second network devices.
Optionally, in step S302, the communication relationships of the M data flows are preconfigured in the first network device, to avoid overhead and latency increase caused by interaction between different network devices.
Similarly, in step S301, the first topology information obtained by the first network device in step S301 may be determined by using respective topology information sent by the N second network devices, or the first topology information is preconfigured in the first network device.
For example, the following uses an example in which the first network device determines the first topology information in step S301 and the communication relationships of the M data flows in step S302 by using respective communication relationships and the respective topology information that are sent by the N second network devices for example descriptions. Each second network device may need to establish a flow table for a local elephant flow. A process is as follows.
First, when a new flow enters the second network device, the second network device adds a 5-tuple and feature information (for example, first packet time and a quantity of bytes) of the flow to an original flow table.
Then, the second network device sets threshold values for an elephant flow and a mice flow based on a quantity of bytes in a collection periodicity, filters out the mice flow, and reserves the elephant flow.
Then, the second network device converts the original flow table into a 2-tuple flow table, and reserves a source IP and a destination IP. A purpose is to reserve key fields required by a global optimal path allocation algorithm, to reduce storage space.
Finally, the second network device aggregates the filtered 2-tuple flow table with local topology knowledge to form a local flow information table. Implementation examples of the local flow information table are shown in Table 1 to Table 3.
Optionally, for the flow table obtained twice successively, the second network device may delete an aged entry. In addition, all communication nodes of a task may be provided in a form of a file, for example, all communication devices of the task are explicitly provided in a form of a hostfile. A communication relationship of a current task can be obtained with reference to an explicit communication algorithm name.
Then, after obtaining local flow information, the second network device aggregates an established flow table to the first network device, so that the first network device obtains the communication relationships of the M data flows in step S302.
Optionally, implementation examples of a table obtained by aggregating Table 1 to Table 3 are shown in Table 4.
Optionally, the foregoing implementation process may be represented as an implementation process in
In addition, the first network device periodically obtains a local flow communication relationship and topology information. The local flow communication relationship and topology information are aggregated to the first network device. The first network device integrates aggregated local flow information tables of the N second network devices, and divides communication clusters based on the communication relationship, to determine the M data flows.
Optionally, after the N second network devices aggregate the local communication relationship and the topology information to the first network device, the first network device may determine whether the M data flows that currently need to be computed belong to a single-task scenario or a multi-task scenario. In the multi-task scenario, resource allocation may need to be performed for each individual task. An implementation process may be represented as an implementation process of “Step 402: Obtain local topology information and aggregate topology information of a complete task” in
Optionally, the first network device may further perform “Step 403: Determine whether all nodes are covered” in
The following further describes the implementation process corresponding to step 404 in
4041: Input the communication relationship.
4042: Establish a Dtor-Stor mapping matrix based on the communication relationship and the topology information.
4043: Perform traversal and allocation in descending order of quantities of occurrences of ToRs.
4044: Allocate a link resource in a sequence of Dtor uplink ports.
4045: Determine all the Dtors are traversed, where if all the Dtors are traversed, step 4046 is performed; or if all the Dtors are not traversed, step 4043 is performed again.
4046: Determine whether there is another communication relationship to be allocated with the resource, where if there is another communication relationship to be allocated with the resource, step 4047 is performed; or if there is no another communication relationship to be allocated with the resource, step 4042 is performed again.
4047: Resource allocation ends.
It can be learned from the foregoing implementation process that the first network device converts communication relationships of same tasks into a communication relationship matrix based on the topology information and first packet time. In addition, the first network device converts the communication relationship matrix into the Dtor-Stor mapping matrix, and starts to allocate a resource from a ToR that appears most frequently. In a scenario shown in
In the foregoing implementation examples of Table 5 to Table 7, a ToR-1 appears most frequently, and the resource is allocated starting from the ToR-1. To be specific, when allocating the resource, the first network device performs polling allocation on ToR uplink ports based on a quantity of spines. As shown in
In a possible implementation, the M data flows that are obtained by the first network device in step S302 correspond to one of a plurality of A) set communication tasks. Optionally, the M data flows correspond to a long steady-state flow task, and a flow volume of a data flow in the long steady-state flow task is greater than a preset threshold within specific duration.
For example, the solutions provided in this application are applicable to an elephant flow scenario in which a communication relationship periodically changes or in which a requirement on convergence time is strict. The elephant flow scenario is mainly subdivided into a long steady-state flow scenario and an AI set communication scenario.
Optionally, a communication relationship feature of the long steady-state flow scenario is that the communication relationship usually does not change, or may be considered as having an infinite change periodicity. In the scenario, as a task starts and ends, a flow table is updated and aged. In a procedure, an edge switching device may need to obtain a local flow communication relationship only after a new task starts, and aggregates the local flow communication relationship into a flow information table on a first network device, so that the flow information table can serve as an input of a global optimal path allocation algorithm for computation. However, if destination IPs are the same but source IPs are different, it indicates that an incast flow exists in the scenario. An incast scenario of a long steady-state flow does not belong to a problem to be resolved in the present disclosure. If such a communication relationship occurs, a global optimal path allocation algorithm does not need to be used for computation.
Optionally, a communication relationship feature of the AI set communication scenario is that for inter-device communication in the AI set communication scenario, an entire process is usually divided into a plurality of phases based on an algorithm. In each phase, all communication relationships between devices changes, and all devices in a communication cluster may need to be synchronized when each phase starts. This has an extremely high requirement on a convergence speed of the algorithm. The following describes examples of three types of algorithms mainly used in the AI set communication scenario. The three types of algorithms include a halving-doubling (H-D) algorithm, a ring algorithm, and a tree algorithm.
In a possible implementation, the H-D algorithm is divided into two stages. A first stage is halving, and a reduce-scatter operation is performed, as shown in
Specifically, it is because that a communication relationship feature of an AI set communication scenario is that for inter-device communication in the AI set communication scenario, an entire process is usually divided into a plurality of phases based on an algorithm. In different phases, all communication relationships between devices change. A communication step is, where n is a current phase, and N is the quantity of communication nodes. In addition, when each phase starts, all devices in a communication cluster should be synchronized. This has an extremely high requirement on a convergence speed of the algorithm.
A second stage is doubling, and an all-gather (all-gather) operation is performed, as shown in
In conclusion, in an all-reduce phase of the inter-device communication in the AI set communication that works based on the H-D algorithm, a total of (2*log2 N) phases may be needed, where N is the quantity of communication nodes. Before an optimal path for a flow on a network is computed based on a global optimal path allocation algorithm, flow communication relationships in all phases of an entire all-reduce stage may need to be restored. In a computation stage of the global optimal path allocation algorithm, optimal paths in all phases are also computed to obtain an optimal result.
As shown in
In a possible implementation, a ring algorithm is different from the H-D algorithm. This is mainly because communication relationships in an entire communication process of the ring algorithm do not change.
For example,
It can be learned that in the all-reduce process, the N communication nodes perform a total of 2(N−1) phases, but in all phases, communication relationships of nodes do not change. Therefore, a communication relationship processing manner of such an algorithm is the same as that of a long steady-state flow. An optimal path that is of only one phase and that may need to be obtained through computation is a global optimal path computation result of all the phases.
In a possible implementation, a difference between a tree algorithm and the ring algorithm lies in that in the tree algorithm, a communication relationship changes in each phase. A difference between the tree algorithm and the H-D algorithm lies in that in the tree algorithm, communication objects that send and receive data are different in same phases.
For example,
In conclusion, three algorithms change the communication relationships of the nodes, including communication relationship features of the long steady-state flow being all different. Communication features of the long steady-state flow are similar to those of the ring algorithm, and do not change in the entire inter-device communication process. A communication relationship between nodes in an H-D algorithm periodically changes, and a halving process and a doubling process are inverse to each other. In a same phase, a communication relationship is that a sending object and a receiving object are the same. A communication relationship between nodes in the tree algorithm is similar to that in the H-D algorithm, and periodically changes. However, different from the H-D algorithm, in a same phase, for a same node in the tree algorithm, objects that send and receive data are different. Based on the foregoing features, a flow can be recognized, and communication relationships of all phases can be completed. In addition, an algorithm for current communication may be further formulated in a file. For example, an algorithm flag of ring, H-D, tree, or static communication is explicitly provided in the file.
Optionally, the solutions provided in this application are applicable to a scenario in which overall network bandwidth is sufficient. That the overall network bandwidth is sufficient is represented as: Assuming that n flows with bandwidth of d exist in a network, the n flows may need to be distributed in m equal-cost links with a capacity of c. If nd>mc, it indicates that the overall network bandwidth is insufficient. A congestion problem cannot be resolved depending on a load balancing mechanism alone. A network congestion problem may need to be resolved using the load balancing mechanism in coordination with a scheduling algorithm or a congestion control algorithm. The present disclosure is applicable to a scenario in which nd≤mc, for example, the AI set communication scenario mentioned above in which the elephant flow is mainly used, the communication relationship periodically changes, and the requirement on the algorithm convergence time is strict; and a long steady-state flow scenario in which a communication relationship is fixed.
S303: The first network device determines the M paths based on the communication relationships of the M data flows and the first topology information.
In this embodiment, after the first network device obtains the first topology information in step S301 and obtains the communication relationships of the M data flows in step S302, in step S303, the M paths determined by the first network device based on the communication relationships of the M data flows and the first topology information respectively correspond to the M data flows. The M paths indicate paths through which the M data flows are transmitted by the N second network devices to the P third network devices.
In a possible implementation, the M data flows are transmitted by the P third network devices to the K fourth network devices, where K is a positive integer. A specific process in which the first network device determines the M paths based on the communication relationships of the M data flows and the first topology information in step S303 includes: The first network device determines a first mapping relationship based on the communication relationships of the M data flows and the first topology information, where the first mapping relationship indicates a mapping relationship between a second network device corresponding to the source address information of each of the M data flows and a fourth network device corresponding to the destination address information of each of the M data flows. The first network device determines the M paths based on the first mapping relationship.
Optionally, the topology information further includes connection relationships between the P third network devices and the K fourth network devices.
Optionally, any fourth network device is different from any second network device.
Optionally, at least one of the N second network devices and at least one of the K fourth network devices are a same network device.
Optionally, N is equal to K, and the N second network devices and the K fourth network devices are a same network device.
Specifically, an implementation in which the first network device determines the M paths is provided, so that the first network device determines the M paths based on mapping relationships between the N second network devices and the K fourth network devices.
In a possible implementation, that the first network device determines the M paths based on the first mapping relationship includes: The first network device determines first sorting information based on the first mapping relationship, where the first sorting information indicates sorting of a quantity of second network devices corresponding to the K fourth network devices. The first network device sequentially traverses the egress ports on the N second network devices based on the first sorting information, to obtain a second mapping relationship, where the second mapping relationship indicates mapping relationships between the egress ports on the N second network devices and the K fourth network devices. The first network device determines the M paths based on the second mapping relationship. Specifically, an implementation in which the first network device determines the M paths is provided, so that the first network device determines the M paths based on the first mapping relationship, the first sorting information, and the second mapping relationship that are sequentially determined.
In a possible implementation, that the first network device sequentially traverses egress ports on the N second network devices based on the first sorting information, to obtain a second mapping relationship includes: The first network device sequentially traverses the egress ports on the N second network devices based on the first sorting information, to obtain a third mapping relationship, where the third mapping relationship indicates an optional quantity of egress ports on the second network device corresponding to each fourth network device. The first network device determines the second mapping relationship based on the third mapping relationship. Specifically, the third mapping relationship indicates the optional quantity of egress ports on the second network device corresponding to each fourth network device. A larger value of the optional quantity indicates smaller uncertainty of a corresponding optional path of the fourth network device. On the contrary, a smaller value of the optional quantity indicates greater uncertainty of a corresponding optional path of the fourth network device. Therefore, the second mapping relationship determined based on the third mapping relationship can be preferentially for traversing egress ports on the second network device corresponding to a fourth network device with an optional path of small uncertainty, to improve accuracy of the solution, and avoid a conflict between the M paths subsequently determined based on the second mapping relationship.
S304: The first network device separately sends the M paths to the N second network devices.
In this embodiment, after determining the M paths in step S303, the first network device separately sends the M paths to the N second network devices in step S304.
In a possible implementation, the M data flows are transmitted by the P third network devices to the K fourth network devices, where K is an integer greater than or equal to 1. The M data flows corresponding to the communication relationships that are obtained by the first network device in step S302 include a first data flow and a second data flow, where source address information of the first data flow and source address information of the second data flow correspond to different second network devices, and destination address information of the first data flow and destination address information of the second data flow correspond to a same fourth network device. In addition, the M paths determined by the first network device in step S303 include a first path and a second path, where the first path corresponds to the first data flow, the second path corresponds to the second data flow, and the first path and the second path correspond to different third network devices.
Specifically, the M paths determined by the first network device include the first path corresponding to the first data flow and the second path corresponding to the second data flow, and the first path and the second path correspond to the different third network devices, where the source address information of the first data flow and the source address information of the second data flow correspond to the different second network devices, and the destination address information of the first data flow and the destination address information of the second data flow correspond to the same fourth network device. In other words, because any fourth network device is a downstream network device of any third network device, after the first data flow and the second data flow are separately sent by different second network devices to different third network devices, the different third network devices separately send the first data flow and the second data flow to a same fourth network device. This can avoid network congestion generated in a process in which data flows from different second network devices are transmitted by a same third network device and then transmitted to a same fourth network device by the same third network device, to improve transmission efficiency of the first data flow and the second data flow.
In a possible implementation, the M paths determined by the first network device in step S303 further indicate egress ports of the M data flows on the N second network devices. Specifically, in addition to indicating the paths through which the M data flows are transmitted by the N second network devices to the P third network devices, the M paths determined by the first network device further indicate the egress ports of the M data flows on the N second network devices, so that the N second network devices can determine, after receiving the M paths, the egress ports for sending the M data flows.
In a possible implementation, the implementation method in
In a possible implementation, the M data flows corresponding to the communication relationships that are obtained by the first network device in step S302 include a third data flow and a fourth data flow, where source address information of the third data flow and source address information of the fourth data flow correspond to a same second network device. In addition, the M paths determined by the first network device in step S303 include a third path and a fourth path, where the third path corresponds to the third data flow, the fourth path corresponds to the fourth data flow, and the third path is different from the fourth path. Specifically, the M paths determined by the first network device include the third path corresponding to the third data flow and the fourth path corresponding to the fourth data flow, and the third path is different from the fourth path. The source address information of the third data flow and the source address information of the fourth data flow correspond to the same second network device. In other words, the third data flow and the fourth data flow are separately transmitted by the same second network device through different paths. This can avoid network congestion generated in a process in which data flows from a same second network device are transmitted through a same path, to improve transmission efficiency of the third data flow and the fourth data flow.
According to the foregoing technical solution, after obtaining the first topology information including the connection relationships between the N second network devices and the P third network devices in step S301, and obtaining the communication relationships of the M data flows in step S302, the first network device determines the M paths based on the communication relationships of the M data flows and the first topology information in step S303, and sends the M paths to the N second network devices in step S304. Then, the N second network devices may separately send the M data flows to the P third network devices based on the M paths. In other words, serving as a device for determining a path, the first network device can determine the M paths corresponding to M data flows transmitted between N second network devices and P third network devices. Therefore, by comparison with an implementation in which a path conflict is easily caused because the N second network devices determine paths only based on local data flows, in the foregoing method, the first network device can determine the paths based on global information, to avoid a path conflict and improve data flow forwarding efficiency.
The following uses an example in which a network device (that is, a second network device) corresponding to source address information is a source top of rack (Stor) switch and a network device (that is, a fourth network device) corresponding to destination address information is a destination top of rack (Dtor) switch to describe an implementation process of step S303.
In a possible implementation, the M paths determined in step S303 may be used to resolve an optimal path allocation problem. The optimal path allocation problem is a mathematical precise coverage problem, and belongs to an NP-complete problem. It is described as: A universal set X is uplink ports that are on Stors and that correspond to Dtors at which DIPs in all IP pairs are located, and a subset S is an uplink port that is on the Stor and that corresponds to each Dtor. A problem is that S may be required to be precise coverage of X, to be specific, uplink ports on Stors corresponding to all Dtors cannot be repeatedly allocated. In this way, a purpose of allocating an optimal path to each flow in a network is achieved. An entire optimal path computation and allocation process is divided into the following steps. The algorithm, in step S303, used to resolve the problem may be referred to as a flow matrix algorithm (flow matrix algorithm, FMA).
For example, an implementation process of the first network device in step S303 may be represented as an implementation process (that is, “Step 407: Perform global optimal path computation by task”) in a dashed box 4 in
Step 4071: The first network device obtains the input communication relationship [SIP, DIP].
Step 4072: The first network device establishes the Dtor-Stor mapping matrix based on the communication relationship and the topology information.
Step 4073: The first network device traverses a flow matrix in a Dtor dimension, and guides a matrix column iteration sequence by using a negative effect value (negative effect value, NEV).
Step 4074: The first network device guides a matrix row computation sequence in an iteration by using a horizontal degree of freedom value (HDOF) value and a vertical degree of freedom (VDOF) value.
Step 4075: Determine whether a for loop of traversing the Dtors is completed, where if the for loop of traversing the Dtors is completed, step 4076 is performed; or if the for loop of traversing the Dtors is not completed, step 4073 is repeatedly performed.
Step 4076: Output an optimal allocation result.
Optionally, as shown in
The following describes step 1 and step 2 by using some implementation examples.
In step 4071, an implementation of aggregating, by the first network device, flow information of a task communication cluster is shown in Table 8.
As shown in Table 8, the complete flow table aggregated by the first network device includes several important items: the flow number, flow table generation switch information, a communication relationship pair, that is, [SIP, DIP], the first packet time, and the quantity of bytes.
Optionally, as an input of an algorithm, initialization processing may need to be performed on the flow table. First, information needed in a computation process may need to be further filtered out from the original flow table. Information related to computation in the table includes a switch generated by the flow table and the communication IP pair [SIP, DIP]. Then, a communication unidirectional flow table in the original table is converted into a bidirectional flow table, as shown in Table 9 and Table 10.
In addition, the global optimal path allocation algorithm is to perform iterative computation in the Dtor dimension. Therefore, in step 4072, the original bidirectional flow table is converted into the Dtor-Stor mapping matrix, as shown in Table 11.
In step 4071 and step 4072, the original communication relationship table is first converted into the Dtor-SIP mapping matrix, and then the Stor at which the SIP is located is found in the flow table generation switching device entry in the original table. Therefore, the original communication relationship table is finally converted into a Dtor-Stor mapping relationship matrix. A core of the global optimal path allocation algorithm is to compute the Dtor-Stor mapping relationship matrix.
It may be understood that the Dtor-Stor mapping matrix shown in Table 11 is an implementation example of the first mapping relationship in step S303.
The following describes step 4073, step 4074, and step 4075 by using some implementation examples.
In step 4073, the Dtor-Stor mapping matrix and an available port matrix of the ToR may need to be initialized. Specifically, the first network device converts the aggregated original flow table into the Dtor-Stor mapping matrix, as shown in Table 11. In the Dtor-Stor mapping matrix, a column represents the Dtor, and a row represents a number of an egress port that is on the Stor and that is of a flow to the Dtor. An entry in the matrix represents a Stor (Stor) corresponding to a flow of a Dtor corresponding to a column at which the entry is located. In addition, an entry “−1” in the Dtor-Stor mapping matrix represents a flow that only passes through a local ToR. In addition, two operations may need to be performed in an initialization stage. A first operation is to convert the original flow table into the Dtor-Stor mapping relationship matrix. A second step is to generate an available egress port matrix of the ToR, as shown in Table 12.
In Table 12, a column of the matrix is defined as a ToR number. A row is defined as an alternative egress port number when a flow passes through each ToR. An entire optimal path computation stage is computing the Dtor-Stor mapping matrix and the available egress port matrix of the ToR.
The flow-based load balancing mentioned above may have a risk of the global conflict. Conditions of forming the global conflict are: Different flows pass through different Stors, flow to a node in a same ToR, and select a same spine as a relay. Therefore, when the Dtor-Stor mapping matrix is processed, if a requirement of Formula 1 is met, it can be ensured that the global conflict problem is avoided:
In Formula 1, x represents the Dtor-Stor mapping relationship matrix, i represents a row corner mark of the matrix, j represents a column corner mark of the matrix, and k represents an element that is of the matrix and that corresponds to a cell (i, j). To ensure that an equation is true, corresponding elements are mutually exclusive in the rows and columns of the matrix, which is also referred to as that the elements are unique in row space and column space. There is no overlapping link for flows on a corresponding network, that is, no local conflict problem and global conflict problem occurs, so that a path of each flow is optimally allocated.
In step 4073, after initializing an algorithm input data structure, the first network device computes a flow matrix based on an FMA algorithm. The FMA algorithm is for performing iterative computation in the Dtor dimension of the flow matrix, that is, traverses all columns in the Dtor-Stor mapping matrix. The algorithm selects a traversal sequence by using a negative effect value, that is, a value of an NEV indicator, of each column. The NEV is defined as a total quantity of entries that appear minus 1, as shown in Table 13.
In Table 13, when a flow is selected for performing computation and allocation on an egress port on the Stor, under a constraint of ensuring that there is no conflict between flows, other flows with same Dtors are affected by a computation result of a 1st flow, and computation and allocation are passively performed on egress ports of other numbers of respective Stors. A value of impact of computation and allocation of a flow on a computation result of another flow is the negative effect value. According to a definition of the NEV indicator, NEV computation is performed on columns of a to-be-computed flow matrix before each round of iteration of Dtor traversal.
For example, Table 13 provides computation before a 1st round of iteration. Except a Dtor-4 and a Dtor-5, all elements in the Dtor column are four non-repeated values. According to the definition of the NEV, the negative effect value is 4-1, that is, 3. When NEV values are the same, a minimum value is selected in columns in a current iteration periodicity based on a natural order of Dtors to enter the current iteration periodicity. When the NEV values are different, for example, a Dtor-0 is selected for computation in a 1st iteration periodicity, elements are 1, 2, 3, and 7. Before computation in a 2nd iteration periodicity is performed, an NEV value of a Dtor-1 is 7-1, that is, 6. An NEV value of a Dtor-3 is 5-1, that is, 4. The NEV value of the Dtor-3 is less than the value of the Dtor-1, and is also a smallest value in all to-be-traversed Dtor columns. In this case, based on a constraint of a traversal sequence, the Dtor-3 enters iterative computation with a 2nd priority. In this manner, certainty of the algorithm can be ensured. It can be learned from Table 13 that NEV computation is not performed for the Dtor-4 and the Dtor-5 when the current iteration periodicity starts. A reason is that the two Dtors have local flows. This causes flows that flow out from a local switching device to be less than a quantity of egress ports. Table 13 is used as an example. Only two flows of each of the Dtor-4 and the Dtor-5 may need to be computed and allocated on respective four egress ports. A total quantity of allocation manners is 12. As a result, the certainty of the algorithm and correctness of subsequent iterative computation are seriously affected, and even an optimal solution in a subsequent iteration cannot be obtained through computation. Therefore, a traversal sequence of a Dtor having local flows may need to be at the rear. Based on the FMA algorithm, traversal and an iteration are performed on all Dtors according to the method.
In step 4074 and step 4075, the first network device guides allocation of ToR ports by using a degree of freedom value. Specifically, after a Dtor of a current iteration is selected by using the NEV, egress port numbers of a Stor may need to be computed and allocated to different flows of a same Dtor in the iteration, as shown in Table 14.
In Table 14, a Dtor of a 1st round of iteration is selected as a Dtor-0 according to the method described in Table 13, and elements of the Dtor-0 include flows from Stors 1, 2, 3, and 7. In a current iteration periodicity, flows on egress ports on Stors are computed and allocated based on a vertical degree of freedom value, that is VDOF, and HDOF.
For example, an implementation process in which a VDOF indicator and an HDOF indicator guide a row computation and allocation sequence of the flow matrix is shown as follows. The HDOF is defined as a total quantity of available egress ports with same numbers in a submatrix in the round of iterative computation. The algorithm specifies that computation and allocation are performed in ascending order of HDOF values. For example, if the HDOF values are the same, computation and allocation are performed in ascending order in a natural order. As shown in Table 14, in an initialization phase, egress ports on all ToRs are in a to-be-computed and to-be-allocated state. Therefore, a submatrix that is in the 1st round of iteration and that includes four ToRs, egress ports with same numbers are available, and all HDOF values are 4. In an iteration process of traversing a Dtor, a flowlet matrix formed by each round of iteration is extremely random. Computation and allocation of egress ports on ToRs are performed based on a constraint of the HDOF value. This can ensure that an optimal solution can be obtained through computation in each iteration periodicity. The other indicator is the VDOF. The VDOF is defined as a quantity of egress ports on a corresponding ToR that can be selected after the round of iterative computation. The algorithm specifies that computation and allocation are performed in ascending order of VDOF values. For example, if the VDOF values are the same, computation and allocation are performed in ascending order in a natural order. As shown in Table 14, in the current iteration periodicity, for flows with Dtors of 0, one port may need to be separately selected from the egress ports on the Stor. Currently, a quantity of ports on each Stor that can be selected and that is initialized is 4. Therefore, after computation and allocation in the current periodicity end, in a current flowlet matrix, a quantity of ports on each Stor that can be selected is 4-1, that is, 3. Therefore, in the current iteration periodicity, computation and allocation of ports may need to be performed in a natural order of ToRs. In Table 14, blocks marked with a symbol “x” represent computation and allocation results of several flows of a Dtor-0 on the egress ports on the Stors in the round of iteration.
It may be understood that an implementation process in any one of Table 12 to Table 14 is an implementation example of the third mapping relationship in step S303.
It can be learned from the implementation processes of step 4073 to step 4075 that, according to a computation rule of the FMA algorithm, an entire computation process of the global optimal path allocation algorithm is completed based on three indicator values and in four dimensions. In addition to processing computation of the three indicators in dimensions of the three indicators, historical computation results in a time dimension are considered, to ensure maximum certainty of allocation in the iteration and ensure that the optimal solution can be continuously obtained through computation in a subsequent iteration process until an iteration ends. A global optimal path allocation result may be obtained through computation by traversing all Dtors according to the foregoing iteration rule, as shown in Table 15.
It may be understood that the global optimal path allocation result shown in Table 15 is an implementation example of the second mapping relationship in step S303.
It may be understood that as described above, in the global optimal path allocation algorithm, traversal computation may need to be performed on the flow matrix in respective dimensions based on three indicator values. Therefore, time complexity is O(n3), where n is a quantity of switch ports. The computation result may be used to verify rows and columns of a flow matrix through traversal according to a flow non-crossing determination rule. Therefore, time complexity of a verification result is O(n2), where n is a quantity of switch ports.
Then, the first network device records the optimal path allocation result obtained through computation into an original flow table, as shown in Table 16.
It can be learned from Table 16 that a key output, that is, a next hop of the Stor, of path planning may be obtained through flow path orchestration and computation. An output computation result matrix may be converted into a path planning table related to the egress port on the Stor. In addition, the result is synchronized to an edge switching device of a network, to guide selection, for a flow, an egress port corresponding to a next hop of an edge switching node, to complete a global path allocation procedure.
In conclusion, a core innovation point of embodiments of this application lies in a multi-layer network system including the N second network devices and the P third network devices. After obtaining the communication relationship and the topology information, the first network device performs optimization computation based on aggregated communication relationship information and the network topology information by using the global optimal path allocation algorithm. The first network device sends an optimal result obtained through computation to the N second network devices. Finally, the N second network devices perform path selection by using the received optimal computation result as a local flow path allocation guide, to implement the load balancing and control the network congestion. Therefore, when the N second network devices perform service packet forwarding, optimal paths are obtained through computation for all flows based on the global optimal path allocation algorithm, to perform one-hop convergence, without performing, on a local single device, path selection through computation according to the hash function. This resolves the local conflict problem in a flow-based multi-path load balancing method and the global conflict problem caused by a local decision mechanism. Service packet forwarding paths with fixed communication relationships are consistent. This resolves the packet disorder problem in a packet-based multi-path load balancing method.
Optionally, the N second network devices obtain flow communication relationship information for entering the network, including the source IP, the destination IP, and the first packet time. The N second network devices obtain the local topology information, and aggregates the obtained local flow communication relationship information and the obtained topology information to the first network device.
Optionally, the first network device combines the aggregated communication relationship information and the aggregated topology information that are of the edge switching node into a network flow information table. The network information table records the topology information and the communication relationship of at least one edge node. In addition, the network information table serves as an input of a global optimal path allocation algorithm. A column of the network information table is mapping from a flow destination edge switching node to all source switching nodes to the destination edge switching node. A row of the network information table represents an egress port number of a to-be-allocated source switching node.
Optionally, the first network device obtains a network flow path allocation table through computation based on the global optimal path allocation algorithm and outputting. The network flow path allocation table includes network path allocation information of flows in the N second network devices. A column of the network path allocation table is mapping from a flow destination edge switching node to all source switching nodes to the destination edge switching node. A row of the network path allocation table represents an egress port allocated, at the source edge node, to a flow to the destination edge node.
Refer to
When the communication apparatus 800 is configured to implement functions of the foregoing first network device, the communication apparatus includes a transceiver unit 801 and a processing unit 802. The transceiver unit 801 is configured to obtain first topology information, where the first topology information includes connection relationships between N second network devices and P third network devices, any second network device is an upstream network device of any third network device, N is an integer greater than or equal to 2, and P is an integer greater than or equal to 1. The transceiver unit 801 is further configured to obtain communication relationships of M data flows, where a communication relationship of each of the M data flows includes source address information and destination address information, M is an integer greater than or equal to 2, and the M data flows are separately transmitted by the N second network devices to the P third network devices. The processing unit 802 is configured to determine M paths based on the communication relationships of the M data flows and the first topology information, where the M paths respectively correspond to the M data flows, and the M paths indicate paths through which the M data flows are transmitted by the N second network devices to the P third network device. The transceiver unit 801 is further configured to separately send the M paths to the N second network devices.
In a possible implementation, the M data flows are transmitted by the P third network devices to K fourth network devices, where K is an integer greater than or equal to 1. The M data flows include a first data flow and a second data flow, where source address information of the first data flow and source address information of the second data flow correspond to different second network devices and destination address information of the first data flow and destination address information of the second data flow correspond to a same fourth network device; and the M paths include a first path and a second path, where the first path corresponds to the first data flow, the second path corresponds to the second data flow, and the first path and the second path correspond to different third network devices.
In a possible implementation, the M paths further indicate egress ports of the M data flows on the N second network devices.
In a possible implementation, the M data flows include a third data flow and a fourth data flow, where source address information of the third data flow and source address information of the fourth data flow correspond to a same second network device; and the M paths include a third path and a fourth path, where the third path corresponds to the third data flow, the fourth path corresponds to the fourth data flow, and the third path is different from the fourth path.
In a possible implementation, the M data flows are transmitted by the P third network devices to the K fourth network device, where K is a positive integer. The processing unit 802 is specifically configured to: determine a first mapping relationship based on the communication relationships of the M data flows and the first topology information, where the first mapping relationship indicates a mapping relationship between a second network device corresponding to the source address information of each of the M data flows and a fourth network device corresponding to the destination address information of each of the M data flows; and determine the M paths based on the first mapping relationship.
In a possible implementation, the processing unit 802 is specifically configured to: determine first sorting information based on the first mapping relationship, where the first sorting information indicates sorting of a quantity of second network devices corresponding to the K fourth network devices; sequentially traverse the egress ports on the N second network devices based on the first sorting information, to obtain a second mapping relationship, where the second mapping relationship indicates mapping relationships between the egress ports on the N second network devices and the K fourth network devices; and determine the M paths based on the second mapping relationship.
In a possible implementation, the processing unit 802 is specifically configured to: sequentially traverse the egress ports on the N second network devices based on the first sorting information, to obtain a third mapping relationship, where the third mapping relationship indicates an optional quantity of egress ports on the second network device corresponding to each fourth network device; and determine the second mapping relationship based on the third mapping relationship.
In a possible implementation, the transceiver unit 801 is further configured to obtain second topology information, where the second topology information includes connection relationships between A second network devices and the P third network devices, at least one of the A second network devices is the same as at least one of the N second network devices, and A is an integer greater than or equal to 1. The transceiver unit 801 is further configured to obtain communication relationships of B data flows, where a communication relationship of each of the B data flows includes source address information and destination address information, and B is an integer greater than or equal to 1; and the B data flows are separately transmitted by the A second network devices to the P third network devices. The processing unit 802 is further configured to determine B paths based on the communication relationship of the B data flows and the topology information, where the B paths respectively correspond to the B data flows, the B paths indicate paths through which the M data flows are sent by the A second network devices to the P third network devices, where egress ports that are on a second network device and that correspond to the B paths are different from egress ports that are on a second network device and that correspond to the M paths. The transceiver unit 801 is further configured to separately send the B paths to the A second network devices.
In a possible implementation, the transceiver unit 801 is specifically configured to separately receive communication relationships that are of the M data flows and that are from the N second network devices.
In a possible implementation, the M data flows correspond to one of a plurality of AI set communication tasks.
In a possible implementation, the first network device is a controller or one of the P third network devices.
When the communication apparatus 800 is configured to implement functions of the foregoing second network device, the apparatus includes a transceiver unit 801 and a processing unit 802. The processing unit 802 is configured to determine communication relationships of Q data flows, where a communication relationship of each of the Q data flows includes source address information and destination address information, and Q is an integer greater than or equal to 1. The transceiver unit 801 is configured to send the communication relationships of the Q data flows to a first network device. The transceiver unit 801 is further configured to receive Q paths from the first network device, where the Q paths indicate paths used when the second network device transmits the Q data flows. The transceiver unit 801 is further configured to transmit the Q data flows based on the Q paths.
In a possible implementation, path information further indicates egress ports that are on the second network device and that are of the Q data flows.
In a possible implementation, the Q data flows include a third data flow and a fourth data flow, where source address information of the third data flow and source address information of the fourth data flow correspond to the second network device; and the path information includes a third path and a fourth path, where the third path corresponds to the third data flow, the fourth path corresponds to the fourth data flow, and the third path is different from the fourth path.
In a possible implementation, the Q data flows correspond to one of a plurality of AI set communication tasks.
It should be noted that for specific content such as an information execution process of each unit of the communication apparatus 800, refer to descriptions in the foregoing method embodiments of this application. Details are not described herein again.
An embodiment of this application further provides a communication apparatus 900.
Optionally, the communication apparatus 900 performs functions of the first network device in
Optionally, the communication apparatus 900 performs the functions of the second network device in
The communication apparatus 900 shown in
Optionally, the processor 901 implements the method in the foregoing embodiments by reading instructions stored in the memory 902, or the processor 901 may implement the method in the foregoing embodiments according to instructions stored inside. When the processor 901 implements the method in the foregoing embodiments by reading the instructions stored in the memory 902, the memory 902 stores the instructions for implementing the method provided in the foregoing embodiments of this application.
Optionally, the at least one processor 901 is one or more central processing units (CPUs) or a single-core central processing unit (CPU), or may be a multi-core CPU.
Further optionally, the at least one processor 901 may be further configured to perform an implementation process corresponding to a processing unit 602 in the embodiment shown in
The memory 902 includes but is not limited to a random-access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a flash memory, an optical memory, or the like. The memory 902 stores instructions of an operating system.
After program instructions stored in the memory 902 are read by the at least one processor 901, the communication apparatus performs a corresponding operation in the foregoing embodiments.
Optionally, the communication apparatus shown in
Further optionally, the network interface 903 may be further configured to perform an implementation process corresponding to a transceiver unit 601 in the embodiment shown in
It should be understood that the network interface 903 has functions of receiving data and sending data. The functions of “receiving data” and “sending data” may be integrated into a same transceiver interface for implementation, or the functions of “receiving data” and the “sending data” may be respectively implemented in different interfaces. This is not limited herein. In other words, the network interface 903 may include one or more interfaces, configured to implement the functions of “receiving data” and “sending data”.
For another function that can be performed by the communication apparatus 900 after the processor 901 reads program instructions in the memory 902, refer to the descriptions in the foregoing method embodiments.
Optionally, the communication apparatus 900 further includes a bus 904. The processor 901 and the memory 902 are usually connected to each other through the bus 904, or may be connected to each other in another manner.
Optionally, the communication apparatus 900 further includes an input/output interface 905. The input/output interface 905 is configured to: be connected to an input device, and receive related configuration information that is input by a user or another device via the input device that can interact with the communication apparatus 900. The input device includes but is not limited to a keyboard, a touchscreen, a microphone, and the like.
The communication apparatus 900 provided in this embodiment of this application is configured to: perform the method performed by the communication apparatus (the first network device) provided in the foregoing method embodiments, and implement corresponding advantageous effects.
For example, when the communication apparatus 900 performs the functions of the first network device in
For another example, when the communication apparatus 900 performs the functions of the second network device in
For specific implementations of the communication apparatuses shown in
An embodiment of this application further provides a communication system. The communication system includes at least a first network device and N second network devices.
Optionally, the communication system further includes P third network devices.
Optionally, the communication system further includes K fourth network devices.
It should be understood that in the communication system, each network device may further apply another method in the foregoing embodiments, and implement corresponding technical effects. Details are not described herein again.
In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiments are merely examples. For example, division into the units is merely logical function division and may be another division during actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or another form.
In conclusion, the foregoing embodiments are merely intended for describing the technical solutions of this application, but not for limiting this application. Although this application is described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that modifications may still be made to the technical solutions described in the foregoing embodiments or equivalent replacements may still be made to some technical features thereof. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to depart from the scope of the technical solutions of embodiments of this application.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202210891496.0 | Jul 2022 | CN | national |
This is a continuation of International Patent Application No. PCT/CN2023/103818 filed on Jun. 29, 2023, which claims priority to Chinese Patent Application No. 202210891496.0 filed on Jul. 27, 2022, both of which are hereby incorporated by reference in their entireties.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/CN2023/103818 | Jun 2023 | WO |
| Child | 19027916 | US |