The present application claims the benefit of priority to Chinese Patent Application No. 201811300794.8, filed on Nov. 2, 2018, which application is hereby incorporated into the present application by reference herein in its entirety.
Embodiments of the present disclosure relate to the field of data storage, and more specifically, to a method, electronic device and computer program product for handling congestion of data transmission.
More and more distributed storage systems are used in various data centers. In a distributed storage system, each storage node transmits data through a network based on the Transmission Control Protocol (TCP). When an end user reads data, there exists such a circumstance that a plurality of data nodes simultaneously send data back to the client node. This many-to-one traffic pattern is also called incast, which is common in data center applications. The presence of incast often causes network congestion, which reduces the performance of distributed storage systems.
Embodiments of the present disclosure provide a solution for handling congestion of data transmission.
In a first aspect of the present disclosure, there is provided a method for handling congestion of data transmission. The method comprises: determining whether a congestion caused by a plurality of storage nodes occurs at a first port of a switch, the first port being connected to a first storage node, the plurality of storage nodes transmitting data to the first storage node via the first port of the switch. The method also comprises in response to determining that the congestion occurs at the first port, selecting at least a second storage node from the plurality of storage nodes. The method further comprises updating configuration of a data transmission path for the second storage node, such that the second storage node transmits data to the first storage node while bypassing the first port.
In a second aspect of the present disclosure, there is provided an electronic device. The electronic device comprises a processor and a memory coupled to the processor, the memory having instructions stored therein, the instructions, when executed by the processor, causing the electronic device to perform acts. The acts comprise determining whether a congestion caused by a plurality of storage nodes occurs at a first port of a switch, the first port being connected to a first storage node, the plurality of storage nodes transmitting data to the first storage node via the first port of the switch. The acts further comprise in response to determining that the congestion occurs at the first port, selecting at least a second storage node from the plurality of storage nodes. The acts further comprise updating configuration of a data transmission path for the second storage node, such that the second storage node transmits data to the first storage node while bypassing the first port.
In a third aspect of the present disclosure, there is provided a computer program product. The computer program product is tangibly stored on a computer readable medium and comprises machine executable instructions which, when executed, cause the machine to perform a method according to the first aspect of the present disclosure.
The Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the present disclosure, nor is it intended to be used to limit the scope of the present disclosure.
The above and other objectives, features, and advantages of the present disclosure will become more apparent through the following more detailed description of the example embodiments of the present disclosure with reference to the accompanying drawings, wherein the same reference sign generally refers to the like element in the example embodiments of the present disclosure.
Principles of the present disclosure will now be described with reference to several example embodiments illustrated in the drawings. Although some preferred embodiments of the present disclosure are shown in the drawings, it would be appreciated that description of those embodiments is merely for the purpose of enabling those skilled in the art to better understand and further implement the present disclosure and is not intended for limiting the scope disclosed herein in any manner.
As used herein, the term “includes” and its variants are to be read as open-ended terms that mean “includes, but is not limited to.” The term “or” is to be read as “and/or” unless the context clearly indicates otherwise. The term “based on” is to be read as “based at least in part on.” The terms “an example embodiment” and “an embodiment” are to be read as “at least one example embodiment.” The term “another embodiment” is to be read as “at least one further embodiment.” The terms “first”, “second” and so on can refer to same or different objects. Other definitions, either explicit or implicit, may be included below.
As mentioned above, in a distributed storage system there exists incast (also referred to as TCP incast) where a plurality of sender nodes transmit data to one receiver node. When TCP incast occurs and causes network congestion (abbreviated as congestion below), the switch between the sender node and the receiver node drops packets a lot. Actually, TCP incast is even worse than what is imagined. Most switches cannot handle TCP incast very well even with cut through forwarding mode for low latency. Table 1 shows test data of a switch under the presence of TCP incast, wherein “In Packet loss” represents the number of packets lost per second. As can be seen, even though the output network interface controller (NIC) of the switch still has half available bandwidth, packets start to drop aggressively on the input NIC of the switch.
The switches have different capabilities of handling TCP incast, but they are all doing very well if there is no TCP incast. By contrast, Table 2 shows test data of a switch without TCP incast. As seen from Table 2, the network throughput for transmission/reception is much higher than the incast situation in Table 1 without any packet loss.
TCP itself controls throughput via the TCP congestion control protocol, while the sender and receiver do not know the state of each other until they get acknowledgment with window update or zero window from the peer. The speed of data flow is also affected by many other factors such as application receiving speed, acknowledgment speed to the sender and estimate of sender congestion window, etc. When the performance degrades, it is too complex for engineers to figure out why the flow becomes slow.
In conventional implementation, when a problem occurs, the following methods are usually used to figure out the problem in the storage system: (1) Check the application server log. If there is really a network error, sometimes the log will give some hint but probably will not provide more information, e.g. incast is ongoing. (2) Use ss/netstat/iftop to roughly check network situation. (3) Use tcpdump to capture packets for wireshark to analysis. However, it is not easy to narrow down the problems quickly. These tools are not as accurate as expected, and a final judgement needs to be made with experience. (4) Login on the switch to check a counter, such as a drop counter.
However, the inventors have realized that there are several problems in such implementations. None of the above approaches use the logic inside the TCP. Especially all the trouble shooting steps are done manually and are time consuming. Therefore, with the conventional troubleshooting approaches, it is hard to know real problems that occur on the network path and software stack, and it is difficult to make a concrete analysis. Due to the congestion caused by TCP incast, the network becomes the performance bottleneck of distributed systems under a high load.
The present disclosure provides a solution for handling congestion of data transmission so as to at least eliminate one or more of the above drawbacks. By monitoring states of a switch and a plurality of storage nodes in a distributed storage system in real, it may be determined whether network congestion occurs at a port of the switch. When it is determined that congestion occurs at a certain port, at least one storage node is selected from storage nodes that transmit data via the certain port. Then, by updating configuration of a data transmission path for the selected storage node, the selected storage node is caused to transmit data while bypassing the congested port. In embodiments of the present disclosure, a congested portion in the storage system may be determined accurately, and a data transmission path may be controlled dynamically. In this way, more intelligent resource allocation is achieved and the data transmission efficiency between storage nodes is increased, thus improving the overall performance of the storage system.
Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Ports 151-157 are arranged on the switch 150. These ports are connected to the storage nodes 110, 120, 130 and 140 respectively, e.g., connected through the NIC on the storage nodes. In the example of
Note that the numbers of ports and NICs shown in
In order to monitor in real-time the data transmission situation of each storage node and the state of the switch, a database 102 may be used in the storage system. The database 102 may be a time series database such as CloudDB. Of course, this is merely an example, and any database that can store time series data or receive data output in stream may be used in conjunction with embodiments of the present disclosure. Information (e.g. TCP information) of the storage nodes 110, 120, 130 and 140 related to transmission control will be streamed-output to the database 102 (will be described in detail with reference to
A control unit 101 may make analysis with information in the database 102 so as to determine whether congest occurs at a port of the switch 150. In the example of
Although it is shown that parameters related to the state of the switch 150 are output to the database 102, the control unit 101 may also obtain operation parameters from the switch 150 directly. The control unit 101 may be deployed on a dedicated computing device (e.g. dedicated server) or any storage node. No matter how the control unit 101 is deployed, the control unit 101 may communicate with each of the storage nodes 110, 120, 130 and 140 to update configuration of a data transmission path for the storage node.
Embodiments of the present disclosure are described in detail with reference to
At block 210, the control unit 101 determines whether congestion caused by a plurality of storage nodes occurs at the first port 151 of the switch 150. For example, in the example of
As mentioned above, since congestion per se is a complex issue, the control unit 101 needs to determine the congestion at the first port 151 in conjunction with factors of the switch and the storage node. For example, if the congestion window of a socket of a certain storage node decreases while a drop counter of the switch keeps growing, it may be considered that congestion occurs in the storage system.
The control unit 101 may obtain parameters related to the state of the switch 150, such as operation parameters of the ports 151-157. Such operation parameters may comprise the input NIC bandwidth, input NIC usage, output NIC bandwidth, output NIC usage and input packet loss of ports as list in Tables 1 and 2.
The control unit 101 further needs to obtain and analyze information on the transmission control of the storage nodes 110, 120, 130 and 140.
The TCP probe 330 may streaming-output information (e.g. TCP information) on the transmission control of a storage node to the time series database 102. Information output by the TCP probe 330 may comprise parameters, such as a congestion window (cwnd) and acknowledgment/sequence (ack/seq). In addition, other critical information such as netsta counter and the like may also be output to the database 102. The TCP probe 330 may be dynamically enabled or disabled based on different policies, in order to reduce side effects of the TCP probe 330.
The information mentioned above is merely exemplary, and embodiments of the present disclosure may utilize any information related to the switch and storage nodes. The control unit 101 may utilize and analyze such information in the database 102 in real time so as to determine whether congestion occurs at a port of the switch 150.
At block 410, the control unit 101 determines whether a packet loss occurs at the first port 151 based on operation parameters of the first port 151. For example, if the control unit 101 determines from operation parameters output from the switch 150 to the database 102 that the parameter “in packet loss” of the first port 151 is not zero, then the control unit 101 may determine a packet loss at the first port 151.
If the control unit 101 determines the packet loss at the first port 151, then the process 400 proceeds to block 420. The control unit 101 may determine, using information in the database 101, that the storage nodes 120, 130 and 140 are transmitting data to the first storage node 110 via the first port 151.
At block 420, the control unit 101 obtains (e.g. from the database 102) information on transmission control of the plurality of storage nodes 120, 130 and 140. At block 430, the control unit 101 determines whether such information indicates a delay in data transmission at at least one of the plurality of storage nodes 120, 130 and 140. If the control unit 101 determines that the delay in data transmission occurs at at least one (e.g. storage node 130) of the plurality of storage nodes 120, 130 and 140, then the process 400 may proceed to block 440. At block 440, the control unit 101 determines that the congestion occurs at the first port 151.
In some embodiments, the information obtained at block 420 comprises a congestion window, the reduction of which means a delay in data transmission. In such embodiments, the control unit 101 may determine at block 430 whether the congestion window for the storage nodes 120, 130 and 140 is reduced. If the congestion window for at least one (e.g. storage node 130) of the storage nodes 120, 130 and 140 is reduced, then the control unit 101 may determine at block 440 that the congestion occurs at the first port 151.
In some embodiments, the information obtained at block 420 may further comprise other information or parameter that can be used to indicate a delay in data transmission. For example, such information may indicate whether repeated acknowledgments (ACK) are received from the receiver (the first storage node 110 in this example).
Due to the complexity of congestion, it is hard to determine the occurrence of congestion only based on the operation states of the switch or the storage node. Therefore, in embodiments of the present disclosure, the occurrence of congestion and a port where the congestion occurs may be determined accurately in this way.
Still referring to
The control unit 101 may select any storage node from the plurality of storage nodes 120, 130 and 140 or select the second storage node based on data traffic. The control unit 101 may determine data traffic transmitted from each of the plurality of storage nodes 120, 130 and 140. For example, the control unit 101 may determine data traffic using information in the database 102.
In some embodiments, the control unit 101 may select a storage node with the largest data traffic from the plurality of storage nodes 120, 130 and 140 as the second storage node. In some embodiments, the control unit 101 may select a storage node with the second highest data traffic as the second storage node. In such embodiments, by changing a transmission path for larger data traffic, the data transmission load of a port where the congestion occurs may be reduced effectively, which helps to improve the transmission efficiency.
In some other embodiments, the control unit 101 may select more than one storage node from the plurality of storage nodes 120, 130 and 140, such that data of these storage nodes are transmitted while bypassing the first port 151, and new data transmission paths for these storage nodes may be different. Therefore, in such embodiments, the data transmission efficiency of a port where the congestion occurs may be improved further.
For the sake of discussion, suppose that the control unit 101 at least selects the storage node 120 (referred to as the second storage node 120 below) at block 220. Then, at block 230, the control unit 101 updates configuration of a data transmission path for the second storage node 120, such that the second storage node 120 transmits data to the first storage node 110 while bypassing the first port 151. The control unit 101 may send the updated configuration to the second storage node in the form of a message, or deliver the updated configuration to the second storage node 120 by other means such as remote procedure call (RPC). Embodiments of the present disclosure are not limited in this regard.
In some embodiments, the control unit 101 may update configuration of a data transmission path for the second storage node 120, such that the second storage node 120 transmits data to the first storage node 110 via another port of the switch 150. Such embodiments are described with reference to
In some embodiments, all or some of the storage nodes 110, 120, 130 and 140 may be connected together, such that data may be transmitted to an adjacent storage node directly or relayed to a destination storage node via an adjacent storage node. In such embodiments, the control unit 101 may update configuration of a data transmission path for the second storage node 120, such that the second storage node 120 transmits data to the first storage node 110 while bypassing the switch 150. Such embodiments are described with reference to
In embodiments of the present disclosure, by monitoring operation states of the switch and storage nodes, congestion occurring at a port of the switch may be determined, and part of data traffic causing the congestion may be redirected to other paths. In this way, the congestion of data transmission may be reduced, and the data transmission efficiency may be increased, which helps to improve the overall performance of the storage system.
As mentioned above, the congestion at the first port 151 may be handled by causing the second storage node 120 to transmit data to the first storage node 110 via another port of the switch 150. Such embodiments are now described with reference to
The control unit 101 may select a free port from a plurality of ports of the switch 150 which are connected to the first storage node 110. Specifically, the control unit 101 may select a second port from the plurality of ports 152-154 based on resource usages of the plurality of ports 152-154 of the switch 151. For example, in the example of
Subsequently, the control unit 101 may deactivate the connection of the second storage node 120 to the first port 151 and activate the connection of the second storage node 120 to the second port 120, such that the second storage node 120 transmits data to the first storage node via the second port 152. For example, the control unit 101 may implement the deactivation and activation by modifying the configuration of the socket of the second storage node 120.
The control unit 101 may determine a network address (e.g. IP address) allocated to the NIC 112 of the first storage node 110 to which the second port 152 is connected, and update the destination address of the socket of the second storage node 120 as the IP address allocated to the NIC 112. For the network bonding NIC, the control unit 101 may implement the activation of the connection to the second port 152 and the deactivation of the connection to the first port 151 by simply changing the port number of the socket of the second storage node 120.
As mentioned above, the second storage node 120 may be caused to transmit data to the first storage node 110 while bypassing the switch 150. Such embodiments will now be described with reference to
As shown in
In some embodiments, the connection between the storage nodes 110, 120, 130 and 140 may be implemented by for example a NIC (including normal NIC and smart NIC) or field programmable gate array (FGPA). For example, in the example of
For the example of
The control unit 101 may implement the deactivation and activation by modifying the configuration of the socket of the second storage node 120. In the example of
In this case, the control unit 101 may deactivate the connection between the second storage node 120 and the switch, and activate the first direct connection 701 and the second direct connection 702, such that the third storage node 730 relays data from the second storage node 120 to the first storage node 110. Therefore, in the example of
Similarly, the control unit 101 may implement the deactivation and activation by modifying configuration of the socket of the second storage node 120. In the example of
In the embodiments described with reference to
In cases shown in
Where all or some of the storage nodes are serially connected, when data needs to be transmitted to an adjacent or nearby storage node, such a serial path may be preferentially selected for data transmission. For example, in the example of
Various components in the apparatus 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, mouse and the like; an output unit 807, such as a variety of types of displays, loudspeakers and the like; a storage unit 808, such as a magnetic disk, optical disk and the like; and a communication unit 809, such as a network card, modem, wireless communication transceiver and the like. The communication unit 809 enables the device 800 to exchange information/data with other devices via a computer network such as Internet and/or a variety of telecommunication networks.
The processing unit 801 performs various methods and processes as described above, for example, any of the processes 200 and 400. For example, in some embodiments, any of the processes 200 and 400 may be implemented as a computer software program or computer program product, which is tangibly included in a machine-readable medium, such as the storage unit 808. In some embodiments, the computer program can be partially or fully loaded and/or installed to the device 800 via ROM 802 and/or the communication unit 809. When the computer program is loaded to RAM 803 and executed by CPU 801, one or more steps of any of the processes 200 and 400 described above are implemented. Alternatively, in other embodiments, CPU 801 may be configured to implement any of the processes 200 and 400 in any other suitable manner (for example, by means of a firmware).
According to some embodiments of the present disclosure, there is provided a computer readable medium. The computer readable medium is stored with a computer program which, when executed by a processor, implements the method according to the present disclosure.
Those skilled in the art would understand that various steps of the method of the disclosure above may be implemented via a general-purpose computing device, which may be integrated on a single computing device or distributed over a network composed of a plurality of computing devices. Optionally, they may be implemented using program code executable by the computing device, such that they may be stored in a storage device and executed by the computing device; or they may be made into respective integrated circuit modules or a plurality of modules or steps therein may be made into a single integrated circuit module for implementation. In this way, the present disclosure is not limited to any specific combination of hardware and software.
It would be appreciated that although several means or sub-means of the apparatus have been mentioned in detailed description above, such partition is only example but not limitation. Actually, according to the embodiments of the present disclosure, features and functions of two or more apparatuses described above may be instantiated in one apparatus. In turn, features and functions of one apparatus described above may be further partitioned to be instantiated by various apparatuses.
What have been mentioned above are only some optional embodiments of the present disclosure and are not limiting the present disclosure. For those skilled in the art, the present disclosure may have various alternations and changes. Any modifications, equivalents and improvements made within the spirits and principles of the present disclosure should be included within the scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
201811300794.8 | Nov 2018 | CN | national |