The present invention relates to packetized data communication systems and, more particularly, to such systems that dynamically vary an extent to which messages are bundled into packets, based on rates at which the messages are generated and predetermined limits related to estimated communication channel capacity.
Packetized communication systems send and receive data in packets over communication links between senders and receivers. Each packet contains header, and sometimes footer, information (collectively referred herein to as “overhead”), as well as payload data. The overhead is used to store information necessary for delivering the packet, such as source and destination address information, error correcting information, and the like. The format and contents of the overhead depends on which communication protocol is used.
Messages large enough to exceed the payload capacity of a single packet are segmented, and each segment is sent in a separate packet. On the other hand, several small messages may be bundled (sometimes referred to as “aggregated” or “packetized”) together into the payload portion of a single packet. This bundling typically occurs in a transport layer (such as TCP) of a network protocol stack. For example, according to the Nagle Algorithm, a message is delayed until either: (a) enough other messages destined to travel over the same communication link have been accumulated to fill a packet or (b) an acknowledgement (ACK) of a previously transmitted packet is received.
Bundling conserves link bandwidth and reduces packet processing requirements. Bandwidth efficiency is improved by bundling, because the number of packets, and consequently the amount of overhead, carried over a network link are less than if each message were transported in its own packet. Secondary beneficial effects include fewer ACKs being carried over the link, because more payload messages can be acknowledged with a single ACK from the receiver. Furthermore, per-packet processing is reduced due to the smaller number of packets that are needed to transport a given number of messages. This can significantly reduce processor (CPU) load on the sender, the receiver and intermediate routers and switches. However, bundling usually causes some messages to wait before being transmitted over a network link.
Some contexts are sensitive to packet latency, i.e., the amount of time that elapses between when a sending application transmits a message and when a receiving application receives the message. For example, some financial applications that support high-frequency trading receive market data, such as messages containing quote and trade data, from electronic exchanges, such as the New York Stock Exchange (NYSE) and NASDAQ, and distribute this data to their clients. Some of the clients employ algorithmic trading methods to analyze this data to identify and take advantage of very short-lived market opportunities. Latencies measured in milliseconds or microseconds may influence the usefulness of the market data to the clients and the ability of the clients to place orders with the exchanges in time to exploit the identified opportunities.
Some packet latency is caused by packet and protocol processing, physical limitation of network links, etc., and is, of course, unavoidable. However, message bundling causes some messages to wait before they can be transported over a link. Bundling latency is the amount of time that elapses while a message sent by an application waits for other messages (or a bundling timeout or another event, such as receipt of an ACK associated with a previously sent packet) before the message can be placed in a packet for transportation over a link. Some users disable bundling, thereby enabling each message to be transported in a separate packet and thus avoid bundling delays. However, if its link becomes busy, a communication system that has bundling disabled is subject to severe performance degradation due to the large amount of overhead handled by the link, particularly if average message size is much less than packet payload capacity.
Prior art systems have addressed packet bundling and bundling delays. For example, Ekl (International Publication Number WO 02/27991) discloses a communication system that dynamically adjusts packet size, and therefore the amount of bundling that can occur, in response to periodically sampled system performance metrics, such as processor utilization, end-to-end packet transit time (referred to in Ekl as “delay”), jitter, bandwidth utilization, queue depth and/or wait time, or events, such as one of these performance metrics exceeding a threshold value.
Baucke, et al. (International Publication Number WO 2007/110096) discloses a system that attempts to minimize bundling delay by calculating a maximum wait time, after which a packet is transmitted, even if the packet has room for one or more additional messages. The maximum wait time is calculated based on an average arrival rate and an average size of previously sent messages, such that the maximum wait time corresponds to an average amount of time to fill a packet.
An embodiment of the present invention provides a system for rate-adaptive control of message transmission. The messages are generated on at least one computer, which has a network port. The network port is configured to support at least two network connections. For example, an Ethernet link may connect the network port to a computer network. The Ethernet link may support several TCP network connections between the computer and several client computers. Each of the messages is to be transported over an associated one of the plurality of network connections.
The system includes at least two local message traffic shapers. Each local message traffic shaper corresponds to one of the network connections. Each local message traffic shaper is configured to limit transfer of the messages associated with its network connection to the network connection. For example, each local message traffic shaper may limit when and/or how often the messages may be dequeued and sent via a writev( ) system call to a network protocol stack, so as to be transmitted by the network protocol stack via TCP packets over the computer network. Each local message traffic shaper is configured to limit transfer of the messages, based at least in part on an aggregate rate at which the messages to be transported over all the network connections are generated. For example, each local message traffic shaper may be assigned a shaping rate, and each shaping rate may be determined according to the aggregate rate at which writev( ) calls are issued in relation to the corresponding network connection. By “generated,” we mean any action that relates to creating or forwarding a message along a message stream, such as creating the message, enqueueing the message, dequeueing the message, etc.
The system also includes a global message traffic shaper coupled to the local message traffic shapers. The global message traffic shaper is configured to limit, in aggregate, transfer of the messages over all the network connections, based at least in part on a predetermined target rate. For example, the predetermined target rate may be set to a value below a rate that would saturate a bottleneck resource or utilize the bottleneck resource at a rate that negatively influences performance of the system.
Each of the local message traffic shapers may be configured to limit the transfer of the messages to a local shape rate. The system may also include a shape rate recalculator that is configured to repeatedly automatically recalculate the local shape rate for the local message traffic shapers. The local shape rate may be recalculated to include an oversubscription amount.
Each local message traffic shaper may include a token bucket, and the global message traffic shaper may include a token bucket different than any of the local message traffic shaper token buckets. A token from the local message traffic shaper and a token from the global message traffic shaper may be required to transfer each of the messages to the corresponding network connection. In other words, before a message may be transferred to the network connection, a token may need to be consumed from the local message traffic shaper and another token may need to be consumed from the global message traffic shaper.
The messages may be held in an application layer buffer while waiting to be selected for transport over the network connection. Once selected, each message is removed from the application layer buffer for transfer to a network protocol layer for packetization. The rate at which the messages are generated may be considered to be a rate at which the messages are removed from the application layer buffer for transfer to the network protocol layer. As noted, this transfer may be implemented via a call to the writev( ) routine, or any suitable application, operating system or network protocol stack call.
Processing according to the Nagle Algorithm may be disabled, in relation to the network port. The messages may contain financial market data.
The computer may be a multiprocessor computer that includes at least two processors. A distinct subset of the local message traffic shapers may be associated with each one of the processors. For example, each processor may execute a distinct subset of the local message traffic shapers.
In such an embodiment, rather than a global message traffic shaper, there may be a per-processor message traffic shaper associated with each processor. Each per-processor message traffic shaper may be configured to limit, in aggregate, transfer of the messages generated on the associated processor. The per-processor message traffic shapers may collectively share the predetermined target rate. For example, the predetermined target rate may be (equally or unequally) divided among the per-processor message traffic shapers, so the sum of the respective shapers' target rates may add up to the global target rate.
In another multiprocessor embodiment, one global message traffic shaper is configured to limit, in aggregate, transfer of the messages generated on all the processors.
Yet another multiprocessor embodiment includes a per-processor message traffic shaper for each processor, in addition to a global message traffic shaper. Each per-processor message traffic shaper is associated with its processor. Each per-processor message traffic shaper is configured to limit, in aggregate, transfer of the messages generated on its processor. This limitation may be based at least in part on an aggregate rate at which the messages to be transported over all the network connections are generated on all of the processors. The global message traffic shaper ensures the aggregate traffic from all the processors remains below the predetermined target rate.
Each of the per-processor message traffic shapers may be configured to limit the transfer of the messages to a per-processor shape rate. The system may also include a shape rate recalculator configured to repeatedly automatically recalculate the per-processor shape rate for each per-processor message traffic shaper.
The recalculator may also be configured to automatically recalculate the per-processor shape rate, such that the per-processor shape rate is recalculated to include an oversubscription amount. In other words, the sum of the per-processor shape rates may exceed, at least at times, the overall system rate limit.
The local message traffic shapers and the global message traffic shaper may be implemented in an application layer or in a network protocol stack.
An embodiment of the present invention provides a method for rate-adaptively controlling transmission of messages. The messages may be generated on at least one computer having a network port configured to support at least two network connections. Each of the messages is to be transported over an associated one of the network connections. For each of the network connections, transfer of the messages (associated with the network connections) to the network connection is limited. This limitation may be based at least in part on an aggregate rate at which the messages to be transported over all the network connections are generated. (The meaning of “generated” is discussed above.) In addition, transfer of the messages over all the network connections may be limiting, in aggregate, based at least in part on a predetermined target rate.
A rate limit on the transfer of the messages to the network connection may be repeatedly automatically recalculated. Recalculating the rate limit may include recalculating the rate limit to include an oversubscription amount. Recalculating the rate limit may include raising the rate limit if the aggregate rate at which the messages to be transported over all the network connections are generated is less than a predetermined value, and decreasing the rate limit if the aggregate rate at which the messages to be transported over all the network connections are generated is greater than the predetermined value.
Processing according to the Nagle Algorithm may be disabled.
For each processors of a multiprocessor computer, transfer of the messages generated on the processor may be limiting, in aggregate. This limitation may include sharing the predetermined target rate among the processors. Optionally or alternatively, this limitation may include oversubscribing at least one of the processors.
Yet another embodiment of the present invention provides a method of controlling transmission of packets of financial market data through a port over a network to a set of client computers. At least one distinct buffer is associated with each of the client computers. Data from each buffer is written through the port. Writing of data from any given buffer is limited to a rate that would prevent all buffers from exceeding an aggregate target rate. The aggregate target rate may be designed to prevent saturation of hardware resources. Furthermore, the aggregate target rate may be designed so as to equitably share the target rate among buffers to the extent required by demand of the buffers.
The invention will be more fully understood by referring to the following Detailed Description of Specific Embodiments in conjunction with the Drawings, of which:
In accordance with embodiments of the present invention, methods and apparatus are disclosed for minimizing message latency time by dynamically controlling an amount of bundling that occurs. Unbundled messages are allowed while a bottleneck resource is lightly utilized, but the amount of bundling is progressively increased as the message rate increases, thereby progressively increasing resource efficiency. In other words, the bottleneck resource is allocated to a set of consumers, such that no consumer “wastes” the resource to the detriment of other consumers. However, while the resource is lightly utilized, a busy consumer is permitted to use more than would otherwise be the consumer's share of the resource. In particular, the consumer is permitted to use the resource in a way that is less than maximally efficient, so as to reduce latency time.
As noted, latency time can be critically important in some communication systems, such as financial applications that support high-frequency trading.
Most or all of the market data messages generated by the application program 108 are small, typically much smaller than the maximum payload capacities of packets utilized in the network link 110 between the quote server 100 and the network 113. Each message generated by the quote server 100 may include information about one or more symbols (securities), and each message may be formed from information from one or more messages received from the exchange computers 103. Nevertheless, for simplicity and without loss of generality, we discuss exemplary embodiments of the present invention by referring to the messages generated within the quote server 100 as atomic units of data sent by the application program 108.
In many cases, the quote server 100 is coupled to the network 113 via a single network link 110 which, of course, has a finite bandwidth. Although not shown the quote server 100 may execute one or more additional application programs, which generate additional message streams. In addition, the quote server 100 may also be coupled to the network 113 by additional network links or to other sets of client computers via other networks and/or other network links, each with its own finite bandwidth. The descriptions provided herein apply to each such application program, message stream and network link, and the elements in
As noted, if the client computers 116-123 are involved in algorithmic trading, small latencies may influence the usefulness of the market data to the client computers 116-123. An operator of the quote server 100 may wish to minimize the latency of messages sent by the application program 108 to the client computers 116-123. However, a tension exists between disabling bundling to avoid message wait times associated with bundling on the one hand, and bundling as many messages as possible into each packet sent over the network link 110 so as to optimize throughput of (i.e., to get maximum benefit from the overhead carried over) the network link 110 on the other hand. As noted, disabling bundling eliminates bundling latencies; however, if the network link 110 becomes busy, performance of the network link 110 will be severely degraded, due to the large amount of overhead handled by the network link 110.
Embodiments of the present invention influence the amount of bundling in a rate-adaptive manner. The adaptation is performed per message stream semi-independently. That is, each message stream's adaptation depends primarily on the message rate of the message stream. However, the adaptation also takes into account an aggregate of all the message streams utilizing a shared resource, such as the network link 110. In some embodiments, this adaptation is performed in the application layer, i.e., not within the network protocol stack. In these embodiments, bundling by the network protocol stack is preferably disabled.
The rate at which the application program 108 generates messages destined to any one of the client computers 112-123 may vary unpredictably over time, such as in response to fluctuations in trading activity at the exchanges. Message generation by the application program 108 may be bursty. In addition, the application program's message generation rate may be different for different ones of the client computers 112-123. Furthermore, the aggregate rate at which the application program 108 generates messages that are to be transported over the network link 110 may be bursty and vary over time.
As load on any system is increased, some resource (known as a “bottleneck resource”) eventually reaches a utilization level that prevents the system from handling a further increase in the load without a dramatic decrease in system performance, even if other resources are not fully utilized. For example, in the system of
Embodiments of the present invention automatically respond to the rates (and changes in the rates) at which messages are generated by the application program 108 and destined to the various client computers 116-123 and in aggregate over the network link 110 to shape message traffic rates, so as to achieve several goals, such as: (a) preventing a bottleneck resource from reaching a utilization level that would cause an undesirable decrease in performance of the system; (b) exploiting the bottleneck resource as much as possible to decrease the latency of messages generated by the application program 108 and destined to the client computers 116-123; and (c) preventing any one or more of the message streams generated by the application program 108 from utilizing so much of the bottleneck resource as to undesirably negatively impact another one or more of the message streams.
As noted, message generation rates of the application program 108 may vary over time. Sometimes the application program 108 generates bursts of messages, interspersed with relatively quite periods. Embodiments of the present invention automatically respond to these bursts by increasing the extent to which messages are bundled, thereby increasing the efficiency of the network link 110. On the other hand, during relatively quite periods, when the network link 110 is not busy, messages are not bundled, or they are bundled to a small degree, thereby decreasing latency time. More detailed descriptions of exemplifying embodiments of the invention are now provided.
Each message stream may be implemented, at least in part, as a TCP connection between a program being executed by the quote server 100 and a program (not shown) being executed by the corresponding client computer 116-123. Although not shown in
A buffer 210, 213 or 216 is associated with each message stream 200-206. Any suitable software or hardware construct, such as a queue or a heap, may be used in the implementation of each buffer 210-216. As indicated by message stream segments 200a-206a, messages generated by the application program 108 are placed in the appropriate buffer 210-216 until they are ready to be transferred to operating system network software and hardware 220. Such transfers are indicated by message stream segments 200b-206b. A message transfer to the operating system network software and hardware 220 may be implemented with a call to an appropriate operating system routine, such as the Linux writev( ) routine. Other, functionally equivalent or similar, routines or system calls may be used under Linux or other operating systems. Once a message is transferred, it no longer needs to reside in the buffer 210-216, if a reliable network connection, such as a TCP connection, is used. (TCP provides reliable connections that handle any retransmissions required as a result of packets being lost or dropped along the way to the respective client computer 116-123.)
The operating system network software and hardware 220 may include an appropriate protocol stack, such as a stack that includes TCP, IP and Ethernet layers (not shown), including a hardware network interface 221, which collectively manage the network connections, including placing the messages into packets and sending the packets across the network link 110. Message segments 200c-206c represent portions of the TCP connections carried over the network link 110, which may be a Gigabit Ethernet link, for example. Message segments 200d-206d represent respective additional network links to the client computers 116-123. One or more of the message segments 200d-206d may be carried over a single shared network link, depending on how the client computers 116-123 are connected to the network 113.
As understood by those of skill in the art, “message traffic shaping” (also known as “packet shaping” or “Internet Traffic Management Practices” (ITMPs)) means the control of computer network traffic in order to optimize or guarantee performance, lower latency and/or increase usable bandwidth by delaying packets that meet certain criteria. Traffic shaping is an action on a set of packets that imposes additional delays on packets, such that the packets (or traffic involving the packets) conform to a predetermined constraint, such as limiting the volume of traffic sent into a network in a specified period of time or the maximum rate at which the traffic is sent.
As used in the present application and appended claims, a message traffic shaper may operate on, limit or control entities in addition to or different than packets, such as the messages generated by the application program 108 or the writev( ) calls to the operating system network software and hardware 220.
A message traffic shaper 223, 226 and 230 is associated with each buffer 210-216 and, therefore, with each message stream 200-206. The message traffic shapers 223-230 are referred to as “local shapers” or “per-buffer shapers.” Each local shaper 223-230 ensures that its corresponding stream 200-206 does not over utilize a resource or its share of the resource, such as the network link 110 or processing power available to execute part or all of the network protocol stack handling the writev( ) calls issued in relation to the corresponding stream 200-206.
In some embodiments, a multiprocessor computer provides a platform on which the quote server 100 is implemented. The multiple processors may be separate CPU integrated circuits, separate cores in a multicore processor integrated circuit or any other suitable circuits or hardware or software emulators. One of the processors of the multiprocessor computer may execute the application program 108, while another one of the processors executes software that implements portions of the network protocol stack and yet another processor handles interrupts generated by a network interface within the operating system network software and hardware 220. In some embodiments, two or more of the processors execute replicas or variants of the application program 108. Each of these processors, and the network link 110, may be considered a resource with a finite capacity. The local shapers 223-230 may be configured to ensure that no more than a predetermined amount of one or more resources is used for their respective message streams 200-206.
In one embodiment, each local shaper 223-230 limits when messages stored in its corresponding buffer 210-216 are transferred (as indicated by message segments 200b-206b) from the buffer 210-216 to the operating system network software and hardware 220. This limit function is indicated at 236, 240 and 243. In one embodiment, this limit function 236-243 is performed by limiting when, and therefore how often, writev( ) calls may be issued.
In addition, a global message traffic shaper 233 is associated with all the message streams 200-206 that are carried over the network link 110. The global message traffic shaper 233 ensures that, in aggregate, the streams 200-206 do not over utilize a resource, such as the network link 110 or processing power available to execute part or all of the network protocol stack handling the writev( ) calls. In one embodiment, the global message traffic shaper 233 limits the aggregate rate at which messages stored in the buffers 210-216 are transferred from the buffers 210-216 to the operating system network software and hardware 220. This limit function is indicated at 246. In one embodiment, this limit function 246 is performed by limiting when, and therefore how often, writev( ) calls may be issued. Thus, permission may be required from both a local shaper 223-230 and from the global shaper 233 to transfer one or more messages from one of the buffers 210-216 to the operating system network software and hardware 220 or to issue a writev( ) call.
In some embodiments, a round-robin scheduler 260 schedules when the transfers of messages from the buffers 210-216 occur, in conjunction with the controls 236-243 and 246 from the local shapers 233-230 and the global shaper 233, as described in more detail below.
Each local shaper 223-230 receives information, such as volume or rate, about message traffic in its respective message stream 200-206 via a feedback loop 233, 236 and 240, respectively. The global message traffic shaper 233 receives information about aggregate message traffic in all the message streams 200-206 via feedback loops 250, 253 and 256. The local shapers 233-230 and the global message traffic shaper 233 use this feedback information to inform their respective limitation functions.
For example, if one of the message streams 200-206 experiences a burst of messages after having been relatively quiet, the corresponding local shaper 223-230 may permit the message stream to proceed with little or no bundling. On the other hand, if the message stream becomes busy (i.e., the burst turns out to be a steady stream of messages), the local shaper 223-230 may progressively throttle down the rate at which the writev( ) routine may be called, thereby progressively increasing the amount of bundling to be performed on messages in this particular message stream.
In some embodiments, the system is configured such that, when the writev( ) routine is called, all of the messages in the corresponding buffer 210, 213 or 216 are transferred to the operating system network software and hardware 220. In some other embodiments, the system is configured such that as many of the messages in the buffer as will fit in the payload section of a packet are transferred to the operating system network software and hardware 220 when the writev( ) routine is called. Thus, a single call to the writev( ) routine may bundle one or more of the messages in the buffer 210, 213 or 216 and pass the bundle to the operating system network software and hardware 220 for transmission over the network link 110.
As noted, bundling is disabled in the operating system network software and hardware 220. For example, processing according to the Nagle Algorithm may be disabled in TCP by calling setsockopt and passing the TCP_NODELAY option. Consequently, the message(s) passed with a single writev( ) call may be sent over the network link 110 by the operating system network software and hardware 220, even if the message(s) do not fill the payload portion of a packet, and even if there is an outstanding unacknowledged packet. That is, one writev( ) call is likely to cause the generation of one packet. Thus, if the network link 110 is relatively lightly loaded, the message(s) should incur no bundling delay, thereby minimizing latency.
Because bundling is disabled in the operating system network software and hardware 220, the system would, absent other controls, be susceptible to severe performance degradation if the network link 110 or another bottleneck resource became very busy. However, the local rate shapers 223-230 and the global message traffic shaper 233 use their feedback mechanisms 233-240 and 250-256 to prevent the bottleneck resource from becoming critically busy. As the resource becomes progressively busier, the shapers 223-233 cause progressively more bundling to occur, thereby increasing the efficiency of the resource. However, when the resource is not busy, the shapers 223-233 allow the resource to be utilized (exploited), to the extent possible, to minimize latency. Thus, a target utilization is set for the bottleneck resource. The shapers 223-233 are configured to utilize as much of the bottleneck resource as possible, without exceeding (at least for more than a short burst) the target utilization.
Which resource is the bottleneck resource in a given system, and a target utilization for this resource, may be determined experimentally or analytically. For example, the amount of a processor resource that is used to handle one writev( ) routine call may be measured, or it may be determined by analyzing a program to count computer instructions that must be executed to handle the writev( ) routine call and associated network protocol stack software, interrupt handling routines, etc. The number of writev( ) routine calls that can be handled per unit of time by a given processor may be calculated or estimated by taking into account the number of instructions executed by the processor to complete one operation and the processor's speed. Functions performed by other processors may be similarly analyzed. The number of messages that may be sent over a network link may be calculated or estimated by dividing the network link's goodput (usable data transfer rate) by the average message size. The resource that can sustain the smallest number of operations per unit time can be determined and designated the bottleneck resource.
A target utilization may be set to the number of operations the bottleneck resource can handle without undesirably degrading system performance. For example, the target may be set to some fraction (less than one) of the maximum number of operations the bottleneck resource can handle. As known from a generalization of Little's Law, response time increases with arrival rate (or completion rate, for a balanced queuing model) almost linearly up to about 75% utilization, above which the response time increases progressively more dramatically. Thus, in many cases, setting the target to about 70-80% of the maximum number of operations the bottleneck resource can handle provides a good balance between maximizing utilization of the resource and avoiding bundling delays.
In one example, the processor that handles interrupts generated by the network interface is the bottleneck resource. In one exemplary configuration, this processor can handle interrupts resulting from up to about 400,000 writev( ) routine calls per second without undue system performance degradation. Thus, the target for this configuration may be set to about 400,000 writev( ) routine calls per second.
It should be noted that the bottleneck resource's utilization need not necessarily be directly measured. Continuing the previous example, the target is expressed in terms of the number of writev( ) routine calls per second that may be executed, not in terms of utilization of the bottleneck resource, i.e., the processor that handles the interrupts. Furthermore, a processor other than the bottleneck resource (i.e., a processor other than the processor that handles the interrupts) may execute the software that issues the calls to the writev( ) routine. Thus, the target may be imposed on operations that are performed by or on a resource other than the bottleneck resource. Furthermore, the target may be set arbitrarily.
Use of a target is conceptually illustrated in
In some embodiments, each rate shaper 223-233 includes a token bucket. A token bucket may be implemented using an up/down counter. A portion of one of the message streams 200-206 is illustrated schematically in
Tokens are added to the global token bucket 404 (i.e., replenished 406) at a rate consistent with a desired shape of the aggregate network traffic on the network link 110 (
The global token bucket 404 has a limit (depth 407) on the number of tokens it can hold. If the global token bucket 404 becomes full, no further tokens are added 406 to it until one or more tokens are consumed 411. The global token bucket's 404 depth 407 is discussed in more detail below.
Similarly, tokens are added to the per-buffer token bucket 403 (i.e., replenished 408) at a rate consistent with a desired shape of network traffic resulting from the message stream 200. As with the global token bucket 404, the per-buffer token bucket 403 has a limit (depth 412) on the number of tokens that it can hold.
In one embodiment, the replenishment rate 408 for the per-buffer token bucket 403 equals the maximum sustained rate at which writev( ) routine calls should be executed for the corresponding message stream 200. The per-buffer token replenishment rate 408 may be simply a fraction of the target aggregate rate for the network link 110, i.e., a fraction of the global token bucket's 404 replenishment rate 406. However, in another embodiment, we include an “Oversubscription factor,” as shown in Equation (1), in the replenishment rate 408. In equation (1), a “Target writev( ) rate” may be the global token bucket's 404 replenishment rate 406, and a “Number of connections” may be the number of message streams 200-206 (
Absent the “Oversubscription factor,” all the local shapers 223-230 (
In one embodiment, the oversubscription factor is determined from a table, based on the aggregate rate at which writev( ) routine calls are issued by all the message streams 200-206. An example of one such table is illustrated in Table 1.
The values in Table 1 are presented merely as one example. Other table values may, of course, be selected based on the amounts of resources required to perform a function, processor speeds, network link capacity, degree to utilization desired for a bottleneck resource, etc. Furthermore, the table may include more or fewer rows, depending on the granularity with which the oversubscription factor is to be administered.
When the aggregate write rate is low, for example less than about 100 writev( ) calls per second, the oversubscription factor may be large, for example about 10, whereas when the aggregate write rate is high, for example more than about 300 writev( ) calls per second, the oversubscription factor may be less, for example about 2, and when the aggregate write rate exceeds the desired value, the oversubscription rate may be less than 1.
Although a table of oversubscription factors may be used, in another embodiment we prefer to initialize each per-buffer token replenishment rate to an equal fraction of the global token bucket's 404 replenishment rate 406. We prefer to set the initial per-buffer token replenishment rate to the global token bucket's 404 replenishment rate 406 divided by a number smaller than the number of message streams 200-206, thereby initially oversubscribing each of the local shapers 223-230. We then periodically or occasionally adjust each per-buffer token replenishment rate by a percentage of the then-current value of the per-buffer token replenishment rate, as shown in Equation (2). This adjustment may be performed by the round-robin scheduler 260, as described below.
Per-buffer token replenishment rate=Per-buffer token replenishment rate±Per-buffer token replenishment rate*Step rate (2)
The “Step rate” is the amount by which the per-buffer token replenishment rate is adjusted. This may be any suitable adjustment amount. In one embodiment, we use about 0.01. Thus, the replenishment rate increases or decreases in steps of about 1%. However, larger or smaller values may be used, depending on how rapidly the replenishment rate is to change. For example, if large or small changes in the rate of writev( ) routine calls are expected over time, larger or smaller step rate values may be used, respectively.
If the aggregate rate of writev( ) calls is less than the target aggregate rate, the per-buffer token replenishment rate may be increased, whereas if aggregate rate of writev( ) calls exceeds or equals (within a predetermined range of) the aggregate target, the per-buffer token replenishment rate may be decreased.
Optionally, if the aggregate rate of writev( ) calls is within a predetermined range of the aggregate target rate, the per-buffer token replenishment rate may remain unchanged. However, in the context of high-frequency trading, each message stream's writev( ) routine call rate and the aggregate writev( ) call rate are expected to vary almost continuously. Thus, there may be no steady-state optimum value for the per-buffer token replenishment rate, and a system that always either increases or decreases the per-buffer token replenishment rate may be sufficient.
Optionally, the step rate may itself be varied, depending on the aggregate writev( ) routine call rate, so as to more quickly adapt to changes in the aggregate rate. In one embodiment, the step rate decreases as the actual aggregate writev( ) routine call rate approaches the target aggregate rate, and the step rate increases as the difference between the actual aggregate rate and the target aggregate rate becomes larger.
As noted, the per-buffer token bucket 403 (
Essentially, the per-buffer token bucket 403 enables the system to determine if the corresponding message stream 200 is in a burst, without storing historical message rate data. If the per-buffer token bucket 403 is relatively full, the message stream 200 is not likely to be currently experiencing a burst. However, if the per-buffer token bucket 403 is empty or nearly empty, the message stream 200 is experiencing a burst or a sustained high traffic period.
The depths of the per-buffer token buckets need not be fixed. In some embodiments, the depths of the per-buffer tokens are adjusted, based on the aggregate rate of calls to the writev( ) routine, as described below.
As noted with reference to
As shown in
Equations (1) and (2) provide two possible ways of calculating a per-buffer token replenishment rate. Other per-buffer token replenishment algorithms may, of course, be used. Operation 510 may not (and need not) be performed at periodic intervals. Thus, whenever operation 510 is performed, a number of tokens to be added to the per-buffer token bucket is calculated, based on the appropriate replenishment rate and on the amount of time that has elapsed since the last time the per-buffer token bucket was replenished. Whole numbers of tokens (or no tokens at all) are added to the token bucket. Thus, the result of any calculation of a number of tokens to be added is truncated (or, in some embodiments, rounded) to arrive at the number of tokens to be added to the token bucket. Note that not every execution of operation 510 necessarily adds any tokens. If the calculated number of tokens to add is less than 1 (or, in some embodiments, less than ½), no token is added. As previously noted, excess tokens, that is, tokens in excess of the token bucket depth 412 (
As a consequence of calculating the per-buffer token replenishment rate, the depth of the token bucket may change. In a sense, the tokens in any one per-buffer token bucket represent a possible burst of packets that may be generated in relation to the corresponding message stream 200-206 (
One reason for maintaining depths of per-buffer token buckets relates to a buildup of tokens that can occur during quiet periods. For example, if all the message streams 200-206 experience only light traffic for a long period of time, all the per-buffer token buckets may fill with tokens. Then, if message traffic suddenly increases in several or all of the message streams 200-206, such as due to market activity in response to a major announcement by the Board of Governors of the Federal Reserve System, many or all of the message streams 200-206 may attempt to “cash in” on their accumulated tokens, resulting in an overload being placed on the bottleneck resource. Thus, some embodiments of the system relatively quickly react to an increase in the aggregate writev( ) routine call rate by reducing the per-buffer token bucket replenishment rate, which in turn causes a reduction in the per-buffer token bucket depths, thereby forcing the message streams to forfeit some of their accumulated tokens.
The above-described token replenishment mechanisms distribute tokens among the per-buffer token buckets equally. However, in some other embodiments, the per-buffer token replenishment is asymmetric. That is, the target aggregate rate of writev( ) routine calls may be unequally divided among the buffers 210-230 (
At 513, if the per-buffer token bucket is empty, the scheduler skips this buffer. It should be noted that operation 510 may not have added any tokens to the per-buffer token bucket, such as if the calculated number of tokens to add was less than 1. Thus, although operation 510 occurs before the decision 513, the per-buffer token bucket may be empty. This may be the case if, for example, the associated message stream is busy and has been busy for some time, thereby depleting its per-buffer token bucket, and the aggregate rate of writev( ) routine calls is high, thereby reducing the replenishment rate for the per-buffer token bucket.
On the other hand, if a per-buffer token is available at 513, control passes to 516, where the number of tokens in the per-buffer token bucket is decremented, i.e., a per-buffer token is consumed. At 520, as many messages as will fit in a packet are dequeued from the message queue.
As noted, the scheduler 260 replenishes the global token bucket 404 (
The global bucket replenishment task may be divided between or among all the schedulers 260, such as on a rotating basis. At 523, if it is this processor's turn to replenish the global token bucket, control passes to 530, where the global token bucket is replenished.
Possible ways of calculating a rate at which tokens are added to the global token bucket were described above. As with the per-buffer token buckets, whenever operation 530 is performed, a whole number (possibly zero) of tokens to be added to the global token bucket is calculated, based on the appropriate replenishment rate and on the amount of time that has elapsed since the last time the global token bucket was replenished. As previously noted, excess tokens, that is, tokens in excess of the global token bucket depth 407 (
Before issuing a call to the writev( ) routine, an attempt is made to consume a global token. At 533, if no global token is available, control passes back to 523, forming a loop. If it is this scheduler's turn to replenish the global token bucket, eventually an execution of operation 530 will add a token to the global token bucket. (Until the calculation performed by operation 530 yields a number larger than 1, no tokens are added.) On the other hand, if it is another scheduler's turn to replenish the global token bucket, the other scheduler, being executed by another processor, will eventually replenish the global token bucket. At 540, the number of global tokens is decremented, and at 546 the writev( ) routine is called with the dequeued messages.
At 550, the index is advanced to the next buffer. If the index is advanced beyond the last buffer 216 (
As noted, in some embodiments multiprocessor computers may be used, and multiple instances of the application program 108 (
Each processor 600, 603 generates a processor-aggregate message stream 606 and 610, respectively. One set of operating system network software and hardware 613 processes both processor-aggregate message streams 606-610. The operating system network software and hardware 613 are preferably handled by a third processor.
The network interface 616 in the operating system network software and hardware 613 has the same or similar limitations as the network interface in a single processor system. Thus, a target is set for a system-wide aggregate rate at which writev( ) routine calls may be made (400,000 writes per second, in the example of
In the embodiment shown in
Preferably, only one crediting operation replenishes all the per-processor token buckets. This may be a scheduler or another similar routine. The crediting operation calculates the aggregate number of global tokens that should be distributed among the processors 600-603, and then the crediting operation distributes the tokens among the per-processor token buckets 620-623. When a per-processor token bucket 620 or 623 is replenished, any tokens that can not be added to the token bucket due to that processor's per-processor token bucket's depth limit are distributed among the remaining per-processor token buckets (to the extents that the remaining per-processor token buckets are not already full), rather than being discarded. Multiple passes may per performed to distribute the tokens among the per-processor token buckets 620-623.
The per-processor token buckets 806-810 are used to shape the aggregate message traffic generated by their respective processors. The aggregate token bucket 813 is used to shape the message traffic from the entire system, i.e., from all the processors 800-803. The previously described scheduler 260 may be modified to require three tokens before allowing a call to the writev( ) routine: one per-buffer token, one per-processor token from the appropriate per-processor token bucket 806-810 and one aggregate token from the aggregate token bucket 813.
The scheduler may also be modified to replenish the per-processor token buckets 806-810, by extension to the descriptions given above for the single processor case. That is, each per-processor token bucket 806-810 may initially have a replenishment rate that is a (equal or unequal) fraction of the replenishment rate for the aggregate token bucket 813 (400,000 tokens per second, in the example of
The per-processor token bucket 806-810 replenishment rates may add up to more than the aggregate token bucket 813 replenishment rate, using the same logic that allows a single processor's per-buffer token bucket replenishment rates to add up to more than the global (processor) token bucket replenishment rate. That is, the per-processor shapers 806-810 may, in total, oversubscribe the aggregate 813 shaping rate. This permits one or more, but not all, of the processors 800-803 to use more than their “fair share” of the bottleneck resource, as long as the bottleneck resource is not utilized to an extent defined by its aggregate target rate. As noted, the aggregate token bucket 813 ensures the bottleneck resource is not over-utilized.
Although embodiments have been described as using TCP packets, other embodiments may use other types of network packets. For example, UDP provides network connections without the guarantee of reliable delivery provided by TCP. UDP is often used in applications that require low latency. Embodiments of the present invention may use UDP as a transport mechanism, which may provide advantages over using UDP without benefit of the present invention. For example, if an application that uses TCP overloads a network link, packets may be delayed and/or retransmitted by TCP, resulting in high latency. However, the packets are reliably delivered. In contrast, if an application that uses UDP overloads a network link, the result is typically high latency and undelivered packets.
Employing an embodiment of the present invention in an application that utilizes UDP transport can prevent or reduce the likelihood of overloading a network link, because the message streams are shaped, therefore preventing or reducing the likelihood of undelivered packets. At the same time, an application that utilizes UDP transport and an embodiment of the present invention retains the benefits of low latency provided by UDP transport when the network is not heavily loaded.
Embodiments of the present invention have been described as being implemented in the application layer of a system. However, other embodiments may be implemented in another layer, such as the layer where TCP or UDP packetization is performed or where the decision when to transmit a packet is performed (i.e., in the layer where the Nagle Algorithm is implemented). Features provided by such an embodiment may be selectively enabled or disabled, such as by a call to setsockopt, passing appropriate parameters. For example, traffic shaping may be enabled for all network connections or only for designated network connections.
In accordance with an exemplary embodiment, systems and methods are provided for dynamically controlling an amount of bundling that occurs in a network communication application. While specific values chosen for these embodiments are recited, it is to be understood that, within the scope of the invention, the values of all of parameters may vary over wide ranges to suit different applications.
A systems for dynamically controlling an amount of bundling that occurs in a network communication has been described as including a processor controlled by instructions stored in a memory. The memory may be random access memory (RAM), read-only memory (ROM), flash memory or any other memory, or combination thereof, suitable for storing control software or other instructions and data. Some of the functions performed by the system have been described with reference to flowcharts and/or block diagrams. Those skilled in the art should readily appreciate that functions, operations, decisions, etc. of all or a portion of each block, or a combination of blocks, of the flowcharts or block diagrams may be implemented as computer program instructions, software, hardware, firmware or combinations thereof. Those skilled in the art should also readily appreciate that instructions or programs defining the functions of the present invention may be delivered to a processor in many forms, including, but not limited to, information permanently stored on non-writable storage media (e.g. read-only memory devices within a computer, such as ROM, or devices readable by a computer I/O attachment, such as CD-ROM or DVD disks), information alterably stored on writable storage media (e.g. floppy disks, removable flash memory and hard drives) or information conveyed to a computer through communication media, including wired or wireless computer networks. In addition, while the invention may be embodied in software, the functions necessary to implement the invention may optionally or alternatively be embodied in part or in whole using firmware and/or hardware components, such as combinatorial logic, Application Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs) or other hardware or some combination of hardware, software and/or firmware components.
While the invention is described through the above-described exemplary embodiments, it will be understood by those of ordinary skill in the art that modifications to, and variations of, the illustrated embodiments may be made without departing from the inventive concepts disclosed herein. For example, although some aspects of the system have been described with reference to a flowchart, those skilled in the art should readily appreciate that functions, operations, decisions, etc. of all or a portion of each block, or a combination of blocks, of the flowchart may be combined, separated into separate operations or performed in other orders. Furthermore, disclosed aspects, or portions of these aspects, may be combined in ways not listed above. Accordingly, the invention should not be viewed as being limited to the disclosed embodiment(s).