The present invention relates generally to computer network monitoring and management system design. More particularly, the present invention relates to high-throughput network traffic collection and processing systems. Furthermore, methods for monitoring, analyzing the network traffic and determining the pairwise network traffic matrix are described.
The present invention pursues optical switching and wavelength division multiplexing technologies for applications in data center networks, and describes a completely new hardware and software design, which significantly reduces the cost and improves the scalability of the system.
In one embodiment, a network traffic collecting and monitoring system includes a traffic processing and dispatching module that pre-processes network traffic received from one or more traffic tapping modules. A traffic collecting module receives and consolidates the network traffic and sends the network traffic to higher-layer applications. A controller dynamically configures the traffic processing and dispatching module to achieve optimal measurement accuracy and network coverage.
The foregoing summary, as well as the following detailed description of preferred embodiments of the invention, will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there are shown in the drawings embodiments that are presently preferred. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
In the drawings:
Certain terminology is used in the following description for convenience only and is not limiting. The words “right”, “left”, “lower”, and “upper” designate directions in the drawings to which reference is made. The terminology includes the above-listed words, derivatives thereof, and words of similar import. Additionally, the words “a” and “an”, as used in the claims and in the corresponding portions of the specification, mean “at least one.”
The preferred invention will be described in detail with reference to the drawings. The figures and examples below are not meant to limit the scope of the present invention to a single embodiment, but other embodiments are possible by way of interchange of some or all of the described or illustrated elements. Moreover, where some of the elements of the present invention can be partially or fully implemented using known components, only portions of such known components that are necessary for an understanding of the present invention will be described, and a detailed description of other portions of such known components will be omitted so as not to obscure the invention.
In general, the present invention relates to network traffic monitoring and system management schemes, specifically in server clusters. According to some aspects, the described high-throughput traffic monitoring system is built upon a non-intrusive, application-transparent network traffic duplication scheme, which is based on an optical broadcast-and-select communication mechanism described in U.S. Provisional Patent Application No. 61/719,026 (attached hereto as Appendix A) which is incorporated herein by reference. This communication mechanism is able to duplicate network traffic onto multiple optical fibers with no additional overhead in the data traffic. According to further aspects, the described traffic monitoring system is able to selectively monitor network packet streams coming from different data transmission or switch ports and subsets of the network traffic according to specific criteria, such that minimum packet losses and maximum network coverage are achieved. By utilizing wavelength division multiplexing (WDM) technologies, the described traffic monitoring system is able to have a fine-grained control of selecting the subsets of network traffic to monitor. By analyzing the collected network traffic data, the described monitoring system is able to obtain a network traffic matrix, to infer application dependency, and to conduct fault diagnosis and other management tasks.
A method of dynamically scheduling network traffic monitoring to optimize the monitoring coverage and accuracy includes prioritizing network traffic based on volume, optimizing the port monitoring sequence, and reconstructing the incomplete monitoring data is described. Furthermore, a method of network traffic monitoring that collects the network traffic through optical signal broadcasting from the network transmission and switching ports, and selects the monitoring channels at allocated time slots, and generates the network traffic pattern matrix is described.
Referring to the figures, wherein like reference numerals indicate corresponding parts in the various figures,
All servers communicate with all other servers in the cluster through the ToR 102 and the cluster switch 103. Many management tasks, such as network intrusion detection, network fault diagnosis, and application dependency discovery, depend on an effective and efficient network monitoring mechanism. In some emerging network technologies, such as software defined networking, network monitoring is of high importance in providing key information for network optimization and interconnection reconfiguration. However, in a production server cluster, the network traffic volume between servers may easily reach 1 Gb/s and even 10 Gb/s, making network traffic monitoring challenging.
Referring to
Network Traffic Tapping Module 201
The network traffic tapping module 201 employs an optical signal duplication mechanism, which is able to generate a copy of the signal transmitted on an incoming optical fiber onto multiple outgoing fiber channels. One such exemplary is an optical power splitter, which splits the incoming optical signal into multiple output ports. However other optical signal duplication mechanisms are known to those skilled in the art. Typically, such signal duplication devices are passive, requiring minimum power consumption to achieve the functionality. These devices are also transparent to the bit rate, allowing such a device to be deployed at low-bandwidth edge networks or high-bandwidth core networks.
Traffic Fusion Module 202
The traffic collection module 204 typically maintains a limited number of receiving ports, and therefore limited data processing capability. To accommodate the processing and port count limitations of the traffic collection module 204, duplicate network traffic generated by the traffic-tapping module 201 is first directed to the traffic fusion module 202 instead of directly to the traffic collection module 204. The main functionality of the fusion module 202 is to consolidate, sample, and/or filter network traffic such that optimal network coverage and operational efficiency is achieved.
An exemplary architecture of the traffic fusion module 202 is shown in
A multi-wavelength optical channel switch 401 takes as input multiple channels of optical signals 402 and generates multiple channels of output optical signals 403. Each of the connecting fiber ports of the input optical signals 402 can carry multiple wavelength channels, while each of the output connecting fiber ports of the output signals 403 carries only one wavelength channel. In addition, the composition of signals carried on each individual channel may change over time. The dynamic signal composition is managed by the central controller 203, which decides what input traffic goes to what output channel based on the network traffic characteristics, and realizes such decisions by initiating control commands to the multi-wavelength optical channel switch 401.
The output 403 of the multi-wavelength optical channel switch 401 is further fed into an electrical packet-dispatching device 405, which conducts network packet header look-up and forwards the packets to the corresponding outgoing ports. The electrical packet-dispatching device 405 may be implemented in a plurality of ways. For example, the switches can be implemented using conventional address-based layer-2 or layer-3 switches, rule-based switches (such as Openflow switches), or dedicated flow-processing units equipped with a purpose-built chipset. However, other technologies for implementing the electrical packet-dispatching device 405 are known to those skilled in the art, and are within the scope of this disclosure. The packet-dispatching configurations (i.e., what packets go to which outgoing ports) are not static, but can be dynamically changed by the central controller 203 such that minimum packet loss and optimal load balancing is achieved.
The outputs of the electrical packet-dispatching device 405 are sent to the traffic collection and processing module 204 for further processing.
The central controller 203 communicates with the components of the traffic fusion module 202, the optical channel switch 401 and the electrical packet-dispatching device 405. The optical channel switch 401 receives the multiple channels of input optical signals 402 from each network traffic-tapping module 201 and selectively forwards different channels of optical signals 402 onto different output channels 403. Since the input optical signals may have certain conflicts in their physical properties (e.g., wavelength contention in wavelength division multiplexing), the controller 203 communicates with the optical channel switch 401 to guarantee conflict-free input signal admission. In addition, the controller 203 also configures what channels of optical signals 402 are forwarded onto what output channels 403 such that the maximum amount of network traffic is captured by the traffic fusion module 202.
A plurality of methods may be utilized by the controller 203 to achieve this goal. For instance, the controller 203 can simply use a round-robin-like scheduling scheme (i.e., all channels are ordered and monitored in a circular order) to rotate the optical signal channels 402 to be monitored, such that every channel is monitored for an equal-length period of time. The controller 203 can also use an importance sampling based scheduling mechanism, in which the controller 203 allocates more monitoring time to signal channels 402 of higher priority (i.e., higher traffic volume, carrying more relevant traffic, or the like). The controller 203 can also leverage other physical properties or practical application requirements, such as correlation among traffic, parity of the transmitting/receiving ports of the optical transceiver, and contention between optical wavelengths, to improve the monitoring efficiency and accuracy. Other technologies for further optimizing the monitoring performance of the traffic fusion module 202 are known to those skilled in the art and are within the scope of this disclosure.
The packet-dispatching device 405 takes as input the multi-channel optical signals 403 and redistributes the signals onto the output channels 404, which further feed into the traffic collection and processing module 204. Since the traffic carried in the output channels 404 changes over time, the traffic volume of an output signal 404 may exceed the physical capacity of the input interface of the traffic collection and processing module 204, resulting in packet loss and incomplete packet capture. Thus, the controller 203 continuously monitors the traffic volume of each input signal to the traffic collection and processing module 204, and dynamically adjusts the distribution of the optical signals 403 on the output channels 404, such that packet losses at all the input interfaces of the traffic collection and processing module 204 are prevented or minimized.
The traffic collection and processing module 204 and the controller 203 may be collocated on the same physical device, or they may be deployed separately. An exemplary architecture of the processing module 204 is shown in
The data received from each interface 406 are first buffered in a receive queue within the receive module 601. Then the higher-layer application 603 fetches data and removes the data from the receive queue. For high-speed network interfaces 406 (i.e., 10 Gbps or higher), it is very common that the application 603 cannot fetch data fast enough such that the receive queue is overflowed, resulting in packet losses. To address this issue, the preferred invention utilizes a two-stage circular buffer, as illustrated in
Referring to
L=(C mod Rs)mod Rl,
Where Rs and Rl are the size of the small 702 and large 703 circular buffers, respectively. Then, in step 805, the gateway 704 gets the data from the buffer and returns it to the aggregation module 602 and further the application 603.
While one exemplary design and implementation of the traffic collection and processing module 204 has been described, other technologies for implementing the traffic collection and processing module 204 are known to those skilled in the art, and are within the scope of this disclosure.
The described apparatus and the related methods enable efficiently collecting, capturing, and processing high-throughput network traffic in a large-scale data center or enterprise network. The utilized broadcast-and-select communication mechanism enables zero-overhead network traffic duplication and tapping. Furthermore, the reconfigurable multi-wavelength channel switch 401 and the packet dispatching device 405 embedded in the traffic fusion module 202 allows the central controller 203 to dynamically select the set of traffic to be monitored such that minimum packet losses and maximum monitoring coverage are achieved.
Embodiments of the present invention relate generally to computer network switch design and network management. More particularly, the present invention relates to scalable and self-optimizing optical circuit switching networks, and methods for managing such networks.
Inside traditional data centers, network load has evolved from local traffic (i.e., intra-rack or intra-subnet communications) into global traffic (i.e., all-to-all communications). Global traffic requires high network throughput between any pair of servers. The conventional over-subscribed tree-like architectures of data center networks provide abundant network bandwidth to the local areas of the hierarchical tree, but provide scarce bandwidth to the remote areas. For this reason, such conventional architectures are unsuitable for the characteristics of today's global data center network traffic.
Various next-generation data center network switching fabric and server interconnect architectures have been proposed to address the issue of global traffic. One such proposed architecture is a completely flat network architecture, in which all-to-all non-blocking communication is achieved. That is, all servers can communicate with all the other servers at the line speed, at the same time. Representatives of this design paradigm are the Clos-network based architectures, such as FatTree and VL2. These systems use highly redundant switches and cables to achieve high network throughput. However, these designs have several key limitations. First, the redundant switches and cables significantly increase the cost for building the network architecture. Second, the complicated interconnections lead to high cabling complexity, making such designs infeasible in practice. Third, the achieved all-time all-to-all non-blocking network communication is not necessary in practical settings, where high-throughput communications are required only during certain periods of time and are constrained to a subset of servers, which may change over time.
A second such proposed architecture attempts to address these limitations by constructing an over-subscribed network with on-demand high-throughput paths to resolve network congestion and hotspots. Specifically, c-Through and Helios design hybrid electrical and optical network architectures, where the electrical part is responsible for maintaining connectivity between all servers and delivering traffic for low-bandwidth flows and the optical part provides on-demand high-bandwidth links for server pairs with heavy network traffic. Another proposal called Flyways is very similar to c-Through and Helios, except that it replaces the optical links with wireless connections. These proposals suffer from similar drawbacks.
Compared to these architectures, a newly proposed system, called OSA, pursues an all-optical design and employs optical switching and optical wavelength division multiplexing technologies. However, the optical switching matrix or Microelectromechanical systems (MEMS) component in OSA significantly increases the cost of the proposed architecture and more importantly limits the applicability of OSA to only small or medium sized data centers.
Accordingly, it is desirable to provide a high-dimensional optical circuit switching fabric with wavelength division multiplexing and wavelength switching and routing technologies that is suitable for all sizes of data centers, and that reduces the cost and improves the scalability and reliability of the system. It is further desirable to control the optical circuit switching fabric to support high-performance interconnection of a large number of network nodes or servers.
In one embodiment, an optical switching system is described. The system includes a plurality of interconnected wavelength selective switching units. Each of the wavelength selective switching units is associated with one or more server racks. The interconnected wavelength selective switching units are arranged into a fixed structure high-dimensional interconnect architecture comprising a plurality of fixed and structured optical links. The optical links are arranged in a k-ary n-cube, ring, mesh, torus, direct binary n-cube, indirect binary n-cube, Omega network or hypercube architecture.
In another embodiment, a broadcast/select optical switching unit is described. The optical switching unit includes a multiplexer, an optical power splitter, a wavelength selective switch and a demultiplexer. The multiplexer has a plurality of first input ports. The multiplexer is configured to combine a plurality of signals in different wavelengths from the plurality of first input ports into a first signal output on a first optical link. The optical power splitter has a plurality of first output ports. The optical power splitter is configured to receive the first signal from the first optical link and to duplicate the first signal into a plurality of duplicate first signals on the plurality of first output ports. The duplicated first signal is transmitted to one or more second optical switching units. The wavelength selective switch has a plurality of second input ports. The wavelength selective switch is configured to receive one or more duplicated second signals from one or more third optical switching units and to output a third signal on a second optical link. The one or more duplicated second signals are generated by second optical power splitters of the one or more third optical switching units. The demultiplexer has a plurality of second output ports. Each second output port has a distinct wavelength. The demultiplexer is configured to receive the third signal from the second optical link and to separate the third signal into the plurality of second output ports.
An optical switching fabric comprising a plurality of optical switching units. The plurality of optical switching units are arranged into a fixed structure high-dimensional interconnect architecture. Each optical switching unit includes a multiplexer, a wavelength selective switch, an optical power combiner and a demultiplexer. The multiplexer has a plurality of first input ports. The multiplexer is configured to combine a plurality of signals in different wavelengths from the plurality of first input ports into a first signal output on a first optical link. The wavelength selective switch has a plurality of first output ports. The wavelength selective switch is configured to receive the first signal from the first optical link and to divide the first signal into a plurality of second signals. Each second signal has a distinct wavelength. The plurality of second signals are output on the plurality of first output ports. The plurality of second signals are transmitted to one or more second optical switching units. The optical power combiner has a plurality of second input ports. The optical power combiner is configured to receive one or more third signals having distinct wavelengths from one or more third optical switching units and to output a fourth signal on a second optical link. The fourth signal is a combination of the received one or more third signals. The demultiplexer has a plurality of second output ports. Each second output port has a distinct wavelength. The demultiplexer is configured to receive the fourth signal from the second optical link and to separate the fourth signal into the plurality of second output ports based on their distinct wavelengths.
Certain terminology is used in the following description for convenience only and is not limiting. The words “right”, “left”, “lower”, and “upper” designate directions in the drawings to which reference is made. The terminology includes the above-listed words, derivatives thereof, and words of similar import. Additionally, the words “a” and “an”, as used in the claims and in the corresponding portions of the specification, mean “at least one.”
The present invention will be described in detail with reference to the drawings. The figures and examples below are not meant to limit the scope of the present invention to a single embodiment, but other embodiments are possible by way of interchange of some or all of the described or illustrated elements. Moreover, where some of the elements of the present invention can be partially or fully implemented using known components, only portions of such known components that are necessary for an understanding of the present invention will be described, and a detailed description of other portions of such known components will be omitted so as not to obscure the invention.
Referring to the drawings in detail, wherein like reference numerals indicate like elements throughout,
Referring to
The k-ary n-cube architecture is denoted by Cnk, where n is the dimension and vector k=<k1, k2, . . . , kn> denotes the number of elements in each dimension. Referring to
Two designs of the wavelength selective switching unit 1403 of
A symmetric architecture of a broadcast-and-select based wavelength selective switching unit 1503 connected to ToR 1103 and servers 1101 is shown in
In the symmetric wavelength selective switching unit 1503 of
Logically above the ToR 1103 is a broadcast-and-select type design for the wavelength selective switching units 1503. The wavelength selective switching units 1503 are further interconnected via fixed and structured fiber links to support a larger number of server inter communications. Each wavelength selective switching unit 1503 includes an optical signal multiplexing unit (MUX) 1507, an optical signal demultiplexing unit (DEMUX) 1508 each with m ports, a 1×2n optical wavelength selective switch (WSS) 1510, a 1×2n optical power splitter (PS) 1509, and 2n optical circulators (c) 1511. The optical MUX 1507 combines the optical signals at different wavelengths for transmission in a single fiber. Typically, two types of optical MUX 1507 devices can be used. In a first type of optical MUX 1507, each of the input ports does not correspond to any specific wavelength, while in the second type of optical MUX 1507, each of the input ports corresponds to a specific wavelength. The optical DEMUX 1508 splits the multiple optical signals in different wavelengths in the same fiber into different output ports. Preferably, each of the output ports corresponds to a specific wavelength. The optical PS 509 splits the optical signals in a single fiber into multiple fibers. The output ports of the optical PS 1509 do not have optical wavelength selectivity. The WSS 1510 can be dynamically configured to decide the wavelength selectivity of each of the multiple input ports. As for the optical circulators 1511, the optical signals arriving via port “a” come out at port “b”, and optical signals arriving via port “b” come out at port “c”. The optical circulators 1511 are used to support bidirectional optical communications in a single fiber. However, in other embodiments, optical circulators 1511 are not required, and may be replaced with two fibers instead of a single fiber.
In the wavelength selective switching unit 1503 of
In the receiving part of the wavelength selective switching unit 1503, optical signals are received from other wavelength selective switching units 1503. The optical signals arrive at port “b” of optical circulators 1511, and leave at port “c”. Port “c” of each optical circulator 1511 is coupled with one of the 2n ports of WSS 1510. Through dynamic configuration of the WSS 1510 with the algorithms described below, selected channels at different wavelengths from different server racks 1102 can pass the WSS 1510 and be further demultiplexed by the optical DEMUX 1508. Preferably, each of the output ports of optical DEMUX 1508 corresponds to a specific wavelength that is different from other ports. Each of the m output ports of the optical DEMUX 1508 is preferably connected with the receiving port of the optical transceiver 1505 at the corresponding wavelength.
Inter-rack communication is conducted using broadcast and select communication, wherein each of the outgoing fibers from the optical PS 1509 carries all the m wavelengths (i.e., all outgoing traffic of the rack). At the receiving end, the WSS 1510 decides what wavelengths of which port are to be admitted, and then forwards them to the output port of the WSS 1510, and the output of the WSS 1510 that is connected to the optical DEMUX 508. The optical DEMUX 1508 separates the WDM optical signals into the individual output port, which is connected to the receiving port of the optical transceivers 1505. Each ToR 1103 combined with one wavelength selective switching unit 1503 described above constitutes a node 1202 in
Asymmetric Architecture
The asymmetric architecture broadcast-select architecture achieves 100% switch port utilization, but at the expense of lower bisection bandwidth. The asymmetric architecture is therefore more suitable than the symmetric architecture for scenarios where server density is of major concern. In an asymmetric architecture, the inter-rack connection topology is the same as that of the symmetric counterpart. The key difference is that the number of the ports of a ToR 1103 that are connected to servers is greater than the number of the ports of the same ToR 1103 that are connected to the wavelength selective switching unit 1403. More specifically, each electrical ToR 1103 has m downstream ports, all of which are connected to servers 1101 in a server rack 102. Each ToR 1103 also has u upstream ports, which are equipped with u small form factor optical transceivers at different wavelength, λ1, λ2, . . . λu. In a typical 48-port GigE switch with four 10 GigE upstream ports, for instance, we have 2 m=48 and u=4.
Logically above the ToR 1103 is the wavelength selective switching unit 1503, which consists of a multiplexer 1507 and a demultipexer 1508, each with u ports, a 1×2n WSS, and a 1×2n power splitter (PS) 1509. The transmitting ports and receiving ports of the optical transceivers are connected to the corresponding port of optical multiplexer 1507 and demultiplexer 1508, respectively. The output of optical multiplexer 1507 is connected to the input of optical PS 1509, and the input of the optical demultiplexer 1508 is connected to the output of the WSS 1510. Each input port of the WSS 1510 is connected directly or through an optical circulator 1511 to an output port of PS of the wavelength selective switching unit 1403 in another rack 1102 via an optical fiber. Again, the optical circulator 1511 may be replaced by two fibers.
In practice, it is possible that the ports, which are originally dedicated for downstream communications connected with servers 1101, can be connected to the wavelength selective switching unit 1403, together with the upstream ports. In this case, the optical transceivers 1505 may carry a different bit rate depending on the link capacity of the ports they are connected to. Consequently, the corresponding control software will also need to consider the bit rate heterogeneity while provisioning network bandwidth, as discussed further below.
In both the symmetric and asymmetric architectures, a network manager 1402 optimizes network traffic flows using a plurality of procedures. These procedures will now be described in further detail.
The first procedure estimates the network bandwidth demand of each flow. Multiple options exist for performing this estimation. One option is to run on each server 1101 a software agent that monitors the sending rates of all flows originated from the local server 1101. Such information from all servers 1101 in a data center can be further aggregated and the server-to-server traffic demand can be inferred by the network manager 1402. A second option for estimating network demand is to mirror the network traffic at the ToRs 1103 using switched port analyzer (SPAN) ports. After collecting the traffic data, network traffic demand can be similarly inferred as in the first option. The third option is to estimate the network demand by emulating the additive increase and multiplicative decrease (AIMD) behavior of TCP and dynamically inferring the traffic demand without actually capturing the network packets. Based on the deployment scenario, a network administrator can choose the most efficient mechanism from these or other known options.
In the second procedure, routing is allocated in a greedy fashion based on the following steps, as shown in the flow chart of
If the capacity of at least one of the links in the selected path is exceeded, the network manager goes back to step 1705 and picks the next most direct path and repeats steps 1706 and 1707. Otherwise, the network manager 402 goes to step 1704 to pick the flow with the second highest bandwidth demand and repeats steps 1705 through 1707.
In a physical network, each server rack 1102 is connected to another server rack 1102 by a single optical fiber. But logically, the link is directed. From the perspective of each server 1101, all the optical links connecting other optical switching modules in both the ingress and egress directions carry all the m wavelengths. But since these m wavelengths will be selected by the WSS 1510 at the receiving end, these links can logically be represented by the set of wavelengths to be admitted.
The logical graph of a 4-ary 2-cube cluster is illustrated in
Next, all the WHITE nodes are placed on top, and all GREY nodes are placed on the bottom, and a bipartite graph is obtained, as shown in
In this procedure, the network manager 1402 provisions the network bandwidth based on the traffic demand obtained from Procedure 1 and/or Procedure 2, and then allocates wavelengths to be admitted at different receiving WSSs 1510, based on the following steps, as shown in the flowchart of
In step 11004, since at the WSS 1510, the same wavelength carried by multiple optical links cannot be admitted simultaneously (i.e., the wavelength contention problem), the network manager 1402 needs to ensure that for each receiving node, there is no overlap of wavelength assignment across the 2n input ports. Thereafter, the process ends at step 11005.
Procedure 3 does not consider the impact of changes of wavelength assignment, which may disrupt network connectivity and lead to application performance degradation. Thus, in practice, it is desirable that only a minimum number of wavelength changes are performed to satisfy the bandwidth demands. Therefore, it is desirable to maximize the overlap between the old wavelength assignment πold and the new assignment anew. The classic Hungarian method can be adopted as a heuristic to achieve this goal. The Hungarian method is a combinatorial optimization algorithm to solve assignment problems in polynomial time. This procedure is described with reference to the flow chart of
such that M×R is minimized, while maintaining routing connectivity. The process ends at step 1105.
The fifth procedure achieves highly fault-tolerant routing. Given the n-dimensional architecture, there are 2n node-disjoint parallel paths between any two ToRs 1103. Upon detecting a failure event, the associated ToRs 1103 notifies the network manager 402 immediately, and the network manager 402 informs all the remaining ToRs 1103. Each ToR 1103 receiving the failure message can easily check which paths and corresponding destinations are affected, and detour the packets via the rest of the paths to the appropriate destinations. Applying this procedure allows the performance of the whole system to degrade very gracefully even in the presence of a large percentage of failed network nodes and/or links.
In the broadcast-and-select based design, each of the 2n egress links of a ToR 1103 carries all the m wavelengths. It is left up to the receiving WSS 1510 to decide what wavelengths to admit. Thus, multicast, anycast or broadcast can be efficiently realized by configuring the WSSs 1510 in a way that the same wavelength of the same ToR 1103 is simultaneously admitted by multiple ToRs 1103. The network manager 1402 needs to employ methods similar to the IP-based counterparts to maintain the group membership for the multicast, anycast or broadcast.
In the symmetric architecture described so far, the number of the ports of a ToR 1103 switch that are connected to servers equals the number of the ports of the same ToR 1103 that are connected to the wavelength selective switching unit 1403. This architecture achieves high bisection bandwidth between servers 1101 residing in the same server rack 1102 with the rest of the network at the expense of only 50% switch port utilization.
The architecture of the wavelength selective switching unit 1603 used for point-to-point communication is described in U.S. Patent Application Publication Nos. 2012/0008944 to Ankit Singla and 2012/0099863 to Lei Xu, the entire disclosures of both of which are incorporated by reference herein. In the present invention, these point-to-point based wavelength selective switching units 1603 are arranged into the high-dimensional interconnect architecture 1404 in a fixed structure. In the wavelength selective switching unit 1603, as illustrated with reference to
Logically above the ToR 1103 are the wavelength selective switching units 1603, which are further interconnected to support a larger number of inter communications between servers 1101. Each wavelength selective switching unit 1603 includes optical MUX 1507 and DEMUX 1508 each with m ports, a 1×2n optical wavelength selective switch (WSS) 1510, a 1×2n optical power combiner (PC) 601, and 2n optical circulators 1511. In operation, the optical PC 601 combines optical signals from multiple fibers into a single fiber. The WSS 1510 can be dynamically configured to decide how to allocate the optical signals at different wavelengths in the single input port into one of the different output ports. The optical circulators 1511 are used to support bi-directional optical communications using a single fiber. Again, the optical circulators 1511 are not required, as two fibers can be used to achieve the same function.
Similar to the broadcast-and-select based system described earlier, all the wavelength selective switching units 1403 are interconnected using a high-dimensional architecture and are controlled by the network manager 1402. The network manager 1402 dynamically controls the optical switch fabric following the procedures below.
Procedures 1, 2, 5 and 6 are the same as the corresponding procedures discussed above with respect to the broadcast-and-select based system.
The third procedure of the point-to-point architecture is described with reference to
This procedure is similar to Procedure 4 in the broadcast-and-select based system, finding a minimum set of wavelengths, while satisfying the bandwidth demands. This procedure first finds a new wavelength assignment πnew, which has a large wavelength overlap with the old assignment πold. Then, uses mew as the initial state and uses an adapted Hungarian method to fine-tune πnew to further increase the overlap between πnew and πold.
In the present invention, all of the wavelength selective switching units 1603 are interconnected using a fixed specially designed high-dimensional architecture. Ideal scalability, intelligent network control, high routing flexibility, and excellent fault tolerance are all embedded and efficiently realized in the disclosed fixed high dimensional architecture. Thus, network downtime and application performance degradation due to the long switching delay of an optical switching matrix are overcome in the present invention.
It will be appreciated by those skilled in the art that changes could be made to the embodiments described above without departing from the broad inventive concept thereof. It is understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the appended claims.
This application claims priority to U.S. Provisional Patent Application No. 61/825,292 filed May 20, 2013, which is incorporated herein by reference. This patent application is related to U.S. Provisional Patent Application No. 61/719,026 filed Oct. 26, 2012, now U.S. application Ser. No. 14/057,133 filed Oct. 18, 2013, published as U.S. Patent Application Publication No. 2014/0119728. Substantive portions of U.S. Provisional Patent Application No. 61/719,026 are attached hereto in an Appendix to the present application. U.S. Provisional Patent Application No. 61/719,026 is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61825292 | May 2013 | US |