The present invention relates generally to computing technology, and more specifically, to resource over-subscription.
Data centers may be configured to process large amounts or volumes of data. In the context of processing large amounts of volumes of data, a map-reduce algorithm may be used. The map-reduce algorithm may entail a mapping of a large data set into smaller data sets or workloads. The workloads may be processed by a plurality of machines, virtual machines, or threads, potentially in parallel, to obtain sub-processed results. The sub-processed results may ultimately be merged or combined to obtain overall results.
In the context of network computing, a resource, such as a switch, may enter a so-called “over-subscribed” state. Succinctly stated, the switch may be over-subscribed if the input data or load required to be processed or handled by the switch exceeds the output capacity of the switch. An over-subscribed resource may represent a bottleneck in a network.
To address over-subscription, additional resources (e.g., additional switches) may be allocated. However, allocating additional resources represents additional cost in terms of, e.g., money, complexity, management, etc. Moreover, over-subscription may represent a dynamic or transient condition. Thus, the additional resources may be idle a majority of the time, resulting in an underutilization of the resources. As such, a network provider or operator may elect to forego allocating the additional resources. However, if not addressed, over-subscription may result in a loss of data (e.g., data packets). A loss of data may be reflected in terms of degraded network quality or reliability.
Embodiments include a method, system, and computer program product for managing workloads in a network. A switch receives data associated with a workload. The received data is tagged with an identifier that associates the data with the workload. The received data is compressed based on determining that second data stored in a buffer of the switch exceeds a threshold. The switch stores the compressed data in the buffer. The compressed data is transmitted to a second network based on a determination that the switch is over-subscribed.
The subject matter which is regarded as embodiments is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The forgoing and other features, and advantages of the embodiments are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
In accordance with one or more embodiments, systems, apparatuses, and methods are described that address over-subscription of a network resource, such as a switch. Buffers associated with the switch are monitored to determine when input data to be processed by the switch exceeds a threshold. When the input data exceeds the threshold, the data may be compressed and tagged with a unique identifier. The unique identifier distinguishes the source or workload from which the data originates. The compressed, tagged data is transmitted on one or more output links of the switch. The compressed data takes up less bandwidth of the output link(s) than an uncompressed version of the data. In some embodiments, data (e.g., compressed data) may be provided by a switch to a management network for handling or processing in order to leverage bandwidth available in the management network.
Turning now to
The system 100 may include a number of different types of computing devices. For purposes of illustrative simplicity and ease of explanation, the system 100 is shown as including a number of servers 114 and a number of switches 122. A skilled artisan would appreciate that other types of devices may be included in some embodiments.
In some embodiments, the switches 122 may be coupled to one another. For example, data may traverse one or more switches 122, and potentially one or more of the servers 114, as part of a multi-hop path.
The servers 114 may be coupled to one or more ports of the switches 122. For example, the servers 114 may be coupled to data ports (DPs) 130 of the switches 122. A DP 130 may generally be used as a principal port to convey data between a server 114 and a switch 122. A DP 130 may be coupled to one or more management ports (MPs) 140. The role of the MP 140 is described further below.
As data is provided from a server 114 to a switch 122, potentially as part of a so-called “shuffle phase” of a map-reduce algorithm, the switch 122 may buffer the data via one or more buffers 150. The buffer 150 may be used to provide the switch 122 additional time to process or handle incoming data. Such additional time may be needed in the event that the volume of incoming data exceeds the capacity of the switch 122 to process that data. Use of the buffer 150 may help to avoid or minimize data loss.
In some embodiments, the state of the buffer 150 may be monitored. Such monitoring may occur at one or more entities. For example, the switch 122 may monitor the state of its own buffer 150. In some embodiments, the state of the buffer 150 may be monitored via the management network 104 (potentially in association with the MP 140).
The processing or handling of the incoming data at a switch 122 may be a function of the state of the buffer 150. For example, if the incoming data stored in the buffer 150 for handling or processing by the switch 122 exceeds a threshold, then the switch 122 may compress the data and may tag the data with a unique identifier (ID). The ID may identify the source or origin of the data in terms of a workload, in order to allow final results of handling or processing to be associated with a given task. On the other hand, if the incoming data stored in the buffer 150 is less than the threshold, the switch might not compress the data.
The threshold used to determine whether to compress the data may be a function of one or more parameters. For example, throughput requirements (e.g., the amount of data processed or handled per unit time), an anticipated maximum rate of incoming data at the switch 122, and a capacity of the buffer 150 may be considered in selecting the threshold. Moreover, the selected threshold may be dynamic in nature and may change based on one or more considerations or factors.
Compressed data may consume less bandwidth on an output link of a switch 122 relative to an uncompressed version of the data. On the other hand, compression represents an additional task that increases latency in terms of the time it takes for the data to arrive at a final destination (e.g., a server 114) and/or for a final result of the processing of the data to be generated. Accordingly, the selection of the threshold described above may take the trade-off between bandwidth and latency into consideration.
As described above, the management network 104 may monitor the state of the buffers 150. More generally, the management network 104 may monitor the performance of, and manage any errors associated with, the network 102. Referring to
The management network 104 may have spare capacity or bandwidth available after taking into consideration any bandwidth needed for monitoring and management purposes. This extra bandwidth may be exploited in the event that the data network 102 is over-subscribed. In this respect, as shown in
The switches 122 may be software-defined network (SDN) enabled switches. In this respect, data may be transferred between various entities or ports (e.g., DPs 130 and MPs 140) of a switch 122.
Turning to
In block 302, data associated with a workload may be received. For example, the data may be received by a switch via a DP of the switch.
In block 304, a determination may be made regarding a status of a monitoring algorithm. For example, if the monitoring algorithm indicates that data in a buffer of the switch exceeds a threshold or that the switch is over-subscribed, flow may proceed from block 304 to block 306. Otherwise, if the monitoring indicates that the switch/buffer has sufficient capacity to accommodate ongoing data operations, flow may proceed from block 304 to block 340.
In block 306, the received data of block 302 may be compressed and/or tagged with a unique ID. As part of block 306, the compressed and/or tagged data may be stored in a buffer of the switch.
The flow from block 306 may be dictated based on the status of the monitoring of block 304. For example, if the switch is over-subscribed, flow may proceed from block 306 to block 308. Otherwise, if the switch is not over-subscribed, flow may proceed from block 306 to block 340.
In block 308, the (compressed) data may be transferred from the switch to a secondary network (e.g., a management network) for handling/processing.
In block 340, the data (e.g., compressed or uncompressed data) may be processed or handled by the switch to generate results or sub-results. As part of block 340, sub-results may be merged with sub-results associated with a common ID potentially handled by other entities, such as other switches. The merger may allow for a generation of overall results associated with a workload.
The method 300 is illustrative. In some embodiments, one or more of the blocks, or a portion thereof, may be optional. In some embodiments, additional blocks or operations not shown may be included. In some embodiments, the blocks may execute in an order or sequence that is different from what is shown in
Referring to
The instructions stored in the memory 402 may be executed by one or more processors, such as a processor 406. The processor 406 may be coupled to one or more input/output (I/O) devices 408. In some embodiments, the I/O device(s) 408 may include one or more of a keyboard or keypad, a touchscreen or touch panel, a display screen, a microphone, a speaker, a mouse, a button, a remote control, a joystick, a printer, etc. The I/O device(s) 408 may be configured to provide an interface to allow a user to interact with the system 400.
The processor 406 may include one or more hard drives 410. The hard drives 410 may be used to store data.
The system 400 is illustrative. In some embodiments, one or more of the entities may be optional. In some embodiments, additional entities not shown may be included. For example, in some embodiments the system 400 may be associated with one or more networks. In some embodiments, the entities may be arranged or organized in a manner different from what is shown in
Technical effects and benefits include an ability to maximize network performance and reliability by addressing or mitigating the impact of over-subscription. Aspects of the disclosure may be applied in connection with one or more components or devices, such as a HADOOP network switch. In some embodiments, a switch may compress shuffle data provided as input to the switch in order to reduce buffer utilization/requirements in the switch. Signatures or traffic classes may be associated with the compressed data to facilitate bandwidth allocation on available network links, potentially avoiding over-subscription. In cases where over-subscription is unavoidable, spare bandwidth associated with a secondary network may be utilized for temporary data transfer purposes.
As will be appreciated by one of average skill in the art, aspects of embodiments may be embodied as a system, method or computer program product. Accordingly, aspects of embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as, for example, a “circuit,” “module” or “system.” Furthermore, aspects of embodiments may take the form of a computer program product embodied in one or more computer readable storage device(s) having computer readable program code embodied thereon.
One or more of the capabilities of embodiments can be implemented in software, firmware, hardware, or some combination thereof. Further, one or more of the capabilities can be emulated.
An embodiment may be a computer program product for enabling processor circuits to perform elements of the invention, the computer program product comprising a computer readable storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method.
The computer readable storage medium (or media), being a tangible, non-transitory, storage medium having instructions recorded thereon for causing a processor circuit to perform a method. The “computer readable storage medium” being non-transitory at least because once the instructions are recorded on the medium, the recorded instructions can be subsequently read one or more times by the processor circuit at times that are independent of the time of recording. The “computer readable storage media” being non-transitory including devices that retain recorded information only while powered (volatile devices) and devices that retain recorded information independently of being powered (non-volatile devices). An example, non-exhaustive list of “non-transitory storage media” includes, but is not limited to, for example: a semi-conductor storage device comprising, for example, a memory array such as a RAM or a memory circuit such as latch having instructions recorded thereon; a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon; an optically readable device such as a CD or DVD having instructions recorded thereon; and a magnetic encoded device such as a magnetic tape or a magnetic disk having instructions recorded thereon.
A non-exhaustive list of examples of computer readable storage medium include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM). Program code can be distributed to respective computing/processing devices from an external computer or external storage device via a network, for example, the Internet, a local area network, wide area network and/or wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface card in each computing/processing device receives a program from the network and forwards the program for storage in a computer-readable storage device within the respective computing/processing device.
Computer program instructions for carrying out operations for aspects of embodiments may be for example assembler code, machine code, microcode or either source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of embodiments are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.
These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable storage medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Number | Name | Date | Kind |
---|---|---|---|
5822700 | Hult | Oct 1998 | A |
6535484 | Hughes et al. | Mar 2003 | B1 |
6577596 | Olsson et al. | Jun 2003 | B1 |
6765873 | Fichou | Jul 2004 | B1 |
7065087 | Koren et al. | Jun 2006 | B2 |
7746801 | Mathieu et al. | Jun 2010 | B2 |
7936675 | Bailey | May 2011 | B2 |
7957284 | Lu et al. | Jun 2011 | B2 |
8031607 | Rochon et al. | Oct 2011 | B2 |
8125904 | Lund | Feb 2012 | B2 |
8279935 | Belyaev | Oct 2012 | B2 |
8291106 | Yazaki et al. | Oct 2012 | B2 |
8369324 | Breight | Feb 2013 | B1 |
8370496 | Marr | Feb 2013 | B1 |
8392575 | Marr | Mar 2013 | B1 |
8416788 | Colville et al. | Apr 2013 | B2 |
8451718 | Kishore et al. | May 2013 | B2 |
8539094 | Marr | Sep 2013 | B1 |
8693374 | Murphy et al. | Apr 2014 | B1 |
8824294 | Halabi et al. | Sep 2014 | B2 |
8867361 | Kempf | Oct 2014 | B2 |
8902769 | Dehghan | Dec 2014 | B1 |
9065749 | Cohen et al. | Jun 2015 | B2 |
20080056273 | Pelletier et al. | Mar 2008 | A1 |
20090080334 | DeCusatis | Mar 2009 | A1 |
20090116503 | Sebastian | May 2009 | A1 |
20100046424 | Lunter et al. | Feb 2010 | A1 |
20110158248 | Vorunganti et al. | Jun 2011 | A1 |
20110242972 | Sebire et al. | Oct 2011 | A1 |
20120151056 | Sporel | Jun 2012 | A1 |
20130003546 | Matthews et al. | Jan 2013 | A1 |
20140003422 | Mogul | Jan 2014 | A1 |
20140064066 | Lumezanu et al. | Mar 2014 | A1 |
20140112128 | Kwan | Apr 2014 | A1 |
20140169158 | Mishra | Jun 2014 | A1 |
20140185450 | Luo et al. | Jul 2014 | A1 |
20150039744 | Niazi | Feb 2015 | A1 |
20150127805 | Htay | May 2015 | A1 |
20150195368 | Bandyopadhyay | Jul 2015 | A1 |
20150207724 | Choudhury | Jul 2015 | A1 |
20150304441 | Ichien | Oct 2015 | A1 |
Entry |
---|
Wang et al., “c-Through: Part-time Optics in Data Centers”, Aug. 30, 2010, Sigcomm 2010. |
Al-Fares et al., “Hedera: Dynamic Flow Scheduling for Data Center Networks”, Apr. 2010, in Proc. 7th USENIX NSDI. |
Farrington et al., “A hybrid electrical/optical switch architecture for modular data centers”, Aug. 2010, ACM SIGCOMM. |
Debo Dutta et al.,‘A Mechanism to Improve Performance in Mapreduce/Hadoop Clouds’, Feb. 1, 2012, IP.COM, IPCOM000214663D, Cisco Systems, Inc., 3 pages. |
Yong, Mark et al., “Towards a Resource Aware Scheduler in Hadoop”, Proceedings of the 2009 IEEE International Conference on Web Services, Computer Science and Engineering University of Michigan, Ann Arbor Dec. 21, 2009, 10 pages. |
Number | Date | Country | |
---|---|---|---|
20150172383 A1 | Jun 2015 | US |