This application is related to the following patent applications, each of which is hereby incorporated by reference in its entirety:
U.S. patent application Ser. No. 13/109,849, filed May 17, 2011, entitled “SYSTEM AND METHOD FOR ZERO BUFFER COPYING IN A MIDDLEWARE ENVIRONMENT;
U.S. patent application Ser. No. 13/170,490, filed Jun. 28, 2011, entitled “SYSTEM AND METHOD FOR PROVIDING SCATTER/GATHER DATA PROCESSING IN A MIDDLEWARE ENVIRONMENT”;
U.S. patent application Ser. No. 13/109,871, filed May 17, 2011, entitled “SYSTEM AND METHOD FOR PARALLEL MUXING BETWEEN SERVERS IN A CLUSTER”; and
U.S. patent application Ser. No. 13/167,636, filed Jun. 23, 2011, entitled “SYSTEM AND METHOD FOR SUPPORTING LAZY DESERIALIZATION OF SESSION INFORMATION IN A SERVER CLUSTER”.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention is generally related to computer systems and software such as middleware, and, is particularly related to systems and methods for muxing between servers in a cluster.
Within any large organization, over the span of many years the organization often finds itself with a sprawling IT infrastructure that encompasses a variety of different computer hardware, operating-systems, and application software. Although each individual component of such infrastructure might itself be well-engineered and well-maintained, when attempts are made to interconnect such components, or to share common resources, it is often a difficult administration task. In recent years, organizations have turned their attention to technologies such as virtualization and centralized storage, and even more recently cloud computing, which can provide the basis for a shared infrastructure. However, there are few all-in-one platforms that are particularly suited for use in such environments. These are the general areas that embodiments of the invention are intended to address.
Systems and methods are provided for providing efficient low-latency multiplexing (herein referred to as “muxing”) between servers in the cluster. One such system can include a cluster of one or more high performance computing systems, each including one or more processors and a high performance memory. The cluster communicates over an InfiniBand network. The system can also include a middleware environment, executing on the cluster, which includes one or more application server instances. The system can include one or more selectors, wherein each said selector contains a queue of read-ready file descriptors. Furthermore, the system can include a shared queue, wherein the read-ready file descriptors in each said selector can be emptied into the shared queue. Additionally, a plurality of multiplexer (herein referred to as “muxer”) threads operates to take work from said shared queue.
Other objects and advantages of the present invention will become apparent to those skilled in the art from the following detailed description of the various embodiments, when read in light of the accompanying drawings.
Described herein are systems and methods that can support work sharing muxing in a cluster.
Simple Muxing
A poll device, such as a selector 302, which may be exposed via a selector interface, can include a queue of read-ready file descriptors, such as sockets (shown as dots in the list). The selector 302 can be used by one or more muxer threads 305a-305c to poll the FD cache 301. A thread 305a-305c may be blocked at the selector 302, e.g. while placing a Selector.select( ) function call, waiting for a scan on the FD cache 301 to complete. Then, the thread 305a-305c can copy read-ready file descriptors into the selector list 302.
Each muxer thread 305a-305c can maintain a thread-local list 303a-303c. The thread-local list 303a-303c includes a list of read-ready sockets (shown as dots in the list) that can be processed by the thread 305a-305c. Since the list 303a-303c is thread-local, other threads may not be able to help processing that list, even when the other threads are idle.
As shown in
A request manager 304 can be used to handle one or more requests from different servers in the middleware machine environment 300. The request manager 304 is a component with multiple queues, to which the requests prepared by the muxer threads 305a-305c can be added. These queues can be first-in-first-out (FIFO) queues, or priority queues. Additionally, constraints on the thread counts may be enforced on the various queues in the request manager 304.
As shown in
Furthermore, once the muxer thread 305b returns from Selector.select( ) another thread, e.g. 305c, may enter Selector.select( ) again. In such a case, since the selector 302 has just been emptied, it may likely block the muxer thread 305c. Thus, there may be a situation where most of the muxer threads are waiting, while one muxer thread is busy.
Thus, in the example, as shown in
Additional information about simple muxing is disclosed in U.S. patent application Ser. No. 13/109,871, filed May 17, 2011, entitled “SYSTEM AND METHOD FOR PARALLEL MUXING BETWEEN SERVERS IN A CLUSTER”, which application is hereby incorporated by reference.
Parallel Muxing
Using parallel muxing, each selector may be accessed by only one muxer thread. For example, the muxer thread 405a uses the selector 402a, while the muxer threads 405b use the selector 402b and the muxer threads 405c use the selector 402c. The worker threads 405a-c poll the selectors 402a-c single-threadedly and process read-ready sockets single-threadedly. The using of individual selectors 402a-c allows a reduction in the arrival rate per selector, and, therefore reduces the contention on the system resources.
Using parallel muxing, the unevenness of load caused by single selector can be resolved. Furthermore, there may still be a need for achieving even distribution of work among different selectors and muxer threads.
Thus, in the example as shown in
Additional information about parallel muxing is disclosed in U.S. patent application Ser. No. 13/109,871, filed May 17, 2011, entitled “SYSTEM AND METHOD FOR PARALLEL MUXING BETWEEN SERVERS IN A CLUSTER”, which application is hereby incorporated by reference.
Work Sharing Muxing
By joining the blocking queues of the selectors 502a-c into one shared queue 506, the queue processing model avoid requiring the individual worker threads 505a-e to process all read-ready sockets sequentially. The worker threads 505a-e can be activated to enable concurrent processing of read-ready sockets from individual selectors 502a-c. Thus, the shared queue 506 can improve the concurrent properties of the FD caches 501a-c maintained by the OS, and the queue processing model offers reduced end-to-end latency.
Using this queue processing model, the one or many read-ready sockets returned from individual selectors 502a-c, and selectors 502a-c themselves can be shared among multiple worker threads, or muxer threads 505a-c. As shown in
As long as the shared queue 506 is not empty, the muxer threads 505a-e may not get suspended in order to achieve high throughput for the muxer, since the queue processing model can avoid having some threads blocked in Selector.select while other threads may have more than one socket to process. Thus, this queue processing model can reduce the queue waiting time of requests that otherwise would be wasted in the thread-local list.
In accordance with various embodiments of the invention, the number of muxer threads (MUXERS) can be less than or equal to the number of the selectors (SELECTORS), or 1<=SELECTORS<=MUXERS. The muxer threads can potentially be blocked in Selector.select( ) at every selector. Thus, there may be up to the number of SELECTORS muxer threads blocked in Selector.select( ). Once a muxer thread returns from the Selector with a list of read-ready sockets, one or more threads may be ready to take work from the shared queue 506, while some of the muxer threads may be busy reading socket at the time. The number of the threads that are ready to read the read-ready sockets can be up to the number of MUXERS-SELECTORS, which represents the difference between the number of MUXERS and the number of SELECTORS.
When a muxer thread 505a-e is idle, the worker can either be blocked trying to get read-ready sockets from a selector 502a-c, or be blocked trying to get read-ready sockets from the shared blocking queue 506. When one or many read-ready sockets become available, the read-ready sockets and their selectors 502a-c can end up in the shared blocking queue 506 in the order that guarantees system-wide progress.
In accordance with various embodiments of the invention, every worker thread 505d-e that returns from the selector can retain one last read-ready socket. Every worker thread 505a-c that gets unblocked from the shared queue 506 can have a read-ready socket. The worker thread 505a-c can continuingly process these sockets (e.g. read a request), and then return to get more read-ready sockets from the shared queue 506. Eventually a selector 502a-c can be taken from the shared queue 506, in which case the worker thread 505a-c can proceed to get more read-ready sockets from that selector 502a-c.
Since the order, in which the read-ready sockets are processed, is based on the selector 502a-c, there is a larger opportunity for having more read-ready sockets in the shared queue 506 using this queue processing model. As a result, the read-ready sockets can be obtained from the selector 502a-c without blocking, and significant response time reductions can be achieved for network-intensive workloads.
Furthermore, the sharing scheme enables the worker threads 505a-e to continuously obtain read-ready sockets from the selectors 502a-c and process them without the need to get suspended or to perform context switches. Thus, this queue processing model can achieve a great degree of concurrency.
Furthermore, in order to keep as many muxer threads busy reading sockets as possible, instead of waiting in Selector, the Muxer class tries to add the selector to the queue as late as possible. The reason is that, if the time from the last poll is longer, then it is more likely that the worker thread can return immediately, thus blocking less. Otherwise, if a worker thread enters the Selector.select too soon, the call would more likely get the worker thread blocked, since the selector list was emptied only a short while ago and the file descriptor cache may not have enough time to get populated again.
In accordance with various embodiments of the invention, the efficiency of the system can be achieved based on efficient concurrent selection of one selector out of many selectors, which enables concurrent selection from multiple small FD caches instead of one large cache. Furthermore, the use of non-blocking sharing of read-ready sockets can eliminate thread starvation. It is beneficial to use a non-blocking bulk add operation for a concurrent queue with a fixed memory footprint (e.g., jobs.offerAll( . . . ) as shown in line 20 of
Thus, using work sharing muxing, the system can guarantee efficient queue progressing and allow the sharing of various mutable states, and can eliminate thread starvation during the concurrent processing of read-ready sockets.
The present invention may be conveniently implemented using one or more conventional general purpose or specialized digital computer, computing device, machine, or microprocessor, including one or more processors, memory and/or computer readable storage media programmed according to the teachings of the present disclosure. Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those skilled in the software art.
In some embodiments, the present invention includes a computer program product which is a storage medium or computer readable medium (media) having instructions stored thereon/in which can be used to program a computer to perform any of the processes of the present invention. The storage medium can include, but is not limited to, any type of disk including floppy disks, optical discs, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data
The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalence.
Number | Name | Date | Kind |
---|---|---|---|
5333274 | Amini et al. | Jul 1994 | A |
6427161 | LiVecchi | Jul 2002 | B1 |
7394288 | Agarwal | Jul 2008 | B1 |
7554993 | Modi et al. | Jun 2009 | B2 |
8131860 | Wong et al. | Mar 2012 | B1 |
20030110232 | Chen et al. | Jun 2003 | A1 |
20030120822 | Langrind et al. | Jun 2003 | A1 |
20040122953 | Kalmuk et al. | Jun 2004 | A1 |
20040205771 | Sudarshan et al. | Oct 2004 | A1 |
20050021354 | Brendle et al. | Jan 2005 | A1 |
20050027901 | Simon et al. | Feb 2005 | A1 |
20050102412 | Hirsimaki | May 2005 | A1 |
20050223109 | Mamou et al. | Oct 2005 | A1 |
20050262215 | Kirov et al. | Nov 2005 | A1 |
20060015600 | Piper | Jan 2006 | A1 |
20060031846 | Jacobs et al. | Feb 2006 | A1 |
20060143525 | Kilian | Jun 2006 | A1 |
20060209899 | Cucchi et al. | Sep 2006 | A1 |
20060248200 | Stanev | Nov 2006 | A1 |
20070156869 | Galchev et al. | Jul 2007 | A1 |
20070198684 | Mizushima | Aug 2007 | A1 |
20070245005 | Banerjee et al. | Oct 2007 | A1 |
20080044141 | Willis et al. | Feb 2008 | A1 |
20080163124 | Bonev et al. | Jul 2008 | A1 |
20080195664 | Maharajh et al. | Aug 2008 | A1 |
20080286741 | Call | Nov 2008 | A1 |
20090019158 | Langen et al. | Jan 2009 | A1 |
20090024764 | Atherton et al. | Jan 2009 | A1 |
20090182642 | Sundaresan | Jul 2009 | A1 |
20090327471 | Astete et al. | Dec 2009 | A1 |
20100198920 | Wong et al. | Aug 2010 | A1 |
20110029812 | Lu et al. | Feb 2011 | A1 |
20110055510 | Fritz et al. | Mar 2011 | A1 |
20110119673 | Bloch et al. | May 2011 | A1 |
20110246582 | Dozsa et al. | Oct 2011 | A1 |
20120066400 | Reynolds et al. | Mar 2012 | A1 |
20120066460 | Bihani et al. | Mar 2012 | A1 |
20120239730 | Revanuru et al. | Sep 2012 | A1 |
20130014118 | Jones | Jan 2013 | A1 |
20140059226 | Messerli et al. | Feb 2014 | A1 |
Number | Date | Country |
---|---|---|
2492653 | Jan 2013 | GB |
Entry |
---|
European Patent Office International Searching Authority, International Search Report and written opinion dated Feb. 5, 2014 for Application No. PCT/US2013/067286, 10 pages. |
Gregory F. Pister, High Performance Mass Storage and Parallel 110, 2002, Chapter 42—An Introduction to the InfiniBand Architecture, IBM Enterprise Server Group, pp. 61 7-632. |
Richard G. Baldwin, “The ByteBuffer Class in Java : Java Programming Notes #1782”., Aug. 20, 2002. 14 pages. Retrieved from: http://www.developer.com/java/other/article.php/1449271/The-ByteBuffer-Class-in-Java.htm. |
National Instruments Corporation, What is Scatter-Gather DMA (Direct Memory Access)?, 1 pages retrieved Aug. 29, 2014, from <http://digital.ni.com/public.nsf/allkb/65B0708FE161D8C0852563DA00620887>. |
Number | Date | Country | |
---|---|---|---|
20140215475 A1 | Jul 2014 | US |