Scalable, high-availability network

Abstract
A multiplicity of users is connected to a network, as are m servers. The users are organized into n user groups, each including a plurality of users, such that all the users in a group are part of a common database which permits intercommunication between them. That database is duplicated in a subset of p of the servers, and the subset shares the processing load of the corresponding user group. When a user in the respective user group attempts to communicate with another user, one of the servers in the subset p will accommodate the necessary processing initiate set up of the connection. At the same time, each server accommodates users in q different groups. Should one of the servers fail, each of the other servers in each subset p accommodating the failing server's users will accommodate the failed server's share of those users. Thus, the processing load of each user group is handled with a redundancy of p (the number of servers in a subset), ensuring a high level of availability.
Description

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing brief description, as well as further objects, features, and advantages of the present invention will be understood more completely from the following detailed description of a presently preferred, but nonetheless illustrative embodiment, with reference being had to the accompanying drawings, in which:



FIG. 1 is a schematic block diagram illustrating a fundamental aspect of the present invention;



FIG. 2 is a schematic block diagram illustrating a network configuration in accordance with a preferred embodiment of the invention; and



FIG. 3 is a schematic block diagram illustrating a network configuration in which 1-for-1 redundancy is provided for each server, as is well-known.





DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Turning now to the drawings, FIG. 1 is a schematic block diagram illustrating a fundamental aspect of the present invention. A multiplicity of users, U, are connected to a network N or a network conglomeration, such as the Internet, as are m servers. The users U are organized into n user groups, each including a plurality of users, such that all the users in a group are part of a common database which permits intercommunication between them. That database is duplicated in a subset p of the servers, which share the processing load of the corresponding user group. That is, when a user in the respective user group attempts to communicate with another user, one of the servers in the subset p will accommodate the necessary processing. At the same time, each server accommodates users from q different groups. Should one of the servers fail, each of the other servers in each subset p will accommodate the failed server's share of those users. Thus, the processing load of each user group is handled with a redundancy of p (the number of servers in a subset), assuring a high level of availability.


A preferred embodiment of the previously described network configuration is shown in FIG. 2. Illustrated are the communication links between a plurality of phone groups and a plurality of SIP servers. Although the phone groups are shown as communicating directly with the servers, it will be understood that these communications may actually be through a network. In this embodiment n=6, so six phone groups (user groups) P1 through P6 are illustrated, as an example. Each phone group may be a collection of phones or the users who are supported by the same SIP Server or IP-PBX. In practice, this often means the phones in the same office served by the same IP-PBX in that office. Also, m=6, so there are six SIP Servers, S1 through S6, each of which accommodates two phone groups (q=2) in the network. The dashed lines between the phone groups and servers indicate which SIP Servers accommodate the phones in each Phone Group and to which those phones are to register. For instance, all the phones in Phone Group P1 register with SIP Servers S1 and S2; phones in Group P2 with Servers S2 and S3; phones in Group P3 with or Servers S3 and S4; and so on. At the bottom of the group assignment, the connection wraps back to the top.


To be precise, this assignment diagram is generated by the following mathematical algorithm:

    • Given n Phone Groups, labeled 1 to n, and also n SIP Servers, labeled 1 to n, each Phone Group i (i=1 to n) is assigned to two different SIP Servers j1 and j2 according to the following rule:





j1=i






j
2
=j
1+1(mod n)


This connection pattern is commonly known as a shuffle. The example of FIG. 2 corresponds to the case of n=6. The discussion continues using this example as an illustration.


Referring to FIG. 2, it can be assumed that for the traffic load generated in each Phone Group, a fraction a is sent to the server connected in the horizontal direction, and (1−α) is therefore sent in the diagonal direction as shown in the diagram, and 0≦a≦1. This same rule of traffic assignment is applied to all the Phone Groups for symmetry of load balancing.


Under normal operating conditions, the six servers are all active in a load sharing mode. When one server fails, the traffic originally accommodated by the failed server is redirected for service to the two other servers that accommodate the same users. For example, if SIP Server S2 failed, all traffic from Phone Group P1 would be served by SIP Server S1, and all traffic from Phone Group P2 by SIP Server S3. It can be shown mathematically that for achieving the best load balancing condition given identical Phone Group traffic characteristics, the value of a should be 0.5. In other words, traffic from each Phone Group should be split equally between its two servers under normal conditions.


It should be noted that the traffic generated from each Phone Group to the assigned SIP Servers pertain only to the signaling messages. Accordingly, there are multiple ways to implement the intended effect of equally splitting the traffic between two SIP Servers, including assignment of successive session initiation requests randomly to the two servers or toggling the requests between the servers.


From the description so far, those skilled in the art will appreciate that the cost of the disclosed structure (using n servers) is less than a 1-for-n arrangement (using one server to protect n servers, which results in a total of n+1 servers). As for the memory requirements, the disclosed structure requires that each server provide sufficient memory to maintain a database of two Phone Groups, totally independent of how large n may become. Thus, this structure is scalable, and a remaining issue is whether the design achieves High Availability.


Availability of a system like a telephone switch is typically expressed as a fractional number. For example, a digital telephone switch for the Public Telephone Switched Network is often cited as highly reliable with an availability of 0.99999 (the so-called “five nine” standard). Using a simple calculation:





Average Downtime per Year=(365×24×60)(1−A)


where A is the availability number, the five-nine availability standard translates to only 5.3 minutes of average downtime per year.


This meaning of availability for a single server is reasonably clear. However, for a network of servers as shown in FIG. 2, the situation is not so clear. In order to avoid ambiguity, a stringent definition is adopted that if any one Phone Group loses service, the entire network is considered to be down. For example, if SIP Servers S1 and S2 have failed, then Phone Group P1 is out of service, and the network is declared down, regardless of whether the other Phone Groups have service or not. With this definition, the availability of the network using the proposed HA configuration of FIG. 2 will be compared to a conventional design of 1-for-1 redundancy for each server.


As is shown in the accompanying Appendix, the disclosed structure has approximately the same availability (or reliability) as the 1-for-1 redundancy arrangement, using the above, stringent definition that no phone group is allowed to be out of service. This is remarkable considering that the disclosed structure is half the cost of the 1-for-1 scheme.


In summary, a preferred, method and highly efficient structure have been disclosed for providing High Availability (HA) for a SIP-based VoIP network consisting of n (n≧3) communication servers called SIP Servers providing service to n groups of users (or n Phone Groups). In our context, we use a stringent definition for HA to mean that all Phone Groups must receive service, and service delivery failure to any one group would render the entire network in “down” status. In the disclosed HA construction, some key networking characteristics are as follows:

    • The entire network has n SIP servers providing service to n Phone Groups.
    • Each Phone Group is assigned for service by two distinct SIP Servers.
    • For any two Phone Groups, one of the following two conditions must apply:
      • (a) The Two Phone Groups are served by 4 distinct SIP Servers, or
      • (b) the two Phone Groups are served by 3 distinct SIP Servers, that is, they are served by one common server.
    • For each SIP Server, it has to serve at most two Phone Groups, and it maintains relevant registration information for users (or phones) in these two groups continuously.
    • For every Phone Group, the phones in the group need to maintain their registration continuously with two distinct SIP Servers, and in case of a SIP Server failure, the phones must have the ability to switch service to the other working SIP Server, either automatically or manually.


      The advantages of the aforementioned HA construction are significant compared to other alternatives in the state of the art:
    • The proposed design requires only n SIP Servers to support n Phone Groups, versus 2n servers to do the same in a conventional 1-for-1 fully redundant arrangement. In broad terms, this amounts to a 50% cost reduction.
    • In the proposed design, each SIP Server is only required to have sufficient processing power and memory space to support at most two Phone Groups, completely independent of the size of the network n and thus making the design scalable to arbitrarily large n.
    • In spite of the equipment efficiency cited above, there is no compromise in the reliability achieved in the proposed design. In other words, the reliability or availability achieved in this design is comparable to that of the conventional 1-for-1 fully redundant arrangement for practical applications of interest.


Although a preferred embodiment of the invention has been disclosed for illustrative purposes, those skilled in the art will appreciate that many additions, modifications and substitutions are possible without departing from the scope and spirit of the invention as defined by the accompanying claims.


Appendix: Availability Calculation

It is very complex to calculate an exact availability for the disclosed HA construction shown in FIG. 1. Instead, we will try to evaluate a performance bound and compare it to the reliability of the conventional 1-for-1 redundancy scheme. The conventional 1-for-1 redundancy construction is illustrated in FIG. 3.


With respect to FIG. 3, let A denote the availability for each server (or SIP Server). The availability A2 for a pair of redundant servers serving a particular user group (Phone Group) is given by:






A
2=1−(1−A)2=A(2−A)   (1)


which is the availability of each user group in FIG. 3. The total network availability AR for the 1-for-1 scheme in FIG. 3 is given by:






A
R=(A2)n=An(2−A)n   (2)


Since it is required that no user group is allowed to be down, the total availability is equivalent to the probability that all n groups are available.

It is very difficult to calculate the exact availability for FIG. 2. But those skilled in the art will appreciate that the following is a bound:






A
v{Total HA network}>Av3+Av1+Av2   (3)


where Av denotes the total availability, Av3 denotes the availability when all servers are working, Av1 denotes the availability when one server has failed, and Av2 denotes the availability when two servers have failed which are not adjacent. In other words, Av is greater than the sum of the probabilities corresponding to those conditions in which the system would not be considered to have failed. This is a lower bound on Av:






A
V
>A
n
+n(1−A)An−1+(nC2−n)(1−A)2An−2   (4)


where nC2 denotes n(n−1)/2. The values of equations (2) and (4) are computed in the following table for comparison.
















Availability
Network
Network Availability


No. Of User
of Each
Availability for
Bound for the


Groups (n)
Server (A)
the 1-for-1 Scheme
HA Design (Av)


















4
0.99
0.99960006
0.99999960


4
0.999
0.99999600
0.99999999


6
0.99
0.99940015
0.99998044


6
0.999
0.99999400
0.99999998


8
0.99
0.99920027
0.99994606


8
0.999
0.99999200
0.99999994


10
0.99
0.99900044
0.99988615


10
0.999
0.99999000
0.99999988










In this table of values of practical interest, it can be seen that the performance of the proposed design is at least as good as the 1-for-1 scheme.

Claims
  • 1. In a network with a multiplicity of users and a plurality of supervisory servers, a method for providing high availability, comprising the steps of: organizing the users into n user groups, each including a plurality of users, such that all the users in a group are part of a common database which permits intercommunication between them;duplicating the database of a user group in a subset of p of the servers, which share the processing load of the corresponding user group, with each server accommodating users in q different groups;upon failure of a server, causing other servers accommodating the failing server's users to accommodate the failed server's share of those users;whereby the processing load of each user group is handled with a redundancy of p, improving the level of network availability.
  • 2. The method of claim 1 wherein the network is a voice over Internet protocol (VoIP) network utilizing the SIP standard and the supervisory servers are SIP servers.
  • 3. The method of claim 2 wherein p=2 and q=2.
  • 4. A network with a multiplicity of users and a plurality of supervisory servers, comprising: a program module executable by a computer and stored therein to maintain the organization of the users into n user groups, each including a plurality of users, such that all the users in a group are part of a common database which permits intercommunication between them;storage media maintaining a copy of the database of a user group for a subset of p of the servers, which servers are to share the processing load of the corresponding user group, with each server accommodating q users in different user groups;a control module responsive to the failure of a server, causing other servers accommodating the failing server's q users to accommodate the failed server's share of those users.
  • 5. The network of claim 4 wherein the network is a voice over Internet protocol (VoIP) network utilizing the SIP standard and the supervisory servers are SIP servers.
  • 6. The network of claim 5 wherein p=2 and q=2.
  • 7. In a network with a multiplicity of users and a plurality of supervisory servers, a control subsystem comprising: a first program module executable by a computer and stored therein to maintain the organization of the users into n user groups, each including a plurality of users, such that all the users in a group are part of a common database which permits intercommunication between them;a second program module executable by a computer and stored therein causing storage media to maintain a copy of the database of a user group for a subset of p of the servers, which servers are to share the processing load of the corresponding user group, with each server accommodating q users in different user groups;a control program module responsive to the failure of a server, causing other servers accommodating the failed server's q users to accommodate the failed server's share of those users.
  • 8. The control subsystem of claim 7 wherein the network is a voice over Internet protocol (VoIP) network utilizing the SIP standard and the supervisory servers are SIP servers.
  • 9. The control subsystem of claim 8 wherein p=2 and q=2.
  • 10. An executable computer program for use with a network with a multiplicity of users' and a plurality of supervisory servers, the computer program being stored in a computer readable medium and comprising: a first executable program module maintaining the organization of the users into n user groups, each including a plurality of users, such that all the users in a group are part of a common database which permits intercommunication between them;a second executable program module causing storage media to maintain a copy of the database of a user group for a subset of p of the servers, which servers are to share the processing load of the corresponding user group, with each server accommodating q users in different user groups;a third executable program module responsive to the failure of a server, causing other servers accommodating the failed server's q users to accommodate the failed server's share of those users.
  • 11. The computer program of claim 10 wherein the network is a voice over Internet protocol (VoIP) network utilizing the SIP standard and the supervisory servers are SIP servers.
  • 12. The control subsystem of claim 11 wherein p=2 and q=2.