The mean time between failures (MTBF) is the average time between component software or equipment failures that result in partial loss of system capacity.
The mean time between outages (MTBO) is the average time between component failures that result in loss of system continuity or unacceptable capacity, performance, or reliability degradation.
This background information is provided to reveal information believed by the applicant to be of possible relevance. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art.
The disclosed subject matter illustrates an approach to topology configuration and optimization, which may address geo-redundancy issues, such as how many sites, and how many servers per site, are required to meet performance and reliability requirements. The disclosed multi-dimensional component failure mode reference model may be reduced to a one-dimensional service outage mode reference model. In the topology configuration approach, a novel adaptation of the hyper-geometric “balls in urns” distribution with unequally likely combinations may be used.
In an example, an apparatus may include a processor and a memory coupled with the processor that effectuates operations. The operations may include receiving a number of geographically diverse sites for a service; receiving a minimum availability of the service; based on the number of geographically diverse sites and the minimum availability, determining a probability that the service is up (Pup), mean time between service outages (F), and mean restoral time (R); and sending an alert that includes the PUP, F, and R.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to limitations that solve any or all disadvantages noted in any part of this disclosure.
Reference will now be made to the accompanying drawings, which are not necessarily drawn to scale.
Most availability analyses typically start by characterizing a failure mode reference model that captures the underlying hardware (HW) and software (SW) components that constitute an application deployment. From a performance:reliability:cost optimization perspective, these models are typically used to determine the minimal topology required to meet the distributed application capacity and availability requirements.
Herein, a simple 2-tiered reference model is used that includes servers and sites to illustrate an approach to topology configuration, with consideration of common geo-redundancy questions like how many sites, and how many servers per site, are required to meet a set of capacity, performance, and reliability requirements. Even for this simple 2-tiered reference model, the number of states grows exponentially. The techniques demonstrate how to reduce the state space of more complex models and how to collapse a state transition diagram into a one-dimensional representation in terms of the amount of available server capacity, where transitions can occur across multiple levels.
At the service level, application outages may matter more than individual failures. Thus, a generic outage mode reference model may be created based on the one-dimensional representation of the failure model. The exact derivation of the outage and restoral rates from the superset of ‘available’ states to the superset of ‘unavailable’ states as a function of the number of servers, the number of sites, and the minimum required server capacity level is disclosed.
Although there have been several attempts to analyze and optimize the availability of redundant, distributed topologies, especially in the context of storage systems and virtualized applications, conventional methods have not derived the exact general formula for the composite service outage and restoral rates, based on the hyper-geometric distribution with unequally likely combinations.
Typical availability analyses start by characterizing a failure mode reference model that captures the underlying HW elements and SW components that constitute the application deployment. These failure models can vary widely in their level of detail, from simple block diagrams to sophisticated failure trees. In most practical cases this detail can be aggregated to reduce the model complexity to a one-dimensional state space without losing the underlying individual component failure and restoral rates, dependencies, or interactions.
Typical reliability optimization questions that these models need to address include the “how many eggs in one basket” type: How many application processes can run on a single host? How many host servers in one rack? How many racks in one datacenter site? How many sites per region? For the analysis to follow, there may be an assumption of simple 2-tiered reference model including servers and sites, and a focus herein is the common geo-redundancy questions: how many sites and how many servers per site? Generalization to more than two tiers is contemplated.
Let M denote the number of geographically diverse sites (e.g., datacenters) and let N denote the number of hosts (servers). For simplicity, assume that Nis an integer multiple of M, and that N identical hosts are spread evenly across M identical sites. Let J=N/M denote the number of hosts per site. Hosts and sites are the HW elements.
For the purposes of this analysis, geographic diversity of sites means that there is no single point of failure that can cause the failure of multiple sites simultaneously. As examples, sites are not geographically diverse if they are located in the same physical building, or share a common HVAC cooling system, or share the same commercial power source at any point along the distribution including point of generation, or share the same transmission links at any point along the data path, etc. Other factors that could be considered include shared natural disaster zones (earthquake fault lines, wildfire regions, storm and flood zones, etc.).
A single identical application instance may be running on each host, and the set of J instances at each site make up the resident application function. Instances and resident functions are the SW elements. Assume that hosts and their associated instances are tightly coupled (that is, if a host is down its associated instance is unavailable, and vice versa). Similarly, assume that sites and their resident function (set of J instances) are tightly coupled (that is, if a site is down its resident function is unavailable, and vice versa). Let K denote the minimum number of instances required for service to be up (e.g., to have adequate capacity to serve the workload with acceptable performance and reliability).
Next, let {λI−1, λF−1, λH−1, λS−1} denote the mean time between failure (MTBF) and let {μI−1, αF−1, μH−1, μS−1} denote the mean time to restore (MTTR) of the {Instance SW, Function SW, Host HW, and Site HW}, respectively. Then the typical failure modes and associated effects (e.g., capacity impacts) for this canonical reference model are given in Table 1. Also, table 1 includes default values for the MTBFs and MTTRs in brackets [ ] that will be used for the simple example as described in more detail herein.
A typical SW fault impacting a single instance may be a memory leak or buffer overflow that leads to an application restart. A typical (less frequent) fault impacting an entire resident function may be the corruption of shared local data, or a latent bug in a code branch that, once triggered by one instance, cascades to the other instances when the transaction retries to execute the same code segment. A typical HW failure impacting a single host may be a fan failure, while a typical failure impacting an entire site may be a transfer switch failure following a commercial power outage.
Subsequent to the failure mode reference model development, a state space transition diagram may be developed and the transition probabilities solved for. In order to make the analysis tractable, the failure and restoral rates are assumed to be exponentially distributed, and the associated stochastic process is assumed to form a Markov chain (MC). A first step in this approach is to characterize states in terms of the amount of available capacity. To illustrate for this simple reference model, let the M-tuple (j1, . . . , jm, . . . , jM) denote the number of instances up at each site m=1, . . . , M, where 0≤jm≤J. There are (J+1)M total states. A ‘level’ in the state diagram may include all states with n total instances up, where Σm=1M(jm)=n for every state on level n (0≤n≤N). For all levels where n≥K, the service is up; otherwise, service is down.
Next, the state transitions may be specified. In this simple reference model, events can result in 1-level transitions in the case of host/instance failure and restoral, or up to J-level transitions in the case of site/function failure and restoral. Finally, enumerate and solve the resulting balance equations to determine the state probabilities. Unfortunately, the state diagram becomes unwieldy very quickly as N and M grow, and the balance equations become virtually impossible to solve by hand to get the explicit equations for the state probabilities. Commercial packages and statistical languages, such as MATLAB provide efficient and stable algorithms for finding the eigenvalues of a matrix, and many optimized library routines such as Eigen and Armadillo have been written to embed the code directly in various languages.
As a prelude to the outage mode reference model presented later, looking closely, a service outage can occur from any state other than the level 6 ‘all up’ state (3,3). In general, an outage can occur from any level n state where n−J<K. The exact derivation of the composite outage and restoral rates between the superset of available (‘up’) states and the superset of unavailable (‘down’) states as function of the input parameters N, M, and K, may be based on an adaptation of the hyper-geometric “balls in urns” distribution with unequally likely combinations. Knowing these rates is critical when sizing deployments for services with stringent (e.g., FCC reportable) outage and restoral requirements.
The second step in advancing the state space modeling is collapsing the failure modes; that is, combining all (HW and SW) failure and restoral rates causing single instance as well as single site transitions. To this end, let {AI, AF, AH, AS} denote the availability
and let {ρI, ρF, ρH, ρS} denote the utilization (ρ=λ/μ) of the {instance SW, function SW, host HW, and site HW}, respectively. First, combine the failure rates, loads, and availabilities. Let
λN≡host(HW+SW) failure rate=λI+λH
λM≡site(HW+SW) failure rate=λF+λS (1)
ρN≡host(HW+SW) failure load=ρI+ρH
ρM≡site(HW+SW) failure load=ρF+ρS (2)
A
N≡host(HW+SW) availability=AIAH
A
M≡site(HW+SW) availability=AFAS (3)
Now considering composite restoral rates, let
μN≡host(HW+SW) restoral rate
μM≡site(HW+SW) restoral rate (4)
The mathematical approach to collapsing restoral rates depends on the particular failure mode interactions and dependencies.
Model 1: μN=λN/ρN=λN/(ρH+ρI)
μM=λM/ρM=λM/(ρS+ρF). (5)
Next, Model 2 is most appropriate if all failure activity stops when all failures occur. In this case, it can be shown that
Model 2: μN=λN/(ρN+ρIρH)=λN/(ρH+(1+ρH)ρI)
μM=λM/(ρM+ρFρS)=λM/(ρS+(1+ρS)ρF). (6)
Model 3 is most appropriate if all failure activity stops when a select failure occurs (host failure for μN or site failure μM). In this case, it can be shown that
Finally, Model 4 is most appropriate if all failure activity stops when a select failure occurs, and restoral activity is sequential (e.g., host then instance for μN or site then resident function μM). In this case, it can be shown that
Each model is suitable for different reliability scenarios. The simplicity of Model 1, for instance, makes it a good choice when combining many failure modes (e.g., internal components of a server). Model 2 works well if all element failures and replacements are independent (e.g., PC peripheral devices). Model 3 and Model 4 are most suitable if failure modes are hierarchical (e.g., user session controlled by application SW running on server HW). Model 4 is most appropriate for our reference failure model, since the instance (or function) sits on top of the underlying host (or site) HW, and recovery involves replacing the HW and restarting the SW in sequence.
While these example state space aggregation models are exact in terms of the mean restoral rate, the resulting model may no longer form a MC. For tractability of analysis, the aggregate restoral rates are still assumed to be exponentially distributed, and the resulting collapsed model is still assumed to form a MC.
Additional complexities can be incorporated without complicating the analysis. For example, an important implication of network function virtualization (NFV) is the increased importance and added difficulty of fault detection and test coverage. Separating SW from HW (with possibly different vendors for each) creates additional reliability requirements enforcement challenges, such as how to ensure that different vendors have robust defect instrumentation and detection mechanisms if failures lie within the interaction between SW and HW, and how to ensure that test coverage is adequate. From an analysis standpoint, detection and coverage may be included. Let Cx denote the coverage factors and let vx−1 denote the uncovered MTTRs (including detection time) for x∈{I, F, H, S}. Then replace μx by μx′=Cxμx+(1−Cx)vx.
As another example, consider scheduled maintenance. Single instance or host maintenance may be rolling application or firmware upgrades. Resident function or site maintenance may be shared database upgrades or power backup testing. Let δx denote the maintenance rates, let γx−1 denote the maintenance MTTRs, and let πx=δx/γx denote the maintenance load for x∈{I, F, H, S}. Then we can replace λx by λx′=λx+δx, ρx by ρx′=ρx+πx, and μx by μx′=λx′/ρx′.
The next step in refining our state space representation is to collapse the failure levels by combining all states with the same number of available instances (capacity levels) and consolidating capacity level transition rates.
As stated, there may be an exact derivation of these composite transition rates, and in particular, the outage rate from the superset of ‘up’ states to the superset of ‘down’ states as a function of N, M, and K.
At the service level, application outages usually matter more than individual failures, therefore the need of a generic outage mode reference model (based on the failure modes). To this end, let n∈[0, N] denote the number of instances up, and let m∈[0, M] denote the number of sites up. Next, let Pn denote the probability that n instances are up (0≤n≤N), let PUP denote the probability that ≥K instances are up (e.g., adequate capacity), and let PDN=1−PUP denote the probability that <K instances are up (e.g., service outage). Finally, let F≡λD−1 denote the mean time between service outages and let R≡μU−1 denote the mean time to restore service following an outage.
Then the capacity level state probabilities Pn are given by
where ┌x┐ in (9) denotes the smallest integer≥x.
The probability that the service is up PUP and the ratio F/R are given by
In preparation for the analysis to follow, decompose Pn as
Note that
is expressed as a ratio in (10), thus what remains is to determine λD (the transition rate from the ‘up’ super-state to the ‘down’ super-state).
Mathematical structure around the solution is provided below. λD is given by
where m*(n) in (12) is the number of sites out of m with >n−K instances up. The quantities Pn|m[m*(n)]PM(m) inside the inner sum are the (weighted) combinations of ways to distribute n instances to m sites. The inner sum is across all sites m that could be up
and the outer sum is across all states n where transition from n to DN due to site failure is possible.
The solution is a specialized “balls in urns” problem involving the hyper-geometric distribution. There are N balls (instances) distributed in M urns (sites) with exactly J balls in each urn. Of the population of N balls, n are UP balls and N−n are DN balls. For M=2, there are
ways of distributing J instances into site 1 such that i instances are UP and J−i instances are DN (with the remaining instances in site 2). For M=3, there are
ways of distributing i UP instances into site 1, j UP instances into site 2, and n−i−j UP instances into site 3. For general M, there are
ways of distributing n UP instances into M sites.
For simplicity, consider the case of M=2 sites. It would seem that
The sum of indicator functions [Ii>n−K+Ii<K] in (13) is the number of sites with enough UP instances to cause an outage if the site fails.
The problem with the proposed solution in (13) is that the
combinations are not all equally likely. It is true that if all sites are up, then all DN instances must be due to individual failures, thus all combinations are equally likely (and if n>(M−1)J, then all sites are up). And it is true that all combinations where every site has >0 UP instances are equally likely. However, combinations with 0 UP instances in a site could be due to J individual DN instances or 1 DN site. Hence, we need to break Pn apart and condition on m; that is, Pn=Σm=┌n/J┐M Pn|mPM(m).
To illustrate,
distributions of n UP instances into 2 sites, and
distributions of i UP instances to site 1 and n−i UP instances to site 2, where n−3≤i≤3.
As can be seen, for n=5 (left) there are 6 distributions of 5 UP instances to 2 sites (e.g., 3 with 2 in site 1 and 3 with 3 in site 1). Since both sites have UP instances, both sites are up. Since n=5>J=3, only site failures (not individual instance failures) can result in an outage. Since combinations are the result of a single instance failure, all combinations are equally likely. Finally, [Ii>2+Ii<3]=1 for all combinations.
For n=4 (center), there are 15 equally likely distributions of 4 UP instances (3 with 1 in site 1, 9 with 2 in site 1, and 3 with 3 in site 1). The main difference is that for the 9 combinations with 2 in site 1 (and 2 in site 2), [Ii>1+Ii<3]=2 (e.g., failure of either site results in an outage). For the remaining 6 combinations, [Ii>1+Ii<3]=1.
For n=3 (right), things get more interesting and the flaw in the ‘equally likely’ assumption is exposed. There are 20 distributions of 3 UP instances in 2 sites (1 with 0 in site 1, 9 with 1 in site 1, 9 with 2 in site 1, and 1 with 3 in site 1). The 18 combinations with 1 or 2 UP instances in site 1 (and vice versa in site 2) are the result of single instance failures, and all 18 combinations are equally likely. The 2 combinations with either 0 or 3 in site 1 (and vice versa in site 2) could result from 3 individual instance failures or 1 site failure, so these combinations are more likely. In fact, for the defaults in Table 1, these 2 combinations account for 99.999% of P3.
To further illustrate, the erroneous “equally likely combinations” formula suggests
For M=2, this scenario of unequal combinations can only happen when i=0 or n−i=0 (that is, when one site has no UP VMs). The result from the correct formula looks like
As illustrated in this example, we can account for the fact that not all combinations are equally likely by breaking Pn apart and conditioning on m. The resulting exact formula for λD for general M is given by
Although the equation for becomes increasingly more awkward to express for increasing M, it is straightforward to program algorithmically for computation. Now that we have the exact formula for the mean time between service outages F=λD−1, then also compute the mean time to restore service R=μD−1=F(1−PUP)/PUP. As shown below, these are tools to facilitate the analysis and optimal sizing of application topologies to meet service performance and reliability requirements.
As a hypothetical example, consider a Voice over IP (VoIP) call setup message processing application. The goal is to cost-effectively size the application (M sites and N virtual instances) to satisfy the following requirements and assumptions:
Application (service) availability≥0.99999.
Adequate capacity to process 600 VoIP calls/sec.
Peak traffic rate 1.5× average traffic rate.
Mean message processing latency≤30 ms, and 95th percentile (95%)≤60 ms.
Service outages lasting longer than 30 minutes are reportable.
Probability of a reportable outage in 1 year≤1%.
An outage occurs if available capacity<50% (2× over-engineering).
Local- and geo-redundancy required (minimum 2+ instances at each of 2+ sites).
Instances implemented as virtual machines (VMs) of the 4-vCPU flavor.
First, we consider the latency requirements to determine the required number of instances N. Given that voice call arrivals are reasonably random, and protocol message processing time is reasonably constant, assume an M/D/C service model, where C is the required number of vCPUs. Let E(W) and V(W) denote the mean and variance of the waiting time W prior to service. For simplicity, Kingman-Kllerstrm heavy traffic GI/G/C two-moment approximations are used below for E(W) and V(W) based on the coefficients of variation Ca2 and Cs2 of the arrival process and the service process (where Ca2=1 and Cs2=0 for the MID/C system). Then the mean and variance of the waiting time W are given by
where T0 is the no-load message processing (code execution) time and
This Kingman-Kllerstrm approximation assumes that W is exponentially distributed with mean T0x, and latency T=T0+W is a shifted exponential. The 95th percentile latency is given approximately by T0+3E(W)=T0(1+3x). Thus, the performance requirements, combined with the capacity requirement of 600 calls/sec, become
where
and C=number of vCPUs.
This result yields an explicit relationship between the maximum allowable processing time T0 and minimum required number of vCPUs C, as shown in
Next, given the proposed minimum topology M=2, N=6, and J=K=3 that satisfies the latency requirements, we can now apply the reference outage model. For the default MTBF and MTTR values in Table 1, the model output parameters, explicit formulae, and resulting values are given in Table 2. As can be seen, based on the assumed MTBFs and MTTRs for this topology, F=323567 hours and R=67 minutes.
Now, consider the availability requirement and assume (worst case) that all outages occur during peak traffic periods, where the peak-to-average traffic ratio σ=1.5. Then
Since 323567>166498, the availability requirement is met, and it would appear that the minimum M=2, N=6 topology is sufficient. However, there should be verification that this solution meets the reportable outage requirement.
Next, consider the service outage requirement P(no outages>30m in 1 year)≥99%.
Then λe−μ/2≤−ln(0.99)/8760=871613−1 and F≥871613e−0.5/1.11=556564 hours. Since 323567<556564, the reportable outage requirement is not met.
In view of the above, there are a number of options, that can be evaluated using the reference outage model. First, we can model the effect of hardening the HW or SW elements by increasing their MTBFs or decreasing their MTTRs. The details are omitted, but hardening the instance SW (increasing λ1−1 from 3 to 13 months) or the resident function SW (increasing λF−1 from 2 to 6.4 years) both result in increasing F above 556564 hours. Interestingly, decreasing the SW MTTRs is not as effective because in this particular example (where the reportable service outage requirement is most constraining), the solution is more sensitive to failure rates than to restoral rates. Notably, hardening the HW (increasing MTBFs or decreasing MTTRs) does not help, lending analytical support to the trend of using commodity hosts and public cloud sites instead of high-end servers and hardened Telco datacenters.
Next, instances (increase N) can be added or sites (increase M). Adding a fourth host/instance to each site (M=2, N=8, J=4) meets the requirement. Also, adding a third site and redistributing the hosts/instances (M=3, N=6, J=2) also meets the requirement. The reason is that although site failures are now more frequent with three sites, so a {2 site} duplex failure is now more likely, the much more probable {1 site+1 instance} duplex failure is no longer an outage mode.
Consider the following topology configuration and optimization algorithm.
There may be an objective function to minimize {(CM+OM)M+(CN+ON)N} subject to PUP≥A, J≥j, M≥m, F≥f, R≤r, etc. Given the inputs, the approach is to compute a family of feasible solution pairs {M,N} that are generally in the range {m,Nmax}, . . . , (Mmax,j}. The most cost-optimal topology is then easily determined given the capital and operational expense costs.
At step 101, receive, by a server or other device, the number of geographically diverse sites (M) for the service and the availability of the service (AN). For example, setting M=m and AN=1 (i.e., only site failures can occur).
At step 102, determine the probability the service is up (PUP), mean time between service outages (F), and mean restoral time (R) based on the information of step 101. Solve for a first {PUP, F, R}.
At step 103, when the output {PUP, F, R} do not meet their respective requirements (that is, no feasible solution exists for M for any N), then increment M←M+1 and repeat step 102, solving for successive {PUP, F, R} values.
At step 104, when the output {PUP, F, R} meets their respective requirements, set J=max(┌K/M┐, j), N=MJ, and AN=μN/(λN+μN).
At step 105, based on the information of step 104, determine the probability the service is up (PUP), mean time between service outages (F), and mean restoral time (R). Solve for a new {PUP, F, R}.
At step 106, when an output of {PUP, F, R} do not meet their respective requirements, then increment N←N+M and J←J+1, and repeat step 105, solving for successive {PUP, F, R} values.
At step 107, sending an indication that {M, N} as a feasible solution, when the output of {PUP, F, R} meets their respective requirements.
At step 108, when J>j, then increment M←M+1 and go to step 104; otherwise, stop.
At step 109, based on the output of steps 104 through 108, the set of feasible solution pairs {M,N} that are generally in the range {m,Nmax}, . . . , (Mmax,j} have been identified. The objective function {(CM+OM)M+(CN+ON)N} is now computed for each feasible solution pair {M,N} collected at step 109, and the {M,N} pair that minimizes the objective function is identified as the most cost-optimal topology.
At step 110, output of step 109 (or any of the above steps) may be sent within an alert, which may be displayed on a device or used as a trigger. The alert may trigger the search for the M candidate physical sites in which to place the application, and the ordering of physical hardware (N servers and possibly racks, switches, routers, links, etc.) to be placed in those sites.
A 2-tiered reference model is used that consists of servers and sites to illustrate an approach to topology configuration and optimization, with a focus on addressing geo-redundancy questions like how many sites, and how many servers per site, are required to meet performance and reliability requirements. First develop a multi-dimensional component failure mode reference model, then exactly reduce this model to a one-dimensional service outage mode reference model. A contribution is the exact derivation of the outage and restoral rates from the set of ‘available’ states to the set of ‘unavailable’ states using an adaptation of the hyper-geometric “balls in urns” distribution with unequally likely combinations. A topology configuration tool for optimizing resources to meet requirements and illustrate effective use of the tool for a hypothetical VoIP call setup protocol message processing application is described.
Network device 300 may comprise a processor 302 and a memory 304 coupled to processor 302. Memory 304 may contain executable instructions that, when executed by processor 302, cause processor 302 to effectuate operations associated with mapping wireless signal strength.
In addition to processor 302 and memory 304, network device 300 may include an input/output system 306. Processor 302, memory 304, and input/output system 306 may be coupled together (coupling not shown in
Input/output system 306 of network device 300 also may contain a communication connection 308 that allows network device 300 to communicate with other devices, network entities, or the like. Communication connection 308 may comprise communication media. Communication media typically embody computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. By way of example, and not limitation, communication media may include wired media such as a wired network or direct-wired connection, or wireless media such as acoustic, RF, infrared, or other wireless media. The term computer-readable media as used herein includes both storage media and communication media. Input/output system 306 also may include an input device 310 such as keyboard, mouse, pen, voice input device, or touch input device. Input/output system 306 may also include an output device 312, such as a display, speakers, or a printer.
Processor 302 may be capable of performing functions associated with telecommunications, such as functions for processing broadcast messages, as described herein. For example, processor 302 may be capable of, in conjunction with any other portion of network device 300, determining a type of broadcast message and acting according to the broadcast message type or content, as described herein.
Memory 304 of network device 300 may comprise a storage medium having a concrete, tangible, physical structure. As is known, a signal does not have a concrete, tangible, physical structure. Memory 304, as well as any computer-readable storage medium described herein, is not to be construed as a signal. Memory 304, as well as any computer-readable storage medium described herein, is not to be construed as a transient signal. Memory 304, as well as any computer-readable storage medium described herein, is not to be construed as a propagating signal. Memory 304, as well as any computer-readable storage medium described herein, is to be construed as an article of manufacture.
Memory 304 may store any information utilized in conjunction with telecommunications. Depending upon the exact configuration or type of processor, memory 304 may include a volatile storage 314 (such as some types of RAM), a nonvolatile storage 316 (such as ROM, flash memory), or a combination thereof. Memory 304 may include additional storage (e.g., a removable storage 318 or a non-removable storage 320) including, for example, tape, flash memory, smart cards, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, USB-compatible memory, or any other medium that can be used to store information and that can be accessed by network device 300. Memory 304 may comprise executable instructions that, when executed by processor 302, cause processor 302 to effectuate operations to map signal strengths in an area of interest.
The machine may comprise a server computer, a client user computer, a personal computer (PC), a tablet, a smart phone, a laptop computer, a desktop computer, a control system, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. It will be understood that a communication device of the subject disclosure includes broadly any electronic device that provides voice, video or data communication. Further, while a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methods discussed herein.
Computer system 500 may include a processor (or controller) 504 (e.g., a central processing unit (CPU)), a graphics processing unit (GPU, or both), a main memory 506 and a static memory 508, which communicate with each other via a bus 510. The computer system 500 may further include a display unit 512 (e.g., a liquid crystal display (LCD), a flat panel, or a solid state display). Computer system 500 may include an input device 514 (e.g., a keyboard), a cursor control device 516 (e.g., a mouse), a disk drive unit 518, a signal generation device 520 (e.g., a speaker or remote control) and a network interface device 522. In distributed environments, the examples described in the subject disclosure can be adapted to utilize multiple display units 512 controlled by two or more computer systems 500. In this configuration, presentations described by the subject disclosure may in part be shown in a first of display units 512, while the remaining portion is presented in a second of display units 512.
The disk drive unit 518 may include a tangible computer-readable storage medium on which is stored one or more sets of instructions (e.g., software 526) embodying any one or more of the methods or functions described herein, including those methods illustrated above. Instructions 526 may also reside, completely or at least partially, within main memory 506, static memory 508, or within processor 504 during execution thereof by the computer system 500. Main memory 506 and processor 504 also may constitute tangible computer-readable storage media.
As described herein, a telecommunications system may utilize a software defined network (SDN). SDN and a simple IP may be based, at least in part, on user equipment, that provide a wireless management and control framework that enables common wireless management and control, such as mobility management, radio resource management, QoS, load balancing, etc., across many wireless technologies, e.g. LTE, Wi-Fi, and future 5G access technologies; decoupling the mobility control from data planes to let them evolve and scale independently; reducing network state maintained in the network based on user equipment types to reduce network cost and allow massive scale; shortening cycle time and improving network upgradability; flexibility in creating end-to-end services based on types of user equipment and applications, thus improve customer experience; or improving user equipment power efficiency and battery life—especially for simple M2M devices—through enhanced wireless management.
While examples of a system in which reliability reference model for topology configuration alerts can be processed and managed have been described in connection with various computing devices/processors, the underlying concepts may be applied to any computing device, processor, or system capable of facilitating a telecommunications system. The various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. Thus, the methods and devices may take the form of program code (i.e., instructions) embodied in concrete, tangible, storage media having a concrete, tangible, physical structure. Examples of tangible storage media include floppy diskettes, CD-ROMs, DVDs, hard drives, or any other tangible machine-readable storage medium (computer-readable storage medium). Thus, a computer-readable storage medium is not a signal. A computer-readable storage medium is not a transient signal. Further, a computer-readable storage medium is not a propagating signal. A computer-readable storage medium as described herein is an article of manufacture. When the program code is loaded into and executed by a machine, such as a computer, the machine becomes a device for telecommunications. In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile or nonvolatile memory or storage elements), at least one input device, and at least one output device. The program(s) can be implemented in assembly or machine language, if desired. The language can be a compiled or interpreted language, and may be combined with hardware implementations.
The methods and devices associated with a telecommunications system as described herein also may be practiced via communications embodied in the form of program code that is transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via any other form of transmission, wherein, when the program code is received and loaded into and executed by a machine, such as an EPROM, a gate array, a programmable logic device (PLD), a client computer, or the like, the machine becomes a device for implementing telecommunications as described herein. When implemented on a general-purpose processor, the program code combines with the processor to provide a unique device that operates to invoke the functionality of a telecommunications system.
While the disclosed systems have been described in connection with the various examples of the various figures, it is to be understood that other similar implementations may be used or modifications and additions may be made to the described examples of a telecommunications system without deviating therefrom. For example, one skilled in the art will recognize that a telecommunications system as described in the instant application may apply to any environment, whether wired or wireless, and may be applied to any number of such devices connected via a communications network and interacting across the network. Therefore, the disclosed systems as described herein should not be limited to any single example, but rather should be construed in breadth and scope in accordance with the appended claims.
In describing preferred methods, systems, or apparatuses of the subject matter of the present disclosure—reliability reference model for topology configuration—as illustrated in the Figures, specific terminology is employed for the sake of clarity. The claimed subject matter, however, is not intended to be limited to the specific terminology so selected. In addition, the use of the word “or” is generally used inclusively unless otherwise provided herein.
This written description uses examples to enable any person skilled in the art to practice the claimed subject matter, including making and using any devices or systems and performing any incorporated methods. Other variations of the examples are contemplated herein.
Methods, systems, and apparatuses, among other things, as described herein may provide for receiving a number of geographically diverse sites for a service; receiving a minimum availability of the service; based on the number of geographically diverse sites and the minimum availability, determining a probability that the service is up (PUP), mean time between service outages (F), and mean restoral time (R); and sending an alert that includes the PUP, F, and R. F may be determined by:
where λD is the mean service outage rate, λM is the site failure rate, λN is the host failure rate, K is the minimum required capacity, J=N/M is the number of hosts per site, Pn|m is the probability of n hosts up given m sites up, PM(m) is the probability of m sites up, PK is the probability of K hosts up, and is the number of sites out of the m sites up that have more than n−K hosts up. Pn|m, PM(m), and PK are determined by the solution to the Markov chain model arising from the problem formulation, and is determined by the solution to a specialized “balls in urns” model involving the hyper-geometric distribution with unequally likely combinations. All combinations in this paragraph the below paragraph (including the removal or addition of steps) are contemplated in a manner that is consistent with the other portions of the detailed description
The methods, systems, and apparatuses may provide for when PUP, F, and R do not meet a respective threshold requirement, incrementing the number of geographically diverse sites for the service; and based on the incremented number of geographically diverse sites, determining a second PUP, second F, and second R. The methods, systems, and apparatuses may provide for when PUP, F, and R meet a respective threshold requirement, setting J=max(┌K/M┐, j), N=MJ, and AN=μN/(λN+μN). Additionally, AN is the probability that all N hosts are up, and μN is the host restoral rate. The methods, systems, and apparatuses may provide for based on J=max(┌K/M┐, j), N=MJ, and AN=μN/(λN+μN), determining a third PUP, third F, and third R. The methods, systems, and apparatuses may provide for when third PUP, third F, and third R do not meet a second respective threshold requirement, incrementing N by M and J by 1 (that is, replace N with N+M and J with J+1). All combinations in this paragraph (including the removal or addition of steps) are contemplated in a manner that is consistent with the other portions of the detailed description.