Scalable and Low Computation Cost Method for Optimizing Sampling/Probing in a Large Scale Network

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority from Greek Patent Application No. 20210100818 filed Nov. 22, 2021, the disclosure of which is hereby incorporated herein by reference.

BACKGROUND

Monitoring of large-scale networks, such as those supporting cloud infrastructures, may be a complex and popular problem. Most recent systems make two assumptions in order to acquire accurate measurements when monitoring such networks. These assumptions include: (1) that the CPU and memory are given a fixed budget for monitoring, and (2) that most network flows can be covered using this budget. In practice, however, reserving a fixed amount of resources for monitoring and full coverage may be infeasible at the scale at which large-scale networks, such as major cloud infrastructure networks, operate. This may make it difficult to know the accuracy of the estimated metrics. Moreover, many conventional frameworks and techniques for monitoring networks, and for determining the placement of network probes to be used in the monitoring, may be computationally inefficient and may not scale.

BRIEF SUMMARY

The monitoring of large-scale networks, such as those supporting cloud infrastructures, may involve setting a fixed budget for monitoring. The budget, sometimes referred to as a probe budget, may be defined as a maximum number of network probes. Instead of assuming that the CPU and memory are given a fixed budget for monitoring and most network flows can be covered using this budget, the framework and techniques, as described herein, optimize the probing strategy in order to measure network metrics with a known accuracy, given a particular probe budget. The framework and techniques, as described herein, may leverage A- and E-optimal experimental designs in statistics to determine the optimal probing strategy. In some examples, instead of using these frameworks and techniques directly in production networks, a scalable and near optimal implementation based on the Frank-Wolfe algorithm may be used.

The framework and techniques may be validated using experimentation, whereby the framework and techniques may be simulated using real network topologies where the latency and loss are generated using idealized models. Empirical evaluation of the implementation of the framework and techniques may be performed using a production probing system in a large cloud network to measure latency and loss. Major gains in reducing the probing budget, while maintaining low error on estimates, even with very low probing budgets may be achieved using the framework and techniques, as described herein.

In some examples, the framework and techniques, as described herein, may model networks paths as a structure of interest. In some examples, a path may be viewed as a collection of links, such as hops between routers, or more abstractly as a feature vector with path characteristics. Using such a model may allow for the formulation of an optimization problem in which the estimation error for a metric of interest, such as the maximum error in estimating latency across all paths, may be minimized subject to measurement constraints, such as probing and/or sampling budgets, at each node in the network. Here, the estimation error may be the average or maximum estimation error. The solution of the optimization problem may be an optimal probe allocation vector that corresponds to a sampling and/or probing frequency for each path in the network.

In some examples, the framework and techniques, as described herein, may include two linear models with properties that allow for the estimation of network latency and loss. In some examples, the framework and techniques, as described herein, may use linear regression. In these examples, the optimal allocation of measurement probes/samples, along network paths, may be determined, without prior observations, as a function of the features of the paths. In some examples, the framework and techniques, as described herein, may use a generalized linear model instead of a linear model. This model may allow for the application of the same principles of optimal design as in linear models. In some examples, a general reduction of a non-linear model may be made. This may allow for the estimation of network properties beyond network latency and loss.

In some examples, an approximation algorithm inspired by the Frank-Wolfe algorithm may be used rather than A- and E-optimal design. This approximation algorithm may be near-optimal or optimal and may be implemented in a scalable manner in very large networks. The use of such framework and techniques, as described herein, may be scalable and improve the computational efficiency and speed with which any computing system and/or network resources, implementing such framework and techniques, may solve for an optimal probe allocation vector when compared to conventional frameworks and techniques. As such, when compared to conventional frameworks and techniques, the use of the framework and techniques, as described herein, may reduce power consumption and increase the available bandwidth of any computing system and/or network resources implementing such framework and techniques.

The framework and techniques, as described herein, may be evaluated using simulations with real topologies and idealized loss/latency models. The framework and techniques, as described herein, may be evaluated by probing in a real production cloud network. The framework and techniques, as described herein, may produce more accurate metric estimates for probe budgets that are similar to those used in conventional frameworks and techniques, such as probing all paths uniformly at random. Moreover, even with low probe budgets, the estimation error from the framework and techniques, as described herein, may remain small, making the framework and techniques usable for many telemetry tasks.

The framework and techniques, as described herein, include models to allocate probes and/or sampling in a way that minimizes the error of the performance metrics for a given budget. In addition, the framework and techniques, as described herein, include a near optimal implementation that may allow for deployment in large scale networks. Additionally, the framework and techniques, as described herein, may produce accurate estimates for a large range of probing budgets, including reduced probing budgets. In addition, the framework and techniques, as described herein, may produce an optimal allocation of probes without having observed prior traffic on the network.

In general, one aspect of the subject matter described in this specification includes a process of determining an allocation of probes to monitor a network. A sample covariance matrix for the network may be determined. An error metric may be optimized, using one or more processors, based on the sample covariance matrix and a fixed probe budget value. The error metric may be associated with an error in estimating path latencies associated with the network. The allocation of probes may be determined, using the one or more processors, based on the optimized error metric. The allocation of probes may be associated with at least one feature vector of edge indicators for the network. The optimization of the error metric may include solving a semi-definite program using the fixed probe budget value. The optimization of the error metric may include using an approximation algorithm based on a Frank-Wolfe algorithm. The error metric may be associated with a maximum error in estimating the path latencies associated with the network. The error metric may be associated with an average error in estimating the path latencies associated with the network. The allocation of probes may be less than or equal to the probe budget value.

Another aspect of the subject matter includes a system for determining network failures that includes one or more memories for storing key network failures and one or more processors configured to perform various steps. A sample covariance matrix for the network may be determined. An error metric may be optimized based on the sample covariance matrix and a fixed probe budget value. The error metric may be associated with an error in estimating path latencies associated with the network. The allocation of probes may be determined based on the optimized error metric. The allocation of probes may be stored in the one or more memories. The allocation of probes may be associated with at least one feature vector of edge indicators for the network. The optimization of the error metric may include solving a semi-definite program using the fixed probe budget value. The optimization of the error metric may include using an approximation algorithm based on a Frank-Wolfe algorithm. The error metric may be associated with a maximum error in estimating the path latencies associated with the network. The error metric may be associated with an average error in estimating the path latencies associated with the network. The allocation of probes may be less than or equal to the probe budget value.

Yet another aspect of the subject matter includes a non-transitory computer-readable medium storing instructions, that when executed by one or more processors, cause the one or more processors to perform various steps. A sample covariance matrix for the network may be determined. An error metric may be optimized based on the sample covariance matrix and a fixed probe budget value. The error metric may be associated with an error in estimating path latencies associated with the network. The allocation of probes may be determined based on the optimized error metric. The allocation of probes may be associated with at least one feature vector of edge indicators for the network. The optimization of the error metric may include using an approximation algorithm based on a Frank-Wolfe algorithm. The error metric may be associated with a maximum error in estimating the path latencies associated with the network. The error metric may be associated with an average error in estimating the path latencies associated with the network.

Yet another aspect of the subject matter described in this specification includes a process of determining a sampling rate for monitoring a network. A sample covariance matrix for the network may be determined. An error metric may be optimized, using one or more processors, based on the sample covariance matrix and a fixed sampling budget value. The error metric may be associated with an error in estimating path latencies associated with the network. The sampling rate may be determined, using the one or more processors, based on the optimized error metric. The optimization of the error metric may include using an approximation algorithm based on a Frank-Wolfe algorithm.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an example of a topology of a network.

FIG. 2 is a flow diagram of an example process for determining an allocation of probes to monitor a network using an E-optimal design.

FIG. 3 is a flow diagram of an example process for determining an allocation of probes to monitor a network using an A-optimal design.

FIG. 4 is a flow diagram of an example process for determining an allocation of probes to monitor a network using an approximation algorithm.

FIG. 5 depicts a block diagram of an example electronic device in accordance with aspects of the disclosure.

DETAILED DESCRIPTION

FIG. 1 depicts an example network topology 150, which may include nodes 152(a), 152(b), 152(c), 152(d), 152(e), and 152(f), generally referred to as nodes 152. Each of the nodes 152 may be a device, such as an electronic device or a router. The logical topology 150 may define how the nodes 152 communicate across the network 150 by showing the logical communication links 112 that are available for communication between the nodes 152. Logical communication links 112 may form the connection between these nodes within the network topology 150. A path may exist between nodes. The path may be viewed as a collection of links, 112, such as hops between routers, or more abstractly as a feature vector with path characteristics. An allocation vector may correspond to a sampling and/or probing frequency for each path in the network topology. For example, the allocation vector may be used to determine the allocation of probes in the network. A network probe may be software, a general purpose device, or a special purpose device inserted into a network for the purposes of monitoring or collecting data about network activity. This may be used to monitor a network associated with network topology 150.

In some examples, inferring an allocation of probes in the network can be abstracted. An unknown function f*:X→R that maps d-dimensional features x ∈ X to real values y ∈ R, where x ⊆ R^dis a set of all possible d dimensional feature vectors may be estimated. In the networking domain, one example of f* may be the latency of a path in a network with d edges. In this case, x may be a vector of edge indicators in a path, and y=f*(x) may be the expected latency of that path, and X ⊆ {0, 1}^dmay be the set of all paths. Here the edge indicators may be defined as having a value of 1 if an edge exists between nodes in a network, and 0 if an edge does not exist between nodes in a network. The function f* may be estimated from a dataset D={(x_i,y_i):x_i∈ X, y_i∈ R, i ∈ [n]} of n training examples, where x_iis the i-th feature vector and y_iis a noisy realization of f*(x_i). No assumption that x_i, x_jwhen i, j may be made. As may be common in machine learning, a dataset may contain multiple identical feature vectors with different noisy responses. A best possible approximation to f* with guarantees, such as worst case and average case guarantees, may be learned.

For the worst case guarantee, f* may be approximated well for all x ∈ X. In this case, a natural metric to optimize may be the maximum error,

L
_max(f)=max x ∈ X(f(x)−f*(x))². (1)

In some examples, the above squared term may be replaced with an absolute value. In our earlier example, X may be the set of all paths and L_max(f) may be the maximum squared error in estimating path latencies.

For the average case guarantee, P may be a distribution over feature vectors in X. Then the average error over x˜P,

L
_avg(f)=Ex˜P[(f(x)−f*(x))²] (2)

may be optimized. The above squared term assigns higher importance to higher errors. In the earlier example, P may be a distribution over paths X and L_avg(f) may be the mean squared error in estimating path latencies, weighted by P.

Stating the problem more formally, given a fixed error metric, such as L_max, and a fixed budget of n measurements, an allocation (x_i)ⁿ_i=1may be designed that allows for a good approximation to f*, as measured by L_max. That is, f{circumflex over ( )} may be learned from the dataset D such that L_max(f{circumflex over ( )})≤c with a high probability, for an appropriately chosen c. Effectively, the allocation determines which feature vectors should be in D, the set of observations/measurements, before fitting f{circumflex over ( )}. While f{circumflex over ( )} may be estimated from measurements (y_i)ⁿ_i=1in D, these measurements may not be available at the time of deciding of what to measure. However, the features (x_i)ⁿ_i=1may be available.

To optimize errors (1) and (2), a class of functions f:R^d→R may be chosen where |f(x)−f*(x)| is bounded by an expression that only depends on features (x_i)ⁿ_i=1, and not on measurements (y_i)ⁿ_i=1. These may be linear models, which may be optimized. Generality and feasibility of solutions may be ensured in a large and global modern network. Approximate solutions that may also be practical for scaling and generalizations to a non-linear class of functions may also be made.

Linear models for the optimization of errors (1) and (2) may be used. In particular, given a feature vector x ∈ R^d, the true value of the unknown function evaluated at x,f*(x), is linear in x, and may be assumed and defined by f*(x)=x^Tθ*, where θ* ∈ R^dis a vector of unknown model parameters. It may also be assumed that access to noisy observations y, which may be obtained by adding Gaussian noise to the true value f*(x) as

y=f*(x)+∈=x^Tθ*+∈, (3)

for ∈˜N(0,σ²) and some known σ>0. It may be assumed that all examples in dataset D are generated this way, that is, y_i=x_i^Tθ*+∈_ifor all i ∈ [n], where ∈_i˜N(0,σ²) are drawn independently of each other.

Least squares regression estimates model parameters θ by minimizing the sum of squares θ{circumflex over ( )}=arg min_θΣⁿ_i=1(y_i−x_i^Tθ)². Under the assumption that G=Σⁿ_i=1x_ix_i^Tis invertible, this problem may have the unique closed-form solution θ{circumflex over ( )}=G⁻¹Σⁿ_i=1y_ix_i. The matrix G may be known as a sample covariance matrix. The parameter θ{circumflex over ( )} may follow a multivariate normal distribution with mean E[θ{circumflex over ( )}]=θ* and covariance Cov(θ{circumflex over ( )})=σ²G⁻¹, for any (x_i)ⁿ_i=1.

In some examples, using the feature vector x, the true value f*(x)=x^Tθ* may be estimated by f{circumflex over ( )}(x)=x^Tθ{circumflex over ( )}. Using the knowledge that θ{circumflex over ( )}˜N(θ*,σ²G⁻¹), we may derive that f{circumflex over ( )}(x) is also normally distributed:

f{circumflex over ( )}(x)˜N(f*(x),σ²x^TG⁻¹x). (4)

From tail inequalities for Gaussian random variables [10], it follows that:

(f{circumflex over ( )}(x)−f*(x))²≤2 log(1/δ)σ²x^TG⁻¹x (5)

holds with probability at least 1−δ. Therefore, the problem of designing a good approximation f{circumflex over ( )} to f*, with a high probability bound on (f{circumflex over ( )}(x)−f*(x))², may be reduced to designing a good sample covariance matrix G, with an upper bound on x^TG⁻¹x. Note that the expression x^TG⁻¹x may not depend on the measurements y_i, but it may depend on the features x_i. In statistics, this class of problems is referred to as the optimal experimental design.

The linear models may be useful for modeling a canonical networking problem of predicting latency. In particular, because the latency of a path may be the sum of latencies on edges of that path, the latency may be a linear function. In some examples, θ* ∈ R^dmay be a vector of mean latencies of d edges, where θ*(e) is the mean latency of edge e. In these examples, x ∈ {0, 1}^dmay be a vector of edge indicators in a path. As such, in these examples, f*(x)=x^Tθ* may represent the mean latency of path x.

FIG. 2 is a flow diagram of example process 200 for determining an allocation of probes to monitor a network using an E-optimal design. In some examples, the process 200 may be performed in whole or in part by an electronic device, such as electronic device 500 described in connection with FIG. 5. At block 210, a sample covariance matrix, G, for the network may be determined/defined, such as what is described above.

The sample covariance matrix, G, may be optimized. For any symmetric matrix M ∈ R^d×d, λ_i(M) may denote the i-th largest eigenvalue of M. Additionally, λ_max(M)=λ₁(M) and λ_min(M)=λ_d(M). The trace of M is the sum of its eigenvalues, tr(M)=Σ^d_i=1λ_i(M). The matrix, M may be a positive semi-definite (PSD) matrix, it may be denoted by M≥0, if x^TMx≥0 for all x ∈ R^d.

At block 220, a maximum error metric may be minimized based on maximizing the minimum eigenvalue of the sample covariance matrix G. The error metric may be associated with an error in estimating path latencies associated with the network. In some examples, computing the E-optimal design may involve the minimization of the maximum error in equation (1), above. Minimization of the maximum error in equation (1), above, may be reduced to optimizing G. Without loss of generality, X in equation (1) may be a unit sphere, X={x:∥x∥₁=1}. In this case,

L
_max(f{circumflex over ( )})∞max_x∈Xx^TG⁻¹x=λ_max(G⁻¹)=λ⁻¹_min(G).

The left side of this proportionality relationship may follow from the definition of L_maxin equations (1) and (5) above, where the constant factor 2 log(1/δ)σ²may be omitted. The first equality may be derived from the definition of the maximum eigenvalue, and the second equality may be derived from a result in linear algebra in which the maximum eigenvalue of any PSD matrix M is the reciprocal of the minimum eigenvalue of its inverse M⁻¹.

Therefore, the minimization of equation (1) above may be achieved by maximizing the minimum eigenvalue of the sample covariance matrix G. This may be known as the E-optimal design and may be formulated as the following semi-definite program (SDP).

max τ

s.t. Σ
_x∈Xα_xxx^T≥σI_d,

Σ_x∈Xα_x≤n, ∀x ∈ X:α_x≥0, (6)

This SDP above has two types of variables. The variable τ may be the minimum eigenvalue of the sample covariance matrix. It may also be the objective that is maximized. The variable α_xmay be used to encode the expected number of times that x should appear in dataset D.

At block 230, an allocation of probes may be determined based on the maximum error metric being minimized. In particular, α_xmay be determined and this may be directly related to the allocation of probes. The allocation of probes may be associated with at least one feature vector of edge indicators, x, for the network.

In the equation (6), above, α_xmay not be integers. For example, when α_x=1.5, it may not be obvious how to collect a dataset D where the feature vector x appears 1.5 times. To address this issue, a randomization may be used and a dataset D may be generated where any feature vector x appears α_xtimes in expectation. Such a dataset may be generated in many ways. In some examples, x_i=x may be set with probability α_x/n, independently for all i ∈ [n]. Note that α_x/n may be a valid probability, because α_x≥0 and Σ_x∈Xα_x=n whenever α=(α₁, . . . , α_|X|) solves the equation (6), above.

FIG. 3 is a flow diagram of example process 300 for determining an allocation of probes to monitor a network using an A-optimal design. In some examples, the process 300 may be performed in whole or in part by an electronic device, such as electronic device 500 described in connection with FIG. 5. At block 310, a sample covariance matrix, G, for the network may be determined/defined, such as what is described above.

At block 320, an average error metric may be minimized based on minimizing the trace of the inverse sample covariance matrix G⁻¹. The error metric may be associated with an error in estimating path latencies associated with the network. In some examples, computing the A-optimal design may involve the minimization of the average error in equation (2), above. Minimization of the average error in equation (2), above, may involve optimizing the sample covariance matrix G as follows. Without loss of generality, X in equation (2) may be a unit sphere, X={x:∥x∥₂=1}. In addition, P may be a uniform distribution over the unit sphere. Then, UΛU^T=G⁻¹may represent the eigen decomposition of G⁻¹, which may be PSD by definition. Then

$L_{a v g} (f^{^}) \propto E_{X ~ P} [x^{⊤} G^{- 1} x] = E_{X ~ P} [x^{⊤} U Λ U^{⊤} x]$

$= E_{X ~ P} [x^{⊤} Λ x] \propto \sum_{i = 1}^{d} λ_{i} (G^{- 1}) = {tr (G)}^{- 1} .$

The initial portion of the above equations may be realized based on the definition of L_avgin equations (2) and (5), above, where the constant factor 2 log(1/δ)σ²may be omitted. The second equality of the above equations may be realized because U performs a rotation and X is a unit sphere. Additionally, the final portion of the above equations are realized because X is a unit sphere.

Therefore, minimization of equation (2), above, may involve minimizing the sum of the eigenvalues of G⁻¹, which is equal to the trace of G⁻¹. This may be known as the A-optimal design and may be formulated as the following semi-definite program (SDP).

max Σ^d_i=1τ_i

s.t. ∀i ∈ [d]:[Σ_x∈Xα_xxx^T, e_i; e_i^T, τ_i]≥0

Σ_x∈Xα_x≤n, ∀x ∈ X:α_x≥0, (7)

Here, e_iis the i-th element of the standard d-dimensional Euclidean basis. This SDP may have two types of variables. The variable τ_imay represent the i-th eigenvalue of the inverse sample covariance matrix. The sum of these variables Σ^d_i=1τ_imay be equal to tr (G)⁻¹and to the objective function that is minimized. The variable α_xmay be the expected number of times that x should appear in dataset D, similar to what is described above. Upon solving this SDP, the dataset D may be obtained as described above.

At block 330, an allocation of probes may be determined based on the average error metric being minimized. In particular, α_xmay be determined and this may be directly related to the allocation of probes. The allocation of probes may be associated with at least one feature vector of edge indicators, x, for the network.

Each of the optimal designs from equations (6) and (7), above, may output an allocation α=(α₁, . . . , α_|X|) subject to a global constraint Σ_x∈Xα_x≤n, which means that the total number of measurements is at most n. In some examples, in practice, local budget constraints may be more common. The budget, sometimes referred to as a probe budget, may be defined as a maximum number of network probes. In some examples, it may be useful to enforce a policy that most paths do not start in a single source node, because such an allocation may not be implemented due to resource availability constraints. Such constraints may be enforced as follows. Let src(x) be the source of path x and S={src(x):x ∈ X} be the set of all sources. Then

∀_s∈ S:Σ_x∈X1{src(x)=s}α_x≤b

may be a set of linear constraints in a that limit the number of measurements from any source s to at most b. Because the constraints may be linear in α, they may be easily incorporated into equations (6) and (7), above, without changing the hardness of these problems, which each remain an SDP. The framework and techniques, as described herein, for solving equations (6) and (7), above, may be generalized.

The optimal designs, described above, may be theoretically sound. In some examples, it may not be computationally efficient to solve the optimal designs exactly as SDPs. For these examples, the solutions may be approximated using additional techniques, such as linear programming and gradient descent. The additional techniques may allow for the implementation of approximation algorithms, which determine approximate solutions to the optimal designs, that scale gracefully and that may be used in a large and global production network. In some examples, as described herein, the optimal designs may be extended to generalized linear models and/or general non-linear models.

In some examples, exact solutions to the SDPs defined by equations (6) and (7), above, may be computationally costly. In some examples, these SDPs may be solved using interior-points methods, which, in some cases, may be slow and/or difficult to scale. Therefore, in these examples, an approximate solution to the SDPs defined by equations (6) and (7), above, may be more computationally efficient and easier to scale. For example, solving equation (6), above may be based on enforcing A ∈ R^d×dto be PSD, which may equate to satisfying infinitely many constraints x^TAx>0, for any x ∈ R^d, which may be linear in the parameterization of A. Continuing with this example, an approximate cutting plane algorithm may be designed, which may generate x and solve a sequence of linear programs (LPs). Such an approach may reduce run time, as compared to interior-points methods. For example, the run time may be reduced by two orders of magnitude when the number of added constraints is small. In some examples, however, there may be a drop in the quality of solutions. In these examples, the quality of the solutions may increase with additional linear program constraints, but the computational cost may also be increased as a result. In these examples, adding additional LP constraints may make this approach difficult to scale because the added LP constraints may not be sparse, and such lack of sparsity may decrease the efficiency of such an approach. Such an approach may be viewed as solving the equations (6) and (7), above, exactly on an outer polytope of the feasible sets. Due to the drop in the solution quality, in some examples, as described above, an opposite approach may be considered. For example, in this opposite approach, the feasible set may be maintained and the SDPs may be solved approximately using algorithms based on the Frank-Wolfe algorithm. This may result in near optimal solutions to the SDPs, and these solutions may be achieved with an orders-of-magnitude reduction in run time.

FIG. 4 is a flow diagram of example process 400 for determining an allocation of probes to monitor a network using an approximation algorithm. In some examples, process 400 may be used for determining a sampling rate to monitor the network using the approximation algorithm. The approximation algorithm may be based on the Frank-Wolfe algorithm. In some examples, the process 400 may be performed in whole or in part by an electronic device, such as electronic device 500 described in connection with FIG. 5. At block 410, a sample covariance matrix for the network may be determined/defined. For example, G_α=Σ_x∈Xα_xxx^Tmay be defined as the sample covariance matrix for allocation α ∈ Δ(X), where Δ(X) is a simplex with vertices X.

The Frank-Wolfe algorithm is a popular algorithm for constrained optimization problems of the form

min_α∈Af(α)

where f is the optimized function, A ∈ R^dis a convex feasible region defined by linear constraints, and α ∈ A is the optimized parameter vector. The algorithm may be solved iteratively by solving a sequence of linear programs (LPs)

g(α)=arg min_α∈Aā^TΔf(α) (9)

In particular, if α⁽ⁱ⁾∈ A is the Frank-Wolfe solution in iteration i, the next solution α⁽ⁱ⁺¹⁾may be obtained as follows. For example, ā=g(α⁽ⁱ⁾) may be computed, which may be an LP. Then, α⁽ⁱ⁺¹⁾may be a solution to the line search problem

α⁽ⁱ⁺¹⁾=arg min_c∈[0,1]f(cα⁽ⁱ⁾+(1−c)ā).

The Frank-Wolfe algorithm converges when f and A are convex. An advantage of the Frank-Wolfe algorithm over gradient descent with a projection to A may be that a projection is not needed in the Frank-Wolfe algorithm. Instead, the LPs in (9), above, may be solved iteratively where the feasibility may be enforced.

The Frank-Wolfe algorithm may strike an elegant balance. On one hand, the LPs in (9), above, may represent the linear constraints in equations (6) and (7), above. On the other hand, the hard problem of eigenvalue optimization may be solved by gradient descent. This approach may be applied to A- and E-optimal designs.

At block 420, an error metric based on the sample covariance matrix and a fixed probe budget value may be optimized. In some examples, an error metric based on the sample covariance matrix and a fixed sampling budget value may be optimized. For example, this may be performed using an approximation algorithm based on the Frank-Wolfe algorithm. The error metric may be associated with an error in estimating path latencies associated with the network. In particular, the error metrics associated with either the E-optimal design or the A-optimal design, described above, may be optimized. As described above, the optimization of an error metric may include solving a semi-definite program using a predetermined fixed probe budget value.

As one example, for the E-optimal design approximation algorithm based on the Frank-Wolfe algorithm, G_α=Σ_x∈Xα_xxx^Tmay be the sample covariance matrix for allocation α ∈ Δ(X), where Δ(X) is a simplex with vertices X. In some examples, for the E-optimal design, the error metric to be optimized may be associated with a maximum error in estimating the path latencies associated with the network. To optimize this error metric, the E-optimal design, as described above, may maximize the minimum eigenvalue of the sample covariance matrix:

max_α∈Δ(X)λ_min(G_α) (10)

From the definitions of λ_minand G_α,

$λ_{\min} (G_{α}) = \min_{v \in Rd :  v  2 = 1} v^{⊤} (Σ_{x \in X} α_{x} x x^{⊤}) v$

$= \min_{v \in Rd :  v  2 = 1} Σ_{x \in X} { v^{⊤} x }_{2}^{2} α_{x} .$

Since ∥v^Tx∥²₂α_xis linear in α for any v, λ_min(G_α) is concave in α. Thus −λ_min(G_α) is convex in α and the equation (10), above, may be solved by an algorithm based on the Frank-Wolfe algorithm. In particular, in this formulation, A=Δ(X), f(α)=−λ_min(G_α), and Δf(α) may be derived as follows. If x_iis the i-th element in X, and since λ_min(G_α) is linear in α,

Δf(α)=−Δλ_min(G_α)=−(∥v^T_minx₁∥²₂, . . . , ∥v^T_minx_|X|∥²₂)

where v_minis the eigenvector associated with λ_min(G_α). The time to compute the gradient Δf(α) may be linear in |X|, which is the number of partial derivatives. Because the time to compute this gradient may be the most computationally-demanding part of this approximation algorithm based on the Frank-Wolfe algorithm, its run time may be close to linear in |X| and the number of iterations. Therefore, the approach may be scalable to large problems.

As another example, for the A-optimal design approximation algorithm based on the Frank-Wolfe algorithm, G_α=Σ_x∈Xα_xxx^Tmay be the sample covariance matrix for allocation α ∈ Δ(X), where Δ(X) is a simplex with vertices X. In some examples, for the A-optimal design, the error metric to be optimized may be associated with an average error in estimating the path latencies associated with the network. To optimize this error metric, the A-optimal design, as described above, minimizes the trace of the inverse sample covariance matrix:

min_α∈Δ(X)tr(G_α⁻¹) (11)

Because equation (11), above, may be convex in α, equation (11) may be solved by an algorithm based on the Frank Wolfe algorithm. In particular, in this formulation, A=Δ(X), f(α)=tr(G_α⁻¹) , and ∇f(α) may be derived as follows. For any invertible matrix M ∈ R^d×d

∂tr(M⁻¹)=tr(∂M⁻¹)=−tr(M⁻¹(∂M)M⁻¹).

Applying this identity to M=G_α and noting that the partial derivative of G_α with respect to α_xmay be ∂G_α/∂α_x=xx^Tm, this yields

∇f(α)=−(tr(G_α⁻¹x₁x₁^TG_α⁻¹), . . . , tr(G_α⁻¹x_|X|x_|X|^TG_α⁻¹)),

where x_iis the i-th element in X. Again, because the time to compute this gradient may be the most computationally-demanding part of this approximation algorithm based on the Frank-Wolfe algorithm, its run time may be close to linear in |X| and the number of iterations. Therefore, the approach may be scalable to large problems.

At block 430, an allocation of probes may be determined based on the optimized error metric, such as from the solutions of equations (10) or (11). In particular, α_xmay be determined, for example, from the solutions of equations (10) or (11) above, and this may be directly related to the allocation of probes. The allocation of probes may be associated with at least one feature vector of edge indicators, x, for the network. The allocation of probes may be less than or equal to a predetermined fixed probe budget value. In some examples, rather than an allocation of probes, a sampling rate may be determined using the same technique. The sampling rate may similarly be associated with at least one feature vector of edge indicators for the network.

In some examples, the allocation of probes determined in the last block of process 200, 300, and/or 400, described above, may be compared to a threshold number of probes to ensure that the allocation is a valid allocation. For example, the allocation of probes may be compared to a threshold that may be equal to a total budget of n probes, where n may be a positive integer number. For example, the allocation of probes may be compared to a threshold that may be equal to a minimum number of probes to be used in the network. In some examples, the allocation of probes determined in the last block of process 200, 300, and/or 400, described above, may be displayed together with the associated network topology on a display of an electronic device, such as electronic device 500 described in connection with FIG. 5, on which the process 200, 300, and/or 400 is implemented. In some examples, the allocation of probes and their placements in the network determined in the last block of process 200, 300, and/or 400, described above, may be stored in one or more memories or storage devices associated with an electronic device, such as electronic device 500 described in connection with FIG. 5. The framework and techniques, as described above, may improve the computational efficiency and speed with which any computing system and/or network resources, implementing such a framework and techniques, may solve for an optimal probe allocation vector when compared to conventional frameworks and techniques. As such, when compared to conventional frameworks and techniques, the use of the framework and techniques, as described herein, may reduce power consumption and increase the available bandwidth of any computing system and/or network resources implementing such framework and techniques.

In some examples, process 200, 300, and/or 400, described above, may be implemented by special purpose logic circuitry, such as a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), a general purpose processor, and/or one or more components of a general purpose electronic device, such as electronic device 500, described in connection with FIG. 5, below.

Generalized linear models (GLMs) may extend linear models to non-linear functions in a way that may inherit the beneficial statistical and computational properties of the linear models. Specifically, in a GLM, the measurement y ∈ R at feature vector x ∈ X may have an exponential-family distribution with mean f*(x)=μ(x^Tθ*), where μ is the mean function and θ* ∈ R^dare model parameters. For any mean function μ, the matrix of second derivatives of the GLM loss at solution θ*, also called the Hessian, may be

H=Σ
ⁿ
_i=1μ{dot over ( )}(x_i^Tθ*)x_ix_i^T (12)

where μ{dot over ( )} may be the derivative of the mean function μ. This Hessian may play the same role as the sample covariance matrix G in equation (5), above. Therefore, up to the factor of μ{dot over ( )}(x_i^Tθ*), the optimal designs, described above, may be used to minimize the maximum and average errors in GLMs.

In some examples, Poisson regression, an instance of GLMs, may be used to estimate packet losses. In particular, f*(x)=exp[x^Tθ*] may be the probability that the packet is not dropped on path x ∈ X and θ* ∈ R^dmay be a vector of log-probabilities that packets are not dropped on all path edges.

The framework and techniques, as described herein, may be extended to non-linear models. A non-linear mapping of features into a lower r-dimensional space may be learned. This mapping may be learned in order to produce a linear regression problem in the new space. In particular, g: R^d→R^rsuch that f*(x)≈g(x)^Tθ* may be learned.

The function g may be learned as follows. If m is the number of tasks and D_jis the dataset corresponding to task j ∈ [m], in networking, the task j may be viewed as an inference problem on day j, which may be accompanied by its training set D_j. This function g may be learned by solving

min_gmin_{θ1, . . . , θm}Σ^m_j=1Σ_(x,y)∈Dj(g(x)^Tθj−y)² (13)

where θ_j∈ R^rmay be optimized model parameters for task j.

A key structure in the above loss function may be that function g is shared among the tasks. Therefore, the minimization of g may lead to learning a common embedding g, which may essentially be a compression of features x, that may be useful for solving all tasks using linear regression. The above optimization problem may be solved using a multi-headed neural network, where g is the body of the network and head j outputs g(x)^Tθ_j, which may be the predicted value at feature vector x for task j.

The framework and techniques described herein may use a statistical approach to monitoring large scale network infrastructure. The statistical approach may have provable guarantees on the quality of estimates under the constraint of a limited probing budget. Such an approach may be used in any production network to measure most network performance metrics. For example, such an approach may be used when the metric estimation problem may be expressed as a regression problem. The operational use of such framework and techniques, as described herein, does not face the obstacle of computation time of optimal strategies as faced by conventional techniques. The use of such framework and techniques makes no assumption about network switch/router hardware capabilities. In some examples, such framework and techniques may operate in a target budget mode and/or a target measure accuracy mode. As can be shown through simulation and experimentation on real network topologies such as those supporting global cloud services, such framework and techniques may estimate latency and loss with low error, even in the case of low probing budgets. The framework and techniques, as described herein, may outperform conventional techniques, such as a uniform probing strategy, generally used in production networks. The framework and techniques used to determine an allocation of probes to monitor a network, as described herein, may produce more accurate network monitoring than conventional techniques, may be more computationally feasible, may be more efficient than conventional techniques, may use fewer probes than conventional techniques, and may be able to scale better than conventional techniques. In addition, such framework and techniques may be well suited for optimizing probing for network link loss and latency estimation.

Extending the framework and techniques, as described herein, may be extended to other telemetry tasks, such as estimating available bandwidth, flow size distribution, or top flows may be straightforward. For example, the extension to other telemetry tasks may be possible when the estimation problem may be formulated as regression. In addition, in some examples, as described above, it may be possible to learn of a non-linear embedding that may transform potentially any estimation problem into regression, with appropriate features. In these examples, it may then be possible to apply the A- and E-optimal designs on the new learned features, as described above. Additionally, the framework and techniques, as described herein, may be extended to optimize for sampling budgets instead of or in addition to optimizing for probing budgets. In some examples, optimizing for sampling budgets may be possible as long as the packet count is known for each end-to-end path, in which case the optimization may be performed with an additional constraint, such as not sampling traffic on a path or link that does not carry any traffic.

The framework and techniques, as presented herein, may be used to offer an operational probing and/or sampling service that may accept the following inputs: a topology, a path level, such as OD pairs, a traffic matrix with a packet count, a telemetry function, such as median latency, flow size distribution, and link failures, and either a probing and/or sampling budget or a target statistical accuracy for a metric to compute. The service may output an optimal sampling and/or probing strategy to perform in hosts. In some examples, the optimal strategy may be related to optimal probe/monitor placement in which determinations are made as to where to monitor the network and at what sampling rate in order to perform accurate measurement of network metrics with limited impact on network resources. Controlling the accuracy of the performance estimation may be used to monitor at scale and at low time granularity in hyper-scale networks.

FIG. 5 depicts a block diagram of an example electronic device 500. Electronic device 500 may be any computing device. The electronic device 500 may include one or more processor 510, system memory 520, a bus 530, the networking interface(s) 540, and other components (not shown), such as storage(s), output device interface(s), input device interface(s). A bus 530 may be used for communicating between the processor 510, the system memory 520, the networking interface(s) 540, and other components. Any or all components of electronic device 500 may be used in conjunction with the subject of the present disclosure.

Depending on the desired configuration, the processor 510 may be of any type including but not limited to a tensor processing unit (TPU), a microprocessor, a microcontroller, a digital signal processor (DSP), or any combination thereof. The processor 510 may include one more level of caching, such as a level one cache 511 and a level two cache 512, a processor core 513, and registers 514. The processor core 513 may include one or more arithmetic logic unit (ALU), one or more floating point unit (FPU), one or more DSP core, or any combination thereof. A memory controller 515 may also be used with the processor 510, or in some implementations the memory controller 515 can be an internal part of the processor 510.

Depending on the desired configuration, the physical memory 520 may be of any type including but not limited to volatile memory, such as RAM, non-volatile memory, such as ROM, flash memory, etc., multiple of these memories, or any combination thereof. The physical memory 520 may include an operating system 521, one or more applications 522, and program data 524, which may include service data 525. Non-transitory computer-readable medium program data 524 may include storing instructions that, when executed by the one or more processing devices, implement a process for determining an allocation of probes to monitor a network 523. In some examples, the one or more applications 522 may be arranged to operate with program data 524 and service data 525 on an operating system 521. The framework and techniques, as described herein, may improve the functioning of a computing and/or electronic device, such as electronic device 500. In addition, the framework and techniques, such as process 523 as described herein, may increase computational efficiency and speed with which any computing system and/or network resources, implementing such a framework and techniques, may solve for an optimal probe allocation vector when compared to conventional frameworks and techniques.

The electronic device 500 may have additional features or functionality, and additional interfaces to facilitate communications between the basic configuration 501 and any required devices and interfaces.

Physical memory 520 may be an example of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, or any other medium which can be used to store the desired information and which can be accessed by electronic device 500. Any such computer storage media can be part of the device 500.

Network interface(s) 540 may couple the electronic device 500 to a network (not shown) and/or to another electronic device (not shown). In this manner, the electronic device 500 can be a part of a network of electronic devices, such as a local area network (“LAN”), a wide area network (“WAN”), an intranet, or a network of networks, such as the Internet. In some examples, the electronic device 500 may include a network connection interface for forming a network connection to a network and a local communications connection interface for forming a tethering connection with another device. The connections may be wired or wireless. The electronic device 500 may bridge the network connection and the tethering connection to connect the other device to the network via the network interface(s) 540.

Aspects of the present disclosure may be implemented as a computer implemented process, a system, or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by an electronic device and may comprise instructions for causing an electronic device or other device to perform processes and techniques described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, solid state memory, flash drive, and/or other memory or other non-transitory and/or transitory media. Aspects of the present disclosure may be performed in different forms of software, firmware, and/or hardware. Further, the teachings of the disclosure may be performed by an application specific integrated circuit (ASIC), field programmable gate array (FPGA), or other component, for example.

Aspects of the present disclosure may be performed on a single device or may be performed on multiple devices. For example, program modules including one or more components described herein may be located in different devices and may each perform one or more aspects of the present disclosure. As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.

The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure. Many modifications and variations of the disclosed aspects may be apparent to those of skill in the art.

Unless otherwise stated, the foregoing alternative examples are not mutually exclusive, but may be implemented in various combinations to achieve unique advantages. As these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the foregoing description of the examples should be taken by way of illustration rather than by way of limitation of the subject matter defined by the claims. In addition, the provision of the examples described herein, as well as clauses phrased as “such as,” “including” and the like, should not be interpreted as limiting the subject matter of the claims to the specific examples; rather, the examples are intended to illustrate only one of many possible examples. Further, the same reference numbers in different drawings can identify the same or similar elements.

Numerous examples are described in the present application, and are presented for illustrative purposes only. The described examples are not, and are not intended to be, limiting in any sense. One of ordinary skill in the art will recognize that the disclosed subject matter may be practiced with various modifications and alterations, such as structural, logical, software, and electrical modifications. It should be understood that the described features are not limited to usage in the one or more particular examples or drawings with reference to which they are described, unless expressly specified otherwise.

Scalable and Low Computation Cost Method for Optimizing Sampling/Probing in a Large Scale Network

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)