The invention relates to wireless communication and, more specifically, to spectrum cartography.
All wireless transmissions use a portion of the radio frequency spectrum. Cellular phones, broadcast television, satellite, and short-distance wireless networks such as Bluetooth and wireless local area networks (WLAN) utilize different portions of the Wi-Fi, for example, typically use wireless frequency spectrum. Often it is important to coordinate the use of the various technologies and frequency ranges to ensure that the technologies do not interfere with each other or with planned future services.
This disclosure describes techniques for constructing power spectral density (PSD) maps representative of the distribution of radio frequency (RF) power as a function of both frequency and space (geographic location). For example, the disclosure describes techniques for construction of PSD maps using non-parametric basis pursuit techniques for signal expansion to model the PSD across space and frequency.
For example, a most parsimonious sparse signal expansion using an overcomplete basis set may be used to constructing the PSD maps. Moreover, a non-parametric basis expansion model of the RF power over frequency and location may be computed, and sparse coefficients computed for the basis expansions functions of the model may entail bases weighting functions instead of scalars. Utilizing this model for signal expansion, many of the weighting functions may be zero; however, a few bases can be identified and selected that more accurately represent the transmitted signals.
In one example, a collaborative scheme is utilized in which cognitive radios cooperate to localize active primary transmitters and reconstruct the power spectral density (PSD) maps (one per frequency band) portraying the power distribution across space. The sensing scheme may utilize a parsimonious linear system model that accounts for the narrow-band nature of transmit-PSDs compared to the large swath of sensed frequencies, and for the group sparsity emerging when adopting a spatial grid of candidate primary user locations. Combining the merits of Lasso, group Lasso, and total least-squares (TLS), the proposed group sparse (GS) TLS approach yields hierarchically-sparse PSD estimates, and copes with model uncertainty induced by channel randomness and grid mismatch effects. Taking advantage of a novel low-complexity solver for the GS-Lasso, a block coordinate descent scheme is developed to solve the formulated GS-TLS problem.
In some examples, a spline-based approach to field estimation is described as one example of a non-parametric basis expansion model of a field of interest. Other example kernel-based interpolation functions include sinc-based kernels. In some examples, the model entails known bases, weighted by generic functions estimated from the field's noisy samples.
A novel field estimator is also described based on a regularized variational least-squares (LS) criterion that yields finitely-parameterized (function) estimates spanned by thin-plate splines. In this way, an over-complete set of (possibly overlapping) basis functions may be used, while a sparsifying regularizer augmenting the LS cost endows the estimator with the ability to select a few of these bases that “better” explain the data using a parsimonious field representation. The sparsity-aware spline-based examples of this disclosure induce a group-Lasso estimator for the coefficients of the thin-plate spline expansions per basis. A distributed algorithm is also developed to obtain the group-Lasso estimator using a network of wireless sensors, or, using multiple processors to balance the load of a single computational unit.
In this way, a basis expansion approach is described in this disclosure to estimate a multi-dimensional field, whose dependence on a subset of its variables is modeled through selected (and generally overlapping) basis functions weighted by unknown coefficient-functions of the remaining variables. The unknown coefficient functions, referred to herein as bases weighting functions and as kernel-based interpolating functions, operate to interpolate the PSD measurements across space using kernel-based method. In one example, the bases weighting functions can be estimated from the field's noisy samples, e.g., by solving a variational thin-plate smoothing spline cost regularized by a term that performs basis selection. The result yields a parsimonious description of the field by retaining those few members of the basis set that “better” explain the data. This attribute is achieved because the added penalty induces a group-Lasso estimator on the parameters of the radial kernels and polynomials. Notwithstanding, group-Lasso here is introduced to effect sparsity in the space of smooth functions. Moreover, in some example implementations, a plurality of different types of kernel-based interpolation functions are used to interpolate the PSD measurements across space. In such cases, multiple kernel can be incorporated within the non-parametric basis expansion model.
Another contribution is in the context of wireless cognitive radio (CR) network sensing (the overarching practical motivation here), where the estimated field enables cartographing the space-frequency distribution of power generated by active RF sources. Using periodogram samples collected by spatially distributed CRs, the sparsity-aware spline-based estimator yields an atlas of PSD maps (one map per frequency). A provably convergent distributed algorithm is described using AD-MoM iterations, to obtain the required group-Lasso estimator using the network of CRs. As corroborated by simulations and tests on real data, the atlas enables localizing the sources and discerning their transmission parameters, even in the presence of frequency-selective Rayleigh fading and pronounced shadowing effects. Simulated tests also confirmed that the sparsity-promoting regularization is effective in selecting those basis functions that strongly influence the field, when the tuning parameters are cross-validated properly.
The techniques may have many applications of the PSD maps, including the distribution of RF power in space and frequency. For example, PSD maps may be used to reveal spatial locations where idle frequency bands can be reused for transmission, even when fading and shadowing effects are pronounced. This information may be useful to wireless service providers in network planning and maintenance. As other example, the PSD maps may be used be devices, such as cellular phones, base stations or wireless-enabled tablet or computing devices, to sense and utilize opportunities within the spectrum for communication, such as dynamic re-use of otherwise pre-allocated frequency bands.
The model and the resultant (parsimonious) estimates of the PSD maps can be used in more general statistical inference and localization problems, whenever the data admit a basis expansion over a proper subset of its dimensions. Furthermore, results in this disclosure extend to kernels other than radial basis functions, whenever the smoothing penalty is replaced by a norm induced from an RKHS.
In one example implementation, techniques are used for cross fertilizing sparsity-aware signal processing tools with kernel-based learning. The disclosure describes non-parametric basis pursuit as foundation for sparse kernel-based learning (KBL), including blind versions that emerge as nonparametric nuclear norm regularization and dictionary learning.
Moreover, KBL is connected herein with Gaussian processes (GP) analysis, including an example implementation of a Bayesian viewpoint in which kernels convey prior information. Alternatively, KBL can be regarded as an interpolation toolset though its connection with the Nyquist-Shannon Theorem (NST), indicating that the impact of a prior model choice is attenuated when the size of the dataset is large, especially when kernel selection is also incorporated.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
In general, the PSD maps representative of the distribution of radio frequency (RF) power as a function of both frequency and spatial location within the geographic region. In the example of
With respect to
As described herein, FC 16 computes sparse coefficients for the basis expansions as bases weighting functions, i.e., the kernel-based interpolating functions used to interpolate across space, instead of scalars to more accurately represent the signals transmitted from RF-enabled devices 15. As used herein, a nonparamentric basis expansion model refers to a basis expansion model where scaling coefficients of the reference basis functions are computed as basis weighting functions that belong to a function space instead of scalars, as used in a parametric basis expansion model. As such, in accordance with the techniques described herein, kernel-based interpolating functions that interpolate the PSD across space are computed as coefficients to the reference basis functions. In this way, the kernel-based interpolating functions operate as bases weighting functions in the non-parametric basis expansion model. Moreover, as described in further detail below, in some example, a number of different types of kernels may be used to accurately model the physical (i.e., geographic) distribution of RF power over the region so as to enhance the ability of the model to accurately reflect fading and multi-path characteristics of the environment. In this way, multiple different types of kernel-based interpolating functions may be incorporate to interpolate the PSD measurements and used as basis weighting functions within the model.
In some example, FC 16 may perform a most parsimonious sparse signal expansion using an overcomplete basis set to compute the basis expansion model and construct one or more the PSD maps. That is, examples of the expansion model described herein may utilize a sparsest subset of the overcomplete set of bases when computing the model to construct the PSD maps.
In one example, FC 16 includes a computing system of one or more computing devices that employ a novel field estimator based on a regularized variational least-squares (LS) criterion that yields finitely-parameterized (function) estimates spanned by thin-plate splines. An over-complete set of (possibly overlapping) basis functions may be used, while a sparsifying regularizer augmenting the LS cost endows the estimator with the ability to select a few of these bases that “better” explain the data using a parsimonious field representation.
In general, spline-based field estimation involves approximating a deterministic map g:Rn→R from a finite number of its noisy data samples, by minimizing a variational least-squares (LS) criterion regularized with a smoothness-controlling functional. In the dilemma of trusting a model (parametric) versus trusting the data (non-parameteric), splines favor the latter since only a mild regularity condition is imposed on the derivatives of g, which is otherwise treated as a generic function. While this generality is inherent to the variational formulation, the smoothness penalty renders the estimated map unique and finitely parameterized. With the variational problem solution expressible by polynomials and specific kernels, the aforementioned map approximation task reduces to a parameter estimation problem. Consequently, thin-plate splines operate as a reproducing kernel Hilbert space (RKHS) learning machine in a suitably defined (Sobolev) space.
Although splines emerge as variational LS estimators of deterministic fields, they are also connected to classes of estimators for random fields. The first class assumes that estimators are linearly related to the measured samples, while the second one assumes that fields are Gaussian distributed. The first corresponds to the Kriging method while the second to the Gaussian process model; but in both cases one deals with a best linear unbiased estimator (BLUE). Typically, wide sense stationarity is assumed for the field's spatial correlation needed to form the BLUE. The so-termed generalized covariance model adds a parametric nonstationary term comprising known functions specified a priori. Inspection of the BLUE reveals that if the nonstationary part is selected to comprise polynomials, and the spatial correlation is chosen to be the splines kernel, then the Kriging, Gaussian process, and spline-based estimators coincide.
Bearing in mind this unifying treatment of deterministic and random fields, one example technique described in this disclosure is spline-based estimation, and the practically motivated sparse (and thus parsimonious) description of the wanted RF field for the spatial region of interest. Toward these goals, the following basis expansion model (BEM) is adopted for the computing PSD maps 17:
with x∈R2, f∈R, and the L2-norms {∥bv(f)∥L
The bases {bv(f)}v=1N
Consider selecting Nb basis functions using the basis pursuit approach, which entails an extensive set of bases thus rendering Nb overly large and the model overcomplete. This motivates augmenting the variational LS problem with a suitable sparsity-encouraging penalty, which endows the map estimator with the ability to discard factors gv(x)bv(f), only keeping a few bases that “better” explain the samples acquired by CRs12. This attribute is inherited because the novel sparsity-aware spline-based method of this disclosure induces a group-Lasso estimator for the coefficients of the optimal finitely-parameterized gv. Group-Lasso estimators set groups of weak coefficients to zero (here the Nb groups associated with coefficients per gv), and outperform the sparsity-agnostic LS estimator by capitalizing on the sparsity present.
In one example, the novel estimator of FC 16 applies an iterative group-Lasso algorithm that yields closed-form estimates per iteration. In another example, CRs 12 or any other plurality of processing centers implement a distributed version of this algorithm, described herein, for the samples collected by the CRs. This provides for computational load-balancing of multi-device or multiprocessor architectures.
In one example, FC 16 implements the basis expansion model described herein for spectrum cartography for wireless cognitive radio (CR) networks. CR technology may be used to address bandwidth under-utilization versus spectrum scarcity, which has rendered fixed-access communication networks inefficient. CRs 12 sense the ambient interference spectrum to enable spatial frequency reuse and allow for dynamic spectrum allocation in accordance with the PSD maps 17 computed by FC 16. Collaboration among CRs 12 can markedly improve the sensing performance and reveal opportunities for spatial frequency reuse. Unlike existing approaches that have mostly relied on detecting spectrum occupancy per radio, the techniques described herein account for spatial changes in the radio frequency (RF) ambiance, especially at intended receiver(s) which may reside several hops away from the sensed area.
The novel field estimation techniques applied in system 10 is a collaborative sensing scheme whereby receiving CRs 12 and FC 16 cooperate to estimate the distribution of power in space x and frequency f, namely the power spectrum density (PSD) map Φ(x,f), from local periodogram measurements. In examples, the estimator is precise enough to identify spectrum holes, which justifies adopting the known bases to capture the PSD frequency dependence. As far as the spatial dependence is concerned, the techniques account for path loss, fading, mobility, and shadowing effects, all of which vary with the propagation medium. For this reason, the collected PSD observations 14 data is used dictate the spatial component. Determination of the PSD at any location may allow remote CRs 12 to reuse dynamically idle bands. It also enables CRs 12 to adapt their transmit-power so as to minimally interfere with licensed transmitters. The spline-based PSD map here provides an alternative to conventional techniques, where known bases are used both in space and frequency. In one example, the field estimator here does not presume a spatial covariance model or pathloss channel model. Moreover, it captures general propagation characteristics including both shadowing and fading.
In this disclosure, operators , (.)′, tr(.), rank(.), bdiag(.), E[•] will denote Kronecker product, transposition, matrix trace, rank, block diagonal matrix and expectation, respectively; |.| will be used for the cardinality of a set, and the magnitude of a scalar. The L2 norm of function b:R→R is
while the lp norm of vector x∈Rp is
for p≧1; and |M|F:=√{square root over (tr(MM′))} is the matrix Frobenius norm. Positive definite matrices will be denoted by >0. The p×p identity matrix will be represented by Ip, while 0p will denote the p×1 vector of all zeros, and 0p×q:=0p0′q. The i-th vector in the canonical basis for Rp will be denoted by ep,i, i=1, . . . , p.
Basis Expanion Model for Spectrum Cartography
Consider a set of Ns sources transmitting signals {us(t)}s=1N
where the basis bv(f) is centered at frequency fv, v=1, . . . , Nb. The example depicted in
The power transmitted by a source s (e.g., any of RF-enabled devices 15) will propagate to the location x∈R2 according to a generally unknown spatial loss function ls(x):R2→R. Specifically, ls takes the form ls(x):=E[|Hsx(f)|2], where Hsx stands for the frequency response of the channel from source s to the receiver positioned at x. The propagation model ls(x) not only captures frequency-flat deterministic pathloss, but also stationary, block-fading and even frequency-selective Rayleigh channel effects, since their statistical moments do not depend on the frequency variable. In this case, the following vanishing memory assumption is required on the transmitted signals for the spatial receive-PSD Φ(x,f) to be factorizable as ls(x)Φs(f).
Sources 15 represented as {us(t)}s=1N
As such, the contribution of source s to the PSD at point x is
and the PSD due to all sources received at x will be given by
Note that Φ(x,f) is not time dependent, but takes into account the randomness of the channels. Such spatial PSD model can be simplified by defining the function
With this definition and upon exchanging the order of summation, the spatial PSD model takes form, where functions {gv(x)}v=1N
In one example, FC 16 and CRs 12 applies cooperative Spline-Based PSD Field Estimation to compute PSDs 17. In this example, the sensing strategy relies on the periodogram estimate φrn(τ) at a set of receiving (sampling) locations X:={xr}r=1N
Hence, the envisioned setup consists of Nr receiving CRs, which collaborate to construct the PSD map based on PSD observations {φrn}. The bulk of processing is performed centrally at a fusion center (FC), which is assumed to know the position vectors X of all CRs, and the sensed tones in F. The FC receives over a dedicated control channel, the vector of samples φr:=[φr1, . . . , φrN]′∈RN taken by node r for all r=1, . . . , Nr.
While a BEM could be introduced for the spatial loss function ls(x) as well, the uncertainty on the source locations and obstructions in the propagation medium may render such a model imprecise. This will happen, e.g., when shadowing is present. Instead, the techniques described here utilize estimation of the functions gv(x) based on the data {φ>rn}. To capture the smooth portions of Φ(x,f), the criterion for selecting gv(x) will be regularized using a so termed thin-plate penalty. This penalty extends to R2 the one-dimensional roughness regularization used in smoothing spline models. Accordingly, functions {gv}v=1N
where ∥∇2gv∥F denotes the Frobenius norm of the Hessian of gv.
The optimization is over S, the space of Sobolev functions. The parameter λ≧0 controls the degree of smoothing. Specifically, for λ=0 the estimates in (eq. 4) correspond to rough functions interpolating the data; while as λ→∞ the estimates yield linear functions (cf. ∇2ĝv(x)=02×2). A smoothing parameter in between these limiting values will be selected using a leave-one-out cross-validation (CV) approach, as discussed later.
The thin-plate splines solution is now described. The optimization problem (eq. 4) is variational in nature, and in principle requires searching over the infinite-dimensional functional space S. It turns out that (eq. 4) admits closed-form, finite dimensional minimizers ĝvx), as presented in the following proposition.
Proposition 1: The estimates {ĝv}v=1N
where K(ρ):=ρ2log(ρ), and βv:=[βv1, . . . , βvN]′ is constrained to the linear subspace β:={β∈N
If the basis functions have finite supports which do not overlap, then (eq. 4) decouples per gv. One novelty of Proposition 1 is that the basis functions with spatial spline coefficients in (eq. 1) are allowed to be overlapping. The implication of Proposition 1 is a finite parametrization of the PSD map [cf. (eq. 5)]. This is particularly important for non-FDMA based CR networks. In the forthcoming description, an overcomplete set is adopted in (eq. 1), and overlapping bases naturally arise therein.
What is left to determine are the parameters
in (eq. 5). To this end, define the vector
containing the network-wide data obtained at all frequencies in F. Three matrices are also introduced collecting the regression inputs: i) T∈RN
ii) B∈RN×N
Upon plugging (eq. 5) into (eq. 4), the optimal {α,β} satisfy the following system of equations:
(BQ′2)φ=[(B′BQ′2KQ2)+NrNλIN
[ΓR]{circumflex over (α)}=(Ω′1Q′1)φ−(ΓQ′1KQ2){circumflex over (γ)} (7)
{circumflex over (β)}=(IN
Matrix Q′2KQ2 is positive definite, and rank(ΓR)=rank(Γ)rank(R). It thus follows that (eq 6)-(eq. 7) admit a unique solution if and only if Γ and R are invertible (correspondingly, B and T have full column rank). These conditions place practical constraints that should be taken into account at the system design stage. Specifically, T has full column rank if and only if the points in X, i.e., the CR locations, are not aligned. Furthermore, B will have linearly independent columns provided the basis functions {bv(f)}v=1N
The condition on X does not introduce an actual limitation as it can be easily satisfied in practice, especially when CRs 12 are randomly deployed. Likewise, the basis set is part of the system design, and can be chosen to satisfy the conditions on B.
Therefore, in one example, FC 16 may execute program to perform the following method constituting the spline-based spectrum cartography algorithm, which amounts to estimating Φ(x,f):
In one embodiment, FC 16 includes PSD tracker 24 that adapts to time-varying PSDs. The real-time requirements on the sensing radios and the convenience of an estimator that adapts to changes in the spectrum map are the motivating reasons behind the PSD tracker introduced in this section. The spectrum map estimator will be henceforth denoted by Φ(x,f,τ), to make its time dependence explicit.
PSD tracker 24 defines the vector φn(τ):=[φln(τ), . . . , (φN
. Per time-slot τ=1, 2, . . . , the periodogram φ(τ) is averaged using the following adaptive counterpart of (eq. 3):
which implements an exponentially weighted moving average operation with forgetting factor δ∈(0,1). For every τ, the online estimator Φ(x,f,τ) is obtained by plugging in (eq. 1) the solution {ĝv(x,τ)}v=1N
Suppose that per time-slot τ, FC 16 receives raw periodogram samples φ(τ) from the CRs in order to update Φ(x,f,τ). The techniques described herein apply for every τ, meaning that {ĝv(x,τ)}v=1N
̂β(τ)=δβ(τ−1)+(IN
̂α(τ)=δα(τ−1)+G2φ(τ) (11)
where the time-invariant matrices G1 and G2 are
Recursions (eq. 10)-(eq. 11) provide a means to update Φ(x,f,τ) sequentially in time, by incorporating the newly acquired data from the CRs in φ(τ). There is no need to separately update φ(τ) as in (eq. 9), yet the desired averaging takes place. Furthermore, matrices G1 and G2 need to be computed only once, during the startup phase of the network.
In one example, FC 16 utilizes Group-Lasso on Splines as an improved spline-based PSD estimator to fit the unknown spatial functions {gv}v=1N
with a large (Nb>>NrN), and a possibly overcomplete set of known basis functions {bv}v=1N
In this context, the envisioned estimation method should provide the CRs with the capability of selecting a few bases that “better explain” the actual transmitted signals. As a result, most functions gv are expected to be identically zero; hence, there is an inherent form of sparsity present that can be exploited to improve estimation. The rationale behind the proposed approach can be rooted in the basis pursuit principle, a technique for finding the most parsimonious sparse signal expansion using an overcomplete basis set. A major differentiating aspect however, is that the sparse coefficients in the basis expansions are not treated as scalars but instead model (eq. 1) implemented by FC 16 entails bases weighted by functions gv.
The proposed approach to sparsity-aware spline-based field estimation from the space-frequency power spectrum measurements φrn [cf. (eq. 3)], is to obtain {ĝv}v=1N
Relative to (eq. 4), the cost here is augmented with an additional regularization term weighted by a tuning parameter μ≧0. If μ=0 then (eq. 12) boils down to (eq. 4). To appreciate the role of the new penalty term, note that the minimization of
intuitively shrinks all pointwise functional values {gv(X1), . . . , gv(XNr)} to zero for sufficiently large μ. Interestingly, it will be shown in the ensuing section that this is enough to guarantee that ĝv(x)≡0 ∀x, for μ large enough.
Estimation using the group-Lasso technique is described. Consider linear regression, where a vector y∈Rn of observations is available, along with a matrix X∈Rn×p of inputs. The group Lasso estimate for the vector of features ζ:=[ζ′1, . . . , ζ′N
This criterion achieves model selection by retaining relevant factors ζv∈Rp/N
The connection between (eq. 13) and the spline-based field estimator (eq. 12) builds on Proposition 1, which still holds in this context. That is, even though criteria (eq. 4) and (eq. 12) purposely differ, their respective solutions ĝv(x) have the same form in (eq. 5). The essential difference manifested by this penalty is revealed when estimating the parameters α and β in (eq. 5), as presented in the following proposition:
Proposition 2: The spline-based field estimator (12) is equivalent to group-Lasso (13), under the identities
with their respective solutions related by
where βv:=[βv1, . . . , βvN
The factors {ζv}v=1N
and thus yields parsimonious PSD map estimates. All in all, the motivation behind the variational problem (eq. 12) is now unravelled. The additional penalty term not present in (eq. 4) renders (eq. 12) equivalent to a group-Lasso problem. This enforces sparsity in the parameters of the splines expansion for Φ(x,f) at a factor level, which potentially nulls the less descriptive functions gv.
The group-Lassoed splines-based approach to spectrum cartography applied by one example of FC 16 can be summarized in the following method to estimate the global PSD map Φ(x,f):
Implementing Steps 1-Stes 4 presumes that CRs 12 communicate their local PSD estimates to a fusion center, which uses their aggregation in φ to estimate the field. In certain examples, a designer or system operator of system 10 may forgo FC 16 to avoid an isolated point of failure, or to aims at a network topology which scales well with an increasing number of CRs 12 based on power considerations. For example, CRs located far away from the FC will drain their batteries more to reach the FC. A fully distributed counterpart of system 10 is described next.
For the purpose of estimating an unknown vector ζ:=[ζ′1, . . . , ζ′N
where y:=[y′1, . . . , y′N
corresponding to the identifications nr=N ∀r ∈ R, p=NbNr. Note that because the locations {xr} are assumed known to the entire network, CRr can form matrices T, K, and thus, the local regression matrix Xr.
In one example, CRs 12 use consensus-based reformulation of the group-Lasso. To distribute the cost in (eq. 17), replace the global variable ζ which couples the per-agent summands with local variables {ζr}r=1N
The equality constraints directly effect local agreement across each CR's neighborhood. Since the communication graph G is assumed connected, these constraints also ensure global consensus a fortiori, meaning that ζr=ζr′, ∀r,r′∈R. Indeed, let P(r,r′):r,r1,r2, . . . , rn, r′ denote a path on G that joins an arbitrary pair of CRs (r,r′). Because contiguous radios in the path are neighbors by definition, the corresponding chain of equalities ζr=ζr
Lemma 1: If G is a connected graph, (17) and (18) are equivalent optimization problems, in the sense that {circumflex over (ζ)}glasso={circumflex over (ζ)}r, ∀r∈R.
To reduce the computational complexity of the resulting algorithm, for a given a∈Rp consider the problem:
and notice that it is separable in the Nb subproblems
Interestingly, each of these subproblems admits a closed-form solution as given in the following lemma.
Lemma 2: The minimizer ζ*v of (20) is obtained via the vector soft-thresholding operator Tμ(•) defined by
where (•)+:=max{•,0}.
Problem (eq. 19) is an instance of the group-Lasso (eq. 13) when X′X=Ip, and a:=X′.
In order to take advantage of Lemma 2, auxiliary variables γr, r=1, . . . , Nr are introduced as copies of ζr. Upon introducing appropriate constraints γr=ζr that guarantee the equivalence of the formulations along the lines of Lemma 1, problem (eq. 18) can be recast as
The dummy variables γrr′ are inserted for technical reasons that will become apparent in the ensuing section, and will be eventually eliminated.
In one example, the distributed group-Lasso algorithm is constructed by optimizing (eq. 22) using an alternating direction method of multipliers (AD-MoM). In this direction, associate Lagrange multipliers r, r−r′ and {hacek over ( )}rr′ with the constraints γr=ζr, ζr′=γrr′ and ζr=γrr′, respectively, and consider the augmented Lagrangian with parameter c>0
where for notational convenience we group the variables γ:={γr,{γrr′}r′∈N
Application of the AD-MoM to the problem at hand consists of a cycle of £c minimizations in a block-coordinate fashion w.r.t. {ζr} firstly, and γ secondly, together with an update of the multipliers per iteration k=0, 1, 2, . . . . The four main properties of this procedure with respect to the resulting algorithm can be highlighted as follows.
and thus can be eliminated.
Building on these four features, the proposed AD-MoM scheme may utilize four parallel recursions run locally per each CR 12 of
Recursions (24)-(27) comprise the novel DGLasso algorithm, tabulated as Algorithm 1.
The algorithm entails the following steps. During iteration k+1, CR r receives the local estimates {ζr′(k)}r′∈N
Distributed Lasso Algorithm as a Special Case: When Nb=p. and there are as many groups as entries of ζ, then the sum Σv=1N
For Nr=1, the network consists of a single CR. In this case, DGLasso yields an AD-MoM-based algorithm for the centralized group-Lasso estimator (eq. 17), which is specified as Algorithm 2 (below). Notice that the thresholding operator τp in GLasso sets the entire subvector ζv(k+1) to zero whenever ∥eγv(k)−vv(k)∥2 does not exceed μ, in par with the group-sparsifying property of group-Lasso. An attractive feature of proximal algorithms relative to GLasso, is that they come with convergence rate guarantees. GLasso can handle a general (not orthonormal) regression matrix X. GLasso does not require an inner Newton-Raphson recursion per iteration. GLasso yields the Lasso estimator.
In one example, CRs 12 computational load balancing the operations. For example, processing associated with updating (eq. 27) involves inversion of the p×p matrix cIp+Xr′Xr, that may be computationally demanding for sufficiently large p. Fortunately, this operation can be carried out offline before running the algorithm. More importantly, the matrix inversion lemma can be invoked to obtain [cIp+Xr′Xr]−3=c−3[Ip−Xr′(cIp+XrXr′)−3Xr]. In this new form, the dimensionality of the matrix to invert becomes nr×nr, where nr is the number of locally acquired data. For highly underdetermined cases (nr<<p) (D)GLasso provides considerable computational savings through the aforementioned matrix inversion identity. The distributed operation parallelizes the numerical computation across CRs 12: if GLasso is run at a central unit (FC 16) with all network-wide data available centrally, then the matrix to invert has dimension n=Σr∈Rnr, which increases linearly with the network size Nr. Beyond a networked scenario, DGLasso provides an alternative for computational load balancing in contemporary multi-processor architectures.
Convergence of Algorithm 1, and thus of Algorithm 2 as well, is ensured by the convergence of the AD-MoM.
Proposition 3: Let be a connected graph, and consider recursions (24)-(27) that comprise the DGLasso algorithm. Then,
for any value of the step-size c>0, the iterates ζr(k) converge to the group-Lasso solution [cf. (17)] as k→∞, i.e.,
In words, all local estimates ζr(k) achieve consensus asymptotically, converging to a common vector that coincides with the desired estimator {circumflex over (ζ)}glasso. Formally, if the number of parameters p exceeds the number of data n, then a unique solution of (eq. 13) is not guaranteed for a general design matrix x. Proposition 3 remains valid however, if the right-hand side of (eq. 28) is replaced by the set of minima; that is,
Three numerical tests were performed, starting from a simulated spectrum cartography example where five frequency bases are identified from an overcomplete set of Nb=90 candidates. The signal propagation is affected by path-loss and Rayleigh fading. This setup is also considered to exemplify the use of cross-validation in selecting the parameters λ and μ in (eq. 12). The second example introduces shadowing effects, and transmit signal parameters adhering to the IEEE 802.11 standard. A third numerical test is run on real RF power measurements taken at different locations in an indoor area, and frequencies in the 2.4 GHz unlicensed band.
Spectrum Cartography Test
Consider a set of Nr=100 CRs uniformly distributed in an area of 1 Km2, cooperating to estimate the PSD map generated by Ns=5 licensed users (sources) TX 1-TX 5 located as in
The PSD generated by source s experiences fading and shadowing effects in its propagation from xs to any location x, where it can be measured in the presence of white Gaussian noise with variance σ2. A 6-tap Rayleigh model was adopted for the multipath channel Hsx(f) between xs and x, whose expected gain adheres to the path-loss law E(∥Hsx(f)∥2)=exp(−∥xs−x∥22/Δ2), with Δ=800 m. A deterministic shadowing effect generated by a 18 m-high and 500 m-wide wall is represented by the black segment in
When designing the bases functions in (eq. 1), it is known a priori that the transmitted signals are indeed normalized raised cosine pulses with roll-off factors ρ∈{0,1}, and bandwidths W∈{10,20,30} MHz. However, the actual combination of bandwidths and roll-off factors used can be unknown, which justifies why an overcomplete set of bases becomes handy. Transmitted signals with bandwidth W=10 MHz are searched over a grid of 16 evenly spaced center frequencies fc in B. Likewise, for W=20 and 30 MHz, 15 and 14 center frequencies are considered, respectively. This amounts to 2×(16+15+14)=90 possible combinations for ρ, W, and fc; thus, Nb=90 raised-cosine bases were adopted with corresponding values of ρ, fv and Bv to match the aforementioned signal specifications; see also
Each CR computes periodogram sampleŝφrn(τ) at N=64 frequencies fn=(101.25+2.5(n−1)) MHz, n=1, . . . , 64 with
Then, these periodogram samples are averaged across T=100 time-slots to form φrn:n=1, . . . , 64. These network-wide observations at T=100 were collected in φ, and following steps S1-S4, the spline-based estimator (eq. 12), and thus the PSD map̂Φ(x,f) was formed. This map was summed across frequencies, and the result is shown in
In summary, this test case demonstrated that the spline-based estimator can reveal which frequency bands are (un)occupied at each point in space, thus allowing for spatial reuse of the idle bands. For instance, transmitter TX5 at the top-right corner is associated with the basis function b46(f), the only one of the transmitted five that occupies the 230-260 MHz sub-band. Therefore, this sub-band can be reused at locations x away from the transmission range of TX5, which is revealed in
Results in
To select λ and μ jointly so that both smoothness and sparsity are properly accounted for, one could consider a two-dimensional grid of candidate pairs, and minimize the CV error over this grid. However, this is computationally demanding. A three-step alternative is followed here. First, estimator (eq. 12) is obtained using an arbitrarily small value of λ=1×10−6, and selecting μ=0.1μmax, where μmax is given above. In the second step, only the surviving bases are kept, and the sparsifying penalty is no longer considered, thus reducing the estimator. If the reduced matrix B, built from the surviving bases, is full rank (otherwise repeat the first step with a larger value of μ), the procedure in described herein is followed to adjust the value of λ via leave-one-out CV. The result of this step is illustrated in
The importance of an appropriate μ value becomes evident when inspecting how many bases are retained by the estimator as μ decreases from μmax to 1×10−4μmax. The Nb lines in
Bias reduction and Improved Support Selection: The penalty term in (eq. 13) introduces bias in the estimator. As μ decreases the bias decreases, reducing the prediction error. There is a tradeoff however, as increasing gives rise to fewer nonzero entries approaching the true support, thus reducing the prediction error as well. The aforementioned CV technique yields an intermediate value of μ balancing these two effects, and thus it tends to overestimate the support. These insights suggest that reducing bias of the estimator improves subset selection and prediction error. Different approaches are available for reducing the bias of (group-) Lasso estimators, using e.g., weighted -norm penalties. Larger weights are given to terms that are most likely to be zero, while smaller weights are assigned to those that are most likely to be nonzero. Another simpler approach is to retain only the support in the minimizer of (13), and re-estimate the amplitudes via, e.g., LS.
IEEE 802.11 signal parameters and shadowing effects—The Nb=14 overlapping frequency bands (channels) specified in the IEEE 802.11 wireless LAN standard, were considered for this second simulated scenario. The frequency bases adopted correspond to Hann-windowed Nyquist pulses, and the center frequencies are fv=(2412+5(v−1)) MHz for v=1, . . . , 13 and f14=2484 MHz. The PSD map to be estimated is generated by two sources located at coordinates xs
Estimator (eq. 12) applied to the simulated data is successful in identifying the actual transmission bands, as can be deduced from
with γ*(v):=(XTX)−1(XTy+v), and the iterates ζ(k), γ(k), v(k) are generated as in Algorithm 2.
Real data test case—A dataset is formatted into triplets (x,fc,p) of positions, carrier frequencies, and aggregate RF power levels of the signals transmitted over carrier frequency fc and measured at position x. These measurements were modeled as acquired by Nr=166 sensing radios located in an indoor area A of 14×34 m, which is represented by the rectangles in
A set of Nb=14 nonoverlapping rectangular bases centered at these frequencies are adopted, and the nonparametric estimator (eq. 12) is run again to obtain the distribution of power across A. Parameters λ and μ are selected via two-fold cross validation, searching over a grid of 30 candidate pairs. A minimum normalized error of 0.0541 is attained for μCV=0.01μmax and λCV=10−4, as shown in
The proposed estimator is capable of recovering the center frequencies that are being utilized for transmission, eliminating the noise affecting the 13th basis. It also recovers the power levels in the surveyed area Ar, with a smooth extrapolation to zones where there are no measurements, and suggests possible locations for the transmitters.
Proof of (eq. 6)-(eq. 8): Upon substituting (eq. 5) into (eq. 4), the optimal coefficients {circumflex over ({)}α{circumflex over (,)}β} specifying {ĝ(x)}v=1N
Observe first that the constraints βv∈B in Proposition 1 can be expressed as T′βv=03 for each v=1, . . . , Nb, or jointly as (IN
Consider next the penalty term in the cost of (eq. 4). Substituting into (eq. 5), it follows that ∫∥∇2ĝv(x)∥F2dx=βv′Kβv. It thus holds that
from which (eq. 30) follows readily.
Now that the equivalence between (eq. 4) and (eq. 30) has been established, the latter must be solved for α and β. Even though K (hence IN
implying that (eq. 30) is convex. Note first that the constraint
implies the existence of a vector γ∈RNb(Nr−3) satisfying (eq. 8). After this change of variables, this is transformed into an unconstrained quadratic program, which can be solved in closed form for {α,γ}. Hence, setting both gradients w.r.t. α and γ} to zero yields (eq. 6) and (eq. 7).
Proof of Proposition 2: After substituting (eq. 15) into (eq. 12), one finds the optimal {α,β} specifying {ĝvx)}v=1N
With reference to (eq. 31), consider grouping and reordering the variables {α,β} in the vector u:=[u′1, . . . , u′N
The second summand due to the thin-plate penalty can be expressed as
while the last term is μΣv=1N
and becomes (13) under the identities (14), and after the change of variables ζ:=[ζ′1, . . . , ζ′N
Selection of the smoothing parameter in (eq. 4): The method to be developed builds on the so-termed leave-one-out CV, which proceeds as follows. Consider removing a single data point φrn from the collection of NrN measurements available to the sensing radios. For a given λ, let̂Φλ(−rn)(x,f) denote the leave-one-out estimated PSD map, following steps S1-S3, using the NrN−1 remaining data points. The aforementioned estimation procedure is repeated NrN times by leaving out each of the data points φrn, r=1, . . . , Nr and n=1, . . . , N, one at a time. The leave-one-out or ordinary CV (OCV) for the problem at hand is given by
while the optimum λ is selected as the minimizer of OCV(λ), over a grid of values λ∈[0,λmax]. Function (eq. 35) constitutes an average of the squared prediction errors over all data points; hence, its minimization offers a natural criterion. The method is quite computationally demanding though, since the system of linear equations (eq. 6)-(eq. 8) has to be solved NrN times for each value of λ on the grid. Fortunately, this computational burden can be significantly reduced for the spline-based PSD map estimator considered here.
Recall the vector φ collecting all data points measured at locations X and frequencies F. Define next a similar vector̂φ containing the respective predicted values at the given locations and frequencies, which is obtained after solving (eq. 4) with all the data in φ and a given value of λ. The following lemma establishes that the PSD map estimator is a linear smoother, which means that the predicted values are linearly related to the measurements, i.e.,̂φ=Sλφ for a λ-dependent matrix Sλ to be determined. Common examples of linear smoothers are ridge regressors and smoothing splines. For linear smoothers, by virtue of the leave-one-out lemma it is possible to rewrite (eq. 35) as
wherêΦλ(x,f) stands for the estimated PSD map when all data in φ are utilized in (eq. 4). The beauty of the leave-one-out lemma stems from the fact that given λ and the main diagonal of matrix Sλ, the right-hand side of (eq. 36) indicates that fitting a single model (rather than NrN of them) suffices to evaluate OCV(λ). The promised lemma stated next specifies the value of Sλ necessary to evaluate (eq. 36).
Lemma 3: The PSD map estimator in (4) is a linear smoother, with smoothing matrix given by
S
λ=(B{KQ2−TR−1Q′1KQ2}) [(B′BQ′2KQ2)+NrNλI]−1(B′Q′2)+(BΓ−1TR−1Q′1). (37)
Proof: Reproduce the structure of φ in Section III-A to form the sapervectar {circumflex over (φ)}:=[{circumflex over (φ)}′1, . . . , {circumflex over (φ)}′N]′ ∈N
{circumflex over (φ)}=(BK){circumflex over (β)}−(BT){circumflex over (α)}. (38)
Because the estimates {{circumflex over (α)},{circumflex over (β)}} are linearly related to the measurements φ [cf, (6)-(8)], so is {circumflex over (φ)} as per (38), establishing that the PSD map estimates in (4) is indeed a linear smoother. Next, solve explicitly for {{circumflex over (α)},{circumflex over (β)}} in (6-(8) and substitute the results in (38), to unveil the structure of the smoothing matrix Sλ such that {circumflex over (φ)}=Sλφ. Simple algebraic manipulations lead, to die expression (37).
The effectiveness of the leave-one-out CV approach is corroborated via simulations in
Proof of (eq. 24)-(eq. 27): Recall the augmented Lagrangian function in (eq. 23), and let ζ:={ζr}r∈R for notational brevity. When used to solve (eq. 22), the three steps in the AD-MoM are given by:
[S1] Local estimate updates:
[S2] Auxiliary variable updates:
[S3] Multiplier updates:
v
r(k+1)=vr(k)+c[ζr(k+1)−γr(k+1(] (41)
{hacek over (v)}
r
r′(k+1)={hacek over (v)}rr′(k)+c[ζr(k+1)−γrr′(k+1)] (42)
r
r′(k+1)=
Focusing first on [S2], observe that (23) is separable across the collection of variables {γj} and {γrr′} that compose γ. The minimization w.r.t. the latter group reduces to
The result in (44) assumes that
The remaining minimization in (40) with respect to {γr} decouples into Nr quadratic subproblems [cf. (23)], that is
which admit the closed-form solutions is (27).
In order to obtain the update (eq. 24) for the prices r, consider their definition together with (eq. 42), (eq. 43) and (eq. 44) to obtain
which coincides with (eq. 24).
Towards obtaining the updates for the local variables in ζ, the optimization (eq. 39) in [S1] can be also split into Nr sub-problems, namely
Upon dividing by c(1+2|Nr|) each subproblem becomes identical to problem (19), and thus by Proposition 2 takes the form of (26).
As described in further detail below, in some example, a number of different types of kernels may be used to accurately model the physical (i.e., geographic) distribution of RF power over the region so as to enhance the ability of the model to accurately reflect fading and multi-path characteristics of the environment.
Reproducing kernel Hilbert spaces (RKHSs) provide an orderly analytical framework for nonparametric regression, with the optimal kernel-based function estimate emerging as the solution of a regularized variational problem. The role of RKHS is further appreciated through its connections to “workhorse” signal processing tasks, such as the Nyquist-Shannon sampling and reconstruction result that involves sine kernels. Alternatively, spline kernels replace sine kernels, when smoothness rather than bandlimitedness is to be present in the underlying function space.
Kernel-based function estimation can be also seen from a Bayesian viewpoint. RKHS and linear minimum mean-square error (LMMSE) function estimators coincide when the pertinent covariance matrix equals the kernel Gram matrix. This equivalence has been leveraged in the context of field estimation, where spatial LMMSE estimation referred to as Kriging, is tantamount to two-dimensional RKHS interpolation. Finally, RKHS based function estimators can linked with Gaussian processes (GPs) obtained upon defining their covariances via kernels.
The techniques described herein recognize use of matrix completion, where data organized in a matrix can have missing entries due to e.g., limitations in the acquisition process. This disclosure makes use of the assertion that imputing missing entries amounts to interpolation, as in classical sampling theory, but with the low-rank constraint replacing that of bandlimitedness. From this point of view, RKHS interpolation emerges as a framework for matrix completion that allows effective incorporation of a priori information via kernels, including sparsity attributes.
Building blocks of sparse signal processing include the (group) least-absolute shrinkage and selection operator (Lasso) and its weighted versions, compressive sampling, and nuclear norm regularization. The common denominator behind these operators is the sparsity on a signal's support that the l1-norm regularizer induces. Exploiting sparsity for kernel-based learning (KBL) leads to several innovations regarding the selection of multiple kernels, additive modeling, collaborative filtering, matrix and tensor completion via dictionary learning, as well as nonparametric basis selection. In this context, techniques for nonparametric basis pursuit (NBP) are described that unifying and advance a number of sparse KBL approaches.
In this disclosure, RKHS in connection with GPs, are described. The Representer Theorem and a kernel trick are described, and the Nyquist-Shannon Theorem (NST) is presented as an example of KBL. Sparse KBL is addressed including sparse additive models (SpAMs) and multiple kernel learning (MKL) as examples of additive nonparametric models. NBP is introduced, with a basis expansion model capturing the general framework for sparse KBL. Blind versions of NBP for matrix completion and dictionary learning are developed. Finally, numerical tests using real and simulated data are presented, including RF spectrum measurements, expression levels in yeast, and network traffic loads.
In the context of reproducing kernel Hilbert spaces (RKHS), nonparametric estimation of a function ƒ:X→R defined over a measurable space X is performed via interpolation of N training points {(x1,z1), . . . , (xNzN)}, where xn∈X, and zn=ƒ(xn)+en∈ER. For this purpose, a kernel function k:X×X→R selected to be symmetric and positive definite, specifies a linear space of interpolating functions ƒ(x) given by
For many choices of k(•,•), HX is exhaustive with respect to (w.r.t) families of functions obeying certain regularity conditions. The spline kernel for example, generates the Sobolev space of all low-curvature functions. Likewise, the sine kernel gives rise to the space of bandlimited functions. Space HX becomes a Hilbert space when equipped with the inner product
and the associated norm is |ƒ|H
admits the finite-dimensional representation
This result is nice in its simplicity, since functions in space HX are compound by a numerable but arbitrarily large number of kernels, while ̂ƒ is a combination of just a finite number of kernels around the training points. In addition, the regularizing term μ|ƒ|H
Remark 1. The finite-dimensional expansion (1.2) solves (1.1) for more general fitting costs and regularizing terms. In its general form, the Representer Theorem asserts that (1.2) is the solution
where the loss function l(zn,f(xn) replacing the LS cost in (1.1) can be selected to serve either robustness (e.g., using the absolute-value instead of the square error); or, application dependent objectives (e.g., the Hinge loss to serve classification applications); or, for accommodating non-Gaussian noise models when viewing (1.3) from a Bayesian angle. On the other hand, the regularization term can be chosen as any increasing function Ω of the norm |ƒ|H
Instead of the deterministic treatment of the previous subsection, the unknown ƒ(x) can be considered as a random process. The KBL estimate (1.2) offered by the Representer Theorem has been linked with the LMMSE-based estimator of random fields ƒ(x), under the term Kriging. To predict the value ζ=ƒ(x) at an exploration point x via Kriging, the predictor̂ƒ(x) is modeled as a linear combination of noisy samples zn:=ƒ(xn)+η(xn) at measurement points {xn}n=1N; that is,
wherê⊕T:={circumflex over ([)}β1, . . . {circumflex over (,)}⊕N] are the expansion coefficients, and zT:=[z1, . . . , zN] collects the data. The MSE criterion is adopted to find the optimal̂β:=argminβ[ƒ(x)−zTβ]2. Solving the latter yieldŝβ=Rzz−1rzζ, where Rzz=E[zzT] and rzζ:=E[zƒ(x)]. If η(x) is zero-mean white noise with power ση2, then Rzz and rzζcan be expressed in terms of the unobserved ζT:=[ƒ(x1), . . . , ƒ(xN)] as Rzz=Rζζ+ση2I, where Rζζ:=E[ζζT], and rζζ, with rζζ:=E[ζƒ(x)]. Hence, the LMMSE estimate in (1.4) takes the form
where αT:=zT(Rζζ+ση2I)−1, and the n-th entry of rζζ, denoted by r(xn,x):=E[ƒ(x)ƒ(xn)], is indeed a function of the exploration point x, and the measurement point xn.
With the Kriging estimate given by (1.5), the RKHS and LMMSE estimates coincide when the kernel in (1.2) is chosen equal to the covariance function r(x,x′) in (1.5).
The linearity assumption in (1.4) is unnecessary when ƒ(x) and e(x) are modeled as zero-mean GPs [25]. GPs are those in which instances of the field at arbitrary points are jointly Gaussian. Zero-mean GPs are specified by cov(x,x′):=E[ƒ(x)ƒ(x′)], which determines the covariance matrix of any vector comprising instances of the field, and thus its specific zero-mean Gaussian distribution. In particular, the vector
By comparing (1.6) with (1.5), one deduces that the MMSE estimator of a GP coincides with the LMMSE estimator, hence with the RKHS estimator, when cov(x,x′)=k(x,x′).
Analogous to the spectral decomposition of matrices, Mercer's Theorem establishes that if the symmetric positive definite kernel is square-integrable, it admits a possibly infinite eigenfunction decomposition
with <ei(x),ei′(x)>H
Such an inner product interpretation forms the basis for the “kernel trick,” as used herein.
The kernel trick allows for approaches that depend on inner products of functions (given by infinite kernel expansions) to be recast and implemented using finite dimensional covariance (kernel) matrices. A simple demonstration of this valuable property can be provided through kernel-based ridge regression. Starting from the standard ridge estimator
it is possible to rewrite and solvêβ=argminβ∈R
Now, if φn=φ(xn) with D=∞ is constructed from xn∈X using eigenfunctions {φi(xn)}i=1∞, then φN+1TΦ=kT(xN+1):=[k(xN+1,x1), . . . , k(xN+1,xN)], and ΦTΦ=K, which yields
coinciding with (1.6), (1.5), and with the solution of (1.1)
Expressing a linear predictor in terms of inner products only is instrumental for mapping it into its kernel-based version. Although the mapping entails the eigenfunctions {φi(x)}, these are not explicitly present in (7), which is given solely in terms of k(x,x′). This is crucial since φ can be infinite dimensional which would render the method computationally intractable, and more importantly the explicit form of φi(x) may not be available. Use of kernel trick was demonstrated in the context of ridge regression. However, the trick can be used in any vectorial regression or classification method whose result can be expressed in terms of inner products only. One such example is offered by support vector machines, which find a kernel-based version of the optimal linear classifier in the sense of minimizing Vapnik's ε-insensitive Hinge loss function, and can be shown equivalent to the Lasso.
In a nutshell, the kernel trick provides a means of designing KBL algorithms, both for nonparametric function estimation, as well as for classification.
Kernels can be clearly viewed as interpolating bases. This viewpoint can be further appreciated if one considers the family of bandlimited functions Bπ:={ƒ∈L2(X): ∫ƒ(x)e−iωxdx=0, ∀|ω|>π}, where L2 denotes the class of square-integrable functions defined over X=R (e.g., continuous-time, finite-power signals). The family Bπ constitutes a linear space. Moreover, any ƒ∈Bπ can be generated as the linear combination (span) of sine functions; that is,
This is the cornerstone of signal processing, namely the NST for sampling and reconstruction, but can be viewed also under the lens of RKHS with k(x,x′)=sinc(x−x′) as a reproducing kernel. The following properties (which are proved in the Appendix) elaborate further on this connection.
P1. The sine-kernel Gram matrix K∈RN×N satisfies k≧0.
P2. The sine kernel decomposes over orthonormal eigenfunctions {φn(x)=sinc(x−n), n∈Z}
P3. The RKHS norm is |ƒ|H
P1 states that sinc(x−x′) qualifies as a kernel, while P2 characterizes the eigenfunctions used in the kernel trick, and P3 shows that the RKHS norm is the restriction of the L2 norm to Bπ.
P1-P3 establish that the space of bandlimited functions Bπ is indeed an RKHS. Any ƒ∈Bπ can thus be decomposed as a numerable combination of eigenfunctions, where the coefficients and eigenfunctions obey the NST. Consequently, existence of eigenfunctions {φn(x)} spanning Bπ is a direct consequence of Bπ being a RKHS, and does not require the NST unless an explicit form for φn(x) is desired. Finally, strict adherence to NST requires an infinite number of samples to reconstruct ƒ∈Bπ. Alternatively, the Representer Theorem fits ƒ∈Bπ to a finite set of (possibly noisy) samples by regularizing the power of ƒ.
The account of sparse KBL methods begins with SpAMs and MKL approaches. Both model the function to be learned as a sparse sum of nonparametric components, and both rely on group Lasso to find it. The additive models considered in this section will naturally lend themselves to the general model for NBP introduced below, and used henceforth.
Additive function models offer a generalization of linear regression to the nonparametric setup, on the premise of dealing with the curse of dimensionality, which is inherent to learning from high dimensional data.
Consider learning a multivariate function ƒ:X→R defined over the Cartesian product X:=X1 . XP of measurable spaces Xi. Let xT:=[x1, . . . , xP] denote a point in X, ki the kernel defined over Xi×Xi, and Hi its associated RKHS. Although ƒ(x) can be interpolated from data via (1.1) after substituting x for x, the fidelity of (1.2) is severely degraded in high dimensions. Indeed, the accuracy of (1.2) depends on the availability of nearby points xn, where the function is fit to the (possibly noisy) data zn. But proximity of points xn in high dimensions is challenged by the curse of dimensionality, demanding an excessively large dataset. For instance, consider positioning TV datapoints randomly in the hypercube [0,1]P, repeatedly for P growing unbounded and N constant. Then
that is, the expected distance between any two points is equal to the side of the hypercube [16].
To overcome this problem, an additional modeling assumption is well motivated, namely constraining ƒ(x) to the family of separable functions of the form
with ci∈Hi depending only on the i-th entry of x, as in e.g., linear regression models
With ƒ(x) separable as in (1.8), the interpolation task is split into P one-dimensional problems that are not affected by the curse of dimensionality.
The additive form in (1.8) is also amenable to subsect selection, which yields a SpAM. As in sparse linear regression, SpAMs involve functions ƒ in (1.8) that can be expressed using only a few entries of x. Those can be learned using a variational version of the Lasso given by [26]
With xni denoting the ith entry of xn, the Representer Theorem (1.3) can be applied per component ci(xi) in (1.9), yielding kernel expansions
with scalar coefficients {γni, i=1, . . . , P, n=1, . . . , N}
The fact that (1.9) yields a SpAM is demonstrated by substituting these expansions back into (1.9) and solving for γiT:=[γi1, . . . , γiN], to obtain
where Ki is the Gram matrix associated with kernel ki, and |•|K
Problem (1.10) constitutes a weighted version of the group Lasso formulation for sparse linear regression. Its solution can be found either via block coordinate descent (BCD) [26], or by substituting γ′i=Ki1/2γi and applying the alternating-direction method of multipliers (ADMM) [6], with convergence guaranteed by its convexity and the separable structure of the its non-differentiable term [30]. In any case, group Lasso regularizes sub-vectors γi separately, effecting group-sparsity in the estimates; that is, some of the vectorŝγi in (1.10) end up being identically zero. To gain intuition on this, (1.10) can be rewritten using the change of variables Ki1/2γi=tiui, with ti≧0 and |ui|=1. It will be argued that if μ exceeds a threshold, then the optimal ti and thuŝγi will be null. Focusing on the minimization of (1.10) w.r.t. a particular sub-vector γi, as in a BCD algorithm, the substitute variables ti and ui should minimize
Minimizing (1.11) over ti is a convex univariate problem whose solution lies either at the border of the constraint, or, at a stationary point; that is,
The Cauchy-Schwarz inequality implies that ziTKi1/2ui≦|Ki1/2zi| holds for any ui with |ui|=1. Hence, it follows from (1.12) that if μ≧|Ki1/2zi|, then ti=0, and thus γi=0.
The sparsifying effect of (1.9) on the additive model (1.8) is now revealed. If μ is selected large enough, some of the optimal sub-vectorŝγi will be null, and the corresponding functions
will be identically zero in (1.8). Thus, estimation via (1.9) provides a nonparametric counterpart of Lasso, offering the flexibility of selecting the most informative component-function regressors in the additive model.
The separable structure postulated in (1.8) facilitates subset selection in the nonparametric setup, and mitigates the problem of interpolating scattered data in high dimensions. However, such a model reduction may render (1.8) inaccurate, in which case extra components depending on two or more variables can be added, turning (1.8) into the ANOVA model.
Specifying the kernel that “shapes” HC, and thus judiciously determineŝƒ in (1.1) is a prerequisite for KBL. Different candidate kernels kk1, . . . , kP would produce different function estimates. Convex combinations can be also employed in (1.1), since elements of the convex hull
conserve the defining properties of kernels.
A data-driven strategy to select “the best” k∈K is to incorporate the kernel as a variable in (1.3), that is
where the notation HkX emphasizes dependence on k.
Then, the following Lemma brings MKL to the ambit of sparse additive nonparametric models. Lemma 1 (MP05) Let {k1, . . . , kP} be a set of kernels and k an element of their convex hull K. Denote by Hi and HkX the RKHSs corresponding to ki and k, respectively, and by HX the direct sum HX:=H1⊕ . . . ⊕HP. It then holds that:
According to Lemma 1, HX replace HkX in (1.13), rendering it equivalent to
MKL as in (1.14) resembles (1.9), differing in that components ci(x) in (1.14) depend on the same variable x. Taking into account this difference, (1.14) is reducible to (1.10) and thus solvable via BCD or ADMoM, after substituting ki(xn,x) for ki(xni,xi). On the other hand, in a more general case of MKL, where K the convex hull of an infinite and possibly uncountable family of kernels.
An example of MKL applied to wireless communications is offered below, where two different kernels are employed for estimating path-loss and shadowing propagation effects in a cognitive radio sensing paradigm.
In the ensuing section, basis functions depending on a second variable y will be incorporated to broaden the scope of the additive models just described.
Consider function ƒ:X×Y→R over the Cartesian product of spaces X and Y with associated RKHSs HX and HY respectively. Let ƒ abide to the bilinear expansion form
where bi:Y→R can be viewed as bases, and ci:X→R as expansion coefficient functions. Given a finite number of training data, learning {ci,bi} under sparsity constraints constitutes the goal of the NBP approaches developed in the following sections.
The first method for sparse KBL of ƒ in (1.15) is related to a nonparametric counterpart of basis pursuit, with the goal of fitting the function ƒ(x,y) to data, where {bi} are prescribed and {ci}s are to be learned. The designer's degree of confidence on the modeling assumptions is key to deciding whether {bi}s should be prescribed or learned from data. If the prescribed {bi}s are unreliable, model (1.15) will be inaccurate and the performance of KBL will suffer. But neglecting the prior knowledge conveyed by {bi}s may be also damaging. Parametric basis pursuit [9] hints toward addressing this tradeoff by offering a compromising alternative.
A functional dependence z=ƒ(y)+e between input y and output z is modeled with an overcomplete set of bases {bi(y)} (a.k.a. regressors) as
Certainly, leveraging an overcomplete set of bases {bi(y)} can accommodate uncertainty. Practical merits of basis pursuit however, hinge on its capability to learn the few {bi}s that “best” explain the given data.
The crux of NBP on the other hand, is to ƒ(x,y) with a basis expansion over the y domain, but learn its dependence on x through nonparametric means. Model (1.15) comes handy for this purpose, when {bi(y)}i=1P is a generally overcomplete collection of prescribed bases.
With {bi(y)}i=1P known, {ci(x)}i=1P need to be estimated, and a kernel-based strategy can be adopted to this end. Accordingly, the optimal function̂ƒ(x,y) is searched over the family
which constitutes the feasible set for the NBP-tailored nonparametric Lasso
The Representer Theorem in its general form (.13) can be applied recursively to minimize (1.17) w.r.t. each ci(x) at a time, renderinĝƒ expressible in terms of the kernel expansion as
where coefficients γiT:=[γi1, . . . , γiN] are learned from data zT:=[z1, . . . , zN] via group Lasso
with Ki:=Diag[bi(y1), . . . , bi(yN)]K.
Group Lasso in (1.18) effects group-sparsity in the subvectors {γi}=i=1P. This property inherited by (1.17) is the capability of selecting bases in the nonparametric setup. Indeed, by zeroing γi the corresponding coefficient function
is driven to zero, and correspondingly bi(y) drops from the expansion (1.15).
Remark 2. A single kernel kX and associated HX can be used for all components ci(x) in (1.17), since the summands in (1.15) are differentiated through the bases. Specifically, for a common K, a different bi(y) per coefficient ci(x), yields a distinct diagonal matrix Diag[bi(y1), . . . , bi(yN)], defining an individual Ki in (1.18) that renders vector γi identifiable. This is a particular characteristic of (1.17), in contrast with (1.9) and Lemma 1 which are designed for, and may utilize, multiple kernels.
Remark 3. The different sparse kernel-based approaches presented so far, namely SpAMs, MKL, and NBP, should not be viewed as competing but rather as complementary choices. Multiple kernels can be used in basis pursuit, and a separable model for c(x) may be due in high dimensions. An NBP-MKL hybrid applied to spectrum cartography illustrates this point, where bases are utilized for the frequency domain Y.
A kernel-based matrix completion scheme will be developed in this section using a blind version of NBP, in which bases {bi} will not be prescribed, but they will be learned together with coefficient functions {ci}. The matrix completion task entails imputation of missing entries of a data matrix Z∈RM×N. Entries of an index matrix W∈{0,1}M×N specify whether datum zmn is available (wmn=1), or missing (wmn=0). Low rank of Z is a popular attribute that relates missing with available data, thus granting feasibility to the imputation task. Low-rank matrix imputation is achieved by solving
where ⊙ stands for the Hadamard (element-wise) product. The low-rank constraint corresponds to an upperbound on the number of nonzero singular values of matrix A, as given by its l0-norm. Specifically, if sT:=[s1, . . . , smin{M,N}] denotes vector of singular values of A, and the cardinality
|{si≠0, i=1, . . . , min {M,N}}|:=|s|0 defines its l0-norm, then the ball of radius P, namely |s|0≦P, can replace rank(A)≦P in (1.19). The feasible set |s|0≦P is not convex because |s|0 is not a proper norm (it lacks linearity), and solving (1.19) requires a combinatorial search for the nonzero entries of s. A convex relaxation is thus well motivated. If the l0-norm is surrogated by the l1-norm, the corresponding ball |s|1≦P becomes the convex hull of the original feasible set. As the singular values of A are non-negative by definition, it follows that
Since the sum of singular values equals the dual norm of the l2-norm of A, |s|1 defines a norm over the matrix A itself, namely the nuclear norm of A, denoted by |A|*.
Upon substituting |A|* for the rank, (1.19) is further transformed to its Lagrangian form by placing the constraint in the objective as a regularization term, i.e.,
The next step towards kernel-based matrix completion relies on an alternative definition of |A|*. Consider bilinear factorizations of matrix A=CBT with B∈RN×P and C∈RM×P, in which the constraint rank(A)≦P is implicit. The nuclear norm of A can be redefined as
Result (1.21) states that the infimum is attained by the singular value decomposition of A. Specifically, if A=UΣVT with U and V unitary and Σ:=diag(s), and if B and C are selected as B=VΣ1/2, and C=UΣ1/2, then
Given (1.21), it is possible to rewrite (1.20) as
Matrix completion in its factorized form (1.22) can be reformulated in terms of (1.15) and RKHSs. Define spaces X:={1, . . . , M} and Y:={1, . . . , N} with associated kernels kX(m,m′) and kY(n,n′), respectively. Let ƒ(m,n) represent the (m,n)-th entry of the approximant matrix A in (1.22), and P a prescribed overestimate of its rank. Consider estimating ƒ:X×Y→R in (1.15) over the family
If both kernels are selected as Kronecker delta functions, then (1.23) coincides with (1.22). This equivalence is stated in the following lemma.
Lemma 2 Consider spaces X:={1, . . . , M}, Y:={1, . . . , N} and kernels kX(m,m′):=δ(m−m′) and kY(n,n′):=δ(n−n′) over the product spaces X×X and Y×Y , respectively. Define functions ƒ:X×Y→R, ci:X→R, and bi:Y→R, i=1, . . . , P, and matrices A∈RM×N, B∈RN×P, and C∈RM×P. It holds that:
According to Lemma 2, the intricacy of rewriting (1.20) as in (1.23) does not introduce any benefit when the kernel is selected as the Kronecker delta. But as it will be argued next, the equivalence between these two estimators generalizes nicely the matrix completion problem to sparse KBL of missing data with arbitrary kernels.
The separable structure of the regularization term in (1.23) enables a finite dimensional representation of functions
Optimal scalars {γim} and {βin} are obtained by substituting (1.24) into (1.23), and solving
where matrix {tilde over (C)} ({tilde over (B)}) is formed with entries γmi (βni).
A Bayesian approach to kernel-based matrix completion is given next, followed by an algorithm to solve for{tilde over ( )}B and{tilde over ( )}C.
To recast (1.23) in a Bayesian framework, suppose that the available entries of Z obey the additive white Gaussian noise (AWGN) model Z=A+E, with E having entries independent identically distributed (i.i.d.) according to the zero-mean Gaussian distribution N(0,σ2).
Matrix A is factorized as A=CBT without loss of generality (w.l.o.g.). Then, a Gaussian prior is assumed for each of the columns bi and ci of B and C, respectively,
bi˜N(0,RB), ci˜N(0,RC) (1.26)
independent across i, and with trace(RB)=trace(RC). Invariance across i is justifiable, since columns are a priori interchangeable, while trace(RB)=trace(RC) is introduced w.l.o.g. to remove the scalar ambiguity in A=CBT.
Under the A WGN model, and with priors (1.26), the maximum a posteriori (MAP) estimator of A given Z at the entries indexed by W takes the form [cf. (1.25)]
With RC=KX and RB=KY, and substituting B:=K{tilde over ( )}YB and C:=K{tilde over ( )}XC, the MAP estimator that solves (1.24) coincides with the estimator solving (1.25) for the coefficients of kernel-based matrix completion, provided that covariance and Gram matrices coincide. From this Bayesian perspective, the KBL matrix completion method (1.23) provides a generalization of (1.20), which can accommodate a priori knowledge in the form of correlation across rows and columns of the incomplete Z.
With prescribed correlation matrices RB and RC, (1.23) can even perform smoothing and prediction. Indeed, if a column (or row) of Z is completely missing, (1.23) can still find an estimatêZ relying on the co variance between the missing and available columns. This feature is not available with (1.20), since the latter relies only on rank-induced co linearities, so it cannot reconstruct a missing column. The prediction capability is useful for instance in collaborative filtering [3], where a group of users rates a collection of items, to enable inference of new-user preferences or items entering the system. Additionally, the Bayesian reformulation (1.27) provides an explicit interpretation for the regularization parameter μ=σ2 as the variance of the model error, which can thus be obtained from training data. The kernel-based matrix completion method (1.27) is summarized in Algorithm 1, which solves (1.27) upon identifying RC=KX, RB=KY, and σ2=μ, and solves (1.25) after changing variables B:=K{tilde over ( )}YB and C:=K{tilde over ( )}XC (compare (1.25) with lines 13-14 in Algorithm 1).
Detailed derivations of the updates in Algorithm 1 are provided in the Appendix. For a high-level description, the columns of B and C are updated cyclically, solving (1.27) via BCD iterations. This procedure converges to a stationary point of (1.27), which in principle does not guarantee global optimality. Opportunely, it can be established that local minima of (1.27) are global minima, by transforming (1.27) into a convex problem through the same change of variables proposed in [22] for the analysis of (1.22). This observation implies that Algorithm 1 yields the global optimum of (1.25), and thus (1.23).
Basis pursuit approaches advocate an overcomplete set of bases to cope with model uncertainty, thus learning from data the most concise subset of bases that represents the signal of interest. But the extensive set of candidate bases (a.k.a. dictionary) still needs to be prescribed. The next step towards model-agnostic KBL is to learn the dictionary from data, along with the sparse regression coefficients. Under the sparse linear model
z
m
=Bγ
m
+e
m, m=1, . . . , M (1.28)
with dictionary of bases B∈RN×P, and vector of coefficients γm∈RP, the goal of dictionary learning is to obtain B and C:=[γ1, . . . , γM]T from data Z:=[z1, . . . , zM]T. A swift count of equations and unknowns yields NP+MP scalar variables to be learned from MN data (see
Having collected sufficient training data, one possible approach to find B and C is to fit the data via the LS cost |Z−CBT|F2 regularized by the l1-norm of C in order to effect sparsity in the coefficients. This dictionary leaning approach can be recast into the form of blind NBP (1.23) by introducing the additional regularizing term
with
The new regularizer on functions ci:X→R depends on their values at the measurement points m only, and can be absorbed in the loss part of (1.3). Thus, the optimal {ci} and {bi} conserve their finite expansion representations dictated by the Representer Theorem. Coefficients {γmp,βnp} must be adapted according to the new cost, and (1.27) becomes
Remark 4. Kernel-based dictionary learning (KDL) via (1.29) inherits two attractive properties of kernel matrix completion (KMC), that is blind NBP, namely its flexibility to introduce a priori information through RB and RC, as well as the capability to cope with missing data. While both KDL and KMC estimate bases {bi} and coefficients {ci} jointly, their difference lies in the size of the dictionary. As in principal component analysis, KMC presumes a low-rank model for the approximant A=CBT, compressing signals {zm} with P′<M components (
Algorithm 1 can be modified to solve (1.29) by replacing the update for column ci in line 7 with the Lasso estimate
Consider a setup with Nc=100 radios distributed over an area X of 100×100 m2 to measure the ambient RF power spectral density (PSD) at Nƒ=24 frequencies equally spaced in the band from 2,400 MHz to 2,496 MHz, as specified by IEEE 802.11 wireless LAN standard. The radios collaborate by sharing their N=NcNƒ measurements with the goal of obtaining a map of the PSD across space and frequency, while specifying at the same time which of the P=14 frequency sub-bands are occupied. The wireless propagation is simulated according to a pathloss model affected by shadowing, with parameters n =3, Δ0=60 m, δ=25 m, σX2=25 dB, and with AWGN variance σn2=−10 dB.
Model (1.15) is adopted for collaborative PSD sensing, with x and y representing the spatial and frequency variables, respectively. Bases {b} are prescribed as Hann-windowed pulses, and the distribution of power across space per sub-band is given by {c(x)} after interpolating the measurements obtained by the radios via (1.17). Two exponential kernels kr(x,x′)=exp(−|x−x′|2/Θr2), r=1,2 with Θ1=10 m and Θ2=20 m are selected, and convex combinations of the two are considered as candidate interpolators k(x,x′). This MKL strategy is intended for capturing two different levels of resolution as produced by pathloss and shadowing. Correspondingly, each ci(x) is decomposed into two functions ci1(x) and ci2(x) which are regularized separately in (1.17).
Solving (1.17) generates the PSD maps of
These results demonstrate the usefulness of model (1.15) for collaborative spectrum sensing, with bases and multi-resolution kernels. The sparse nonparametric estimator (1.17) serves the purpose of revealing the occupied frequency bands, and capturing the PSD map across space per source. Compared to a spline-based approach, the MKL adaptation of (1.17) here provides the appropriate multi-resolution capability to capture pathloss and shadowing effects when interpolating the data across space.
In this example, a computer 500 includes a processor 510 that is operable to execute program instructions or software, causing the computer to perform various methods or tasks. Processor 510 is coupled via bus 520 to a memory 530, which is used to store information such as program instructions and other data while the computer is in operation. A storage device 540, such as a hard disk drive, nonvolatile memory, or other non-transient storage device stores information such as program instructions, data files of the multidimensional data and the reduced data set, and other information. The computer also includes various input-output elements 550, including parallel or serial ports, USB, Firewire or IEEE 1394, Ethernet, and other such ports to connect the computer to external device such a printer, video camera, surveillance equipment or the like. Other input-output elements include wireless communication interfaces such as Bluetooth, Wi-Fi, and cellular data networks.
The computer itself may be a traditional personal computer, a rack-mount or business computer or server as shown in
The techniques described herein may be implemented in hardware, software, firmware, or any combination thereof. Various features described as modules, units or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices or other hardware devices. In some cases, various features of electronic circuitry may be implemented as one or more integrated circuit devices, such as an integrated circuit chip or chipset.
If implemented in hardware, this disclosure may be directed to an apparatus such a processor or an integrated circuit device, such as an integrated circuit chip or chipset. Alternatively or additionally, if implemented in software or firmware, the techniques may be realized at least in part by a computer readable data storage medium comprising instructions that, when executed, cause one or more processors to perform one or more of the methods described above. For example, the computer-readable data storage medium may store such instructions for execution by a processor. Any combination of one or more computer-readable medium(s) may be utilized.
A computer-readable medium may form part of a computer program product, which may include packaging materials. A computer-readable medium may comprise a computer data storage medium such as random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic or optical data storage media, and the like. In general, a computer-readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. Additional examples of computer readable medium include computer-readable storage devices, computer-readable memory, and tangible computer-readable medium. In some examples, an article of manufacture may comprise one or more computer-readable storage media.
In some examples, the computer-readable storage media may comprise non-transitory media. The term “non-transitory” may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. In certain examples, a non-transitory storage medium may store data that can, over time, change (e.g., in RAM or cache).
The code or instructions may be software and/or firmware executed by processing circuitry including one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other processing circuitry suitable for implementation of the techniques described herein. In addition, in some aspects, functionality described in this disclosure may be provided within software modules or hardware modules.
Further exemplary details are described in Bazerque, “Basis Pursuite For Spectrum Cartography,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, directly included herein as an Appendix and incorporated herein by reference.
Various embodiments of the invention have been described. These and other embodiments are within the scope of the following claims.
This application claims the benefit of U.S. Provisional Application No. 61/649,793, filed May 21, 2012, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61649793 | May 2012 | US |