Any device that captures data from the real world needs a preprocessing filter to clean the signal from noise, or attenuate irrelevant signal features that the users wants to avoid. Moreover, filters are used as models in machine learning and control applications. In addition to be application specific, the development of these filters is a complex process that is both time consuming and computationally intensive.
Many aspects of the present disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.
Disclosed herein are various examples of systems and methods related to automatically composing universal filters. Reference will now be made in detail to the description of the embodiments as illustrated in the drawings, wherein like reference numbers indicate like parts throughout the several views.
Kernel methods, such as support vector machine (SVM), kernel principal component analysis (KPCA), and Gaussian process (GP), create a powerful unifying framework for classification, clustering, and regression, with many important applications in machine learning, signal processing, and biomedical engineering. In particular, the theory of adaptive signal processing can be greatly enhanced through the integration of the theory of reproducing kernel Hilbert space (RKHS). By performing classical linear methods in a potentially infinite-dimensional feature space, kernel adaptive filtering (KAF) removes the limitations of the linear model to provide general nonlinear solutions in the original input space. KAF bridges the gap between adaptive signal processing and feedforward artificial neural networks (ANNs), combining the universal approximation property of neural networks and the simple convex optimization of linear adaptive filters.
KAF has gained traction in the scientific community thanks to its usefulness coupled with simplicity, and has been discussed for solving online nonlinear system identification. The kernel least-mean-square (KLMS) algorithm is the simplest feedforward kernel method in the family of kernel adaptive filters. It can be viewed as a growing single-layer neural network, i.e., a finite impulse response (FIR) filter, trained using the LMS algorithm in the RKHS. Other KAF algorithms include the kernel affine projection algorithms (KAPA), kernel recursive least squares (KRLS), and the extended kernel recursive least squares (EX-KRLS) algorithm. While most research has focused on time-delayed feedforward implementations of kernel methods, a recurrent formulation may be utilized to solve nonlinear problems involving non-stationary dynamics. The kernel adaptive autoregressive-moving-average (KAARMA) algorithm can achieve the appropriate memory depth via internal states, by feeding back some or all of the outputs through time-delay units, at the input. As a result, the input and output are no longer independent stationary vectors, but correlated temporal sequences.
A major bottleneck of KAF algorithms is that computation scales with number of samples. When the reproducing kernel is Gaussian, kernel adaptive filters grow linearly like radial basis function (RBF) networks, which poses significant time-space complexity issues for continuous online adaptation. To address this issue, a variety of sparsification and quantization techniques have been proposed to curb the network growth. In batch modes, sparcificaiton has been addressed by pruning and fixed-size approaches. This disclosure considers online adaptive methods. Existing online sample evaluation and selection criteria include the approximate linear dependency (ALD), novelty, prediction variance, surprise, and coherence. The central theme has been to form a compact structure by either eliminating redundant information or minimizing information loss.
One of the most successful method to date, because of simplicity and information preservation, is the vector quantization (VQ) technique introduced in the quantized KLMS (QKLMS) algorithm, which was shown to outperform pruning techniques using the novelty surprise criteria, ALD, and/or prediction variance criteria. Rather than discarding the information associated with redundant data points, VQ updates the coefficients locally within a responsive domain. In practice, only the coefficient of its nearest neighbor is updated. A modified version (M-QKLMS) may be used by computing the exact gradient when performing the VQ coefficient update. Nevertheless, these methods require the participation of all dictionary centers, to evaluate or update the function approximation at any new data sample.
Orthogonal Decomposition Using Exponentially Decaying Kernel
In this disclosure, the concept of a simple instance-based learning data-structure that self organizes essential data points is introduced. The nearest-instance-centroid-estimation (NICE) algorithm is complementary to existing sparsification and VQ techniques. Whereas the others bound the network structure from below, by eliminating redundant basis function centers, NICE bounds the network structure from above, by ignoring centers outside a certain responsive domain. NICE divides the growing sum that defines the filter functional into partial sums (subfilters) that have tight support, e.g., that have nonzero output only in a subregion of the RKHS, naturally forming a compactly-supported reproducing kernel functional. The concept is supported by the fact that a Gaussian function has exponential decay to zero, therefore, if the samples are organized in sufficiently distant clusters, each is approximately orthogonal to the others.
Formally, for a function approximation of the form:
where the approximating function {circumflex over (f)} is represented as a sum of N Gaussian functions ϕ, each associated by a different center ui, and weighted by a coefficient αi. Although the Gaussian function has nonzero values over the full space, our computation has finite precision, so the Gaussian tails are effectively zero. Theoretically, this means, we can project {circumflex over (f)} onto the subspace defined by N′:
span{ϕ(uj,⋅):1≤j≤N′; and N′<N}, (2)
obtaining {circumflex over (f)}s (component in the subspace) and {circumflex over (f)}⊥ (component perpendicular to the subspace):
{circumflex over (f)}={circumflex over (f)}s+{circumflex over (f)}⊥. (3)
Using this decomposition, the basis functions can be partitioned into m orthogonal sets (at the machine precision):
where Σi=1mN(i)=N and ϕ(uj), ϕ(u)=0 for all j∉N*. For Gaussian functions, from the kernel trick ϕ(uj), ϕ(u)=ϕ(∥uj−u∥), orthogonality can be approximated using the squared norm to define pseudo-normal bases or neighborhoods, by relaxing the orthogonality constraint to ϕ(uj), ϕ(u)<ϵ or equivalently approximate N*≈{uj:∥uj−u∥<dϵ}, where ϵ is an arbitrarily small positive quantity and dϵ is the corresponding distance value.
Nearest Neighbor Search: Nearest neighbor search is a computationally intensive operation, especially in high dimensional spaces. Data-space partitioning and search data structure can be utilized, however, on average, a naive linear search outperforms space partitioning approaches on higher dimensional spaces, due to the curse of dimensionality. The incremental nature of the representer theorem at the core of KAF algorithms allows a very simple solution that is heavily based on instantaneous computations. By comparing the current sample with a few representatives of the existing data, rather than every individual sample, and since kernel methods are inherently instance-based learning, there can be diminished return for finer data structures. The need to maintain complex search data structures for a sequentially formed, depth-1 forest can be traded with the centroid of each cluster at the roots. The NICE network can learn the clusters directly from the data, using an intuitive kernel bandwidth metric, and update the centroid locations through an iterative update.
To perform an evaluation, a linear search among the centroids determines the local supports-of-interest.
The nearest-neighbor search and computation used in NICE is similar to the k-nearest neighbors (k-NN) algorithm only in the sense that the function is approximated locally with respect to its nearest neighbors. However, rather than computing the distances from the test sample to all stored instances and applying weighted averaging, NICE computes the distances between the input and the set of centroids, then applies a standard KAF algorithm on the instances belonging to the nearest-neighbor cluster. Also, the number of centers in each cluster or neighborhood is not predefined, but rather instance-learned directly from data.
Along similar lines, unlike k-means clustering, which aims to partition the observations or centers into a fixed number of clusters with each center belonging to the cluster with the nearest mean, over several epochs, the clusters in NICE are formed instantaneously, using only the predefined distance threshold.
Compared with standard compactly-supported kernels, there is no fixed cut-off distance or range. The concept of a cut-off is only loosely associated with the minimum centroid distance: NICE-KLMS uses a finite subset of local supports rather than using a compactly-supported kernel, e.g., a truncated kernel. A simple thresholding technique used to sparsify an RBF kernel by setting a cut-off distance produces a sparse Gram matrix, but often destroys its positive definiteness. With knowledge transfer, in which out-of-range but close-by centers are copied to form a new cluster, NICE evaluations can extend beyond the neighborhood defined by the minimum centroid distance.
By partitioning the centers into distinct quasi-orthogonal regions, each cluster of NICE can be thought of as a separate filter or subfilter, specializing in different parts of the input/feature space. From this perspective, the NICE framework becomes a content addressable filter bank (CAFB). Instead of frequency bands, the filters are organized by amplitude bands. This CAFB can be incrementally updated for more and more new applications, always using the past-learned filters, opening the door for transfer learning and much more efficient training for new data scenarios, avoiding training from scratch as has been doing since the invention of adaptive filtering.
Compared with multiple and mixture kernel learning, NICE-KLMS uses a single kernel (fixed RKHS) across filters. The appropriate filter (set of weights) is selected based on the minimum centroid distance. In this perspective, the NICE-KLMS can be viewed as a single-kernel multiple- or mixture-filter algorithm. In terms of time-space complexity, instead of running multiple learning algorithms in parallel, as is the case in the mixture model, only one filter is updated by NICE-KLMS at any given time step.
Compared to local-structure based KAF, such as the fixed budget (FB) QKLMS, the network size of NICE-QKLMS is not determined a priori, but rather learned directly from the complexity or dynamic range of the data. The minimum description length (MDL) criterion can be used to adapt the network size, rather than a fixed constant, however it depends on prior knowledge of the locally stationary environment or window size. The only free parameter in NICE, the centroid distance threshold, is conditionally independent of the data, given the appropriate kernel parameter. Since it relates directly to the kernel bandwidth and the shape of the Gaussian is well-understood, it can be set very intuitively. In addition, the two major drawbacks of the existing algorithms are knowledge retention and computational complexity. NICE does not throw away previously learned structures, but rather naturally tucks them away for future use. When the environment changes back to a previous state, QKLMS-FB or QKLM-MDL has no inherent mechanism for recall and has to relearn the structure from scratch. The centroid computation is also significantly more simple to compute than the respective significance measures, e.g., MDL. Furthermore, the NICE paradigm is complementary to most network reduction algorithms and can be used in conjunction.
The following disclosure begins with a brief overview of the KLMS algorithm, and then introduces the novel NICE-KLMS. The mean square convergence analysis for NICE-KLMS is presented using the energy conservation relation, and the performance of the NICE-KLMS algorithm is evaluated with special emphasis on the associative filter storage property of the CAFB framework.
NICE-KLMS Algorithm
First, the KLMS algorithm is briefly discussed, then the NICE extension for KLMS and QKLMS is introduced. In machine learning, supervised learning can be grouped into two broad categories: classification and regression. For a set of N data points ={ui,yi}i=1N, the desired output y is either categorical variables (e.g., y∈{−1,+1}), in the case of binary classification, or real numbers (e.g., y∈R) for the task of regression or interpolation, where X1N{ui}i=1N is the set of M-dimensional input vectors, i.e., ui∈M, and y1N{yi}i=1N is the corresponding set of desired vectors or observations. In this disclosure, the focus will be on the latter problem, although the same approach can be used for classification. The task is to infer the underlying function y=f(u) from the given data ={X1N,y1N} and predict its value, or the value of a new observation y′, for a new input vector u′. Note that the desired data may be noisy in nature, i.e., yi=f(ui)+vi, where vi is the noise at time i, which is assumed to be independent and identically distributed (i.i.d.) Gaussian random variable with zero-mean and unit-variance, i.e., V˜(0,1).
For a parametric approach or weight-space view to regression, the estimated latent function {circumflex over (f)}(u) is expressed in terms of a parameters vector or weights w. In the standard linear form:
{circumflex over (f)}(u)=wTu. (5)
To overcome the limited expressiveness of this model, the M-dimensional input vector u∈⊆M (where is a compact input domain in M) can be projected into a potentially infinite dimensional feature space . Define a → mapping Φ(u), the parametric model of Equation (5) becomes:
{circumflex over (f)}(u)=ΩTΦ(u), (6)
where Ω is the weight vector in the feature space.
Using the Representer Theorem and the “kernel trick”, Equation (6) can be expressed as:
where K(u, u′) is a Mercer kernel, corresponding to the inner product Φ(u), Φ(u′), and N is the number of basis functions or training samples. Note that is equivalent to the reproducing kernel Hilbert spaces (RKHS) induced by the kernel if identified as Φ(u)=K(u,⋅). The most commonly used kernel is the Gaussian kernel
a(u,u′)=exp(−a∥u−u′∥2), (8)
where a>0 is the kernel parameter. Without loss of generality, the focus is on the kernel least-mean-square algorithm, which is the simplest KAF algorithm.
The learning rule for the KLMS algorithm in the feature space follows the classical linear adaptive filtering algorithm, the LMS:
which, in the original input space, becomes
where ei is the prediction error in the i-th time step, η is the learning rate or step-size, and fi denotes the learned mapping at iteration i. Using KLMS, the mean of y can be estimated with linear per-iteration computational complexity O(N), making it an attractive online algorithm.
Nearest Instance Centroid Estimation
As described in the previous section, the NICE algorithm operates under the framework of subspace decomposition by organizing new sample points into existing clusters (quasi-orthogonal regions) or forming new ones based on the minimum centroid distance dmin(c) and the threshold distance dc. For continuous online adaptation of the KLMS algorithm, the first data sample can be used to initialize a cluster, which also serves as its centroid and the weight of the KAF associated with the first cluster. For each subsequent data point, the minimum centroid distance is computed, resulting in two types of operations:
Clearly, the Cluster operation does not change the behavior of the KLMS algorithm, except that instead of updating the weights of a global filter, each new sample is assigned to a local filter associated with its nearest cluster or region in the input/feature space. The Split operation, on the other hand, carves out a new local region. If we allow the kernel adaptive filter associated with this new cluster to be initialized from scratch with just one sample, it results in a performance discontinuity in time. For continuous learning, this jump becomes insignificant in the long run. However, for short term update, these can be avoided by copying the weights from its nearest-neighbor cluster (out-of-range in terms of the centroid distance threshold, but spatially, still the closest). This can be viewed as a smoothing procedure. In the worst case, the last cluster will retain a dictionary size equivalent to KLMS (if it is passed from one cluster to the next in its entirety), however with probability zero. For this to happen, the data would have to be preorganized by cluster and presented to the algorithm in order. An exponentially decaying term λ can be used to gradually diminish the effects of the copied coefficients in that particular part of the space. These initial samples can also be removed when their contributions fall below a certain threshold, as new samples are brought into the cluster. More elaborate schemes such as MDL can be used to further reduce the cluster size. Note that the out-of-range centers associated with these weights will never be used to update the centroid location. Since the centroid is the geometric mean of the vector space, its location can be easily updated with a one-step operation using its previous location, the number of existing within-cluster centers, and the new data point.
Since the Gaussian kernel is isotropic, and the interval estimation and coverage probability of a normal distribution are known, and for convenience and intuition, the NICE centroid distance threshold dc can be expressed in terms of the unnormalized standard deviation. The unnormalized Gaussian-kernel standard deviation σk is defined with respect to the kernel parameter a in Equation (8) as:
An example of the NICE-KLMS algorithm is summarized by the algorithm of
The NICE-KLMS algorithm behaves identically to KLMS when the number of clusters is fixed at one, i.e., an infinite centroid distance threshold or dc=∞. In practice, it runs much faster than KLMS, since the number of centers needed per cluster/filter is significantly fewer, and on average, the number of clusters (operations need to select the appropriate filter) is significantly smaller than the average size of individual clusters, i.e., |C|<<|
Vector Quantization
As noted above, the vector quantization technique in QKLMS is complementary to NICE, and can be combined to further reduce the network structure and run-time complexity. Each of the within-cluster centers can be viewed as a mini centroid and compacted using a quantization distance threshold dq. An example of the NICE-QKLMS algorithm is presented in the algorithm of
In the case that the minimum VQ distance is less than the predefined threshold (Cluster & Merge), QKLMS assumes the current input is a direct copy of its nearest neighbor, thus only updating the coefficients with the instantaneous error. This is an approximation with practical value. A more appropriate treatment is to update the coefficient using convex optimization or gradient descent, in this case. The exact error gradient with respect to the closest-neighbor coefficient αi* can be determined as:
where εi=εi2/2 is the cost function. Clearly, the instaneous error ei needs to be scaled by a kernel evaluation between the current input ui and its nearest neighbor ui*. This formulation is termed the modified or M-QKLMS. This option is reflected in the algorithm of
NICE-QKLMS Mean-Square-Convergence Analysis
Here, the energy conservation relation for adaptive filtering can be used to show the sufficient condition for mean square convergence of the NICE-QKLMS algorithm. The upper and lower steady-state excess-mean-square-error bounds can also be established. Two simplifying hypothesis can be imposed here: the clustering operation that was discussed above is optimal, i.e., no errors in clustering have been introduced, and that the orthogonalization amongst clusters is exact. First, let a general nonlinear model be defined as:
di=f(ui)+vi, (13)
where di is the noisy measurement or desired value, f(⋅) is the unknown nonlinear mapping, and vi denotes measurement noise. In this disclosure, the focus is on the following class of kernel adaptive filtering algorithms defined in Equation (6). The universal approximation property states that there exists a vector Ω* such that f(⋅)=Ω*Tψ(⋅). The prediction error becomes:
where {tilde over (Ω)}i−1TΩ*T−Ωi−1T is the weight error vector in the functional space . The steady-state mean-squared-error (MSE) of an adaptive filter is defined as
Under the widely-used and often realistic assumption
If it is further assumed that:
where Ci is the weight error covariance matrix, i.e., CiE[{tilde over (Ω)}i−1T{tilde over (Ω)}i−1] and RE[ψ(ui)Tψ(ui)].
Conservation of Energy for Kernel Adaptive Filtering
First, define the a priori and a posteriori estimation errors, ei− and ei+ respectively, as:
ei−{tilde over (Ω)}i−1Tψ(ui), (18)
ei+{tilde over (Ω)}iTψ(ui). (19)
Substituting Equation (18) into Equation (14) yields the following relation between the error terms {ei,ei−}:
ei=ei−+vi. (20)
Subtracting the optimal weight Ω* from both sides of the weight update equation, then multiplying both sides by the feature space input ψ(ui), from the right, gives:
Ωi=Ωi−1+ηeiψ(ui)
Ωi−Ω*=Ωi−1−Ω*+ηeiΩ(ui) (21)
{tilde over (Ω)}iTψ(ui)={tilde over (Ω)}i−1Tψ(ui)−ηeiψ(ui)Tψ(ui) (22)
ei+=ei−−ηei(ui,ui)
ηei=ei−−ei+, (23)
since (ui, ui)=(∥ui−ui∥2)=1. Substituting Equation (23) into Equation (21) yields the following weight-error vector update rule:
Ωi−Ω*=Ωi−1−Ω*+(ei−−ei+)ψ(ui)
{tilde over (Ω)}i={tilde over (Ω)}i−1−(ei−−ei+)ψ(ui). (24)
To evaluate the energy conservation of Equation (24), square both sides, yielding:
or, in shorthand notation:
∥{tilde over (Ω)}i+(ei−)2=∥{tilde over (Ω)}i−1+(ei+)2, (26)
which describes how the energies of the weight-error vectors for two successive time instants i−1 and i are related to the energies of the a priori and a posteriori estimation errors.
Steady-State MSE Performance Analysis
In the steady state, the following assumption holds:
where the mean square deviation converges to a steady-state value. In the steady state, the effect of the weight-error vector cancels out. Taking the expectation on both sides of Equation (26) yields:
E[∥{tilde over (Ω)}i]+E[(ei−)2]=E[∥{tilde over (Ω)}i−1]+E[(ei+)2]. (28)
Substituting the expression for the a posteriori estimation error ei+ in Equation (23) into the right-hand side of Equation (28) gives
E[∥{tilde over (Ω)}i]+E[(ei−)2]=E[∥{tilde over (Ω)}i−1]+E[(ei−−ηei)2]
E[∥{tilde over (Ω)}i]+E[∥{tilde over (Ω)}i−1]−2ηE[eiei−]+η2E[ei2]. (29)
Clearly, a sufficient condition for mean square convergence is to ensure a monotonic decrease of the weight-error power E[∥{tilde over (Ω)}i], or:
−2ηE[eiei−]+η2E[ei2]≤0. (30)
Since the step size is lower bounded by 0, from Equations (30) and (20), then:
where equality (a) follows from A.1, i.e., the cross-term E[viei−]=E[vi]E[ei−]=0. From Equation (31), the following sufficient condition can be obtained:
where equality (b) follows from the kernel trick. Summarizing the sufficient conditions below:
it can be seen that for weight adaptation in using the current feature space input ψ(ui), as long as the step size η is appropriately selected according to Equation (31), the NICE-KLMS algorithm converges in an identical fashion as KLMS.
At steady state, the excess mean-squared error (EMSE) is given by simple manipulation of Equations (30) and (31):
For the three operations of NICE-QKLMS, how the mean square convergence and steady-state EMSE are affected is shown.
Cluster & Retain:
Let the a priori weight vector for cluster Cc be denoted by Ωi−1(c). Updating the weight vector using the current input ψ(ui) does not change the behavior of the mean square convergence or EMSE.
Cluster & Merge:
Instead of using the current input ψ(ui) to update the weight vector for cluster c, its nearest within-cluster neighbor ψ(uq(c)) is used, i.e., Ωi=Ωi+1+ηeiψ(uq(c)), where uq(c)=arg min∥ui−U(c)∥. This affects the kernel trick used to simplify the expressions throughout the steady-state MSE performance analysis. The simple identity (ψ(ui), ψ(ui))=1 is no longer valid, but rather, the value is bounded by a factor q>0, i.e., (∥ui−uq(c)∥2≤q)≥exp(−aq).
This introduces a new energy conservation relation. Substituting the current input with its nearest within-cluster neighbor in Equation (22) gives:
Substituting this new expression that relates the three error terms {ei+,ei−,ei} during the merge update for Equation (23) in Equation (21), the energy conservation relation in Equation (26) becomes:
where
denotes the quantization energy due to the merge operation. It follows that:
In the limit as the quantization factor q→0, i.e.,
the quantization energy Jq→0 and Equation (37) reduces to Equation (26).
Again, using Equation (35), the sufficient conditions for mean square convergence in Equation (33) becomes:
which is satisfied with an appropriately selected step size and a sufficiently small quantization factor q such that (∥uq(c)−ui∥2)>0. It follows that the steady-state EMSE is:
The expected value in the numerator, on the right-hand side of Equation (39), can be expanded as:
where equalities (c) and (d) follow from the symmetry property and the scaling-and-distributive property of RKHS, respectively, equality (e) holds because inner products are scalars, and equality (f) results from A.2.
Since the maximum squared distance for the merge operation is determined by the quantization factor q, it follows that Equation (40) is bounded as:
(exp(−aq)−1)E[∥{tilde over (Ω)}i−1]≤(E[exp(−a∥ui−uq(c)∥2)]−1)E[∥{tilde over (Ω)}i−1]≤0, (41)
where the upper-bound is achieved when the current input is an existing support, i.e., uq(c)=ui.
Substituting Equations (40) and (41) into Equation (39) yields the following bounds for the NICE-QKLMS EMSE:
Compared to equation (34), the NICE-KLMS is a special case of NICE-QKLMS. The universal approximation property and the mean square convergence of Equation (38) indicates that:
when i approaches infinity and the quantization factor is zero, i.e., given infinite training data and no quantization. Note that this is the average asymptotic behavior for the ensemble of KLMS filters; individual performance using finite training data may vary.
Split & Retain:
Creating a new cluster c′ and updating the new weight vector Ωi−1(c′) using the current input ψ(ui) does not change the behavior of the mean square convergence or EMSE. As long as these operations are maintained, which are essentially the same building blocks of QKLMS, the mean square convergence is not changed from the QKLMS analysis.
Simulation Results
Here, the performance of the proposed NICEKLMS algorithm and the generalized NICE-QKLMS algorithm was evaluated, for the task of short-term chaotic time series prediction and transfer learning. Since the QKLMS algorithm has been studied extensively and established as the state-of-the-art performer for curbing the growth of the RBF structure in kernel adaptive filtering, the comparisons were focused on the QKLMS algorithm. Specifically, is is shown that the NICE-QKLMS algorithm can outperform the QKLMS algorithm, using finite training data, with fewer data centers per evaluation. And under the framework of transfer learning, NICE-QKLMS can leverage previously learned knowledge, i.e., filter parameters, to related task or domain.
Mackey-Glass Time Series Prediction
First, the NICE-QKLMS was tested on the Mackey-Glass (MG) chaotic time series. It was generated using the following time-delay ordinary differential equation:
with β=0.2, γ=0.1, τ=30, and discretized at a sampling period of 6 seconds. Chaotic dynamics are highly sensitive to initial conditions: small differences in initial conditions produce widely diverging outcomes, rendering long-term prediction intractable in general. Additive Gaussian noise with zero-mean and standard deviation σn=0.04, i.e., V˜(0,1.6×10−3), were introduced. The time-delay embedding length or filter length was set at L=12; learning rate for all algorithms at η=0.1; kernel parameter for all three KAF algorithms at a=1; the quantization threshold at dq=0.1 for QKLMS and NICE-QKLMS; and the centroid distance threshold for NICE-QKLMS at dc=3σk=2.1213. The training set consisted of 3000 consecutive samples. Testing was comprised of 400 independent samples.
Referring to
Independent trials were run for 100 Monte Carlo simulations, in which training consisted of the same 3000 consecutive samples but with noise re-sampled from the same distribution, and testing consisted of 400 independent consecutive samples with re-sampled noise and random starting index.
For this particular experimental setup, NICE-QKLMS network used more than 100 fewer centers (257.01 vs 359.51) than QKLMS after 3000 iterations. Left uncurbed, the KLMS grew linearly, with 3000 centers, after the same number of updates. The vector quantization algorithm in QKLMS bounds the center-to-center distances from below, sequentially merging nearby centers into existing centers. However, it lacks a mechanism to bound center-to-center distances from above. For a given input sample, many of the centers in the QKLMS dictionary are sufficiently far away for the output of the Gaussian reproducing kernel to produce significant contributions.
On the other hand, by partitioning the input/feature space into distinct spatial regions, NICE-QKLMS is able to specialize and provide better or similar performance using fewer samples per operation. The average number of clusters at the end of the adaptation was 2.53. On average, to evaluate the function approximate at each input, NICE-QKLMS automatically selected one of the 2.53 filters (with an average of 257 centers per filter) based on the minimum input-to-centroid distance threshold and performs KAF. For the same performance, the computational savings of NICE-QKLMS vs QKLMS is approximately 100 kernel evaluations, taking into account the 2.53 centroid distance computations used for filter selection.
Lorenz Time Series Prediction
Next, consider the Lorenz chaotic system described by the following three ordinary differential equations:
where σ=10,
and ρ=28 are the parameters at which the system exhibits chaotic behavior. The Lorenz system is nonlinear and aperiodic. The x-component is used in the following short-term prediction task. The signal is normalized to be zero-mean and unit-variance.
For a more comprehensive comparison between the NICE-(Q)KLMS and (Q)KLMS algorithms, their performances (prediction gain and per-evaluation network size) were visualized using 3D surface plots as illustrated in
where σu2 is the signal power and σe2 is the MSE. Each of the six subplots corresponds to the KAF performances using a different vector quantization threshold: dc=0 in
Within each subplot of
As expected, the best performance is achieve when the quantization threshold is at zero, in
Transfer Learning Using Content Addressable Filter Bank (CAFB)
Under the NICE framework, partial functionals comprising the adaptive filter can be quickly stored and retrieved based on the input pattern. Instead of frequency bands, the subfilters are organized by amplitude or spatial bands or patterns. Since each cluster or distinct quasi-orthogonal region corresponds to a specialized “spatial-band” subfilter, the filter evaluation becomes the update of one of the partial filters, creating a content addressable filter bank or associative filter storage. This CAFB can be incrementally updated for new signal applications with mild constraints (e.g., amplitude normalization and same embedding dimension), opening the door for transfer learning and significantly more efficient training for new data scenarios, avoiding large initial errors produced by training from scratch, as have been done since the invention of adaptive filtering, and leverage previously learned knowledge to enhance prediction on limited data.
Here, the multipurpose capability of the NICE algorithm can be demonstrated by showing that each subfilter can be shared across different signals. Specifically, it can be shown that a NICE CAFB trained on one chaotic time series (Mackey-Glass) can be quickly repurposed for another time series (Lorenz), and one trained on the Lorenz time series can be transferred to enhance the performance of the real-world sunspot one-step-ahead prediction task. This is expected to be the case for other applications where a model for the time series is required but the number of labeled data is limited.
(1) Chaotic Time Series Prediction:
and ρ=28, initial condition (x0,y0,z0)=(0,1,1.05).
NICE self-organizes the data into interchangeable local components that can be used for different signals. To further illustrate its multipurpose capability, the MG-trained filter was adapted using the Lorenz data to show that it's faster to train than from scratch. NICE provides fast, native support for automatically isolating and identifying a problem region. In the example, the Lorenz data contained new sample points (higher narrower peaks) that are not represented in the MG training data. Rather than updating the entire filter, NICE was only allowed to automatically create/split and update a single new cluster, using the exact same centroid parameters as before.
Finally, the learning curves of the updated filters (Lorenz chaotic time series prediction using updated NICE filters initially trained on MG) were compared to the learning curves of filters learned from scratch, as shown in
(2) Sunspot Prediction:
The first 6000 samples of the Lorenz chaotic time series in
The top subplot of
With reference now to
Stored in the memory 1506 are both data and several components that are executable by the processor 1503. In particular, stored in the memory 1506 and executable by the processor 1503 are a NICE-KLMS application 1512, one or more CAFB 1515 that may be used for object recognition, and potentially other applications 1518. Also stored in the memory 1506 may be a data store 1521 including, e.g., images, video and other data. In addition, an operating system may be stored in the memory 1506 and executable by the processor 1503. It is understood that there may be other applications that are stored in the memory and are executable by the processor 1503 as can be appreciated.
Where any component discussed herein is implemented in the form of software, any one of a number of programming languages may be employed such as, for example, C, C++, C#, Objective C, Java®, JavaScript®, Perl, PHP, Visual Basic®, Python®, Ruby, Delphi®, Flash®, or other programming languages. A number of software components are stored in the memory and are executable by the processor 1503. In this respect, the term “executable” means a program file that is in a form that can ultimately be run by the processor 1503. Examples of executable programs may be, for example, a compiled program that can be translated into machine code in a format that can be loaded into a random access portion of the memory 1506 and run by the processor 1503, source code that may be expressed in proper format such as object code that is capable of being loaded into a random access portion of the memory 1506 and executed by the processor 1503, or source code that may be interpreted by another executable program to generate instructions in a random access portion of the memory 1506 to be executed by the processor 1503, etc. An executable program may be stored in any portion or component of the memory including, for example, random access memory (RAM), read-only memory (ROM), hard drive, solid-state drive, USB flash drive, memory card, optical disc such as compact disc (CD) or digital versatile disc (DVD), floppy disk, magnetic tape, or other memory components.
The memory is defined herein as including both volatile and nonvolatile memory and data storage components. Volatile components are those that do not retain data values upon loss of power. Nonvolatile components are those that retain data upon a loss of power. Thus, the memory 1506 may comprise, for example, random access memory (RAM), read-only memory (ROM), hard disk drives, solid-state drives, USB flash drives, memory cards accessed via a memory card reader, floppy disks accessed via an associated floppy disk drive, optical discs accessed via an optical disc drive, magnetic tapes accessed via an appropriate tape drive, and/or other memory components, or a combination of any two or more of these memory components. In addition, the RAM may comprise, for example, static random access memory (SRAM), dynamic random access memory (DRAM), or magnetic random access memory (MRAM) and other such devices. The ROM may comprise, for example, a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or other like memory device.
Also, the processor 1503 may represent multiple processors 1503 and the memory 1506 may represent multiple memories 1506 that operate in parallel processing circuits, respectively. In such a case, the local interface 1509 may be an appropriate network that facilitates communication between any two of the multiple processors 1503, between any processor 1503 and any of the memories 1506, or between any two of the memories 1506, etc. The processor 1503 may be of electrical or of some other available construction.
Although portions of the NICE-KLMS application 1512, CAFB 1515, and other various systems described herein may be embodied in software or code executed by general purpose hardware, as an alternative the same may also be embodied in dedicated hardware or a combination of software/general purpose hardware and dedicated hardware. If embodied in dedicated hardware, each can be implemented as a circuit or state machine that employs any one of or a combination of a number of technologies. These technologies may include, but are not limited to, discrete logic circuits having logic gates for implementing various logic functions upon an application of one or more data signals, application specific integrated circuits having appropriate logic gates, or other components, etc. Such technologies are generally well known by those skilled in the art and, consequently, are not described in detail herein.
The NICE-KLMS application 1512 and CAFB 1515 can comprise program instructions to implement logical function(s) and/or operations of the system. The program instructions may be embodied in the form of source code that comprises human-readable statements written in a programming language or machine code that comprises numerical instructions recognizable by a suitable execution system such as a processor 703/803 in a computer system or other system. The machine code may be converted from the source code, etc. If embodied in hardware, each block may represent a circuit or a number of interconnected circuits to implement the specified logical function(s).
Also, any logic or application described herein, including the NICE-KLMS application 1512 and CAFB 1515 that comprises software or code can be embodied in any non-transitory computer-readable medium for use by or in connection with an instruction execution system such as, for example, a processor 1503 in a computer system or other system. In this sense, the logic may comprise, for example, statements including instructions and declarations that can be fetched from the computer-readable medium and executed by the instruction execution system. In the context of the present disclosure, a “computer-readable medium” can be any medium that can contain, store, or maintain the logic or application described herein for use by or in connection with the instruction execution system.
The computer-readable medium can comprise any one of many physical media such as, for example, magnetic, optical, or semiconductor media. More specific examples of a suitable computer-readable medium would include, but are not limited to, magnetic tapes, magnetic floppy diskettes, magnetic hard drives, memory cards, solid-state drives, USB flash drives, or optical discs. Also, the computer-readable medium may be a random access memory (RAM) including, for example, static random access memory (SRAM) and dynamic random access memory (DRAM), or magnetic random access memory (MRAM). In addition, the computer-readable medium may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or other type of memory device.
In this disclosure, a novel online nearest-neighbors approach to organize and curb the growth of the radial basis function (RBF) structure in kernel adaptive filtering (KAF) algorithms is presented. The nearest-instance-centroid-estimation (NICE) kernel least-mean-square (KLMS) algorithm is an instance-based learning that provides the appropriate time-space tradeoff with good online performance. Its centers or support vectors in the input/feature space form self-organized regions. The need to maintain complex search data structures is traded for a depth-1 forest with the iteratively updated centroid of each cluster at the root. A linear search among the centroids determines the subset of local supports or subfilter used to evaluate a given function approximation. Compared to the popular RBF network reduction algorithm used in quantized KLMS, which only bounds the network structure or center-to-center distances from below, NICE bounds the network structure from above, by relocating centers outside of a certain responsive domain to a different subfilter. Using the energy conservation relation for adaptive filtering, the sufficient condition for mean square convergence of the NICE-KLMS algorithm was shown. The upper and lower steady-state excess-mean-square-error (EMSE) bounds were also established. As a proof-of-concept, vector quantization (VQ) was combined with NICE to formulate the novel KAF algorithm. Simulations on chaotic time-series prediction tasks demonstrated that the proposed method outperforms existing vector quantization method using fewer centers per evaluation. Furthermore, the multipurpose capability of the novel approach was demonstrated by performing regression on different signals using the same content addressable filter bank (CAFB) or associative filter storage. Nice CAFB can leverage previously learned knowledge to a related task or domain.
A novel approach for cluster analysis or unsupervised learning within the kernel adaptive filtering framework for regression was presented. By self-organizing the data centers into distinct spatial regions, and with NICE's ability to detect changes in data distribution, non-stationary learning systems are possible. As a CAFB, universal filtering of different signals. The NICE framework is also closely related to multiple and mixture kernel learning, but formulated within a single fixed RKHS. Enhanced versions can be developed using different kernel parameters, introducing adaptive learning parameters, and applying the associative filter storage to multiple tasks.
A novel nearest-neighbors approach to organize and curb the growth of radial basis function (RBF) structure in kernel adaptive filtering (KAF) has been discussed. The nearest-instance-centroid-estimation (NICE) kernel least-mean-square (KLMS) algorithm provides an appropriate time-space trade-off with good online performance. Its centers in the input/feature space form self-organized regions. Compared with conventional KAF, instead of using all centers to evaluate/update the function approximation at a given point, a linear search among the iteratively-updated centroids determines the set of local supports used, naturally forming a locally-supported reproducing kernel. NICE is complementary to existing RBF network reduction algorithms. Under the NICE framework, information is quickly stored and retrieved based on its content. Since each cluster corresponds to a specialized spatial-band filter, it becomes a content addressable filter bank (CAFB). This CAFB can be incrementally updated for new applications, always using the past-learned filters, allowing for transfer learning and significantly more efficient training for new data scenarios, avoiding training from scratch as have been done since the beginning of adaptive filtering.
It should be emphasized that the above-described embodiments of the present disclosure are merely possible examples of implementations set forth for a clear understanding of the principles of the disclosure. Many variations and modifications may be made to the above-described embodiment(s) without departing substantially from the spirit and principles of the disclosure. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.
The term “substantially” is meant to permit deviations from the descriptive term that don't negatively impact the intended purpose. Descriptive terms are implicitly understood to be modified by the word substantially, even if the term is not explicitly modified by the word substantially.
It should be noted that ratios, concentrations, amounts, and other numerical data may be expressed herein in a range format. It is to be understood that such a range format is used for convenience and brevity, and thus, should be interpreted in a flexible manner to include not only the numerical values explicitly recited as the limits of the range, but also to include all the individual numerical values or sub-ranges encompassed within that range as if each numerical value and sub-range is explicitly recited. To illustrate, a concentration range of “about 0.1% to about 5%” should be interpreted to include not only the explicitly recited concentration of about 0.1 wt % to about 5 wt %, but also include individual concentrations (e.g., 1%, 2%, 3%, and 4%) and the sub-ranges (e.g., 0.5%, 1.1%, 2.2%, 3.3%, and 4.4%) within the indicated range. The term “about” can include traditional rounding according to significant figures of numerical values. In addition, the phrase “about ‘x’ to ‘y’” includes “about ‘x’ to about ‘y’”.
This application claims priority to, U.S. Patent Application entitled “AUTOMATIC COMPOSITION OF UNIVERSAL FILTERS,” filed on Aug. 28, 2017, and assigned application No. 62/550,751, which is incorporated herein by reference in its entirety.
This invention was made with government support under grant number N66001-15-1-4054 awarded by the U.S. Department of Defense, Defense Advanced Research Projects Agency (DARPA). The Government has certain rights in the invention.
Number | Name | Date | Kind |
---|---|---|---|
20180197081 | Ji | Jul 2018 | A1 |
Entry |
---|
Y.Y. Ou et al., A Novel Radial Basis Function Network Classified with Centers Set by Hierarchical Clustering, IEEE 2005 (Year: 2005). |
A. Brandstetter et al., Radial Basis Function Networks GPU-Based Implementation, IEEE Transactions on Neural Networks, vol. 19, No. 12, 2008 (Year: 2008). |
B. Chen et al., Quantized Kernel Least Mean Square Algorithm, IEEE Transactions on Neural Networks and Learning Systems, vol. 23, No. 1, 2012 93, 2759-2770, 2013 (Year: 2012). |
S. Jiang et al., An improved K-nearest-neighbor algorithm for text categorization, Expert Systems with Applications 39, 1503-1509, 2012 (Year: 2012). |
V. Nikulin, Threshold-based clustering with merging and regularization in application to network intrusion detection, Computational Statistics and Data Analysis, 51, 1184-1196, 2006 (Year: 2006). |
K. Li, et al., Transfer Learning in Adaptive Filters: The Nearest Instance Centroid-Estimation Kernel Least-Mean-Square Algorithm, IEEE Transaction on Signal Processing, vol. 65, No. 24, Dec. 15, 2017 (Year: 2017). |
Number | Date | Country | |
---|---|---|---|
20190068171 A1 | Feb 2019 | US |
Number | Date | Country | |
---|---|---|---|
62550751 | Aug 2017 | US |