Succinct representations of data (especially high-dimensional data) are paramount to promptly extract actionable information and reduce the storage and communication requirements related to sharing and archiving data. A fundamental observation, first made in the image processing community, is that a more efficient signal representation is possible if one uses an over-complete dictionary learned from the signals themselves rather than a fixed basis. Given a large number of signal samples, the dictionary learning problem aims to construct a dictionary that can be used to efficiently represent the data as a linear combination of its columns, also known as atoms. In this context, a more efficient data representation is one that uses a smaller number of atoms to achieve the desired signal reconstruction quality.
Usually, the dictionary is learned by fitting training data to a linear model representation for the data via a regularized least-squares criterion, where regularization is used to encourage a sparse structure in the vectors of regression coefficients. Naturally, the data samples used for learning the dictionary are assumed to adhere to the nominal process that generated the data. However, in many cases it is not possible to screen all data samples to guarantee that no datum behaves as an outlier, that is, deviates significantly from the remaining data. A new method is needed for robust dictionary learning.
Disclosed herein is a method for constructing a dictionary to succinctly represent data from a training data set that comprises the following steps. The first step provides for modeling the data as a linear combination of columns of the dictionary. The next step provides for modeling a presence of outliers in the data set via deterministic outlier vectors that will take nonzero values for data vectors that are deemed outliers only. The next step provides for formatting the training data set in matrix form for processing. The next step provides for defining an underlying structure in the data set. The next step provides for quantifying a similarity across the data according to the underlying structure. The next step provides for building a Laplacian matrix from the matrix-formatted data set. The underlying data structure is captured within the Laplacian matrix. The next step provides for using Lasso and group-Lasso regularizers to succinctly represent the data as a linear combination of columns of the dictionary. The next step provides for choosing scalar parameters for controlling the number of dictionary columns used to represent the data and the number of elements of the training data set identified as outliers. The next step provides for using block coordinate descent and proximal gradient methods on the vector-matrix-formatted data set to estimate a dictionary that efficiently represents each datum in the training data set as a linear combination of a subset of dictionary columns, corresponding expansion coefficients, and the outlier vectors. The next step provides for using a length of the outlier vectors to identify outliers in the data.
Throughout the several views, like elements are referenced using like references. The elements in the figures are not drawn to scale and some dimensions are exaggerated for clarity.
The disclosed methods below may be described generally, as well as in terms of specific examples and/or specific embodiments. For instances where references are made to detailed examples and/or embodiments, it should be appreciated that any of the underlying principles described are not to be limited to a single embodiment, but may be expanded for use with any of the other methods described herein as will be understood by one of ordinary skill in the art unless otherwise stated specifically.
An embodiment of method 10 establishes a universal framework for anomaly detection that does not rely on the statistical distribution of anomalies in the dataset. Once outliers have been identified in a data set, they may be removed from the data set. Then, the outlier-free data set may be used to learn a new dictionary and the corresponding vectors of expansion coefficients via the same BCD and PG methods, with all outlier vectors set to zero.
In a given dataset, outliers alone may be more informative than the remaining data, in which case one would like to quickly identify them and possibly remove them from the dataset. The regularized least-squares (LS) criterion is sensitive to the presence of outliers. Dictionary learning (DL) problem formulations based on the LS cost inherit such sensitivity and are, thus, sensitive to outliers, which can bias the dictionary estimate and reduce the efficiency of DL as a data compression tool.
Instead of using the LS criterion as an error fitting term, robust DL approaches leverage tools developed in the area of robust statistics and use criteria such as the l1-error and the Huber loss to induce robustness to outliers. Unfortunately, previous robust DL approaches do not lend themselves for constructing efficient solvers, and in many cases they do not directly identify the outliers. In an embodiment of dictionary construction method 10, robust analysis methods for data capitalizing on modern signal processing tools is used to leverage a data model that captures the presence of outliers, and the fact that by definition outliers are rare, to develop efficient tools for identifying and tapering the effects of outliers. When there is a structural relationship among elements of the training set, outliers can inherit such a structure. For instance, when training a dictionary using image patches contaminated by outliers, the spatial structure of the patches (size and location) induces a similar structure in the outliers. Dictionary construction method 10 leverages outlier structure to enhance the robustness of the resulting dictionary and to identify outliers. An embodiment of method 10 may be described as a robust DL framework that capitalizes on sparsity-aware modeling tools and the predefined structure among outliers. The specific outlier structure may be defined via an undirected graph subsuming the similarity among elements of the training set. This information may be introduced in the robust DL problem via a Laplacian regularization term that now couples all outlier vector representations. A BCD algorithm coupled with PG methods may be used for training the dictionary and identifying the outliers. The data set may be partitioned and vectorised in order to format the data in matrix form. Dictionary construction method 10 uses known structure from the dataset, or application domain, as a means to develop enhanced anomaly detection algorithms applicable to a wide range of signal domains. The methodology is able to capture spatiotemporal structures in the data and deal with datasets that are high dimensional and large.
Consider a training set of M-dimensional real-valued vectors :={yn}n=1N with cardinality N:=||. Let D:=[d1, . . . , dQ]εM×Q denote a dictionary with Q atoms, where an atom corresponds to a column of D. It is assumed that each yn can be represented as a linear combination of atoms from D. Whenever Q>M, D is called overcomplete. If D is overcomplete, vectors {dq}q=1Q are linearly dependent. Then, the vector of expansion coefficients sn in yn=Dsn is not unique for any n. By requiring each sn to be sparse, a balance can be stricken between the quality of the approximation and the stability of the representation of each yn.
Given , the goal of the classical DL problem is to find both D and {sn}n=1N as the minimizers of
where :={DεM×Q:∥dq∥2≦1,∀q}, η>0 is a tuning parameter, and ∥sn∥1:=Σq=1Q|sn,q| represents the l1-norm of sn with sn,q:=[sn]q. The l1-norm regularizer encourages sparsity on the sn's and η controls the sparsity level of the resulting sn's. (As used herein, the notation “'s” used after a term such as sn is meant to indicate plurality.) The constraints on the size of the columns of D defined in are necessary to avoid dictionary atoms to grow unbounded and the corresponding entries of each sn approach zero.
Problem (1) is non-convex and may be solved with a BCD-based solver, which iteratively updates D with {sn}n=1N fixed, and then updates all {sn}n=1N with D fixed. This BCD solver solves a constrained LS optimization problem to update the dictionary, and is thus sensitive to the presence of outliers in . Note that in the aforementioned BCD solver the updates for the sn's are also adversely impacted by outliers both through the outlier-sensitive D and the optimization criterion used. Further discussion of a BCD approach is disclosed in M. Elad's article “Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing,” New York: Springer 2010, which article is incorporated by reference herein in its entirety.
The following describes a sparse and structured process for modeling outliers in a dataset. First, a model for each ynε that explicitly captures the presence of outliers in the training data set is introduced. To this end, each ynε is modeled as
y
n
=Ds
n
+o
n+εn,n=1, . . . ,N (2)
where on denotes a deterministic outlier vector, and εn denotes a inlier noise vector affecting yn. If yn is an outlier, then on takes a nonzero value and lessens the effect of yn on D; otherwise, on=0N, where 0N is an N×1 vector of zeros. Solving for the tuple (D, {sn, on}n=1N) using the set of equations in (2) entails solving a nonlinear and under-determined system of P equations with N(M+Q)+MQ unknowns. Thus, auxiliary information about the structure of each sn and the {on}n=1N, and on the proportion of outliers present in becomes essential to obtain a stable estimate for (D, {sn, on}n=1N) and avoid over-fitting the model in (2).
The structural information about the outliers may be summarized by an auxiliary weighted graph (V, E). The set V defines the vertex set, with cardinality |V|=N, where each vertex is associated with an on. The set E defines the edge set whose elements define the structure of the on's and the similarity of each pair of on's. The connectivity of is compactly summarized by the adjacency matrix WεN×N, where an entry wn,n′:=[W]n,n′>0 if nodes vn, vn′εV are connected in , and wn,n′=0 otherwise. The topology of , and thus the underlying structure associated with the outliers, is encapsulated in the Laplacian matrix L:=diag(W1N)−W, where lN denotes an N×1 vector of ones and diag(yn) a diagonal matrix with yn on its main diagonal. Note that does not have to be connected.
Usually DL seeks sparse expansion coefficient vectors {sn}n=1N. Moreover, outliers by definition are rare and, thus, few of them are assumed to be present in . The sparse presence of outliers in equation (2) can be quantified through the support of the vector [∥o1∥2, . . . , ∥oN∥2]. Let O:=[o1, . . . , oN]εQ×N denote a matrix of outlier vectors, S:=[s1, . . . , sN]ε∥Q×N a matrix of expansion coefficient vectors, and ∥O∥2,p:=∥[∥o1∥2, . . . , ∥oN∥2]∥p, where ∥on∥p:=(Σm=1M|om,n|p)1/p represents the lp-norm of on with om,n:=[on]m and pε[1, ∞). An estimator for (D, S, O) that captures the spatial structure of the outliers is given by the solution of
where Tr(•) denotes the matrix trace operator, η>0 is a tuning parameter that controls the sparsity of the sn's λ>0 a tuning parameter that controls the number vectors in identified as outliers, and μ>0 a tuning parameter that controls how much emphasis is placed on the structural relation among outliers. The group Lasso regularizer in equation (3) encourages entire on vectors to be set to zero. The Laplacian regularization term in equation (3) can be explicitly written as
From equation (4), it follows that the Laplacian regularization term in equation (3) encourages on's to be similar to each other if they are connected in .
In general, equation (3) is a non convex optimization problem which is difficult to solve. Nevertheless, when optimizing with respect to any one of D, S, and O with the other two fixed, the resulting optimization problems can be solved efficiently. DL-alternate formulations that rely on a constrained version of equation (3), where the l0-“norm” is used in lieu of the fj-norm, and bestow structure on the dictionary's atoms may be used. There are no general rules for choosing between any of these DL formulations and the selection often depends on the specific problem at hand. Our outlier model in equation (2) and the Laplacian regularizer in equation (4) can be used to develop robust versions of these DL approaches. The resulting robust and structure DL approaches are able to capture the underlying structure of the outliers after appropriately modifying the algorithmic framework outlined next.
The following algorithm is an embodiment of a BCD solver for equation (3). This embodiment capitalizes on the observation that efficient solvers may be realized for updating one of D, S, or O when the other two optimization variables are fixed. Equation (3) is convex with respect to D and O individually. With f(D, S, O) denoting the objective function in equation (3), the proposed BCD algorithm minimizes f(D, S, O) with respect to each optimization block, namely D, S, or O, one at a time. Let k denote the BCD-iteration index. Our robust DL algorithm is summarized by the following iterations:
where {circumflex over (D)}(k), Ŝ(k), and Ô(k) denote estimates for D, S, and O at iteration k, respectively.
Updating D(k) in (6a) may be done via a BCD algorithm iterating over the columns of D. Per BCD iteration, each atom of D is updated with all other atoms fixed by solving the resulting unconstrained LS optimization problem. This BCD algorithm is summarized as Algorithm 1 above. Note that for (5b) to be well-defined, each atom in {circumflex over (D)}(k−1) must be used by at least one sn. In practice, every unused atom of the dictionary is updated to a randomly chosen element of in before running Algorithm 1 and (5b) is skipped for such an atom when updating the dictionary. If there are no unused atoms, Algorithm 1 is guaranteed to achieve the minimum of the cost in (6a).
The update in (6b) decouples per column of S. The resulting set of optimization problems are convex and correspond to the Lagrangian version of the least-absolute shrinkage and selection operator, such as is described in R. Tibshirani's article “Regression shrinkage and selection via the lasso” (Journal of the Royal Statistical Society, 1996. These optimization problems can be efficiently solved via, e.g., BCD or an alternating direction method of multipliers, such as is described by M. Elad in the article “Sparse and Redundant Representation: From Theory to Applications in Signal and Image Processing”, 2010 and by D. P. Bertsekas in the article “Nonlinear Programming, 1999. The update in (6c) entails solving a convex optimization problem. However, tackling (6c) in its current form leads to solvers whose computational complexity can be high due to the coupling across on's caused by the Laplacian regularizer.
Equation (6c) may be solved via a PG algorithm such as Algorithm 2, which is shown below.
Although the resulting solver (i.e., Algorithm 2) for (6c) is iterative in nature, it features fast convergence when compared to other first order optimization methods such as gradient descent and BCD. Moreover, the updates are decomposable across each on. Hence, they can be parallelized. The PG algorithm 2 for updating (6c) at iteration k relies on the following:
where τ denotes the proximal method iteration index, wn(O(τ)) denotes the gradient of the differentiable part off with respect to on evaluated at O(τ), an the n-th column of L, and Λ the Lipschitz constant for the gradient (with respect to O) of the differentiable part off which corresponds to the largest singular value of I+μL. Our iterative update for computing (6c) is summarized as follows:
This update is separable across on's. Computing each on(τ) is then done in closed form as
where (•)+:=max{0,•}. Note that 1/Λ in (9) can be understood as the step size α of the PG algorithm. From this vantage point, the PG algorithm will converge for any
It is also possible to choose α at each PG iteration via a line search, such as is known in the art and described in more detail in N. Parikh's and S. Boyd's article “Proximal algorithms,” in “Found. Trends Optimization, vol. 1, pp. 123-231, 2013.
The PG algorithm for solving (6c) may be summarized as Algorithm 2. Each iteration of this PG algorithm can be computed in O (MNQ+MN2) run time, where the quadratic dependency on N stems from the second term in (7a) that defines an operation to be performed for each n. The computational complexity of Algorithm 2 can be reduced after noticing that O(τ) is a sparse matrix and often L is a sparse matrix as well. In this case, specialized sparse matrix multiplication algorithms may be used to reduce the run time of (7a) for each n. The PG method outlined by Algorithm 2 converges to the solution of (6c) and features a worst-case convergence rate of O(1/τ).
It is possible to enhance the suboptimal convergence rate of the PG algorithms while maintaining their computational simplicity. From that vantage point, one may develop an accelerated PG (APG) algorithm for solving (6c) which features worst-case convergence rate of O(1/τ2) to the solution of (6c). The resulting APG solver requires a copy of both O(τ) and O(τ−1). Maintaining a copy of O(τ−1) does not add excessive data storage requirements since O(τ−1) is column-wise sparse. Similarly to the PG algorithm, each iteration of the APG algorithm can be computed in O (MNQ+MN2) run time and benefits from specialized sparse matrix multiplication algorithms for updating each ωn.
The numerical performance of the robust DL algorithm defined in 6(a)-6(c) was tested via numerical tests on two real datasets: UCSD's Anomaly Detection Dataset and on the Resampled United States Postal Service (USPS) dataset. For these tests all algorithms used in this embodiment of method 10 were implemented in Python™ 2.7, which is a high-level, object oriented programming language created by Guido van possum and offered by The Python Foundation. The UCSD Anomaly Detection Dataset data set features video of a pedestrian walkway acquired with a stationary camera and a fixed background. With few exceptions, most portions of the video contain only pedestrians moving along the walkway. Abnormal events include the presence of bikers, skates and small carts. Three images from this dataset were used for this test. (See images A, B, and C shown in
Each image was resized from 158×258 pixels to 152×252 and then partitioned into 8×8 pixel blocks. The resulting N=1,653 blocks (551 per image), were used to construct ⊂ with M=64. Each 64-dimensional yn= was constructed by vertically stacking the columns of each block. In this test, no further preprocessing was performed on the elements of the training set. In order to capture the spatial structure among blocks per image, an auxiliary graph with N0=551 nodes was constructed as follows. Each yn was assigned to a node of the graph. The edges for each node were chosen so that each yn is connected to its spatial neighboring blocks only, as shown in
Referring back to
However, the best convergence rate is achieved by setting α=1/Λ as in Algorithm 2. In practice it was observed that during the first iteration of the BCD algorithm the PG algorithm took less than 20 iterations. As the BCD iterations progressed, the average number of PG iterations required for convergence decreased to around 3-5.
It was observed that setting μ>0 encouraged outlier blocks to be chosen according to the structure defined by the matrix L. For instance, in
The goal of the USPS test was to find the injected anomalies along with other “odd” looking digits in . The size of the dictionary was set to Q=100 and it was initialized with randomly chosen elements from . The spatial structure of the digits was captured via an auxiliary graph with N=1,000 nodes. For each node, the Cosine similarities between its image and all other images corresponding to all other nodes were computed. Per node n, a set of K weighted edges was added to the graph, where the chosen edges corresponded to the node pairs {(n, m)}m≠n with the K-largest similarity score. In this experiment, the constrained DL formulation that uses the l0-“norm” was used in lieu of equation (3). Thus, the solvers used for (6a) and (6b) were modified accordingly. Specifically, orthogonal matching pursuit (OMP) was used to solve (6b) and the constrained set was dropped from (6a). The desired set of nonzero coefficients used by OMP was set to S0=40.
Dictionary construction method 10 utilizes a robust DL algorithm that leverages known structure among the elements of the training set. The training set structure may be summarized by an auxiliary graph whose connectivity is defined according to the similarity between elements of the training set. Dictionary construction method 10 further uses a BCD solver to learning the dictionary and to identify the outliers. The outlier variables may be updated with a PG algorithm as part of the aforementioned BCD solver. In addition to being applied to images, the method 10 can be applied to other signal domains including video (full motion video, motion imagery, etc.), computer and sensor networks (traffic flows, information flows), social networks, text data, active and passive sonar, audio, etc.. Method 10's ability to construct a dictionary from a dataset that includes outliers/anomalies will allow a computer to extract actionable information from a data set faster than previous methods while simultaneously reducing the storage and communications requirements related to sharing and archiving the data.
From the above description of the dictionary construction method 10, it is manifest that various techniques may be used for implementing the concepts of method 10 without departing from the scope of the claims. The described embodiments are to be considered in all respects as illustrative and not restrictive. The method/apparatus disclosed herein may be practiced in the absence of any element that is not specifically claimed and/or disclosed herein. It should also be understood that method 10 is not limited to the particular embodiments described herein, but is capable of many embodiments without departing from the scope of the claims.
This application claims the benefit of U.S. Provisional Application No. 62/315,580, filed 30 Mar. 2016, titled “Spatiotemporal Methods for Anomaly Detection in Dictionary Learning and Sparse Regression” (Navy Case #103757).
The United States Government has ownership rights in this invention. Licensing and technical inquiries may be directed to the Office of Research and Technical Applications, Space and Naval Warfare Systems Center, Pacific, Code 72120, San Diego, Calif., 92152; voice (619) 553-5118; ssc_pac_t2@navy.mil. Reference Navy Case Number 103757.
Number | Date | Country | |
---|---|---|---|
62315580 | Mar 2016 | US |