The invention relates to a method for automatic online detection and classification of anomalous objects in a data stream according to claim 1 and an system to that aim according to claim 22.
In practical applications data analysis it is often necessary to evaluate the content of datasets so that the contents belong to certain classes.
One example would be the classification of measurements into normal and anomalous classes. The mathematical boundary between “normal” and “anomalous” is usually a mathematical condition which is either satisfied or not satisfied.
From previous art (e.g. U.S. Pat. Nos. 5,640,492, 5,649,492, 6,327,581, as well as the following journal articles:
From previous art (e.g. the articles: P. A. Porras, and P. G. Neumann, “Emerald: event monitoring enabling responses to anomalous live disturbances”, Proc. National Information Systems Security Conference, 1997, pp. 353-365, and C. Warrender, S. Forrest and B. Perlmutter, “Detecting intrusions using system calls: alternative data methods”, Proc. IEEE Symposium on Security and Privacy, 1999, pp. 133-145) it is known how to detect outliers online, i.e. one example at a time, when the notion of normality is fixed in advance as a model.
It is not known, however, how to detect outliers in the continuous stream of data and at the same time to construct and the representation of normality and to dynamically adjust the representation with the arrival of new data or the removal of previous data. This form of data processing constitutes the scope of the invention.
The problem in real time application is that offline analysis is often not feasible or desirable.
One example for such an application would be the detection of an attack by a hacker to a computer system through a computer network.
The “normal” characteristics are known but it cannot in beforehand be defined how an attack would be represented in a datastream.
It is only be known in advance that a certain deviation from the normal situation will take places.
The current invention related to such situation in which datasets are analysed in real time without definite knowledge of the classification criteria to be used in the analysis.
In the following the invention is described by the way of example by
A system and method are disclosed for online detection and classification of anomalous objects in continuous data streams.
In
The overall scheme of an embodiment of the system and the method is depicted in
The data stream 1000 are data packets in communication networks.
Alternatively the data stream 1000 can be entries in activity logs, measurements of physical characteristics of operating mechanical devices, measurements of parameters of chemical processes, measurements of biological activity, and others.
The central feature of the method and the system according to the invention is that it can deal with continuous data streams 1000 in an online fashion. The term “continuous” in this context means that data sets are received regularly or irregularly (e.g. random bursts) by the system and processed one at a time.
The term “online” in this context means that the system can start processing the incoming data immediately after deployment without the extensive setup and tuning phase. The tuning of the system is carried out automatically in the process of its operation. This contrasts with an offline mode in which the tuning phase involves extensive training (such as with the systems bases on neural networks and support vector machines) or manual interaction (such as with expert systems).
The system can alternatively operate in the offline mode, whereby the data obtained from the data stream 1000 are stored in the database 1100 before being using in the further processing stages. Such mode can employed in the situations when the volume of the incoming data exceeds the throughout of the processing system, and intermediate buffering in the database is required.
It is possible to operate the application in a mixed mode (e.g. in case the data is strongly irregular), in which at least a part of the total data stream is a continuously incoming datastream 1000.
In this case, the system reads the data from the data stream 1000 as long is new data is available. If no new data is available, the system switches its input to the database and processes the previously buffered data. On the other hand, if the arrival rate of the data in the data stream 1000 exceeds the processing capacity of the system, the data is veered off into the database for processing at a later time. In this way, optimal utilization of computing resources is achieved.
Each of the incoming objects is supplied to a feature extraction unit 1200, which performs the pre-processing required to obtain the features 1300 relevant for a particular application.
The purpose of the feature extraction unit is to compute, based on the content of the data, the set of properties (“features”) suitable for subsequent analysis in an online anomaly detection engine 2000. These properties must meet the following requirements:
either
a) each property is a numeric quantity (real or complex), or
b) the set of properties forms a vector in an inner product space (i.e. computer programs are provided which take the said set of properties as arguments and perform the operations of addition, multiplication with a constant and scalar product pertaining to the said sets of properties), or
c) a non-linear mapping is provided transforming the sets of properties in the so-called Reproducing Kernel Hilbert Space (RKHS). The latter requirement can be satisfied by providing a computer program which takes the said sets of properties as arguments and computes a kernel function between the two sets of properties. The function realized by this program must meet (exactly or approximately) the conditions known as “Mercer conditions”.
In the exemplary embodiment of the system, the features can be (but are not limited to)
If the entire set of properties does not satisfy the imposed requirements as a whole, it can be split into subsets of properties. In this case, the subsets are processed by separate online anomaly detection engines.
Similarly to the data, the features can be buffered in the feature database 1400, if for some reason intermediate storage of features is desired.
Alternatively, if the incoming objects are such that they can be directly used in a detection/classification method, no feature extraction unit 1200 is necessary.
The features 1300 are then passed on to the online anomaly detection engine 2000.
The main step 2100 of the online anomaly detection engine 2000 comprises a construction and an update of a geometric representation of the notion of normality.
The online anomaly detection 2000 constitutes the core of the invention. The main principle of its operation lies in the construction and maintaining of a geometric representation of normality 2200. The geometric representation is constructed in the form of a hypersurface (i.e. a manifold in a high-dimensional space) which depends on selected examples contained in the data stream and on parameters which control the shape of the hypersurface. The examples of such hypersurfaces can be (but are not limited to):
The online anomaly detection engine consists of the following components:
The output of an online anomaly detection engine 2000 is an anomaly warning 3100 which can be used in the graphical user interface, in the anomaly logging utilities or in the component for automatic reaction to an anomaly. In the exemplary embodiment for identification of hacker attacks, the consumers of an anomaly warning are, respectively, the security monitoring systems, security auditing software, or network configuration software.
Alternatively, the output of an online anomaly detection engine can be used for further classification of anomalies. Such classification is carried out by the classification unit 4000 which can utilize any known classification method, e.g. a neural network, a Support Vector Machine, a Fischer Discriminant Classifier etc. The anomaly classification message 4100 can be used in the same security management components as the anomaly warning.
In one embodiment the geometric representation of normality 2200 is a parametric hypersurface enclosing the smallest volume among all possible surfaces consistent with the predefined fraction of the anomalous objects (see example in
Alternatively the geometric representation of normality 2200 is a parametric hypersurface enclosing the smallest volume among all possible surfaces consistent with a dynamically adapted fraction of the anomalous objects. An example is depicted in
Said hypersurface is constructed in the feature space induced by a suitably defined similarity function between the data objects (“kernel function”) satisfying the conditions under which the said function acts as an inner product in the said feature space (“Mercer conditions”). The update of the said geometric representation of normality 2200 involves the adjustment so as to incorporate the latest objects from the incoming data stream 1000 and the adjustment so as to remove the least relevant object so as to retain the encapsulation of the smallest volume enclosed by the geometric representation of normality 2200, i.e. the hypersurface. This involves a minimization problem which is automatically solved by the system.
The construction and the update of the geometric representation of normality 2200 will be described in greater detail in connection with
Once the geometric representation of normality 2200 is automatically updated, an anomaly detection 2300 is automatically performed by the online anomaly detection engine 2000 assigning to the object the
The output of the online anomaly detection engine 2000 is used to issue the anomaly warning 3100 and/or to trigger the classification component 4000 which can utilize any known classification method such as decision trees, neural networks, support vector machines (SVM), Fischer discriminant etc.
The use of support vector-machines in connection with the invention is described below in Appendix A.
The geometric representation of normality 2200 can also be supplied to the classification component if this is required by the method.
In an exemplary embodiment of the construction and update of the geometric representation of normality 2100 the hypersurface representing the class of normal events is represented by the set of parameters x1, . . . , xn (i=1 . . . n), one parameter for each object in the working set.
The size n of the working set is chosen in advance by the user There may be two reasons for this:
The parameters are further restricted to be non-negative, and to have values less than or equal to C=1/(nv), where ν is the expected fraction of the anomalous events in the data stream (e.g. 0.25 for 25% expected outliers), to be set by the user. This estimate is the only a à priori knowledge to be provided to the system. There may be some other, kernel-dependent parameters in the system. These parameters reflect some prior knowledge (if available) about the geometry of objects.
This is a very weak limitation since such estimates are readily available.
The working set it partitioned into the
“set 0” of the objects whose parameters xk are equal to zero,
“set E” of the object whose parameters xk are equal to C, and the
“set S” of the remaining objects.
The operation of the construction and update of the geometric representation of normality 2100 is illustrated in
Upon the arrival of the data object k, the following three main actions are performed within a loop:
The importation and removal operations maintain the minimal volume enclosed by the hypersurface and consistent to the pre-defined expected fraction of anomalous objects.
For more complicated geometries a volume estimate can be used as the optimization criterion, since for more complicated surfaces such as the hyperellipsoid, the exact knowledge of a volume may not be available.
These operations are explained in more detail in Appendix C. The relevance of the data object can be judged either by the time stamp on the object or by the value of parameter xi assigned to the object.
The steps A2.1 to A2.4 are the initialization operations to be performed when not enough data objects have been observed in order to bring the system into equilibrium (i.e. not enough data to construct a hypersurface).
Construction of the hypersurface 2200 enclosing the smallest volume and consistent with the pre-defined expected fraction of anomalous objects amounts, as shown in the article “Support Vector Data Description” by D. M. J. Tax and R. P. W. Duin, Pattern Recognition Letters, vol. 20, pages 1191-1.199, (1999), to solving the following mathematical programming problem:
where:
K is a n×n matrix that consists of evaluations of the given kernel function for all data points in the working set: Ki,j=kernel (pi, pj).
For example, of the objects are vectors in the n-dimensional space, and the solution is sought in the linear feature space, the kernel function is evaluated as follows:
As another example, if the solution is space in the features space of radial basis functions (which is n infinite-dimensional space, the kernel function is computed as:
where γ is the kernel parameter.
In equation (1) c is the vector of the numbers at the main diagonal of K, a is the vector of n ones and b=−1.
The parameter C is related to the expected fraction of the anomalous objects.
The necessary and sufficient condition for the optimality of the representation attained by the solution to problem (1) is given by the well-known Karush-Kuhn-Tucker conditions.
When all the points in the working set satisfy the said conditions, the working set is said to be in equilibrium.
Importation of a new data objects into, or removal of an existing data object from a working set may result in the violation of the said conditions. In such case, adjustments of the parameters x1, . . . , xn are necessary, in order to bring the working set back into equilibrium.
An framework for performing such adjustments, based on the Karush-Kuhn-Tucker conditions, for a different mathematical programming problem—Support Vector Learning—was presented in the article “Incremental and Decremental Support Vector Learning” by G. Cauwenberghs and T. Poggio, Advances in Neural Information Processing Systems 13, pages 409-415, (2001).
The algorithms for performing the adjustments of the geometric representation are described in more detail in appendix C.
Special care needs to be taken at the initial phase of the operation of the online anomaly detection engine as described in
The initialization steps A2.1 to A2.4 of the invention are designed to handle this special case and to bring the working set into the equilibrium after the smallest possible number of data objects has been seen.
The exemplary embodiment of the online anomaly detection method in the system for detection and classification of computer intrusions is depicted in
The online anomaly detection engine 2000 is used to analyse a data stream 1000 (audit stream) containing network packets and records in the audit logs of computers. The packets and records are the objects to be analysed.
The audit stream 1000 is input into the feature extraction component 1200 comprising a set of filters to extract the relevant features.
The extracted features are read by the online anomaly detection engine 2000 which identifies anomalous objects (packets or log entries) and issues an event warning if the event is discovered to be anomalous. Classification of the detected anomalous events is performed by the classification component 4000 previously trained to classify the anomalous events collected and stored in the event database.
The online anomaly detection engine comprises a processing unit having memory for storing the incoming data, the limited working set, and the geometric representation of the normal (non-anomalous) data objects by means of a parametric hypersurface; stored programs including the programs for processing of incoming data; and a processor controlled by the stored programs. The processor includes the components for construction and update of the geometric representation of normal data objects, and for the detection of anomalous objects based on the stored representation of normal data objects.
The component for construction and update of the geometric representation receives data objects and imports it into the representation such that the smallest volume enclosed by the hypersurface and consistent with the pre-defined expected fraction of anomalous objects is maintained; the component further identifies the least relevant entry in the working set and removes it while maintaining the smallest volume enclosed by the hypersurface. Detection of the anomalous objects is performed by checking if the objects fall within or outside of the hypersurface representing the normality.
As an embodiment of the invention, the architecture of the system for detection and classification of computer intrusions is disclosed. The system consists of the feature extraction component receiving data from the audit stream; of the online anomaly detection engine; and of the classification component, produced by the event learning engine trained on the database of appropriate events.
In
In order to find the optimal geometric representation of normality 2200 of a dataset with respect to the optimality criterion, a certain minimum number of objects is required. Referring to the above mentioned example (e.g.
Each object has an individual weight α1, which is bounded by a parameter C. For the optimal representation the sum of the α1 should be one. Given a very small set of objects, the optimality criteria cannot be fulfilled.
Consider a simple example, where a minimum number of seven objects is required (see
Suppose the window size is 100 examples and the expected outlier ratio is 7%. One can compute the value of C=1/7. In order to bring the system in equilibrium, all the constraints must be satisfied; that is, all a_i should be <=1/7 but their sum should be equal to one. It can be easily seen that these two constraints can only be satisfied after we have observed at least 7 points.
After adding a seventh object, indicated by the circle in
The new object increases its weight α, while one of the other objects decreases its weight or to maintain the overall sum of the weights. These two objects are indicated by the ‘x’ marks in
In the final step of the optimization, the added object hits the upper weight bound. This is indicated in
The meaning of the curve in this figure, as well as in all subsequent figures, is the shape of the representation of normality. Although it may seem somewhat strange that there are no points inside the normality region, it should be noted, however, that the guarantees as to the upper bound on the number anomalies can be fulfilled only after at least n=window_size points have been seen. Until then, although the feasible solution exists, the statistical features of this solution cannot be enforced.
In
The three types of data objects are indicated:
In
As can be seen from the above, the geometric representation of normality is updated sequentially which is essential for on-line (real time) applications. There are no prior assumptions about the classification. The classification (i.e. the membership to set) is developed automatically while the data is received.
In the next step (
In Appendix B, especially in section 2.4 a particular advantageous formulation of the geometric representation of normality (2200), i.e. the quarter sphere is described. The asymmetry of the geometric representation of normality (2200) is well suited for data streams in intrusion problems.
For reasons of simplicity the inventive method and system is described in connection with a two-dimensional data set. Obviously the method and the system can be generalised to datasets with arbitrary dimensions. The curve would be a hypersurface enclosing a higher dimensional volume.
The invention is also applicable to monitoring of the measurements of physical parameters of operating mechanical devices, of the measurements of chemical processes and of the measurement of biological activity. In general the invention is specifically suited in situations in which continuous data is received and no a priori classification or knowledge about the source of the data is available.
Such an application is e.g. image analysis of medical samples where anomalous objects can be distinguished by a different colour or radiation pattern. Another possible medical application would be data streams representing electrical signals obtained from EEG or ECG apparatus. Here anomalous wave patterns can be automatically detected. Using EEG data the imminent occurrence of an epileptic seizure might be detected.
Furthermore, data online collected from mechanical or geophysical system can analysed using the inventive method and system. Mechanical stress and resulting fractures can be discerned from the data. As soon as “anomalous” data (i.e. deviations from “normal” data) is received, this might indicate a noteworthy chance of conditions.
The inventive method and system could also be applied to pattern recognition in which the pattern is not known a priori which is usually the case. The “anomalous” objects would be the ones not belonging to the pattern.
There is also a possible application of the inventive method and system in connection with financial data. It could be used to identify changes in trading data indicating unwanted risks. Credit card data could be also analysed to identify risks or even fraud.
Appendix A describes a the general context of online SVM. Appendix B describes a special application using a quarter-sphere method. Appendix C contains the description some extra Figure C2, C3, C5, C6, C7, C10, C11, C12. Fig. C2 gives general overview. Appendix D explains some of the formulae.
Abstract. The paper presents two useful extensions of the incremental SVM in the context of online learning. An online support vector data description algorithm enables application of the online paradigm to unsupervised learning. Furthermore, online learning can be used in the large scale classification problems to limit the memory requirements for storage of the kernel matrix. The proposed algorithms are evaluated on the task of online monitoring of EEG data, a on the classification task of learning the USPS dataset with a-priori chosen working set size.
Many real-life machine learning problems can be more naturally viewed as online rather than batch learning problems. Indeed, the data is often collected continuously in time, and, more importantly, the concepts to be learned may also evolve in time. Significant effort has been spent in the recent years on development of online SVM learning algorithms (e.g. [17, 13, 7, 12]). The elegant solution to online SVM learning is the incremental SVM [4] which provides a framework for exact online learning. In the wake of this work two extensions to the regression SVM have been independently proposed [10, 9].
One should note, however, a significant restriction on the applicability of the above mentioned supervised online learning algorithms: the labels may not be available online, as it would require manual intervention at every update step. A more realistic scenario is the update of the existing classifier when a new batch of data becomes available. The true potential of online learning can only be realized in the context of unsupervised learning.
An important and relevant unsupervised learning problem is one-class classification [11, 14]. This problem amounts to conducting a multi-dimensional data description, and its mean application is novelty (outlier) detection. In this case online algorithms an essential, for the same reasons that made on-line learning attractive in the supervised case: the dynamic nature of data and drifting concepts. An online support vector data description (SVDD) algorithm based on the incremental SVM is proposed in this paper.
Looking back at the supervised learning, a different role can be seen for on-line algorithms. Online learning can be used to overcome memory limitations typical for kernel methods on large scale problems. It has been long known that storage of the full kernel matrix, or even the part of it corresponding to support vectors, can well exceed the available memory. To overcome this problem, several subsampling techniques have been proposed [16, 1]. On-line learning can provide a simple solution to the subsampling problem: make a sweep through the data with a limited working set, each time adding a new example and removing the least relevant one. Although this procedure results in an approximate solution, an experiment on the USPS data presented in this paper shows that significant reduction of sewn requirements man be achieved without major decrease in classification accuracy.
To present the above-mentioned extensions we first need an abstract formulation of the SVM optimization problem and a brief overview of the incremental SVM. Then the details of our algorithm are presented, followed by their evaluation on real-life problems.
A smooth extension of the incremental SVM to the SVDD can be carried out by using the following abstract form of the SVM optimization problem:
where c and a are n×1 vectors, K is a n×n matrix and b is a scalar. By defining the meaning of the abstract parameters a, b and c for the particular SVM problem at hand, one can use the same algorithmic structure for different SVM algorithms. In particular, for the standard support vector classifiers [19], take c=1, a=y, b=0 and the given regularization constant C; the same definition applies to the v-SVC [15]except that C=1/Nv; for the SVDD [14, 18], the parameters are defined as: c=diag(k), a=y and b=1.
Incremental (decremental) SVM provides a procedure for adding (removing) one example to (from) an existing optimal solution. When a new point k is added, its weight xk is initially assigned to 0. Then the weights of other points and μ should be updated, in order to obtain the optimal solution for the enlarged dataset. Likewise, when a point k is to be removed from the dataset, its weight is forced to 0, while updating the weights of the remaining points and μ so that the solution obtained with xk=0 is optimal for the reduced dataset. The online learning follows naturally from the incremental/decremental learning: the new example is added while some old example is removed from the working set.
The basic principle of the incremental SVM [4]is that updates to the state of the example k should keep the remaining examples in their optimal state. In other words, the Kuhn-Tucker (KT) conditions:
must be maintained for all the examples, except possibly for the current one.
To maintain optimality in practice, one can write out conditions (2)-(3) for the states before and after the update of xk. By subtracting one from the other the following condition on increments of Δx and Δg is obtained:
The subscript s refer to the examples in the set S of unbounded support vectors, and the subscript r refers to the set R of bounded support vectors (E) and other examples (O). It follows from (2) that Δga=0. Then lines 2 and 4 of the system (4) can be re-written as:
This linear system is easily solved for Δs:
is the gradient of the linear manifold of optimal solutions parameterized by xk.
One can further substitute (6) into the lines 1 and 3 of the system (4) and obtain the following relation:
is the gradient of the linear manifold of the gradients of the examples in set. R at the optimal solution parameterized by xk.
Notice that all the reasoning in the preceding section is valid only for sufficiently small Δxk such that the composition of sets S and R does not change. Although computing the optimal Δxk in not possible in one step, one can compute the largest update Δxkmax such that composition of sets S and R remains intact. Four cases must be accounted for1: 1In the original work of Cauwenberghs and Poggio five cases are used but two of them easily fold together.
I
+
s
={iεS:sign(Δxk)βi>ε}
I
−
s
={iεS:sign(Δxk)βi>−ε}
I
+
R
={iεE:sign(ΔIk)γi>ε}
I
−
R
={iεO:sign(Δxk)γi<−ε}.
sign(Δxk)γk>ε,
Finally, the largest possible update is computed among the four cases:
Δxkmax=abs min([ΔxkS;ΔxkR;Δxkg;Δxkx]). (14)
The rest of the incremental SVM algorithm essentially consists of repeated computation of the update Δxkmax, update of the sets S, E and O, update of the state and of the sensitivity parameters β and γ. The iteration stops when either case 3 or case 4 occurs in the increment computation. Computational aspects of the algorithm can be found in [4].
Applying this incremental algorithm leaves open the possibility of an empty set S. This has two main consequences. First, au the block with the sub script s vanish from the KT conditions (4). Second, it is be impossible to increase the weight of the current example since this would violate the equality constraint of the SVM. As a result, the KT conditions (4) can be written component-wise as
Δgk=akΔμ (15)
Δgr=arΔμ. (16)
One can see that the only free variable is Δμ, and [ak; ar] do; plays the role of sensitivity of the gradient with respect to Δμ. To select the points from E or O which may enter set S, a feasibility relationship similar to the main case, can be derived. Resolving (15) for Δμ and substituting the result into (16), we conclude that
Then, using the KT conditions (2), the feasible index sets can be defined as
and the largest possible step liked Δμmax be computed as:
As it was mentioned in the introduction the online SVDD algorithm uses the same procedure as the incremental SVM, with the following definitions of the abstract parameters in problem (1): c=diag(K), a=y and b=−1. However, special care needs to be taken of the initialization stage, in order to obtain the initial feasible solution.
For the standard support vector classification, an optimal solution for a single point is possible; x1=0, b=y1. In the incremental SVDD the situation is more complicated. The difficulty arises from the fact that the equality constraint Σi=1naixi=1 and the box constraint 0≦xi≦C may be inconsistent; in particular the constraint cannot be satisfied when fewer than ┌1/c┐ examples are available. This initial solution can be obtained by the following procedure:
This experiments shows the use of the online novelty detection task on non-stationary time series data. The online SVDD is applied to a BCI (Brain-Computer Interface) project [2, 3]. A subject was sitting in front of a computer, and was asked to press a key on the keyboard using the left or the right hand. During the experiment, the EEG brain signals of the subject are recorded. From these signals, it is the task to predict which hand will be used for the key press. The first step in the classification task requires a distinction between ‘movement’ and ‘no-movement’ which should be made one. The incremental SVDD will be used to characterize the normal activity of the brain, such that special events, like upcoming keystroke movements, are detected.
After preprocessing the EEG signals, at each time point the brain activity is characterized by 21 feature values. The sampling rate was reduced to 10 Hz. A window of 500 time points (thus 5 seconds long) at the start of the time series was used to train an SVDD. In the top plot of
In the bottom plot of
shown. Here again, an output above zero indicates that an outlier is detected. It is clear that the online version generates less false alarms, because it follows the cling data distribution. Although the detection is far from perfect, as can be observed, many of the keystrokes are indeed clearly detected as outliers. It is also clear that the method is easily triggered by the eye blinks. Unfortunately the signal is very noisy, and it is hard to quantify the exact performance for these methods on this data
To make the SVM learning applicable to very large datasets, the classifier has to be constrained to have a limited number of objects in memory. This is, in principle, exactly what an online classifier with fixed window size M does. The only difference is that removing the oldest object is not useful in this application because the same result is achieved as if the leaning had been done on the last M objects. Instead, the “least relevant” object needs to be removed during each window advancement. A reasonable criterion for relevance seems to be the value of the weight. In the experiment presented below the example with the smallest weight is removed from the working set.
The dataset is the standard US Postal Service dataset, containing 7291 training and 2007 images of handwritten digits, size 16×16 [19]. On this 10 class dataset 10 support vector classifiers with a RBF kernel, σ2=0.3·256 and C=100, were trained3. During the evaluation of a new object, it is assigned to the class corresponding to the classifier with the largest output. The total classification error on the test set for different window sizes M is shown in table 1. 3The best model parameters as reported in [19] were used.
One can see that the classification accuracy deteriorates marginally (by about 10%) until the working size of 150, which is about 2% of the data. Clearly, by discarding “irrelevant” examples, one removes potential support vectors that cannot be recovered at a later stage. Therefore it is expected that performance of the limited memory classifier would be worse than that of an unrestricted classifier. It is also obvious that no more points than the number of support vectors are eventually needed, although the latter number is not known in advance. The average number of support vectors per each unrestricted 2-class classifier in this experiment is 274. Therefore the results above can be interpreted as reducing the storage requirement by 46% from the minimal at the cost of 10% increase of classification problem.
Notice that the proposed strategy differs from the caching strategy, typical for many SVMlight-like algorithm [6, 8, 5], in which kernel products are recomputed if the examples are found using in the fixed-size cache and the accuracy of the classifier is not sacrificed. Our approach constitutes a trade off between accuracy and computational load because kernel products never need to be recomputed. It should be noted, however, that computational cost of re-computing the kernels can be very significant, especially for the problems with complicated kernels such as string matching or convolution kernels.
Based on revised version of the incremental SVM, we have proposed: (a) an online SVDD algorithm which, unlike all previous extensions of incremental SVM, deals with an unsupervised learning problem, and (b) a fixed-memory training algorithm for the classification SVM which allows to limit the memory requirement for storage of the kernel matrix at the expense of classification performance. Experiments on novelty detection in non-stationary time series and on the USPS dataset demonstrate feasibility of both approaches. More detailed comparisons with other subsampling techniques for limited-memory learning will be carried out in future work.
This research was partially supported through a European Community Marie Curie Fellowship and BMBF FKZ O1IBB02A. We would like to thank K. R. Müller and B. Blankertz for Fruitful discussions and the use of BCI data. The authors are solely responsible for information communicated and the European Commission is not responsible for any views or results expressed.
Abstract: Practical application of data mining and machine learning techniques to intrusion detection is often hindered by the difficulty to produce clean data for the training. To address this problem a geometric framework for unsupervised anomaly detection has been recently proposed. In this framework, the data is mapped into a feature space, and anomalies are detected as the entries in sparsely populated regions. In this contribution we propose a novel formulation of a one-class Support Vector Machine (SVM) specially designed for typical IDS data features. The key idea of our “quarter-sphere” algorithm is to encompass the data with a hypersphere anchored at the center of mass of the data in feature space. The proposed method and its behavior on varying-percentages of attacks in the data is evaluated on the KDDCup 1999 dataset.
The majority of current intrusion detection methods can be classified as either misuse detection or anomaly detection [NWY02]. The former identify patterns of known illegitimate activity; the latter focus on unusual activity patterns. Both groups of methods have their advantages and disadvantages. Misuse detection methods are generally more accurate but are fundamentally limited to known attacks. Anomaly detection methods are usually less accurate than misuse detection methods—in particular, their false alarm rates are hardly acceptable in practice—however, they are at least in principle capable of detecting novel attacks. This feature makes anomaly detection methods the topic of active research.
In some early approaches, e.g. [DR90, LV92], it was attempted to describe the normal behavior by means of some high-level rules. This turned out to be quite a difficult task. More successful was the idea of collecting data from normal operation of a system and computing, based on this data, features describing normality; deviation of such features would be considered an anomaly. This approach is known as “supervised anomaly detection”. Different techniques have been proposed for characterizing the concept of normality, most notably statistical techniques, e.g. [De87, JLA +93, PN97, WFP99], and data mining techniques, e.g. [BCJ+01, VS00]. In practice, however, it is difficult to obtain clean data to implement these approaches. Verifying that no attacks are present in the training data may be an extremely tedious task, and for large samples this is infeasible. On the other hand, if the “contaminated” data is treated as clean, intrusions similar to the ones present in the training data will be accepted as normal patterns.
To overcome the difficulty in obtaining clean data, the idea of unsupervised anomaly detection has been recently proposed and investigated on several intrusion detection problems [PES01, EAP+02, LEK+03]. These methods compute some relevant features and use techniques of unsupervised learning to identify sparsely populated areas in feature space. The points—whether in the training or in the test data—that fall into such areas are treated as anomalies.
More precisely, two kinds of unsupervised learning methods have been investigated: clustering methods and one-class SVM. In this contribution we focus on one-class SVM methods and investigate the application of the underlying geometric ideas in the context of intrusion detection.
We present three formulations of one-class SVM that can be derived following different geometric intuitions. The formulation used in previous work was that of the hyperplane separating the normal data from the origin [SPST+01]. Another formulation, motivated by fitting a sphere over the normal data, is also well-known in the literature on kernel methods [TD99]. The novel formulation we propose in this paper is based on fitting a sphere centered at the origin to normal data. This formulation, to be referred to as a quarter-sphere, is particularly suitable to the features common in intrusion detection, whose distributions are usually one-sided and concentrated at the origin.
Finally, we present an experimental evaluation of the one-class SVM methods under a number of different scenarios.
Support Vector Machines have received great interest in the machine learning community since their introduction in the mid-1990s. We refer the reader interested in the underlying statistical learning theory and the practice of designing efficient SVM learning algorithms to the well-known literature on kernel methods, e.g. [Va95, Va98, SS02]. The one-class SVM constitutes the extension of the main SVM ideas from supervised to unsupervised learning paradigms.
We begin our investigation into the application of the one-class SVM for intrusion detection with a brief re-capitulation and critical analysis of the two known approaches to one-class SVM. It will follow from this analysis that the quarter-sphere formulation, described in section 2.4, could be better suited for the data common in intrusion detection problems.
The original idea of the one-class SVM [SPST+01] was formulated as an “estimation of the support of a high-dimensional distribution”. The essence of this approach is to map the data points xi into the feature space by some non-linear mapping §(xi), and to separate the resulting image points from the origin with the largest possible margin by means of a hyperplane. The geometry of this idea is illustrated in
feature space, maximization of the separation margin limits the volume occupied by the normal points to a relatively compact area in feature space. Mathematically, the problem of separating the data from the origin with the largest possible margin is formulated as follows:
The weight vector w, characterizing the hyperplane, “lives” in the feature space F, and therefore is not directly accessible (as the feature space may be extremely high-dimensional). The non-negative slack variables ξi allow for some points, the anomalies, to lie on the “wrong” side of the hyperplane. Instead of the primal problem (1), the following dual problem, in which all the variables have low dimensions, is solved in practice:
Once the solution α is found, one can compute the threshold parameter τ=Σjαjk(xi, xj) for some example i such that αi lies strictly between the bounds (such points are called support vectors). The decision, whether or not point x is normal, is computed as:
f(x)=sgn(Σiαik(xi,x)−τ). (3)
The points with f(x)=−1 are considered to be anomalies.
Another, somewhat more intuitive geometric idea for the one-class SVM is realized in the sphere formulation [TD99]. The normal data can be concisely described by a sphere (in a feature space) encompassing the data, as shown in
training data can be treated by introducing slack variables ξi, similarly to the plane formulation. Mathematically the problem of “soft-fitting” the sphere over the data is described as:
Similarly to the primal formulation (1) of the plane one-class SVM, one cannot directly solve the primal problem (4) of the sphere formulation, since the center c belongs to the possibly high-dimensional feature space. The same trick can be employed—the solution is sought to the dual problem:
The decision function can be computed as:
The radius R2 plays the role of a threshold, and, similarly to the plane formulation, it can be computed by equating the expression under the “sgn” to zero for any support vector.
The similarity between the plane and the sphere formulations goes beyond merely an analogy. As it was noted in [SPST+01], for kernels k(x, y) which depend only on the difference x−y, the linear term in the objective function of the dual problem (5) is constant, and the solutions are equivalent.
When applying one-class SVM techniques to intrusion detection problems, the following observation turns out to be of crucial importance: A typical distribution of the features used in IDS is one-sided on R0+. Several reasons contribute to this property. First, many IDS features are of temporal nature, and their distribution can be modeled using distributions common in survival data analysis, for example by an exponential or a Weibull distribution. Second, a popular approach to attain coherent normalization of numerical attributes is the so-called “data-dependent normalization” [BAP+02]. Under this approach, the features are defined as the deviations from the mean, measured in the fraction of the standard deviation. This quantity can be seen as F-distributed. Summing up, the overwhelming mass of data lies in the vicinity of the origin.
The consequences of the one-sidedness of the data distribution for the one-class SVM can be seen in
absolute values of the normally distributed points. The anomaly detection is shown for a fixed value of the parameter ν and varying smoothness σ of the RBF kernel. The contours show the separation between the normal points and anomalies. One can see that even for the heavily regularized separation boundaries, as in the right picture, some points close to the origin are detected as anomalies. As the regularization is diminished, the one-class SVM produces a very ragged boundary and does not detect any anomalies.
The message that can be carried from this example is that, in order to account for the one-sidedness of the data distribution, one needs to use a geometric construction that is in some sense asymmetric. The new construction we propose here is the quarter-sphere one-class SVM described in the next section.
A natural way to extend the ideas of one-class SVM to one-sided non-negative data is to require the center of the fitted sphere be fixed at the origin. The geometry of this approach is shown in
following dual problem is obtained:
Note that, unlike the other two formulations, the dual problem of the quarter-sphere SVM amounts to a linear rather than a quadratic program. Herein lies the key to the significantly lower computational cost of our formulation.
It may seem somewhat strange that the non-linear mapping affects the solution only through the norms k(xi, xi) of the examples, i.e. that the geometric relations between the objects are ignored. This feature indeed poses a problem for the application of the quarter-sphere SVM with the distance-based kernels. In such case, the norms of the points are equal, and no meaningful solution to the dual problem can be found. This predicament, however, can be easily fixed. A well-known technique, originating from kernel PCA [SSM98], is to center the images of the training points Φ(xi) in feature space. In other words, the values of image points are re-computed in the local coordinate system anchored at the center of mass of the image points. This can be done by subtracting the mean from all image values:
Although this operation may not be, directly computable in feature space, the impact of centering on the kernel values can be easily computed (e.g. [SSM98, SMB+99]):
where K is the l×l kernel matrix with the values Kij=k(xi, xj), and 1l is an l×l matrix with all values equal to 1/l. After centering in feature space, the norms of points in the local coordinate system are no longer all equal, and the dual problem of the quartersphere formulation can be easily solved.
To compare the quarter-sphere formulation with the other one-class SVM approaches and to investigate some properties of our algorithm, experiments are carried out on the KDDCup 1999 dataset. This dataset comprises connection record data collected in 1998 DARPA IDS evaluation. The features characterizing these connection records are pre-computed in the KDDCup dataset.
One of the problems with the connection record data from the KDDCup/DARPA data is that a large proportion (about 75%) of the connections represent the anomalies. In previous work [PES01, EAP+02] it was assumed that anomalies constitute only a small fraction of the data, and the results are reported on subsampled datasets, in which the ratio of anomalies is artificially reduced to 1-1.5%. To render our results comparable with previous work we also subsample the data. The results reported below are averaged over 10 runs of the algorithms in any particular setup.
We first compare the quarter-sphere one-class SVM with the other two algorithms. Since the sphere and the plane formulations are equivalent for the RBF kernels, identical results are produced for these two formulations.
The experiments are carried out for two different values of the parameter σ of the RBF kernel: 1 and 12 (the latter value used in [EAP+02]). These values correspond to low and moderate regularization. As the evaluation criterion, we use the portion of the ROC curve between the false alarm rates of 0 and 0.1, since higher false alarm rates are unacceptable for intrusion detection. The comparison of ROCs of the three formulations for the two values of σ are shown in
consistently outperforms the other two formulations; especially at the low value of regularization parameter. The best overall results are achieved with the medium regularization with σ=12, which has most likely been selected in [EAP+02] after careful experimentation. The advantage of the quarter-sphere in this case is not so dramatic as with low regularization, but is nevertheless very significant for low false alarm rates.
The assumption that intrusions constitute a small fraction of the data may not be satisfied in a realistic situation. Some attacks, most notably the denial-of-service attacks, manifest themselves precisely in a large number of connections. Therefore, the problem of a large ratio of anomalies needs to be addressed.
In the experiments in this section we investigate the performance of the sphere and the quarter-sphere one-class SVM as a function of the attack ratio. It is known from the literature [TD99, SPST+01] that the parameter ν of the one-class SVM can be interpreted as an upper bound on the ratio of the anomalies in the data. The effect of this parameter on the quarter-sphere formulation is different: it specifies that exactly ν fraction of points is expected to be the anomalies. This is agreeably a more stringent assumption, and methods for the automatic determination of the anomaly ratio must be further investigated. Herein we perform a simple comparison of the algorithms under the following three scenarios:
Under the scenario that ν matches the anomaly ratio it is assumed that perfect information about the anomaly ratio is available. One would expect that the parameter ν can tune both kinds of one-class SVM to the specific anomaly ratio. This, however, does not happen, as can be seen from
Under the scenario with fixed ν it is assumed that no information about the anomaly ratio is available, and that this parameter is simply set by the user to some arbitrary value. As one can see from
Under the scenario with fixed anomaly ratio and the varying ν we investigate what impact the adjustment of the parameter has on the same dataset. As it can be seen from
achieved on the higher values. The parameter ν does not have any impact on the accuracy of the quarter-sphere one-class SVM.
We have presented a novel one-class SVM formulation, the quarter-sphere SVM, that is optimized for non-negative attributes with one-sided distribution. Such data is frequently used in intrusion detection systems. The one-class SVM formulations previously applied in the context of unsupervised anomaly detection do not account for non-negativity and one-sidedness; as a result, they can potentially detect very common patterns, their attributes close to the origin, as anomalies. The quarter-sphere SVM avoids this problem by aligning the center of the sphere fitted to the data with the “center of mass” of the data in feature space.
Our experiments conducted on the KDDCup 1999 dataset demonstrate significantly better accuracy of the quarter-sphere SVM in comparison with the previous, sphere or plane, formulations. Especially noteworthy is the advantage of the new algorithm at low false alarm rates.
We have also investigated the behavior of one-class SVM as a function of attack rate. It is shown that the accuracy of all three formulations of one-class SVM considered here degrades with the growing percentage of attacks, contrary to the expectation that the parameter ν of one-class SVM, if properly set, should tune it to the required anomaly rate. We have found that the performance degradation with the perfectly set tuning parameters is essentially the same as when the parameter is set to some arbitrary value. We believe that performance of anomaly detection algorithms on higher anomaly rates should be given special attention in the future work, especially with respect to the data normalization techniques.
The authors gratefully acknowledge the finding from the Bundesministerium für Bildung und Forschung under the project MIND (FKZ 01-SC40A). We also thank Klaus-Robert Müller and Stefan Harmeling for valuable suggestions and discussions.
FIG. 3—Operation of the Flow Control Unit of the Plane/Sphere Agent
The Flow control unit reads the following data as the arguments:
The following sequence of actions is performed in a loop for each incoming example ‘X’.
The resulting object ‘obj’ is the output data of the Flow control unit and it is passed to other parts of the online anomaly detection engine as the plane/sphere representation.
At the beginning of the system's operation, the initialization unit overtakes the control from the flow control unit until the system can be brought into the equilibrium state. It reads the examples from the feature stream (1300), assigns them the weight of C and puts them into the set E until floor(1/C) examples has been seen. The next example get the weight of 1−floor(1/C) and is put into set S. Afterwards the control is passed back to the flow control unit.
FIG. 5—Operation of the Importation Unit of the Plain/Sphere Agent
The Importation unit reads the following data as the arguments:
Upon reading the new example the importation unit performs initialization of some internal data structures (expansion of internal data and kernel storage, allocation of memory for gradient and sensitivity parameters etc.)
A check of equilibrium of the system including the new example is performed (i.e. it is verified if the current assignment of weights satisfies the Karush-Kuhn-Tucker conditions). If the system has reached the equilibrium state, the importation unit terminates and outputs the current state of the object ‘obj’. If the system is not in equilibrium processing continues until such state is reached.
Sensitivity parameters are updated so as to account for the latest update of the object's state or to compute the values corresponding to the initial state of the object with the new example added. Sensitivity parameters reflect the sensitivity of the weights and the gradients of all examples in the working set with respect to an infinitesimal change of weight of the incoming example.
Depending on whether or not the set S (maintained in the internal storage) is empty or not one of the following processing paths is taken.
If the set S is empty, the only free parameter of the object is the threshold ‘b’. To update ‘b’ the possible increments of the threshold ‘b’ are computed for all points in sets E and O such that gradients of these point are forced to zero. Gradient sensitivity parameters are used to carry out this operation efficiently. The smallest of such increments is chosen, and the example, whose gradient is brought to zero by this increment is added to set S (and removed from the corresponding index set, E or O).
If the set S is not empty, four possible increments need to be computed so that the selection is made among them. The increment ‘inc_a’, is the smallest increment of the weight of the current example such that the induced change of the weights of the examples in set S brings the weight of some of these examples the border of the box (i.e. forces it to take on the value of zero or C). This increment is determined as the minimum of all such possible increments for each example in set S individually, computed using the weight sensitivity parameters. The increment ‘ind_g’ is the smallest increment of the weight of the current example such that the induced change of the gradients of the examples in sets E and O brings these gradients to zero. This increment is determined as the minimum of all such possible increments for each example in sets E and O individually, computed using the gradient sensitivity parameters. The increment ‘inc_ac’ is the possible increment of the weight of the new example. It is computed as the difference between the upper bound C on the weight of an example and the current weight a_c of the new example. The increment ‘inc_ag’ is the possible increment of the weight of the new example such that the gradient of the new example becomes zero. This increment is computed using the gradient sensitivity of the new example.
After the four possible increments are computed the smallest one among them and the index ‘ind’ of the example associated with the smallest respective increment is computed. Depending on which of the four, increments yields the minimum value, the following processing steps are taken:
If the minimum is yielded by the increment ‘inc_a’ the example referred to by the index ‘ind’ is removed from set S.
If the minimum is yielded by the increment ‘inc_ac’ the example referred to by the index ‘ind’ (in this case it is the new example) is added to set E.
In the other two remaining cases (‘inc_g’ and ‘inc_gc’) the example referred to by the index ‘ind’ is added to set S.
After the composition of index sets is update, the state of the object is updated. This operation consists of applying the computed increments to the weights of all examples in the working set and to the threshold ‘b’.
The resulting object ‘obj’ is the output data of the Importation unit and it is passed to the flow control unit (2112).
FIG. 6—Operation of the Relevance Unit of the Plain/Sphere Agent
The Relevance unit reads the following data as the arguments:
If ‘TSFlag’ is set the oldest example in the working set is least relevant example.
otherwise the following selection is made:
If set On (not cached examples from set O) of the object is not empty, an example is selected at random from the set On, otherwise
If set Oc (cached examples from set O) of the object is not empty, an example is selected at random from the set Oc, otherwise
If set S is not empty, the example with the minimum weight is selected from set S, otherwise
The example is selected at random from the set E.
The output of the relevance unit is the index ‘ind’ of the selected example. It is passed to the flow control unit (2112).
FIG. 7—Operation of the Removal Unit of the Plain/Sphere Agent
The Removal unit reads the following data as the arguments:
Upon reading the input arguments the removal unit performs initialization of some internal data structures (contraction of internal data and kernel storage, of gradient and sensitivity parameters etc.)
A check of the weight of the example ‘ind’ is performed. If the weight of this example is equal to zero, control is returned to the flow control unit (2112), otherwise operation is continues until weight of the example ‘ind’ reaches zero.
Sensitivity parameters are updated so as to account for the latest update of the object's state or to compute the values corresponding to the initial state of the object with the example ‘ind’ removed. Sensitivity parameters reflect the sensitivity of the weights and the gradients of all examples in the working set with respect to an infinitesimal change of weight of the outgoing example.
Depending on whether or not the set S (maintained in the internal storage) is empty or not one of the following processing paths is taken.
If the set S is empty, the only free parameter of the object is the threshold ‘b’. To update ‘b’ the possible increments of the threshold, ‘b’ are computed for all points in sets E and O such that gradients of these point are forced to zero. Gradient sensitivity parameters are, used to carry out this operation efficiently. The smallest of such increments is chosen, and the example, whose gradient is brought to zero by this increment is added to set S (and removed from the corresponding index set, E or O).
If the set S is not empty, three possible increments need to be computed so that the selection is made among them. The increment ‘inc_a’ is the smallest increment of the weight of the example ‘ind’ such that the induced change of the weights of the examples in set S brings the weight of some of these examples the border of the box (i.e. forces it to take on the value of zero or C). This increment is determined as the minimum of all such possible increments for each example in set S individually, computed using the weight sensitivity parameters. The increment ‘ind_g’ is the smallest increment of the weight of the current example such that the induced change of the gradients of the examples in sets E and o brings these gradients to zero. This increment is determined as the minimum of all such possible increments for each example in sets E and O individually, computed using the gradient sensitivity parameters. The increment ‘inc_ac’ is the possible increment of the weight of the example ‘ind’. It is computed as the negative difference between current weight a_c of the example ‘ind’ and zero.
After the three possible increments are computed the one with the smallest absolute value among them and the index ‘ind’ of the example associated with the smallest respective increment is computed. Depending on which of the three increments yields the minimum value, the following processing steps are taken:
If the minimum is yielded by the increment ‘inc_a’ the example referred to by the index ‘ind’ is removed from set S.
If the minimum is yielded by the increment ‘inc_ac’ nothing is to be done (this is the termination condition which is detected in the next iteration)
In the other remaining case (‘inc_g’) the example referred to by the index ‘ind’ is added to set S.
After the composition of index sets is updated, the state of the object is updated. This operation consists of applying the computed increments to the weights of all examples in the working set and to the threshold ‘b’.
After the termination of the loop the example being removed is purges, i.e. all data structures associated with it (kernel cache, index sets etc.) are permanently cleared out.
The resulting object ‘obj’ is the output data of the Removal unit and it is passed to the flow control unit (2112).
FIG. 10—Operation of the Flow Control Unit of the Quarter-Sphere Agent
The Flow control unit reads the following data as the arguments:
The following sequence of actions is performed in a loop for each incoming example ‘X’.
The resulting object ‘obj’ is the output data of the Flow control unit and it is passed to other parts of the online anomaly detection engine as the plane/sphere representation.
FIG. 11—Operation of the Centering Unit of the Quarter-Sphere Agent
The Centering unit reads the following data as the arguments:
Upon reading of the example ‘X’ the centering unit computes the kernel row for this example, i.e. a row vector of kernel values for this example and all other examples in the working set.
Depending on the value of ‘OPFlag’ the following operations are performed:
If “expand” operation is requested,
The resulting object ‘obj’ is the output data of the Centering unit and it is passed to the flow control unit (2212).
FIG. 12—Operation of the Sorting Unit of the Quarter-Sphere Agent
The Sorting unit reads the following data as the arguments:
Depending of the value of ‘ModeFlag’, the sorting unit invokes the usual sorting operation (e.g. QuickSort), of the adaptive mode is indicated, or the median finding operation (which is cheaper than sorting) if the fixed mode is indicated.
The output of the Sorting unit is the ordered vector of norms of the examples in the working set, where the ordering depends on the requested mode. This vector is passed to the flow control unit (2122).
This technical report provides some additional mathematical and technical details on implementation of quarter-sphere SVM.
The dual formulation of the quarter-sphere SVM is given by the following linear program:
The simplicity of equality constraints in problem (1) gives rise to an extremely efficient procedure of finding a solution. One can clearly see that in order to minimize the objective function of the problem (1) one should give as much weight as possible to the points with the largest norms k(xi, xi). Since the weight ai is bounded above by 1/ul the solution is to fix the weights at the upper bound for └νl┘ points with largest norms, and to assign the weight of 1−[vl]/vl to the next largest point. The remaining points become zero weights. From the algorithmic point of view, the problem amounts to finding an └νl┘-th order statistic, i.e. this can be solved in linear time by a “median-find” type of algorithm.
It may seem somewhat strange that the non-linear mapping affects the solution only through the norms k(xi, xi) of the examples; that is, the geometric relations between the objects are ignored. This feature indeed poses a problem for the application of the quarter-sphere SVM with the distance-based kernels. In such case, the norms of the points are equal, and no meaningful solution to the dual problem can be found. To avoid this predicament, centering of the images of the training points Φ(xi) in feature space, which is a well-known technique originating from kernel PCA [2], can be applied. In other words, the values of image points are re-computed in the local coordinate system anchored at the center of mass of the image points. This is done by subtracting the mean from all image values:
Although this operation may be intractable in a high-dimensional feature space, the impact of centering on the kernel values can be easily computed (e.g. [2, 1]):
where K is the l×l kernel matrix with the values Kij=k(xi, xj), and 1l is an l×l matrix with all values equal to l/l. After centering in feature space, the norms of points in the local coordinate system are no longer all equal, and the dual problem of the quarter-sphere formulation can be easily solved.
From the computational point of view, the centering operation (2) poses a problem, since it has to be performed every time a new point is added to or removed from a dataset and the cost of this operation, if performed directly, is O(l3). Luckily only l diagonal elements of {tilde over (K)} are used. In the following the formulas will be developed for computing the updates to the values of these elements when an example is added or removed.
In this section, the recursive relations connecting the values on the main diagonal of the centered kernel matrix {tilde over (K)} before and after the addition of the l-th example are developed. First consider the centered value {tilde over (K)}u(l).1 Observe that: 1The superscript (l) denotes that the quantity pertains to the state after the example l is added.
where the auxiliary term F(l-1) depending only on previous l−1 examples is defined as:
In a similar we the value {tilde over (K)}kk(l), k<l, is obtained:
where the auxiliary term Gk(l-1) depending only on previous l−1 examples is defined as:
It can be easily seen, that, apart from the cost of computing the auxiliary terms F(l-1) and Gk(l-1), computation of the update to each diagonal entry of Kll takes O(1) time (taking into account that
Kli needs to be computed only once and can be amortized over all l diagonal entries). Finally, it remains to be shown that maintaining the auxiliary terms does not cost any extra work. The following recursive relationships hold between the respective auxiliary quantities:
The amortized cost of these operations is O(1).
A similar recursive technique underlies the update formulas for the removal of an example. To simplify the notation we assume that the example to be removed has index l. In this case only the diagonal values of {tilde over (K)} for examples with k<l are to be updated:
The recursive relations between the auxiliary terms are computed as follows:
The analysis of the update expressions above reveals that all operations have running time of O(1) except Σi=1l Kli which can be carried out once and amortized over all l−1 entries to be updated.
Number | Date | Country | Kind |
---|---|---|---|
03090256.3 | Aug 2003 | EP | regional |
04090263.7 | Jun 2004 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2004/009221 | 8/17/2004 | WO | 00 | 11/2/2007 |