Support vector data description (SVDD) is a machine-learning technique used for single class classification and outlier or anomaly detection. The SVDD classier partitions the whole space into an inlier region which consists of the region near the training data, and an outlier region which consists of points away from the training data. The computation of the SVDD classifier uses a kernel function with the Gaussian kernel being a common choice for the kernel function. The Gaussian kernel has a bandwidth parameter, and it is important to set the value of this parameter correctly for good results. A small bandwidth leads to over-fitting and the resulting SVDD classifier overestimates the number of anomalies, while a large bandwidth leads to under-fitting and the resulting SVDD classifier underestimates the number of anomalies resulting in possibly many anomalies or outliers not being detected by the classifier.
In an example embodiment, a non-transitory computer-readable medium is provided having stored thereon computer-readable instructions that, when executed by a computing device, cause the computing device to determine a bandwidth parameter value for a support vector data description for outlier identification. A mean pairwise distance value is computed between a plurality of observation vectors, wherein each observation vector of the plurality of observation vectors includes a variable value for each variable of a plurality of variables. A scaling factor value is computed based on a number of the plurality of observation vectors and a predefined tolerance value. A Gaussian bandwidth parameter value is computed using the computed mean pairwise distance value and the computed scaling factor value. An optimal value of an objective function is computed that includes a Gaussian kernel function that uses the computed Gaussian bandwidth parameter value. The objective function defines a support vector data description (SVDD) model using the plurality of observation vectors to define a set of support vectors. The computed Gaussian bandwidth parameter value and the defined a set of support vectors are output for determining if a new observation vector is an outlier.
In another example embodiment, a computing device is provided. The computing device includes, but is not limited to, a processor and a non-transitory computer-readable medium operably coupled to the processor. The computer-readable medium has instructions stored thereon that, when executed by the computing device, cause the computing device to determine a bandwidth parameter value for a support vector data description for outlier identification.
In yet another example embodiment, a method of determining a bandwidth parameter value for a support vector data description for outlier identification is provided.
Other principal features of the disclosed subject matter will become apparent to those skilled in the art upon review of the following drawings, the detailed description, and the appended claims.
Illustrative embodiments of the disclosed subject matter will hereafter be described referring to the accompanying drawings, wherein like numerals denote like elements.
Support vector data description (SVDD) like other one class classifiers provides a geometric description of observed data. The SVDD classifier computes a distance to each point in the domain space which is a measure of the separation of that point from the training data. During scoring, if an observation is found to be a large distance from the training data, it may be an anomaly, and the user may choose to generate an alert that a system or a device is not performing as expected or a detrimental event has occurred.
Referring to
Input interface 102 provides an interface for receiving information from the user or another device for entry into SVDD training device 100 as understood by those skilled in the art. Input interface 102 may interface with various input technologies including, but not limited to, a keyboard 112, a microphone 113, a mouse 114, a display 116, a track ball, a keypad, one or more buttons, etc. to allow the user to enter information into SVDD training device 100 or to make selections presented in a user interface displayed on display 116. The same interface may support both input interface 102 and output interface 104. For example, display 116 comprising a touch screen provides a mechanism for user input and for presentation of output to the user. SVDD training device 100 may have one or more input interfaces that use the same or a different input interface technology. The input interface technology further may be accessible by SVDD training device 100 through communication interface 106.
Output interface 104 provides an interface for outputting information for review by a user of SVDD training device 100 and/or for use by another application or device. For example, output interface 104 may interface with various output technologies including, but not limited to, display 116, a speaker 118, a printer 120, etc. SVDD training device 100 may have one or more output interfaces that use the same or a different output interface technology. The output interface technology further may be accessible by SVDD training device 100 through communication interface 106.
Communication interface 106 provides an interface for receiving and transmitting data between devices using various protocols, transmission technologies, and media as understood by those skilled in the art. Communication interface 106 may support communication using various transmission media that may be wired and/or wireless. SVDD training device 100 may have one or more communication interfaces that use the same or a different communication interface technology. For example, SVDD training device 100 may support communication using an Ethernet port, a Bluetooth antenna, a telephone jack, a USB port, etc. Data and messages may be transferred between SVDD training device 100 and another computing device of a distributed computing system 128 using communication interface 106.
Computer-readable medium 108 is an electronic holding place or storage for information so the information can be accessed by processor 110 as understood by those skilled in the art. Computer-readable medium 108 can include, but is not limited to, any type of random access memory (RAM), any type of read only memory (ROM), any type of flash memory, etc. such as magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips, . . . ), optical disks (e.g., compact disc (CD), digital versatile disc (DVD), . . . ), smart cards, flash memory devices, etc. SVDD training device 100 may have one or more computer-readable media that use the same or a different memory media technology. For example, computer-readable medium 108 may include different types of computer-readable media that may be organized hierarchically to provide efficient access to the data stored therein as understood by a person of skill in the art. As an example, a cache may be implemented in a smaller, faster memory that stores copies of data from the most frequently/recently accessed main memory locations to reduce an access latency. SVDD training device 100 also may have one or more drives that support the loading of a memory media such as a CD, DVD, an external hard drive, etc. One or more external hard drives further may be connected to SVDD training device 100 using communication interface 106.
Processor 110 executes instructions as understood by those skilled in the art. The instructions may be carried out by a special purpose computer, logic circuits, or hardware circuits. Processor 110 may be implemented in hardware and/or firmware. Processor 110 executes an instruction, meaning it performs/controls the operations called for by that instruction. The term “execution” is the process of running an application or the carrying out of the operation called for by an instruction. The instructions may be written using one or more programming language, scripting language, assembly language, etc. Processor 110 operably couples with input interface 102, with output interface 104, with communication interface 106, and with computer-readable medium 108 to receive, to send, and to process information. Processor 110 may retrieve a set of instructions from a permanent memory device and copy the instructions in an executable form to a temporary memory device that is generally some form of RAM. SVDD training device 100 may include a plurality of processors that use the same or a different processing technology.
Some machine-learning approaches may be more efficiently and speedily executed and processed with machine-learning specific processors (e.g., not a generic CPU). Such processors may also provide additional energy savings when compared to generic CPUs. For example, some of these processors can include a graphical processing unit (GPU), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), an artificial intelligence (AI) accelerator, a purpose-built chip architecture for machine learning, and/or some other machine-learning specific processor that implements a machine learning approach using semiconductor (e.g., silicon (Si), gallium arsenide (GaAs)) devices. These processors may also be employed in heterogeneous computing architectures with a number of and a variety of different types of cores, engines, nodes, and/or layers to achieve additional various energy efficiencies, processing speed improvements, data communication speed improvements, and/or data efficiency targets and improvements throughout various parts of the system.
Training application 122 performs operations associated with computing a value for a Gaussian bandwidth parameter value s and defining SVDD 126 from data stored in training dataset 124. SVDD 126 may be used to classify data stored in a dataset 1824 (shown referring to
Referring to the example embodiment of
Training application 122 may be integrated with other system processing tools to automatically process data generated as part of operation of an enterprise, device, system, facility, etc., to identify any outliers in the processed data, to monitor changes in the data, and to provide a warning or alert associated with the monitored data using input interface 102, output interface 104, and/or communication interface 106 so that appropriate action can be initiated in response to changes in the monitored data.
Training application 122 may be implemented as a Web application. For example, training application 122 may be configured to receive hypertext transport protocol (HTTP) responses and to send HTTP requests. The HTTP responses may include web pages such as hypertext markup language (HTML) documents and linked objects generated in response to the HTTP requests. Each web page may be identified by a uniform resource locator (URL) that includes the location or address of the computing device that contains the resource to be accessed in addition to the location of the resource on that computing device. The type of file or resource depends on the Internet application protocol such as the file transfer protocol, HTTP, H.323, etc. The file accessed may be a simple text file, an image file, an audio file, a video file, an executable, a common gateway interface application, a Java applet, an extensible markup language (XML) file, or any other type of file supported by HTTP.
Training dataset 124 may include, for example, a plurality of rows and a plurality of columns. The plurality of rows may be referred to as observation vectors or records (observations), and the columns may be referred to as variables. Training dataset 124 may be transposed. Training dataset 124 may include unsupervised data. The plurality of variables may define multiple dimensions for each observation vector. An observation vector xi may include a value for each of the plurality of variables associated with the observation i. All or a subset of the columns may be used as variables used to define observation vector xi. Each variable of the plurality of variables describes a characteristic of a physical object. For example, if training dataset 124 includes data related to operation of a vehicle, the variables may include an oil pressure, a speed, a gear indicator, a gas tank level, a tire pressure for each tire, an engine temperature, a radiator level, etc. Training dataset 124 may include data captured as a function of time for one or more physical objects.
The data stored in training dataset 124 may be generated by and/or captured from a variety of sources including one or more sensors of the same or different type, one or more computing devices, etc. The data stored in training dataset 124 may be received directly or indirectly from the source and may or may not be pre-processed in some manner. For example, the data may be pre-processed using an event stream processor such as the SAS® Event Stream Processing, developed and provided by SAS Institute Inc. of Cary, N.C., USA. As used herein, the data may include any type of content represented in any computer-readable format such as binary, alphanumeric, numeric, string, markup language, etc. The data may be organized using delimited fields, such as comma or space separated fields, fixed width fields, using a SAS® dataset, etc. The SAS dataset may be a SAS® file stored in a SAS® library that a SAS® software tool creates and processes. The SAS dataset contains data values that are organized as a table of observations (rows) and variables (columns) that can be processed by one or more SAS software tools.
Training dataset 124 may be stored on computer-readable medium 108 or on one or more computer-readable media of a distributed computing system 128 and accessed by SVDD training device 100 using communication interface 106, input interface 102, and/or output interface 104. Data stored in training dataset 124 may be sensor measurements or signal values captured by a sensor, may be generated or captured in response to occurrence of an event or a transaction, generated by a device such as in response to an interaction by a user with the device, etc. The data stored in training dataset 124 may include any type of content represented in any computer-readable format such as binary, alphanumeric, numeric, string, markup language, etc. The content may include textual information, graphical information, image information, audio information, numeric information, etc. that further may be encoded using various encoding techniques as understood by a person of skill in the art. The data stored in training dataset 124 may be captured at different time points periodically, intermittently, when an event occurs, etc. One or more columns of training dataset 124 may include a time and/or date value.
Training dataset 124 may include data captured under normal operating conditions of the physical object. Training dataset 124 may include data captured at a high data rate such as 200 or more observations per second for one or more physical objects. For example, data stored in training dataset 124 may be generated as part of the Internet of Things (IoT), where things (e.g., machines, devices, phones, sensors) can be connected to networks and the data from these things collected and processed within the things and/or external to the things before being stored in training dataset 124. For example, the IoT can include sensors in many different devices and types of devices, and high value analytics can be applied to identify hidden relationships and drive increased efficiencies. This can apply to both big data analytics and real-time analytics. Some of these devices may be referred to as edge devices, and may involve edge computing circuitry. These devices may provide a variety of stored or generated data, such as network data or data specific to the network devices themselves. Some data may be processed with an event stream processing engine (ESPE), which may reside in the cloud or in an edge device before being stored in training dataset 124.
Training dataset 124 may be stored using various data structures as known to those skilled in the art including one or more files of a file system, a relational database, one or more tables of a system of tables, a structured query language database, etc. on SVDD training device 100 or on distributed computing system 128. SVDD training device 100 may coordinate access to training dataset 124 that is distributed across distributed computing system 128 that may include one or more computing devices. For example, training dataset 124 may be stored in a cube distributed across a grid of computers as understood by a person of skill in the art. As another example, training dataset 124 may be stored in a multi-node Hadoop® cluster. For instance, Apache™ Hadoop® is an open-source software framework for distributed computing supported by the Apache Software Foundation. As another example, training dataset 124 may be stored in a cloud of computers and accessed using cloud computing technologies, as understood by a person of skill in the art. The SAS® LASR™ Analytic Server may be used as an analytic platform to enable multiple users to concurrently access data stored in training dataset 124. The SAS® Viya™ open, cloud-ready, in-memory architecture also may be used as an analytic platform to enable multiple users to concurrently access data stored in training dataset 124. Some systems may use SAS In-Memory Statistics for Hadoop® to read big data once and analyze it several times by persisting it in-memory for the entire session. Some systems may be of other types and configurations.
An SVDD model is used in domains where a majority of data in training dataset 124 belongs to a single class. An SVDD model for normal data description builds a minimum radius hypersphere around the data. The objective function for the SVDD model for normal data description is
max(Σi=1nαi(xi·xi)−Σi=1nΣj=1uαiαj(xi·xj)), (1)
subject to:
Σi=1nαi=1 (2)
0≤αi≤C,∇∀i=1, . . . ,n, (3)
where xi∈m, i=1, . . . , n represents n observations in training dataset 124, αi∈ are Lagrange constants, C=1/nf is a penalty constant that controls a trade-off between a volume and errors, and f is an expected outlier fraction. The expected outlier fraction is generally known to an analyst. Data preprocessing can ensure that training dataset 124 belongs to a single class. In this case, f can be set to a very low value such as 0.001. SV is the set of support vectors that includes the observation vectors in training dataset 124 that have C≥αi>0 after solving equation (1) above. SV<C is a subset of the support vectors that includes the observation vectors in training dataset 124 that have C>αi>0 after solving equation (1) above. The SV<C is a subset of the support vectors located on a boundary of the minimum radius hypersphere defined around the data.
Depending upon a position of an observation vector, the following results are true:
Center position:Σi=1nαixi=a. (4)
Inside position:∥xi−a∥<R→αi=0. (5)
Boundary position:∥xi−a∥=R→0<αi<C. (6)
Outside position:∥xi−a∥>R→αi=C. (7)
where a is a center of the hypersphere and R is a radius of the hypersphere. The radius of the hypersphere is calculated using:
R
2
=x
k
·x
k−2Σi=1N
where any xk∈SV<C, xi and xj are the support vectors, αt and αj are the Lagrange constants of the associated support vector, and NSV is a number of the support vectors included in the set of support vectors. An observation vector z is indicated as an outlier when dist2(z)>R2, where
dist2(z)=(z·z)−2Σi=1N
When the outlier fraction f is very small, the penalty constant C is very large resulting in few if any observation vectors in training dataset 124 determined to be in the outside position according to equation (7).
Referring to
Boundary 200 includes a significant amount of space with a very sparse distribution of training observations. Scoring with the model based on the set of support vectors SV that define boundary 200 can increase the probability of false positives. Instead of a circular shape, a compact bounded outline around the data that better approximates a shape of data in training dataset 124 may be preferred. This is possible using a kernel function. The SVDD is made flexible by replacing the inner product (xi·x1) with a suitable kernel function K (xi, xj). A Gaussian kernel function is used herein. The Gaussian kernel function may be defined as:
where s is the Gaussian bandwidth parameter.
The objective function for the SVDD model with the Gaussian kernel function is
max(Σi=1nαiK(xi,xi)−Σi=1nΣj=1uαiαjK(xi,xj)), (11)
subject to:
Σi=1nαi=1, (12)
0≤αi≤C,∀i=1, . . . ,n (13)
where again SV is the set of support vectors that includes the observation vectors in training dataset 124 that have C≥αi>0 after solving equation (1) above. SV<C is the subset of the support vectors that includes the observation vectors in training dataset 124 that have C>αi>0 after solving equation (1) above.
The results from equations (4) to (7) above remain valid. A threshold R is computed using:
R
2
=K(xk,xk)−2Σi=1N
where any xk∈SV<C, where xi and xj are the support vectors, αi and αj are the Lagrange constants of the associated support vector, and NSV is a number of the support vectors included in the set of support vectors.
An observation vector z is indicated as an outlier when dist2(z)>R2, where
dist2(z)=K(z,z)−2Σi=1N
Σi=1N
Referring to
Referring to
Referring to
In an operation 402, a second indicator may be received that indicates a plurality of variables of training dataset 124 to define xi. The second indicator may indicate that all or only a subset of the variables stored in training dataset 124 be used to define SVDD 126. For example, the second indicator indicates a list of variables to use by name, column number, etc. In an alternative embodiment, the second indicator may not be received. For example, all of the variables may be used automatically.
In an operation 404, a third indicator is received that indicates a data filter for a plurality of observations of training dataset 124. The third indicator may indicate one or more rules associated with selection of an observation from the plurality of observations of training dataset 124. In an alternative embodiment, the third indicator may not be received. For example, no filtering of the plurality of observations may be applied. As an example, data may be captured for a vibration level of a washing machine. A washing machine mode, such as “fill’, “wash”, “spin”, etc. may be captured. Because a “normal” vibration level may be different dependent on the washing machine mode, a subset of data may be selected for a specific washing machine mode setting based on a value in a column of training dataset 124 that defines the washing machine mode. For example, SVDD models may be defined for different modes of the machine such that the data filter identifies a column indicating the washing machine mode and which value(s) is(are) used to define the SVDD model.
In an operation 406, a fourth indicator of a tolerance value δ may be received. In an alternative embodiment, the fourth indicator may not be received. For example, a default value may be stored, for example, in computer-readable medium 108 and used automatically. In another alternative embodiment, the value of the tolerance value δ may not be selectable. Instead, a fixed, predefined value may be used. For illustration, a value of √{square root over (2)}×10−7≤δ≤√{square root over (2)}×10−5. For further illustration, a value of δ=√{square root over (2)}×10−6 has been shown to work well for most training datasets.
In an operation 408, a fifth indicator of a value of the expected outlier fraction f may be received. In an alternative embodiment, the fifth indicator may not be received. For example, a default value may be stored, for example, in computer-readable medium 108 and used automatically. In another alternative embodiment, the value of the expected outlier fraction f may not be selectable. Instead, a fixed, predefined value may be used.
In an operation 410, a number of observation vectors N is selected after reading all of the observation vectors from training dataset 124 and after applying the data filter indicated in operation 404, if any, to define a selected set of observation vectors X, where xi∈X, and xi, i=1, N. The selected set of observation vectors X are processed to compute SVDD 126.
In an operation 412, a value of the penalty constant C=1/Nf may be computed from N and f.
In an operation 414, a determination is made concerning whether or not any xi of the selected set of observation vectors X is a repeat of another observation vector xj. When at least one observation vector is repeated, processing continues in an operation 420. When the observation vectors are each unique, processing continues in an operation 416.
In operation 416, a central tendency value is computed for pairwise distances between observation vectors. In an illustrative embodiment, a mean pairwise distance
1, . . . , N and j=1, . . . , N, where p is a number of variables that define each observation vector xi and σj2 is a variance of each variable of the number of variables. For illustration, each σj2 is computed using
is a mean value computed for a first variable from each observation vector value for the first variable of the selected set of observation vectors X, . . . ,
is a mean value computed for a pth variable from each observation vector value for the pth variable of the selected set of observation vectors X. Because the column variances can be calculated in one pass through the selected set of observation vectors X, the computation of mean pairwise distance
In another illustrative embodiment, a median pairwise distance Dmd is computed using Dmd=mediani<j∥xi−xj∥, i=1, . . . , N and j=1, . . . , N. The user may select either mean pairwise distance
In an operation 418, the Gaussian bandwidth parameters is computed from either mean pairwise distance D or median pairwise distance Dmd and a scaling factor F, where F=1/√{square root over (ln[(N−1)/δ2])}. For example, s=√{square root over (
In operation 420, repetition weight factors, W, M, and Q, are computed from a repetition vector wi where xi is repeated wi>0 times and i=1, . . . , N. W=Σi=1Nwi, M=Σi=1Nwi2, and Q=(W2−M)/2, where {x1, . . . , xN} are the distinct observation vectors included in the selected set of observation vectors X.
In an operation 422, a variance value σ−2 is computed from the selected set of observation vectors X, where σ−2=Σi=1pσi2, where each σi2 computing using
where p is the number of variables that define each observation vector xi.
In an operation 424, the Gaussian bandwidth parameter s is computed from the variance value σ2 and a weighed scaling factor FW, where FW=W/√{square root over (Q×ln[2Q/(δ2M)])}. For example, s=σFW, where σ=√{square root over (σ2)}, and processing continues in operation 426.
In operation 426, an optimal value is computed for the objective function of the SVDD model using the Gaussian kernel function with the computed Gaussian bandwidth parameter s and the selected set of observation vectors X. For example, equations (11)-(13) above are used to solve for SV, a set of support vectors that have 0<αi≤C. Values for the Lagrange constants αi for each support vector of the set of support vectors, for R2 using equation (14), and for the center position α using equation (4) are computed as part of the optimal solution. Only the SV<C are needed for the computations of R2, and only the SV are needed for the computations of a.
In an operation 428, the set of support vectors SV, the Lagrange constants αi for each support vector of the set of support vectors SV, the center position a, and/or R2 computed from the set of support vectors may be stored in SVDD 126 in association with the computed Gaussian bandwidth parameter s.
Referring to
Second input interface 1802 provides the same or similar functionality as that described with reference to input interface 102 of SVDD training device 100 though referring to outlier identification device 1800. Second output interface 1804 provides the same or similar functionality as that described with reference to output interface 104 of SVDD training device 100 though referring to outlier identification device 1800. Second communication interface 1806 provides the same or similar functionality as that described with reference to communication interface 106 of SVDD training device 100 though referring to outlier identification device 1800. Data and messages may be transferred between outlier identification device 1800 and distributed computing system 128 using second communication interface 1806. Second computer-readable medium 1808 provides the same or similar functionality as that described with reference to computer-readable medium 108 of SVDD training device 100 though referring to outlier identification device 1800. Second processor 1810 provides the same or similar functionality as that described with reference to processor 110 of SVDD training device 100 though referring to outlier identification device 1800.
Outlier identification application 1822 performs operations associated with creating outlier dataset 1826 from data stored in dataset 1824 using SVDD 126. SVDD 126 may be used to classify data stored in dataset 1824 and to identify outliers in dataset 1824 that are then stored in outlier dataset 1826 to support various data analysis functions as well as provide alert/messaging related to the identified outliers stored in outlier dataset 1826. Dependent on the type of data stored in training dataset 124 and dataset 1824, outlier dataset 1826 may identify anomalies as part of process control, for example, of a manufacturing process, for machine condition monitoring, for example, of an electro-cardiogram device, for image classification, for intrusion detection, for fraud detection, etc. Some or all of the operations described herein may be embodied in outlier identification application 1822. The operations may be implemented using hardware, firmware, software, or any combination of these methods.
Referring to the example embodiment of
Outlier identification application 1822 may be implemented as a Web application. Outlier identification application 1822 may be integrated with other system processing tools to automatically process data generated as part of operation of an enterprise, to identify any outliers in the processed data, and to provide a warning or alert associated with identification of an outlier using second input interface 1802, second output interface 1804, and/or second communication interface 1806 so that appropriate action can be initiated in response to the outlier identification. Outlier identification application 1822 and training application 122 further may be integrated applications.
Training dataset 124 and dataset 1824 may be generated, stored, and accessed using the same or different mechanisms. Similar to training dataset 124, dataset 1824 may include a plurality of rows and a plurality of columns with the plurality of rows referred to as observations or records, and the columns referred to as variables that are associated with an observation. Dataset 1824 may be transposed.
Similar to training dataset 124, dataset 1824 may be stored on second computer-readable medium 1808 or on one or more computer-readable media of distributed computing system 128 and accessed by outlier identification device 1800 using second communication interface 1806. Data stored in dataset 1824 may be a sensor measurement or a data communication value, may be generated or captured in response to occurrence of an event or a transaction, generated by a device such as in response to an interaction by a user with the device, etc. The data stored in dataset 1824 may include any type of content represented in any computer-readable format such as binary, alphanumeric, numeric, string, markup language, etc. The content may include textual information, graphical information, image information, audio information, numeric information, etc. that further may be encoded using various encoding techniques as understood by a person of skill in the art. The data stored in dataset 1824 may be captured at different time points periodically or intermittently, when an event occurs, etc. One or more columns may include a time value. Similar to training dataset 124, data stored in dataset 1824 may be generated as part of the IoT, and some or all data may be processed with an ESPE.
Similar to training dataset 124, dataset 1824 may be stored in various compressed formats such as a coordinate format, a compressed sparse column format, a compressed sparse row format, etc. Dataset 1824 further may be stored using various structures as known to those skilled in the art including a file system, a relational database, a system of tables, a structured query language database, etc. on SVDD training device 100, on outlier identification device 1800, and/or on distributed computing system 128. Outlier identification device 1800 and/or distributed computing system 128 may coordinate access to dataset 1824 that is distributed across a plurality of computing devices. For example, dataset 1824 may be stored in a cube distributed across a grid of computers as understood by a person of skill in the art. As another example, dataset 1824 may be stored in a multi-node Hadoop® cluster. For instance, Apache™ Hadoop® is an open-source software framework for distributed computing supported by the Apache Software Foundation. As another example, dataset 1824 may be stored in a cloud of computers and accessed using cloud computing technologies, as understood by a person of skill in the art. The SAS® LASR™ Analytic Server developed and provided by SAS Institute Inc. of Cary, N.C. may be used as an analytic platform to enable multiple users to concurrently access data stored in dataset 1824.
Referring to
In an operation 1900, a sixth indicator is received that indicates dataset 1824. For example, the sixth fifteenth indicates a location and a name of dataset 1824. As an example, the sixth indicator may be received by outlier identification application 1822 after selection from a user interface window or after entry by a user into a user interface window. In an alternative embodiment, dataset 1824 may not be selectable. For example, a most recently created dataset may be used automatically or observation vectors may be streamed to outlier identification application 1822 from an event publishing application executing at a computing device of distributed computing system 128.
In an operation 1902, a seventh indicator may be received that indicates a plurality of variables of dataset 1824 to define observation vector z. The same set of the plurality of variables selected in operation 402 to define SVDD 126 should be selected. The seventh indicator may indicate that all or only a subset of the variables stored in dataset 1824 be used to determine whether the observation vector z is an outlier. For example, the seventh indicator indicates a list of variables to use by name, column number, etc. In an alternative embodiment, the seventh indicator may not be received. For example, all of the variables may be used automatically.
In an operation 1904, an eighth indicator is received that indicates SVDD 126. For example, the eighth indicator indicates a location and a name of SVDD 126. As an example, the eighth indicator may be received by outlier identification application 1822 after selection from a user interface window or after entry by a user into a user interface window. In an alternative embodiment, SVDD 126 may not be selectable. For example, a default name and location for SVDD 126 may be used automatically.
In an operation 1906, the set of support vectors SV, the Lagrange constants αi for each support vector of the set of support vectors SV, the center position a, R2, and the Gaussian bandwidth parameters are defined. For example, the set of support vectors SV, the Lagrange constants αi for each support vector of the set of support vectors SV, the center position a, R2, and the Gaussian bandwidth parameters are read from SVDD 126 though the center position a and R2 may be computed from the set of support vectors SV and the Lagrange constants αi instead.
In an operation 1908, a ninth indicator is received that indicates outlier dataset 1826. For example, the ninth indicator indicates a location and a name of outlier dataset 1826. As an example, the ninth indicator may be received by outlier identification application 1822 after selection from a user interface window or after entry by a user into a user interface window. In an alternative embodiment, outlier dataset 1826 may not be selectable. For example, a default name and location for outlier dataset 1826 may be used automatically.
In an operation 1910, a first observation is read from dataset 1824 and selected as observation vector z. In another embodiment, the first observation may be received from another computing device in an event stream and selected as observation vector z. In still another embodiment, the first observation may be received from a sensor 1812 through second input interface 1802 or second communication interface 1806 and selected as observation vector z. The observation vector may include values received from a plurality of sensors of the same or different types connected to a device or mounted in a location or an area. For example, sensor 1812 may produce a sensor signal value referred to as a measurement data value representative of a measure of a physical quantity in an environment to which sensor 1812 is associated and generate a corresponding measurement datum that may be associated with a time that the measurement datum is generated. The environment to which sensor 1812 is associated for monitoring may include a power grid system, a telecommunications system, a fluid (oil, gas, water, etc.) pipeline, a transportation system, an industrial device, a medical device, an appliance, a vehicle, a computing device, etc. Example sensor types of sensor 1812 include a pressure sensor, a temperature sensor, a position or location sensor, a velocity sensor, an acceleration sensor, a fluid flow rate sensor, a voltage sensor, a current sensor, a frequency sensor, a phase angle sensor, a data rate sensor, a humidity sensor, an acoustic sensor, a light sensor, a motion sensor, an electromagnetic field sensor, a force sensor, a torque sensor, a load sensor, a strain sensor, a chemical property sensor, a resistance sensor, a radiation sensor, an irradiance sensor, a proximity sensor, a distance sensor, a vibration sensor, etc. that may be mounted to various components used as part of the system.
In an operation 1912, a distance value for observation vector z is computed using dist2(z)=K(z,z)−2 Σi=1N
where xi is any support vector of the defined set of support vectors SV, NSV is the number of support vectors included in the defined set of support vectors SV, and αi is the Lagrange constant associated with support vector xi. G=Σi=1N
In an operation 1914, a determination is made concerning whether or not dist2(z)>R2. When dist2(z)>R2, processing continues in an operation 1916. When dist2(z)≤R2, processing continues in an operation 1918.
In operation 1916, observation vector z and/or an indicator of observation vector z is stored to outlier dataset 1826, and processing continue in operation 1918.
In operation 1918, a determination is made concerning whether or not dataset 1824 includes another observation or another observation vector has been received. When there is another observation, processing continues in an operation 1920. When there is not another observation, processing continues in an operation 1922.
In operation 1920, a next observation is selected as observation vector z from dataset 1824 or is received, and processing continues in operation 1912 to determine if the next observation is an outlier.
In operation 1922, scoring results are output. For example, statistical results associated with the scoring may be stored on one or more devices and/or on second computer-readable medium 1808 in a variety of formats as understood by a person of skill in the art. Outlier dataset 1826 and/or the scoring results further may be output to a second display 1816, to a second printer 1820, etc. In an illustrative embodiment, an alert message may be sent to another device using second communication interface 1806, printed on second printer 1820 or another printer, presented visually on second display 1816 or another display, presented audibly using a second speaker 1818 or another speaker when an outlier is identified.
Because computation of an SVDD model is an unsupervised learning technique, it is desirable to have an unsupervised bandwidth parameter selection technique, such as that provided by training application 122, which does not depend on labeled data that separates the inliers from the outliers. Training application 122 includes two such techniques. The first technique uses mean pairwise distance D and is referred to herein as a mean criterion. The first technique can be applied with non-repeating observation vectors using operation 418 or repeating observation vectors using operation 424. The second technique uses median pairwise distance Dmd and is referred to herein as a median criterion. U.S. Patent Publication No. 2017/0236074, titled KERNEL PARAMETER SELECTION IN SUPPORT VECTOR DATA DESCRIPTION FOR OUTLIER IDENTIFICATION, and assigned to SAS Institute Inc., the assignee of the present application, describes an unsupervised bandwidth selection technique referred to herein as a peak criterion. A paper by Charu C. Aggarwal, titled Outlier Analysis, and published by Springer Publishing Company, Incorporated in 2013 describes using F=1/√{square root over (2)} resulting in s=Dmd/√{square root over (2)}=DmdF. Use of s=Dmd/√{square root over (2)} is referred to herein as a median2 criterion.
The performance of the SVDD using the Gaussian bandwidth parameter s computed using the mean criterion, the median criterion, the peak criterion, and the median2 criterion was compared with five different sample datasets. Training application 122 was executed with each sample dataset to compute the Gaussian bandwidth parameter s and the associated set of support vectors SV using the mean criterion and the median criterion. The peak criterion and the median2 criterion were also implemented and the Gaussian bandwidth parameter s and the associated set of support vectors SV were also computed using each of those techniques. The Gaussian bandwidth parameter s and the associated set of support vectors SV computed using each of the four techniques was input to outlier identification application 1822. Dataset 1824 was created for each of the sample datasets using a bounding rectangle defined for the dataset. The observation vectors resulting from the bounding rectangle were two-dimensional and created by dividing each dataset into a 200×200 grid. The observation vectors not identified as outliers were graphed with the results presented below for each technique.
Referring to
Referring to
The scoring results indicate that the Gaussian bandwidth parameter s computed using the mean and median criteria provides a good quality data description. The descriptions are close to the one obtained using the peak criterion. The median2 criterion did not provide a quality data description.
Referring to
Referring to
The scoring results again indicate that the Gaussian bandwidth parameter s computed using the mean and median criteria provides a good quality data description. The descriptions are close to the one obtained using the peak criterion. The median2 criterion did not provide a quality data description.
Referring to
Referring to
The scoring results again indicate that the Gaussian bandwidth parameter s computed using the mean and median criteria provides a good quality data description. The descriptions are close to the one obtained using the peak criterion. The median2 criterion did not provide a quality data description. In fact, using the median2 criterion the three clusters became a single cluster.
Referring to
Referring to
For fourth sample dataset 1200, the peak criterion significantly outperformed the other three techniques because the set of support vectors SV computed using the peak criterion could separate out all four of the clusters while the mean criterion and the median criterion merged the two clusters that lie close to each other in the bottom left of the graphs. Though the mean and median criterion did not perform as well as the peak criterion, any point in the inlier region was close to fourth sample dataset 1200, while the area of the region that was misclassified was small compared to the bounding region of fourth sample dataset 1200. Therefore, the result was still very reasonable. The median2 criterion again performed very poorly.
Referring to
Referring to
For fifth sample dataset 1500, the peak criterion and the mean criterion significantly outperformed the other two techniques. The median criterion did not perform as well as the peak criterion or the mean criterion, but the result was still very reasonable. The median2 criterion again performed very poorly.
The SVDD approach requires solving a quadratic programming problem. The time needed to solve the quadratic programming problem is directly related to the size of training dataset 124. The illustrative results show that training application 122 provides a nearly identical data description using either the mean criterion or the median criterion as compared to the peak criterion.
Computation of the Gaussian bandwidth parameters using the mean criterion is extremely fast even when training dataset 124 is very large because it can be computed in a single iteration. Computation of the Gaussian bandwidth parameter S using the peak criterion requires computation of the SVDD solution multiple times using training dataset 124 for a list of bandwidth values that lie on a grid. Additionally, a good starting value for the Gaussian bandwidth parameter s is needed to initiate the grid search, and it is not immediately obvious what a good starting value is. For illustration, Table I below summarizes the computation time in seconds to calculate s using the peak criterion (speak) and s using the mean criterion (smean) with the datasets above.
N is the number of observations in the dataset and M is the number of variables.
Table II below summarizes the computation time in seconds to calculate s using the peak criterion (speak) and s using the mean criterion (smean) with additional datasets.
Fpeak is an F-score computed for s using the peak criterion (speak) and Fmean is an F-score computed for s using the mean criterion (smean) where the F-score can be defined as
where tp is a number of true positives, fp is a number of false positives, and fn is a number of false negatives. The results show the extreme improvement in computation time with a nearly identical data description that results from use of smean Therefore, use of smean provides a significant improvement over prior methods.
Computation of the Gaussian bandwidth parameters using the mean criterion is extremely fast even when training dataset 124 is very large because it can be computed in a single iteration.
Training application 122 can be implemented as a wrapper code around a core module for SVDD training computations either in a single machine or in a multi-machine distributed environment. There are applications for training application 122 and outlier identification application 1822 in areas such as process control and equipment health monitoring where the size of training dataset 124 can be very large, consisting of a few million observations. Training dataset 124 may include sensor readings measuring multiple key health or process parameters at a very high frequency. For example, a typical airplane currently has ˜7,000 sensors measuring critical health parameters and creates 2.5 terabytes of data per day. By 2020, this number is expected to triple or quadruple to over 7.5 terabytes. In such applications, multiple SVDD training models may be developed with each representing a different operating mode of the equipment or different process settings. Successful application of a SVDD in these types of applications requires algorithms that can train using huge amounts of training data in an efficient manner, which is provided by training application 122 in particular using the mean criterion.
The word “illustrative” is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “illustrative” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Further, for the purposes of this disclosure and unless otherwise specified, “a” or “an” means “one or more”. Still further, using “and” or “or” in the detailed description is intended to include “and/or” unless specifically indicated otherwise.
The foregoing description of illustrative embodiments of the disclosed subject matter has been presented for purposes of illustration and of description. It is not intended to be exhaustive or to limit the disclosed subject matter to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the disclosed subject matter. The embodiments were chosen and described in order to explain the principles of the disclosed subject matter and as practical applications of the disclosed subject matter to enable one skilled in the art to utilize the disclosed subject matter in various embodiments and with various modifications as suited to the particular use contemplated.
The present application claims the benefit of 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 62/542,006 filed on Aug. 7, 2017, the entire contents of which are hereby incorporated by reference. The present application also claims the benefit of 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 62/544,879 filed on Aug. 13, 2017, the entire contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62542006 | Aug 2017 | US | |
62544879 | Aug 2017 | US |