It is to be understood that both the following general description and the following detailed description are illustrative and explanatory only and are not restrictive.
In one embodiment, the disclosure provides a computing system. The computing system includes at least one processor; and at least one memory device having processor-executable instructions stored thereon that, in response to execution by the at least one processor, cause the computing system to access a dataset comprising multiple records; and access at least one configuration attribute. A first configuration attribute of the at least one configuration attribute is indicative of a detection interval. The processor-executable instructions, in response to execution by the at least one processor, also cause the computing system to generate, using a first subset of the multiple records, a detection model to determine presence or absence of an anomalous record within the multiple records; select a second subset of the multiple records, the second subset comprising second records within the detection interval; and generate classification attributes for respective ones of the second records by applying the detection model to the second subset. A first classification attribute of the classification attributes designates a first one of the second records as one of normal or anomalous.
In another embodiment, the disclosure provides a computer-implemented method. The computer-implemented method includes accessing, by a computing system comprising at least one processor, a dataset comprising multiple records; and accessing, by the computing system, at least one configuration attribute. A first configuration attribute of the at least one configuration attribute is indicative of a detection interval. The computer-implemented method also includes generating, by the computing system, using a first subset of the multiple records, a detection model to determine presence or absence of an anomalous record within the multiple records; selecting, by the computing system, a second subset of the multiple records, the second subset comprising second records within the detection interval; and generating, by the computing system, classification attributes for respective ones of the second records by applying the detection model to the second subset. A first classification attribute of the classification attributes designates a first one of the second records as one of normal or anomalous.
In yet another embodiment, the disclosure provides a computer-program product. The computer-program product includes at least one computer-readable non-transitory storage medium having processor-executable instructions stored thereon that, in response to execution, cause a computing system to: access a dataset comprising multiple records; and access at least one configuration attribute. A first configuration attribute of the at least one configuration attribute is indicative of a detection interval. The processor-executable instructions, in response to execution, also cause the computing system to generate, using a first subset of the multiple records, a detection model to determine presence or absence of an anomalous record within the multiple records; select a second subset of the multiple records, the second subset comprising second records within the detection interval; and generate classification attributes for respective ones of the second records by applying the detection model to the second subset. A first classification attribute of the classification attributes designates a first one of the second records as one of normal or anomalous.
Additional elements or advantages of this disclosure will be set forth in part in the description which follows, and in part will be apparent from the description, or may be learned by practice of the subject disclosure. The advantages of the subject disclosure can be attained by means of the elements and combinations particularly pointed out in the appended claims.
This summary is not intended to identify critical or essential features of the disclosure, but merely to summarize certain features and variations thereof. Other details and features will be described in the sections that follow. Further, both the foregoing general description and the following detailed description are illustrative and explanatory only and are not restrictive of the embodiments of this disclosure.
The annexed drawings are an integral part of the disclosure and are incorporated into the subject specification. The drawings illustrate example embodiments of the disclosure and, in conjunction with the description and claims, serve to explain at least in part various principles, elements, or aspects of the disclosure. Embodiments of the disclosure are described more fully below with reference to the annexed drawings. However, various elements of the disclosure can be implemented in many different forms and should not be construed as limited to the implementations set forth herein. Like numbers refer to like elements throughout.
The disclosure recognizes and addresses, among other technical challenges, the issue of anomaly detection in datasets. To that end, embodiments of this disclosure, individually or in combination, provide flexible, interactive configuration of a desired anomaly analysis, and also can provide execution of the configured anomaly analysis. Embodiments that execute such an analysis can determine present or absence of one or several records that deviate from a pattern obeyed by other records within a dataset. A record that deviates from such a pattern can be referred to as an anomalous records. The anomaly analysis described herein can be performed for a various types of data. Those types of data can include, for example, business analytics data, including pricing, sales, contract, inventory, or similar. Configuration and execution of the anomaly analysis can be separated into respective environments. Interactive configuration of the desired anomaly analysis can be afforded by a sequence of one or more multiple user interfaces presented at a client device. Such an interactive configuration can leverage attributes of a dataset (such as structure of a table) that is selected for anomaly analysis. In some cases, configuration of the anomaly analysis can be accomplished by means of application programming interfaces (APIs). In addition, or in other cases, implementation of a configured anomaly analysis also can be implemented via one or multiple APIs.
In sharp contrast to existing technologies, by separating configuration of the anomaly analysis from execution of that anomaly analysis, embodiments of the disclosure avoid building (e.g., linking and compiling) case-specific anomaly-detection computational tools. Instead, this disclosure provides a computing system that can be built one time and can then perform a wide variety of anomaly analyses by leveraging configurable attributes that define a desired anomaly analysis. Because the complexities of implementing and performing the desired anomaly analysis can be shifted away from a client domain into a server domain, embodiments of the disclosure can be readily accessible to client devices operated by analysts of disparate computational proficiency (ranging from users to developers, for example). In addition, the flexibility and the access to advanced analytical tools that are afforded by embodiments of this disclosure can improve quality and speed of decision-making by a business unit or other types of organizations.
With reference to the drawings,
Execution of the client application 116 can cause the client device 110 to present a sequence of user interfaces 120 to configure the analysis of a dataset and to review results of the analysis. A display device (not depicted in
To present the first UI, in response to executing the client application 116, the client device 110 can receive first UI data 142 from the anomaly detection subsystem 150. The first UI data 142 can include formatting data defining formatting attributes of UI elements to be presented within the first UI. The formatting data also can define a layout of those UI elements. In this disclosure, a formatting attribute can be embodied in, or can include, a code that defines a characteristic of a UI element presented on a user interface. The code can define, for example, a font type; a font size; a color; a length of a line; thickness of a line, a size of a viewport or bounding box; presence or absence of an overlay; type and size of the overlay, or similar characteristics. The code can be a numerical value or an alphanumerical value, in some cases.
As is illustrated in
In some embodiments, the first UI can include a selectable visual element that, in response to being selected, can cause the client device 110 to present a second UI as part of the sequence of user interfaces 120. To that, the client device 110 can execute, or can continue executing, the client application 116 to receive second UI data 142 from the anomaly detection subsystem 150. The second UI data 142 also can be retained in the UI repository 154, within the UI data 156. The second UI data 142 can include formatting data defining formatting attributes of UI elements to be presented within the second UI. The formatting data also can define a layout of those UI elements.
The second UI can include, in some embodiments, multiple selectable visual elements that can permit supplying a dataset for analysis to the anomaly detection subsystem 150. The dataset comprises multiple records. A first selectable visual element of the multiple selectable visual elements, in response to being selected, can permit the client device 110 to obtain a document from the memory 114. The document contains the dataset, and in some cases, the document can be a comma-separated file. The client device 110 can send the document to the anomaly detection subsystem 150. In some cases, the document can be sent in response to selection of a second selectable visual element of the multiple selectable visual elements. The UI 300 shown in
In response to being selected, a second selectable visual element of the multiple selectable visual elements within the second UI (e.g., UI 300 shown in
In addition, or in some embodiments, one or more selectable visual elements of the multiple selectable visual elements included in the third UI can permit defining a data domain where the query 144 is to be resolved. For instance, a first one of the one or more selectable visual elements can permit identifying a particular server device that administers contents of one or multiple databases. In addition, a second one of the one or more selectable visual elements can permit identifying a particular database of the database(s). The client device 110 can send first data and second data identifying the particular server device and the particular database, respectively, to the anomaly detection subsystem 150. In some cases, the first data and/or the second data can be incorporated into the query 144 as metadata. In other cases, the first data and/or the second data can be sent to in one or more transmission(s) separate from the query 144. For instance, the first data and/or the second data can be sent as part of configuration attributes 146.
As an illustration, the UI 350 shown in
With further reference to
As mentioned, in some cases, in addition to receiving the query 144, the anomaly detection subsystem 150 can receive data identifying a particular server device of the server devices 172. That particular server device can be functionally coupled to one or more of the data repositories 174. By sending the query 144 to that particular server device, the anomaly detection subsystem 150 can confine the resolution of the query 144 to a desired domain of records pertaining to a particular database. Consequently, not only can computing resources be used more efficiently in the resolution of the query 144, but records included in the dataset 164 can pertain to one or several particular databases of a desired type. For example, a particular database can include information related to mail-order pharmacies in specific geographic locations and quantity of medications fulfilled. As another example the particular database can include information identifying inventory quantity, sales quantity, a medication quantity, supply quantity, and/or quantify of prescriptions or medications that have been shipped.
Prior to anomaly analysis of the dataset 164, the anomaly detection subsystem 150 can send structure data identifying dimensions, measures, and date of the table corresponding to the dataset 164. Such structure data constitutes particular configuration attributes of the anomaly analysis. Thus, the anomaly detection subsystem 150 can send the structure data as part of the configuration attributes 146. A first one of the particular configuration attributes can identify a first dimension; a second one of the particular configuration attributes can identify a first measure; and a third one of the particular configuration attributes can identify a date column. Additionally, still prior to the anomaly analysis, the anomaly detection subsystem 150 can send particular UI data 142 defining formatting attributes. The particular UI data 142 also can be retained in the UI repository 154, within the UI data 156. The particular UI data 142 can include formatting data defining formatting attributes of UI elements to be presented within one or multiple interactive UIs. The formatting data also can define a layout of those UI elements. In some embodiments, the anomaly detection subsystem 150 can include an output module 260 that send the structure data and various types of UI data 142.
By sending such structure data and the particular UI data 142 to the client device 110, the anomaly detection subsystem 150 can cause the client device 110 to present one or multiple interactive user interfaces for configuration of characteristics of the anomaly analysis. Hence, in contrast to existing analysis technologies, the anomaly analysis can be interactively customized without changes to the anomaly detection subsystem 150. Accordingly, end-users can create a custom anomaly analysis to be performed by the anomaly detection subsystem 150, without coding or modeling experience.
More specifically, the client device 110 can execute, or can continue executing, the client application 116 to receive both the structure data contained in the configuration attributes 146 and the particular UI data 142 from the anomaly detection subsystem 150. In response to receiving such data, the client device 110 can present a fourth UI in the sequence of user interfaces 120. The fourth UI permits interactively configuring particular attributes of a desired anomaly analysis. To that end, the fourth UI can include multiple selectable visual elements.
A first subset of the multiple selectable visual elements can permit receiving input information defining the data scope of the desired anomaly analysis. That is, the input information can select a measure, a dimension, and a date column within the dataset 164. The measure, dimension, and date column can be selected based on the structure data that has been received from the anomaly detection subsystem 150. The measure defines a target variable (e.g., quantity of a particular product or item) to be analyzed for presence of anomalous records, and the dimension defines at least one independent variable determining values of the target variable. The measure, the dimension, and the date column define respective ones of the particular attributes of the desired anomaly analysis.
In addition, a first one of the multiple selectable visual elements can permit receiving input information defining a first parameter associated with a detection interval for the desired anomaly analysis. The first parameter defines one of the particular attributes of the desired anomaly analysis. The detection interval defines a time period where the anomaly detection subsystem 150 can determine presence of one or multiple anomalous records within the measure identified as a target variable. The time period has a lower bound defined by is a first time and an upper bound defined by a second time after the first time. In some cases, the first parameter defines a span of the detection interval; that is, the difference between the upper bound and the lower bound of the time period. Hence, the first parameters can be expressed in units of time (e.g., day or week). As an illustration, the first parameter can be three weeks, four weeks, or six weeks.
Further, in some embodiments, a second one of the multiple selectable visual elements can permit receiving input information defining a second parameter that can control sensitivity of detection of an anomalous record. Such a sensitivity represents a broadening of a sharp decision boundary corresponding to a detection model of this disclosure. The broadening can be controlled by that second parameter (which can be referred to as sensitivity parameter). The sensitivity parameter can be defined as an ordinal categorical parameter indicating, for example, one of multiple categories (or types) of sensitivity of detection.
In one example, there can be three categories of sensitivity—e.g., “low, “medium,” and “high.” Hence, the sensitivity parameter can indicate one of “low” sensitivity, “medium” sensitivity, and “high” sensitivity. In some embodiments, the three sensitivity categories can be converted to the standard error (or confidence interval) of a selected type of detection model. That standard error can then be applied as constraint during generation of the decision boundary of the selected detection model. In such an example, a sensitivity parameter value of “low” indicates using 85% confidence interval to determine the decision boundary and differentiate a normal record (falls within the decision boundary) from an anomalous record (falls outside of the decision boundary). Further, sensitivity parameter values of “medium” and “high” indicate using 80% and 70% confidence interval, respectively, to determine the decision boundaries. Embodiments of this disclosure are, of course, not limited to those particular confidence intervals.
As an illustration, the UI 400 shown in
The UI 400 also includes a selectable UI element 450 and a selectable UI element 460. Selection of the selectable UI element 450 permits defining a span of a detection interval. The span can be defined as an offset relative to a most recent date present in the date column identified via the selectable UI element 420. Selection of the selectable UI element 450 can present a menu of preset parameters (not depicted in
After input information has been received using a configuration user interface, the client device 110 can execute, or can continue executing the client application 116, to send the particular attribute(s) that configure characteristics of the desired anomaly analysis to the anomaly detection subsystem 150. The client device 110 can send the particular attributes as part of the configuration attributes 146, via the communication network 140.
The anomaly detection subsystem 150 can receive the particular attributes within the configuration attributes 146, from the client device 110. The anomaly detection subsystem 150 can configure the detection interval based on the first parameter within the received configuration attributes 146. As mentioned, the first parameter can define the span of the time interval (e.g., three weeks) corresponding to the detection interval. The anomaly detection subsystem 150 can then configure the upper bound of the detection interval as the value of the most recent date within the date column identified in the configuration attributes 146. In addition, the anomaly detection subsystem 150 can configure the lower bound of the detection interval as the value of the date index (a date or another type of time, for example) in the date column that yields the defined span of the detection interval. In other words, the date index that corresponds to the time interval measured from the most-recent date. In some embodiments, the configuration module 220 (
In addition, the anomaly detection subsystem 150 can determine a training interval using the detection interval and the date column identified in the configuration attributes 146. The training interval precedes the detection interval. That is, the training interval contains historical dimension records relative to dimension records contained in the detection interval. More specifically, the training interval defines a second time period where the anomaly detection subsystem 150 can generate an anomaly detection model to determine presence or absence of anomalous records within a dataset (e.g., values of a target variable). The second time period has a lower bound defined by is a first time and an upper bound defined by a second time after the first time. The anomaly detection subsystem 150 can configure the lower bound of the second time period as the value of a date index identifying the earliest time in the date column within the dataset 164. In addition, the anomaly detection subsystem 150 can configure the upper bound of the second time period as the value of another date index that precedes the date index defining the lower bound of the detection interval. In some cases, the date index corresponding to the upper bound of the second time period can be immediately consecutive to the date index defining the lower bound of the detection interval. In some embodiments, the configuration module 220 (
Regardless of how the training interval is configured, the anomaly detection subsystem 150 can generate a detection model 158 based on the dataset 164 and the training interval. To that end, the anomaly detection subsystem 150 can select a subset of the multiple records included in the dataset 164. The subset includes first records within the training interval. The first records can include first measure records and first dimension records. As mentioned, the first measure records serve as values of a target variable (e.g., the metric corresponding to the measure records), and the first dimension records serve as values of an independent variable (e.g., time, geographical region, employee identification (ID), item ID, or similar). In addition, the anomaly detection subsystem 150 can train, using such a subset, the detection model 158 to classify a record as being one of a normal record or an anomalous record. The detection model 158 can be embodied in, or can include a time-series model, a median absolute deviation model, or an isolation forest model, for example. The detection model 158 can be trained using one or several unsupervised training techniques. In some embodiments, the anomaly detection subsystem 150 can include a training module 230 that can train the detection model 158.
Training the detection model 158 includes generating a first decision boundary and a second decision boundary. The first and second decision boundaries define a domain where values of respective measure records are deemed normal. Outside that domain, a value of a measure record is deemed anomalous. In other words, each one of the first and second decision boundary separates that domain from another domain where values of records are deemed anomalous. More specifically, the first decision boundary and the second decision boundary can define, respectively, an upper bound and a lower bound that can be compared to values of measure records. The trained detection model 158 classifies a measure record having a value within the interval defined by the upper bound and the lower bound as a normal record. The trained detection model 158 classifies another measure record having a value outside that interval as an anomalous record.
Embodiments of the disclosure also provide flexibility with respect to configuration of the detection model 158 that is trained for anomaly detection. In other words, anomaly analyses performed by the anomaly detection subsystem 150 need not be limited to a specific type of detection model 158. In some embodiments, the client device 110 can present a configuration user interface as part of the sequence of user interfaces 120, where the configurations user interface permits selecting the type of detection model 158 to be trained for anomaly analysis. The anomaly detection subsystem 150 can cause the client device 110 to present such a configuration user interface. As is illustrated in
In addition, or in some embodiments, a training interval can be configured independently from a detection interval. Thus, a training interval need not be limited to being immediately consecutive to the detection interval. In some cases, the configuration user interface that permits selecting the type of detection model 158 also can permit defining both the training interval and the detection interval.
As an illustration, the UI 500 shown in
The UI 500 also can include a fillable pane 520 that can receive input information defining one or multiple regressors that can serve as independent variables affecting the target variable defined by the measure selected for anomaly analysis. Examples of regressors include item quantity, item sales, and the like.
The UI 500 can further include a pane 530 having several selectable UI elements that permit incorporating various temporal effects into the relationship between the target variable and independent variable(s). As is illustrated, the temporal effects can include monthly seasonality, weekly seasonality, daily seasonality, and American holiday (international holidays also can be contemplated). Monthly seasonality can be selected via a selectable UI element 532a; weekly seasonality can be selected via a selectable UI element 532b; daily seasonality can be selected via a selectable UI element 532c; and America holiday can be selected via a selectable UI element 532d. Each one of those selectable UI elements is embodied in a checkbox, just for the sake of illustration. Selection of a selectable visual element 534 results in selection of all available seasonality and American holiday. Particular table columns can be searched using a selectable UI element 536 and, based on results of the search, a table column can be added as temporal effect. Further, selection of a selectable element 538 can cause presentation of a menu of table columns available for selection as a temporal effect.
Regardless of its type and how a temporal effect is selected, selection of one or more temporal effects results in respective regressors or model parameters being added to a time-series model used for detection of anomalous records. Accordingly, variation caused by seasonality and/or holiday factor can be incorporated in the generation of a decision boundary for a type of detection model that has been selected as described herein.
The UI 500 also includes a selectable UI element 540 that, in response to being selected, causes the client device 110 to send model information identifying the selection of the type of model, regressor(s), and/or seasonality effect(s). The model information can be sent to the anomaly detection subsystem 150, as part of the configuration attributes 146.
Besides permitting selection of the detection model 158 to be trained for anomaly analysis, the UI 500 can permit defining a lower bound and an upper bound of a training interval, and a lower bound and an upper bound of a detection interval. To that end, the UI 500 includes a first selectable UI element 544a and a second selectable UI element 544b that can receive, respectively, first input information and second input information. The first input information defines the lower bound of the training interval, and the second input information defines the upper bound of the training interval. Further, the UI 500 includes a third selectable UI element 548a and a fourth selectable UI element 548b that can receive, respectively, third input information and fourth input information. The third input information defines the lower bound of the detection interval, and the fourth information defines the upper bound of the detection interval.
The detection model 158 that has been trained can classify each one of the multiple records within the dataset 164 as either a normal record or an anomalous record. Thus, in some cases, after being trained, the detection model 158 can classify each one of the records within the detection interval. Classification of records in such a fashion constitutes a detection mechanism that can determine presence or absence of anomalous records in a dataset, within the detection interval.
Because the detection model 158 can be trained using an unsupervised training technique after a desired dataset has been obtained from a data repository, the anomaly detection subsystem 150 serves as a data-agnostic anomaly detection tool. Therefore, the anomaly detection subsystem 150 can be reconfigured in response to a dataset becoming available, in sharp contrast to existing technologies that are built (e.g., linked and compiled) for particular types of datasets.
The anomaly detection subsystem 150 can generate classification attributes for respective records of the dataset 164 within the detection interval by applying the trained detection model 158 to the respective records. In some cases, each one of the classification attributes designates a record as one of a normal record or an anomalous record. In other cases, each one of the classification attributes designates a record as one of a normal record, an anomalous record of a first type (e.g., “downtrend”), or an anomalous record of a second type (e.g., “spike”). Spike and downtrend denominations are simply illustrative and are provided simply for the sake of nomenclature. A first classification attribute of the classification attributes designates a first one of the respective records as either a normal record or an anomalous record; and a second classification attribute of the classification attributes designates a second one of the respective records as either a normal record or an anomalous record. In some embodiments, the anomaly detection subsystem can include a detection module 240 (
A classification attribute can be embodied in, or can include, a label. For purposes of illustration, the label can contain a string of characters that convey that a record is either a normal record or an anomalous record. In one example, the label can be one of “Normal” or “Anomalous.” In another example, the label can be one of “0”, “1,” or “−1,” where “0” designates a normal record, “1” designates an anomalous record of a first type; and “−1” designates an anomalous record of a second type.
In some embodiments, the anomaly detection subsystem 150 can determine anomaly scores for respective anomalous records that may have been identified within the dataset 164. Each one of the anomaly scores represents the magnitude of an anomaly. Specifically, a score σ for an anomalous record can be equal to the smallest distance between a metric value of the anomalous record and the first classification boundary or the second classification boundary. In some embodiments, the anomaly detection subsystem can include a scoring module 250 (
The anomaly detection subsystem 150 can generate anomaly data 148 defining an anomaly table. The anomaly table can include dimension records of the dataset 164 and dimensions records identifying respective classification attributes for corresponding ones of the dimension records. The dimension records pertain to the detection interval and correspond to the independent variable identified by the configuration attributes 146. In addition, or in other embodiments, the anomaly detection subsystem 150 also can embed anomaly scores into the anomaly table. The anomaly scores constitute second measure records. Each one of the anomaly scores that are added to the anomaly table corresponds to a respective dimension record identifying a record designated as an anomalous record. In some embodiments, the anomaly detection subsystem 150 can format the anomaly data 148 as a comma-separated document that includes multiple rows, each row including a dimension record, a measure record, and a classification attribute. In some cases, at least one of the multiple rows includes an anomaly score. In some embodiments, the anomaly detection subsystem 150 can include an output module 260 (
In addition, or in some embodiments, the anomaly detection subsystem 150 can embed other data into the anomaly data 148. For example, the anomaly detection subsystem 150 can embed first data and second data identifying, respectively, the training interval and the detection interval corresponding to the dataset 164. Further, or as another example, the anomaly detection subsystem 150 can embed data summarizing the anomaly analysis into the anomaly data 148. Such data can include first data identifying a number of anomalous records and/or second data identifying a percentage of anomalous records. The output module 260 (
The anomaly detection subsystem 150 can send the anomaly data 148 to the client device 110 by means of the communication network 140. The anomaly detection subsystem 150 also can send other UI data 142 including formatting data defining formatting attributes that control presentation of a results UI in the sequence of user interfaces 120. The results UI can summarize various aspects of anomaly analysis. Thus, the results UI can include multiple UI elements identifying at least a subset of the anomaly data 148.
The results UI can include a selectable visual element that, in response to being selected, permits identifying a data view to be plotted as a time series of the independent variable identified by the configuration attributes 146 and used in the anomaly analysis. In one example, to identify the data view, selection of the selectable visual element causes presentation of a menu of selectable item IDs having at least one anomalous record. Selection of the particular item ID can cause the client device 110 to present a user interface 130 that includes a graph of the data view identified by the particular item ID. The graph can be a two-dimensional plot of measure value as a function of time, where the ordinate corresponds to measure value and the abscissa corresponds to date index. The time domain shown in the abscissa includes a training interval 134 used to generate the detection model 158, and a detection interval 132 defining a detection window. The graph also presents a first decision boundary 136a and a second decision boundary 136b defining a domain where data records can be deemed to be normal. The domain is represented by a stippled rectangle in the user interface 130.
Anomalous records in the graph are represented by solid circles. An anomalous record that has a measure value below the second decision boundary 136b can be referred to as a “downtrend” record. An anomalous record having a measure value above the first decision boundary 136a can be referred to as a “spike” record. As mentioned, spike and downtrend denominations are simply illustrative and are provided simply for the sake of nomenclature.
The UI 600 shown in
The first data that constitutes the anomaly table can be referred to as item data. Because the item data is presented during execution of the client application 116, the client device 110 can retain the item data in system memory. The system memory can be embodied in one or multiple volatile memory devices, such as random-access memory (RAM) device(s). The pane 610, however, can include a selectable UI element 634 that in response to being selected, causes the client device 110 to retain the item data in mass storage integrated within the client device 110 or functionally coupled thereto. The selectable visual element 634 is labeled “Download Item Data” simply for the sake of nomenclature. The pane 610 also has a selectable UI element 638 that, in response to being selected, causes the client device 110 to retain received anomaly data 148 in mass storage integrated within the client device 110 or functionally coupled thereto. The selectable visual element 634 is labeled “Download Analysis Data” simply for the sake of nomenclature.
The UI 600 also includes a pane 640 that permits controlling presentation of a time series associated with an anomalous record. To that point, the pane 640 includes a selectable UI element 648 that in response to being selected, causes the client device 110 to present a menu of selectable item IDs. That menu includes the item IDs shown by the UI elements 612. Further, the pane 640 also includes a selectable UI element 648 that, in response to being selected, causes the client device 110 to generate a UI including a graph 650 (
In some embodiments, the anomaly detection subsystem 150 can expose a group of APIs that can permit configuration of a desired anomaly detection analysis or execution of the desired detection analysis, or both. In those embodiments, the anomaly detection subsystem 150 can include an API server that provide the group of APIs. In one example, that server can be retained in the memory 270 (
At block 710, the computing system can access a dataset comprising multiple records. The dataset can be accessed in several ways. In some cases, the computing system can receive a document containing the dataset. The document can be a comma-separated file, for example. In other cases, the computing system can receive a query from a client device (e.g., client device 110 (
At block 720, the computing system can access at least one configuration attribute. Such configuration attribute(s) can define one or more characteristics of an anomaly analysis. A first configuration attribute of the at least one configuration attribute defines a detection interval. As an example, the detection interval can be embodied in the detection interval 132 (
At block 730, the computing system can generate, using a subset of the multiple records, a detection model to determine presence or absence of an anomalous record within the multiple records. The detection model that is generated can classify each one of the multiple records within the dataset as either a normal record or an anomalous record. Thus, in some cases, the detection model that is generated can classify each one of the records within the detection interval. As mentioned, generating the detection model includes generating a first decision boundary and a second decision boundary by training the detection model using the subset of multiple records and one or multiple unsupervised training techniques. Each one of the first decision boundary and the second decision boundary separate a first domain where values of records are deemed normal and a second domain where values of records are deemed anomalous. Accordingly, the detection model classifies a measure record having a value within the first domain as a normal record. Further, the detection model classifies another measure record having a value outside that first domain as an anomalous record. The detection model can be generated by implementing the method illustrated in
At block 740, the computing system can select a second subset of the multiple records. The second subset that is selected includes second records within the detection interval.
At block 750, the computing system can generate classification attributes for respective ones of the second records by applying the detection model to the second subset. In some cases, a first classification attribute of the classification attributes designates a first one of the second records as one of a normal record or an anomalous record. In other cases, the first classification attribute designates the first one of the second records as one of a normal record, an anomalous record of a first type, or an anomalous record of a second type.
At block 810, the computing system can determine a training interval using the detection interval and the dataset. As an example, the training interval can be the training interval 134 depicted in
At block 820, the computing system can select a subset of the multiple records. The subset includes first records within the training interval.
At block 830, the computing system can train, using the subset, a detection model to classify at least one of the multiple records as being either a normal record or an anomalous records.
In order to provide some context, the computer-implemented method and systems of this disclosure can be implemented on the computing environment illustrated in
The computer-implemented methods and systems in accordance with this disclosure can be operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that can be suitable for use with the systems and methods comprise, but are not limited to, personal computers, server computers, laptop devices, and multiprocessor systems. Additional examples comprise set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that comprise any of the above systems or devices, and the like.
The processing of the disclosed computer-implemented methods and systems can be performed by software components. The disclosed systems and computer-implemented methods can be described in the general context of computer-executable instructions, such as program modules, being executed by one or more computers or other devices. Generally, program modules comprise computer code, routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The disclosed methods can also be practiced in grid-based and distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules can be located in both local and remote computer storage media including memory storage devices.
Further, one skilled in the art will appreciate that the systems and computer-implemented methods disclosed herein can be implemented via a general-purpose computing device in the form of a computing device 901. The components of the computing device 901 can comprise, but are not limited to, one or more processors 903, a system memory 912, and a system bus 913 that couples various system components including the one or more processors 903 to the system memory 912. The system can utilize parallel computing.
The system bus 913 represents one or more of several possible types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, or local bus using any of a variety of bus architectures. The bus 913, and all buses specified in this description can also be implemented over a wired or wireless network connection and each of the subsystems, including the one or more processors 903, a mass storage device 904, an operating system 905, software 906, data 907, a network adapter 908, the system memory 912, an Input/Output Interface 910, a display adapter 909, a display device 911, and a human-machine interface 902, can be contained within one or more remote computing devices 914a,b,c at physically separate locations, connected through buses of this form, in effect implementing a fully distributed system.
The computing device 901 typically comprises a variety of computer-readable media. Exemplary readable media can be any available media that is accessible by the computing device 901 and comprises, for example and not meant to be limiting, both volatile and non-volatile media, removable and non-removable media. The system memory 912 comprises computer readable media in the form of volatile memory, such as random access memory (RAM), and/or non-volatile memory, such as read only memory (ROM). The system memory 912 typically contains data such as the data 907 and/or program modules such as the operating system 905 and the software 906 that are immediately accessible to and/or are presently operated on by the one or more processors 903. The software 906 can include, in some embodiments, one or more of the modules described herein in connection with detection of anomalous records. As such, in at least some of those embodiments, the software 906 can include the ingestion module 210, the configuration module 220, the training module 230, the detection module 240, the scoring module 250, and the output 260. In other embodiments, the software 906 can include a different configuration of modules from that shown in
In some embodiments, program modules that constitute the software 906 can be retained (built or otherwise) in one or more remote computing devices functionally coupled to the computing device 901. Such remote computing device(s) can include, for example, remote computing device 914a, remote computing device 914b, and remote computing device 914c. Hence, as mentioned, functionality described herein in connection with detection of anomalous record can be provided in a distributed fashion, using parallel computing, for example.
In another aspect, the computing device 901 can also comprise other removable/non-removable, volatile/non-volatile computer storage media. By way of example,
Optionally, any number of program modules can be stored on the mass storage device 904, including by way of example, the operating system 905 and the software 906. Each of the operating system 905 and the software 906 (or some combination thereof) can comprise elements of the programming and the software 906. The data 907 can also be stored on the mass storage device 904. The data 907 can be stored in any of one or more databases known in the art. Examples of such databases comprise, DB2®, Microsoft® Access, Microsoft® SQL Server, Oracle®, mySQL, PostgreSQL, and the like. The databases can be centralized or distributed across multiple systems.
In another aspect, the user can enter commands and information into the computing device 901 via an input device (not shown). Examples of such input devices comprise, but are not limited to, a keyboard, pointing device (e.g., a “mouse”), a microphone, a joystick, a scanner, tactile input devices such as gloves, and other body coverings, and the like. These and other input devices can be connected to the one or more processors 903 via the human-machine interface 902 that is coupled to the system bus 913, but can be connected by other interface and bus structures, such as a parallel port, game port, an IEEE 1394 Port (also known as a Firewire port), a serial port, or a universal serial bus (USB).
In yet another aspect, the display device 911 can also be connected to the system bus 913 via an interface, such as the display adapter 909. It is contemplated that the computing device 901 can have more than one display adapter 909 and the computing device 901 can have more than one display device 911. For example, the display device 911 can be a monitor, an LCD (Liquid Crystal Display), or a projector. In addition to the display device 911, other output peripheral devices can comprise components such as speakers (not shown) and a printer (not shown) which can be connected to the computing device 901 via the Input/Output Interface 910. Any operation and/or result of the methods can be output in any form to an output device. Such output can be any form of visual representation, including, but not limited to, textual, graphical, animation, audio, tactile, and the like. The display device 911 and computing device 901 can be part of one device, or separate devices.
The computing device 901 can operate in a networked environment using logical connections to one or more remote computing devices 914a,b,c. By way of example, a remote computing device can be a personal computer, portable computer, smartphone, a server, a router, a network computer, a peer device or other common network node, and so on. Logical connections between the computing device 901 and a remote computing device 914a,b,c can be made via a network 915, such as a local area network (LAN) and/or a general wide area network (WAN). Such network connections can be through the network adapter 908. The network adapter 908 can be implemented in both wired and wireless environments. In an aspect, one or more of the remote computing devices 914a,b,c can comprise an external engine and/or an interface to the external engine.
For purposes of illustration, application programs and other executable program components such as the operating system 905 are illustrated herein as discrete blocks, although it is recognized that such programs and components reside at various times in different storage components of the computing device 901, and are executed by the one or more processors 903 of the computer. An implementation of the software 906 can be stored on or transmitted across some form of computer-readable media. Any of the disclosed methods can be performed by computer readable instructions embodied on computer-readable media. Computer-readable media can be any available media that can be accessed by a computer. By way of example and not meant to be limiting, computer-readable media can comprise “computer storage media” and “communications media.” “Computer storage media” comprise volatile and non-volatile, removable and non-removable media implemented in any methods or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. Exemplary computer storage media comprises, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
It is to be understood that the methods and systems described here are not limited to specific operations, processes, components, or structure described, or to the order or particular combination of such operations or components as described. It is also to be understood that the terminology used herein is for the purpose of describing exemplary embodiments only and is not intended to be restrictive or limiting.
As used herein the singular forms “a,” “an,” and “the” include both singular and plural referents unless the context clearly dictates otherwise. Values expressed as approximations, by use of antecedents such as “about” or “approximately,” shall include reasonable variations from the referenced values. If such approximate values are included with ranges, not only are the endpoints considered approximations, the magnitude of the range shall also be considered an approximation. Lists are to be considered exemplary and not restricted or limited to the elements comprising the list or to the order in which the elements have been listed unless the context clearly dictates otherwise.
Throughout the specification and claims of this disclosure, the following words have the meaning that is set forth: “comprise” and variations of the word, such as “comprising” and “comprises,” mean including but not limited to, and are not intended to exclude, for example, other additives, components, integers, or operations. “Include” and variations of the word, such as “including” are not intended to mean something that is restricted or limited to what is indicated as being included, or to exclude what is not indicated. “May” means something that is permissive but not restrictive or limiting. “Optional” or “optionally” means something that may or may not be included without changing the result or what is being described. “Prefer” and variations of the word such as “preferred” or “preferably” mean something that is exemplary and more ideal, but not required. “Such as” means something that serves simply as an example.
Operations and components described herein as being used to perform the disclosed methods and construct the disclosed systems are illustrative unless the context clearly dictates otherwise. It is to be understood that when combinations, subsets, interactions, groups, etc. of these operations and components are disclosed, that while specific reference of each various individual and collective combinations and permutation of these may not be explicitly disclosed, each is specifically contemplated and described herein, for all methods and systems. This applies to all aspects of this application including, but not limited to, operations in disclosed methods and/or the components disclosed in the systems. Thus, if there are a variety of additional operations that can be performed or components that can be added, it is understood that each of these additional operations can be performed and components added with any specific embodiment or combination of embodiments of the disclosed systems and methods.
Embodiments of this disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the methods and systems may take the form of a computer program product on a computer-readable storage medium having computer-readable program instructions (e.g., computer software) embodied in the storage medium. More particularly, the present methods and systems may take the form of web-implemented computer software. Any suitable computer-readable storage medium may be utilized including hard disks, CD-ROMs, optical storage devices, or magnetic storage devices, whether internal, networked, or cloud-based.
Embodiments of this disclosure have been described with reference to diagrams, flowcharts, and other illustrations of methods, systems, apparatuses, and computer program products. Each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, respectively, can be implemented by processor-accessible instructions. Such instructions can include, for example, computer program instructions (e.g., processor-readable and/or processor-executable instructions). The processor-accessible instructions can be built (e.g., linked and compiled) and retained in processor-executable form in one or multiple memory devices or one or many other processor-accessible non-transitory storage media. These computer program instructions (built or otherwise) may be loaded onto a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine. The loaded computer program instructions can be accessed and executed by one or multiple processors or other types of processing circuitry. In response to execution, the loaded computer program instructions provide the functionality described in connection with flowchart blocks (individually or in a particular combination) or blocks in block diagrams (individually or in a particular combination). Thus, such instructions which execute on the computer or other programmable data processing apparatus create a means for implementing the functions specified in the flowchart blocks (individually or in a particular combination) or blocks in block diagrams (individually or in a particular combination).
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including processor-accessible instruction (e.g., processor-readable instructions and/or processor-executable instructions) to implement the function specified in the flowchart blocks (individually or in a particular combination) or blocks in block diagrams (individually or in a particular combination). The computer program instructions (built or otherwise) may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer-implemented process. The series of operations can be performed in response to execution by one or more processor or other types of processing circuitry. Thus, such instructions that execute on the computer or other programmable apparatus provide operations that implement the functions specified in the flowchart blocks (individually or in a particular combination) or blocks in block diagrams (individually or in a particular combination).
Accordingly, blocks of the block diagrams and flowchart illustrations support combinations of means for performing the specified functions in connection with such diagrams and/or flowchart illustrations, combinations of operations for performing the specified functions and program instruction means for performing the specified functions. Each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, can be implemented by special-purpose hardware-based computer systems that perform the specified functions or operations, or combinations of special-purpose hardware and computer instructions.
The methods and systems can employ artificial intelligence techniques such as machine learning and iterative learning. Examples of such techniques include, but are not limited to, expert systems, case-based reasoning, Bayesian networks, behavior-based AI, neural networks, fuzzy systems, evolutionary computation (e.g. genetic algorithms), swarm intelligence (e.g. ant algorithms), and hybrid intelligent systems (e.g. Expert inference rules generated through a neural network or production rules from statistical learning).
While the computer-implemented methods, apparatuses, devices, and systems have been described in connection with preferred embodiments and specific examples, it is not intended that the scope be limited to the particular embodiments set forth, as the embodiments herein are intended in all respects to be illustrative rather than restrictive.
Unless otherwise expressly stated, it is in no way intended that any method set forth herein be construed as requiring that its operations be performed in a specific order. Accordingly, where a method claim does not actually recite an order to be followed by its operations or it is not otherwise specifically stated in the claims or descriptions that the operations are to be limited to a specific order, it is in no way intended that an order be inferred, in any respect. This holds for any possible non-express basis for interpretation, including: matters of logic with respect to arrangement of operations or operational flow; plain meaning derived from grammatical organization or punctuation; the number or type of embodiments described in the specification.
It will be apparent to those skilled in the art that various modifications and variations can be made without departing from the scope or spirit. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims.