This disclosure is generally related to the field of data analysis. More specifically, this disclosure is related to a system and method which facilitates automated imputation for multi-state sensor data and outliers.
In many Industrial Internet of Things (IoT) applications, a large volume of high-dimensional data may be continuously collected from heterogeneous sensors for various applications, e.g., planning, prognostics, and diagnostics. Sensor data can be lost and corrupted during the data collection due to sensor malfunctions, unreliable communication channels, and unstable databases. As the number of sensors (i.e., attributes) increases, so increases the chance of corrupted/missing data per database query. This in turn can result in rapidly compromised data quality for machine learning algorithms.
For example, when a feature matrix is constructed for a multivariate analysis algorithm, a significant number of samples or attributes may be discarded due to issues relating to data quality. A simplistic and naïve approach can be to eliminate samples which contain null data points or features with poor data quality (e.g., missing data). However, this elimination can result in a large waste of collected sensor data if the data loss or corruption randomly occurs across the feature matrix. One solution to address this data waste is to perform data imputation, by replacing the missing data with substituted values.
One challenge of data imputation is to avoid introducing unwanted data artifacts. In particular, it can be difficult to perform data imputation for multi-dimensional data with unknown multi-states and outliers, which may occur in sensor data for many industrial applications. For example, multi-dimensional data from industrial sensors can include both null values and outliers. For a large-scale Industrial IoT application, e.g., with high-dimensional and multi-state sensor data, the challenge remains to automate the preprocessing of the data (including outlier elimination and missing-data imputation).
A system and method are provided to facilitate automated data imputation. During operation, the system generates a cluster model based on raw data obtained from sensors with multiple states, wherein the raw data includes missing values. The system replaces the missing values with first imputed data based on the cluster model. The system iterates, until a predetermined threshold has been reached, through a series of operations which include: updating the cluster model based on most recently imputed data; predicting outliers based on the cluster model; marking the outliers as null values to obtain filtered data; updating the cluster model based on the filtered data; and replacing the null values with second imputed data based on the cluster model.
In some embodiments, prior to generating the cluster model based on the raw data, the system: receives a request to process the raw data, wherein a state of a sensor includes one or more of off, idle, and active; subsequent to iterating through the series of operations until the predetermined threshold has been reached, returns final data generated based on the cluster model; and stores, in a database, the final data as preprocessed data.
In some embodiments, generating the cluster model based on the raw data, replacing the missing values with the first imputed data, updating the cluster model based on the filtered data, and replacing the null values with the second imputed data is performed by a first module. Updating the cluster model based on the most recently imputed data, predicting the outliers, and marking the outliers as null values is performed by a second module.
In some embodiments, iterating through the series of operations further involves the first module: receiving, as input data, the raw data or the filtered data; replacing the missing or null values with the most recently imputed data; and transmitting, as output data, the most recently imputed data to the second module.
In some embodiments, iterating through the series of operations further involves the second module: receiving, as input data, the most recently imputed data from the first module; updating the cluster model based on the most recently imputed data; predicting the outliers based on the cluster model; removing the outliers by marking the outliers as null values to obtain the filtered data; and transmitting, as output data, the filtered data to the first module.
In some embodiments, the first module includes a first cluster outlier module, a resampler module, and a denormalizer module. The second module includes a second cluster outlier module and a null value imputer module.
In some embodiments, generating the cluster model based on the raw data and updating the cluster model based on the most recently imputed data or the filtered data comprises one or more of: determining, based on the raw data, the most recently imputed data, or the filtered data, clusters and information associated with the clusters, wherein the information associated with the clusters includes one or more of: a number of clusters; a centroid of a respective cluster; and a standard deviation associated with the respective cluster; classifying a cluster as an outlier cluster; classifying a point as an outlier point; and determining that the outlier point belongs to a first cluster of the determined clusters.
In some embodiments, replacing the missing values with the first imputed data and replacing the null values with the second imputed data comprises: generating, for a missing or null value based on a Gaussian distribution, a sample based on the determined clusters and the information associated with the clusters; and replacing the missing or null value with the generated sample.
In some embodiments, the cluster model is generated or updated based on a Gaussian Mixture Model with a number of centroids. A probability density function of the GMM is based on a Gaussian distribution. An outlier cluster is defined based on a user-defined threshold. An outlier point is defined based on a user-defined confidence level.
In the figures, like reference numerals refer to the same figure elements.
The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The embodiments described herein provide a system which facilitates automated data imputation for multi-state sensor data with outliers, using an iterative feedback loop which updates a learned model by replacing missing or null values with resampled data.
As described above, in many IoT applications, a large volume of high-dimensional data may be continuously collected from heterogeneous sensors for various applications, e.g., planning, prognostics, and diagnostics. Sensor data can be lost and corrupted during the data collection due to sensor malfunctions, unreliable communication channels, and unstable databases. As the number of sensors (i.e., attributes) increases, so increases the chance of corrupted/missing data per database query. This in turn can result in rapidly compromised data quality for machine learning algorithms.
For example, when a feature matrix is constructed for a multivariate analysis algorithm, a significant number of samples or attributes may be discarded due to issues relating to data quality. A simplistic and naïve approach can be to eliminate samples which contain null data points or features with poor data quality (e.g., missing data). However, this elimination can result in a large waste of collected sensor data if the data loss or corruption randomly occurs across the feature matrix. One solution to address this data waste is to perform data imputation, by replacing the missing data with substituted values.
One challenge of data imputation is to avoid introducing unwanted data artifacts. In particular, it can be difficult to perform data imputation for multi-dimensional data with unknown multi-states and outliers, which may occur in sensor data for many industrial applications. For example, vibration sensors can be attached to equipment that measures a Root Mean Square (RMS) value of three-axis acceleration. Note that equipment can often operate with multiple states (e.g., off, idle, and active) of which emits their distinctive vibration signals. Thus, three-axis sensor measurements can follow a three-dimensional (3D) Gaussian distribution given a hidden state of a unit of equipment under monitoring. Multi-dimensional data from such industrial sensors can include both null values and outliers. For a large-scale Industrial IoT application, e.g., with high-dimensional and multi-state sensor data, the challenge remains to automate the preprocessing of the data (including outlier elimination and missing-data imputation).
The embodiments described herein provide a system which facilitates automated data imputation for high-dimensional, multi-state sensor data with outliers. The system can perform operations in a feedback loop, by iterating through imputing data for missing values and identifying/eliminating outliers. For example, the system can learn a cluster model based on incoming raw data, resample missing values from the learned cluster model, and impute the missing values with the resampled data to obtain imputed data. The system can subsequently relearn the cluster model based on the imputed data, identify outliers from the learned cluster model, and eliminate the predicted outliers, to obtain filtered data. The system can loop back and repeat these operations, e.g., by relearning the cluster model based on the filtered data, etc. The system can iterate through these operations until a certain predetermined threshold is reached or other stopping criteria is met. The system can return final data back as preprocessed data, e.g., in response to a request for preprocessed data or to be stored in a database as preprocessed data for subsequent usage. An exemplary high-level system environment is described below in relation to
Thus, by automating the process of data imputation in a matrix based on data obtained from multi-state sensors, the described embodiments provide a system which can address the challenge of preprocessing a large amount of multi-state sensor data (which multiple states cannot be observed directly by human eyes). The system can perform an automated process which iterates, in a feedback loop, through imputing data by replacing missing or null values with resampled data based on a learned cluster model, and eliminating outliers based on an updated cluster model, until a certain predetermined threshold is reached. The end result can be a matrix for the multi-state sensor data, where the matrix does not include any missing or null values or outliers. This allows the system to retain the valid data, rather than discarding valid data for a particular timestamp based on invalid, missing, null, or outlier values, as in the prior art.
The described embodiments of an overall system for automated data imputation include modules, components, or units which can interact in an iterative feedback loop, to provide a solution to problem of data imputation for multi-state sensor data and outliers, including obtaining, storing, processing, and managing data to obtain preprocessed data, and subsequently using the preprocessed data in various technical applications. Thus, the disclosed system is directed to a solution which is both necessarily rooted in computer technology and provides a specific implementation of a solution to a problem in the software arts.
Furthermore, the described embodiments may be integrated into many different practical applications, i.e., used in many technical fields and for many different applications. For example, the described embodiments may be integrated into applications related to industrial Internet of Things, which can include interconnected sensors, instruments, and other physical devices networked together with industrial applications on various computing device, including in the technical fields of manufacturing and energy management. Thus, the improvements provide by the disclosed system apply to several technologies and technical fields, including but not limited to: industrial IoT applications; machine data analytics; outlier removal; data imputation; and data mining of voluminous and error-prone sensor data.
The term “autoimputer” refers to the described embodiments of the overall system, which includes a cluster imputer and an outlier remover, and performs the functions described herein.
The terms “cluster imputer” and “cluster imputer module” are used interchangeably in this disclosure, and refer to a component or unit of the overall system which learns the cluster model and replaces missing or null values with resampled values, as described below in relation to
The terms “outlier remover” and “outlier remover module” are used interchangeably in this disclosure, and refer to a component or unit of the overall system which relearns or updates the cluster model based on imputed data, predicts outliers, and removes outliers by marking them as missing or null values, as described below in relation to
The terms “cluster outlier” and “cluster outlier module” are used interchangeably in this disclosure, and refer to a component or unit of the cluster imputer and the outlier remover modules, and is described below in relation to
The terms “estimator” and “estimator module” are used interchangeably in this disclosure, and refer to a module in a cluster outlier which performs the operations described below in relation to
The terms “regenerate,” “relearn,” and “update” the cluster model are used interchangeably in this disclosure, and refer to updating a previously-generated cluster model based on imputed data, most recently imputed data, filtered data, updated data, or data that has been modified from data which was used to construct the previously-generated cluster model.
Outlier remover 114 can receive the imputed data (via communication 122), relearn the cluster model based on the imputed data, identify or predict outliers from the learned cluster model, and eliminate the identified or predicted outliers, to obtain filtered data (Xfiltered) 124. Outlier remover 114 can send filtered data (Xfiltered) 124 back to cluster imputer 112.
Cluster imputer 112 can update the current cluster model based on filtered data (Xfiltered) 124, resample missing values from the current cluster model, and impute the missing values with the resampled values, to obtain imputed data. The system can determine whether a certain predetermined threshold or predetermined stopping criteria has been reached or met. If it has, the system can return preprocessed data (Xout) 126 to database 102. If it has not, the system can iterate through the above operations, i.e., through outlier remover 114 and back to cluster imputer 112, as described above.
For a formal description of an exemplary algorithm, the following simplified matrix notations can be used. For an n×m matrix A=[aij]nm, an ith row vector and a jth column vector are denoted by Ai. and A.j where 1≤i≤n and 1≤j≤m. Assume that a p feature sensor with n samples each can represent certain unknown states of a system of interest. In addition, assume that all sensor data is normalized to have a zero-mean and unit-variance after data preprocessing. Let Xnorm denote the normalized feature matrix defined by X=[xij]np, where xij is the ith sample of the pth feature sensor for 1≤i≤n and 1≤j≤p. The column vector X.j=[x1j, . . . , xnj] is data from feature sensor j. A vector or an array of x is denoted by x=(x1, . . . , xp).
Outlier remover 220 can receive, by cluster outlier 222, imputed data ({tilde over (X)}imputed) 242. Cluster outlier 222 can relearn or update the cluster model based on imputed data ({tilde over (X)}imputed) 242. Cluster outlier 222 can also generate predicted outliers (Ŷoutlier) 244, as an outlier label where a value of −1 indicates an outlier sample, and a value of +1 indicates an inlier sample. Cluster outlier 222 can transmit (Ŷoutlier) 244 (e.g., as predicted outliers 244) to null value imputer 224, which can mask the detected outlier samples with a null value. This can result in generating outlier-filtered data to null value-corrupted data of filtered data ({tilde over (X)}filtered) 246. Outlier remover 220 can subsequently send filtered data ({tilde over (X)}filtered) 246 back to cluster imputer 210.
The system depicted in architecture 200 can iterate through the above-described operations until a predetermined threshold has been reached or until a predetermined stopping criteria has been met. When the predetermined threshold or stopping criteria is detected, the system can return final data generated based on the current cluster model as preprocessed data (Xout) 280, which is obtained based on imputed data ({tilde over (X)}imputed) 242 through the iterations.
Detailed Description of Cluster Outlier Module and Exemplary Diagram with Clusters, Outlier Points, and Outlier Cluster
The cluster outlier module (e.g., cluster outlier 212 and cluster outlier 222 of
where 0≤wk≤1 is the weight probability with Σk wk=1 and N(x|uk, σk) is a Gaussian distribution of the random variable x with a mean uk and standard deviation σk of cluster k. An outlier can be defined by outlier clusters whose weight probability wk is less than a user-defined threshold wmin and outlier points which are outside the confidence interval (xkl, xku) given a user-provided confidence level αc.i such that N(xkl≤x≤xku|uk, σk)≤αc.i.
The system can determine that the data at times t3, t6, and t10 contain missing or null values (“lost data”). As discussed above, a naïve approach is to discard the entirety of the data for the data at times t3, t6, and t10. However, this would result in discarding valid data, and would be a waste of the obtained valid data (e.g., for sensor_2340). Because the system has already generated the cluster model, the system can replace the lost data with representative data (i.e., imputed data or resampled data) based on the Gaussian distribution.
The system can subsequently update the cluster model based on the imputed data, identify the outliers, and remove the outliers by replacing the outliers with null values (as described above in relation to
Estimator 410, by normalizer module 412, takes as input data Xin 440, and normalizes or transforms Xin 440 to Xnorm 442, which can have a zero mean and a unit variance for each column which can produce normalizer parameter θnorm 444. Cluster learning module 414 can take Xnorm 442 and estimate the optimal number of clusters and density distribution of each cluster, which produces a parameter tuple of clusters θclo 446, wherein θclo=(w, u, σ) for each column such that w=(w1, . . . , wK
Next, build label hashtable module 416 can take θclo 446 and build a label hash table for all clusters, such that wk<wmin is assigned to an outlier cluster labeled by −1 and other valid clusters are reassigned to a new unique label. Build label hashtable module 416 can store the label reassignment in the label hashtable as hash table parameter θhash. Inlier bound estimation module 418 can compute the inlier bound of a normalized column for each cluster based on θclo 446 (fitted cluster model) and αc.i 456 (user-defined confidence level), which results in producing θbound 458 for all columns. Thus, estimator 410 can produce a set of parameters θparam=(θnorm, θclo, θhash, θbound) which are learned from input data Xin 440 and user-defined control parameters wmin 450 and αc.i 456. The system can use a MeanShift Clustering Algorithm for cluster learning by default which empirically shows the best performance for multi-state sensor data when the number of clusters is not provided.
Predictor 420 can use the leaned parameter θparam for each step, as depicted in
to main detected states {tilde over (Z)}state 470. This can ensure that the state labels have a strong temporal correlation and further be robust against noise. Finally, outlier detection module 430 can produce an outlier label {tilde over (Y)}outlier 472 by checking whether Xnorm 462 is within inlier bounds θbound 458 for detected states {tilde over (Z)}state 470.
System 501 can be an industrial system and sensors 504 can include industrial sensors, e.g., operating in an industrial setting with various equipment. Device 520 can obtain readings 506 from multi-state sensors 504 of system 501. Sensors 504 can include vibration sensors attached to equipment (not shown) of system 501, where the equipment can operate in multiple states, e.g., off, idle, and active, where a respective sensor can emit a distinctive vibration signal depending on its state. Device 520 can store readings 506 in database 522 as, e.g., raw data 508. As discussed above, raw data 508 (obtained from industrial multi-state sensors 504) may include null values, missing values, and outliers.
During operation, device 520 can transmit raw data to device 518, e.g., in response to a request raw data 524 communication received from device 518, or in response to a user-generated command 532 (from user 514 via device 512) to generate a model. Device 520 can return raw data 526. Device 518 can receive raw data 526 (as raw data 528) along with user command 532 to generate the model (as a command 534). Device 518 can perform the operations described above in relation to
Device 518 can also send preprocessed data 550 and model 552 to device 512. Device 512 can receive preprocessed data 550 (as preprocessed data 558) and model 552 (as model 560), and can display on the screen of display 516 interactive elements 562 (which allow user 514 to, e.g., view the model and view the preprocessed data). Display 516 can also include interactive graphical user interface elements and a visual representation of each iteration of the model 564, which is generated as part of receiving model 560. User 514 can select an interactive element on display 516, which can correspond to, e.g.: viewing the cluster model in detail, as described above in relation to diagram 300 of
In some embodiments, user 514 can use an interactive element displayed on display 516 to locally modify the preprocessed data (e.g., by manually inserting or deleting data for one or more timestamps via device 512), and send a command 572 to regenerate the model with the modified data. Device 518 can receive command 572 (as a command 574) and can perform operations 536-548 as described above, using the modified data instead of the raw data as the initial data. Device 518 can subsequently return updated preprocessed data and an updated model back to device 512, for subsequent redisplay (e.g., updated display) on display 516.
In some embodiments, user 514 can use an interactive element displayed on display 516 to create a model based on additional data, e.g., by adding a set of sensors for a plant sublevel, a plant subsystem, or an asset.
Thus, environment 500 depicts exemplary entities and communications which facilitate automated data imputation. Environment 500 can also include user actions performed in response to device 518 performing the preprocessing of the data (i.e., automated data imputation), e.g., user 512 can use actionable and interactive graphical user interface elements on display 516 associated with device 512. User 512 can also manipulate an interface to modify the data which is preprocessed and used to construct the model, as described above in relation to the exemplary display screens of
The system iterates through operations one or more or all of operations 610-618 until the predetermined threshold has been reached. If the predetermined threshold has been reached (decision 608), the system returns final data generated based on the cluster model (i.e., the current cluster model as most recently updated in the iterative rounds), and the operation returns.
The system updates, by the outlier remover module, the cluster model based on the most recently imputed data (operation 724) (e.g., the most recently imputed data can be the first or the second imputed data, depending on the iteration or round). The system identifies and predicts, by the outlier remover module, outliers based on the cluster model (operation 726). The system masks, by the outlier remover module, the outliers with null values to obtain filtered data (operation 728). The filtered data can include outlier-filtered null value-corrupted data. The outlier remover module can send to the cluster imputer module the filtered data, and the cluster imputer module can receive from the outlier remover module the filtered data (not shown).
The system updates, by the cluster imputer module, the cluster model based on the filtered data (operation 730). The system generates, by the cluster imputer module, new samples (second imputed data) for the missing or null values (operation 732). The system replaces, by the cluster imputer module, the null values with the second imputed data (operation 734), and the operation returns to decision 708 of
Note that in operations 704, 724, and 730 (i.e., the cluster imputer module generating the cluster model based on the raw data, the outlier remover module updating the cluster model based on the most recently imputed data, and the cluster imputer module updating the cluster model based on the filtered data), the system can perform operations as described above in relation to
Content-processing system 818 can include instructions, which when executed by computer system 802, can cause computer system 802 to perform methods and/or processes described in this disclosure. Specifically, content-processing system 818 may include instructions for sending and/or receiving/obtaining data packets to/from other network nodes across a computer network (communication module 820). A data packet can include a request, data, raw data, imputed data, filtered data, a state, and a classification.
Content-processing system 818 can further include instructions for generating a cluster model based on raw data obtained from sensors with multiple states, wherein the raw data includes missing values (first cluster model-generating module 822). Content-processing system 818 can include instructions for replacing the missing values with first imputed data based on the cluster model (data-resampling/imputing module 824). Content-processing system 818 can include instructions for iterating, until a predetermined threshold has been reached (threshold-detecting module 832), through a series of operations which include the following operations. Content-processing system 818 can include instructions for: updating the cluster model based on most recently imputed data (second cluster model-generating module 826); predicting outliers based on the cluster model (outlier-detecting module 828); marking the outliers as null values to obtain filtered data (missing/null value-managing module 830); updating the cluster model based on the filtered data (first cluster model-generating module 822); and replacing the null values with second imputed data based on the cluster model (data-resampling/imputing module 824).
Data 834 can include any data that is required as input or that is generated as output by the methods and/or processes described in this disclosure. Specifically, data 834 can store at least: data; a command; a request; raw data; imputed data; most recently imputed data; filtered data; normalized or denormalized data; final data; preprocessed data; data with missing values or null values; a null value; a matrix; a model; a cluster model; a result of an iteration; a predetermined threshold; a stopping criteria; a number of rows; data associated with a sensor; data obtained from sensors with multiple states; an indicator of a sensor; a clustering algorithm; a MeanShift Clustering Algorithm; an outlier point; an outlier cluster; a cluster; a sample; a null value-free matrix; a number of clusters; a number of centroids; a standard deviation; a weight probability; a random variable; a mean; a confidence interval; a user-provided confidence interval; a density distribution; a parameter; a parameter tuple; a normalizer parameter; a hash table; a label; an outlier or inlier label; an inlier bound; a user-defined control parameter; and a state.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.
The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer-readable storage medium.
Furthermore, the methods and processes described above can be included in hardware modules or apparatus. The hardware modules or apparatus can include, but are not limited to, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), dedicated or shared processors that execute a particular software module or a piece of code at a particular time, and other programmable-logic devices now known or later developed. When the hardware modules or apparatus are activated, they perform the methods and processes included within them.
The foregoing descriptions of embodiments of the present invention have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention. The scope of the present invention is defined by the appended claims.