The invention relates to the method of processing and storing data for real time anomaly detection problem. The method proposed in the present invention is used on the basis of anomaly detection technology and is applied in the field of real time computing.
Typically, the data processing and storing method for real time anomaly detection is represented by the following simplified steps:
Step 1: incoming data will be stored in the database.
Step 2: perform a comparison of the incoming data with past data points to conclude whether the incoming data is anomalous or not and then issue warnings.
However, as the number of historical data points to be used for comparison increases, three problems arise:
One is that the computer needs to store a large amount of historical data on random access memory or read-only memory (RAM) while the amount of RAM is limited.
The second is that the requirement to retrieve historical data from the database is continuously costly and leads to database failure in the long run.
The third is the increased computation time, while for the real time anomaly detection problem (the problem of time constraints from the occurrence of an event until the system responds to that event), the computation time of basic operations needs to reach a certain speed or time limit.
The method of processing and storing data for real time anomaly detection problem solves the above three problems well. Respond to real time anomalous data detection and provide treatment for similar problems that can be applied to speed up computation.
The purpose of the present invention is to provide a method of processing and storing data for real time anomaly detection problem. This method increases computing power many times over (depending on how data storage and computation are divided on RAM read-only memory).
To achieve the foregoing, the present invention provides a method of processing and storing data for real time anomaly detection problem with the following specific implementation steps:
Step 1: build a historical database over time, a database of mean and standard deviation. More specifically: the data after coming to the system will be saved to the database according to the timestamp, after the specified time periods, the data will be averaged and saved to the database.
Step 2: make a selection number of blocks and number of points in one block, divide the historical data into blocks of equal size and build a formula to calculate the mean, the standard deviation of each data block and the mean, the median standard deviation of the whole data:
In fact, the detection of data anomalies using different algorithms requires different data processing and storage. For algorithms that require the use of the mean and the median standard deviation of historical data to make an outlier assessment, the following steps apply:
Step 2.1: divide historical data into equal blocks, namely: suppose historical data to be averaged, standard deviation is n×m data points, we divide into m data blocks, each block contain n points data.
Step 2.2: determine the number of historical data points to use.
Step 2.3: construct formulas to calculate the mean, the standard deviation of data blocks and the mean, the median standard deviation of the whole data.
Step 3: create an independently running data mapping process that reads collected data, normalizes the data, and interacts with the in-memory database to write historical data according to time.
Step 4: process the calculation of the mean, the standard deviation of the data blocks and the mean, the median standard deviation of the whole data and store it in the database on read-only memory (RAM).
To perform anomaly detection according to the data division in step 2. We use two independent processes: the process of calculating the mean, the standard deviation, and performing the calculation when n points have been collected data for that block and for all historical data is shown in step 4.1; anomalous data detection real time process reads the data in real time and checks whether the data point is anomalous performed in step 4.2.
Step 4.1: process the calculation of the mean, the standard deviation of the data blocks and the mean, the median standard deviation the whole data and save it in the database with the data structure as Table 2, and are stored directly on RAM:
The process of calculating the mean, standard deviation is scheduled to execute after n×t time because the data is written to the database in t time period, so after n×t time we proceed with the following next steps.
Step 4.1.1: read the historical data of the last n points in the database stored in Step 3.
Step 4.1.2: calculate the mean and standard deviation of the n points obtained.
Step 4.1.3: calculate the mean, the median standard deviation of all historical data stored in the database: based on the mean, the standard deviation of up to m−1 previously calculated data blocks and the mean, the standard deviation of the nearest n points using the formulas established in Step 2.3.
Step 4.1.4: store the last n-point mean, the nearest n-point standard deviation, the mean of all historical data, and the median standard deviation of all historical data into a datastructured database Table 2 to query.
Step 4.2: anomaly real time process reads real time data from the database and performs anomaly detection.
Then, because in Step 4.1, the mean, the median standard deviation of historical data has been calculated, it is not necessary to recalculate them each time the incoming data is available. It will to speed up anomaly detection computation and real time response to the problem.
This solution helps to solve the problem of real time calculation of both anomalous data detection, avoiding hard drive scanning and database file opening and closing many times.
In order to describe the invention in a more coherent, clear and understandable manner, the figures below depict parts of the invention:
In the Anomaly Detection System, it is the detection of abnormal data occurring in the system, the requirement is that the anomaly should be detected as soon as possible to minimize the risk of impact to the system or in other words real time detection.
The method of processing and storing data for real time anomaly detection problem proposed in the present invention consists of sequential implementation steps detailed below:
Step 1: build a historical database over time, a database of mean and standard deviation.
Refer to
System data is collected by agents installed on the server including information such as percentage of central processor usage, percentage of internal memory used, network latency, etc. that will be stored on a centralized messaging system to task different systems using the same data source. Thanks to an independently running Process Mapping Data, it reads data from the centralized messaging system, normalizes the data, and interacts with the on-memory database management system (in-memory database) to write data over time.
The content of the data includes: the time the data was written, the source of the data to be written, the value of the data to be written.
When a record is sent to the messaging system, the Data Mapping Process writes the data to the database with the following structure:
In addition, it is necessary to build a database storage structure for the mean and standard deviation values of historical data points as follows:
Step 2: make a selection number of blocks and number of points in one block, divide the historical data into equal sized blocks and build formulas to calculate the mean, standard deviation of each data block and the whole data:
From the starting idea of dividing historical data into smaller blocks to facilitate real time anomaly detection calculations, the calculation of the mean, the standard deviation is done as follows: with the mean, the average of n×m data points is equal to the average of the arithmetic mean of m blocks, where each block has n data points; with standard deviation, averaging the standard deviation of m blocks, where each block has n data points, will calculate that block standard deviation. Specifically, the method of dividing data blocks and calculating the average, standard deviation of each block and the whole data includes the following steps:
Step 2.1: Divide historical data into equal blocks: assuming the historical data to be averaged is n×m data points, we divide it into m data blocks, each containing n data points. The choice of two parameters n and m depends on the characteristics of each different data type, based on the requirement between the data processing speed and the data average used to detect the outlier data. For example, when we divide more blocks (m large) and each block has a large number of points (n large), the data processing speed will be slower, and the comparison of new incoming data with the data average will be less accurate.
Step 2.2: determine the historical data points to use, these points are past data from the present time, assuming those points denoted by
a11, a21, . . . , an1, a12, a22, . . . , an2 . . . , a1m, a2m, . . . , anm are the first data point, the second data point, . . . , the n×m data point respectively.
Step 2.3: The mean (denoted by mean) is calculated by adding all the data points and dividing the result by the number of data points, and the median standard deviation (denoted by median_std) is calculated as the median of the standard deviations of the smaller blocks, respectively. Here is the formula:
In which, the standard deviation of each block (denoted by std_block_i) is calculated according to the following formula:
Refer to
Step 3: Data mapping process (called Process Mapping Data) runs independently to read the collected data. Because the data collected by the agents is often in a raw form (usually in json format—javascript object notation) including many different fields, we need to separate the data into the required fields for anomaly detection and normalization of data to real number format. Post-normalized data is written to the in-memory database by the process over time. The data in the database has a data structure like Table 1.
Step 4: Perform anomaly detection of incoming data with the mean, the median standard deviation of historical data already stored in the database on read-only memory (RAM).
To perform anomaly detection according to the data division in step 2. We use two independent processes: The process of calculating the mean, the standard deviation, performing the calculation when n points have been collected data for that block and for all historical data is shown in step 4.1; Anomalous data detection real time process reads the data in real time and checks whether the data point is anomalous performed in step 4.2. As follows:
Step 4.1: process the calculation of the mean, the standard deviation of the data blocks, the mean, the median standard deviation of the whole data and save it in the database for the mean, the standard deviation values with the data structure as Table 2, and are stored directly on RAM:
The process of calculating the mean, standard deviation is scheduled to execute after n×t time because the data is written to the database in t time period, so after n×t time we proceed the next steps.
Step 4.1.1: read historical data for the last n points in the database stored in step 3.
Step 4.1.2: calculate the mean and standard deviation of the n points obtained.
Step 4.1.3: calculate the mean of all historical data blocks stored on the database: based on the mean, standard deviations of up to m−1 previously calculated data blocks, and the mean, standard deviation of the nearest n points, we can calculate the mean and the median standard deviation of all historical data using the formulas established in step 2.3.
Step 4.1.4: save the last n point mean, the nearest n point standard deviation, the mean of all historical data, and the median standard deviation of all historical data into a structured database Table 2 to query.
Step 4.2: anomaly real time process reads real time data from the database and performs anomaly detection:
Existing data will be checked for anomalous condition by parametric method based on statistics, namely algorithm based on mean and historical data standard deviation as follows:
Let xcurrent be the current value of the data obtained, mean and median_std are mean, median standard deviation of the most recent historical data from the current data point calculated in step 4.1.3, respectively. Then:
In which, factor will be determined by the empirical rule, usually taken as 3.
If xcurrent is an abnormal data point, it will be saved in the database and sent directly to the alarm system so that the operator of the network system will check and correct the error.
Refer to
Solve the problem of real time anomaly detection, anomaly response time<1 minute (from anomaly appearance time to giving warning).
Save on storage costs on RAM and don't have to scan the hard drive repeatedly.
Number | Date | Country | Kind |
---|---|---|---|
1-2021-04085 | Jul 2021 | VN | national |