The present disclosure generally relates to networking telemetry. More particularly, the present disclosure relates to systems and methods for compressing network data by deploying a Deep Neural Network (DNN) in the network such that the network data can be regenerated using the predictive nature of the DNN.
Useful network data is high in volume, making it difficult to store for extended periods. For example, a network with a million data sources may generate 46Gb of one-minute sampled data in one day, which translates to 16Tb of data per year. While the number of data sources may seem high, even in a traditional Internet Protocol (IP) network, this number would be inside the realm of possibility when considering each flow to be a data source. The number of data sources would be much higher when considering cloud services, IoT networks with billions of devices, multiple network layers, or high sampling network measurements (e.g., state of polarization, or wireless SNR measurements).
Due to cost, network measurement data is not stored for extended periods, or it is aggregated in larger periods of time (e.g., daily, weekly, monthly, yearly), thus losing fidelity in an important historical record of what has happened in the network. The process of aggregation/averaging is a very crude way of lossy compression for time-series data. For example, averaging represents a time-series with a single number over a period, so its accuracy is not good.
Due to the large amount of storage typically required for time-series data, a special variation of a database, known as TimeScale Databases (TSDBs), have been developed to facilitate the storage, reading, and writing to the time-series data. TSDBs make use of chunks to divide time-series by the time interval and a primary key. Each of these chunks is stored as a table for which the query planner is aware of the chunk's ranges (in time and keyspace), allowing the query planner to immediately identify which chunks an operations data belongs to. Lossless compression is used to reduce the amount of data in TDSBs.
One of these compression schemes is XOR-based compression. XOR compression claims to achieve 1.2×-4.2× compression ratios of floating-point numbers, depending on the dataset used. In this algorithm, successive floating-point numbers are XORed together, which means that only the different bits are stored. The first data point is stored with no compression, whereas subsequent data points are represented using their XORed values, encoded using a bit packing scheme, which is covered, for example, in www.vldb.org/pvldb/vol8/p1816-teller.pdf. Regarding the XOR-based compression, the efficacy of the compression relies on subsequent values in the time-series being close to one another. That is, if consecutive values in the time-series are close to one another, then fewer bits will need to be stored to produce lossless compression.
Although there exist other compression tools for floating point numbers, they typically do not measure up to XOR based compression. For example, popular compression tools GZIP, LZF, and SZIP offer compression ratios of 1.25×, 1.18×, and 1.19×, respectively, when attempting to compress the floating-point representation of a sine wave. Other commercial compression schemes such as fpzip can achieve compression ratios of 1.5×-2.74×. It should be noted that all the aforementioned compression methods are lossless. For lossy compression, the ZFP compressor has been shown to achieve an average compression ratio of 4.41× with an accuracy (i.e., the number of bits in agreement between two floating-point numbers, whereby the lossless case corresponds to an accuracy of 64) of 34.1 for a 32-bit precision floating-point representation, and an 18.1 compression ratio with 21.0 accuracy gain for 16-bit representation.
However, the known solutions have several shortcomings. In the instance of lossless compression techniques, the compression ratios are typically less than an order of magnitude. Due to the immense quantity of time-series data that needs to be stored, these compression schemas, even if they are lossless, do not offer a sufficiently large compression ratio to justify their use. This becomes even more apparent when considering the overhead needed to constantly decompress stored time-series.
One attempt at data compression using Deep Neural Networks (DNNs) is known as semantic compression. Particularly, a state-of-art semantic compression approach is called DeepSqueeze. DeepSqueeze uses autoencoders for data compression of each row in a relational table. Autoencoders, for example, refer to a type of unsupervised artificial neural network used to learn efficient encodings or representations of unlabeled data by attempting to regenerate inputs from encodings.
The DeepSqueeze algorithm has a few shortcomings as well. The technique requires significant overhead in the form of training of multiple models and functions and storing all the weights associated with this multitude of models. It is designed to be used to compress rows of a relational table and not windows of a time-series. The algorithm is designed for tabular data and specifically for finding the correlation between different columns in such data, making it unfavorable for dealing with time-series. Based on published data, the performance of the algorithm is also several orders of magnitude lower than needed. Thus, this approach has no clear adaption for time-series data.
Systems and methods for compressing network data are provided. According to one implementation, a method includes the step of collecting raw telemetry data from a network environment. The raw telemetry data is collected as time-series datasets. The method also includes the step of compressing the time-series datasets by deploying the time-series datasets as a Deep Neural Network (DNN) in the network environment itself. The time-series datasets are configured to be substantially reconstructed from the DNN using predictive functionality of the DNN.
The present disclosure is illustrated and described herein with reference to the various drawings, in which like reference numbers are used to denote like system components/method steps, as appropriate, and in which:
In various embodiments, the present disclosure relates to systems and methods for storing network data with Deep Neural Networks (DNN) “memorization” and network telemetry reduction by pruning and recovery of network data. The embodiments of the present disclosure improve upon conventional DNN techniques for compression and achieve much better lossy compression. Compressed measurements could be any number of network observations, such as packet counters, latency, jitter, Signal-to-Noise Ratio (SNR) estimates, State of Polarization (SOP) measurements, wireless Channel Quality Indicator (CQI) reports, alarm states, Performance Monitoring (PM) data, among other types of network data.
In the present disclosure, embodiments of systems and methods are described for “storing” network data in a DNN by deploying the DNN in the network, which greatly reduces the volume of information that needs to be stored. In this sense, data can be compressed by orders of magnitude, which is very useful for reducing the storage requirements on network data. There are multiple benefits to reducing the volume of measurement data produced in the network. For example, reducing data volume reduces the cost of storing the data once it arrives at the point of analysis. Also, reducing data volume reduces the cost of transferring data.
Network data includes several (n) time-series ts1, . . . , tsn sampled at the same rate. For network measurements, each time-series will have the same number of samples over the same period. The samples are pooled together into buckets, so tsik=(x1, . . . , xp) are the p samples collected in the kth period for time-series i. For network status (e.g., alarms), each time-series will have a number of indicators in the same period, so tsik=(x1, . . . , xp) could be the states of p alarms in the period, or the p alarm counts in the period.
Thus, the compression schemes, procedures, algorithms, techniques, methods, processes, etc. as described in the present disclosure include a unique sequence of steps that greatly improve prior compression attempts. Essentially, compressing data includes training a DNN model (e.g., DNN 20) and then configuring or deploying the compressed data as the DNN 20, instead of actually storing the compressed data in a database. As described herein, the compression strategy of the present disclosure includes obtaining or collecting the raw (telemetry) data in the form of time-series datasets. The telemetry data can be compressed in windows or chunks of time in the time-series data, which is different from conventional semantic compression (e.g., DeepSqueeze), which uses an autoencoder for data compression of each row in a relational table. However, in the present disclosure, instead of rows in a table, the operations are applied to the time-series datasets. Then, data can be “decompressed” (i.e., recovered, regeneration, etc.) from the DNN 20, such as by using the predictive functionality of the DNN 20.
Furthermore, the telemetry data is time-series data obtained from a communications network, where data may be related to network data, such as information or PM data related to packet count, latency, jitter, SNR, SNR estimates, state of polarization, Channel Quality Indicator (CQI) reports, alarm states, etc. Compression procedures, according to the present disclosure, may include the steps of 1) dividing the time-series data into equal-sized chunks (or periods of time), 2) feeding each index as an input to the DNN 20, where each chunk is an output of the DNN 20, and 3) training the DNN 20 to adjust weights until a desired compression ratio (or precision) is achieved. Decompression procedures, according to the present disclosure, may include the steps of 1) receiving an index corresponding to a desired time range, 2) inputting this index to the trained DNN 20, and 3) propagating values through the DNN 20 to obtain the reconstructed time-series chunk at the output of the DNN 20.
Before collecting and compressing the data, one or more telemetry devices (e.g., NEs 12) may be configured to prune the raw data before transmitting to raw data to the data collector 14. In some embodiments, this pruning process may include detecting a quality of a data decompression and providing a feedback signal to change the parameters of the pruning process in order to reduce the reconstruction error. For example, adjusting the level of pruning may use Reinforcement Learning.
In addition, the step of configuring compressed data as the DNN 20 may include the deployment of the DNN hosting system 22, which may be a server configured to host the DNN 20. In this case, the server may allow the querying of the DNN 20 for data retrieval, which effectively results in a reconstruction of the original raw data. Although this is a lossy compression scheme, the results described below demonstrate that the systems and methods of the present disclosure greatly improve conventional lossless or lossy compression strategies.
The DNN 20 may include indices and may further include relationships between each index and a time-series dataset. In response to receiving an index from a requesting agent, the DNN 20 is configured to reconstruct a time-series bucket related to that index. The indices may be picked randomly or according to a specific predetermined pattern. Also, in some embodiments, the indices may be determined through a bottleneck of an autoencoder.
Also, the DNN 20 may include multiple dense layers, as described below. The DNN 20 may also be configured as a decoder of an autoencoder, which is further described below.
Regarding compression, the system 10 may compress different time-series datasets at different compression rates and at different precisions (or accuracies) depending on the size of the data values (e.g., network traffic flow amount). The system 10 may use the compressor 18 as an autoencoder Bernoulli transformer to increase compression rate, by compressing time-series into Bernoulli distributed latent states and limiting (constraining) the distortion of reconstructed time-series.
Regarding decompression, the system 10 may decompress or reconstruct the data by using a transformation to the frequency domain. In some embodiments, the system 10 may determine “residuals,” which are the differences between the output and the original raw data. Then, the system 10 may then compress these residuals, as described in more detail below.
Furthermore, the system 10 may use a prediction-quantization-entropy coding scheme. For example, the prediction part of this may be for an autoencoder process and the quantize and entropy parts may be for the residuals. In some embodiments, autoencoding may include a Bernoulli Transformer AutoEncoder (BTAE), which may include an encoder acting as a feed-forward device and a decoder acting as a dictionary. The BTAE may be configured to reduce the size of the latent state by a factor related to the number of bits of floating point numbers used for the time-series values. Also, the system 10 may determine quantized entropy loss to constraint the size of the encoded residual with respect to the total entropy.
According to still other embodiments of the present disclosure, the telemetry data may be provided by an Internet Protocol Flow Information Export (IPFIX) device of one of the NEs 12, a router, switch, node, etc. of a communication network for providing network traffic information. The time-series datasets may be a collection of univariate and/or multi-variate time-series datasets. The step of configuring the compressed data as the DNN 20 may be a form of DNN memorization. Also, the time-series data may be deployed as the DNN 20 in lieu of storing compressed data in a database or in lieu of encoding comparisons between rows or columns in a relational table (as done in conventional systems). The training of the DNN 20 may include a stochastic gradient descent process. After training the DNN 20, the raw data can be discarded, since it can be effectively reconstructed by the strategies discussed below. Also, in some cases, the system 10 may be configured to reduce Mean Square Error (MSE) between the raw data and output of DNN 20 by increasing the number of iterations and/or by choosing hyper-parameters. Thus, the general aspects of the system 10 are described with respect to
ik
=W
3 max{0,W2 max{0,W1ik+b1}+b2}+b3
Compressing the time-series is done by training the DNN 50. During training, indices are used as the input to the DNN 50, and the raw time-series fragments are used as the desired output. The DNN 50 is trained to minimize the Mean Square Error (MSE) between the raw time-series and the output of the DNN 50 for a given index. Typically, the MSE is not 0. However, it can be made arbitrarily close to 0 by increasing the number of training iterations and by well-chosen training hyper-parameters. After the DNN 50 is trained, the raw time-series datasets can be discarded, since their information is contained in the DNN 50 itself.
The present disclosure describes and/or suggests at least two possible indexing strategies. The first strategy includes generating random integers for the index with a good ratio of 0s and 1s. For example, random 32-bit integers can be generated from 32-long vectors of Bernoulli random variables.
One embodiment, which may be referred to as a preferred embodiment, can include the use of random indices. For example, this may be preferable for certain reasons, such as (1) the autoencoder 60 for a network might double the size of the network during training, meaning that it may need more training and would take longer to train than a network consisting of the compressor only (
The level of compression (i.e., compression ratio) may be related to the level of achievable precision. In general, the stronger the compression (i.e., data being compressed greatly), the less achievable precision is possible (i.e., data is not reconstructed as accurately). This means that the compression can be used judiciously to save on space and time to train the DNNs. When comparing compression algorithms, an important metric is the compression ratio, which is equal to the size of the uncompressed data divided by the size of its compressed form. For example, if the original size of a dataset is 16 MB and its compressed size is 4 MB, the compression ratio would be 4×.
For example, in many cases, the size of values in a time-series may vary significantly. Consider the case of “mice” and “elephant” flows in an IP network. There may be several orders of magnitude difference in the size of these flows. In the case of traffic engineering, it may be more important to know the size of large flows precisely than the size of small flows precisely.
The present disclosure describes systems and methods that can be used to compress small and large flows at different accuracies with different DNNs. Thus, more than one DNN can be applied, even on the same network data. For example, small flow measurements can be compressed with less precision, while big flow measurements can be compressed with higher precision. As the amount of measurement information per flow is the same for all flows and there are likely many more smaller flows than large flows, the systems of the present disclosure can achieve better results by storing the information at different compression ratios.
For example, if there are 100 small flows that can be compressed at 100 compression ratio and 1 large flow that can be compressed at compression ratio of 10, to maintain its acceptably high precision for each set of flows, the average compression ratio compression could be 91. This should be compared to compressing all flows at the compression ratio of 10, which the minimum required by elephant flows. It should be clear that using multiple compression ratios results in almost a magnitude less storage requirements.
For compressing, the process 70 includes receiving time-series data and dividing it into chunks, as indicated in block 72. For example, the chunks may be periods of time and/or may be equal-sized time windows. Each chunk can be contiguous in time with adjacent chunks. The process 70 further includes feeding an index as an input to a DNN to get a time-series chunk as an output, as indicated in block 74. The index can be a time or some other means to uniquely identify a chunk. Finally, the process 70 may include training the DNN to adjust values of weights therein until a desired compression ratio or precision is achieved, as indicated in block 76. For example, the training can be a stochastic gradient optimization.
For decompressing, the process 80 includes retrieving an index of time-series data corresponding to a desired time range, as indicated in block 82. The process 80 also includes inputting the time-series index into a trained DNN, as indicated in block 84. Next, the process 80 includes propagating values (and calculating the values of output layers at each stage) until the chunk is reconstructed at the output of the DNN, as indicated in block 86.
Of note, multiple indices can be used to reconstruct a time-series containing multiple chunks and concatenating chunks from multiple indices to obtain reconstructed time-series. The trained DNN can be used to store measurements related to a network, such as packet counters, latency, jitter, signal-to-noise (SNR) estimates, state of polarization (SOP) measurements, wireless channel quality indicator (CQI) reports, alarm states, etc.
The DNN can be made of dense layers as described with respect to
During experimentation to analyze the efficacy of the DNN compression and decompression strategies described in the present disclosure, two datasets were used to check the performance of these compression schemes. A first dataset included a synthetic dataset, which was generated by creating uncorrelated polynomial degree-10 functions. This dataset allowed the performance of the present systems to be checked on very large sample sizes. It should be noted that this was a very challenging dataset. Each generated time-series had 768 floating point samples. The time-series were compressed to a randomly generated 32-bit integer, which was later used as an input to the DNN to regenerate the time-series.
The performance of various compression strategies is shown in the table below. For example, gzip and XOR are comparable lossless methods. It may be noted that replacing 15-minute bins with a single daily measurement is equivalent to a compression ratio of 96×. In the case of this time-series, the systems of the present disclosure were able to obtain a compression ratio of 113× with an average error of 3%. If, for example, the average error from the averaging of daily samples was 20% (not an unreasonable assumption), the equivalent reduction on this time-series using the present methods would be over 300×. As another example, at the Mean Absolute Percentage Error (MAPE) of 10%, the compression is over 200×, meaning that 1Gb of network measurements could be compressed to 5 Mb, if a 10% MAPE is acceptable.
Next, a collection of optical SNR estimates from a service provider was used. This dataset was fixed in size, so the compression ratio depended solely on the size of the neural network. In this dataset, 1% error is about 0.1 dB, which can achieve compression of over 20×. Better compression ratio might be expected if the dataset was larger. Again, gzip and XOR are comparable lossless methods.
A traditional way of approaching the problem may require the installation of a specialized monitoring system with intermediate collectors, large volume data storage, and an enormous number of computational resources to analyze the collected data. The data volume problems start at the Network Element (NE) where the software and hardware are typically not able to track all flows passing the NE. Due to hardware limitations, the monitoring system may be unable to track much of the traffic.
One practical approach to mitigate the collection overhead in IPFIX is a technique called threshold compression. In this technique, the router reports only flows above a threshold to the collection station (e.g., IPFIX collector 94). One possible disadvantage of this method may be that the information of flows below the threshold are not sent, while they may account for up to 90% of the flows. In conventional systems, the problem of large network data is resolved in unsatisfying ways, such as by reducing the pressure on the switch CPU, whereby flows are simply not reported. However, reported traffic flow data may not be transmitted out of the network. When transmitted to storage solutions, data may often be aged-out and deleted, and sometimes long-term summaries may be taken.
In cellular systems, Channel Quality Indicator (CQI) measurements may be used to report User Equipment (UE) channel quality (e.g., SNR, MIMO channel estimates, etc.). Current standards mandate that these measurements are reported at a regular interval to ensure that the UE's channel is tracked sufficiently fast, especially when it is moving. If there are many UEs, the eNodeB may decide to reduce the frequency of the reports. If the frequency is reduced too much, the quality of transmission of the UEs may suffer, and outages may be possible.
The present disclosure provides implementations or approaches to reduce the amount of network telemetry. Although the problems discussed herein exist in current systems today, solutions may not usually be thought of outside of conventional approaches. However, with the advent of Artificial Intelligence (AI) approaches, especially in interpolation and imputation, new solution avenues, such as the embodiments discussed herein, are becoming available.
The problem of conventional systems, as described in the present disclosure, fits in the NetFlow and IPFIX architecture, as an example use case, but also in the context of 3GPP architecture where there is a large volume of network measurement data coming from network devices and the volume is so high that it may start to impact the available data bandwidth. The solutions to these problems in the NetFlow and IPFIX, as described herein, can also be used in any context where network telemetry is used.
NetFlow and IPFIX can be used to turn the network into a large collection of sensors collecting information about network traffic flows. Information about Internet Protocol (IP) flows can be used for monitoring of network usage, identification of misconfigured network elements, identifying of compromised network end points, and detection of network attacks (e.g., see Omar Santos, “Network Security with NetFlow and IPFIX: Big Data Analytics for Information Security,” Cisco Press, 2016). However, high fidelity network monitoring with technologies such as IPFIX comes with many challenges due to the amount of data that must be collected by NEs 12, then transmitted to where it can be stored and finally processed to get the insights that the operator is looking for.
Thus, data gathering and data reporting features are located on the NEs 112 and may be implemented in hardware to handle large volumes of data. The data reporting modules 116 are configured to transmit the data obtained by the data gathering modules 114 to the data collector module 118. Data is analyzed by the network analytics module 120, such as by using network analytics software. The network analytics module 120 may be configured to use Machine Learning (ML) to recover (or impute) the original data or a version that is close to the original data.
The data gathering modules 114 and data reporting modules 116 are configured to prune the data before it is transmitted to the data collector module 118. At any time, only a subset of the collected data is transmitted to the data collector module 118 for analysis or monitoring by the network analytics module 120. Pruning can be done by randomly dropping a data source for a period of time, randomly dropping portions of a dataset from a data source, or by using a periodic schedule to do either. The network analytics module 120 is configured to use ML to recover (impute) the missing data points. In this case, the imputation may work sufficiently because many of the time-series datasets produced in the network are highly correlated inside of a time-series dataset and across time-series datasets. Thus, the system 110 can prune the data to the point where reconstruction still works sufficiently. While the reconstruction may not be perfect, its performance is good enough for analysis tasks that might be examined with various reconstruction approaches.
It may be noted that despite showing this architecture in the context of IPFIX, the architecture of the network data collection system 110 would work particularly well in the context of Channel Quality Indication (CQI) measurements in wireless networks, where many users have correlated CQI measurements. Instead of a per-flow dropping, the data reporting modules 116 may instead be configured to drop CQI measurements per user.
The advantage of the network data collection system 110 of
Network data is sampled time-series data. The samples are taken at a prescribed measurement interval (typically in the order of minutes). When picking the sampling interval, the network operator is typically not concerned about sampling intervals from the point of view of the Nyquist criterion, as they are not trying to reconstruct the data at the point of the collection and processing. Data is collected for other purposes (e.g., forecasting, anomaly detection, etc.), so the precise reconstruction of the underlying random process may not be extremely important.
To reduce the information generated and transmitted by NEs, the embodiments of the present disclosure may be configured to drop (i.e., prune) some of the samples. In some implementations, this may be referred to as “measurement sampling.” For example, the present systems may prune every kth sample of each time-series dataset. This is called subsampling and can be undone for each time-series dataset using a low pass filter if the Nyquist criteria is satisfied for the subsampled time-series.
Furthermore, a “correlation” process may be used in the present systems. For example, “correlation” may refer to the situation where information about one time-series dataset may be available in another time-series dataset. This may be a result of using pruning processes and then imputing the missing information. The embodiments of the present disclosure may be configured to remove parts of a time-series so that some information is always available in another, correlated time-series. The information can be removed in many ways, but one of the easiest may be to use an offset between the time-series when the kth element is removed. For the example in
Using the data available, the systems of the present disclosure were able to find multiple correlations in the datasets. Using correlation analysis on all the factors, the results showed that delay, jitter, packet loss ratio are highly correlated. This makes sense from network and queueing theory. From network theory, it may be understood that path and link delay are correlated to link utilization and that link utilization is correlated to end-to-end traffic.
Measurement sampling is an alternative technique that may be used in both packet and flow-based measurement to reduce the data volumes required to report. The main idea in this technique is to take only a subset of packets or flows out of all packets or flows to obtain reasonable result for the measurement. To do so, different selection mechanisms are introduced, including count-based sampling, time-based sampling, random sampling (uniform or weighted probability), etc. However, the main drawback of measurement sampling is the difficulty of determining the right sampling process and corresponding parameters according to the measurement conditions. Moreover, sampled data may not represent the characteristics of the real data. Also, using measurement sampling may introduce inevitable bias into results.
However, the embodiments of the present disclosure are different from measurement sampling in several aspects. First, in measurement sampling, data is reduced without considering the patterns existing in the data, while the present approach learns to reconstruct the reduced data based on the extracted patterns and features. Another major advantage of the present approach over the measurement sampling is that the present method can extract the patterns not only in time dimension, but also in spatial dimension. This is rather important as it is likely that there are significant correlations among datasets belonging to different flows. This correlation can be exploited fully in the present approach to reduce the amount of reporting data, while it is typically ignored in measurement sampling.
First, the data pruning module 138 is configured to receive the data from the data gathering module 136 and remove some portion of the data before transmitting the pruned data to the network analytics device 134. In the network analytics device 134, the data recovery module is next configured to reconstruct (impute) the data and pass it to the recovery evaluation module 142. Then, the recovery evaluation module 142 is configured to determine if the recovery was of high enough quality. Finally, the pruning logic is configured to instruct the data pruning module 138 on how to prune data to improve performance.
Depending on the quality of recovery as detected by the recovery evaluation module 142, the pruning logic 144 may instruct the data pruning module 138 to a) reduce or increase the amount of pruned data, b) change a pruning strategy (e.g., from a random process to a patterned process), and/or other factors. It should be understood that other kinds of changes could also be sent by the pruning logic 144 to the data pruning module 138 to control pruning strategies as desired.
Example of Data Reconstruction with Denoising
The system 150 is configured to perform the reconstruction method by accepting reduced data as an input and providing reconstructed data as an output. The system 150 treats missing values as noise and may use denoising method (e.g., the autoencoder 60). As certain values are missing, the reduced data is applied to a transformer 152. The transformer 152 is configured to receive the reduced data, which is first transformed in the frequency domain (e.g., using an inverse Fourier transform) or in the wavelet domain (e.g., using a wavelet transform). The transform into the frequency domain interlaces the missing and present values and so the structure of the data is pronounced even if some of the values are missing. The autoencoder structure denoises the frequency representation of the data, thus emphasizing the structures in the data. The inverse Fourier transform then returns the denoised frequency domain data into the time domain using a second transformer 154.
The method associated with the DNN system 160 of
To validate the accuracy of the reconstruction process, the network elements may be configured to transmit the full data periodically. For example, once a week, the network elements may transmit the full dataset over any suitable period of time (e.g., about one hour, two hours, a full day, etc.). This dataset may be used to repeat the self-labeling procedure, producing a validation dataset. The validation dataset is used to ensure that the performance of the DNN system 160 is still accurate. In some embodiments, the autoencoder 60 (e.g., encoder 62, bottleneck 64, and decoder 66) may be made from layers of Long Short-Term Memory (LSTM), convolutional or “dense” DNN blocks, etc.
The communication between the data pruning module 138 and the pruning logic 144 in
One embodiment for adjusting the rate of pruning is by using a fixed threshold, as suggested above. However, there may be other ways of doing this as well. For example, the system could use Reinforcement Learning (RL) to automatically adjust the level of pruning in each of the time-series datasets. The RL system may decide for each time-series to increase or decrease the pruning rate taking the input of the reconstruction process into consideration. With the RL approach, the policy for increasing or decreasing of the pruning is found automatically through the process of trial and error. Unlike the process described above where a fixed threshold is used to decide when to increase or decrease the pruning, the RL approach may determine when to increase or decrease the threshold. For example, the RL approach may be using a policy which determines when to increase, decrease, or leave threshold alone. The input to the policy could be the current error in the reconstruction of the data or the data itself. The policy is determined through a process of exploration, which could be done on live traffic or in simulation. To find a better policy, the RL framework may use cost function, which may be trained to minimize a weighted total error across all reconstructions of the data. Therefore, every time a policy is used, the RL framework may be configured to measure the error of the reconstruction and add it to the total error observed from the policy up to that point. Over time, the RL may learn which policy works the best or may find policies that are improvements over others.
For evaluating the performance of the data reduction systems and methods of the present disclosure, publicly available network traffic trace information was used from an IP backbone network and a private dataset was also used. The public dataset included IP-level traffic flow measurements collected from every Point of Presence (PoP) in the IP backbone. The data was sampled flow data from every router for a period of six days (e.g., Apr. 7 to 13, 2003). For the private dataset, five-minute telemetry data for a five-day period was used.
The table below show the performance of the network data reduction approach of the present disclosure:
An acceptable error for this dataset is in the range of 15-25% for the purposes of traffic engineering, whereby the maximum data reduction of 25% may be expected. Using random data removal and imputation techniques outlined in the present disclosure, favorable results were achieved, while it may be expected that the performance of the embodiments of the present disclosure may be even better with a more sophisticated approach to pruning and more data for training the model.
Firstly, the system 170 may convert time-series datasets to time-frequency decompositions. Among the time-frequency decompositions, the system 170 may use spectrograms to represent a time-series dataset as a 2D image. In the resulting image, vertical and horizontal axes may represent frequency and time. The sequences of Short Time Fourier Transform (STFT) (i.e., frequency components) may be shown for a long time. Brightness of each pixel shows the strength of a frequency component at each time frame as depicted in
In a next step, the system 170 may apply the CNN autoencoder deep learning approach to predict the noise model. In this case, the noise may be considered as artifacts generated by the missing data. The autoencoder can be used to capture the noise models for both magnitude and phase spectrograms. However, as a magnitude spectrogram contains most of the structure of the signal compared to a phase spectrogram, the system 170 may use just magnitude spectrograms to obtain the noise model, but phase spectrogram may also be kept for reconstructing the time-series datasets. Finally, the noise spectrogram computed by the autoencoder ay be reduced from the amplitude spectrogram of the noisy data set, which is the dataset with the missing value. The resulting spectrogram and the original phase spectrogram may be converted to time domain using inverse Short Time Fourier Transform (ISTFT).
According to some embodiments, the autoencoder 60 (at least the encoder 62 and decoder 66) may be placed in a router of a communications network. A set of samples from time-series may be selected in a way such that the system 170 will have enough samples to perform a frequency conversion. These samples may then be used for the training. Training is performed locally at the router, which may be configured to compute the weights of the decoder 66 being sent to a collecting server (e.g., data collector 14, IPFIX collector 94, data collector 118, etc.) to be used in the decoder 66 residing in the collecting server.
It should be noted that the collecting server may just have the decoder network for reconstructing the compressed data and retrieve the original samples. In the training phase, the steps like the ones in Denoising Autoencoders (DAE) can be used. However, the ratio of data corruption may be equal to the target compression rate. In this way, local DAE learns how to reconstruct data from partially corrupted data.
Thus, in addition to compressing the original data, it is also possible to compress the residuals. It may be understood that the residual is the difference between the network output and the original data: rik=tsik′−tsik and the DNN can be trained so that the residuals are small rik≤∈. However, the system 180 can also further use lossless or lossy compression on the residuals to reduce the error further. In this process, data is compressed first. Second, the compressed data is used to determine the residual. Next, the residual is compressed separately with either lossy or lossless compression. In the case of lossless compression, the residual values may be quantized and then compressed using a compressor specific to the data. For example, an entropy encoder can be used for optimal results.
A deep learning-based lossy time-series compression technique is further described with respect to the multiple embodiments of the present disclosure. In this section, a new compression technique is proposed, which is referred to herein as “Deep Dict” alluding to the deep learning in an encoding phase and using a dictionary-type function in a decoding phase. Deep Dict provides a novel lossy time-series data compression technique configured to improve the compression ratio.
Deep Dict may include a new framework configured for lossy compression of time-series data. The results demonstrate that Deep Dict achieves a higher compression ratio than state-of-the-art compressors. Deep Dict may be configured as a novel Bernoulli Transformer-based AutoEncoder (BTAE) or a BTAE-based lossy compressor that can effectively reduce the size of latent states and reconstruct time-series from the Bernoulli latent states. By aiming at further performance improvement, a new loss function, referred to as Quantized Entropy Loss (QEL), is also introduced in this section, which considers the characteristics of various compression problems and outperforms common regression losses such as L1 and L2 in terms of the compression ratio. QEL is applicable to any prediction-based compressors that utilize uniform quantization and entropy coder. The Deep Dict technique was tested on ten time-series datasets across a variety of domains. As demonstrated by the results, Deep Dict outperformed the compression ratio of the state-of-the-art lossy compression methods by up to 53.66%.
With the rapid rise of smart devices, sensors, and IoT networks, massive amount of time-series data are generated continuously. Thus, as suggested above, reduction of the time-series data volume through compression is critical so as to save network bandwidth and storage space. The intrinsic noise of time-series datasets can enable lossy compression to improve the compression ratio significantly when compared to lossless compression.
Again, massive amounts of time-series data may be created, stored, and communicated as a result of the widespread use of smart devices, industrial processes, IoT networks, and scientific research. Various domains (e.g., finance, stock analysis, health monitoring, etc.) benefit from high resolution time-series datasets. However, transmitting such a large number of time-series datasets can be costly in terms of network bandwidth and storage space. Consequently, many studies focus on compressing time-series datasets with a high compression ratio. Data compression can be roughly classified in two categories: lossless and lossy compression. Lossless compression permits flawless recovery of the original time-series, whereas lossy time-series compression can achieve a considerably higher compression ratio without significantly compromising downstream tasks due to the inherent noise of lossy time-series datasets.
AutoEncoder (AE) may be used as a lossy time-series compression technique. An AE encodes time-series data as latent states with real-values and decodes them as a prediction. Compressed latent states are one of the sources of overhead for compressed data. This section of the present disclosure addresses the potential of encoding time-series datasets into Bernoulli distributed latent states rather than real-value latent states in order to drastically reduce the size of latent states and improve compression rate. In addition to AE, prediction-based compressors typically employ regression losses. This study formulates a new loss to more accurately describe the problem based on the nature of entropy coders.
Based on a comparison with existing compression techniques, it was discovered that the systems and methods, as proposed in the present disclosure, may be configured to utilize a prediction-quantization-entropy coder paradigm, and may use Deep Dict to enhance prediction abilities. The embodiments of the present disclosure can greatly reduce the size of latent states in comparison to conventional AutoEncoder-based compressors. Loss functions for conventional prediction-based compressors are L1/L2. However, the embodiments of novel loss functions (e.g., QEL) are configured to improve the compression ratio of conventional systems, which has certain drawbacks of conventional regression losses. A prediction-quantization-entropy coding scheme may be utilized for lossy signal, picture, and video compression, where QEL may be suitable for compressors that use uniform quantization.
Again, time-series datasets are described as a collection of time-dependent data that can be categorized broadly into Univariate Time-series (UTS) and Multivariate Time-series (MTS). Univariate time-series datasets UTS ∈l contain a single variable that varies over time, whereas Multivariate time-series datasets MTS ∈l*d contain multiple variables that are not only related to timestamp but also dependent on one another, where is the length of a time-series and is the number of variables. AutoEncoder (AE)-based compressors encode time-series into a latent representation consisting of floating-point numbers. Thus, the size of the latent representation has a direct effect on the compression ratio. Deep Dict is a new compressor which compresses time-series (UTS/MTS) into Bernoulli distributed latent states, significantly reducing the size of latent states, and limits the distortion of reconstructed time-series.
For evaluating lossy time-series compression, two key metrics are used: data distortion and compression ratio. Data distortion measures the error between the original and reconstructed time-series, including the mean square error, the mean absolute percentage error, and the maximum absolute error. Maximum Absolute Error (MaAE) is a widely accepted measure of distortion in the absence of downstream tasks or domain knowledge. Compression ratio is defined as the ratio of original data size to compressed data size.
Initially, the Deep Dict compressor 200 divides the original long time-series into smaller time-series (x) chunks by a time window. The BTAE 202 encodes x into Bernoulli latent states (c) and decodes c to predict time-series (x′). In order to limit the error of reconstructed time-series to a desired range, the residual (r=x−x′) is quantized uniformly to rq and an entropy coder is used to compress rencoded in a lossless manner. The value c, the decoder, and rencoded are used for transmission or storage after compression. During decompression, c is fed to the decoder to recover x′, and the entropy coder decodes rencoded to rq. The time-series reconstruction is formulated as xrecon=+rq. Thus, the quantization error determines the distortion of reconstructed time-series. During decompression, c functions as indices and BTAE's decoder serves as a dictionary.
The BTAE 202 seeks to discover the Bernoulli latent states/representations of a time-series and to recover the time-series from the representations. The encoder takes the flattened time-series x as input and transforms it into real-value latent states y, which are then binarized to Bernoulli latent states c as shown in Eq. 1 below, where tan h approximation is used to make the binarization function differentiable.
Consequently, the derivative of the binarization function of the binarization unit 208 is computed as in Eq. 2:
df
binarization
/dy
i=1−tanh2(yi) (2)
A transformer-based decoder is then fed with the Bernoulli latent states (c) to predict the corresponding time-series. Since c contains limited information, the FFN 220 is used as the encoder. The FFN 220 augments c∈{−1, 1}|c| to c′∈ which is then replicated to c″∈ where is the length of time-series, and is considered as one of important hyperparameters of Deep Dict. To avoid additional overhead of parameters in positional encoding, sine and cosine functions are used as seen in Eqs. 3-4, and c″ with positional encoding is fed into the transformer 228 encoder blocks followed by the FFN 230 to obtain the prediction of time-series.
P E(pos,2i)=sin(pos/100002i/dmodel) (3)
P E(pos,2i+1)=cos(pos/100002i/dmodel) (4)
The BTAE 202 may be configured to employ 32-bit float for the training phase, and 16-bit float for validation and testing to decrease the size of the decoder 210. Time-series datasets may contain recurring or similar patterns. Hence, not all Bernoulli hidden states are unique. Encoding Bernoulli latent states with an entropy encoder 216 can further improve the compression ratio.
In comparison to the traditional AE, the BTAE 202 reduces the size of the latent state by a factor of the size of number of bits of a float point |F P|; for example, |F P|=32 bits by default in Pytorch and TensorFlow. Due to its non-autoregressive architecture, the system 200 leads to faster training and performs better on long sequences than RNN models such as LSTM and GRU in terms of compression ratio.
It may take a long time and many computational resources to train a new model from scratch each time a new time-series is compressed. Therefore, transfer learning can be used to accelerate the compression process. For univariate time-series, a trained model can be directly applied to another, whereas for a multivariate time-series, the encoder 206, and the last FFN 230 of the decoder 210 have to be retrained from scratch due to the varying number of multivariate variables.
The Quantized Entropy Loss (QEL) is defined herein. In general, many ML-based compression schemes contain quantization and entropy coders and apply trivial L1/L2 as loss function. Nevertheless, because of the entropy coder 216, the size of rencoded is constrained by the total entropy of rq. Since common regression losses such as L1/L2 do not consider quantization nor minimizing entropy of rq, they do not describe the problem precisely. The embodiments of the present disclosure are configured to introduce this new loss function, referred to as QEL, so as to minimize the size of rencoded. In addition, QEL is not specific to the Deep Dict system 200. Rather, it is a loss function applicable to all compression methods that employ uniform quantization followed by an entropy coder.
To formulate QEL, certain symbols are defined for clarification. Given E as the desired MaAE, rq=2∈*round(r/2∈). S={s1, s2, . . . , s|S|} is the set of unique values of rq whereas n(sj) is a counter to keep track of the number of times sj appears in rq, and p(sj) specifies the probability of sj appearing in rq, where |⋅| counts the number of items in ⋅. A formal expression of p(sj) can be given as n(sj)/|rq|.
As stated in Eq. 5, the size of rencoded is always greater than the total entropy of rq due to the nature of the entropy coder. Good-performing entropy coders with high compression ratio tend to approach the limitation.
|rencoded|≥−Σj=0|S|n(sj)log p(sj) (5)
In light of these, the objective function can be formulated as in Eq. 6:
min(r)H(r)=−Σj=0|S|p(sj)log p(sj) (6)
However, since neither the probability (p) nor the counter (n) is differentiable, P may be defined to replace p as shown in Eq. 7.
To make g∈ differentiable, g∈ is approximated as g′∈ in Eq. 9:
g∈≈g′∈=1(∈x)b+1 (9)
where g∈=limb→∞g′∈. Under the circumstances, the objective function can be reformulated as Eq. 10:
H(r)≈−Σj=0|S|P(r−sj)∈(xk) (10)
With these in mind, once the first order derivative of H is taken, it will appear as in Eq. 11:
where dij=ri−si. Furthermore, QEL can be easily generalized to multidimensional matrix as presented in Eq. 12 where ⋅ is the index of r's elements, and d⋅j=r⋅−sj
∂H/∂r⋅≈b∈b/|r|+Σj=0|S|[1+ln p(sj)](d⋅jb−1)/[∈bd⋅jb−1d⋅j+1]2 (12)
To accelerate the process, the forward procedure sticks with the objective function formulation in Eq. 6, utilizes the occurrences of unique values of rq, and saves S and p(sj) for backward procedure, while the backward process constructs the distance matrix D=[d⋅j]⋅*j and uses Eq. 12 to calculate derivatives.
The table of
Timestamp t=[tlow, . . . , thigh]T∈ determines the length of a time-series. Polynomial timestamp is tp=[t0, t1, . . . , td]T∈ where dp is the degree of polynomial. To introduce randomness into synthetic data, the coefficient C∈ is randomly generated. A time-series is defined as x=(C×tp)T∈ and the concatenation of multiple x forms a long sequence.
The datasets listed include both multivariate and univariate time-series. The first variable is utilized as a univariate dataset for these multivariate time-series; for instance, the univariate dataset of watch_gyr indicates that only the x-axis is utilized; and univariate mode indicates that the multivariate time-series is flattened prior to being fed into compressors.
To evaluate the performance of Deep Dict, five lossy time-series compressors were used as baselines. These compressors include Critical Aperture (CA)1, SZ22, LFZip3, and SZ34. CA is an industrially well-received compressor that is computationally simple and efficient. The evaluation also utilized the most recent versions of SZ2 (version 2.1.12.2) and SZ3 (version 3.1.5.3), which are the state-of-the-art prediction-based compressors. LFZip is a cutting edge framework for prediction-quantization-entropy that uses bidirectional GRU to capture nonlinear structure.
The table of
Regarding the results under multi-variate datasets, six of the ten datasets listed in the table of
Regarding transferability, training a new model for each time-series from scratch is inefficient and time-consuming. Transfer learning is used to accelerate the compression process. The model is initially pre-trained using the largest dataset (i.e., ppg_ecg for univariate datasets and synthetic for multivariate datasets), and then fine-tuned for just 10 epochs with the remaining datasets. As shown in the table of
In a last step of an empirical study, a series of empirical studies were conducted concerning the impact of hyper-parameters and data size. As defined by Eq. 9, a large b improves the precision of the approximation, but it may also cause the derivative float-point overflow. As shown in
Furthermore, the raw telemetry data may be network data collected from a communications network. For example, the raw telemetry data may include information related to a) packet count, b) latency, c) jitter, d) SNR, e) SNR estimates, f) state of polarization, g) Channel Quality Indicator (CQI) reports, h) alarm states, and the like. The step of compressing the time-series datasets may include dividing the raw telemetry data into equal-sized chunks of time, feeding indices as inputs to the DNN to obtain the equal-sized chunks of time as outputs from the DNN, and training the DNN to adjust weights until a desired compression ratio or precision is achieved.
The process 270 may further include the step of substantially reconstruct the time-series datasets by a) receiving an index corresponding to a desired time range associated with a desired time-series dataset, b) inputting the index to the trained DNN, and c) propagating values through the DNN to substantially decompress the desired time-series dataset at the output of the DNN. A telemetry device may be configured to prune the raw telemetry data before transmitting the raw telemetry data for collection. The process 270 may also include a) detecting a quality factor of a data decompression process related to reconstruction, b) providing a feedback signal to change the parameters of a pruning process associated with the telemetry device in order to reduce a reconstruction error; and c) adjusting the pruning level of the pruning process using Reinforcement Learning.
The step of deploying the time-series datasets as the DNN in the network environment may include applying the DNN to a host server. The host server, for example, may be configured to allow a query request of the DNN for data retrieval. The process 270 may also include creating the DNN with indices and relationships between each index and a respective time-series dataset. In response to receiving an index for a query request, the DNN is configured to substantially reconstruct a time-series bucket related to the index. Creating the DNN may include picking the indices randomly or according to a pattern and/or may include determining the indices with respect to a bottleneck of an autoencoder. Also, creating the DNN may include forming multiple dense layers and/or may include using a decoder of an autoencoder.
The step of compressing the time-series datasets may include a step of compressing the time-series datasets at multiple different compression rates and at different precisions depending on a size of a value included in the different time-series datasets. The compressing step may also include increasing the compression rate by using a Bernoulli Transformer AutoEncoder (BTAE) so as to compress the time-series datasets into Bernoulli distributed latent states and constraint the distortion of reconstructed time-series datasets. The BTAE may include an encoder acting as a feed-forward device and the decoder acting as a dictionary. Also, the BTAE may be configured to reduce the size of a latent state by a factor related to the number of bits of floating point numbers used for the values of the time-series datasets.
Substantially reconstructing the time-series datasets may include transforming the time-series datasets to the frequency domain. The process 270 may also include determining residuals as a difference between outputs of a reconstruction process and the raw telemetry data and then compressing the residuals. The process 270 may also perform a prediction-quantization-entropy coding scheme, wherein a prediction procedure is related to a decoding element of an autoencoder, and wherein quantization and entropy procedures are related to a distortion constraint element for processing the residuals. Also, the process 270 can determine quantized entropy loss to constrain the size of an encoded residual with respect to total entropy.
It will be appreciated that some embodiments described herein may include or utilize one or more generic or specialized processors (“one or more processors”) such as microprocessors; Central Processing Units (CPUs); Digital Signal Processors (DSPs): customized processors such as Network Processors (NPs) or Network Processing Units (NPUs), Graphics Processing Units (GPUs), or the like; Field-Programmable Gate Arrays (FPGAs); and the like along with unique stored program instructions (including both software and firmware) for control thereof to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of the methods and/or systems described herein. Alternatively, some or all functions may be implemented by a state machine that has no stored program instructions, or in one or more Application-Specific Integrated Circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic or circuitry. Of course, a combination of the aforementioned approaches may be used. For some of the embodiments described herein, a corresponding device in hardware and optionally with software, firmware, and a combination thereof can be referred to as “circuitry configured to,” “logic configured to,” etc. perform a set of operations, steps, methods, processes, algorithms, functions, techniques, etc. on digital and/or analog signals as described herein for the various embodiments.
Moreover, some embodiments may include a non-transitory computer-readable medium having instructions stored thereon for programming a computer, server, appliance, device, at least one processor, circuit/circuitry, etc. to perform functions as described and claimed herein. Examples of such non-transitory computer-readable medium include, but are not limited to, a hard disk, an optical storage device, a magnetic storage device, a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically EPROM (EEPROM), Flash memory, and the like. When stored in the non-transitory computer-readable medium, software can include instructions executable by one or more processors (e.g., any type of programmable circuitry or logic) that, in response to such execution, cause the one or more processors to perform a set of operations, steps, methods, processes, algorithms, functions, techniques, etc. as described herein for the various embodiments.
Although the present disclosure has been illustrated and described herein with reference to preferred embodiments and specific examples thereof, it will be readily apparent to those of ordinary skill in the art that other embodiments and examples may perform similar functions and/or achieve like results. All such equivalent embodiments and examples are within the spirit and scope of the present disclosure, are contemplated thereby, and are intended to be covered by the following claims. Moreover, it is noted that the various elements, operations, steps, methods, processes, algorithms, functions, techniques, etc. described herein can be used in any and all combinations with each other.
The present application claims the benefit of priority to Provisional Application No. 63/229,117, filed Aug. 4, 2021, entitled “Storing network data with DNN memorization and network telemetry reduction by pruning and recovery of network data,” the contents of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
63229117 | Aug 2021 | US |