Coud-based services can be delivered on-demand to companies and other users over the Internet and are becoming more available and important. Many cloud-based services are data based or otherwise need to maintain the integrity and security of data, and these services suffer if their data is compromised. Accordingly, malicious attackers have created malware that targets user data. In particular, ransomware, which is one type of malware, may attempt to penetrate a system and encrypt stored files to prevent a user from accessing the files, or in some cases, the ransomware may attempt to lock a user’s system as a whole. Once an attacker controls access to the user data, the attacker demands payment (typically in a form of cryptocurrency) before the attacker will return control of the system or user data back to the user.
Enterprises including providers of cloud-based services often employ cluster storage systems to meet data storage needs. Cluster storage systems may distribute data volumes to a set of storage nodes forming a cluster, and one technique that cluster storage can use to improve data security is to maintain backup or mirror copies of volumes. A storage node maintaining a backup or mirror copy of a base volume is generally not the storage node that maintains the base volume, so that the storage node maintaining the backup copy can be used if the storage node maintaining the base volume becomes unavailable. In a mirrored volume environment, the storage cluster synchronizes the base and backup volumes, so that if the ransomware encrypts base volumes, the backup volumes may also be encrypted. After ransomware has penetrated the system and encrypted files, the user may therefore be unable to access their data even from the backup. Some prior systems and methods for defending against ransomware try to detect and prevent ransomware when the ransomware tries to take control of a system or its data. However, detection and blocking of ransomware can be difficult and is subject to false positives that may disrupt or interfere with the efficiency or operation of a user’s system.
The drawings illustrate examples for the purpose of explanation and are not of the invention itself. Use of the same reference symbols in different figures indicates similar or identical items.
A storage system in accordance with an example of the present disclosure may employ one or more storage nodes implemented in one or more servers with each storage node including at least one storage or service processing unit (SPU). Each of the SPUs provides storage services to storage clients seeking use of virtual volumes that the SPU maintains. In maintaining the virtual volumes, the SPUs always write new data for the virtual volumes to empty physical storage locations, i.e., physical storage location that do not currently store valid data. The SPUs do not overwrite old data of the virtual volumes, and the old data remains in physical storage until a garbage collection process determines that the old data is unneeded and therefore invalid. Each SPU has a communication link to a cloud-based service, and the SPUs when performing storage services can report I/O information and data analysis to the cloud-based service. Based on the I/O information and data analysis the SPUs report, the cloud-base service may employ methodologies to detect anomalies in storage activity suggesting a malware attack, and the cloud-based service may instruct the SPUs to retain the old data that anomalous activity was overwriting. If an attack is confirmed, e.g., ransomware or other malware is later determined to have encrypted or taken control of data, the SPUs still retain the old data, e.g., in snapshots of virtual volumes, and the virtual volumes may be rolled back to a state where the data is an unencrypted and accessible.
Each SPU 120 has hardware including a host interface 122, communication interfaces 124, a storage interface 128, and a processing system 130.
Host interface 122 provides communications between the SPU 120 and its host server 110. For example, each SPU 120 may be installed and fully resident in the chassis of an associated host server 110, and each SPU 120 may be a card, e.g., a PCI-e card, or printed circuit board with a connector or contacts that plug into a slot in a standard peripheral interface, e.g., a PCI bus in host server 110. Host interface 122 includes circuitry that complies with the protocols of the host server bus.
Communication interfaces 124 in an SPU 120 provide communications with other SPUs 120 and to other network connected devices. Multiple SPUs 120, e.g., SPUs 120-1 to 120-N in
Storage interfaces 128 in SPUs 120-1 to 120-N include circuitry and connectors for attachment to devices of respective backend storage 150-1 to 150-N, sometimes generically referred to herein as backend or persistent storage 150. Each SPU 120 may thereby control its backend storage 150. Backend storage 150 may employ, for example, hard disk drives, solid state drives, or other nonvolatile/persistent storage devices or media in which data may be physically stored, and backend storage 150 particularly may have a redundant array of independent disks (RAID) 5 or 6 configuration for performance and redundancy.
Processing system 130 in an SPU 120 includes one or more microprocessors, microcontrollers, or CPUs 132 with memory 134 that the SPU 120 employs to manage one or more physical storage devices of backend storage 150 and provide storage services to clients. In the illustrated example, processing system 130 particularly implements a set of modules including a management module 141, an I/O processor 142, a garbage collection module 143, and a data analysis module 144. In other examples, SPU 120 may additionally implement modules that provide other storage functions such as data deduplication, encryption and decryption, or compression and decompression. PCT Pub. No. WO 2021/150576 A1, entitled “Primary Storage with Deduplication” describes some examples of storage systems with additional storage functions such as deduplication, which is hereby incorporated by reference in its entirety.
Management module 141 controls processes such as a setup or configuration process for an SPU 120 and communications with cloud-based infrastructure 180. I/O processor 142 processes storage service requests such as read and write requests from storage clients and performs storage operations to fulfill storage service request. In accordance with an aspect of the present disclosure, data analysis module 144 may perform analysis of data associated with the storage service requests to help detect encrypted data or the activities of malware such as ransomware. In one example of the current disclosure, data analysis module 144 may periodically sample data blocks (e.g., one 8 kb block of data for every 256th I/O per virtual volume) and may analyze or tests the encryption status of the sampled data, i.e., to determine whether the incoming block is encrypted or not. The SPU 120, e.g., management module 141, can communicate to cloud-based infrastructure 180 information regarding the storage services that I/O processor 142 handles and regarding the analysis results from data analysis module 144. Cloud-based infrastructure 180 may have its own analytics service 186 that analyzes the information regarding I/O processes and analysis results from data analysis module 144 to determine whether real-time I/O processes suggest a ransomware or other malware attack, and if an attack is suspected, a management services 182 provided by cloud-based infrastructure 180 may instruct management module 141 in the SPU 120 to preserve old data as described further below.
I/O processors 142 of SPUs 120-1 to 120-N generally perform storage services in response to storage service requests targeting the virtual volumes that the SPUs 120-1 to 120-N own. In some implementations of storage platform 100, storage clients, e.g., applications 112 running on a host server 110 or a user device 162 or 164, may request storage service through an SPU 120 resident in the host server 110 associated with the storage client. The I/O processor 142 of the resident SPU 120 may receive the storage service request and provide the requested storage service if the SPU 120 owns the targeted virtual volume or may forward the storage service request through data network 125 to another SPU 120, e.g., to the SPU 120 that owns the virtual volume that the storage service request targeted.
In accordance with an example of the current disclosure, each I/O processor 142 maintains a set of generation numbers 136, each generation number corresponding to an associated virtual volume, and the I/O processor 142 uses a current value of the generation number for a virtual volume to tag and uniquely distinguish each I/O process that changes the content of that virtual volume. For example, for each write request that SPU 120-1 receives requesting writing data to an address or offset in a virtual volume V1, the I/O processor 142 of SPU 120-1 may increment the generation number 136 for the volume V1 and tag (or otherwise identify) the write request using the current value of generation number 136 for the virtual volume V1. The next write request to volume V1 will be tagged with the next value of generation number 136.
I/O processor 142, during each write operation, may record in a data index 138 an entry in which the generation number and volume/offset of the write operation are mapped to the physical storage locations where write data is stored in backend storage 150. Data index 138 may be any type of database, but in one example of the present disclosure, data index 138 is a key-value store where entries in data index 138 including a key and a value. The key in each entry contains a generation number of a write operation and a volume ID and offset or address in the virtual volume for the write operation, and the value in the entry contains a pointer to the physical location in backend storage 150 containing the data pattern written. When reading from a base volume, I/O processor 142 may query data index 138 to find the entries that correspond to the volume/offset to be read, and of those entries, the I/O processor 142 uses the entry having the newest generation number to identify where the requested data is in backend storage 150. The entries having older generation numbers may be required for snapshots or may be garbage that garbage collection module 143 can identify and reclaim for storage of new data. When reading from a snapshot, I/O processor 142 may query data index 138 to find the entries that correspond to the volume/offset to be read and of those entries uses the entry having the newest generation number that is at least as old as the snapshot, newer entries being ignored. Garbage collection module 143 acts to preserve any entries in data index 138 may be needed for reading any virtual volume or snapshot. Garbage collection module 143 can reclaim entries and identified data that are not needed for any virtual volume or snapshot.
Private network 160, as noted above, may provide a connection through firewall 161 to public network 170, so that user devices 162 and 164, servers 110, and SPUs 120 may communicate with remote devices and particularly with cloud-based infrastructure 180. Cloud-based infrastructure 180 may include a computer or server that is remotely located from host servers 110 and from user devices 162 and 164, and cloud-based infrastructure 180 may provide management service 182 for configuration and management of storage platform 100 to thereby reduce the burden of storage management on an enterprise using storage platform 100. Management service 182, for example, use an image library 180 to provide SPUs 120 with operating system or software images, provisioning or configuration setting, and operating instructions and thus allows an enterprise to offload the burden of storage setup and management to an automated process that cloud-based management 180 and the SPUs 120 execute. Management service 182 may particularly be used to configure SPUs 120 in a pod or cluster in storage platform 100, to monitor the performance of storage platform 100, to provide analysis services 186, or provide recovery services for storage platform 100. Management service 182, during a setup process, may determine an allocation of storage volumes to meet the needs of an enterprise or other users of storage platform 100, distribute the allocated volumes to SPUs 120-1 to 120-N, and create recipes for SPUs 120 to execute to bring storage platform 100 to the desired working configuration such as illustrated in
For detection malicious activity, SPUs 120-1 to 120-N can collect I/O information and analyze data blocks. I/O information may identify a time-series I/O operations on a per block level for all virtual volumes BT1 to BTN and V1 to VN. Data block analysis may include performing a subset of NIST-800 tests and using a multi-layer perceptron to predict whether a block of data is encrypted. For example, data analysis module 144 in an SPU 120 may only periodically analyze a byte distribution of an 8kb page of write data or may implement additional analysis techniques such as convolutional neural networks (CNN) and other statistical tests to examine the entire 8 kb page. An SPU 120 may report the I/O information or results from analysis module 144 to management service 182 or analytics service 186 in cloud-based infrastructure 180.
The SPU, in a process block 230, may analyze the data associated with the storage operation performed. For example, an SPU may analyze every storage operation that writes to a page in a virtual volume or may only analyze a sampling of the pages written. The number of pages analyzed may be chosen to minimize the impact that the processing has on storage perform, and in one example, each SPU analyzes 1 in 256 of the pages written to each virtual volume the SPU owns.
A primary focus of the analysis that process block 230 performs may be to determine whether the data is encrypted, and any desired tests or techniques for identifying encrypted data may be employed. In one example, the SPU may analyze I/O blocks using a series of statistical randomness tests, including one or more of the Frequency (Monobit) Test, Index of Coincidence, Chi-Square Test, Chi-Squared Test on Binary Bit Distribution (such as NIST-800-22 discloses), and the SPU may employ a Multi-Layer Perceptron (MLP) on the sampled byte distribution. (In machine learning, a perceptron is an algorithm for supervised learning of binary classifiers, which are functions that can determine whether or not an input vector belongs to some specific class.)
A reporting process 240 may follow analysis process 230. The storage platform or one or more of the SPUs in the storage platform may perform reporting process 240 to report information, e.g., I/O pattern information and encryption statistics, to the cloud-based service. In one example of reporting process 240, the SPU reports to a cloud-based service (e.g., management service 182 or analytics service 186 of
In a block 330, the cloud service analyzes the I/O data in the analysis pool. The cloud-base service could use many different analysis techniques to identify an anomaly that may suggest the activity of malware such as ransomware. The cloud-based service may, for example, implement encoder-decoder (AE) models that are trained using the information from the storage platform. The models may be trained during training periods, e.g., every five days, to recognize normal storage patterns for respective volumes in the storage platform. The models for the virtual volumes of the storage platform generally depend on the specifics of the storage platform and the storage client activity. In the event the cloud-based service does not have a model for a virtual volume, a model can be created using the previous I/O information spanning the required training period, e.g., five days. If there is an existing model, the existing model can be fine-tune with the unseen data collected in a past period. In one specific example, the data used for training the model is the histogram of the compression ratio and total I/O size recorded at a suitable frequency, e.g., every two minutes. Each AE model analyzes a window of data points, e.g., 20 data points equivalent to 40 minutes at 2 minutes per data point. The primary objective of the encoder-decoder model is to learn the write patterns of storage clients and to raise an alert if any unusual or suspicious pattern is detected.
The cloud-based service may perform anomaly detection using the encryption statistic signal from the storage platform. The cloud-based service may employ CUSUM (Cumulative Sum) to detect a level shift in the encryption percentage signal. The level shift may indicate the activity of ransomware because when ransomware is encrypting data on a machine, the percentage of encrypted data being written increases and creates a noticeable shift in the encryption signal.
Another part of anomaly detection may look at the entire 8kb sampled data from the SPDK and use a larger sub-set of NIST-800-22 tests along with a convolutional neural net (CNN) to distinguish between encrypted blocks and non-encrypted blocks. A CNN may be used since non-encrypted data has special dependencies while encrypted blocks should not have any patterns or special correlation.
A publicly available dataset may be used for training models and testing anomaly detection. For example, a public dataset containing approximately a suitable quantity, e.g., 100 GB, of compressed files, such as zip, gzip, tar, mkv, mp4, pdf, and more could be parsed into a series of 8 KB pages. The byte distribution of the data can be calculated and used this as the ground truth for non-encrypted data. Next, AES-256 can be used to encrypt the data and repeated the process, storing the results as encrypted data.
In information theory, a string, e.g., an 8kb page, is considered random if there is no shorter description of that string. However, when compression algorithms are applied to a data stream, the compressed version shares certain characteristics with the original uncompressed stream, which means that the compressed stream cannot be considered truly random, despite the fact that the difference in entropy between the compressed and encrypted streams may not be significant. This resemblance between the compressed and uncompressed data can be leveraged to differentiate between compressed and encrypted data. In contrast, encrypted data is truly random by definition because symmetric block encryption algorithms (such as AES-128, AES-256, etc.) use a random vector to XOR with the block of data, thereby producing a completely random output.
Entropy and compressibility are a data metrics that may be calculated and used. Compression aims to minimize the number of bytes used to store information. The entropy of compressed data is higher than that of uncompressed data, as fewer bits are used to represent the same information. This means that each bit in compressed data carries more information, resulting in an increased entropy. A similar phenomenon occurs in encryption, where the number of bits representing the data does not decrease, but each bit carries the same amount of information since it is XORed with a random vector. Therefore, randomness tests may be used to differentiate between compressed and encrypted data.
The cloud-based service may also identify I/O access patterns, e.g., whether writes are directed sequential or random addresses, as being an indicator of ransomware or other malware activity. In accordance with another example of the present disclosure, the cloud service can analyze I/O information using statistical analysis, artificial intelligence (A/I) and machine learning techniques, and predictive modeling to detect anomalies in I/O patterns. For statistical analysis, the cloud-based service may, for example, analyze historic data activity or patterns and compare the historic data activity or patterns with real-time data activity or patterns represented in the analysis pool. Anomalies or discrepancies between historic and real-time activity may indicate the activity of ransomware. The anomalies or discrepancies may be detected using techniques including but not limited to using intersection over union (IoU) of incoming and historic I/O.
Another technique that a cloud-based service may use is cross entropy loss over incoming I/O and historic I/O compression histogram and entropy histograms. For cross entropy loss over incoming I/O access patterns and historic access patterns, artificial intelligence (A/I) and machine learning and predictive modeling may take advantage of unsupervised learning and detect any anomalies in incoming I/O patterns as they occur. Alternatively, supervised learning may collect in lab data by running ransomware in a sandbox and use the data to train multi-layer perceptron (MLP) networks to predict whether data is clean or encrypted by ransomware. Again, cross entropy loss is a loss function used in training a machine learning model such as MLP. Cross entropy and a set of other loss functions may be used to optimize models.
Collected I/O information may be used to train support vector machines to be able to group different file types together and detect ransomware-infected files. Furthermore, using the kernel trick to enhance accuracy. Kernel trick is a technique to transfer the data into a higher dimension without computing the coordinates of the data in that dimension. A kernel function may be used to calculate the similarity between pairs of instances.
In process 300, a decision block 340 determines whether an anomaly has been detected. If an anomaly suggesting ransomware or other malware activity is detected, the cloud-based service alerts the storage platform, e.g., one or more SPU 120 in storage platform 100 of
As disclosed herein, systems and methods can automatically detect a ransomware attack and suggest a timestamp to roll back the volume data to the latest point at which, with high probability, the system was unencrypted by ransomware. The use of a cloud-based service may solve problems by automating all the processes of creating, distributing, and managing virtual volumes in a storage platform, and the cloud-based service may eliminate the need of having a dedicated storage administrator while also removing the need for guess work and hours and hours of experimentation to get the right setup for a storage platform. Futher, the user of the storage platform does not require ransomware detection software on host systems. The storage architecture with cloud services already has that capability.
In examples of the systems and methods disclosed herein, can avoid conventional ransomware detection techniques that have high false positive rates and that are CPU intensive. The cloud-based solution reduces the processing load on the devices performing the I/O operations, which may improve storage performance.
All or portions of some of the above-described systems and methods can be implemented in a computer-readable media, e.g., a non-transient media, such as an optical or magnetic disk, a memory card, or other solid state storage containing instructions that a computing device can execute to perform specific processes that are described herein. Such media may further be or be contained in a server or other device connected to a network such as the Internet that provides for the downloading of data and executable instructions.
Although particular implementations have been disclosed, these implementations are only examples and should not be taken as limitations. Various adaptations and combinations of features of the implementations disclosed are within the scope of the following claims.
This patent document is a claims benefit of the earlier filing date of U.S. Provisional Pat. App. No. 63/314,996, filed Feb. 28, 2022, U.S. Provisional Pat. App. No. 63/314,970, filed Feb. 28, 2022, U.S. Provisional Pat. App. No. 63/314,987, filed Feb. 28, 2022, and U.S. Provisional Pat. App. No. 63/316,081, filed Mar. 3, 2022, all of which are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
63314996 | Feb 2022 | US | |
63314970 | Feb 2022 | US | |
63314987 | Feb 2022 | US | |
63316081 | Mar 2022 | US |